# EDA (Exploratory Data Analysis)

> We will work our own EDA on the following [Deepnote notebok](https://deepnote.com/workspace/sebastian-minaya-a67e42f1-471f-4ef3-b708-827621c005a4/project/Curso-EDA-Communication-Duplicate-829d77b8-46d3-4ab9-bd18-e867a252bb80/notebook/1.0-jvelezmagic-conociendo-nuestros-datos-2a5ad82e8af540619589fae1d201dec4)

## What is EDA?

EDA is a process of analyzing data to summarize their main characteristics, often with visual methods. It is a very important step in data analysis and data science. It is also the first step in data analysis, after data collection and before data modeling.

## Why EDA?

EDA is important because it helps us to understand the data and make conclusions about the data. It also helps us to find patterns in data and to determine relationships between variables. It is also used to confirm assumptions and to check the quality of the data.

## What are the steps in EDA?

1. Formulate a question

    * What do you want to know?
    * What do you want to prove?
    * What is the reason for the analysis? <br><br>

2. Determine the size of the data

    - How many observations are there?
    - How many variables are there?
    - Do I need all observations and variables? <br><br>

3. Determine the type of data

    - How many categorical variables are there?
    - How many numerical variables are there?
    - How can I explore each variable? <br><br>

4. Clean and validate the data

    - Remove duplicates
    - Remove missing values
    - Remove outliers
    - Correct data types
    - Correct spelling errors
    - Correct inconsistent data
    - Correct inaccurate data
    - Correct incomplete data
    - Correct invalid data
    - Correct irrelevant data
    - Correct improperly formatted data
    - Check for missing values
    - Check for outliers
    - Check for data types
    - Check for spelling errors
    - Check for inconsistent data
    - Check for inaccurate data
    - Check for incomplete data
    - Check for invalid data
    - Check for irrelevant data
    - Check for improperly formatted data
    - What is the proportion of missing values?
    - How can I handle missing values?
    - What is the data distribution?
    - Are there any outliers? <br><br>

5. Stablish the relationship between variables

    - What is the relationship between variable X and variable Y?
    - What happens if I consider Z instead of Y?
    - What does it mean for observations to be similar?
    - What does this trend mean? <br><br>

6. Make conclusions 
    - What can I conclude from the analysis?
    - What are the limitations of the analysis?
    - What are the next steps?

## What are the types of data analysis?

Some of the types of data analysis are:

1. Descriptive analysis: It is used to describe, summarize and find patterns in data.

2. Diagnostic analysis: It is used to find the cause of an event.

3. Predictive analysis: It is used to make predictions about future events based on past data.

4. Prescriptive analysis: It is used to find the best solution to a problem.

## What are the types of EDA?

Some of the types of EDA are:

1. Univariate analysis: It is used to describe, summarize and find patterns in a single variable.

2. Bivariate analysis: It is used to find the relationship between two variables.

3. Multivariate analysis: It is used to find the relationship between more than two variables.

## What are the tools for EDA?

Some of the tools for EDA are:

1. Cloud based

    - Google Colab
    - Kaggle
    - Databricks
    - IBM Watson Studio
    - Amazon SageMaker <br><br>

2. Desktop based

    - Jupyter Notebook
    - RStudio
    - PyCharm
    - Spyder
    - Visual Studio Code <br><br>

## Steps to perform EDA

1. Data collection

    - Data can be collected from different sources such as:
        - Web scraping
        - APIs
        - Databases
        - CSV files
        - Excel files
        - JSON files
        - XML files
        - Text files
        - PDF files
        - Images
        - Videos
        - Audio files
        - Social media
        - Sensors
        - IoT devices
        - etc. <br><br>
    - There are types of data collection:
        - Primary data collection: It is the process of collecting data directly from the source.
        - Secondary data collection: It is the process of collecting data from a third party source.
        - Tertiary data collection: It is the process of collecting data from a third party source that has already collected data from another source. <br><br>

2. Data cleaning and validation

    - Data cleaning is the process of removing or correcting inaccurate, incomplete, irrelevant, duplicated or improperly formatted data.
    - Data validation is the process of checking the accuracy, completeness, consistency, relevancy and validity of data.
    - For data cleaning and validation looking at these is important:
        - Data Model: If a third party collected the data, verify what questions they wanted to answer with the data. If you are the one collecting the data, ask yourself many questions and consider if that data is sufficient to answer them.
        - Standard File Format Tracking: Verify that the file extensions you are handling correspond to the internal format they have. Make sure that numbers are expressed in the format you are working with.
        - Data Types: Verify that the data is of the type indicated in the dataset.
        - Variable Range: Verify that the variables are within the range established in the data collection. In case of finding variables outside the range, ask yourself: how did that data get here? Do they have any alternate meaning? Should I preserve or eliminate them?
        - Uniqueness: Verify how unique the data is. Detect if there is duplication in the data and correct it.
        - Consistency of Expressions: Refers to how the person collecting the data defines their variables. Date format, time format, variables written in the same way throughout the table. They are not erroneous data, it is just a matter of giving them the appropriate format.
        - Null Values: They can be explicit or implicit in the dataset. They are missing data. Why is it empty? Can I fill it with another data? Is it empty due to a random process or does it have a meaning? <br><br>

3. Data exploration

    1. Univariate Data Exploration

        - Univariate data exploration is the process of exploring a single variable.
        - Some of the techniques used in univariate data exploration are:
            - Histograms
            - Box plots
            - Bar plots
            - Pie charts
            - Frequency tables
            - Descriptive statistics
            - etc. <br><br>

    2. Bivariate Data Exploration

        - Bivariate data exploration is the process of exploring two variables.
        - Some of the techniques used in bivariate data exploration are:
            - Scatter plots
            - Line plots
            - Bar plots
            - Box plots
            - Correlation
            - etc. <br><br>

    3. Multivariate Data Exploration

        - Multivariate data exploration is the process of exploring more than two variables.
        - Some of the techniques used in multivariate data exploration are:
            - Scatter plots
            - Line plots
            - Bar plots
            - Box plots
            - Correlation
            - etc. <br><br>

## PMF, PDF and CDF

1. PMF (Probability Mass Function)

    - The probability mass function (PMF) is a function that gives the probability that a discrete random variable is exactly equal to some value.
    - The probability mass function (PMF) is important because it allows us to make predictions about the future based on past data. <br><br>

2. PDF (Probability Density Function)
    
    - The probability density function (PDF) is a function that gives the probability that a continuous random variable is exactly equal to some value.
    - The probability density function (PDF) is important because it allows us to make predictions about the future based on past data. <br><br>

3. CDF (Cumulative Distribution Function)

    - The cumulative distribution function (CDF) is a function that gives the probability that a random variable is less than or equal to some value.
    - The cumulative distribution function (CDF) is important because it allows us to make predictions about the future based on past data. <br><br>

## Law of Large Numbers and Central Limit Theorem

1. Law of Large Numbers

    - The law of large numbers states that as the number of trials increases, the average of the results approaches the expected value.
    - The law of large numbers is important because it allows us to make predictions about the future based on past data. <br><br>


2. Central Limit Theorem

    - The central limit theorem states that as the number of trials increases, the distribution of the results approaches a normal distribution.
    - The central limit theorem is important because it allows us to make predictions about the future based on past data. <br><br>

## Choosing the right correlation coefficient

1. Pearson Correlation Coefficient

    - The Pearson correlation coefficient is a measure of the linear correlation between two variables.
    - The Pearson correlation coefficient is important because it allows us to make predictions about the future based on past data. <br><br>

2. Spearman Correlation Coefficient

    - The Spearman correlation coefficient is a measure of the monotonic correlation between two variables.
    - The Spearman correlation coefficient is important because it allows us to make predictions about the future based on past data. <br><br>

3. Kendall Correlation Coefficient

    - The Kendall correlation coefficient is a measure of the ordinal correlation between two variables.
    - The Kendall correlation coefficient is important because it allows us to make predictions about the future based on past data. <br><br>

## Correlation vs Causation

1. Correlation

    - Correlation is a measure of the linear relationship between two variables.
    - Correlation is important because it allows us to make predictions about the future based on past data. <br><br>

2. Causation

    - Causation is a measure of the causal relationship between two variables.
    - Causation is important because it allows us to make predictions about the future based on past data. <br><br>

## How to measure causation?

1. A/B Testing

    - A/B testing is a method of comparing two versions of a product or service against each other to determine which one performs better.
    - A/B testing is important because it allows us to make predictions about the future based on past data. <br><br>

2. Regression Analysis

    - Regression analysis is a method of estimating the relationship between a dependent variable and one or more independent variables.
    - Regression analysis is important because it allows us to make predictions about the future based on past data. <br><br>

## Multiple Linear Regression

- Multiple linear regression is a method of estimating the relationship between a dependent variable and two or more independent variables.
- Multiple linear regression is important because it allows us to make predictions about the future based on past data. <br><br>

## Logistic Regression

- Logistic regression is a method of estimating the relationship between a dependent variable and one or more independent variables.
- Logistic regression is important because it allows us to make predictions about the future based on past data. <br><br>

## Simpson's Paradox

- Simpson's paradox is a phenomenon in probability and statistics in which a trend appears in several different groups of data but disappears or reverses when these groups are combined.
- Simpson's paradox is important because it helps us check data looking into different groups. <br><br>

## What to do when there are a lot of variables?

1. Dimensionality Reduction: Dimensionality reduction is the process of reducing the number of variables in a dataset. Some examples of dimensionality reduction are:

    - Principal Component Analysis (PCA)
    - Linear Discriminant Analysis (LDA)
    - t-distributed Stochastic Neighbor Embedding (t-SNE)
    - Uniform Manifold Approximation and Projection (UMAP)
    - etc. <br><br>

2. Feature Selection: Feature selection is the process of selecting the most important variables in a dataset. Some examples of feature selection are:

    - Univariate Feature Selection
    - Recursive Feature Elimination
    - Feature Importance
    - etc. <br><br>

3. Feature Extraction: Feature extraction is the process of extracting new variables from existing variables in a dataset. Some examples of feature extraction are:

    - Bag of Words
    - Word2Vec
    - etc. <br><br>