## Exploratory Data Analysis (EDA)

This is an approach to analyze data in order to summarize main characteristics of the data, gain better understanding of the data set, uncover relationships between different variables and extract important variables for the problem you're trying to solve.

### Descriptive Statistics

When you begin to analyze data, it's important to first explore your data before you spend time building complicated models.

 Descriptive statistical analysis helps to describe basic features of a dataset and obtains a short summary about the sample and measures of the data. 
 
There are a couple different useful methods. 

1. By using the describe () function in pandas. Using the describe function and applying it on your data frame, a describe function automatically computes basic statistics for all numerical variables. It shows the mean, the total number of data points, the standard deviation, the quartiles, and the extreme values. Any NaN values are automatically skipped in these statistics. This function will give you a clearer idea of the distribution of your different variables.

You could have also categorical variables in your dataset. These are variables that can be divided up into different categories or groups and have discrete values. One way you can summarize the categorical data is by using the function value_counts.

2. Box plots are a great way to visualize numeric data, since you can visualize the various distributions of the data.

The main features that the box plot shows are the median of the data which represents where the middle data point is, the upper quartile shows where the 75th percentile is, the lower quartile shows where the 25th percentile is. The data between the upper and lower quartile represents the inter-quartile range. Next, you have the lower and upper extremes. These are calculated as 1.5 times the inter-quartile range above the 75th percentile, and as 1.5 times the IQR below the 25th percentile.

Finally, box plots also display outliers as individual dots that occur outside the upper and lower extremes. With box plots, you can easily spot outliers and also see the distribution and skewness of the data. Box plots make it easy to compare between groups.

2. Visualizing the relationship between continuous variables in the data. These data points are numbers contained in some range. One good way to visualize this is using a scatter plot. Each observation in a scatter plot is represented as a point.

The predictor variable is the variable that you are using to predict an outcome. The target variable is the variable that you are trying to predict. 

In a scatter plot, we typically set the predictor variable on the x-axis or horizontal axis, and we set the target variable on the y-axis or vertical axis. Matplotlib function is often used for scatter plot taking in x and a y variable.

Something to note is that it's always important to label your axes and write a general plot title so that you know what you're looking at.

### GroupBy in Python


grouping and how this can help to transform your data set.

The groupby method is used on categorical variables, groups the data into subsets according to the different categories of that variable, you can group by a single variable, or you can group by multiple variables by passing in multiple variable names. 

### Correlation

 Correlation is a statistical metric for measuring to what extent different variables are interdependent. In other words, when we look at two variables over time, if one variable changes, how does this affect change in the other variable?
 
 Examples;
 
 1. smoking is known to be correlated to lung cancer, since you have a higher chance of getting lung cancer if you smoke. 
 
 2. there is a correlation between umbrella and rain variables, where more precipitation means more people use umbrellas. Also, if it doesn't rain, people would not carry umbrellas. Therefore, we can say that umbrellas and rain are interdependent and by definition they are correlated.
 
 It is important to know that correlation doesn't imply causation. In fact, we can say that umbrella and rain are correlated, but we would not have enough information to say whether the umbrella caused the rain or the rain caused the umbrella. In data science, we usually deal more with correlation.

### Correlation - Statistics

1. One way to measure the strength of the correlation between continuous numerical variables is by using a method called Pearson Correlation.

Pearson Correlation method will give you two values; the correlation coefficient and the p-value. How do we interpret these values?

1. For the correlation coefficient, a value close to one implies a large positive correlation

2.  a value close to -1 implies a large negative correlation,

3. a value close to zero implies no correlation between the variables.




2. the p-value will tell us how certain we are about the correlation that we calculated.

1. For the p-value, a value less than 0.001 gives us a strong certainty about the correlation coefficient that we calculated,

2. a value between 0.001 and 0.05 gives us moderate certainty,

3. a value between 0.05 and 0.1 will give us a weak certainty,

4.  a p-value larger than 0.1 will give us no certainty of correlation at all.

We can say that there is a strong correlation when the correlation coefficient is close to one or -1 and the p-value is less than 0.001. 



3.  a heat map that indicates the correlation between each of the variables with one another. The color scheme indicates the Pearson correlation coefficient, indicating the strength of the correlation between two variables. We can see a diagonal line with a dark red color indicating that all the values on this diagonal are highly correlated. This makes sense because when you look closer, the values on the diagonal are the correlation of all variables with themselves, which will be always one. This correlation heat map gives us a good overview of how the different variables are related to one another

### Chi-Square Test for Categorical Variables

**Introduction**

The chi-square test is a statistical method used to determine if there is a significant association between two categorical variables. This test is widely used in various fields, including social sciences, marketing, and healthcare, to analyze survey data, experimental results, and observational studies.

**Concept**

The chi-square test is a non-parametric statistical method used to examine the association between two categorical variables. It evaluates whether the frequencies of observed outcomes significantly deviate from expected frequencies, assuming the variables are independent. The test is grounded in the chi-square distribution, which is applied to count data and helps in determining if any observed deviations could have arisen by random chance.

Null Hypothesis and Alternative Hypothesis
The chi-square test involves formulating two hypotheses:

Null Hypothesis (𝐻0) - Assumes that there is no association between the categorical variables, implying that any observed differences are due to random chance.

Alternative Hypothesis (H1)- Assumes that there is a significant association between the variables, indicating that the observed differences are not due to chance alone.

**Applications**

Market Research: Analyzing the association between customer demographics and product preferences.
Healthcare: Studying the relationship between patient characteristics and disease incidence.
Social Sciences: Investigating the link between social factors (e.g., education level) and behavioral outcomes (e.g., voting patterns).
Education: Examining the connection between teaching methods and student performance.
Quality Control: Assessing the association between manufacturing conditions and product defects.

**Practical Example - Weak Correlation**

Suppose a researcher wants to determine if there is an association between gender (male, female) and preference for a new product (like, dislike). The researcher surveys 100 people and records the following data:

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

![image.png](attachment:image.png)


![image-2.png](attachment:image-2.png)


![image-3.png](attachment:image-3.png)

### Data Visualization commands in Python

Importing libraries

In [1]:
from matplotlib import pyplot as plt
# Alternatively,
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

## matplotlib functions

### Stanadard line plot

The simplest and most fundamental plot is a standard line plot. The function expects two arrays as input, x and y, both of the same size. x is treated as an independent variable and y as the dependent one. The graph is plotted as shortest line segments joining the x,y point pairs ordered in terms of the variable x.

In [3]:
plt.plot(x,y)

![image.png](attachment:image.png)

### Scatter plot

Scatter plots are graphs that present the relationship between two variables in a data set. It represents data points on a two-dimensional plane. The independent variable or attribute is plotted on the X-axis, while the dependent variable is plotted on the Y-axis.

Scatter plots are used in either of the following situations:

When we have paired numerical data
When there are multiple values of the dependent variable for a unique value of an independent variable
In determining the relationship between variables in some scenarios

Here, x contains the independent variable, and y contains the dependent variable. You have the option to change the size, color, and shape of the markers with additional attributes in the function.
A sample scatter plot is shared below.

In [None]:
plt.scatter(x,y)

![image.png](attachment:image.png)

### Histogram

A histogram is an important visual representation of data in categorical form. To view the data in a "Binned" form, we may use the histogram plot with a number of bins required or even with the data points that mark the bin edges. The x-axis represents the data bins, and the y-axis represents the number of elements in each of the bins.

An example of a histogram plot is shown below. Use an additional argument, edgecolor, for better clarity of plot.
Consider the graph shown below. The left graph is the histogram plot for a data set, plotted without setting the edgecolor. The right one is the same graph but has the edgecolor argument set as the color black.

In [None]:
plt.hist(x,bins)

![image.png](attachment:image.png)

### Bar plot

A bar plot is used for visualizing catogorical data. The y-axis represents the average value of data points belonging to a particular category, while the x-axis represents the number of elements in the different categories.

Here, x is the categorical variable, and height is the number of values belonging to the category. You can adjust the width of each bin using an additional width argument in the function.

A sample graph is shown below.

In [None]:
plt.bar(x,height)

![image.png](attachment:image.png)

### Pseudo Color Plot

A pseudocolor plot displays matrix data as an array of colored cells (known as faces). This plot is created as a flat surface in the x-y plane. The surface is defined by a grid of x and y coordinates that correspond to the corners (or vertices) of the faces. Matrix C specifies the colors at the vertices. The color of each face depends on the color of one of its four surrounding vertices. Of the four vertices, the one that comes first in the x-y grid determines the color of the face.

In this course, you use the pcolor plot for visualizing the contents of a pivot table that has been grouped on the basis of 2 parameters. Those parameters then represent the x and y-axis components that create the grid. The values in the pivot table are the average values of a third parameter. These values act as the code for the color the cell is going to take.

You can define an additional **cmap** argument to specify the color scheme of the plot.

Two sample pcolor plots are shown below, created for same data but for different color schemes.

In [None]:
plt.pcolor(C)

![image.png](attachment:image.png)

## seaborn functions

### Regression plot

A regression plot draws a scatter plot of two variables, x and y, and then fits the regression model and plots the resulting regression line along with a 95% confidence interval for that regression. The x and y parameters can be shared as the dataframe headers to be used, and the data frame itself is passed to the function as well.

A sample regression plot is shared below.

In [None]:
sns.regplot(x = 'header_1',y = 'header_2',data= df)

![image.png](attachment:image.png)

### Box and whisker plot

A box plot (or box-and-whisker plot) shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be "outliers".

Consider the Box and whisker plot interpretation figure shown below.

![image.png](attachment:image.png)

The plot uses whiskers to represent Minimum value to 25% quartile data and 75% quartile to Maximum value data. The range between 25% quartile and 75% quartile is considered as the Inter-Quartile Range. Outliers are generally classified as being outside 1.5 times the interquartile range.

A sample box plot is shown below

![image-2.png](attachment:image-2.png)

### Residual Plot

A residual plot is used to display the quality of polynomial regression. This function will regress y on x as a polynomial regression and then draw a scatterplot of the residuals.
Residuals are the differences between the observed values of the dependent variable and the predicted values obtained from the regression model. In other words, a residual is a measure of how much a regression line vertically misses a data point, meaning how far off the predictions are from the actual data points.

In [None]:
sns.residplot(data=df,x='header_1', y='header_2')
#or 
sns.residplot(x=df['header_1'], y=df['header_2'])

![image.png](attachment:image.png)

### KDE plot

A Kernel Density Estimate (KDE) plot is a graph that creates a probability distribution curve for the data based upon its likelihood of occurrence on a specific value. This is created for a single vector of information. It is used in the course in order to compare the likely curves of the actual data with that of the predicted data.

A sample graph made for a random set of values is shown below.

In [None]:
sns.kdeplot(X)

![image.png](attachment:image.png)

### Distribution Plot

This plot has the capacity to combine the histogram and the KDE plots. This plot creates the distribution curve using the bins of the histogram as a reference for estimation. You can optionally keep or discard the histogram from being displayed. In the context of the course, this plot can be used interchangeably with the KDE plot.

Here, keeping the argument hist as True would plot the histogram along with the distribution plot. Both variations are shown in the image below.

In [None]:
sns.distplot(X,hist=False)

![image.png](attachment:image.png)

## Summary

At this point, you know: 

1. Tools like the 'describe' function in pandas can quickly calculate key statistical measures like mean, standard deviation, and quartiles for all numerical variables in your data frame. 

2. Use the 'value_counts' function to summarize data into different categories for categorical data. 

3. Box plots offer a more visual representation of the data's distribution for numerical data, indicating features like the median, quartiles, and outliers.

4. Scatter plots are excellent for exploring relationships between continuous variables, like engine size and price, in a car data set.

5. Use Pandas' 'groupby' method to explore relationships between categorical variables.

6. Use pivot tables and heat maps for better data visualizations.

7. Correlation between variables is a statistical measure that indicates how the changes in one variable might be associated with changes in another variable.

8. When exploring correlation, use scatter plots combined with a regression line to visualize relationships between variables.

9. Visualization functions like regplot, from the seaborn library, are especially useful for exploring correlation.

10. The Pearson correlation, a key method for assessing the correlation between continuous numerical variables, provides two critical values—the coefficient, which indicates the strength and direction of the correlation, and the P-value, which assesses the certainty of the correlation.

11. A correlation coefficient close to 1 or -1 indicates a strong positive or negative correlation, respectively, while one close to zero suggests no correlation.

12. For P-values, values less than .001 indicate strong certainty in the correlation, while larger values indicate less certainty. Both the coefficient and P-value are important for confirming a strong correlation.

13. Heatmaps provide a comprehensive visual summary of the strength and direction of correlations among multiple variables.