# Correlation and P Value Concepts
Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate together. This is useful in feature selection. The most common method for calculating correlation is the Pearson Correlation Coefficient, that assumes a normal distribution of the attributes involved. The mathematical representation for the Pearson Correlation Coefficient is as follows:
$$r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2\sum_{i=1}^{n}(y_i - \bar{y})^2}}$$
where:

* $r$ is the Pearson Correlation Coefficient
* $n$ is the number of data points
* $x_i$ is the value of the first variable for the $i^{th}$ data point
* $y_i$ is the value of the second variable for the $i^{th}$ data point
* $\bar{x}$ is the mean of the first variable
* $\bar{y}$ is the mean of the second variable


In [1]:
import pandas as pd
df = pd.read_csv('../../datasets/insurance.csv')
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


Label data Type     | Feature data Type       | Effect size stat        | Visualization
Numeric             | numeric                 | Pearson Correlation (r) | Scatterplot
Numeric             | categorical             | one-way ANOVA (f)       | Boxplot
Categorical         | categorical             | Pearson Chi-Square      | CrossTab/Barplot

# P-Value Concepts
The P-value is the probability of obtaining results at least as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is correct. A small p-value (typically â‰¤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis.
- [ ] # Path: src\Learn EDA\Correlation and P_Value Concepts.ipynb
- [ ] # Compare this snippet from src\Learn EDA\README.md:
- [ ] #

# Bivariate: Numerical vs Numerical: Stats
Bivariate Statistics are used to describe the relationship between two variables. The most common method for calculating correlation is the Pearson Correlation Coefficient, that assumes a normal distribution of the attributes involved. The mathematical representation for the Pearson Correlation Coefficient is as follows:

- [ ] # Path: src\Learn EDA\Bivariate - Numerical vs Numerical - Stats.ipynb
- [ ] # Bivariate: Numerical vs Numerical: Stats
- [ ] # Compare this snippet from src\Learn EDA\README.md:
- [ ] #


# Independent and Dependent Variables
Independent Variables
-  age, sex, bmi, children, smoker, region

Dependent Variables
- charges

## Effect Size
Effect size is a measure of the strength of the relationship between two variables. It is often used in place of correlation when the variables are not normally distributed. The most common effect size measure is the Cohen's d, which is calculated as follows:
$$d = \frac{\bar{x} - \bar{y}}{s}$$
    

# Effect Size: Person Correlation (r)
Person Correlation is a measure of the strength of the relationship between two variables. It is often used in place of correlation when the variables are not normally distributed. The most common effect size measure is the Cohen's d, which is calculated as follows:
$$r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2\sum_{i=1}^{n}(y_i - \bar{y})^2}}$$

* small effect size: .10 < x < .29
* medium effect size: .30 < x < .49
* large effect size: .50 < x < 1.0

In [2]:
# pearson correlation
df.corr()

Unnamed: 0,age,bmi,children,charges
age,1.0,0.109272,0.042469,0.299008
bmi,0.109272,1.0,0.012759,0.198341
children,0.042469,0.012759,1.0,0.067998
charges,0.299008,0.198341,0.067998,1.0
