# Day 15 - Bivariate Stats: Numeric to Numeric

Bivariate statistics deal with **relationships**. A relationship refers to how one feature changes as another feature changes. It does not imply causation.

We divide our data into two fields. An **independent variable** represents a potential cause and acts as a predictor. A **dependent variable** is a potential effect that you want to predict or explain.

Any data field used to explain or predict another variable (label) is called a **feature**.

In Bivariate statistics we are looking for an **effect size** to see the strength and direction of a relationship.

In [1]:
import pandas as pd

df = pd.read_csv('data/insurance.csv')

Which features can we use for numerical to numerical

In [2]:
print(df.head())

df.dtypes

   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520


age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object

## Pearson Correlation (r)

[Pearson correlation](https://www.scribbr.com/statistics/pearson-correlation-coefficient/) is a statistical measure of effect size that indicates how much two numeric variables influence each other.

| effect size | r |
| -- | -- |
| small effect size | .10 < r < .29 |
| medium effect size | .30 < r < .49 |
| large effect size | .50 < r |

#### Assumptions
1. Normality
2. Continuous data
3. Linear relationship
4. Homoscedastic relationship


In [3]:
import numpy as np

height = [60, 62, 65, 68, 70, 74]
weight = [140, 138, 150, 166, 190, 250]

print(np.corrcoef(height, weight))
print('') # print a line break
# print only the r rather than the whole matrix; round to 2 decimals
print(round(np.corrcoef(height, weight)[0][1], 2))

[[1.         0.92989745]
 [0.92989745 1.        ]]

0.93


In [4]:

num_df= df.loc[:,['age', 'bmi', 'children','charges']]
num_df.corr()

Unnamed: 0,age,bmi,children,charges
age,1.0,0.109272,0.042469,0.299008
bmi,0.109272,1.0,0.012759,0.198341
children,0.042469,0.012759,1.0,0.067998
charges,0.299008,0.198341,0.067998,1.0


In [5]:
df.charges.corr(df.age)

0.2990081933306478

## P-value

In addition to effect size, another statistic that is commonly calculated is the probability that the estimated effect size is due to random chance.

By default, we assume that there is no relationship between variables (termed the null hypothesis). But we often start to see trends that one change in a variable is related to a change in the other. But how strong or significant do those trends have to be to lead us to reject the null hypothesis? That is the role of the p-value statistic.

**P-value** is the probability that future data collections will have results at least as extreme as those observed in the test data—assuming that there is actually no relationship.

In [7]:
from scipy import stats

corr = stats.pearsonr(df.charges, df.age)
corr

PearsonRResult(statistic=0.29900819333064743, pvalue=4.886693331718505e-29)

In [8]:
#Extracting the numbers from the tuples
print('r: \t' + str(round(corr[0], 4))) 
print('p-value:' + str(round(corr[1], 4)))

r: 	0.299
p-value:0.0


 Let’s create a loop that calculates the r and p-value for every numeric feature with the charges label.

In [9]:
 # Create an empty DataFrame to store the correlations and p-values
corr_df = pd.DataFrame(columns=['r', 'p-value'])

for col in df:  # Use this to loop through the insurance.csv DataFrame
    if pd.api.types.is_numeric_dtype(df[col]): # Only calculate r, p-value for the numeric columns
        r, p = stats.pearsonr(df.charges, df[col])
        corr_df.loc[col] = [round(r, 3), round(p, 3)]

corr_df.sort_values(by=['r'], ascending=False)

Unnamed: 0,r,p-value
charges,1.0,0.0
age,0.299,0.0
bmi,0.198,0.0
children,0.068,0.013


In [10]:
### Write a function to calculate bivariate stats
def bivariate_stats(df, label):
    corr_df = pd.DataFrame(columns=['r', 'p-value'])

    for col in df:
        if pd.api.types.is_numeric_dtype(df[col]) and col != label:
            r, p = stats.pearsonr(df[label], df[col])
            corr_df.loc[col] = [round(r, 3), round(p, 3)]

    return corr_df.sort_values(by=['r'], ascending=False)

In [11]:
bivariate_stats(df, 'charges')

Unnamed: 0,r,p-value
age,0.299,0.0
bmi,0.198,0.0
children,0.068,0.013


In [12]:
bikedf = pd.read_csv('data/bikebuyers.csv')
bivariate_stats(bikedf, 'PurchaseBikeNumeric')

Unnamed: 0,r,p-value
EducationNumeric,0.141,0.0
ID,0.056,0.075
Income,0.042,0.181
GenderNumeric,0.011,0.721
HomeOwnerNumeric,-0.019,0.542
Age,-0.106,0.001
MaritalStatusNumeric,-0.109,0.001
Children,-0.122,0.0
CommuteDistanceNumeric,-0.141,0.0
Cars,-0.202,0.0
