## 1 - SetUp Environment

In [10]:
import numpy as np
import pandas as pd
import pickle

<hr>

## 2 - Load Dataframe

In [12]:
with open('../Assets/Version 1-4.pickle', 'rb') as file:
    df = pickle.load(file)

df.head()

Unnamed: 0,PassengerId,Name,Ticket,Age,Parch,Fare,Pclass,Sex,Cabin,Embarked,Survived
0,1,"Braund, Mr. Owen Harris",A/5 21171,22.0,0,7.25,3,male,Unknown,S,0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",PC 17599,38.0,0,71.2833,1,female,C,C,1
2,3,"Heikkinen, Miss. Laina",STON/O2. 3101282,26.0,0,7.925,3,female,Unknown,S,1
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",113803,35.0,0,53.1,1,female,C,S,1
4,5,"Allen, Mr. William Henry",373450,35.0,0,8.05,3,male,Unknown,S,0


<hr>

## 3 - Pearson: Numerical Variable

For doing this, we use pearson correlation.

In [14]:
pearson_correlation = df[['Age' , 'Parch' , 'Fare']].corr()
pearson_correlation

Unnamed: 0,Age,Parch,Fare
Age,1.0,-0.124444,0.177769
Parch,-0.124444,1.0,0.222327
Fare,0.177769,0.222327,1.0


*   Age ~ Parch : weak negative 
*   Fare ~ Parch : weak positive 
*   Parch ~ Fare : moderate positive 

<hr>

## 4 - Chi Square: Categorical Variable

For doing this, we use chi-square correlation.
* Null hypotheses : there is no relationship between the two variables
* Alternative hypotheses : there is significant relationship between the two variables

In [27]:
from scipy.stats import chi2_contingency

def chi_square(a , b):
    contingency_table = pd.crosstab(df[a], df[b])
    stat, p, dof, expected = chi2_contingency(contingency_table)
    
    print(f"Chi-squared: {stat}")
    print(f"P-value: {p}")
    print(f"Degrees of freedom: {dof}") 

now make a for lopp to calculate for all categorical variables:

In [32]:
qualitative_list = ['Sex' , 'Survived' , 'Cabin' , 'Embarked' , 'Pclass']

for i in range(len(qualitative_list)):
    for j in range(i+1, len(qualitative_list)):
        a = qualitative_list[i]
        b = qualitative_list[j]
        print(a , " ~ " , b)
        chi_square(a , b)
        print()

Sex  ~  Survived
Chi-squared: 614.1656258936159
P-value: 1.3890628126435168e-135
Degrees of freedom: 1

Sex  ~  Cabin
Chi-squared: 40.8529447444464
P-value: 2.2208180154061487e-06
Degrees of freedom: 8

Sex  ~  Embarked
Chi-squared: 19.469838232597045
P-value: 5.91804612917703e-05
Degrees of freedom: 2

Sex  ~  Pclass
Chi-squared: 19.323683653010775
P-value: 6.366715003095156e-05
Degrees of freedom: 2

Survived  ~  Cabin
Chi-squared: 92.8391184966289
P-value: 1.23182785274672e-16
Degrees of freedom: 8

Survived  ~  Embarked
Chi-squared: 24.547760057729995
P-value: 4.672202225231174e-06
Degrees of freedom: 2

Survived  ~  Pclass
Chi-squared: 89.72803168407883
P-value: 3.2794837536918885e-20
Degrees of freedom: 2

Cabin  ~  Embarked
Chi-squared: 128.0895108725087
P-value: 1.506745029609524e-19
Degrees of freedom: 16

Cabin  ~  Pclass
Chi-squared: 931.584253653275
P-value: 4.903722448794014e-188
Degrees of freedom: 16

Embarked  ~  Pclass
Chi-squared: 205.72649157662522
P-value: 2.2055751

the results show that p-value for all pairwise analysis are less than 0.05. so it mean we reject null hypothesis and it means there relationship between variables.

<hr>

## 5 - ANOVA Test: Categorical vs. Numerical

For doing this, we use anova test.
* Null hypotheses : there is no relationship between the two variables
* Alternative hypotheses : there is significant relationship between the two variables

In [46]:
qualitative_list = ['Sex' , 'Survived' , 'Cabin' , 'Embarked' , 'Pclass']
quantitative_list = ['Age' , 'Parch' , 'Fare']

import statsmodels.api as sm
from statsmodels.formula.api import ols

for qualitative_var in qualitative_list:
    for quantitative_var in quantitative_list:
        # Formulate the ANOVA model formula
        formula = f'{quantitative_var} ~ {qualitative_var}'

        # Fit the ANOVA model
        model = ols(formula, data = df).fit()
        x = sm.stats.anova_lm(model, typ = 2)

        # Print the ANOVA results
        print(f" {qualitative_var} ~ {quantitative_var}")
        print(x)
        print()
        print()

 Sex ~ Age
                 sum_sq      df         F    PR(>F)
Sex          668.786362     1.0  4.055455  0.044234
Residual  215043.055139  1304.0       NaN       NaN


 Sex ~ Parch
              sum_sq      df          F        PR(>F)
Sex        44.932317     1.0  62.693901  5.129917e-15
Residual  934.568448  1304.0        NaN           NaN


 Sex ~ Fare
                sum_sq      df          F        PR(>F)
Sex       1.184396e+05     1.0  45.712906  2.060148e-11
Residual  3.378591e+06  1304.0        NaN           NaN


 Survived ~ Age
                 sum_sq      df         F    PR(>F)
Survived     433.535045     1.0  2.626041  0.105365
Residual  215278.306456  1304.0       NaN       NaN


 Survived ~ Parch
              sum_sq      df          F    PR(>F)
Survived   11.791271     1.0  15.888878  0.000071
Residual  967.709494  1304.0        NaN       NaN


 Survived ~ Fare
                sum_sq      df          F        PR(>F)
Survived  1.886885e+05     1.0  74.372534  1.847173e-17