### Fisher Score- Chisquare  Test For Feature Selection

Compute chi-squared stats between each non-negative feature and class.

- This score should be used to evaluate categorical variables in a classification task.

This score can be used to select the n_features features with the highest values for the test chi-squared statistic from X, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification), relative to the classes.

Recall that the chi-square test measures dependence between stochastic variables, so using this function “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification.
The Chi Square statistic is commonly used for testing relationships between categorical variables.

It compares the observed distribution of the different classes of target Y among the different categories of the feature, against the expected distribution of the target classes, regardless of the feature categories.

#### Youtube Videos

Statistical test: https://www.youtube.com/watch?v=4-rxTA_5_xA

https://www.youtube.com/watch?v=YrhlQB3mQFI

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
df=pd.read_csv(r'C:\Users\TEJKIRAN\Desktop\DataAnalytics_files\titanic.csv')

In [2]:
df.head()

Unnamed: 0,PassengerId,Name,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,1,"Braund, Mr. Owen Harris",3,male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,"Heikkinen, Miss. Laina",3,female,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,female,35.0,1,0,113803,53.1,C123,S,1
4,5,"Allen, Mr. William Henry",3,male,35.0,0,0,373450,8.05,,S,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Name         891 non-null    object 
 2   Pclass       891 non-null    int64  
 3   Sex          891 non-null    object 
 4   Age          714 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Ticket       891 non-null    object 
 8   Fare         891 non-null    float64
 9   Cabin        204 non-null    object 
 10  Embarked     889 non-null    object 
 11  Survived     891 non-null    int64  
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [4]:
##['sex','embarked','alone','pclass','Survived']
df=df[['Sex','Embarked','Pclass','Survived']]
df.head()

Unnamed: 0,Sex,Embarked,Pclass,Survived
0,male,S,3,0
1,female,C,1,1
2,female,S,3,1
3,female,S,1,1
4,male,S,3,0


In [5]:
### Let's perform label encoding on sex column
df['Sex']=np.where(df['Sex']=="male",1,0)
df.head()

Unnamed: 0,Sex,Embarked,Pclass,Survived
0,1,S,3,0
1,0,C,1,1
2,0,S,3,1
3,0,S,1,1
4,1,S,3,0


In [6]:
df["Embarked"].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [7]:
### let's perform label encoding on embarked
ordinal_label = {k: i for i, k in enumerate(df['Embarked'].unique(), 0)}
df['Embarked'] = df['Embarked'].map(ordinal_label)

In [8]:
df.head()

Unnamed: 0,Sex,Embarked,Pclass,Survived
0,1,0,3,0
1,0,1,1,1
2,0,0,3,1
3,0,0,1,1
4,1,0,3,0


In [9]:
### train Test split is usually done to avaoid overfitting
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(df[['Sex','Embarked','Pclass']],
                                              df['Survived'],test_size=0.3,random_state=100)

In [10]:
X_train.head()

Unnamed: 0,Sex,Embarked,Pclass
69,1,0,3
85,0,0,3
794,1,0,3
161,0,0,2
815,1,0,1


In [11]:
X_train['Sex'].unique()

array([1, 0])

In [12]:
X_train.isnull().sum()

Sex         0
Embarked    0
Pclass      0
dtype: int64

In [47]:
## Perform chi2 test
### chi2 returns 2 values
### Fscore and the pvalue
from sklearn.feature_selection import chi2
f_p_values=chi2(X_train,y_train)

In [48]:
f_p_values

(array([65.67929505,  7.55053653, 21.97994154]),
 array([5.30603805e-16, 5.99922095e-03, 2.75514881e-06]))

In [49]:
import pandas as pd
p_values=pd.Series(f_p_values[1])
p_values.index=X_train.columns
p_values

Sex         5.306038e-16
Embarked    5.999221e-03
Pclass      2.755149e-06
dtype: float64

In [50]:
p_values.sort_index(ascending=False)

Sex         5.306038e-16
Pclass      2.755149e-06
Embarked    5.999221e-03
dtype: float64

### Observation 
Sex Column is the most important column when compared to the output feature
Survived