## Fisher Score- Chisquare Test For Feature Selection
Compute chi-squared stats between each non-negative feature and class.

* This score should be used to evaluate categorical variables in a classification task.<br>

This score can be used to select the n_features features with the highest values for the test chi-squared statistic from X, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification), relative to the classes.

Recall that the chi-square test measures dependence between stochastic variables, so using this function “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification. The Chi Square statistic is commonly used for testing relationships between categorical variables.

It compares the observed distribution of the different classes of target Y among the different categories of the feature, against the expected distribution of the target classes, regardless of the feature categories.


In [12]:
import seaborn as sns
import numpy as np

df=sns.load_dataset('titanic')

In [13]:
df.sample(6)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
489,1,3,male,9.0,1,1,15.9,S,Third,child,False,,Southampton,yes,False
574,0,3,male,16.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
555,0,1,male,62.0,0,0,26.55,S,First,man,True,,Southampton,no,True
338,1,3,male,45.0,0,0,8.05,S,Third,man,True,,Southampton,yes,True
750,1,2,female,4.0,1,1,23.0,S,Second,child,False,,Southampton,yes,False
482,0,3,male,50.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [14]:
# Consoder categorical features to compare with Target ["survived"]
##['sex','embarked','alone','pclass','Survived']

df=df[['sex','embarked','alone','pclass','survived']]
df.head()

Unnamed: 0,sex,embarked,alone,pclass,survived
0,male,S,False,3,0
1,female,C,False,1,1
2,female,S,True,3,1
3,female,S,False,1,1
4,male,S,True,3,0


#### Before applying Chi-square test, perform LabelEncoding on categorical features

In [15]:
df['sex']=np.where(df['sex']=="male",1,0)
df['alone']=np.where(df['alone']==True,1,0) # True=1 ; False=0

# label encoding on embarked
ordinal_label = {k: i for i, k in enumerate(df['embarked'].unique(), 0)}
print(f"Ordinal Labels: {ordinal_label}")
print("-"*50)
df['embarked'] = df['embarked'].map(ordinal_label)

df.head()

Ordinal Labels: {'S': 0, 'C': 1, 'Q': 2, nan: 3}
--------------------------------------------------


Unnamed: 0,sex,embarked,alone,pclass,survived
0,1,0,0,3,0
1,0,1,0,1,1
2,0,0,1,3,1
3,0,0,0,1,1
4,1,0,1,3,0


In [16]:
from sklearn.model_selection import train_test_split
X = df[['sex','embarked','alone','pclass']]
y = df['survived']

X_train,X_test, y_train,y_test=train_test_split(X, y, test_size=0.3, random_state=100)

In [17]:
X_train.isnull().sum()

sex         0
embarked    0
alone       0
pclass      0
dtype: int64

In [18]:
X_train.head()

Unnamed: 0,sex,embarked,alone,pclass
69,1,0,0,3
85,0,0,0,3
794,1,0,1,3
161,0,0,1,2
815,1,0,1,1


#### Perform Chi2 Test
> `chi2 returns 2 values : Fscore and the p-value`

In [21]:
from sklearn.feature_selection import chi2

f_p_values=chi2(X_train,y_train)
print(f_p_values)
print("-"*60)
print(f"F-Scores : {f_p_values[0]}")
print("-"*60)
print(f"p-values : {f_p_values[1]}")

(array([65.67929505,  7.55053653, 10.88471585, 21.97994154]), array([5.30603805e-16, 5.99922095e-03, 9.69610546e-04, 2.75514881e-06]))
------------------------------------------------------------
F-Scores : [65.67929505  7.55053653 10.88471585 21.97994154]
------------------------------------------------------------
p-values : [5.30603805e-16 5.99922095e-03 9.69610546e-04 2.75514881e-06]


#### The higher value of f-score indicates the more important the feature is
#### The less value , the p-value the more important the feature is

In [24]:
import pandas as pd
p_values=pd.DataFrame(f_p_values[1])
p_values.index=X_train.columns
p_values.sort_index(ascending=False)



Unnamed: 0,0
sex,5.306038e-16
pclass,2.755149e-06
embarked,0.005999221
alone,0.0009696105


#### sex in the most imp col when compared to the target