### Fisher Score- Chisquare Test For Feature Selection¶
Compute chi-squared stats between each non-negative feature and class.

- This score should be used to evaluate categorical variables in a classification task.

This score can be used to select the n_features features with the highest values for the test chi-squared statistic from X, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification), relative to the classes.

Recall that the chi-square test measures dependence between stochastic variables, so using this function “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification. The Chi Square statistic is commonly used for testing relationships between categorical variables.

It compares the observed distribution of the different classes of target Y among the different categories of the feature, against the expected distribution of the target classes, regardless of the feature categories.

#### Youtube Videos
Statistical test: https://www.youtube.com/watch?v=4-rxTA_5_xA

https://www.youtube.com/watch?v=YrhlQB3mQFI

In [1]:
import seaborn as sns
df=sns.load_dataset('titanic')

In [2]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB


In [5]:
df.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [7]:
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [8]:
df.columns[df.isnull().sum() > 0]

Index(['age', 'embarked', 'deck', 'embark_town'], dtype='object')

In [10]:
missing_col = [col for col in df.columns if col in df.columns[df.isnull().sum() > 0]]
missing_col

['age', 'embarked', 'deck', 'embark_town']

In [11]:
##['sex','embarked','alone','pclass','Survived']
df=df[['sex','embarked','alone','pclass','survived']]
df.head()

Unnamed: 0,sex,embarked,alone,pclass,survived
0,male,S,False,3,0
1,female,C,False,1,1
2,female,S,True,3,1
3,female,S,False,1,1
4,male,S,True,3,0


In [12]:
import numpy as np
df['sex']=np.where(df['sex']=="male",1,0)
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['sex']=np.where(df['sex']=="male",1,0)


Unnamed: 0,sex,embarked,alone,pclass,survived
0,1,S,False,3,0
1,0,C,False,1,1
2,0,S,True,3,1
3,0,S,False,1,1
4,1,S,True,3,0


In [13]:
df['embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

### Label Encoding using pandas

In [14]:
### let's perform label encoding on embarked
ordinal_value = {k:i for i,k in enumerate(df['embarked'].unique())}
ordinal_value

{'S': 0, 'C': 1, 'Q': 2, nan: 3}

In [15]:
df['embarked'] = df['embarked'].map(ordinal_value)

In [16]:
df['embarked'].unique()

array([0, 1, 2, 3])

### Label Encoding using sklearn - Just trying out

In [23]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
alone_encoded = encoder.fit_transform(df['alone'])
print(type(alone_encoded))
print(np.unique(alone_encoded))
df['alone']=alone_encoded

<class 'numpy.ndarray'>
[0 1]


In [24]:
df['alone']

0      0
1      0
2      1
3      0
4      1
      ..
886    1
887    1
888    0
889    1
890    1
Name: alone, Length: 891, dtype: int64

In [25]:
df.head()

Unnamed: 0,sex,embarked,alone,pclass,survived
0,1,0,0,3,0
1,0,1,0,1,1
2,0,0,1,3,1
3,0,0,0,1,1
4,1,0,1,3,0


In [26]:
### train Test split is usually done to avaoid overfitting
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(df[['sex','embarked','alone','pclass']],
                                              df['survived'],test_size=0.3,random_state=100)

In [28]:
X_train.head()

Unnamed: 0,sex,embarked,alone,pclass
69,1,0,0,3
85,0,0,0,3
794,1,0,1,3
161,0,0,1,2
815,1,0,1,1


In [29]:
y_train.head()

69     0
85     1
794    0
161    1
815    0
Name: survived, dtype: int64

In [30]:
X_train.isnull().sum()

sex         0
embarked    0
alone       0
pclass      0
dtype: int64

In [31]:
## Perform chi2 test
### chi2 returns 2 values
### Fscore and the pvalue
from sklearn.feature_selection import chi2
f_p_values = chi2(X_train,y_train)

### 1st array is Fscore. Higher t Fscore, feature is more important
### 2nd array is pvalue. Lower t pvalue, feature is more importantm

In [32]:
# 1st array is Fscore. Higher t Fscore, feature is more important
# 2nd array is pvalue. Lower t pvalue, feature is more important
f_p_values

(array([65.67929505,  7.55053653, 10.88471585, 21.97994154]),
 array([5.30603805e-16, 5.99922095e-03, 9.69610546e-04, 2.75514881e-06]))

In [36]:
import pandas as pd
p_value = pd.Series(f_p_values[1])
p_value.index = X_train.columns
p_value

sex         5.306038e-16
embarked    5.999221e-03
alone       9.696105e-04
pclass      2.755149e-06
dtype: float64

In [43]:
p_value.sort_values() # Lower t pvalue, feature is more important. Sorting in ascending order

sex         5.306038e-16
pclass      2.755149e-06
alone       9.696105e-04
embarked    5.999221e-03
dtype: float64

### Observation
Sex Column is the most important column when compared to the output feature Survived

In [44]:
p_value.index

Index(['sex', 'embarked', 'alone', 'pclass'], dtype='object')

In [47]:
imp_cols = [col for col in X_train.columns if col in p_value.index]
imp_cols

['sex', 'embarked', 'alone', 'pclass']

In [49]:
imp_X_train = X_train[p_value.index]
imp_X_train.head()

Unnamed: 0,sex,embarked,alone,pclass
69,1,0,0,3
85,0,0,0,3
794,1,0,1,3
161,0,0,1,2
815,1,0,1,1
