# Feature Selection - Fischer Score (F Score) and Chi2 Test on Titanic Dataset

## What is Fisher Score and Chi2 ( χ2) Test

Fisher score is one of the most widely used supervised feature selection methods. However, it selects each feature independently according to their scores under the Fisher criterion, which leads to a suboptimal subset of features.

## Chi Square (χ2) Test
A chi-squared test, also written as X2

test, is any statistical hypothesis test where the sampling distribution of the test statistic is a chi-squared distribution.

chi-square test measures dependence between stochastic variables, so using this function weeds out the features that are the most likely to be independent of class and therefore irrelevant for classification.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, SelectPercentile, chi2
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [None]:
titanic = sns.load_dataset("titanic")
titanic.head()

In [None]:
# Checking for null values
titanic.isnull().sum()

In [None]:
# Dropping the columns with null values.
titanic.drop(labels=["age", "deck"], axis=1, inplace=True)

In [None]:
titanic = titanic.dropna()

In [None]:
# NaN and Null are removed
titanic.isnull().sum()

In [None]:
data = titanic[["pclass", "sex", "sibsp", "parch", "embarked", "who", "alone"]].copy()

In [None]:
data.head()

In [None]:
data.isnull().sum()

In [None]:
# Convertimg String to Number
sex = {"male": 0, "female": 1}
data["sex"] = data["sex"].map(sex)

In [None]:
data.head()

In [None]:
ports = {"S": 0, "C": 1, "Q": 2}
data["embarked"] = data["embarked"].map(ports)

In [None]:
data.head()

In [None]:
who = {"man": 0, "woman": 1, "child": 2}
data["who"] = data["who"].map(who)

In [None]:
data.head()

In [None]:
alone = {False: 0, True: 1}
data["alone"] = data["alone"].map(alone)

In [None]:
data.head()

## F-Test

In [None]:
x = data.copy()
y = titanic["survived"]
x.shape, y.shape

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

In [None]:
f_score = chi2(x_train, y_train)

In [None]:
# Features with values less than 0.05 are more important features
f_score

In [None]:
p_value = pd.Series(f_score[1], index=x_train.columns)
p_value.sort_values(ascending=True, inplace=True)

In [None]:
# As we can see "who" and "sex" have the lowest p values and hence they are most important features.
p_value

In [None]:
p_value.plot.bar()

In [None]:
x_train_2 = x_train[["who", "sex"]]
x_test_2 = x_test[["who", "sex"]]

## Build Model

In [None]:
def run_random_forest(x_train, x_test, y_train, y_test):
    clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print("Accuracy on test set: ")
    print(accuracy_score(y_test, y_pred))

In [None]:
%%time
run_random_forest(x_train_2, x_test_2, y_train, y_test)

In [None]:
# Adding one more feature "pclass"
x_train_3 = x_train[["who", "sex", "pclass"]]
x_test_3 = x_test[["who", "sex", "pclass"]]

In [None]:
%%time
run_random_forest(x_train_3, x_test_3, y_train, y_test)

##### Here we can see the accuracy increased.

In [None]:
# Adding one more feature "embarked"
x_train_4 = x_train[["who", "sex", "pclass", "embarked"]]
x_test_4 = x_test[["who", "sex", "pclass", "embarked"]]

In [None]:
%%time
run_random_forest(x_train_4, x_test_4, y_train, y_test)

##### Here we can see the accuracy increased further more.

In [None]:
# Replacing "embarked" with "alone"
x_train_4 = x_train[["who", "sex", "pclass", "alone"]]
x_test_4 = x_test[["who", "sex", "pclass", "alone"]]

In [None]:
%%time
run_random_forest(x_train_4, x_test_4, y_train, y_test)

##### Here we can see the accuracy remained unchanged.

In [None]:
# Adding "emabrked" again
x_train_5 = x_train[["who", "sex", "pclass", "alone", "embarked"]]
x_test_5 = x_test[["who", "sex", "pclass", "alone", "embarked"]]

In [None]:
%%time
run_random_forest(x_train_5, x_test_5, y_train, y_test)

##### Here we can see the accuracy didn't change much.

In [None]:
%%time
# Testing on original dataset.
run_random_forest(x_train, x_test, y_train, y_test)

##### Here we can see the accuracy has decreased.

This shows that a proper feature selection can in some cases improme the accuracy and definetly the training time.