<a href="https://colab.research.google.com/github/sureshmecad/Google-Colab/blob/master/1_Chi_Squared_statistical_test_(SelectKBest).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Chi-Squared statistical test (SelectKBest)

<p style='text-align: right;'> 20 points</p>

Chi2 is a measure of dependency between two variables. It gives us a goodness of fit measure because it measures how well an observed distribution of a particular feature fits with the distribution that is expected if two features are independent.

Scikit-Learn offers a feature selection estimator named SelectKBest which select K numbers of features based on the statistical analysis.

Reference link: https://chrisalbon.com/machine_learning/feature_selection/chi-squared_for_feature_selection/

The below mentioned function generate_feature_scores_df is used to get feature score for using it in  Chi-Squared statistical test explained below

In [2]:
import pandas as pd

In [15]:
def get_selected_features(raw_df,processed_df):
    selected_features=[]
    for i in range(len(processed_df.columns)):
        for j in range(len(raw_df.columns)):
            if (processed_df.iloc[:,i].equals(raw_df.iloc[:,j])):
                selected_features.append(raw_df.columns[j])
    return selected_features

In [1]:
def generate_feature_scores_df(X,Score):
    feature_score = pd.DataFrame()
    for i in range(X.shape[1]):
        new = pd.DataFrame({"Features":X.columns[i],"Score":Score[i]},index=[i])
        feature_score = pd.concat([feature_score,new])
    return feature_score

Hey coder! lets use dataset - diabetes.csv from below link

https://github.com/arupbhunia/Data-Pre-processing/blob/master/datasets/diabetes.csv

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
# create a data frame named diabetes and load the csv file again
diabetes = pd.read_csv("/content/drive/MyDrive/CloudyML/diabetes.zip")
diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [6]:
# assign features to X variable and 'outcome' to y variable from the dataframe diabetes
X = diabetes.drop('Outcome', axis = 1)

y = diabetes["Outcome"]

In [7]:
#import chi2 and SelectKBest
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [8]:
# converting data cast to a float type.
X = X.astype(float)

Lets use SelectKBest to calculate the best feature score. Use Chi2 as Score Function and no.of feature i.e. k as 4


In [9]:
# Initialise SelectKBest with above parameters 
chi2_test = SelectKBest(chi2, k=4)

# fit it with X and Y
chi2_model = chi2_test.fit(X, y)

In [10]:
#print the scores of chi2_model
chi2_model.scores_

array([ 111.51969064, 1411.88704064,   17.60537322,   53.10803984,
       2175.56527292,  127.66934333,    5.39268155,  181.30368904])

In [11]:
# use generate_feature_scores_df function to get features and their respective scores passing X and chi2_model.scores_ as paramter
feature_score_df = generate_feature_scores_df(X, chi2_model.scores_)


# return feature_score_df
feature_score_df

Unnamed: 0,Features,Score
0,Pregnancies,111.519691
1,Glucose,1411.887041
2,BloodPressure,17.605373
3,SkinThickness,53.10804
4,Insulin,2175.565273
5,BMI,127.669343
6,DiabetesPedigreeFunction,5.392682
7,Age,181.303689


Did you see the features and corresponding chi square scores? This is so easy right, higher the score better the feature. Just like higher the marks in assignment better the student of ours. 

In [12]:
#Lets get X with selected features of chi2_model using tranform function so we will have X_new
X_new = chi2_model.transform(X)

whaaaat! tranform()? , how is it different from fit_transform.You know buddy fit() can also confuse you. Hey you inquisitive learner we will tell you the difference.

The fit() function calculates the values of these parameters. The transform function applies the values of the parameters on the actual data and gives the normalized value. The fit_transform() function performs both in the same step. Note that the same value is got whether we perform in 2 steps or in a single step.

for more info on this refer: https://www.youtube.com/watch?v=BotYLBQfd5M

In [13]:
# Convert X_new into a dataframe

X_new = pd.DataFrame(X_new)

In [16]:
#repeat the previous steps of calling get_selected_features function( pass X and X_new as score in the function)
selected_features = get_selected_features(X, X_new)

# return selected_features
selected_features

['Glucose', 'Insulin', 'BMI', 'Age']

Let have X with all features given in list selected_features and save this dataframe in variable chi2_best_features

In [17]:
chi2_best_features = pd.DataFrame(X_new)
chi2_best_features.columns = selected_features

# print chi2_best_features.head()
chi2_best_features.head()

Unnamed: 0,Glucose,Insulin,BMI,Age
0,148.0,0.0,33.6,50.0
1,85.0,0.0,26.6,31.0
2,183.0,0.0,23.3,32.0
3,89.0,94.0,28.1,21.0
4,137.0,168.0,43.1,33.0


As you can see chi-squared test helps us to select  important independent features out of the original features that have the strongest relationship with the target feature.

You did it well!

------------------------------

In [None]:
from sklearn.feature_selection import SelectPercentile

# use f_classif (the default) and SelectPercentile to select 50% of features:
select = SelectPercentile(percentile=50)
select.fit(X_train, y_train)

# transform training set:
X_train_selected = select.transform(X_train)

print(X_train.shape)
print(X_train_selected.shape)

In [None]:
mask = select.get_support()
print(mask)
# visualize the mask. black is True, white is False
plt.matshow(mask.reshape(1, -1), cmap='gray_r')