## Categorical Input and Categorical Output

A categorical variable has a measurement scale consisting of a set of categories. For example an incoming email can be ‘spam’ or ‘not spam’. When both input and output is categorical in nature we have different methods for selecting features. 

## Chi-Square Test for Independence
A table that cross-classifies variables say X and Y with rows(r) and columns(c) where each cell represent the count of the variables is called Contingency Table. Chi-Square Test compares two variables in a contingency table to check if there are associated or not.
Expected Frequencies are computed for the variables and compared with the observed. If there is a huge difference between the two then the variables are not associated with each other.
The scikit-learn machine library provides chi2() function in the SelectKBest class using which we can select k best features. It computes the Chi-Squared stats between each non-negative feature and class. 

### About the Data: 
Titanic dataset is considered to implement Chi-Square test for feature selection.

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import chi2,SelectKBest

In [30]:
df = pd.read_csv('tested.csv')
#Considering only categorical features
df.drop(['PassengerId','Name','Age','Ticket','Cabin','Fare'],axis = 1,inplace = True)

<IPython.core.display.Javascript object>

In [31]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,SibSp,Parch,Embarked
0,0,3,male,0,0,Q
1,1,3,female,1,0,S
2,0,2,male,0,0,Q
3,0,3,male,0,0,S
4,1,3,female,1,1,S


In [36]:
#Encoding the Data
label_encoder = LabelEncoder()
df['Sex'] = label_encoder.fit_transform(df['Sex'])
df['Embarked'] = label_encoder.fit_transform(df['Embarked'])

In [40]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,SibSp,Parch,Embarked
0,0,3,1,0,0,1
1,1,3,0,1,0,2
2,0,2,1,0,0,1
3,0,3,1,0,0,2
4,1,3,0,1,1,2


In [70]:
# split into input (X) and output (y) variables
X = df.iloc[:,1:]
y = df.iloc[:,:1]

#Split into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

select = SelectKBest(score_func=chi2, k=3)
new = select.fit_transform(X_train,y_train)

#printing the features that have been selected using get_support()
cols = select.get_support(indices=True)

#Printing the scores of the selected columns
for i in range(len(cols)):
    print('Feature %d: %f' % (cols[i], select.scores_[i]))

# Creating a new dataframe with the selected columns
features_df_new = df.iloc[:,cols]
features_df_new.head(3)

<IPython.core.display.Javascript object>

Feature 1: 1.451147
Feature 2: 106.000000
Feature 3: 3.579026


Unnamed: 0,Pclass,Sex,SibSp
0,3,1,0
1,3,0,1
2,2,1,0
