# Feature Selection

In this notebook we select the best features for training.
We have used Chi2Test from sklearn.feature_selection to rank the features based on their pvalues. 

In [15]:
import pandas as pd 
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin 
from sklearn.preprocessing import LabelEncoder
#chi2 test 
from sklearn.feature_selection import chi2

df = pd.read_excel("./data/HYPE Dataset for Group Project.xlsx", sheet_name='HYPE-Retention')
df.drop(df[df['SUCCESS LEVEL'] == 'In Progress'].index, inplace=True)

df['ADMIT TERM CODE'].fillna(df['ADMIT TERM CODE'].mode(), inplace=True)
df['ADMIT YEAR'] = df['ADMIT TERM CODE'].astype(str).str[:4]
df['ADMIT MONTH'] = df['ADMIT TERM CODE'].astype(str).str[4:]

df = df.groupby(["ID 2"]).apply(lambda x: x.sort_values(["ADMIT YEAR"], ascending=False)).drop_duplicates('ID 2',
                                                                                                        keep='first').reset_index(drop=True)
label_encoder = LabelEncoder()

df['SUCCESS LEVEL'] = label_encoder.fit_transform(df['SUCCESS LEVEL'])

X = df.drop(columns=['SUCCESS LEVEL'])
y = df['SUCCESS LEVEL']

X.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 375 entries, 0 to 374
Data columns (total 40 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   INTAKE TERM CODE               375 non-null    int64  
 1   ADMIT TERM CODE                374 non-null    float64
 2   ID 2                           375 non-null    int64  
 3   RECORD COUNT                   375 non-null    int64  
 4   INTAKE COLLEGE EXPERIENCE      375 non-null    object 
 5   PRIMARY PROGRAM CODE           375 non-null    int64  
 6   SCHOOL CODE                    375 non-null    object 
 7   PROGRAM LONG NAME              375 non-null    object 
 8   PROGRAM SEMESTERS              375 non-null    int64  
 9   TOTAL PROGRAM SEMESTERS        375 non-null    int64  
 10  STUDENT LEVEL NAME             375 non-null    object 
 11  STUDENT TYPE NAME              375 non-null    object 
 12  STUDENT TYPE GROUP NAME        375 non-null    obj

Let us make a list of categorical columns

In [17]:
cat_columns = X.columns

X_cat = X[cat_columns]
X_cat.info()
X_cat.head()
X_cat.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 375 entries, 0 to 374
Data columns (total 18 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   INTAKE COLLEGE EXPERIENCE      375 non-null    object
 1   SCHOOL CODE                    375 non-null    object
 2   STUDENT LEVEL NAME             375 non-null    object
 3   TIME STATUS NAME               375 non-null    object
 4   RESIDENCY STATUS NAME          375 non-null    object
 5   FUNDING SOURCE NAME            375 non-null    object
 6   GENDER                         375 non-null    object
 7   DISABILITY IND                 375 non-null    object
 8   MAILING COUNTRY NAME           373 non-null    object
 9   CURRENT STAY STATUS            375 non-null    object
 10  ACADEMIC PERFORMANCE           375 non-null    object
 11  FIRST YEAR PERSISTENCE COUNT   375 non-null    int64 
 12  AGE GROUP LONG NAME            373 non-null    object
 13  APPL 

(375, 18)

Let us build a custom transformer to retain only top 6 categories in all categorical columns and encode categorical columns

In [18]:
class CategoricalTransformer(BaseEstimator,TransformerMixin):
    def __init__ (self,cols):
        self.cols=cols
        
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):        
        return X[self.cols]
    
from sklearn.pipeline import Pipeline
cat_pipeline = Pipeline([
                ('cat_pipe',CategoricalTransformer(cols=cat_columns))
])

X_cat_trans = cat_pipeline.fit_transform(X_cat)
X_cat_trans.head()

cat_encoder = LabelEncoder()
for i in X_cat_trans.columns:
    X_cat_trans[i] = cat_encoder.fit_transform(X_cat_trans[i])

X_cat_trans.head()


Unnamed: 0,INTAKE COLLEGE EXPERIENCE,SCHOOL CODE,STUDENT LEVEL NAME,TIME STATUS NAME,RESIDENCY STATUS NAME,FUNDING SOURCE NAME,GENDER,DISABILITY IND,MAILING COUNTRY NAME,CURRENT STAY STATUS,ACADEMIC PERFORMANCE,FIRST YEAR PERSISTENCE COUNT,AGE GROUP LONG NAME,APPL FIRST LANGUAGE DESC,APPLICANT CATEGORY NAME,APPLICANT TARGET SEGMENT NAME,PREV EDU CRED LEVEL NAME,ADMIT MONTH
0,4,2,1,0,1,1,1,0,0,2,0,1,2,1,1,2,0,3
1,0,3,1,0,1,1,0,0,0,6,2,1,5,2,3,2,1,1
2,3,1,1,0,1,1,1,0,0,6,2,1,2,0,1,2,0,1
3,1,0,1,0,1,1,0,0,0,5,2,0,3,2,1,2,1,3
4,2,0,1,0,1,1,1,0,0,5,0,0,2,2,3,2,2,3


In [19]:
X_cat_trans.isnull().sum()
type(X_cat_trans)

f_p_values = chi2(X_cat_trans,y)
f_p_values

f_values = pd.Series(f_p_values[0])
p_values = pd.Series(f_p_values[1])

p_values.index = X_cat_trans.columns
p_values.sort_values(ascending=False)

MAILING COUNTRY NAME             7.789638e-01
STUDENT LEVEL NAME               7.674223e-01
RESIDENCY STATUS NAME            6.875489e-01
AGE GROUP LONG NAME              6.803171e-01
TIME STATUS NAME                 6.714830e-01
ADMIT MONTH                      5.694109e-01
SCHOOL CODE                      5.255581e-01
APPLICANT TARGET SEGMENT NAME    4.351682e-01
APPL FIRST LANGUAGE DESC         3.498137e-01
FUNDING SOURCE NAME              3.305055e-01
INTAKE COLLEGE EXPERIENCE        2.720758e-01
DISABILITY IND                   1.051031e-01
GENDER                           7.383873e-02
APPLICANT CATEGORY NAME          4.435434e-02
PREV EDU CRED LEVEL NAME         8.520508e-03
FIRST YEAR PERSISTENCE COUNT     2.303092e-13
CURRENT STAY STATUS              9.342392e-29
ACADEMIC PERFORMANCE             2.007597e-34
dtype: float64

The above code has listed features in the in the decreasing order of p-values. A higher pvalue indicates a good depedence with the Target Variable. Seeing the pvalues, we can conclude that FIRST YEAR PERSISTENCE COUNT,CURRENT STAY STATUS and ACADEMIC PERFORMANCE features can be dropped because it has very low p-value. Below is the list of features that will be used to train the models.