Welcome to **ROOSEVELT NATIONAL PARK** :)

In [None]:
from PIL import Image
Image.open('roosevelt.jpg')

The Data set consists of cartographic variables which are used to predict the forest cover types in the four Wilderness Areas of Roosevelt National Forest, Colorado, USA. Independent variables were derived from data obtained from US Geological Survey (USGS) and United States Forest Service (USFS).

In [None]:
Image.open('forest_area.PNG')

In [None]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import scipy.stats as st
pd.set_option('display.max_columns', None)

from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2,mutual_info_classif
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
from sklearn.feature_selection import RFE

In [None]:
df=pd.read_csv('covertype.csv')
df.head(2)

In [None]:
plt.figure(figsize=(6,6))
df['class'].value_counts(normalize=True).plot.pie(autopct='%1.f%%')
plt.title('Distribution of classes')
plt.show()

**Classes**
* Spruce/Fir (Picea pungens)
* Lodgepole Pine (Pinus contorta)
* Ponderosa Pine (Pinus ponderosa)
* Cottonwood/Willow (Populus deltoides)
* Aspen (Populus tremuloides)
* Douglas-fir (Pseudotsuga menziesii)
* Krummholz

In [None]:
print("Any missing sample in set:",df.isnull().values.any())
print('Number of records in dataset are {} and Features are {}.'.format(*df.shape))

**Lets dive deep into feature information.**
* All wilderness and soil types are dummified.
* Target feature is 'class' with 7 categories.
* Rest all features are scaled between 0 to 1.
* As Number of features are large in this dataset, we need to select significant features for our analysis.

# 1) Anova+ Chi-Square

* **Anova**     : For Continuous Independent feature + Categorical Target feature
             
             
* **Chi-Square** :For Categorical Independent feature + Categorical Target feature

### 1.1 One-Way Anova

* We will perform One-way Anova betweeen numerical features and our target categorical feature 'class', which have 7 categories. 
* The one-way ANOVA is used to determine whether there are any statistically significant differences between the means of particular numerical feature when divided into respective 7 class types.

* H0:  
         
      Means of qualitative feature in all 7 different class subgroups is same. 
      And hence all these subgroups are similar and cant differentiate between each subgroups.
      Therefore that particular qualitative feature can not predict class types.
    
* H1: 

       Means of qualitative feature in all 7 different class subgroups is different.
       and hence that particular qualitative feature is significant in predicting class types.

In [None]:
anova_results={}
num_cols=df.columns.to_list()[:10]

for i in num_cols:
    
    d7=df[df['class']==7][i]
    d6=df[df['class']==6][i]
    d5=df[df['class']==5][i]
    d4=df[df['class']==4][i]
    d3=df[df['class']==3][i]
    d2=df[df['class']==2][i]
    d1=df[df['class']==1][i]

    static,p_value=st.f_oneway(d1,d2,d3,d4,d5,d6,d7)
    anova_results[i]=[static,p_value]
    
df_anova=pd.DataFrame(anova_results).T
df_anova=df_anova.rename(columns={0:'F_statistic',1:'p_value'})
df_anova=df_anova.sort_values(by=['p_value','F_statistic'],ascending=[False,False])
df_anova['Significant']=df_anova['p_value'].apply(lambda x : 'True' if x<0.05  else 'False')
df_anova

* As pvalue<0.5 in all cases, we reject null hypothesis in all cases.
* Hence all quantitative features are significant in deciding class type.

### 1.2 Chi Square

* Chi-square test also known as Chi-Square Test of Association is a nonparametric test. 
* It compares two variables in a contingency table to see if they are related. 
* In a more general sense, it tests to see whether distributions of categorical variables differ from each another.

In [None]:
X=df.drop('class',1)
Y=df['class']
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,stratify=Y)

cat_cols=X_train.columns.to_list()[10:-2]
not_satisfied=[]
chi2_results={}

for _ in cat_cols:
    cross_table=pd.crosstab(X_train[_],Y_train)
        
    if np.any(np.sum(cross_table,1)<5):
        not_satisfied.append(_)
    elif len(cross_table)==1:
        not_satisfied.append(_)
        
    else:
        stat,p_value,dof=(st.chi2_contingency(cross_table)[0:3])
        chi2_results[_]=[stat,p_value,dof]    
            
            
print(f'The column(s) not satisfied for chi2 test {not_satisfied}')

df_chi2=pd.DataFrame(chi2_results).T
df_chi2=df_chi2.rename(columns={0:'chi2_statistic',1:'p_value',2:'dof'})
df_chi2=df_chi2.sort_values(by=['p_value','chi2_statistic'],ascending=[False,False])
df_chi2['Significant']=df_chi2['p_value'].apply(lambda x : 'True' if x<0.05  else 'False')
df_chi2

* As pvalue<0.5 in all cases, we reject null hypothesis in all cases.
* Hence all categorical(dummified) features are significant in deciding class type except 'Soil_Type15' as its counts are less than 5.

# 2) Mutual Information

* Mutual information (MI) between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.
* The function relies on nonparametric methods based on entropy estimation from k-nearest neighbors distances. 
* It can be used for univariate features selection using **mutual_info_classif from sklearn package**.

In [None]:
bestfeatures = SelectKBest(score_func=mutual_info_classif, k=10)
fit = bestfeatures.fit(X_train,Y_train)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X_train.columns)
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Columns','Score']  
print(featureScores.nlargest(10,'Score')) 

* We manually selected top 10 features which are having strong dependencies on target class.
* Elevation being the most important feature with highest score.
* Just check if our top 10 selected features have multicolinearity using VIF values.
* High VIF values shows presence of multicolinearity.

In [None]:
signi_cols=list(featureScores.nlargest(10,'Score')['Columns'].values)

X_train_s=X_train[signi_cols]
Y_train_s=Y_train.copy()
X_test_s=X_test[signi_cols]
Y_test_s=Y_test.copy()

V_values=[vif(sm.add_constant(X_train_s).values,i) for i in range(sm.add_constant(X_train_s).shape[1])]
l=list(zip(sm.add_constant(X_train_s).columns,V_values))
l.sort(key=lambda x : x[1],reverse=True)
df_vif=pd.DataFrame(l,columns=['Feature','vif_value'])
df_vif

* No multicolinearity in these selected features, hence we can proceed with these features for model buliding and can check performance of model bulit on these features with base model built on all features.

## 3.1) SelectFromModel (RandomForestClassifier)

In [None]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

#X=df.drop('class', axis=1)
Y=df['class']

embeded_rf_selector = SelectFromModel(RandomForestClassifier(n_estimators=50), max_features=54)
embeded_rf_selector.fit(X, Y)


In [None]:
embeded_rf_support = embeded_rf_selector.get_support()
embeded_rf_feature = X.loc[:,embeded_rf_support].columns.tolist()
print(str(len(embeded_rf_feature)),'features are significant enough according to RandomForestClassifier. ')
print('\n',embeded_rf_feature)

## 3.2) SelectFromModel (LGBMClassifier)

In [None]:
from sklearn.feature_selection import SelectFromModel
from lightgbm import LGBMClassifier

lgbc=LGBMClassifier(n_estimators=500, learning_rate=0.05, num_leaves=32, colsample_bytree=0.2,
            reg_alpha=3, reg_lambda=1, min_split_gain=0.01, min_child_weight=40)

embeded_lgb_selector = SelectFromModel(lgbc, max_features=54)
embeded_lgb_selector.fit(X, Y)

embeded_lgb_support = embeded_lgb_selector.get_support()
embeded_lgb_feature = X.loc[:,embeded_lgb_support].columns.tolist()
print(str(len(embeded_lgb_feature)),'features are significant enough according to lightGBM')
print('\n',embeded_lgb_feature)

## 4) Recursive Feature Elimination


* The goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. 
* First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. 
* Then, the least important features are pruned from current set of features. 
* That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
rfe_selector = RFE(estimator=LogisticRegression(), \
                   n_features_to_select=10, step=10, verbose=5)
rfe_selector.fit(X, Y)
rfe_support = rfe_selector.get_support()
rfe_feature = X.loc[:,rfe_support].columns.tolist()
print('\n',str(len(rfe_feature)), 'most significant features according to RFE are ')
print(rfe_feature)


**Happy learning!!**

* **Random hypothesis testing**
* H0: Opening this notebook was significant enough.
* H1: Wasn't useful.

**Upvote** If this notebook passed your Random hypothesis testing and you found this work useful :)

For more info visit:
https://towardsdatascience.com/the-5-feature-selection-algorithms-every-data-scientist-need-to-know-3a6b566efd2
