### 15-05-2020

### Topics:
* Feature Selection vs Feature Engineering vs Dimensionality Reduction
* Feature Selection Methods
* Filter Methods
* Wrapper Methods
* Embedded Methods
* Hybrid Methods
* Advanced Methods

### Feature Selection vs Feature Engineering vs Dimensionality Reduction
* Feature Selection - It's a process by which we reduce the number of features that is considered in machine learning. Benefits - Model more interpretable, Shorter training time, reduce overfitting .i.e more genralized model

* Feature Engineering - This is all about data preprocessing to make model more effective. It transforms data in such a way that model performs better. Benefits - Improved model accuracy

* Dimensionality Reduction - Techniques like SVD/PCA etc which transforms features to lower dimension is called Dimensionality Reduction. Benefits - Faster model training, improved accuracy( in other words we combine the different columns to get better results 

### Feature Selection Methods
* __Filter Methods__ - _Simple way of chossing features that you think will have impact in target without any ML algorithm._

* __Wrapper Methods__ - _It uses ML algos to identify the subset of features which will be better predictor. Dependent on algo_

* __Embedded Methods__ -

### Filter Methods for Feature Selection
* Feature selection method idenpendently of ML algo.
* Based on characterstics of data.
* These are simple & quick way of feature selection

### Advantages
* Selected features can be used for all ML algorithms. This means if you change ML algo, no need to change the feature selected.
* Computationally not so expensive.


### Types
* __Univariate Filter Based Methods - These methods treat each feature idependently__
* __Multivariate Filter based Methods - They will use relationship between features__

### Filter Methods:


#### Basic
* _constant or Quasi constant methods_

In [1]:
from sklearn.feature_selection import VarianceThreshold

In [2]:
vt = VarianceThreshold(threshold=0.2)

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns; sns.set(color_codes=True)
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
df = pd.DataFrame({'A':['m','f','m','m','m','m','m','m'], 
              'B':[1,2,3,1,2,1,1,1], 
              'C':[1,2,3,1,2,1,1,1]})

In [5]:
from sklearn.preprocessing import OrdinalEncoder

In [6]:
oe = OrdinalEncoder()

In [7]:
df['A'] = oe.fit_transform(df[['A']])

In [8]:
df

Unnamed: 0,A,B,C
0,1.0,1,1
1,0.0,2,2
2,1.0,3,3
3,1.0,1,1
4,1.0,2,2
5,1.0,1,1
6,1.0,1,1
7,1.0,1,1


In [9]:
vt.fit_transform(df)

array([[1., 1.],
       [2., 2.],
       [3., 3.],
       [1., 1.],
       [2., 2.],
       [1., 1.],
       [1., 1.],
       [1., 1.]])

In [12]:
vt.variances_

array([0.109375, 0.5     , 0.5     ])

__Dropping duplicated columns__

### Correlation Filter
* Correlation is measured as linear relationship between two quantitive columns. It tell's how one variable depends on other.
* Say, we have 3 features A,B,C & one target T. To find out important features for model predicting T, we need to measure correlation between A & T, B & T, C & T.
* If we see feature A & B are correlated, what to do ?
* A. A & B feature provide redundent information to model for predicting T, thus one of them should be removed.
* Three ways of calculating correlation - Pearson, Spearman, Kendall

In [13]:
df = pd.read_csv('https://raw.githubusercontent.com/edyoda/data-science-complete-tutorial/master/Data/winequality-white.csv', sep=';')

In [14]:
corr = df.corr(method='pearson')

In [15]:
np.abs(corr.loc['fixed acidity']).sort_values(ascending=False)[:6]

fixed acidity    1.000000
pH               0.425858
citric acid      0.289181
density          0.265331
alcohol          0.120881
quality          0.113663
Name: fixed acidity, dtype: float64

### Chi-squared Method
* Used for testing relationship between categorical variables (binary targets/ counts etc.)
* This calculates relationship betwen all features & target (both categorical)

In [16]:
from sklearn.feature_selection import chi2, SelectKBest

In [17]:
cols = ['age','workclass','fnlwgt','education','education-num','marital-status','occupation','relationship'
        ,'race','sex','capital-gain','capital-loss','hours-per-week','native-country','Salary']
adult_data = pd.read_csv('https://raw.githubusercontent.com/zekelabs/data-science-complete-tutorial/master/Data/adult.data.txt', names=cols)

In [18]:
cat_adult_data = adult_data.select_dtypes(include=['object'])

In [19]:
oe = OrdinalEncoder()

In [20]:
data_td = oe.fit_transform(cat_adult_data)

In [21]:
df = pd.DataFrame(data_td, columns=list(cat_adult_data.columns.values))

In [23]:
df.head()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country,Salary
0,7.0,9.0,4.0,1.0,1.0,4.0,1.0,39.0,0.0
1,6.0,9.0,2.0,4.0,0.0,4.0,1.0,39.0,0.0
2,4.0,11.0,0.0,6.0,1.0,4.0,1.0,39.0,0.0
3,4.0,1.0,2.0,6.0,0.0,2.0,1.0,39.0,0.0
4,4.0,9.0,2.0,10.0,5.0,2.0,0.0,5.0,0.0


In [24]:
chi_2, pval = chi2(df.drop(columns=['Salary']), df.Salary)

In [25]:
chi_2

array([  47.50811916,  297.94227041, 1123.46981798,  504.5588538 ,
       3659.14312486,   33.03130514,  502.43941948,   13.61925602])

In [26]:
feature_importances = pd.Series(chi_2, index=list(df.drop(columns=['Salary']).columns.values))

In [27]:
feature_importances.sort_values(ascending=False)[:4]

relationship      3659.143125
marital-status    1123.469818
occupation         504.558854
sex                502.439419
dtype: float64

In [28]:
df[list(feature_importances.sort_values(ascending=False)[:4].index)][:5]

Unnamed: 0,relationship,marital-status,occupation,sex
0,1.0,4.0,1.0,1.0
1,0.0,2.0,4.0,1.0
2,1.0,0.0,6.0,1.0
3,0.0,2.0,6.0,1.0
4,5.0,2.0,10.0,0.0


In [29]:
fs = SelectKBest(k=4, score_func=chi2)

In [30]:
fs.fit_transform(df.drop(columns=['Salary']), df.Salary)

array([[4., 1., 1., 1.],
       [2., 4., 0., 1.],
       [0., 6., 1., 1.],
       ...,
       [6., 1., 4., 0.],
       [4., 1., 3., 1.],
       [2., 4., 5., 0.]])

In [31]:
fs.scores_

array([  47.50811916,  297.94227041, 1123.46981798,  504.5588538 ,
       3659.14312486,   33.03130514,  502.43941948,   13.61925602])

* __PS: Corr is for finding relationship between continues feature & continues target. Chi2 is for finding relationship between categorical features & categorical target__

### ANOVA Univariate Test
* Suited if feature is continues & normally distributed.
* Target can be discrete/categorical. f_classif
* Target can also be continues. f_regression

In [32]:
df = pd.read_csv('https://raw.githubusercontent.com/edyoda/data-science-complete-tutorial/master/Data/winequality-white.csv', sep=';')

In [33]:
from sklearn.feature_selection import f_classif
fs = SelectKBest(k=8,score_func=f_classif)

In [34]:
feature_data = fs.fit_transform(df.drop(columns=['quality']), df.quality)

In [35]:
from sklearn.linear_model import LogisticRegression

In [36]:
lr = LogisticRegression()

In [37]:
from sklearn.model_selection import train_test_split

In [38]:
trainX, testX, trainY, testY = train_test_split(feature_data, df.quality)

In [39]:
lr.fit(trainX, trainY)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [40]:
lr.score(testX,testY)

0.5420408163265306

In [41]:
from sklearn.tree import DecisionTreeClassifier

In [42]:
dt = DecisionTreeClassifier()

In [43]:
dt.fit(trainX, trainY)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [44]:
dt.score(testX,testY)

0.6146938775510205