# Python Machine Learning 5

# Feature Selection / PCA

## By Sal Lascano

Lets start with, **Why bother with Dimensionality reduction?** The reason we do dimensionality reduction is because removing irrelevant features results in a better performing algorithm/more accurate and it will obviously run faster.

Machine Learing Algorithms are only as good as the features we provide (Garbage in, garbage out). Coming up with good features is one of the **most important jobs in machine learning** choosing the right number of features to use is more of an art than a science. Now lets discuss what makes a good feature. 

A good feature makes it easier for the algorithm to decide between different things, a feature that provides plenty information to the algorith or statistically speaking a feature rich in variance is one of those (but not filled with outliers)

We want our features to be independent. We dont want features that are correlated or redundant to one and other. If such case comes across pick the one that seems most useful and disregard the other ones. Algorithms are not smart enough to realize that for example height in inches and centimiters are the same so they will double count the importance of the feature. 

Again we want to help our algorith be as accurate as possible. So we want features that are easy to read and interpret by the algorith. For example if we want to know how long a letter will take from place a to get to place b, a good easy to read feaure will be one that gives the distance between these two cities in miles. A hard to read feature will give us the coordenates of the two cities. Pretty much the same information but handed over in different formats where in the second example it is harder for the algorith to get the essence of the feature.  

To recap, the goals of feature selection are:
    - Improving the prediction performance
    - Providing faster and more cost-effective predictors
    - Providing a better understanding of the underlying process that generated the data
  
In this notebook we will introduce some basic methods from a mathematic perspective, Lets start. 

### Removing features with low variance

If a feature has the same value in all observations, then the feature is telling us nothing so its useless. In this cases the **VarianceThreshold** is used to remove all these useless features. As its name states, the **VarianceThreshold** removes features whos variance does not meet a stated threshold. By default it deletes all features with 0 variance. 

Lets see an example:

In this example we will work with the famous iris data set wish has four features: sepal length, sepal width, petal length, petal width. Lets see the variance of these features.  

In [8]:
#Import all the needed libraries and the dataset 
from sklearn import datasets
import numpy as np
iris = datasets.load_iris()

#printthe shape of the data set, we have 150 observations and 4 features
print('Shape: (%d, %d)' %iris.data.shape)

#print title
print('Variance:')

#Prinitng the variance of all 4 features
dict(zip(iris.feature_names, np.var(iris.data, 0)))

Shape: (150, 4)
Variation:


{'petal length (cm)': 3.0924248888888854,
 'petal width (cm)': 0.5785315555555559,
 'sepal length (cm)': 0.6811222222222222,
 'sepal width (cm)': 0.1867506666666667}

In [43]:
#Lets say we want to transform this numpy array into a DataFrame, we use pandas for that.
irisDF = pd.DataFrame(data=iris.data[0:,0:], columns=iris.feature_names)

#See the first 5 rows using head()
irisDF.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


Now lets remove all features with a variance less than 0.6, we should be left off with 2 features according to the above information, lets see.  

In [11]:
#Lets import feature selection from sklearn
import sklearn.feature_selection as fs

#Now set the threshold, the argument "threshold= 0.6" is used to define the threshold.
x_new = fs.VarianceThreshold(threshold = 0.6).fit_transform(iris.data)

#After the selection, the new variable  x_new has only 2 predictors.
x_new.shape

(150, 2)

### Univariate Feature Selection

Univariate feature selection works by selecting the best features based on univariate statistical tests.

#### SelectKBest

SelectKBest is used to keep the $k$ highest scoring features. It uses $\chi^2$ test to pick the best features. Lets see how it works on the code.

In [56]:
# In fs.SelectKBest, fs.chi2 means we are going to use chi-square and k=2 means we will select the best 2 features
# In fit_transform, we should enter the independent and the dependent variables
best2 = fs.SelectKBest(fs.chi2, k=2).fit_transform(iris.data, iris.target)

print('Shape')
print(best2.shape)
print('   ')
print('Slice first 5 rows of our new variable best2 with the best 2 features only')
print(best2[:5,])
print('   ')
print('Slice first 5 rows of the iris data with all original features')
print(iris.data[:5,])

Shape
(150, 2)
   
Slice first 5 rows of our new variable best2 with the best 2 features only
[[1.4 0.2]
 [1.4 0.2]
 [1.3 0.2]
 [1.5 0.2]
 [1.4 0.2]]
   
Slice first 5 rows of the iris data with all original features
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


#### SelectPercentile 

SelectPercentile is used when there are thousands of features. It works same as SelectKBest using the $\chi^2$ test to pick the best features just that here we are asking for a percentile not an actual number. lets get to the code.

we will ask for the top 50% so the results should be the exact same as previous example. 

In [59]:
# In fs.SelectKBest, fs.chi2 means we are going to use chi-square and the 50 means we will select 
# the top 50% features
# In fit_transform, we should enter the independent and the dependent variables
percent50 = fs.SelectPercentile(fs.chi2, 50).fit_transform(iris.data, iris.target) 


print('Shape')
print(percent50.shape)
print('   ')
print('Slice first 5 rows of our new variable percent50 with the top 50% features only')
print(percent50[:5,])
print('   ')
print('Slice first 5 rows of the iris data with all original features')
print(iris.data[:5,])
print('   ')
print('It worked just as expected YEY')

Shape
(150, 2)
   
Slice first 5 rows of our new variable percent50 with the top 50% features only
[[1.4 0.2]
 [1.4 0.2]
 [1.3 0.2]
 [1.5 0.2]
 [1.4 0.2]]
   
Slice first 5 rows of the iris data with all original features
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
   
It worked just as expected YEY


In Regression we use the f_regression function to perform an F-test. We will use the iris data set to do a little exercise. Lets say we want to predict the fourth feature based on all four features, we will ask for the best feature to predict the fourth which in essence will be its self. Lets see the code. 

In [84]:
# Set all 4 features as our independent variables 
iris.x = iris.data[:,:]

# Set out fourth feature as our dependent variable
iris.y = iris.data[:, 3]

# In fs.SelectKBest, fs.f_regression means we are useing the F-Test and k=1 means we will select the best feature
# In fit_transform, we should enter the independent and the dependent variables
best1 = fs.SelectKBest(fs.f_regression, k=1).fit_transform(iris.x, iris.y)

print('Slice first 5 rows of our new variable best1 with the feature that explains better the target (its self)')
print(best1[:5, :])
print('   ')
print('Slice first 5 rows of the iris data with all original features')
print(iris.x[:5, :])
print('   ')
print('Again, it worked just as expected, it chose its self')

Slice first 5 rows of our new variable best1 with the feature that explains better the target (its self)
[[0.2]
 [0.2]
 [0.2]
 [0.2]
 [0.2]]
   
Slice first 5 rows of the iris data with all original features
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
