# M/L Commando Course, Cambridge 2018

## Feature Engineering and Selection

_Our usual scenario for learning tasks include a list of instances (represented as feature/value pairs) and a special feature (the target class) that we want to predict for future instances based on the values of the remaining features. However, the source data does not usually come in this format. We have to extract what we think are potentially useful features and convert them to our learning format. This process is called feature extraction or feature engineering, and it is an often underestimated but very important and time-consuming phase in most real- world machine learning tasks._

Start by importing numpy, scikit-learn, pandas, and pyplot, the Python libraries we will be using in this chapter. Show the versions we will be using (in case you have problems running the notebooks).

In [None]:
%pylab inline
import matplotlib.pyplot as plt
import IPython
import sklearn as sk
import numpy as np
import pandas as pd

print('IPython version:', IPython.__version__)
print('numpy version:', np.__version__)
print('scikit-learn version:', sk.__version__)
print('matplotlib version:', matplotlib.__version__)
print('pandas version:', pd.__version__)

## Import titanic data using pandas

The Python package pandas (http://pandas.pydata.org/), for example, provides data structures and tools for data analysis. It aims to provide similar features to those of R, the popular language and environment for statistical computing. We will use pandas to import the Titanic data we presented in Chapter 2, Supervised Learning, and convert them to the scikit-learn format.

In [None]:
titanic_raw = pd.read_csv('data/titanic.csv')
print (titanic_raw[12:14])

You can see that each csv column has a corresponding feature into the DataFrame, and that the feature type is induced from the available data. We can inspect some features to see what they look like.

In [None]:
print (titanic_raw.head(14)[['pclass', 'survived', 'age', 'embarked', 'boat', 'sex']]) 
# "head" method just gets us the first n rows
#<-rjm49 note mix of indexing and list notation [[]]


In [None]:
titanic_raw.describe() #rjm49 - handy Pandas summarisation method

## Feature extraction

As we know, scikit-learn methods expect real numbers
as feature values.  Last time, we used the LabelEncoder and OneHotEncoder preprocessing methods to manually convert certain categorical features into 1-of-K values (generating a new feature for each possible value; valued 1 if the original feature had the corresponding value and 0 otherwise). This time, we will use a similar scikit-learn method, DictVectorizer, which automatically builds these features from the different original feature values. Moreover, we will program a method to encode a set of columns in a unique step. 

In [None]:
from sklearn import feature_extraction

def one_hot_dataframe(data, cols, replace=False):
    """ Takes a dataframe and a list of columns that need to be encoded.
    Returns a 3-tuple comprising the data, the vectorized data,
    and the fitted vectorizor.
    Modified from https://gist.github.com/kljensen/5452382
    """
    dic_vecr = feature_extraction.DictVectorizer()
    stuff_to_transform = data[cols]
    print(stuff_to_transform.head(),"\n")
    dty_list = stuff_to_transform.to_dict(orient='records') #rjm49 - this call makes a dict for each record in the table, returns them all as a list
    print("first dict:", dty_list[0:14],"\n") #rjm49 - look at the first one to see what's in it...
    
    txd = dic_vecr.fit_transform(dty_list) #converts string types to 1-hot-encoded classes as a NumPy "sparse array"
    print("first transformed vec:\n", txd[0],"\n")

    vecData = pd.DataFrame( txd.toarray())
    vecData.columns = dic_vecr.get_feature_names()
    vecData.index = data.index
    if replace is True: #replace the columns in data with those from our VecData object
        data = data.drop(cols, axis=1)
        data = data.join(vecData)
    return (data, vecData)

titanic, titanic_n= one_hot_dataframe(titanic_raw, ['pclass', 'embarked', 'sex'], replace=True)
print(titanic.head())

In [None]:
titanic.describe()

The heck is going on with the "embarked" feature?

In [None]:
print (titanic_n['embarked'])
print ("- - - -")
mask = (titanic_n['embarked'] != 0) #returns a boolean True/False Series
print (titanic_n[mask].head()['embarked'])

Convert the remaining categorical features...

In [None]:
titanic, titanic_n = one_hot_dataframe(titanic, ['home.dest', 'room', 'ticket', 'boat'], replace=True)


We also have to deal with missing values, since DecisionTreeClassifier we plan to use does not admit them on input. Pandas allow us to replace them with a fixed value using the fillna method. We will use the mean age for the age feature, and 0 for the remaining missing attributes. Adjust N/A ages with the mean age

In [None]:
print (titanic['age'].describe())
print (titanic[12:14]['age'])

print("- - - - -")
mean = titanic['age'].mean()
titanic['age'].fillna(mean, inplace=True)
print (titanic['age'].describe())
print (titanic[12:14]['age'])

Complete n/a with zeros

In [None]:
titanic.fillna(0, inplace=True)

In [None]:
print (titanic.head(15))

Build the training and testing dataset

In [None]:
from sklearn.model_selection import train_test_split
titanic_target = titanic['survived']
titanic_data = titanic.drop(['name', 'row.names', 'survived'], axis=1) #can use inplace=True to alter original
X_train, X_test, y_train, y_test = train_test_split(titanic_data, titanic_target, test_size=0.25, random_state=33)


Let's see how a decision tree works with the current feature set.

In [None]:
from sklearn import tree
dt = tree.DecisionTreeClassifier(criterion='entropy')
dt = dt.fit(X_train, y_train)


In [None]:
import pydot, io
dot_data = io.StringIO()
sk.tree.export_graphviz(dt, out_file=dot_data, feature_names=titanic_data.columns)
#graph = pydot.graph_from_dot_data(dot_data.getvalue())
(graph,) = pydot.graph_from_dot_data(dot_data.getvalue())
graph.write_png('titanic.png')
from IPython.core.display import Image
Image(graph.create_png())

In [None]:
from sklearn import metrics
def measure_performance(X, y, clf, show_accuracy=True, show_classification_report=True, show_confusion_matrix=True):
    y_pred = clf.predict(X)   
    if show_accuracy:
         print( "Accuracy:{0:.3f}".format(metrics.accuracy_score(y, y_pred)),"\n")
    if show_classification_report:
        print ("Classification report")
        print (metrics.classification_report(y, y_pred),"\n")
      
    if show_confusion_matrix:
        print ("Confusion matrix")
        print (metrics.confusion_matrix(y, y_pred),"\n")

In [None]:
from sklearn import metrics
measure_performance(X_test, y_test, dt, show_confusion_matrix=False, show_classification_report=False)

## Feature Selection

Working with a smaller feature set may lead to better results. So we want to find some way to algorithmically find the best features. This task is called feature selection and is a crucial step when we aim to get decent results with machine learning algorithms. If we have poor features, our algorithm will return poor results no matter how sophisticated our machine learning algorithm is. Select only the 20% most important features, using a chi2 test

In [None]:
from sklearn import feature_selection
fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=20)
X_train_fs = fs.fit_transform(X_train, y_train)
print (titanic_data.columns[fs.get_support()])
print (fs.scores_[2])
print (titanic_data.columns[2])


Evaluate performance with the new feature set

In [None]:
dt.fit(X_train_fs, y_train)
X_test_fs = fs.transform(X_test)
measure_performance(X_test_fs, y_test, dt, show_confusion_matrix=False, show_classification_report=False)

Find the best percentil using cross-validation on the training set

In [None]:
from sklearn import model_selection #cross_validation

percentiles = range(1, 100, 5)
results = []
for i in percentiles:
    fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=i)
    X_train_fs = fs.fit_transform(X_train, y_train)
    scores = cross_validation.cross_val_score(dt, X_train_fs, y_train, cv=5)
    #print i,scores.mean()
    results = np.append(results, scores.mean())

optimal_percentil = np.where(results == results.max())[0]
print(optimal_percentil)
print(percentiles)
#print (percentiles[optimal_percentil[0]])
print ("Optimal number of features:", percentiles[optimal_percentil[0]], "\n")

# Plot number of features VS. cross-validation scores
import pylab as pl
pl.figure()
pl.xlabel("Number of features selected")
pl.ylabel("Cross validation accuracy)")
pl.plot(percentiles,results)
print ("Mean scores:",results)

Evaluate our best number of features on the test set

In [None]:
fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=percentiles[optimal_percentil[0]])
X_train_fs = fs.fit_transform(X_train, y_train) #rjm49 - select just the most relevant features, train on those
dt.fit(X_train_fs, y_train)
X_test_fs = fs.transform(X_test)
measure_performance(X_test_fs, y_test, dt, show_confusion_matrix=False, show_classification_report=False)

In [None]:
print(dt.get_params())

Compute the best criterion, using the held out set (see next notebook on Model Selection)

In [None]:
dt = tree.DecisionTreeClassifier(criterion='entropy')
scores = cross_validation.cross_val_score(dt, X_train_fs, y_train, cv=5)
print( "Entropy criterion accuracy on cv: {0:.3f}".format(scores.mean()))
dt = tree.DecisionTreeClassifier(criterion='gini')
scores = cross_validation.cross_val_score(dt, X_train_fs, y_train, cv=5)
print( "Gini criterion accuracy on cv: {0:.3f}".format(scores.mean()))



In [None]:
dt.fit(X_train_fs, y_train)
X_test_fs = fs.transform(X_test)
measure_performance(X_test_fs, y_test, dt, show_confusion_matrix=False, show_classification_report=False)