In [1]:
%pylab inline
import platform
import IPython
import sklearn as sk
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

print ('Python version:', platform.python_version())
print ('IPython version:', IPython.__version__)
print ('numpy version:', np.__version__)
print ('scikit-learn version:', sk.__version__)
print ('matplotlib version:', matplotlib.__version__)

Populating the interactive namespace from numpy and matplotlib
Python version: 3.5.2
IPython version: 4.0.1
numpy version: 1.13.1
scikit-learn version: 0.18.2
matplotlib version: 1.5.0


In the previous chapters we have studied several algorithms for very different tasks,
from classification and regression to clustering and dimensionality reduction. We
showed how we can apply these algorithms to predict results when faced with new
data. That is what machine learning is all about. In this last chapter, we want to show
some important concepts and methods you should take into account if you want to
do real-world machine learning.
- In real-world problems, usually data is not already expressed by attribute/
float value pairs, but through more complex structures or is not structured at
all. We will learn __feature extraction__ techniques that will allow us to extract
scikit-learn features from data.
- From the initial set of available features, not all of them will be useful
for our algorithms to learn from; in fact, some of them may degrade our
performance. We will address the problem of selecting the most adequate
feature set, a process known as __feature selection__.
- Finally, as we have seen in the examples in this book, many of the machine
learning algorithms have parameters that must be set in order to use them.
To do that, we will review __model selection__ techniques; that is, methods to
select the most promising hyperparameters to our algorithms. 

# Feature extraction

...the source data does not usually come in this format. We have to
extract what we think are potentially useful features and convert them to our learning
format. This process is called feature extraction or feature engineering, and it is an
often underestimated but very important and time-consuming phase in most realworld
machine learning tasks. We can identify two different steps in this task:
 - __Obtain features__: This step involves processing the source data and extracting
the learning instances, usually in the form of feature/value pairs where
the value can be an integer or float value, a string, a categorical value, and
so on. The method used for extraction depends heavily on how the data
is presented. For example, we can have a set of pictures and generate an
integer-valued feature for each pixel, indicating its color level, as we did
in the face recognition example in Chapter 2, Supervised Learning. Since this
is a very task-dependent job, we will not delve into details and assume we
already have this setting for our examples.
 - __Convert features__: Most scikit-learn algorithms assume as an input a set of
instances represented as a list of float-valued features. How to get these
features will be the main subject of this section.

We can, as we did in Chapter 2, Supervised Learning, build ad hoc procedures to
convert the source data. There are, however, tools that can help us to obtain a
suitable representation. The Python package __pandas__ (http://pandas.pydata.
org/), for example, provides data structures and tools for data analysis. It aims to
provide similar features to those of R, the popular language and environment for
statistical computing. We will use pandas to import the Titanic data we presented in
Chapter 2, Supervised Learning, and convert them to the scikit-learn format.

In [2]:
%pylab inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Populating the interactive namespace from numpy and matplotlib


In [32]:
titanic = pd.read_csv('data/titanic.csv')
print (titanic[:5])

   row.names pclass  survived  \
0          1    1st         1   
1          2    1st         0   
2          3    1st         0   
3          4    1st         0   
4          5    1st         1   

                                              name      age     embarked  \
0                     Allen, Miss Elisabeth Walton  29.0000  Southampton   
1                      Allison, Miss Helen Loraine   2.0000  Southampton   
2              Allison, Mr Hudson Joshua Creighton  30.0000  Southampton   
3  Allison, Mrs Hudson J.C. (Bessie Waldo Daniels)  25.0000  Southampton   
4                    Allison, Master Hudson Trevor   0.9167  Southampton   

                         home.dest room      ticket   boat     sex  
0                     St Louis, MO  B-5  24160 L221      2  female  
1  Montreal, PQ / Chesterville, ON  C26         NaN    NaN  female  
2  Montreal, PQ / Chesterville, ON  C26         NaN  (135)    male  
3  Montreal, PQ / Chesterville, ON  C26         NaN    NaN  female  

In [6]:
print (titanic.head()[['pclass', 'survived', 'age', 'embarked',
'boat', 'sex']])

  pclass  survived      age     embarked   boat     sex
0    1st         1  29.0000  Southampton      2  female
1    1st         0   2.0000  Southampton    NaN  female
2    1st         0  30.0000  Southampton  (135)    male
3    1st         0  25.0000  Southampton    NaN  female
4    1st         1   0.9167  Southampton     11    male


The main difficulty we have now is that scikit-learn methods expect real numbers
as feature values. In Chapter 2, Supervised Learning, we used the LabelEncoder and
OneHotEncoder preprocessing methods to manually convert certain categorical
features into 1-of-K values (generating a new feature for each possible value; valued
1 if the original feature had the corresponding value and 0 otherwise). This time, we
will use a similar scikit-learn method, __DictVectorizer__, which automatically builds
these features from the different original feature values. Moreover, we will program
a method to encode a set of columns in a unique step.

In [7]:
from sklearn import feature_extraction

def one_hot_dataframe(data, cols, replace=False):
    vec = feature_extraction.DictVectorizer()
    mkdict = lambda row: dict((col, row[col]) for col in cols)
    vecData = pd.DataFrame(vec.fit_transform( data[cols].apply(mkdict, axis=1)).toarray())
    vecData.columns = vec.get_feature_names()
    vecData.index = data.index
    if replace:
        data = data.drop(cols, axis=1)
        data = data.join(vecData)
    return (data, vecData)

The one_hot_dataframe method (based on the script at https://gist.github.
com/kljensen/5452382) takes a pandas DataFrame data structure and a list of
columns and encodes each column into the necessary 1-of-K features. If the replace
parameter is True, it will also substitute the original column with the new set. Let's
see it applied to the categorical pclass, embarked, and sex features (titanic_n only
contains the previously created columns):

In [34]:
titanic,titanic_n = one_hot_dataframe(titanic, ['pclass', 'embarked', 'sex'], replace=True)
titanic.describe()

Unnamed: 0,row.names,survived,age,embarked,embarked=Cherbourg,embarked=Queenstown,embarked=Southampton,pclass=1st,pclass=2nd,pclass=3rd,sex=female,sex=male
count,1313.0,1313.0,633.0,821,1313.0,1313.0,1313.0,1313.0,1313.0,1313.0,1313.0,1313.0
mean,657.0,0.341965,31.194181,0,0.154608,0.034273,0.436405,0.24524,0.213252,0.541508,0.352628,0.647372
std,379.174762,0.474549,14.747525,0,0.361668,0.181998,0.496128,0.430393,0.40976,0.498464,0.47797,0.47797
min,1.0,0.0,0.1667,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,329.0,0.0,21.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,657.0,0.0,30.0,0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
75%,985.0,1.0,41.0,0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0
max,1313.0,1.0,71.0,0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


The pclass attribute has been converted to three pclass=1st, pclass=2nd,
pclass=3rd features, and similarly for the other two features. Note that the
embarked feature has not disappeared, This is due to the fact that the original
embarked attribute included NaN values, indicating a missing value; in those cases,
every feature based on embarked will be valued 0, but the original feature whose
value is NaN remains, indicating the feature is missing for certain instances. Next, we
encode the remaining categorical attributes:

In [35]:
titanic, titanic_n = one_hot_dataframe(titanic, ['home.dest', 'room', 'ticket', 'boat'], replace=True)

In [36]:
titanic.describe()

Unnamed: 0,row.names,survived,age,embarked,embarked=Cherbourg,embarked=Queenstown,embarked=Southampton,pclass=1st,pclass=2nd,pclass=3rd,...,ticket=248744 L13,ticket=248749 L13,ticket=250647,ticket=27849,ticket=28220 L32 10s,ticket=34218 L10 10s,ticket=36973 L83 9s 6d,ticket=392091,ticket=7076,ticket=L15 1s
count,1313.0,1313.0,633.0,821,1313.0,1313.0,1313.0,1313.0,1313.0,1313.0,...,1313.0,1313.0,1313.0,1313.0,1313.0,1313.0,1313.0,1313.0,1313.0,1313.0
mean,657.0,0.341965,31.194181,0,0.154608,0.034273,0.436405,0.24524,0.213252,0.541508,...,0.000762,0.000762,0.000762,0.000762,0.002285,0.000762,0.001523,0.001523,0.000762,0.000762
std,379.174762,0.474549,14.747525,0,0.361668,0.181998,0.496128,0.430393,0.40976,0.498464,...,0.027597,0.027597,0.027597,0.027597,0.047764,0.027597,0.039014,0.039014,0.027597,0.027597
min,1.0,0.0,0.1667,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,329.0,0.0,21.0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,657.0,0.0,30.0,0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,985.0,1.0,41.0,0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1313.0,1.0,71.0,0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


We also have to deal with missing values, since DecisionTreeClassifier we plan
to use does not admit them on input. Pandas allow us to replace them with a fixed
value using the fillna method. We will use the mean age for the age feature, and 0
for the remaining missing attributes.

In [37]:
mean = titanic['age'].mean()
titanic['age'].fillna(mean, inplace=True)
titanic.fillna(0, inplace=True)

Now, all of our features (except for Name) are in a suitable format. We are ready to
build the test and training sets, as usual.

In [39]:
from sklearn.cross_validation import train_test_split
titanic_target = titanic['survived']
titanic_data = titanic.drop(['name', 'row.names', 'survived'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(titanic_data, titanic_target, test_size=0.25, random_state=33)

We decided to simply drop the name attribute, since we do not expect it to be
informative about the survival status (we have one different value for each instance,
so we can generalize over it). We also specified the survived feature as the target
class, and consequently eliminated it from the training vector.
Let's see how a decision tree works with the current feature set.

In [40]:
from sklearn import tree
dt = tree.DecisionTreeClassifier(criterion='entropy')
dt = dt.fit(X_train, y_train)
from sklearn import metrics
y_pred = dt.predict(X_test)
print ("Accuracy:{0:.3f}".format(metrics.accuracy_score(y_test, y_pred)), "\n")

Accuracy:0.833 

