## Homework 1: Titanic Survival Classification

Using the Titanic dataset, create features to classify the survival of passenger

![alt-text](https://upload.wikimedia.org/wikipedia/commons/thumb/f/fd/RMS_Titanic_3.jpg/450px-RMS_Titanic_3.jpg "titanic")


### First and foremost, the imports

Obviously one of the most important reasons people use Python is because of its codebase. If you ever have a problem coding in python, chances are somone else has solved your problem better than you probably could have and  has shared reproducable code that is most likely awesome!

So I prefer to use what other people have done becuase I am lazy.

**The packages I will be using for this homework:**
- **Pandas**
    - This is my favorite python package for data manipulation. Pandas has saved me so much time it is unbelievable. I highly recommend if you plan on using Python for data science.
    - http://pandas.pydata.org/pandas-docs/stable/
- **Numpy**
- **Scikit-Learn (sklearn)**
    - Preprogrammed ml so your life is easier.
    - http://scikit-learn.org/stable/user_guide.html
- **Matplotlib / Seaborn**
    - Matplotlib is easy matlab style graphing in Python but truth be told, it's ugly as shit so just install seaborn and it well render overlay that looks wayyy better. 
    - https://stanford.edu/~mwaskom/software/seaborn/

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

### Reading the data with pandas

I have the titanic training data saved in the same directory as this notebook so loading the data into a pandas DataFrame is really really easy, like one line of code easy. A pandas data frame is like a fancy overlay for numpy arrays that allows you to construct a heterogenous array containing all types.

Below is two lines of code that imports and displays the first 5 lines of the DataFrame

In [2]:
tr = pd.read_csv('train.csv', index_col=[0])
tr.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Yeah so as you can see, a really awesome way to load in data.

Now we have a data frame in python that we can manipulate to clean this data so that we can pass it into a classifierb

In [3]:
#lets look at the nan values present and figure out what to do with those
tr.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


So we can see that there are NA values present in Age, Cabin and Embarked

When building predictive models, it is often important to deal with NA's in an inteligent way. Since there aren't many NA's in the data and it is still early in the class, I will simply impute the median for Age and the most common value for Embarked. Embarked won't be an issue because there are only two missing, age will definitly be impacted and definitly devalue its predictive ability.

I will be dealing with the Cabin variable separately so we are just filling Age and Embarked.

** Note, this isn't the proper way to deal with missing data and there is an entire field of statistics that is dedicated to this matter.**

In [33]:
#Fill the age values with the median
train = tr.copy()
train.Age = tr.Age.fillna(np.median(train.Age.dropna()))


#fill the embarked value with the most frequent
embarked_val = tr.Embarked.dropna().value_counts().idxmax()
train.Embarked = tr.Embarked.replace(np.nan, embarked_val)

Pandas doesn't like setting vaues on only portions of a dataframe, so the common practice is to create a copy of the dataframe to set vaues on. Obviously this isn't efficient so if your data is big then don't copy and just grit your teeth and deal with the bitching that pandas is gonna provide.

If you know yourself to be a careful and diligent data setter, then here is a link to a SE post that explains how to disable the warning that crops up when you try to do that.

http://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas

Here is the command: 

`pd.options.mode.chained_assignment = None  # default='warn'`

In [37]:
#clean up the training data to use in numpy array
#create a vector of the response variable
y = train.Survived.values

#encode sex and embarked as numerical variables using sklearns label encoder
#that we imported above
le = LabelEncoder()
sex = le.fit_transform(train.Sex)
le1 = LabelEncoder()
embarked = le1.fit_transform(train.Embarked)


#grab all of the categorical variables
cat_vars = train[['Pclass', 'SibSp', 'Parch']].values

#Since the cabin feature is a little confusing lets just binarize it to
#just include the information whether an individual was in a cabin or not
inCabin = [0 if i == np.nan else 1 for i in train.Cabin]
inCabin = np.array(inCabin)

#lets combine all of the categorical variables
#here using the np.insert function to add the 
#converted categorical features to the array

cat_vars = np.insert(cat_vars, len(cat_vars[0]), inCabin, axis = 1)
cat_vars = np.insert(cat_vars, len(cat_vars[0]), sex, axis=1)
cat_vars = np.insert(cat_vars, len(cat_vars[0]), embarked, axis=1)

#create new array for storing numeric variables
num_vars = train[["Age","Fare"]].values

#here I am adding in the numeric variables to the categorical variables, but ticket price is a float
#converting integers to floats will not lose any information, however ticket price would suffer from
#conversion to a float, although I suspect that wouldn't damage our results too much
cat_vars = np.insert(cat_vars, len(cat_vars[0]), num_vars[:,0], axis=1)
cat_vars = cat_vars.astype(float)
cat_vars = np.insert(cat_vars, len(cat_vars[0]), num_vars[:,1], axis=1)

train_array = cat_vars
train_array[1:5,]

array([[  1.    ,   1.    ,   0.    ,   1.    ,   0.    ,   0.    ,
         38.    ,  71.2833],
       [  3.    ,   0.    ,   0.    ,   1.    ,   0.    ,   2.    ,
         26.    ,   7.925 ],
       [  1.    ,   1.    ,   0.    ,   1.    ,   0.    ,   2.    ,
         35.    ,  53.1   ],
       [  3.    ,   0.    ,   0.    ,   1.    ,   1.    ,   2.    ,
         35.    ,   8.05  ]])

So there is the head of our numpy array containing all of the data that we are going to use as features. All of the categorical variables have been encoded and the numeric variables have been added at the end.

A note on the above code - *as you can see it is very repetative. It is very easy to create methods for doing all of this repetative work but methods that call fucntions from packages aren't always the easiest to demonstrate and especially difficult to present in a notebook.*

This is clearley not as easy to read as the pandas version.

Below I am printing a dictionary that contains the names and column locations for easier reference. If there are a large amount of features this won't be doable by hand but there are alot of ways of doing this using the `pandas.DataFame.columns` to get a list of the variable names.

In [39]:
feature_names = {'Pclass':0, 'SibSp':1, 'Parch':2,'inCabin':3, 'Sex':4, 'Embarked':5, 'Age':6, 'Fare':7}
print(feature_names)

{'Parch': 2, 'Pclass': 0, 'inCabin': 3, 'Fare': 7, 'SibSp': 1, 'Embarked': 5, 'Age': 6, 'Sex': 4}


So the above section was all about cleaning up the data that was initially loaded. At this point I would either save my cleaned features to a csv or pickle the data into a precompiled python object. Essentially storing the final training numpy ndarray to disk and being able to clear my memory. 

Clearly this isn't large data but I will show what I mean by this below.

In [53]:
import pickle

pickle.dump(train_array, open( "train_array.p", "wb" ))

In [57]:
%timeit rain_array = pickle.load(open( "train_array.p", "rb" ))
train_array[0:4,]

The slowest run took 5.98 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 95.3 µs per loop


array([[  3.    ,   1.    ,   0.    ,   1.    ,   1.    ,   2.    ,
         22.    ,   7.25  ],
       [  1.    ,   1.    ,   0.    ,   1.    ,   0.    ,   0.    ,
         38.    ,  71.2833],
       [  3.    ,   0.    ,   0.    ,   1.    ,   0.    ,   2.    ,
         26.    ,   7.925 ],
       [  1.    ,   1.    ,   0.    ,   1.    ,   0.    ,   2.    ,
         35.    ,  53.1   ]])

In [58]:
%timeit tr = pd.read_csv('train.csv', index_col=[0])

100 loops, best of 3: 3.4 ms per loop


So in practivce I wouldn't time how long it took but you can see here that loading a serialized object is way faster than using pandas. Than preformance advantage becomes important as the datasets become larger as well as having the extra memory.

IRL I would do the cleaning/serialization in a python scrip or just restart the notebook server and load the serialized array to work with.

## Part II: Modeling the data

So now that we have this trimmed down feature array loaded into memory, we can start to model our data.

We talked about splitting the data set into a labeled train and test dataset. This is easily done using sklearn and I highly recomend using the `train_test_split` function from the cross_validation class. 

There is no hard and fast rule about hom much data to put into each partition. I usually go with 70% or 80% but depending on the amount of data and the distribution of classes you could do anywhere from half and half to 95%.

In [59]:
from sklearn.cross_validation import train_test_split


#we need four objects to hold the four arrays that this function is going to return
#the split is 80% in train and 20% in test
X_train, X_test, y_train, y_test = train_test_split(train_array, y, test_size=0.2, random_state=0)

This next step of standardization of the features is done in the book but is something that isn't always default atleast in a statistical approach. Non-parametric classifiers don't rely as heavily on underlying distributions and therefore it isn't necessary to standardize always unless you are seeing certain variables unduely dominate the classifier.

Standardizing also is removing interpretability from the model. So say we were preforming some kind of logistic regression, the coeffiecnts of that regression model now wouldn't be interpretable in the same way as the non-scaled training data.

In [60]:
from sklearn.preprocessing import StandardScaler
#create a scaler object
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

Now that the data is split and standard we can actually start to train a prediction model.

I am just going to use two very simple clasifiers that are fairly easy to understand their methodology.

In [63]:
from sklearn.linear_model import LogisticRegression

#here we are creating a logistic regression object and training it on standardized features
lr = LogisticRegression(random_state = 0)
lr.fit(X_train_std, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [67]:
from sklearn.metrics import confusion_matrix
#here we can evaluate 
cfm = confusion_matrix(lr.predict(X_test_std), y_test)
cfm

array([[92, 18],
       [18, 51]])

This array above is a very informative

![alt-text](http://www.gepsoft.com/gepsoft/APS3KB/Chapter09/Section2/confusionmatrix.png "Confusion Matrix")

Using this we can calculate our predictive accuracy and precision, but if we train a few models with different parameters we can start to evaluate these models in an objective way using the information that we get from this matrix.

In [66]:
tr.Survived.value_counts()

0    549
1    342
Name: Survived, dtype: int64

In [71]:
print("Accuracy:", (cfm[0][0] + cfm[1][1])/len(y_test))

Accuracy: 0.798882681564


In [76]:
#base occurance rate of positive in test set
print("Ocuurance of positive: ", 549/(342+549))

Ocuurance of positive:  0.6161616161616161


So here accuracy may not be the best measurement of predictive ability because the base occurance rate of the positives in the test set is roughly 60% and we saw our predictive accuracy around 80%. This is much less impressive than 80% accuray when the positives and negatives are all held even.

Below I am going to train another very simple model that is even easier to understand than Logistic regression.

Here we are going to train a KNN classifier that essentially finds the points that the observation is nearest too in p-space and classifies the observation to be the most common class of the k nearest neighbors.

In [78]:
from sklearn.neighbors import KNeighborsClassifier as KNN

knn = KNN(5)
knn.fit(X_train_std, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [79]:
cfm1 = confusion_matrix(knn.predict(X_test_std), y_test)
cfm1

array([[97, 20],
       [13, 49]])

In [80]:
print("Accuracy:", (cfm1[0][0] + cfm1[1][1])/len(y_test))

Accuracy: 0.815642458101


So we can see here that KNN is slightly more accurate than the Logistic regression but not by much. This comparison is ignoring a rather important part of ML which is paramater tuning or optimization of the models.