Exercises based on:

Data Analysis with Python by  Marco Bonzanini from Packt>

Data source: 
https://www.kaggle.com/c/titanic/data

In [None]:
import pandas as pd 
fname = '~/titanic_data/train.csv'
data = pd.read_csv(fname)

In [None]:
len(data)

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.count()

In [None]:
data['Age'].min(), data['Age'].max()

In [None]:
data['Survived'].value_counts()

In [None]:
data['Survived'].value_counts() *100/len(data)

In [None]:
data['Sex'].value_counts()

In [None]:
data['Pclass'].value_counts()

In [None]:
%matplotlib inline
alpha_color=0.5
data['Survived'].value_counts().plot(kind='bar')

In [None]:
data['Sex'].value_counts().plot(kind='bar', color=['b','r'], alpha=alpha_color)

In [None]:
data['Pclass'].value_counts().sort_index().plot(kind='bar', color=['b','r','g'], alpha=alpha_color)

In [None]:
data.plot(kind='scatter', x='Survived', y='Age')

In [None]:
data[data['Survived']==1]['Age'].value_counts().sort_index().plot(kind='bar')

In [None]:
# create intervals
bins = [0,10,20,30,40,50,60,70,80]
data['AgeBin'] = pd.cut(data['Age'], bins)

In [None]:
data[data['Survived']==1]['AgeBin'].value_counts().sort_index().plot(kind='bar')

In [None]:
data[data['Survived']==0]['AgeBin'].value_counts().sort_index().plot(kind='bar')

In [None]:
data['AgeBin'].value_counts().sort_index().plot(kind='bar')

In [None]:
data[data['Pclass']==1]['Survived'].value_counts().plot(kind='bar')

In [None]:
data[data['Pclass']==3]['Survived'].value_counts().plot(kind='bar')

In [None]:
data[data['Sex']=='male']['Survived'].value_counts().plot(kind='bar')

In [None]:
data[data['Sex']=='female']['Survived'].value_counts().plot(kind='bar')

In [None]:
# crossreference gender with class
data[(data['Sex']=='male') & (data['Pclass']==1)]['Survived'].value_counts().plot(kind='bar')

In [None]:
data[(data['Sex']=='male') & (data['Pclass']==3)]['Survived'].value_counts().plot(kind='bar')

In [None]:
data[(data['Sex']=='female') & (data['Pclass']==1)]['Survived'].value_counts().plot(kind='bar')

In [None]:
data[(data['Sex']=='female') & (data['Pclass']==3)]['Survived'].value_counts().plot(kind='bar')

In [None]:
# Machine Learning- supervised learning with scikit-learn

Implementing a classifier with scikit-learn:
- dummy classifier
- random forest classfier - simple, but fast runtime, good for unbalanced and missing data
- train/test split
- adding more features
- accuracy of our classfier

In [None]:
data.head()

We use the column 'Survived' as our label. That's the variable we want to predict. We use other features to train our model.

Using just one feature 

In [None]:
data['IsFemale'] = (data['Sex'] == 'female') # true/false
samples = data[['IsFemale']] #x
labels = data['Survived'] #y

Train/test split (70% train, 30% test) 

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(samples, labels, train_size=0.7, random_state=0)
print("Samples: train={}, test={}".format(len(X_train), len(X_test)))
# random_state = 0 <- always the same results

In [None]:
X_train['IsFemale'].value_counts() # majority - male passangers

In [None]:
X_test['IsFemale'].value_counts()

Dummy Classfier (most frequant class) - simply assign the most frequent class from the training set to every sample that we get from the test set 

Why we use it: 
- check that are data pipeline is fully workingand is producing some results 
- would give as a baseline result

In [None]:
from sklearn.dummy import DummyClassifier
clf_dummy = DummyClassifier(strategy="most_frequent")
clf_dummy.fit(X_train, Y_train)
# fit(samples,labels)
Y_predicted = clf_dummy.predict(X_test)

Once we have results we want to know how well are we doing

In [None]:
from sklearn.metrics import accuracy_score
print("Accuracy={}".format(accuracy_score(Y_test,Y_predicted)))

Random Forest Classifier - fairly quick, fairly robust

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=0)
clf.fit(X_train,Y_train)
Y_predicted = clf.predict(X_test)
print("Accuracy={}".format(accuracy_score(Y_test,Y_predicted)))

Using more features - IsFemale & Pclass

In [None]:
samples = data[['IsFemale', 'Pclass']]
labels = data['Survived']

X_train, X_test, Y_train, Y_test = train_test_split(samples,labels,train_size=0.7,random_state=0)

In [None]:
clf = RandomForestClassifier(random_state=0)
clf.fit(X_train,Y_train)
Y_predicted = clf.predict(X_test)
print("Accuracy={}".format(accuracy_score(Y_test,Y_predicted)))

Adding Pclass did not bring any improvement, the accuracy is the same.

In [None]:
data['AgeSentinel'] = data['Age'].fillna(-100) # because 'Age' contains a lot of missing data

Sentinel value is so off the chart that the algorithm treats it differently from other data.

In [None]:
features = ['IsFemale', 'Pclass', 'AgeSentinel']
samples = data[features]
labels = data['Survived']

X_train, X_test, Y_train, Y_test = train_test_split(samples, labels, train_size=0.7, random_state=0)

In [None]:
clf = RandomForestClassifier(random_state=0)
clf.fit(X_train,Y_train)
Y_predicted = clf.predict(X_test)
print("Accuracy={}".format(accuracy_score(Y_test, Y_predicted)))

Quality of prediction has gone down. Typical at the beginning, gives us better understanding of the dataset. Conclusion is that adding more features does not neccesserly give better accuracy. 

In [None]:
features = ['IsFemale', 'Pclass', 'AgeSentinel', 'Fare']
samples = data[features]
labels = data['Survived']

X_train, X_test, Y_train, Y_test = train_test_split(samples, labels, train_size=0.7, random_state=0)

In [None]:
clf = RandomForestClassifier(random_state=0)
clf.fit(X_train,Y_train)
Y_predicted = clf.predict(X_test)
print("Accuracy={}".format(accuracy_score(Y_test, Y_predicted)))

Quality has gone up. 

In [None]:
# add family size (siblings, spouses, parents, children)
data['FamilySize'] = data['SibSp'] + data['Parch']
features = ['IsFemale', 'Pclass', 'AgeSentinel', 'Fare', 'FamilySize']
samples = data[features]
labels = data['Survived']

X_train, X_test, Y_train, Y_test = train_test_split(samples, labels, train_size=0.7, random_state=0)

In [None]:
clf = RandomForestClassifier(random_state=0)
clf.fit(X_train,Y_train)
Y_predicted = clf.predict(X_test)
print("Accuracy={}".format(accuracy_score(Y_test, Y_predicted)))

Quality has gone down a bit.

Feature importance

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.bar(range(len(features)), clf.feature_importances_, tick_label=features)
plt.show()

Adding new features can influence the quality of prediction both in positive and negative way. The bar values add up to 100%. What we see from the chart is that the most important features are:
- IsFemale
- AgeSentinel
- Fare 

While passengers class and family size are less important. What we do not see is the relations between the features. For example the passangers class is not important in absolute terms, but it becomes a strong indicator when in conjunction with gender.