In [None]:
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import pandas as pd

In [None]:
titanic = pd.read_csv('titanic.csv')
print(titanic.shape)
titanic.head()

The `Titanic` data set collects information about almost 900 passangers aboard the Titanic during the fateful voyage when it crashed into an iceberg in 1912 and sank. The information includes their age; the fare they paid for their ticket (in British pounds); their sex; and the passenger class `Pclass`, with 1st class corresponding to VIP treatment and 3rd class corresponding to a much less luxurious experience. Crucially, the data set also records whether that passenger survived the sinking of the ship, with `1` indicating that the passenger survived and `0` indicating that the passenger tragically perished.

We are eventually going to train an algorithm to predict whether a passenger survived the Titanic based on their available information. Before we do, let's get a sense for some trends using familiar pandas summarization.

# Exploratory analysis

How wealthy were these passengers? We can't know for certain, but we can get a sense for how much was paid for each passenger class.

In [None]:
titanic.groupby('Pclass').mean(numeric_only=True)

# group everyone by passenger class, and calculate mean 

- The average price of 84 pounds for a first-class ticket corresponds to nearly \$15,000 USD today.
- The second-class ticket corresponds to roughly \$3,500.
- The third class ticket corresponds to roughly \$2,500.

We can safely assume that the first-class passengers were indeed substantially more wealthy on average than the others.

Did Pclass have an effect on survival rate?

This difference in wealth made a considerable difference in how likely passengers were to survive.

In [None]:
# 'save the women first' (and maybe children, TBD)

titanic.groupby(['Pclass', 'Sex']).mean(numeric_only=True)

This table reflects the famous maritime tradition of prioritizing women and children first into the lifeboats, resulting in vastly higher survival rates among women in these data. Note the role of class: a 1st-class woman was twice as likely to survive as a third class woman, and a 1st-class man was nearly three times as likely to survive as a 3rd class man.

In [None]:
# Characteristics of who survived and who didn't

titanic.groupby('Survived').mean(numeric_only=True)

In [None]:
titanic.groupby('Survived').Sex.value_counts()

# Our first ML algorithm (classification)

## Preparing data

We'd like to develop automated models that can use these trends and others to make predictions about survival. However, we need to do a bit of data cleaning before we're ready for this. In particular, machine learning algorithms don't really get text, so we need to transform text data into numbers before we can proceed. Here's how we can encode the `Sex` data. 


We also don't really have any use for the actual names of passengers, so let's just remove that column.

In [None]:
titanic = titanic.drop(columns=['Name']) # drop the column 'Name'
titanic.head()

In [None]:
# Turn Sex column into numbers
is_F = (titanic['Sex'] == 'female') # array of True and False
titanic['Sex'] = is_F.astype(int) # 1 = female, 0 = male
titanic.head()

Now in an ideal world, we would find the best model or classifier using all the data we have (train), then go find future or unseen data (test), then try making predictions on the new test data. In this case, that's not only impossible but also a weird concept (what's the point of making really good predictions on an event that happened a century ago?). But we do need to find a way to check if our model is actually good at predicting.

In cases like this (which is very common), people **randomly split their samples into two groups, and reserve the samples in the test set for evaluating the model**. 

Think of it as a professor reserving some questions in the question bank for the actual test (test) and releasing the rest as practice questions (train). Why would giving all the questions ahead of time be an inaccurate way to evaluate a student's understanding (model)?

In [None]:
train = titanic.sample(frac= ) # 80% rows for training
test = titanic.drop(index=train.index) # rest of rows for testing
print(train.shape, test.shape)

The next thing to do is to separate out the target data `Survived` from the predictor data (everything else).

In [None]:
y_train = train[        ]
X_train = train.drop(columns=[        ])
print(X_train.shape, y_train.shape)

y_test = test[        ]
X_test = test.drop(columns=[        ]) 
print(X_test.shape, y_test.shape)

In [None]:
# this is from last lecture demo
# y = titanic['Survived']
# X = titanic.drop(columns=['Survived'])
# print(X.shape, y.shape)

## Training or fitting a model 

To use a machine learning model from `scikit-learn`, you should import the relevant model. For example, let's use a decision tree classifier.

Arguments passed to the model upon instantiation (`max_depth`=2) are typically used to control how complex the model can be. These arguments are often referred to as hyperparameters. In practice, we don't usually know what the right hyperparameters are, and so we need to resort to various computational techniques (in coming lectures) to select good ones.

In [None]:
from sklearn import tree 
# sklearn is a package called scikit-learn 
# that contains a lot of useful ML algorithms

T = tree.DecisionTreeClassifier(max_depth=2) 

T.fit(        ,         ) # find the best f

X -> [model] -> f(X)

What does the score function calculate? It depends by model, but in the case of classifiers, the score is the fraction of the time that the model made the correct prediction. The score does the same job as the loss function, but the signs are flipped -- high scores mean good models.

In [None]:
T.score(X_train, y_train) 


In [None]:
T.score(X_test, y_test) 


Our model was able to use the predictor variables to be right nearly 80% of the time. That's pretty impressive, but there is an important problem here. When doing machine learning, it's not advised to score or evaluate your model on the same data used for training or fitting the model. We'll come back to this in the next lecture.

In most cases, you should aim to understand how your model makes the decisions that it does. This is the problem of machine learning interpretation.

Every machine learning algorithm has both strengths and weaknesses when it comes to interpretation. Decision trees are pretty pleasant to interpret, as they correspond to "flow-chart" style reasoning that many of us are familiar with. The `tree` module of `scikit-learn` provides a convenient method for visualizing decision trees.

In [None]:
fig, ax = plt.subplots(1, figsize = (10, 10))
p = tree.plot_tree(T, 
                   filled=True, 
                   feature_names=X_train.columns)

The automated decision tree classifier has found the following rule to predict whether a passenger survived the Ttianic. First, check whether Sex <= 0.5 -- that is, check whether the passenger is male (0) or female (1).

- If the passenger is male, then check how old they are. If they are young, the algorithm gives them a fair chance of survival (remember: "women and children."). Otherwise, the algorithm gives them very low odds indeed.

- If so, next check whether the passenger was first or second-class. If so, then they survive with high probability; otherwise the algorithm isn't sure and gives them a roughly 50-50 chance.

We've now gone through one cycle of:

1. Acquiring data.
2. Exploratory analysis.
3. Modeling.
4. Interpreting results.

In practice, we would not stop here. Having trained our model, we should then ask:

- What new insights can we gain from our model about the underlying data set?
- Can we improve our model?
- How can we use what we have learned in other data sets?

These questions, and many others, are part of the cycle of data science!

if we have time... regression example

In [None]:
# controls random number generation
# always get the same data
np.random.seed(1234) 

# true model is linear with a = 1 and b = 1
a = 1
b = 1

n_points = 100

X = np.random.rand(n_points)
Y = a*X + b + 0.2*np.random.randn(n_points) # final term is random noise

In [None]:
fig, ax = plt.subplots(1)

ax.plot([0,1], [1, 2], color = "black", label = "true model")
ax.plot([0,1], [1.2, 2.2], color = "black", linestyle=':', label = "another model 1")
ax.plot([0,1], [0.8, 2.2], color = "black", linestyle='-.', label = "another model 2")
ax.scatter(X, Y, label = "data")
ax.set(xlabel='X', ylabel='Y')
plt.legend()