## Keypoints:

* a collection of useful snippet of codes that you can use in other tasks
* examples of how to monitor the performance of a ML working on imperfect data
* examples of designing an ML model to classify using "Decision Trees" and "Random Forests"

In [None]:
# import the packages we need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# get titanic passenger list csv file as a DataFrame
titanic_df = pd.read_csv("data/titanic_train.csv")

In [None]:
# preview the data
titanic_df.head()

In [None]:
# a shortcut to get the null values
titanic_df.isnull().sum()

In [None]:
# Drop unnecessary columns; these columns won't be useful in analysis and prediction.
# i.e. We are arguing that the passenger ID number, Name, Ticket code and "Cabin" have no predictive power of survival. That
# may or may not be true; interesting to explore, especially "Cabin", but unfortunately there are many missing data.

# axis=1 tells .drop to look at the columns

titanic_df = titanic_df.drop(['PassengerId','Name','Ticket', 'Cabin'], axis=1)

In [None]:
# How many passengers embarked where? S = Southampton, C=Cherbourg, Q=Queenstown
titanic_df["Embarked"].value_counts()

* the S (Southampton) is the most frequent departure harbour, so we will use that to fill any Null values (there are only 2). This is quite arbitrary, but shows AN approach to dealing with missing data. I'm sure you could think of others...

In [None]:
# in titanic_df, fill the two missing values with the most frequent value, which is "S".
titanic_df["Embarked"] = titanic_df["Embarked"].fillna("S")

In [None]:
# check it's worked
titanic_df["Embarked"].value_counts()


* We will turn the categorical data on embarkation (S, C, Q) into columns of 0s and 1s using the get_dummies method. 

* Assigning numerical values to non-numeric data is often/usually necessary. Sci-kit learn wants numeric values to things we want to use as features.

In [None]:
# make a new dummy dataframe with columns of 0s and 1s depending on embarkation port
embark_dummies_titanic  = pd.get_dummies(titanic_df['Embarked'])

In [None]:
# show what this has done
embark_dummies_titanic.head()

In [None]:
# this adds the three new columns in embark_dummies_titanic to the original dataframe
titanic_df = titanic_df.join(embark_dummies_titanic)

In [None]:
# show what we have so far
titanic_df.head()

In [None]:
# and because the idea was to replace the Embarked column, we now drop that from the dataframe
titanic_df = titanic_df.drop(['Embarked'], axis=1)

Now we're going to look at the ages and where the age is missing (177 values), replace it with an appropriate random age. Again, this is arbitrary - you can think of other methods...

In [None]:
# plot a histogram of the ages with seaborn. This will ignore the missing values.
sns.histplot(titanic_df["Age"])

In [None]:
# Some basic statistical methods in pandas on the "Age" column
# get average, std, and number of NaN values in titanic_df["Age"]
average_age_titanic   = titanic_df["Age"].mean()
std_age_titanic       = titanic_df["Age"].std()
count_nan_age_titanic = titanic_df["Age"].isnull().sum()
print("ages: mean, stdev, number_nan", average_age_titanic, std_age_titanic, count_nan_age_titanic)

# Assign a random number from a normal distibution with the same mean and standard deviation.
# pull random numbers from a normal distribution
random_floats = np.random.normal(loc=average_age_titanic, scale=std_age_titanic, size=count_nan_age_titanic)
# round them to 1 decimal place
random_ages = np.round(random_floats, 1)

In [None]:
# we should have an array of 177 random ages
random_ages.shape

In [None]:
# fill NaN values in Age column with random values generated
titanic_df["Age"][np.isnan(titanic_df["Age"])] = random_ages

In [None]:
titanic_df["Age"].shape

In [None]:
# plot a histogram of the ages - should look slightly different to the previous example
sns.histplot(titanic_df["Age"])

In [None]:
# what have we got so far
titanic_df.info()

In [None]:
# Now, for fun, create a couple of subsets for those that died or survived
survived = titanic_df[ titanic_df["Survived"] == 1 ]
died = titanic_df[titanic_df["Survived"] == 0]

In [None]:
# use some pandas dataframe methods to plot semi-transparent age histograms for both subsets
survived["Age"].plot.hist(alpha=0.3,color='red',bins=50).set_xlabel("Age")
died["Age"].plot.hist(alpha=0.3,color='blue',bins=50).set_xlabel("Age")
plt.legend(['Survived','Died'])
plt.show()

In [None]:
# For more fun we'll examine whether there are correlations between survival or not with the continuously distributed
# variables "Age" and "Fare" using a seaborn pairplot

pairplot_df = titanic_df[["Survived", "Age","Fare"]]
sns.pairplot(pairplot_df,hue='Survived',palette='Set1')

* Now we are going to go about turning other columns into numbers. First create a new column called Person

In [None]:
# As we see, children(age < ~16) on aboard seem to have a high chances for Survival.
# So, maybe we could classify passengers as male, female or child
# this is an example of writing a little function that can be used on a dataframe
def get_person(passenger):
    # passenger is going to contain both the age and sex of the passenger
    age,sex = passenger
    if age < 16:
        return 'child'  
    else:
        return sex
    
titanic_df['Person'] = titanic_df[['Age','Sex']].apply(get_person,axis=1)

In [None]:
# look at the first 10 rows to see what's been created
titanic_df.head(10)

* and now the dummies variable for the person categorical data

In [None]:
# create dummy variables for Person column
person_dummies_titanic  = pd.get_dummies(titanic_df['Person'])

titanic_df = titanic_df.join(person_dummies_titanic)

titanic_df.head()

In [None]:
# No need to use Sex or Person columns once we have our numerical data
titanic_df = titanic_df.drop(['Sex', 'Person'], axis=1)
titanic_df.head()

### A Decision Tree Classifier
Having prepared a dataset and got rid of or replaced missing data, we move to the "Machine Learning" part.
We're going to see how well we can predict the survival of passengers based on the data above. The "label" is the "Survived" value (1 = survived), the other columns are the "features".

In [None]:
# define training and testing sets
X = titanic_df.drop("Survived",axis=1) # these are the features (in general, a matrix)
y = titanic_df["Survived"]             # these are the labels   (n general, a matrix)

In [None]:
# now use a method from sklearn that will give us a random split between a training set and a test set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30) # we choose 30%

### Let's build the model in scikit learn
in this case, a decision tree classifier https://scikit-learn.org/stable/modules/tree.html#tree

In [None]:
# import the model class from sklearn 
from sklearn.tree import DecisionTreeClassifier

In [None]:
# create the model
dtree = DecisionTreeClassifier()

In [None]:
# fit the data to the model - this IS the training step
dtree.fit(X_train,y_train)

### Prediction and Evaluation 


* a prediction on a new data point is made by checking which region of the partition of the feature space the point lies in and then predicting the majority target (or the single target of pure leaves)
* it is also possible to use decision tree for regression tasks. we still find the region where the new data point lies in but this time we calculate the mean target value of the training points in this leaf.

* store the prediction in a variable called predictions

In [None]:
# check how we're doing by making predictions of the labels using the features in the test set and then comparing them with 
#the actual labels in the test set
predictions = dtree.predict(X_test)

In [None]:
# usea couple of standard methods from the sklearn library
from sklearn.metrics import classification_report,confusion_matrix

print (classification_report(y_test,predictions))

The way to read this is:

Precision: The ability of a classification model to identify only the relevant data points. Mathematically, precision is the number of true positives divided by the number of true positives plus the number of false positives.

Recall: The ability of a model to find all the relevant cases within a data set. Mathematically, we define recall as the number of true positives divided by the number of true positives plus the number of false negatives.

The f1-score is calculated from the "confusion matrix" - see below and https://builtin.com/data-science/precision-and-recall

Support is how many of the sample were actually in each category.

macro avg is an unweighted mean of the scores, while weighted avg weights by the numbers in each category (support).

In [None]:
print (confusion_matrix(y_test,predictions))

# The 2x2 confusion matrix here gives the number of true positives (class 0) in the top left,
# the number of true negatives (class 1) in the bottom right and then bottom left is the
# false negatives (1s that were classified as 0s) and top right is the false positives (0s that
# were classified as 1s)

### Controlling complexity

There is always a danger of underfitting or overfitting the model in any Machine Learning process. There are tools to examine whether that is happening.

In [None]:
dtree.score(X_train, y_train)

In [None]:
dtree.score(X_test, y_test) 

* We see that the accuracy is very high in training : we let it run until it finds "pure leaves".
* however, it performs much worse on the test data. This suggests overfitting.
* We can try to avoid overfitting by restrict the depth of the decision tree. In scikit-learn we can appy 'pre-pruning' that will stop developing the tree before it perfectly fits to the training data.
* We want to do so to avoid overfitting and create a model that is more robust to generalization


### Create a second model

* technique called Pruning
The default is to keep making split decisions until we end up with as many leaves as training data points - almost perfect fitting. We can stop this by only allowing it to make a certain number of branching decisions before stopping.

In [None]:
dtree2 = DecisionTreeClassifier(max_depth= 2)
dtree2.fit(X_train, y_train)

dtree2.score(X_train,y_train)

In [None]:
dtree2.score(X_test,y_test)

The scores above should be much more similar, suggesting that overfitting has not occurred. Of course we may also have lost some accuracy in our predictions...

In [None]:
predictionsDtree = dtree2.predict(X_test)
print (classification_report(y_test,predictionsDtree))

### Feature importance

* feature importance rates how important each feature is for the decision a tree makes
* it is a number from 0 to 1. where 0 means not used at all (they all sum to 1)

In [None]:
# method the get the feature importance
dtree.feature_importances_


In [None]:
import sklearn.tree as tree

In [None]:
# use the feature labels from the dataframe as the labels on a histogram plot
titanic_df.columns[1:]

In [None]:
n_features = X.shape[1]
plt.barh(range(n_features),dtree.feature_importances_)
plt.yticks(np.arange(n_features),titanic_df.columns[1:])

In [None]:
# tree.plot_tree(dtree) # This would be very slow because the tree is very complex for model 1.

In [None]:
# plot the feature importance for dtree2

n_features = X.shape[1]
plt.barh(range(n_features),dtree2.feature_importances_)
plt.yticks(np.arange(n_features),titanic_df.columns[1:])

In [None]:
# This shows the limited depth of the tree. The X array refers to the columns
# of features in the test set. It uses only the features in the plot above to make a categorisation.

tree.plot_tree(dtree2)

### Feature scaling

* the algorithm is invariant to scaling of the data.
* as each feature is processed separately, and the possible splits of the data don't depend on scaling, no preprocessing like normalisation or standardisation of features is needed for decision tree algorithms, so decision tree work well when you have features that are on completely different scales, or a mix of binary and continuous values. (but you might to scale the data for visualisation purposes).

## Random Forests

* Random Forest is one of the most common ensemble methods, which consists of a collection of Decision Trees. 
* We repeatedly select data from the data set (with replacement) and build a Decision Tree with each new sample.
* It is important to note that since we are sampling with replacement, many data points will be repeated and many won’t be included as well. 
* Random Forest is that each node of the Decision Tree is limited to only considering splits on random subsets of the features.

#### How it works

* In the case of classification with Random Forests, we use each tree in our forest to get a prediction, then the label with the most votes becomes the predicted class for that data point.

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=10, max_depth=2) # 10 trees in the forest, to a maximum tree depth of 2
rfc.fit(X_train, y_train)

rfc_pred = rfc.predict(X_test)

confusion_matrix(y_test,rfc_pred)

In [None]:
print (classification_report(y_test,rfc_pred))

In [None]:
rfc.feature_importances_

In [None]:
n_features = X.shape[1]
plt.barh(range(n_features),rfc.feature_importances_)
plt.yticks(np.arange(n_features),titanic_df.columns[1:])

# YOUR TURN


* try to change the number of estimators of the Random Forest and see if that affects the feature importance