# Machine Learning Basics
## with sklearn

In [None]:
import pip
!pip install scikit-learn

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

## Machine Learning
A learning problem considers a data sample and then tries to predict properties of previously unknown data.

The two main categories of learning problems are:
- **Supervised learning**: the data has an additional attribute ("label") that we want to predict $\rightarrow$ classification or regression
- **Unsupervised learning**: the data has no target value, we want to identify groups of similar samples $\rightarrow$ clustering
- **Reinforcement learning**: no data available, by interacting with an environment and correspondig reactions a system is trained

![Data Science Lifecycle](https://ajgoldsteindotcom.files.wordpress.com/2017/11/ds-deconstructed.jpg?w=740)

## A Sample ML Project

![Titanic Image](http://oliviak.blob.core.windows.net/blog/ML%20series/Titanic%20Sinking.jpg)

## Data Dictionary
| Variable |                 Definition                 |                       Key                      |
|:--------:|:------------------------------------------:|:----------------------------------------------:|
| survival | Survival                                   | 0 = No, 1 = Yes                                |
| pclass   | Ticket class                               | 1 = 1st, 2 = 2nd, 3 = 3rd                      |
| sex      | Sex                                        |                                                |
| Age      | Age in years                               |  fractional if less than 1                     |
| sibsp    | # of siblings / spouses aboard the Titanic |                                                |
| parch    | # of parents / children aboard the Titanic |  definition of family on board                 |
| ticket   | Ticket number                              |                                                |
| fare     | Passenger fare                             |                                                |
| cabin    | Cabin number                               |                                                |
| embarked | Port of Embarkation                        | C = Cherbourg, Q = Queenstown, S = Southampton |


For more information, please be refered to the dataset description on [kaggle.com](https://www.kaggle.com/c/titanic/data).


### 01. Frame the Problem
We are given information about the passengers of the Titanic (e.g. gender, age, ticket category) and want to predict who survived the tragedy.

### 02. Load Data

In [None]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/ADSLab-Salzburg/DataAnalysiswithPython/main/data/titanic.csv')
df.head()

### 03. Process the Data
We need to understand what the columns mean and probably clean the data.

#### Is the PassengerID consecutive?

In [None]:
df.tail() # ID and pandas index agree...

#### Do we know the survival status for each passenger?

In [None]:
# isna() returns a boolean value -> sum > 0 would mean we have NaNs
df['Survived'].isna().sum()

#### How many passenger classes are there?

In [None]:
df['Pclass'].value_counts()

#### Do we know the name, gender and age of each passenger?

Check if name column contains NaNs.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

#### How to handle NaN values?
- remove row
- remove feature (=column)
- replace
  - mean
  - median
  - zero (min)
  - max

In [None]:
# replace NaNs
df['Age'].describe()

In [None]:
df.loc[df['Age'].isna(), 'Age'] = df['Age'].mean()
df['Age'].describe()

#### What is the age distribution of the passengers?

In [None]:
import matplotlib.pyplot as plt
# use pandas plots for simplicity
df.boxplot(column='Age')
df.hist(column='Age')
plt.show()

#### How many passengers had siblings/spouses on board the Titanic?

In [None]:
# SibSp gives the number of siblings/spouses
df['SibSp'].sum()

In [None]:
df['SibSp'].astype(bool).sum()

#### How many passengers had parents/children on board the Titanic?

In [None]:
# Parch gives the number of parents/children
df['Parch'].astype(bool).sum()

#### How much did the passengers pay for their ticket on average?

In [None]:
df['Fare'].describe()

In [None]:
df['Fare'].mean()

In [None]:
df.boxplot(column='Fare')
plt.show()

#### Where did the passengers embark?

In [None]:
df['Embarked'].value_counts()
# S = Southampton
# C = Cherbourg
# Q = Queenstown

#### Which features (=columns) are categorical, numerical etc.?

In [None]:
df.columns.values

### 04. Exploratory Data Analysis
What is the survival rate of men and women?

In [None]:
import seaborn as sns
# subplots
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))

women = df[df['Sex'] == 'female']
men = df[df['Sex'] == 'male']

# plot women
ax = sns.distplot(women[women['Survived']==0].Age, bins=18, label='not survived', ax=axes[0], kde=False)
ax = sns.distplot(women[women['Survived']==1].Age, bins=18, label='survived', ax=axes[0], kde=False)
ax.legend()
ax.set_title('Female')

# plot men
ax = sns.distplot(men[men['Survived']==0].Age, bins=18, label='not survived', ax=axes[1], kde=False)
ax = sns.distplot(men[men['Survived']==1].Age, bins=18, label='survived', ax=axes[1], kde=False)
ax.legend()
ax.set_title('Male')
plt.show()

#### Survival Rate by Ticket Price

In [None]:
ax = sns.violinplot(x="Survived", y="Fare", data=df)
plt.show()

### 05. Build a Model
Given the training data for the binary classification problem "survival", we want to fit an estimator to be able to predict the class (0=not suvived, 1=survived) of previously unseen data (=test data).

In scikit-learn (sklearn), an estimator for classification is a Python object that implements the methods ``fit(X, y)`` and ``predict(T)``.

An example of an estimator is the ``DicisionTreeClassifier``, that learns simple dicision rules to classify the data. The estimator’s constructor takes as arguments the model’s parameters.

In [None]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)

### Training set and testing set

Machine learning is about learning some properties of a data set and then testing those properties against another data set. A common practice in machine learning is to evaluate an algorithm by splitting a data set into two. We call one of those sets the **training set**, on which we learn some properties. We call the other set the **testing set**, on which we test the learned properties.

In [None]:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, train_size=0.8)
df_train.shape

In [None]:
df_test.shape

**Let's get on training! Wait..**

In [None]:
clf.fit(df_train[['Sex', 'Pclass']], df_train['Survived'])  

**We need numeric features!**

In [None]:
 # convert to categorical object and use codes instead of strings
df_train["Gender"] = df_train["Sex"].astype('category').cat.codes
df_test["Gender"] = df_test["Sex"].astype('category').cat.codes
df_train.head()

In [None]:
clf.fit(X=df_train[['Gender', 'Pclass']], y=df_train['Survived'])  

### Predict on Test Data

In [None]:
from sklearn import tree
y_pred = clf.predict(X=df_test[['Gender', 'Pclass']])
y_pred[:10]

#### Evaluation
Accuracy is a very simple measure to evalute the performance of the classifier on the test data. It gives you the number of correctly classified samples.

In [None]:
from sklearn.metrics import accuracy_score
acc = accuracy_score(df_test['Survived'], y_pred)
print(f'The accuracy is {acc*100:.2f}%.')

#### Evaluation
The decision tree classifier is quite special, as we can also visualize, literally, the decisions it makes:

In [None]:
from sklearn import tree
plt.figure(figsize=(16,9)) # needed, that we can see the figure properly
tree.plot_tree(clf, feature_names=['gender', 'passenger class'], class_names=['not survived', 'survived'])
plt.show()

#### Confusion plot
A more in-depth analysis is possible with a so called confusion matrix:

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(df_test['Survived'], y_pred)
cm

### 06. Visualize the Results
At least, plot the confusion matrix so that it is more interpretable.

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
disp.plot()
plt.show()

## Wrap-up Exercises
1. Draw a plot showing the number of men and women who survived and did not survive (bar plot) on the test data only.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

2. Create a simple "classifier": Compute the accuracy score if all women would have been predicted as having survived, and men as not. $\rightarrow$ You can use the column "Gender" as ``y_pred``, directly.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

3. Add more columns (e.g. ``Age``) to train the decision tree classifier and check if the performance improves. What happens to the graphical output of the tree?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Further Reading
- [scikit-learn documentation](https://scikit-learn.org/stable/modules/classes.html#)

and see below:

[<img src="https://cloud.google.com/products/ai/ml-comic-1/assets/panel_01_2x.png" width=500>](https://cloud.google.com/products/ai/ml-comic-1/)