In [1]:
import pandas as pd
import numpy as np

# Machine Learning Workflow

Let's go through the (beginning of the) machine learning workflow using a familiar dataset.

![penguins](./penguins.png)

## The Workflow

![workflow](./ml_workflow.png)

## 1. Define Business Goal

A Goal should be measurable

**Penguins:**
> Predict which species a penguin belongs to<br>
> We want to achieve an accuracy of 0.8

**Titanic:**
> Predict who survived and who died<br>
> Arbitrarily: We want an accuracy of the model that is higher than 0.77

**Accuracy:** Ratio of correct predictions over all cases. What is the percentage of correctly classified cases.

**Loss:** Difference between y and y_hat


## 2. Get Data

For the penguins data and for the titanic data we just have to load a .csv file.

Potential data sources:
- Databases
- Create your own data (simulation) / run a survey
- Sensors / Devices that measure data
- Web Crawling
- API (Application Programming Interface) 

In [2]:
df = pd.read_csv('./data/penguins_simple.csv', sep=';')
df.head()

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Adelie,39.1,18.7,181.0,3750.0,MALE
1,Adelie,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,36.7,19.3,193.0,3450.0,FEMALE
4,Adelie,39.3,20.6,190.0,3650.0,MALE


In [3]:
df.shape

(333, 6)

## 3. Train-Test-Split

What is the purpose of splitting the data into training and test data? - We want to be able to detect if our model is overfitting.
The train-test-split does not help to prevent overfitting but it helps to detect overfitting.

**Overfitting:**

Algorithm is to some extent memorizing the correct answers for the training data. This means that it will not work well on data it has not been trained on. The model does not **generalize** well.

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
!pip install scikit-learn

In [45]:
training_scores = []

for i in range(100):

    # Create an array of random 0s and 1s
    y = np.random.randint(low=0, high=2, size=50)

    # Create a random matrix of feautres X
    X = pd.DataFrame(np.random.normal(loc=0, scale=1, size=(50, 10)))

    # Create our logistic regression
    m = LogisticRegression()

    m.fit(X, y)

    # Let's inspect the accuracy of the model
    training_scores.append(m.score(X, y))

In [44]:
np.mean(training_scores)

0.8048000000000001

This looks like a fantastic model! Should we deploy the model and use it in production?

Let's check how it performs on some test data that was created from the same data generating process

In [62]:
test_scores = []

for i in range(100):

    # Let's create some test data
    y_test = np.random.randint(low=0, high=2, size=10)
    X_test = np.random.normal(loc=0, scale=1, size=(10, 10))

    # Let's inspect the accuracy on the test data
    test_scores.append(m.score(X_test, y_test))

In [42]:
np.mean(test_scores)

0.501

:( It looks like our model just picked up on random fluctioations and did not actually learn anything. I could have saved the time and energy and used a coin-flip instead

**This is why splitting your data into training and testing data is so important. It allows you to estimate the out of sample performance of your model and understand whether it is generalizing well.**

In [46]:
# Import train-test-split
from sklearn.model_selection import train_test_split

In [47]:
df.head()

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Adelie,39.1,18.7,181.0,3750.0,MALE
1,Adelie,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,36.7,19.3,193.0,3450.0,FEMALE
4,Adelie,39.3,20.6,190.0,3650.0,MALE


In [49]:
y = df.Species
X = df.loc[:, 'Culmen Length (mm)':'Body Mass (g)']

In [57]:
# Split the DataFrame into X and y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [58]:
X_train.shape

(266, 4)

In [59]:
X_test.shape

(67, 4)

In [60]:
y_train.shape

(266,)

In [61]:
y_test.shape

(67,)

In the Titanic dataset your y is the column **Survived**

In case of the Titanic data, kaggle already provided split data. This means that you do not have to split the data again.

## 4. Explore Data

This is your task for the afternoon. You have learnt how to do that last week.

This is only done on the training data. In order to do this it makes sense to merge X_train and y_train into df_train.

## 5. Feature Engineering

The lecture on feature engineering is scheduled for wednesday morning.

## 6. Train Model

The lecture on the model itself is scheduled for tuesday morning.

## 7. Optimize / Cross-Validation
The lecture on cross validation is schedule for wednesday afternoon.

## 8. Calculate Test Score

You will see how to do this when talking about training the model.

## 9. Deploy the model

We will talk about model deployment towards the end of the bootcamp.