# Machine Learning

## Goals

* Learn techniques in data science
* Practice programming in Python and using data frames
* Learn the jargon and understand what machine learning pipelines look like
* Set up a classification model
* Set up a regression model
* Set up an unsupervised (clustering) model
* Explore some basic neural networks

---

**Approach to learning**: I would like to take an approach where we are playing with data as early as possible. As such, we will be talking about concepts as we encounter them in the data.

---

**Assumptions**: you have at least 12 hours of Python training, including some exposure to Pandas.

---


## What is machine learning?

Loosly, machine learning can be defined as:

```Algorithms that allow a computer to predict patterns in unseen data based on learning done on data that has previously been seen```

Typically, the more data you allow the computer to learn with, the better job it will do.

---

# Why use machine learning?

A non-exhaustive list of applications:

* Fraud detection
* Spam detection
* Credit risk
* Voice recognition
* Image recognition
* Recommendations (search engines)
* Finding patterns in the stock market
* Housing prices

## Traditional modeling vs machine learning

Usually in science we work with **white box model**, like a set of equations, and we fully understand how our models work. We can see the process that turns inputs into outputs.

With machine learning, we let the machine figure out the details. While we understand the algorithms that allow the computer to learn, we don't always understand the insights or details about the specific insights the algorithms learn about the data (**black box model**).

This aspect is sometimes used as an argument against the practice of machine learning.

Here is a thought provoking [blog entry about this from Rich Sutton](http://www.incompleteideas.net/IncIdeas/BitterLesson.html) (pioneer of reinforcement learning, U of A faculty).

___

## The Machine Learning Process


* **Define the problem**
 * This can be difficult. What are we trying to achieve? What are we trying to predict?
* **Get the data**
 * This is often connected with the problem definition step, because knowing about the data helps clarify what we can do with it
* **Prepare data** 
 * Exploratory data analysis and visualization
 * Cleaning data
 * Often the most tedious and time consuming step
* **Select Algorithm**
 * Setting up one or more machine learning pipelines 
* **Train the model**
 * Feed the algorithm data.
* **Test the mode**
  * Maybe we need to go back and select a different algorithm to work with?
* **Select the best model** 
 * The definition of "best" depends on the type of problem, the type of data, and our goals
* **Predict**
  * Use the model to make predictions based on unseen data
* $$$ (?)


---

## Data science tools in Python we will be using

* Data analysis and cleaning/transforming: **pandas**
* Visualization: matplotlib (possibly **seaborn** and **plotly**)
* Scientific computing/number crunching: **numpy**
* Machine learning algorithms: **Scikit-learn**
* Neural networks: **keras** (using **tensorflow** as a backend)

In [None]:
# Lets make sure that we have the tools available to us ...

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import tensorflow
import keras

In [None]:
# ... if not, may need to uncomment one or more of:
# (Replace pip with conda where applicable)

# !pip install tensorflow
# !pip install keras
# !pip install numpy
# !pip install pandas
# !pip install matplotlib
# !pip install sklearn

## Getting the data for this notebook

We can download the data if we don't have it already

In [None]:
# Download data and solutions

import urllib.request
import os

def download_data(path):
    if os.path.exists(path):
        return
    if not os.path.exists('data'):
        os.mkdir('data')
    if not os.path.exists('data/titanic'):
        os.mkdir('data/titanic')
    if not os.path.exists('solutions'):
        os.mkdir('solutions')
    url = 'https://raw.githubusercontent.com/ualberta-rcg/python-machine-learning/main/notebooks/' + path
    output_file = path
    urllib.request.urlretrieve(url, output_file)
    print("Downloaded " + path)

def show_solution(file):
    fp = open('solutions/{}'.format(file), 'r')
    print(fp.read())

download_data('data/titanic/train.csv')
download_data('solutions/titanic-passenger-class.py')
download_data('solutions/titanic-random-forest-pipeline.py')
download_data('solutions/titanic-age-dropna.py')

## More about Machine Learning tools

Four packages that are available to us for free:

* Scikit-learn
 * Easy to understand
 * Great for learning
 * Consistent interface
* Tensorflow
 * From Google
 * Takes advantage of GPUs
* Pytorch
 * From Facebook
 * Takes advantage of GPUs
* Keras
 * Built on top of Tensorflow
 * Easier to understand and use

Some comparisons:
https://towardsdatascience.com/scikit-learn-tensorflow-pytorch-keras-but-where-to-begin-9b499e2547d0

---

## A problem: The Titanic Kaggle challenge

We will learn some of the concepts and jargon of machine learning by walking through an example.

[Kaggle](https://www.kaggle.com/) (owned by Google) is a machine learning competition website.

Competitions are either for fun, for money, or might lead to a job offer.

The introductory competition involves the sinking of the [Titanic](https://www.kaggle.com/c/titanic):

> The sinking of the Titanic is one of the most infamous shipwrecks in history.
>
> On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
> 
> While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
>
> In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

In the competition, we are given 891 rows of data (one row per passenger) in the file `train.csv` where we know whether the passenger survived. As the name indicates, we will use this data to train a machine learning model.

For the competition there is also a file called `test.csv` that includes information on 418 other passengers, but missing from this data is whether the passenger survived or not. To enter the competition, we make predictions on this file and submit it to the competition.

When I was first learning data science, I did two submissions to the Titanic competition: one which guessed 71% correctly, the other guessed 76% correctly.

We'll explore this problem to learn some of the concepts related to machine learning, and build a model that at least beats my earlier attempts.

## An initial look ...

We'll load the data file in and get a sense for the data it contains.

In [None]:
train_df = pd.read_csv('data/titanic/train.csv')
train_df.info()

In [None]:
# Lets look at the first 10 records ...
train_df.head(10)

Do you see any good candidates for predictors of survival?

---

## Features and Labels

In short, inputs to a machine learning model are called **features** (or **attributes**), and the output predicted is called a **label** (or a **target**).

The **label** that we are trying to predict in the Titanic challenge is clearly the information in the **Survived** data column.

We have 11 potential **features** we can use to predict this **label**.

* A feature can be binary, nominal, or numerical.
* We want to choose features that have predictive power.
* We want to choose features that are as independent as possible
    * E.g., if `weight_in_pounds` is a predictive feature in a model, don't also choose `weight_in_kilograms`
* Note that we can also design our own features from the data provided. (E.g., if we are predicting stock prices and we are given the previous opening and closing prices of a stock, maybe the difference would be a good predictor?). The process of designing and choosing features is called **feature engineering**.

A row of features (a single record in the input) is often called a **feature vector**.

A **dataset** often refers to a collection of feature vectors (either labeled or unlabeled)

---

**Question**: what could be possible features for a model that predicts housing prices?

---

## Types of Learning

### Supervised
* All instances in training data are **labeled**
* **Classification** - predicting nominal label
  * We are looking to build models that separate data into distinct classes
  * Algorithms:
    * Decision Trees
    * Random Forest
    * Support Vector Machine
  * E.g.,
    * Did the person survive the titanic? (True/False)
    * What species of plant is this?
* **Regression** - predicting numerical label
  * Based on previous data, predict a continuous numerical quantity
  * Algorithms
    * Linear regression
    * Polynomial regression
  * E.g.,
    * Predict the high temperature for tomorrow
    * Predict the closing price of a stock tomorrow

### Unsupervised
* There are **no labels** for the instances
* We are trying to find hidden meaning in data without additional guidance
* E.g., Find ten categories that a collection of emails fall into (clustering)
* Algorithms:
  * KMeans

### Reinforcement Learning
* Algorithms learn how to make actions on data points based on environment responses
* It’s impossible to get label without making an action
* Check this out, a DeepMind model learning to play Atari Breakout: https://www.youtube.com/watch?v=V1eYniJ0Rnk

---

## Exploratory Data Analysis

**Exploratory Data Analysis (EDA)** is poking around data (e.g., looking at descriptive statistics or plots) to gather insights.

---

The `value_counts` method for a series is useful for exploring the data. It in essense gives us the frequency table from a column in the data.

Lets find out the survival rate...

In [None]:
train_df['Survived'].value_counts()

In [None]:
counts = train_df['Survived'].value_counts()
percent = round(100 * counts[1] / len(train_df))
print('Survival rate: {}%'.format(percent))

We can look at the influence of various features on survival rate...

In [None]:
sex_counts = train_df['Sex'].value_counts()
sex_counts

In [None]:
# This is how the series returned from `value_counts` is organized

print(sex_counts.keys())
print(sex_counts.values)

Let's look at the data for the males alone

In [None]:
male_df = train_df[train_df['Sex'] == 'male']

In [None]:
male_df['Survived'].value_counts()

In [None]:
# and now the survival rate for males ...

counts = male_df['Survived'].value_counts()
percent = round(100 * counts[1] / len(male_df), 2)
print('Survival rate: {}%'.format(percent))

At this point you will have recognized that you have done an almost identical set of operations on two different dataframes ... this sounds like a good time to create a function!

In [None]:
def survival_report(df):
    if len(df) == 0:
        print('Empty data')
        return
    counts = df['Survived'].value_counts()
    print('             Lived: {}'.format(counts[1]))
    print('              Died: {}'.format(counts[0]))
    print('Chance of survival: {}%'.format(round(100 * counts[1] / len(df), 2)))

In [None]:
survival_report(male_df)

The new function makes exploring the data for the female passengers easy ...

In [None]:
survival_report(train_df[train_df['Sex'] == 'female'])

**Question: is gender a good predictor of survival?**

---

## Exercise

Using our `survival_report` function above, explore the survival rates for the different values of Passenger class ...

In [None]:
### Your code here ...


In [None]:
# PRINT SOLUTION (copy/paste output into a cell to run)
show_solution('titanic-passenger-class.py')

## Selecting some features

Clearly `Sex` and `Pclass` have information that would be valuable to a model that would predict survival on the Titanic, so we will include them in our collection of features.

We will add a couple of other features:
* `SibSp` (number of siblings/spouses also aboard)
* `Parch` (number of parents/children also aboard)

In [None]:
features = ["Pclass", "Sex", "SibSp", "Parch"]

Our dataset with just these features is:

In [None]:
train_df[features]

## One-hot encoding

Most machine learning algorithms only work on numerical features, not with categorical strings like 'male' and 'female'.

**One-hot** encoding is the process of converting categorical variables into a binary representation.

There is a method of a pandas `Dataframe` called [`get_dummies`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) that does this for us:

In [None]:
X = pd.get_dummies(train_df[features])
X

The `get_dummies` method leaves the other columns (which already have numerical information) untouched.

(Question: while passenger class is expressed as a number, should it perhaps also be treated as nominal?)

The `Sex_female` column doesn't provide any additional predictive information when combined with the `Sex_male` column (the information it contains is entirely redundant), so we can ask `get_dummies` to leave it out with the `drop_first` keyword option.

In [None]:
X = pd.get_dummies(train_df[features], drop_first=True)
X

---

`X` is now a dataframe with our features. We now define our labels as `y`.

In [None]:
y = train_df['Survived']
y

(The `X` is uppercase becase it is a set of vectors, the `y` is lowercase because it is a set of individual values.)

---

## Training and test data

We would like to now (randomly) split our data (both the features, and the labels) into two sets:

* Data we will **train** a machine learning model on
* Data we will use to **test** our machine learning model with to test the model's performance. This data will be previously **unseen** by the training process.

Arbitrarily, we can decide that we would like **two thirds of the data to train on**, reserving the remaining one third of the data for evaluating the model (we make predictions with the test data, then compare these predictions with the real answers).

Scikit-learn has a function called [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) that makes this work easy.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

As the names indicate:
* `X_train` are the features that are used for training (2/3 of the `X` values)
* `y_train` are the labels that are used for training (2/3 of the `y` values)
* `X_test` are the features used to make predictions to test the model (1/3 of the `X` values)
* `y_test` are the labels that are used to compare the predictions with reality (1/3 of the `y` values)

---

## Our first machine learning model: Decision Tree

The model we are going to initially try is a decision tree.

![](https://raw.githubusercontent.com/ualberta-rcg/python-machine-learning/main/notebooks/assets/youdroppedfood.jpg)

(Image: Audrey Fukman and Andy Wright on SFoodie, via Serious Eats)

A number of questions are asked about the data, and decisions are made based on the answers. These answers are refined or changed based on additional information.

---

### `sklearn.tree.DecisionTreeClassifier`

The [`DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) looks at your data and figures out what questions to ask automatically.

We can specify an option for `max_depth` (basically, how deep do we want the questions to go).

In [None]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=3)

---

### `fit`

Most models (all?) from Scikit-learn have a `fit` method that trains the model based on your features and labels.

In [None]:
# fit doesn't modify the model in place
# (returns a trained model)

model = model.fit(X_train, y_train)

---

### `predict`

Most models (all?) from Scikit-learn have a `predict` method that allows a trained model to make predictions when given unlabeled features.

Here we use our **test data** (data that the model wasn't trained on) as an input

In [None]:
predictions = model.predict(X_test)

In [None]:
print("The first ten predicted labels for the test data")
print(list(predictions[:10]))
print('The ten actual labels')
print(list(y_test[:10]))

### But what did the decision tree do?

In [None]:
# If you want to install graphviz ....
# Note for conda, you may have to install both graphviz and python-graphviz
# !pip install graphviz
# !conda install graphviz python-graphviz

In [None]:
import graphviz
from sklearn.tree import export_graphviz
from graphviz import Source

dot_data = export_graphviz(model,
                           feature_names=X.columns,
                           class_names=['Died', 'Survived'],
                           filled=True, rounded=True,
                           special_characters=True,
                           out_file=None)
graph = graphviz.Source(dot_data)
graph

## Measuring the quality of predictions in classification models

Ah, statistics and jargon! Let's look at something called a [**confusion matrix**](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) that compares our predicted values to the actual values in `y_test`.

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, predictions)

The confusion matrix reports the number of true positive, true negative, false positive, and false negative predictions in a table:

|          | Predicted 0         | Predicted 1         |
|----------|---------------------|---------------------|
| **Actual 0** | True negative (TN)  | False positive (FP) |
| **Actual 1** | False negative (FN) | True positive (TP)  |

Entries on the main diagonal report correct predictions, entries on the other diagonal report incorrect predictions.

Here we have the following interpretations for our predictions:

* Correct predictions
  * **True negative**: person was predicted to die and actually died
  * **True positive**: person was predicted to survive and actually survived
* Incorrect predictions
  * **False negative**: person was predicted to die, but actually survived
  * **False positive**: person was predicted to survive but actually died

**Accuracy** is the proportion of predictions that are correct (The proportion of true positives and true negatives in the predicted results). So accuracy is **(TN + TP) / (TN + TP + FN + FP)**

In Scikit-learn, this is represented by the [`accuracy_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(predictions, y_test)

Sometime accuracy isn't the best measure. For example, if you wanted to predict the presence of a disease that affected only 1 in every 1000 people, you can have a 99.9% accurate model just by predicting that nobody ever has the disease.

In fact, we could easily create a model for the Titanic challenge that is 62% accurate by just predicting that everybody died!

---

[**Precision** and **recall**](https://en.wikipedia.org/wiki/Precision_and_recall) are less obvious terms, but depending on what is important in the problem, they may be better indicators of model performance.

![](https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/330px-Precisionrecall.svg.png)

The **circle** represents the passengers that we **predicted to survive**.

The **left hand side** represents the passengers that **actual survived**.

**[Precision](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html)** is the proportion of passengers we **correctly predicted** to survive among all of the passengers we predicted to survive: **TP/(TP + FP)**

**[Recall](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html)** is the proportion of passengers we correctly predicted to survive, among all of the passengers that actually survived: **TP/(TP + FN)**

Examples when **high precision** is important (and we want to reduce the number of false positives):
* **Trials**: of all of the people we declare/predict to be guilty, we want as many of these to be guilty as possible. Otherwise innocent people go to jail.
* **Spam filter**: of all the emails we predict to be spam, we want as many to actually to be spam as possible (or else legitimate email gets flagged)

Examples when **high recall** is important (and we want to reduce the number of false negatives):
* **Cancer screening**: a false negative means somebody who got a negative result actually has cancer

**Question**: what do you think is more important for youtube/netflix/spotify recommendation engines, precision or recall?


---

The scores for our model ...

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

print('Accuracy: {}'.format(accuracy_score(y_test, predictions)))
print('Precision: {}'.format(precision_score(y_test, predictions)))
print('Recall: {}'.format(recall_score(y_test, predictions)))

---

### Putting it all together ...

We now have a lot of code spread out over a number of cells in this notebook.

Here we combine it together so we can get a better picture of what the entire pipeline looks like.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score

# More about this line shortly ...
# np.random.seed(1337)

# Load data
train_df = pd.read_csv('data/titanic/train.csv')

# Choose features and lables
features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_df[features], drop_first=True)
y = train_df['Survived']

# Split data into training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

# Initialize model and fit to training data
model = DecisionTreeClassifier(max_depth=3)
model = model.fit(X_train, y_train)

# Use model to predict on unseen test data
predictions = model.predict(X_test)

# Evaluate how well the model did
print('Accuracy: {}'.format(accuracy_score(y_test, predictions)))
print('Precision: {}'.format(precision_score(y_test, predictions)))
print('Recall: {}'.format(recall_score(y_test, predictions)))

---

## Repeatability ...

Run the above cell a few times. What happens?

A lot of the code related to machine learning depends on the output of random number generators, which can make repeating results difficult.

Scikit-learn uses NumPy's random number generator, and luckily we can **seed** this random number generator (give the random generator in initial number so that the **same sequence of random numbers** are generated whenever that seed number is used).

**Uncomment the line that looks like this: `np.random.seed(1337)`**

Now run the code several times and watch the results.

Note: the number `1337` isn't special (it's often used for seeds, as it's internet slang for 'leet' or 'elite'). 

Pick whichever number makes you happy -- but use it everytime for the same pipeline if you are interested in repeatability.


---
---

## If a tree is good, is a forest better?

What if we could set up a whole bunch of decision trees, each asking different questions, then vote on their final decisions to get a prediction? This is what Random Forest does.

Random Forest is called an **ensemble** model, because it combines the results of multiple models to try to get a better answer.

![](https://upload.wikimedia.org/wikipedia/commons/7/76/Random_forest_diagram_complete.png)

---

In Scikit-learn, Random Forest is implemented by [`sklearn.ensemble.RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

Like with `DecisionTree`, we have a setting for the `max_depth` of the generated decision trees.

We also have a setting for the number of trees to use, `n_estimators`.

E.g., `model = RandomForestClassifier(n_estimators=100, max_depth=3)`

---

### Exercise: use a `RandomForestClassifier`

Set up a machine learning pipeline using a `RandomForestClassifier` model:

* Read the Titanic data from a file
* Split the data into training and testing data
* Fit a random forest model to our training data
* Predict labels using our test data.
* Evaluate this model.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Your code goes here ...
# Hint: other than a couple of lines of code, it should look
#   very much like the decision tree pipeline above 

In [None]:
# PRINT SOLUTION (copy/paste output into a cell to run)
show_solution('titanic-random-forest-pipeline.py')

---

## Handling missing data

Let's look at the information about our data set again:

In [None]:
train_df.info()

---

Suppose we thought that age was a good predictor of survival.

We have a problem though:

**Not every row has data for age recorded**

Unfortunately, many machine learning algorithms **don't know how to handle empty data**.

We have some strategies to deal with this:

* We could throw out any rows with missing data
* We could substitute a default value (e.g., a mean if available, zero, -1, 9999, or some other value)
* Maybe the presence of a null value could be a feature in itself? (e.g., record a one if there was a cabin number recorded for the passenger, and a zero otherwise)

Let's look at how to throw out rows, and how to replace a null value with a mean.

---

### Throwing out rows

First, here is a tool for detecting the presence of missing (null) values:

In [None]:
train_df.isnull()

In [None]:
# Count the missing data in each column
train_df.isnull().sum()

Pandas dataframes have a method called [`dropna`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) that can be used to filter out rows that have missing data.

We can specify which columns to look at using the `subset` keyword argument.

Note that this method by default does not modify the dataframe in place, but **returns a new dataframe** with the rows missing.

In [None]:
age_non_null_train_df = train_df.dropna(subset=['Age'])
age_non_null_train_df.info()

# If we decide to go further down this road, we might do either:
#    train_df.dropna(subset=['Age'], inplace=True)
#                or
#    train_df = train_df.dropna(subset=['Age'])

---

### Replacing missing data with a mean

With this strategy, we don't throw out any rows, but instead create a 'fictional' age value for the rows with missing age.

Let's look at the mean age from the non-null values:

In [None]:
train_df.describe()

Pandas `Series` have a method called [`fillna`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.fillna.html) that can replace null values in a column with a specific value.

In [None]:
# We can make a copy of the dataframe if we don't want to
# modify the original (optional)...
age_mean_train_df = train_df.copy()
# Overwrite the column with new data with the missing data filled
age_mean_train_df['Age'] = train_df['Age'].fillna(train_df['Age'].mean())

In [None]:
age_mean_train_df.describe()

Tip: if you wanted to get a sense of the distribution of ages in the data, you could run value counts using binning with the `bins` option:

In [None]:
train_df["Age"].value_counts(bins=10, sort=False)

## Exercise

Use the age of the passenger as an additional feature in a machine learning pipeline.

* Use either the dataset with the null rows thrown out (e.g., `train_df = age_non_null_train_df`) or the dataset with the missing age data replaced by mean (e.g., `train_df = age_mean_train_df`). You decide.
* Use which ever classifier algorithm you'd like (`DecisionTreeClassifier` or `RandomTreeClassifier`).
* Play with whichever keyword arguments you might like to change ... any effect on the results?

In [None]:
# Your code here ...

In [None]:
# PRINT SOLUTION (copy/paste output into a cell to run)
# (one possible solution ...)
show_solution('titanic-age-dropna.py')

## Exercise

Predict your own survival (or possibly that of your entire family) with the most recently trained model.

E.g.,
```
# You may need to sanity check the order of features, should look like:
# ["Pclass", "SibSp", "Parch", "Age", "Sex_male"]
print(X_train.columns)
features = X_train.columns

family = [
  [2, 1, 1, 53.0, 1], # Me
  [2, 1, 1, 52.0, 0], # Wife
  [2, 0, 2, 10.0, 0]  # Daughter
]

family_df = pd.DataFrame(family, columns=features)
model.predict(family_df)
```

In [None]:
# Your code here ...

## Overfitting

One potential problem to look out for in machine learning is overfitting. This occurs when your model is so finely tuned to the data you give it, that it only works well with that data, and does a poor job when it encounters new unseen data. This can happen when too many adjustable parameters to a model are used than what is optimal. 

![](https://upload.wikimedia.org/wikipedia/commons/thumb/1/19/Overfitting.svg/200px-Overfitting.svg.png)

From Wikipedia: `The green line represents an overfitted model and the black line represents a regularized model. While the green line best follows the training data, it is too dependent on that data and it is likely to have a higher error rate on new unseen data, compared to the black line.`

---

Onto the next notebook, on [regression](02-regression.ipynb).