# PyData Toronto - August 2017

# Machine Learning - Interactive Session

## Software

We are going to use a few key pieces of the PyData stack. For the data manipulation, we will be using **pandas** and for the machine learning, we will use **scikit-learn**.

SciKit-Learn is a good starting point for this because it has a lot of tools that are useful for Machine Learning and the defaults are sensible.

## Types of Machine Learning

* supervised vs. unsupervised
* classification vs. regression

**Supervised** means that we can observe an outcome, while **unsupervised** is talking aobut classifying the observations into groups based on their characteristics. (We don't have any information on what categories they go into before hand.)

**Classification** is putting them into a finite group of categories. **Regression** is numeric -- finding a number for each observation.

## Supervised Classification

The purpose of this is to make predictions about what **category** the observations will fit into. The classic example is identifying which borrowers are going to default on their loans.

This differs from statistical modeling in the objective, but many of the models are similar.

In statistical modeling, we are much more interested in the relationships between various characteristics of an observation and their outcomes.

## Training vs. Test Data

With machine learning, the only real way to know how well your model works is to see the accuracy of your predictions. **But**, you don't wait to make predictions and wait to see what happens.

So, we would like to **test** our models on some observations before we put them into production.

We test our models by setting aside some data and never looking at it.

We don't look at the data because we don't want to select an **overfitted** model.

Overfitting is a problem because you can create a model that is really good at predicting the data you already have, but not that good at predicting new data. Bias-variance tradeoff.

When you have your data organized correctly, you can split it usine the `train_test_split` method from Sci-Kit Learn.

In [None]:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df)

This method takes and returns pandas `DataFrame`s, so no worries there.

# Creating Models

We organize our data in a way that makes analysis and access to the data easier, but the requirements are a bit different for building models.

I like the `patsy` packages for this. You can read about it at [patsy.readthedocs.io](http://patsy.readthedocs.io).

`patsy` lets you define your models using R-style formulas - and lets you create new data frames that are suitable for machine learing problems.

In [None]:
from patsy.highlevel import dmatrices

formula = "outcome ~ v1 + v2 + v3 + v4"
y_train, X_train = dmatrices(formula, data=df_train, return_type='dataframe')

The `formula` can be quite complicated - including polynomials and interactions between categories. It will handle categorical variables correctly by transforming them into a set of "dummy" variables (also known as one-hot encoding).

# Fitting Models

To create a model, you create an estimator object and then use it's `fit` method.

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

In [None]:
model.fit(X_train, y_train)

In [None]:
y_pred_train = model.predict(X_train)
p_pred_train = model.predict_proba(X_train)[:, 1]

# Model Evaluation

In [None]:
from sklearn.metrics import roc_auc_score
p_pred_test = model.predict_prob(X_test)
auc_test = roc_auc_score(y_test, p_pred_test)