# Integrating Pandas with Scikit-Learn, an Exciting New Workflow

## About Me - Ted Petrou

* Author of Pandas Cookbook
* Author of [Master Data Analysis with Python][2]
* Founder of Dunder Data - Expert Data Science Instruction
* Specialize in finding best practices to use the python data science ecosystem
* Follow me on Twitter [@TedPetrou][3]

## Make sure you have scikit-learn 0.20 installed

## Major Objective
The major objective of this tutorial is to teach the latest and most robust workflows for those that use pandas for data exploration and scikit-learn for machine learning. The primary focus will be on the new features added to version 0.20 of scikit-learn in September, 2018. [See the changelog here][1] for a list of all the new features.

[1]: https://scikit-learn.org/stable/whats_new.html#version-0-20-0
[2]: https://online.dunderdata.com/courses/master-data-analysis-with-python-volume-1-foundations-of-data-exploration
[3]: https://twitter.com/tedpetrou

## 1. The Scikit-Learn Estimator
The scikit-learn library has one primary type of object to do machine learning - the **estimator**.

All estimators:
* Learn from data
* Are python types
* Written in CamelCase
* Use the three-step process: import, instantiate, fit

Types of estimators:
The following are common types of estimators. [Visit the glossary][1] to see more types.
* Regressors - Supervised learning with continuous target
* Classifiers - Supervised learning with categorical target
* Clusterers - Unsupervised learning
* Transformers - Transform the input/output data
* Meta-estimators - Learn from other estimators

### Helper Functions
Nearly every object in scikit-learn is either an estimator or **helper function**. The helper functions perform a single task and written in snake_case.

### Finding estimators in the scikit-learn API
The scikit-learn package is divided neatly into about 35 modules. Most modules contain several estimators. It can be valuable to take a look at the entire scikit-learn API to help learn which module an estimator is located in. Note that the estimators (CamelCase) are listed first in the modules followed by the helper functions. We display the API in the notebook below.

[1]: https://scikit-learn.org/stable/glossary.html#class-apis-and-estimator-types

In [None]:
from IPython.display import IFrame
IFrame('https://scikit-learn.org/stable/modules/classes.html', 800, 600)

## Common Estimators and Helper Functions

The complete API above is huge with many estimators and helper functions needed only for very specific circumstances.

### House - Room - Object
I like to analogize scikit-learn to a house, where the modules are the rooms, and the estimators and helper functions are the objects in the room. The following house consolidates the most common estimators and helper functions into one diagram.

![](images/scikit_house.png)

## The Housing Dataset
We will be using the housing dataset from the ["Advanced Regression Techniques" Kaggle competition][1] for the entire tutorial. The full training dataset contains 80 explanatory variables along with the sale price target variable on 1460 houses from the city of Ames, Iowa for houses sold from 2006 to 2010. We will begin by reading in the housing_sample dataset which contains a small subset of the columns.

[1]: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

In [1]:
import pandas as pd
hs = pd.read_csv('data/housing_sample.csv')
hs.head()

Unnamed: 0,Neighborhood,Exterior1st,YearBuilt,LotFrontage,GrLivArea,GarageArea,SalePrice
0,CollgCr,VinylSd,2003,65.0,1710,548,208500
1,Other,Other,1976,80.0,1262,460,181500
2,CollgCr,VinylSd,2001,68.0,1786,608,223500
3,Other,Other,1915,60.0,1717,642,140000
4,Other,VinylSd,2000,84.0,2198,836,250000


In [9]:
hs.columns

Index(['Neighborhood', 'Exterior1st', 'YearBuilt', 'LotFrontage', 'GrLivArea',
       'GarageArea', 'SalePrice'],
      dtype='object')

In [3]:
hs['Neighborhood'].value_counts()

Other      863
NAmes      224
CollgCr    148
OldTown    112
Edwards     99
Name: Neighborhood, dtype: int64

Get the shape, data types, and number of missing values:

In [4]:
hs.shape

(1460, 7)

In [5]:
hs.dtypes

Neighborhood     object
Exterior1st      object
YearBuilt         int64
LotFrontage     float64
GrLivArea         int64
GarageArea        int64
SalePrice         int64
dtype: object

In [6]:
hs.isna().sum()

Neighborhood     14
Exterior1st      43
YearBuilt         0
LotFrontage     259
GrLivArea         0
GarageArea        0
SalePrice         0
dtype: int64

## Prepare Data - scikit-learn gotchas

Before we can learn from the data, we need to prepare it so that it works with scikit-learn estimators. 

* Assign input data to `X` and output to `y` - convention used throughout scikit-learn documentation
* Input data must be two-dimensional
* Input data must be numeric (no strings)
* Input and output data cannot contain missing values

Some transformers can handle data that is non-numeric or contains missing values but all the machine learning estimators (regressors, classifiers, and clusterers) cannot.

### Model sale price with ground living area

Let's use the ground living area to model the sale price of the house and extract the data into numpy arrays. It is possible to keep the data in a pandas DataFrames/Series but scikit-learn has historically been designed to work with numpy arrays. Note that we use the `pop` method to completely remove the `SalePrice` column from our DataFrame.

In [None]:
X = hs[['GrLivArea']].values
y = hs.pop('SalePrice').values

View the first 5 input and output values. Notice that X is two dimensional.

In [None]:
X[:5]

In [None]:
y[:5]

## Import, Instantiate, Fit — The three-step process for each estimator
The scikit-learn API is consistent for all estimators and uses the same three-step process to learn from the data.

* **Import** the estimator from its module
* **Instantiate** the estimator, possibly changing the (hyper)parameters
* **Fit** the estimator to the data

## Linear regression with the three-step process

For this problem, we need to use an estimator that is a [Regressor][1] - one that models target variables with continuous values such as sale price. The word 'Regressor' is often contained within the names of these estimators. Specifically, we will do linear regression by importing the `LinearRegression` estimator from the `linear_model` module.

### Step 1: Import
Open up the scikit-learn house (package), go to the `linear_model` room (module) and select the `LinearRegression` object (estimator) to import.

[1]: https://scikit-learn.org/stable/glossary.html#term-regressors

In [None]:
from sklearn.linear_model import LinearRegression

### Step 2: Instantiate
The `LinearRegression` object above is merely a blueprint. We must instantiate it (construct an instance of the class) in order to have an object that can learn from the data. I sometimes refer to this as "constructing the machine learning vehicle" from the blueprint. I typically use the first letters of each word of the estimator class name as the variable name for the instance.

In [None]:
lr = LinearRegression()

### Step 3: Fit
All estimators learn from the data via the `fit` method. In this particular case, the estimator learns the parameters of the linear regression model (the slope and intercept) that result in the minimum squared error.

In [None]:
lr.fit(X, y)

## Estimated Parameters - end in a single underscore
scikit-learn stores parameters learned from the data as public attributes that end in a single underscore. For linear regression, it stores the intercept and coefficient (slope) as separate attributes.

In [None]:
lr.intercept_

In [None]:
lr.coef_

## Make predictions
Once the model is trained, you will be able to use the `predict` method to make predictions. Pass it an input array with the same type of data used during training. Here, we predict using our original training data.

In [None]:
lr.predict(X)

## Summary of commands

In [None]:
hs = pd.read_csv('data/housing_sample.csv')
X = hs[['GrLivArea']].values
y = hs.pop('SalePrice').values

from sklearn.linear_model import LinearRegression  # step 1 - import
lr = LinearRegression()                            # step 2 - instantiate
lr.fit(X, y)                                       # step 3 - fit

lr.predict(X)

## Exercise
All other regression estimators use the same three-step process to learn from the data. Complete the three-step process for the following models:
* K-nearest neighbors
* Decision trees
* Random Forests
* Gradient Boosted trees

The model learned can drastically change by setting the hyperparameters in step 2 during instantiation. We aren't concerned with hyperparameters at this point. Also, You may choose input data from other columns that have no missing values.