<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Modeling Walkthrough

_Authors: Riley Dallas (AUS), Adi Bronshtein (Live Online)_

---

### Learning Objectives
*After this lesson, you will be able to:*

- Gather, clean, explore and model a dataset from scratch.
- Split data into testing and training sets using both train/test split and cross-validation and apply both techniques to score a model.


## Importing libaries
---

We'll need the following libraries for today's lesson:

1. `pandas`
2. `numpy`
3. `seaborn`
4. `matplotlib.pyplot`
5. `train_test_split` and `cross_val_score` from `sklearn`'s `model_selection` module
6. `LinearRegression` from `sklearn`'s `linear_model` module
7. `r2_score` from `sklearn`'s `metrics` module 

In [None]:
import pandas as pd 
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

## Load the Data

---

Today's [dataset](http://www-bcf.usc.edu/~gareth/ISL/data.html) (`College.csv`) is from the [ISLR website](http://www-bcf.usc.edu/~gareth/ISL/). 

Rename `Unnamed: 0` to `University`.

## Data cleaning: Initial check
---

Check the following in the cells below:
1. Do we have any null values?
2. Are any numerical columns being read in as `object`?

In [None]:
# Check for nulls


In [None]:
# Check column data types


![](./assets/modeling-data-cleaning.jpg)

## Data cleaning: Clean up `PhD` column
---

`PhD` is being read in as a string because some of the cells contain non-numerical values. In the cell below, replace any non-numerical values with `NaN`'s, and change the column datatype to float.

In [None]:
# One way - get value_counts()


In [None]:
# Second way - get unique values


In [None]:
# If it's a very long list, we can sort it!


In [None]:
# Grab all the universities and colleges with missing PhD data


In [None]:
# Use the "replace()" method:

# # To make it stick - use inplace=True! (uncomment to run)
# college['PhD'].replace(to_replace='?', value=np.nan, inplace=True).astype(float)

In [None]:
# Use a lambda function

# # if we want to make it "stick" - we need to reassign to the column (uncomment to run)
# college["PhD"] = college['PhD'].apply(lambda x:np.nan if x == '?' else float(x))

In [None]:
# Third way - using a (defined) function


In [None]:
# Map the column using the function


In [None]:
# Make sure 
college.dtypes

## Data cleaning: Drop `NaN`'s
---

Since there are a small percentage of null cells, let's go ahead and drop them.

In [None]:
# Get the number of missing rows


In [None]:
# Get the number of rows before dropping missing values


In [None]:
# Drop the missing values, check the shape


## Feature engineering: Binarize `'Private'` column
---

In the cells below, convert the `Private` column into numerical values.

In [None]:
college['Private'].value_counts()

In [None]:
# With lambda function:


In [None]:
# Use the "map()" method (to make it "stick" we'll need reassign to the column)


In [None]:
# Use the "replace()" method (to make it stick, use inplace=True!)


In [None]:
# Using np.where() (need to reassign to column)


In [None]:
# Check changes

## Feature engineering: Create an `Elite` column
---

The `Top10perc` is the percentage of students enrolled that graduated high school in the top 10th percentile. Let's create a column called `Elite` that has the following values:
- 1 if `Top10perc` is greater than or equal to 50%
- 0 if `Top10perc` is less than 50%

## EDA: Plot a Heatmap of the Correlation Matrix
---

Heatmaps are an effective way to visually examine the correlational structure of your predictors. 

In [None]:
plt.figure(figsize=(12, 10))
sns.heatmap(college.corr(), cmap='twilight_r', annot=True, vmin=-1, vmax=1, linewidths=1);

In [None]:
# Heatmap of "Apps" with other features only!

## EDA: Use seaborn's `.pairplot()` method to create scatterplots 
---

Let's create a pairplot to see how some of our stronger predictors correlate to our target (`Apps`). Instead of creating a pairplot of the entire DataFrame, we can use the `y_vars` and `x_vars` params to get a smaller subset.

## EDA: Create histograms of all numerical columns
---

## EDA: Boxplots
---

In the cells below, create two boxplots:
1. One for our target (`Apps`)
2. And one for our strongest predictor (`Accept`)

## Model Prep: Create our features matrix (`X`) and target vector (`y`)
---

Every **numerical** column (that is not our target) will be used as a feature.

The `Apps` column is our label: the number of applications received by that university.

In the cell below, create your `X` and `y` variables.

**First way (of grabbing all numerical columns) - same as solution code**:

In [None]:
# This gives me a list of numerical columns


In [None]:
# Using list comprehension - grab all columns EXCPET the "Apps"


**Second way**:

## Model Prep: Train/test split
---

We always want to have a holdout set to test our model. Use the `train_test_split` function to split our `X` and `y` variables into a training set and a holdout set.

## Model Prep: Instantiate our model
---

Create an instance of `LinearRegression` in the cell below.

In [None]:
linreg = LinearRegression()

## Cross validation
---

Use `cross_val_score` to evaluate our model.

In [None]:
# This gives us 5 cross validated testing scores (R-squared)


In [None]:
# This gives us an average cross validated testing score (R-squared)


## Model Fitting and Evaluation
---

Fit the model to the training data, and evaluate the training and test scores below.

In [None]:
# Fit the model on training data


In [None]:
# Training score


In [None]:
# Testing score


**Let's check the residuals! (errors)**

In [None]:
# Let's create predictions!


In [None]:
# MSE for training


In [None]:
# MSE for testing


In [None]:
# RMSE for training

In [None]:
# RMSE for testing


In [None]:
# Look at the coefficients


In [None]:
# Zip the coefficients and features together 


In [None]:
# Put it into a dataframe