<a id="feature-engineering-to-improve-performance"></a>
## Feature Engineering to Improve Performance
---

Machine learning models are very powerful, but they cannot automatically handle every aspect of our data. We have to explicitly modify our features to have relationships that our models can understand. In this case, we will need to pull out features to have a linear relationship with our response variable.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("fivethirtyeight")

df = pd.read_csv(r'..\data\bikeshare.csv', index_col='datetime', parse_dates=True)

In [2]:
df.head()

Unnamed: 0_level_0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,count
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2011-01-01 00:00:00,Spring,0,0,Clear Skies,9.84,14.395,81,0.0,16
2011-01-01 01:00:00,Spring,0,0,Clear Skies,9.02,13.635,80,0.0,40
2011-01-01 02:00:00,Spring,0,0,Clear Skies,9.02,13.635,80,0.0,32
2011-01-01 03:00:00,Spring,0,0,Clear Skies,9.84,14.395,75,0.0,13
2011-01-01 04:00:00,Spring,0,0,Clear Skies,9.84,14.395,75,0.0,1


<a id="handling-categorical-features"></a>
### Handling Categorical Features

scikit-learn expects all features to be numeric. So how do we include a categorical feature in our model?

- **Ordered categories:** Transform them to sensible numeric values (example: small=1, medium=2, large=3)
- **Unordered categories:** Use dummy encoding (0/1). Here, each possible category would become a separate feature.

What are the categorical features in our data set?

- **Ordered categories:** `weather`
- **Unordered categories:** `season`, `holiday` (already dummy encoded), `workingday` (already dummy encoded)

For season, we can't simply leave the encoding as 1 = spring, 2 = summer, 3 = fall, and 4 = winter, because that would imply an ordered relationship. Instead, we create multiple dummy variables.

### Your Turn:

Transform the values in the `weather` column so that they map to the following values:

 - Heavy Storms/Rain: 1
 - Light Storms/Rain: 2
 - Partly Cloudy: 3
 - Clear Skies: 4

In [3]:
weather_map = {
    'Clear Skies':       4,
    'Partly Cloudy':     3,
    'Light Storms/Rain': 2,
    'Heavy Storms/Rain': 1
}

df.weather = df.weather.map(weather_map)

In [4]:
df.head()

Unnamed: 0_level_0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,count
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2011-01-01 00:00:00,Spring,0,0,4,9.84,14.395,81,0.0,16
2011-01-01 01:00:00,Spring,0,0,4,9.02,13.635,80,0.0,40
2011-01-01 02:00:00,Spring,0,0,4,9.02,13.635,80,0.0,32
2011-01-01 03:00:00,Spring,0,0,4,9.84,14.395,75,0.0,13
2011-01-01 04:00:00,Spring,0,0,4,9.84,14.395,75,0.0,1


### Categorical Variables

 - Exist when there's no inherent order to any of the values
  - Male/Female
  - California/Ohio/New York, etc
  
 - For these values, you want to represent them all as either 0/1
 - This is also called OneHot Encoding
 - Easiest way to do this is with the command `pd.get_dummies()`

In [5]:
pd.get_dummies(df.season).head()

Unnamed: 0_level_0,Fall,Spring,Summer,Winter
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2011-01-01 00:00:00,0,1,0,0
2011-01-01 01:00:00,0,1,0,0
2011-01-01 02:00:00,0,1,0,0
2011-01-01 03:00:00,0,1,0,0
2011-01-01 04:00:00,0,1,0,0


In [6]:
pd.get_dummies(df.season, drop_first=True).head()

Unnamed: 0_level_0,Spring,Summer,Winter
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2011-01-01 00:00:00,1,0,0
2011-01-01 01:00:00,1,0,0
2011-01-01 02:00:00,1,0,0
2011-01-01 03:00:00,1,0,0
2011-01-01 04:00:00,1,0,0


In [7]:
# this one-hot-encodes all categorical variables
df = pd.get_dummies(df, drop_first=True)
df.head()

Unnamed: 0_level_0,holiday,workingday,weather,temp,atemp,humidity,windspeed,count,season_Spring,season_Summer,season_Winter
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2011-01-01 00:00:00,0,0,4,9.84,14.395,81,0.0,16,1,0,0
2011-01-01 01:00:00,0,0,4,9.02,13.635,80,0.0,40,1,0,0
2011-01-01 02:00:00,0,0,4,9.02,13.635,80,0.0,32,1,0,0
2011-01-01 03:00:00,0,0,4,9.84,14.395,75,0.0,13,1,0,0
2011-01-01 04:00:00,0,0,4,9.84,14.395,75,0.0,1,1,0,0


### Your Turn:

 - Re-run linear regression on this dataset using two different variants:
  - the entire dataset we currently have
  - the original version we used on Monday w/ 4 variables:  'weather', 'temp', 'atemp', 'humidity'
 - Make sure to standardize your data
 - Write the down the version with the highest r-squared value, and the # for that metric
 - Make sure to check the returns to the most important coefficients for your model

In [11]:
X = df.loc[:, df.columns != 'count']
y = df['count']

X_std = (X - X.mean()) / X.std()

from sklearn.linear_model import LinearRegression
lreg = LinearRegression()
lreg.fit(X_std, y)
lreg.score(X_std, y)

0.2752503232099429

In [14]:
coeffs = pd.DataFrame({
    'Variable': X_std.columns,
    'Weight'  : lreg.coef_
}).sort_values(by='Weight', ascending=False)
coeffs

Unnamed: 0,Variable,Weight
3,temp,62.038316
9,season_Winter,44.661416
4,atemp,24.652284
7,season_Spring,16.284213
8,season_Summer,15.168533
6,windspeed,4.420027
1,workingday,-1.333938
0,holiday,-1.437827
2,weather,-1.680746
5,humidity,-54.257686


### A Few Points About Categorical Data:

 - Ambiguity over whether or not a variable is truly 'ordered':
  - More conservative assumption is to use one-hot encoding
 - Be careful of categorical columns with high #'s of unique values
  - What if your dataset has more columns than rows?

### With High Dimensional Categories:

 - Often need to re-assemble them in some way:
  - Group them into a smaller number of categories
    - bin street address into different neighborhoods
  - Extract a smaller piece of information from them:
      - Someone's surname or greeting from their full name
      - Someone's deck level from their seat on a ticket
 - Group together values with a low count into one aggregate value:
      - 'Other', 'N/A', etc
 - Generally will want at least 10 occurrences of a particular value in order to use it
 - Linear models (and others) tend to be problematic or dysfunctional when # of columns > # of rows

### Your Turn:

Re-open the sacramento housing dataset, this time using the one with categorical values.

These categories are a bit problematic, because some of them a very broad # of unique values.

Try re-running the model we created on Monday, but this time incorporating at least one of the following variables:

 - **Zip Code**:  Is encoded as a number, but is really a category
 - **type**: What type of land it is
 - **city**: What city the town was in
 
 Compare your r_squared value to the one we had on Monday, which was 0.182.

### Evaluation Metrics for Regression Problems

In addition to the r_squared value, here are two common metrics for regression problems:

**Mean absolute error (MAE)** is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$

**Mean squared error (MSE)** is the mean of the squared errors:

$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$

### Take 5 Minutes:  

 - Refit your model on the bikeshare data, using the entire dataset

### SKLearn:  Preprocessing & Metrics

 - Useful modules that allow you to automatically calculate a number of different repetitive items
 - Preprocessing:
   - Allows you to standardize data
   - Impute missing values
 - Metrics:
   - Practically any scoring metric that you could think of

In [None]:
from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(df['count'], df['prediction']))
print('MSE:', metrics.mean_squared_error(df['count'], df['prediction']))

### Feature Engineering

 - Practice of extracting new features from old ones that better capture relationships in data
 - Common practices:
  - capturing some numeric relationship between different quantitative variables:
   - ie, ratio of different columns (Profit/Sales, etc)
   - log-transforming numeric columns that are noisy, or not bell-shaped
 - Transforming low value count categorical values into something more useful
 - Extracting additional information from dates!
  - pandas gives you a lot of useful options here

In [16]:
df['Hour'] = df.index.hour
df.head()

Unnamed: 0_level_0,holiday,workingday,weather,temp,atemp,humidity,windspeed,count,season_Spring,season_Summer,season_Winter,Hour
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2011-01-01 00:00:00,0,0,4,9.84,14.395,81,0.0,16,1,0,0,0
2011-01-01 01:00:00,0,0,4,9.02,13.635,80,0.0,40,1,0,0,1
2011-01-01 02:00:00,0,0,4,9.02,13.635,80,0.0,32,1,0,0,2
2011-01-01 03:00:00,0,0,4,9.84,14.395,75,0.0,13,1,0,0,3
2011-01-01 04:00:00,0,0,4,9.84,14.395,75,0.0,1,1,0,0,4


### Training & Test Sets

In [24]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_std, y, test_size=0.2, random_state=2)

In [25]:
X_train.shape

(8708, 10)

In [26]:
X_test.shape

(2178, 10)

In [27]:
y_train.shape

(8708,)

In [28]:
y_test.shape

(2178,)

In [29]:
lreg.fit(X_train, y_train)
print(lreg.score(X_train, y_train), lreg.score(X_test, y_test))

0.2691189782895864 0.2983179259155764
