# Regression

In the last lecture, we looked at the problem of classification where the goal was to find a rule that best predicts class membership from a discrete set of classes, such as "Fraud v.s. Not Fraud" or "Survived or Not Survived". In this lecture, we are going to study real-valued prediction---like predict the number of sales on a given day or predict the energy usage of a device. However, the basic construct will still be the same: you have a set of features X and a set of labels Y. The goal is to find a functional relationship between X and Y that best predicts Y.

## Example
Bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions, precipitation, day of week, season, hour of the day, etc. can affect the rental behaviors. The core data set is related to   the two-year historical log corresponding to years 2011 and 2012 from Capital Bikeshare system, Washington D.C., USA which is  publicly available in http://capitalbikeshare.com/system-data. 

In [99]:
import pandas as pd
data = pd.read_csv('hour.csv')
data[:5]

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


Let's break down each of the columns, so we have a better idea of what we are dealing with.

In [100]:
print(data.groupby(['yr'])['yr'].count()) #Two years of data

yr
0    8645
1    8734
Name: yr, dtype: int64


In [76]:
print(data.groupby(['season'])['season'].count()) #seasons count 1-4

season
1    4242
2    4409
3    4496
4    4232
Name: season, dtype: int64


In [77]:
print(data.groupby(['weekday'])['weekday'].count()) #day of the week

weekday
0    2502
1    2479
2    2453
3    2475
4    2471
5    2487
6    2512
Name: weekday, dtype: int64


Let's try something simple first and then refine our approach. Like before, we can construct features and labels.

In [101]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

def linmodel():
    feature_cols = ['season','yr','mnth','hr','holiday','weekday','workingday','weathersit','temp','atemp','hum','windspeed']
    #feature_cols = ['yr','mnth','hr']
    X = data[feature_cols].to_numpy() #defines the features
    Y = data['cnt'].to_numpy() #labels (or what we are predicting)

    #create a testing dataset
    X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2,random_state=1)

    lin = LinearRegression() #model that we use
    lin.fit(X_train,Y_train)
    
    Y_pred = lin.predict(X_test) #evaluation and prediction
    print(r2_score(Y_test, Y_pred))

In [102]:
linmodel()

0.3975373993285606


This is better than simply picking the mean value each time. But can we do even better?

## Feature Engineering
What might be broken. Many datasets that are collected overtime exhibit seasonal effects where the data repeats:

In [104]:
print(data.groupby(['mnth'])['cnt'].mean())

mnth
1      94.424773
2     112.865026
3     155.410726
4     187.260960
5     222.907258
6     240.515278
7     231.819892
8     238.097627
9     240.773138
10    222.158511
11    177.335421
12    142.303439
Name: cnt, dtype: float64


Why might this be a problem for linear regression? Fundementally this is a non-linear relationship between the month id and the value. One simple trick is to transform the data into "distance to june" rather than keep it as a month id.

In [105]:
import numpy as np
data['mnth'] = np.abs((data['mnth'] - 6))
print(data.groupby(['mnth'])['cnt'].mean())

mnth
0    240.515278
1    227.363575
2    213.010989
3    197.563918
4    169.664756
5    135.995813
6    142.303439
Name: cnt, dtype: float64


We can do the same thing with the "hour" feature as well.

In [107]:
print(data.groupby(['hr'])['cnt'].mean())
data['hr'] = (data['hr']-17)%12 #put this on a 12 hr clock
print(data.groupby(['hr'])['cnt'].mean())

hr
0      53.898072
1      33.375691
2      22.869930
3      11.727403
4       6.352941
5      19.889819
6      76.044138
7     212.064649
8     359.011004
9     219.309491
10    173.668501
11    208.143054
12    253.315934
13    253.661180
14    240.949246
15    251.233196
16    311.983562
17    461.452055
18    425.510989
19    311.523352
20    226.030220
21    172.314560
22    131.335165
23     87.831044
Name: cnt, dtype: float64
hr
0     242.654457
1     251.138334
2     261.828179
3     292.474914
4     195.795876
5     152.487285
6     147.945704
7     153.744154
8     143.897454
9     132.966759
10    134.167602
11    162.702172
Name: cnt, dtype: float64


In [108]:
linmodel()

0.4075595780242215


What else could be challenging? There could be outliers or points that are not really well correlated with the features. We can fix this with a technique called ridge regression, this approach penalizes fitting heavily to a single feature.

In [122]:
from sklearn.linear_model import Ridge #penalties

def ridgemodel(alpha=1):
    feature_cols = ['season','yr','mnth','hr','holiday','weekday','workingday','weathersit','temp','atemp','hum','windspeed']
    X = data[feature_cols].to_numpy()
    Y = data['cnt'].to_numpy()

    #create a testing dataset

    X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2,random_state=1)
    
    Y_train[500:1000] = -500

    lin = Ridge(alpha=alpha)
    lin.fit(X_train,Y_train)
    Y_pred = lin.predict(X_test)

    print(r2_score(Y_test, Y_pred))

ridgemodel(50)


0.37987645058621644


## Non-linear models
Maybe your problem is fundementally non-linear. One easy way to make a linear model more successful at non-linear relationships to to engineer your features to capture non-linearity better:

In [131]:
from sklearn.preprocessing import PolynomialFeatures
pol = PolynomialFeatures(degree=4)
X = np.array([[1,2]])
Xp = pol.fit_transform(X)
Xp

array([[ 1.,  1.,  2.,  1.,  2.,  4.,  1.,  2.,  4.,  8.,  1.,  2.,  4.,
         8., 16.]])

In [126]:
def polymodel(degree=1):
    feature_cols = ['season','yr','mnth','hr','holiday','weekday','workingday','weathersit','temp','atemp','hum','windspeed']
    X = data[feature_cols].to_numpy()
    Y = data['cnt'].to_numpy()
    
    pol = PolynomialFeatures(degree=degree)
    X = pol.fit_transform(X)

    #create a testing dataset

    X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2,random_state=1)

    lin = LinearRegression()
    lin.fit(X_train,Y_train)
    Y_pred = lin.predict(X_test)

    print('Degree:',degree,r2_score(Y_test, Y_pred))
    
    #training accuracy
    Y_pred = lin.predict(X_train)
    print(r2_score(Y_train, Y_pred))
    
    

polymodel(1)
polymodel(2)
polymodel(3)
polymodel(4) #what is going on here???

Degree: 1 0.4075595780242216
0.4010699589510761
Degree: 2 0.549239847239622
0.5350103113676354
Degree: 3 0.6614844208425092
0.65922124560599
Degree: 4 -1091604887.4043262
0.7597899663796178


sci-kit learn also has other non-linear models that might be more effective for your task. 

In [127]:
from sklearn.ensemble import RandomForestRegressor
def rfmodel():
    feature_cols = ['season','yr','mnth','hr','holiday','weekday','workingday','weathersit','temp','atemp','hum','windspeed']
    X = data[feature_cols].to_numpy()
    Y = data['cnt'].to_numpy()
    
    #create a testing dataset

    X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2,random_state=1)
    
    #Y_train[0:1000] = 100

    lin = RandomForestRegressor()
    lin.fit(X_train,Y_train)
    Y_pred = lin.predict(X_test)

    print(r2_score(Y_test, Y_pred))

rfmodel()



0.7692034764655856
