# Pre-Work: Dummy Variables

#### What are dummy variables?
Dummy variables are a set of variables that takes the value 0 or 1 to indicate the presence or absence of each category in a catergorical variable.

#### Why do we need dummy variables?
Most machine learning techniques will only accept numberical values as inputs as will not accept string objects.

For example, most machine learning techniques will not accept the variable `state = ['DC', 'VA', 'MD']` but will understand the corresponding dummy variables `state_DC = [0, 1]`, `state_VA = [0, 1]`, and `state_MD = [0, 1]`, where `0` indicates absence of that category and `1` indicates presence.

| state | state_DC | state_VA | state_MD |
|-------|----------|----------|----------|
| DC    | 1        | 0        | 0        |
| VA    | 0        | 1        | 0        |
| MD    | 0        | 0        | 1        |

#### Why are numerical values sometimes converted to dummy variables?
If a variable takes on numerical values, but the values themselves do not imply an order, then they can be treated as categorical values and converted to dummy variables.

For example, consider a variable which holds internet socket port numbers used in network communications and can take on values form the list `ports = [20, 22, 25, 80]`. Port 20 is typically used for FTP traffic and port 80 is typically used for web traffic. There is no implied meaning that port 80 is 4 times as much as port 20. Here it is better to treat ports as a categorical variable and used dummy variables. Read more about port numbers [here](https://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers).

#### Is there an easy way to create dummy variables?
Yes, pandas `pd.get_dummies()` makes it really easy to make dummy variables. Read the API docs for the function [here](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html).

In [1]:
import numpy as np
import pandas as pd

from sklearn import linear_model

### Let's use the bikeshare data to demonstrate dummy variables

In [2]:
# load bikeshare data
wd = '../../assets/dataset/bikeshare/'
bike_data = pd.read_csv(wd + 'bikeshare.csv')
bike_data.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


In [3]:
# get details about the bikeshare data
with open(wd + 'bikeshare.txt') as f:
    print f.read()

Bike Sharing Dataset

Hadi Fanaee-T

Laboratory of Artificial Intelligence and Decision Support (LIAAD), University of Porto
INESC Porto, Campus da FEUP
Rua Dr. Roberto Frias, 378
4200 - 465 Porto, Portugal


Background 

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return 
back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return 
back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of 
over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, 
environmental and health issues. 

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by
these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration
of tra

### Check the data types of all columns

In [4]:
bike_data.dtypes

instant         int64
dteday         object
season          int64
yr              int64
mnth            int64
hr              int64
holiday         int64
weekday         int64
workingday      int64
weathersit      int64
temp          float64
atemp         float64
hum           float64
windspeed     float64
casual          int64
registered      int64
cnt             int64
dtype: object

### All columns that are to be used to fit the model need to be numerical (int or float).
Here `dteday` is not numerical and must be removed or converted to numerical. Dates are not always useful as a feature. It is often better to convert dates to weekday, which is what is contained in the `weekday` column. If you are interested in knowing how a date can be converted to a weekday number, see the code section below. It uses the `weekday()` function from the `datetime` package. Read more about it [here](https://docs.python.org/2/library/datetime.html#datetime.date.weekday).

In [6]:
from datetime import datetime

# this check allows this code to be run multiple times
if 'dteday' in bike_data.columns:
    # Convert date to weekday
    bike_data['weekday'] = bike_data['dteday'].apply(
        lambda x: datetime.strptime(x, '%Y-%m-%d').weekday()
    )
    # Remove dteday
    bike_data.drop(['dteday'], axis=1, inplace=True)

### Now let's get dummy variables for weekday

As previously mentioned, even though weekday is a number, perhaps it is better to treat it as a categorical variable and create dummy variables.

In [10]:
# this check allows this code to be run multiple times
if 'weekday' in bike_data.columns:
    # get dummy variables for weekday
    bike_data = bike_data.join(pd.get_dummies(bike_data['weekday'], prefix='weekday'))
    # remove weekday
    bike_data.drop(['weekday'], axis=1, inplace=True)

### Now we're ready to train a model

In [11]:
# set independent variables
# here you can manually choose which columns to use
# X = bike_data[['season', 'temp']]
# or remove the columns you do not want to use
X = bike_data.drop(['casual', 'registered', 'cnt'], axis=1)

# set dependent variable
y = bike_data['registered']

In [12]:
# Create linear regression object
lm = linear_model.LinearRegression()

# Fit the model
lm.fit(X, y)

# Score model
print "R-squared: " + str(lm.score(X, y))

R-squared: 0.335694432332


### Now let's also get dummy variables for the `weathersit` feature

In [13]:
# this check allows this code to be run multiple times
if 'weathersit' in bike_data.columns:
    # get dummy variables for weathersit
    X = X.join(pd.get_dummies(X['weathersit'], prefix='weathersit'))
    # remove weathersit
    X.drop(['weathersit'], axis=1, inplace=True)

In [14]:
# Create linear regression object
lm = linear_model.LinearRegression()

# Fit the model
lm.fit(X, y)

# Score model
print "R-squared: " + str(lm.score(X, y))

R-squared: 0.338341281134


### Now let's also get dummy variables for the `hr` feature

In [15]:
# this check allows this code to be run multiple times
if 'hr' in bike_data.columns:
    # get dummy variables for hr
    X = X.join(pd.get_dummies(X['hr'], prefix='hr'))
    # remove hr
    X.drop(['hr'], axis=1, inplace=True)

In [16]:
# Create linear regression object
lm = linear_model.LinearRegression()

# Fit the model
lm.fit(X, y)

# Score model
print "R-squared: " + str(lm.score(X, y))

R-squared: 0.676941859387


### As an exercise, create dummy variables for the `season` feature and check the model score

In [17]:
if 'season' in bike_data.columns:
    # get dummy variables for season
    X = X.join(pd.get_dummies(X['season'], prefix='season'))
    # remove hr
    X.drop(['season'], axis=1, inplace=True)

In [18]:
# Create linear regression object
lm = linear_model.LinearRegression()

# Fit the model
lm.fit(X, y)

# Score model
print "R-squared: " + str(lm.score(X, y))

R-squared: 0.679148138732
