# Pre-Work: Dummy Variables

#### What are dummy variables?
Dummy variables are a set of variables that takes the value 0 or 1 to indicate the presence or absence of each category in a catergorical variable.

#### Why do we need dummy variables?
Most machine learning techniques will only accept numberical values as inputs as will not accept string objects.

For example, most machine learning techniques will not accept the variable `state = ['DC', 'VA', 'MD']` but will understand the corresponding dummy variables `state_DC = [0, 1]`, `state_VA = [0, 1]`, and `state_MD = [0, 1]`, where `0` indicates absence of that category and `1` indicates presence.

| state | state_DC | state_VA | state_MD |
|-------|----------|----------|----------|
| DC    | 1        | 0        | 0        |
| VA    | 0        | 1        | 0        |
| MD    | 0        | 0        | 1        |

#### Why are numerical values sometimes converted to dummy variables?
If a variable takes on numerical values, but the values themselves do not imply an order, then they can be treated as categorical values and converted to dummy variables.

For example, consider a variable which holds internet socket port numbers used in network communications and can take on values form the list `ports = [20, 22, 25, 80]`. Port 20 is typically used for FTP traffic and port 80 is typically used for web traffic. There is no implied meaning that port 80 is 4 times as much as port 20. Here it is better to treat ports as a categorical variable and used dummy variables. Read more about port numbers [here](https://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers).

#### Is there an easy way to create dummy variables?
Yes, pandas `pd.get_dummies()` makes it really easy to make dummy variables. Read the API docs for the function [here](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html).

In [None]:
import numpy as np
import pandas as pd

from sklearn import linear_model

### Let's use the bikeshare data to demonstrate dummy variables

In [None]:
# load bikeshare data
wd = '../../assets/dataset/bikeshare/'
bike_data = pd.read_csv(wd + 'bikeshare.csv')
bike_data.head()

In [None]:
# get details about the bikeshare data
with open(wd + 'bikeshare.txt') as f:
    print f.read()

### Check the data types of all columns

In [None]:
bike_data.dtypes

### All columns that are to be used to fit the model need to be numerical (int or float).
Here `dteday` is not numerical and must be removed or converted to numerical. Dates are not always useful as a feature. It is often better to convert dates to weekday, which is what is contained in the `weekday` column. If you are interested in knowing how a date can be converted to a weekday number, see the code section below. It uses the `weekday()` function from the `datetime` package. Read more about it [here](https://docs.python.org/2/library/datetime.html#datetime.date.weekday).

In [None]:
from datetime import datetime

# this check allows this code to be run multiple times
if 'dteday' in bike_data.columns:
    # Convert date to weekday
    bike_data['weekday'] = bike_data['dteday'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d').weekday())
    # Remove dteday
    bike_data.drop(['dteday'], axis=1, inplace=True)

### Now let's get dummy variables for weekday

As previously mentioned, even though weekday is a number, perhaps it is better to treat it as a categorical variable and create dummy variables.

In [None]:
# this check allows this code to be run multiple times
if 'weekday' in bike_data.columns:
    # get dummy variables for weekday
    bike_data = bike_data.join(pd.get_dummies(bike_data['weekday'], prefix='weekday'))
    # remove weekday
    bike_data.drop(['weekday'], axis=1, inplace=True)

### Now we're ready to train a model

In [None]:
# set independent variables
# here you can manually choose which columns to use
# X = bike_data[['season', 'temp']]
# or remove the columns you do not want to use
X = bike_data.drop(['casual', 'registered', 'cnt'], axis=1)

# set dependent variable
y = bike_data['registered']

In [None]:
# Create linear regression object
lm = linear_model.LinearRegression()

# Fit the model
lm.fit(X, y)

# Score model
print "R-squared: " + str(lm.score(X, y))

### Now let's also get dummy variables for the `weathersit` feature

In [None]:
# this check allows this code to be run multiple times
if 'weathersit' in bike_data.columns:
    # get dummy variables for weathersit
    X = X.join(pd.get_dummies(X['weathersit'], prefix='weathersit'))
    # remove weathersit
    X.drop(['weathersit'], axis=1, inplace=True)

In [None]:
# Create linear regression object
lm = linear_model.LinearRegression()

# Fit the model
lm.fit(X, y)

# Score model
print "R-squared: " + str(lm.score(X, y))

### Now let's also get dummy variables for the `hr` feature

In [None]:
# this check allows this code to be run multiple times
if 'hr' in bike_data.columns:
    # get dummy variables for hr
    X = X.join(pd.get_dummies(X['hr'], prefix='hr'))
    # remove hr
    X.drop(['hr'], axis=1, inplace=True)

In [None]:
# Create linear regression object
lm = linear_model.LinearRegression()

# Fit the model
lm.fit(X, y)

# Score model
print "R-squared: " + str(lm.score(X, y))

### As an exercise, create dummy variables for the `season` feature and check the model score