# Mini-lab 7: Transportation forecasting with kNN

As in mini-lab 6 we will analyze mode choice as a function of delta_travel_time, and delta_travel_cost, the difference in travel time and cost between the transit travel option and the driving travel option. In this mini-lab we will build a method to predict the travel mode for a trip based on the travel mode of the "nearest neighbors". 

### k nearest neighbors algorithm:
The [k-nearest neighbors algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) is a popular classification technique in computer science. In this mini-lab we will build a classifier that predicts the travel mode for unobserved trips based on the travel mode of the nearest neighbors. Here, the nearest neighbors are the observation with the most similar delta_travel_time and delta_travel_cost.

### The dataset
The dataset is the same as we used in Mini-lab 6. It contains the actual travel decisions for travelers given various travel alternatives. Below is the detailed dataset description.
#### "The California Department of Transportation (Caltrans) conducts the California Household Travel Survey (CHTS) every ten years to obtain detailed information about the socioeconomic characteristics and travel behavior of households statewide." -[Caltrans website](http://www.dot.ca.gov/hq/tpp/offices/omsp/statewide_travel_analysis/chts.html)


The modechoice.csv file contains data from the CHTS on trips that people living in the bay area actually took. The dataset contains demographic info on the traveler as well as trip origin taz and destination taz. We have combined this data with the inter TAZ travel time/cost data that we used in mini-labs 4 and 5 to provide information on trip cost, time, and distance for all available travel modes.

Note that for some trips/some people, not all modes are available. Some people do not have a drivers license or do not have access to a car. Sometimes biking is infeasible due to bike ownership, trip distance or restrictions on biking across bridges. 

The data in modechoice.csv is as follows:
<table>
    <tr>
        <td>'observation_id'</td>   <td>int id</td>
    </tr><tr>
        <td>'choice'        </td>   <td>  string mode chosen <li>'drive_alone' - drive alone,<li>'shared_ride_2' - 2 person shared ride,<li>'shared_ride_3' - 3 person shared ride,<li>'walk_transit_walk' - walk tranit walk,<li>'drive_transit_walk' - drive transit walk,<li>'walk_transit_drive' - walk transit drive,<li>'walk' - walk,<li>'bike' - bike </td>
    </tr><tr>
    <td> 'availability_drive_alone'</td>   <td>1 if available else 0</td>
    </tr><tr>
    <td> 'availability_shared_ride_2'</td>   <td>1 if available else 0</td>
    </tr><tr>
    <td> 'availability_shared_ride_3+'</td>   <td>1 if available else 0</td>
    </tr><tr>
    <td> 'availability_walk_transit_walk'</td>   <td>1 if available else 0</td>
    </tr><tr>
    <td> 'availability_drive_transit_walk'</td>   <td>1 if available else 0</td>
    </tr><tr>
    <td> 'availability_walk_transit_drive'</td>   <td>1 if available else 0</td>
    </tr><tr>
    <td> 'availability_walk'</td>   <td>1 if available else 0</td>
    </tr><tr>
    <td> 'availability_bicycle'</td>   <td>1 if available else 0</td>
    </tr><tr>
    <td> 'household_id'</td>   <td>      int
    </tr><tr>
    <td> 'person_id'</td>   <td>              int
    </tr><tr>
    <td> 'tour_id'</td>   <td>                 int
    </tr><tr>
    <td> 'tour_origin_taz'</td>   <td>                   int taz id
    </tr><tr>
    <td> 'primary_dest_taz'</td>   <td>    int taz id
    </tr><tr>
    <td> 'age'</td>   <td>          int age in years
    </tr><tr>
    <td> 'household_size'</td>   <td>                       int, number of people
    </tr><tr>
    <td> 'household_income'</td>   <td>    int 1-8, 1 = lowest income bracket, 8=highest
    </tr><tr>
    <td> 'household_income_values'</td>   <td>         int dollar value household income
    </tr><tr>
    <td> 'transit_subsidy'</td>   <td>   1 if has subsidy, else 0
    </tr><tr>
    <td> 'transit_subsidy_amount'</td>   <td>           subsidy dollar amount
    </tr><tr>
    <td> 'cross_bay'</td>   <td>    1 if trip crosses bay, else 0
    </tr><tr>
    <td> 'total_travel_time_drive_alone'</td>   <td>    door to door travel time in minutes
    </tr><tr>
    <td> 'total_travel_time_shared_ride_2'</td>   <td> 
    </tr><tr>
    <td> 'total_travel_time_shared_ride_3+'</td>   <td> 
    </tr><tr>
    <td> 'total_travel_time_walk_transit_walk'</td>   <td> 
    </tr><tr>
    <td> 'total_travel_time_drive_transit_walk'</td>   <td> 
    </tr><tr>
    <td> 'total_travel_time_walk_transit_drive'</td>   <td> 
    </tr><tr>
    <td> 'total_travel_time_walk'</td>   <td> 
    </tr><tr>
    <td> 'total_travel_time_bicycle'</td>   <td> 
    </tr><tr>
    <td> 'total_travel_distance_drive_alone' </td>   <td> travel distance in miles
    </tr><tr>
    <td> 'total_travel_distance_shared_ride_2'</td>   <td> 
    </tr><tr>
    <td> 'total_travel_distance_shared_ride_3+'</td>   <td> 
    </tr><tr>
    <td> 'total_travel_distance_walk'     </td>   <td> 
    </tr><tr>
    <td> 'total_travel_distance_bicycle'</td>   <td> 
    </tr><tr>
    <td> 'total_travel_cost_drive_alone'</td>   <td> travel cost in dollars
    </tr><tr>
    <td> 'total_travel_cost_shared_ride_2'</td>   <td> Note driving costs include fixed per mile rate divided evenly among passengers, and tolls. Does not include parking and other car ownership related costs
    </tr><tr>
    <td> 'total_travel_cost_shared_ride_3+'</td>   <td> 
    </tr><tr>
    <td> 'total_travel_cost_walk_transit_walk'</td>   <td> 
    </tr><tr>
    <td> 'total_travel_cost_drive_transit_walk'</td>   <td> 
    </tr><tr>
    <td> 'total_travel_cost_walk_transit_drive'</td>   <td> 
    </tr><tr>
    <td> 'age_ctgry'         <td>  str age category:
                                       <li>'0-04' = 0-4 years old,
                                       <li>'05-19' = 5-19 years old, 
                                       <li>'20-44' = 20-44 years old, 
                                       <li>'45-64' = 45-64 years old, 
                                       <li>'65+' = 65+ years old, 




In [None]:
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
mc = Table.read_table('../minilab6/modechoice.csv')
mc

## Drive vs. transit travel time comparison

As in mini-lab 6: first we take trips where both drive and transit are available, we compute the delta_travel_time and delta_travel_cost for trips where choice = transit and where choice = drive. These steps should have been completed in mini-lab 6, but the code to do this is copied below:

In [None]:
# Get rows where both drive and walk to transit are available
transit_drive_avail = mc.where('availability_drive_alone',1).where('availability_walk_transit_walk',1)

# From transit_drive_avail, store the rows where the mode 'choice' is 'walk_transit_walk'
# in a table called took_transit
took_transit = transit_drive_avail.where('choice','walk_transit_walk')

# from transit_drive_avail, store the rows where the selected mode 'choice' is 'drive_alone'
drove = transit_drive_avail.where('choice','drive_alone')


# Compute the difference in travel time (the 'total_travel_time_walk_transit_walk' column-
# the 'total_travel_time_drive_alone' column) for people who took transit.
chose_wtw_tt_delta = (took_transit.column('total_travel_time_walk_transit_walk')-
                      took_transit.column('total_travel_time_drive_alone'))

# Compute the difference in travel time (the 'total_travel_time_walk_transit_walk' column-
# the 'total_travel_time_drive_alone' column) for people who drove.
chose_drive_tt_delta = (drove.column('total_travel_time_walk_transit_walk')-
                        drove.column('total_travel_time_drive_alone'))

# Compute the cost delta for transit cost vs. drive alone cost for people who took tranist
chose_wtw_cost_delta = (took_transit.column('total_travel_cost_walk_transit_walk')
                        -took_transit.column('total_travel_cost_drive_alone'))

# Compute the cost delta for transit cost vs. drive alone cost for people who drove
chose_drive_cost_delta = (drove.column('total_travel_cost_walk_transit_walk')
                          -drove.column('total_travel_cost_drive_alone'))

# Building a classifier
## 0. Visualizing the data
The first step when we start building a predictive model is to visualize the data. We already created this scatter plot in mini-lab 6, but I am including it here to remind us of what the data set looks like.

In [None]:
plt.figure(figsize=(12,8))

plt.scatter(chose_wtw_tt_delta, chose_wtw_cost_delta, color='blue', 
            alpha=.5, label = 'took transit')

plt.scatter(chose_drive_tt_delta, chose_drive_cost_delta, color='green', alpha=.5, label = 'drove')

plt.xlabel('delta travel time (min)')
plt.ylabel('delta travel cost ($)')
plt.legend(shadow=True)


## 1. Build the input data table
First we need to build a table that contains the response variable (the thing we want to predict - in this case, whether someone will drive or not), and the inputs that will be used to predict the response variable (in this case delta_travel_time and delta_travel_cost). 

In [None]:
input_table = Table().with_columns('transit_time-drive_time', chose_drive_tt_delta,
                                   'transit_cost-drive_cost', chose_drive_cost_delta,
                                   'choice=drive',1)

transit_input_table = Table().with_columns('transit_time-drive_time', chose_wtw_tt_delta,
                                           'transit_cost-drive_cost', chose_wtw_cost_delta,
                                           'choice=drive',0)
input_table.append(transit_input_table)

## 2. Normalize the data.
If the input data columns are all in the same units, then we're good to go, but if they are in different units, i.e one column contains travel time in minutes and another column contains travel cost in dollars, a distance measure doesn't make a lot of sense. Instead we normalize the data and for every column. The "units" of the normalized data is how many standard deviations the data point is from the mean. I have built a norm() helper function to normalize the data. Later we will use the transform function to modify a new point that we want to predict in the same that we have modified the other points.

In [None]:
def norm(data):
    x_minus_mean = data - np.mean(data)
    x_norm = x_minus_mean/np.std(data)
    return x_norm

def transform(to_predict, data):
    return (to_predict - np.mean(data))/np.std(data)


# Your task: Create two new columns in the table, one called 'normed_transit_time-drive_time', 
# another called 'normed_transit_cost-drive_cost'. The values in these two columns should be
# the normalized values of 'transit_time-drive_time' and 'transit_cost-drive_cost'. Use the
# norm function above to help you out:



## 3. Build a classifier
### 3a. We have decided to use k-Nearest Neighbor (kNN) algorithm. 
Below I have built a basic kNN class with two methods, a fit method where we load in the nearest neighbor candidates, and a predict method, where we identify the nearest neighbors and return the most common response category among the nearest neighbors.


### 3b. Determine a distance function
Since we need to identify the nearest neighbors, we first need to choose a distance function. In this case we will use a euclidean distance function.



In [None]:
def distance(x, y):
    return np.sqrt(np.sum((x-y)**2,1))

class KNearestNeighbors():
    def __init__ (self, n_neighbors=5):
        '''
        n_neighbors: number of neighbors
        '''
        self.n_neighbors = n_neighbors
    
    def fit(self, input_data, response):
        '''
        input_data: a table, the values of this table will be used to 
            compute the distance to the neighbors
        response: a table with one column, the values in this column 
            represent the category of the thing we are trying to predict 
        '''
        self.input_data = input_data.values
        self.response = response.values.flatten()
    
    def predict(self, to_predict, return_kneighbor_inds=True):
        '''
        to_predict: A single input data point. It should contain one value 
           for each of the columns in the input_data table.
        return_kneighbor_inds: boolean. If True, return the indices of the
            nearest neighbors from the input table, otherwise, only the 
            majority category of the k-nearest neighbors is returned.
        '''
         #get distance input_data to predict"
        dists = distance(to_predict, self.input_data)

        #get indices of k nearest points
        inds = np.argsort(dists)[0:self.n_neighbors]

        #return the most common response among the neighbors
        most_common_response = (np.argmax(np.bincount(self.response[inds])))
        if return_kneighbor_inds:
            return most_common_response, inds
        return most_common_response
            

## 4. Using the classifier

In [None]:
# Your task: Use the input_table.select() method to select only the columns 
# to be used to determine the nearest neighbors



# Your task: Use the input_table.select() method to select the column with the response variable



# Your task: Create an instance of the KNearestNeighbors class, let's set n_neighbors to 5.



# Your task: Load in the nearest neighbor candidates using the fit method:



# Set to_predict =[25,1.5] This means we are predicting the travel mode of a trip with 
# delta_travel_time = 25 min, delta_travel_cost = 1.5
to_predict = [25,1.5]

#transform the first element of this point to find out how many standard deviations 
#the delta_travel_time is from the mean. Do the same for delta_travel_cost
normed_to_predict = [transform(to_predict[0],input_table['transit_time-drive_time']),
                     transform(to_predict[1],input_table['transit_cost-drive_cost'])]


prediction, nn_inds = kNN.predict(normed_to_predict, True)
predicted_travel_mode = 'drive' if prediction ==1 else 'take transit'

print ('Based on the k nearest neighbors, the predicted travel mode is %s' %predicted_travel_mode)

## Verify results
Make sure we are actually locating the nearest neighbors. In the cell below we use the Table take() method to see which rows have been identied as nearest neighbors. Take a look at the values and confirm that they are similar to the values of the to_predict point. 

In [None]:
input_table.take[nn_inds]

### Visualize the nearest neighbors
Below is the same scatter poing that we plotted above, but now we have added a black dot for the to_predict point, and added red circles to identify the nearest neighbors to the observed point.

In [None]:
plt.figure(figsize=(12,8))
nn_to_plot = input_table.take[nn_inds]

plt.scatter(chose_wtw_tt_delta, chose_wtw_cost_delta, color='blue', 
            alpha=.5, label = 'took transit')

plt.scatter(chose_drive_tt_delta, chose_drive_cost_delta, color='green', 
            alpha=.5, label = 'drove')


plt.scatter(nn_to_plot['transit_time-drive_time'], nn_to_plot['transit_cost-drive_cost'], 
            facecolors='none', edgecolors='red', label = 'nearest neighbors')

plt.scatter([to_predict[0]],[to_predict[1]], color='black', label = 'to_predict')

plt.xlabel('delta travel time (min)')
plt.ylabel('delta travel cost ($)')
plt.legend(shadow=True)


### Questions:
* What is the kNN predicted travel mode when delta_travel_time = 50, delta_travel_cost = 1.5 when we use... 
 - 1 nearest neighbors to predict travel mode? 
 - 3 nearest neighbors to predict travel mode? 
 - 5 nearest neighbors to predict travel mode? 
 - 10 nearest neighbors to predict travel mode? 
 - 50 nearest neighbors to predict travel mode?



* Describe in words the decision rule (how we decide the travel mode) for k-nearest neighbors when k=1?


* Describe in words the decision rule (how we decide the travel mode) for k-nearest neighbors when k=10?


* For the task of classifying travel mode, do you think it makes more sense to use k=1 or k=10? Why?

In [None]:
# Your answers here
