# New York Taxi Fare Prediction (Exploratory Data Analysis in iPython)

This homework investigates doing exploratory data analysis in iPython. It is based on New York Taxi Fare Prediction on Kaggle (https://www.kaggle.com/c/new-york-city-taxi-fare-prediction), revolving around predicting the fare of a taxi ride given a pickup and a drop off location.


In [None]:
# Basic Imports
import pandas as pd
import numpy as np
import urllib.request
import json
from time import sleep

# Visualization and geo-data imports
import matplotlib.pyplot as plt
import folium
from folium import plugins
import fiona
from shapely.geometry import shape,mapping, Point, Polygon, MultiPolygon
import geopandas
from geopandas.tools import sjoin
import seaborn as sns

# Modelling and training
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
import lightgbm as lgbm

## Load Datasets

In this part, we will load the entire 55 million records of NYC taxi fare as follows:
* First we will only load first 10 rows of training data from csv file (check dataframe column types) using pandas.

* We want to efficiently store dataset (optimize on dataframe column datatypes)
    
    * float64 > float32 (for GPS coordinates 7 digit precision is enough, more is overkill)
      (referenced from: https://gis.stackexchange.com/questions/8650/measuring-accuracy-of-latitude-and-longitude/8674)
    
    * object > string (pickup_datetime column's datatype converted to string for reading from csv, again converted back to datetime by parsing string)

    * int64 > int8 (passenger_count field since max passengers in single record is less than 2^8=256)
    
    * Ignoring key field (not used for training)
    
* Read entire training csv in chunks of 10 million rows and storing chunks in single list

In [None]:
# Sample and visualize first 10 rows by reading part of training data

sample_train_df = pd.read_csv('../input/train.csv', nrows=10)
sample_train_df.info()
#display(sample_train_df.head())
#display(sample_train_df.tail())
del sample_train_df

In [None]:
%%time

### Optimize and compress certain feature's data types
#   1. float32 is enough for upto 7 digit precision (as used in GPS)
#   2. datetime object is stored as string for reading csv
#   3. passenger_count field type is changed to uint8 (not more than 256 passenger count in training data)
#   4. Ignoring key object field

train_df_type = {
    'fare_amount':'float32',
    'pickup_datetime':'str',
    'pickup_longitude':'float32',
    'pickup_latitude':'float32',
    'dropoff_longitude': 'float32',
    'dropoff_latitude': 'float32',
    'passenger_count': 'uint8'
}

# Read training csv in chunks and storing chunks in single list

chunksize=10**7
df_train_list=[]
cnt=1
for chunk in pd.read_csv('../input/train.csv', dtype=train_df_type, usecols=list(train_df_type), chunksize=chunksize):
    # converting pickup_datetime from string to datetime object
    chunk['pickup_datetime'] = pd.to_datetime(chunk['pickup_datetime'].str.slice(0,19), utc=True, format='%Y-%m-%d %H:%M:%S')
    df_train_list.append(chunk)
    print(cnt, "chunk appended")
    cnt+=1
    
# concatenating list of training data to single dataframe

train_df = pd.concat(df_train_list)
del df_train_list
train_df.info()

## Task 1

Take a look at the training data. There may be anomalies in the data that you may need to factor in before you start on the other tasks. Clean the data first to handle these issues. Explain what you did to clean the data (in bulleted form).

### Taking Look at Training and Testing Data

* After reading both train and test data, use describe method to find out statistics (e.g. min, max, mean, count of values inside dataset)

In [None]:
# Describe training data statistics
train_df.describe()

In [None]:
# read test csv and describe its statistics

test_df = pd.read_csv('../input/test.csv')
test_df.describe()

### Find Anomalies in Data

* Fare Amount 
    * Fare amount in 254 training records is negative.
    * Fare amount is 0 in certain training records (In this case, we need to check for unusual higher distance travelled by taxi and may decide to filter those records)
    * Fare amount is abnormally high in some taxi records with not that big distances covered (may decide to filter taxi records with fare amount more than 10 standard deviations upwards from mean)
     
* Passenger Count
    * Maximum count of passenger in test data is 6.
    * Number of passengers are 0 in 195416 taxi trips.
    * In 116 training records, there are more than 6 passengers.
 
* Coordinates (Pickup and Dropoff)
    * Test dataset is clean (no NULL / Nan values).
    * In training set, we find 376 records with NULL values.
    * Use test dataset min and max coordinates as bounding box for training data. (since we are focused on predicting for same test data only)

In [None]:
### Observations from description of training and testing data:
#   1. Minimum fare amount in training data is negative
#   2. Maximum count of passenger in test data is 6

print("Instances where features of training data is null: ")
display(train_df.isnull().sum()) # any null values in train data

print("Occurences of negative fare amount: " + str(len(train_df[train_df['fare_amount']<0]))) # fare amount is zero or negative
print("Occurences with more than 6 passengers: " + str(len(train_df[train_df['passenger_count']>6]))) # passenger count is more than 6
print("Occurences with exactly 0 passengers: " + str(len(train_df[train_df['passenger_count']==0])))

### Helper functions to get Distances (Euclidean, Manhattan)

In [None]:
def getEuclidean_distance(pickup_long, pickup_lat, dropoff_long, dropoff_lat):
    return np.sqrt(((pickup_long-dropoff_long)**2) + ((pickup_lat-dropoff_lat)**2))

def get_manhattan_dist(pickup_long, pickup_lat, dropoff_long, dropoff_lat):
    return ((dropoff_long - pickup_long).abs() + (dropoff_lat - pickup_lat).abs())


In [None]:
%%time

# Add new column (manhattan_dist) to both training and test dataframe.

train_df['manhattan_dist'] = get_manhattan_dist(train_df.pickup_longitude, train_df.pickup_latitude,
                                              train_df.dropoff_longitude, train_df.dropoff_latitude).astype(np.float32)

test_df['manhattan_dist'] = get_manhattan_dist(test_df.pickup_longitude, test_df.pickup_latitude,
                                       test_df.dropoff_longitude, test_df.dropoff_latitude).astype(np.float32)

### Clean Data (Steps)

* Remove any rows with any kind of NULL / Nan data ()
* Filter based on passenger counts (only consider data with passenger count between 1 and 6)
* Filter based on fare amount (remove negative fares, or fares with 0 amount but long distances covered)
* Filter based on outliers (discard records with fare amount more than 10 standard deviations upward from mean)
* Filter based on test data coordinates boundary
                

In [None]:
def clean_data(train_df):
    
    print("Initial Train dataframe length: " + str(len(train_df)))
    
    # Remove null data
    train_df=train_df.dropna(how='any',axis='rows')
    print("Train dataframe length after removing NULL values: " + str(len(train_df)))
    
    train_df=train_df[(train_df.passenger_count<=6) & (train_df.passenger_count>=1)]
    print("Train dataframe length after filtering based on passenger counts: " + str(len(train_df)))
    
    train_df=train_df[(train_df.fare_amount>0) | ((train_df.fare_amount==0) & (train_df.manhattan_dist<0.75))]
    train_df=train_df[(train_df.fare_amount <= train_df.fare_amount.mean()+10*train_df.fare_amount.std())]
    print("Train dataframe length after filtering based on fare amount: " + str(len(train_df)))
    
    train_df=train_df[(train_df.pickup_longitude>=min(test_df.pickup_longitude)) & (train_df.pickup_longitude<=max(test_df.pickup_longitude))]
    train_df=train_df[(train_df.pickup_latitude>=min(test_df.pickup_latitude)) & (train_df.pickup_latitude<=max(test_df.pickup_latitude))]    
    train_df=train_df[(train_df.dropoff_longitude>=min(test_df.dropoff_longitude)) & (train_df.dropoff_longitude<=max(test_df.dropoff_longitude))]
    train_df=train_df[(train_df.dropoff_latitude>=min(test_df.dropoff_latitude)) & (train_df.dropoff_latitude<=max(test_df.dropoff_latitude))]
    print("Train dataframe length after filtering based on test data coordinates boundary: " + str(len(train_df)))
    
    return train_df



train_df=clean_data(train_df)

### Helper functions to get different distances

### Airport Distances


* Get coordinates from each of 3 airports:
     Airport | Longitude      | Latitude      
     JFK     | -73.78         | 40.64                   
     LGA     | -73.87         | 40.77         
     EWR     | -74.175        | 40.69         
    
* Calculates euclidean distances from all airports to all data points (both with pickup and dropoff at airports)

* Add total of 6 new columns (each 3 airport pickup and dropoff distances) to both training and test data.

In [None]:
def getAirport_pickup_distance(dropoff_long, dropoff_lat, airport):
    pickup_long=0
    pickup_lat=0
    if airport == "JFK":
        pickup_long=-73.7822222222
        pickup_lat=40.6441666667
    elif airport == "LGA":
        pickup_long=-73.87
        pickup_lat=40.77
    elif airport == "EWR":
        pickup_long=-74.175
        pickup_lat=40.69
    return np.sqrt(((pickup_long-dropoff_long)**2) + ((pickup_lat-dropoff_lat)**2))

def getAirport_dropoff_distance(pickup_long, pickup_lat, airport):
    dropoff_long=0
    dropoff_lat=0
    if airport == "JFK":
        dropoff_long=-73.7822222222
        dropoff_lat=40.6441666667
    elif airport == "LGA":
        dropoff_long=-73.87
        dropoff_lat=40.77
    elif airport == "EWR":
        dropoff_long=-74.175
        dropoff_lat=40.69
    return  np.sqrt(((pickup_long-dropoff_long)**2) + ((pickup_lat-dropoff_lat)**2))


In [None]:
%%time

train_df['jfk_pickup_dist']=getAirport_pickup_distance(train_df.dropoff_longitude, train_df.dropoff_latitude, "JFK").astype(np.float32)
train_df['lga_pickup_dist']=getAirport_pickup_distance(train_df.dropoff_longitude, train_df.dropoff_latitude, "LGA").astype(np.float32)
train_df['ewr_pickup_dist']=getAirport_pickup_distance(train_df.dropoff_longitude, train_df.dropoff_latitude, "EWR").astype(np.float32)

train_df['jfk_dropoff_dist']=getAirport_dropoff_distance(train_df.pickup_longitude, train_df.pickup_latitude, "JFK").astype(np.float32)
train_df['lga_dropoff_dist']=getAirport_dropoff_distance(train_df.pickup_longitude, train_df.pickup_latitude, "LGA").astype(np.float32)
train_df['ewr_dropoff_dist']=getAirport_dropoff_distance(train_df.pickup_longitude, train_df.pickup_latitude, "EWR").astype(np.float32)

test_df['jfk_pickup_dist']=getAirport_pickup_distance(test_df.dropoff_longitude, test_df.dropoff_latitude, "JFK").astype(np.float32)
test_df['lga_pickup_dist']=getAirport_pickup_distance(test_df.dropoff_longitude, test_df.dropoff_latitude, "LGA").astype(np.float32)
test_df['ewr_pickup_dist']=getAirport_pickup_distance(test_df.dropoff_longitude, test_df.dropoff_latitude, "EWR").astype(np.float32)

test_df['jfk_dropoff_dist']=getAirport_dropoff_distance(test_df.pickup_longitude, test_df.pickup_latitude, "JFK").astype(np.float32)
test_df['lga_dropoff_dist']=getAirport_dropoff_distance(test_df.pickup_longitude, test_df.pickup_latitude, "LGA").astype(np.float32)
test_df['ewr_dropoff_dist']=getAirport_dropoff_distance(test_df.pickup_longitude, test_df.pickup_latitude, "EWR").astype(np.float32)

### Euclidean Distances

Compute euclidean distance between pickup and dropoff points

In [None]:
train_df['euclidean_dist'] = getEuclidean_distance(train_df.pickup_longitude, train_df.pickup_latitude,
                                                   train_df.dropoff_longitude, train_df.dropoff_latitude).astype(np.float32)

test_df['euclidean_dist'] = getEuclidean_distance(test_df.pickup_longitude, test_df.pickup_latitude,
                                       test_df.dropoff_longitude, test_df.dropoff_latitude).astype(np.float32)

### Haversine Distances

Compute haversine distance between pickup and dropoff points
(Formula referenced from: https://stackoverflow.com/questions/4913349/haversine-formula-in-python-bearing-and-distance-between-two-gps-points)

In [None]:
%%time
def haversine_np(lon1,lat1,lon2,lat2):
    lon1,lat1,lon2,lat2 = map(np.radians, [lon1,lat1,lon2,lat2])
    dlon=lon2-lon1
    dlat=lat2-lat1
    a=np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    c=2*np.arcsin(np.sqrt(a))
    km=6367*c
    return km

train_df['haversine_dist']=haversine_np(train_df.pickup_longitude, train_df.pickup_latitude, train_df.dropoff_longitude, train_df.dropoff_latitude).astype(np.float32)
test_df['haversine_dist']=haversine_np(test_df.pickup_longitude, test_df.pickup_latitude, test_df.dropoff_longitude, test_df.dropoff_latitude).astype(np.float32)

### Save / Load Cleaned Data

Save preprocessed cleaned data to feather file (don't need to spend computation time again for cleaning)

Feather is fast, interoperable binary data frame storage for python (much faster than reading from csv)

### Save / Load training data from Feather

In [None]:
%%time
# saving training dataframe to feather file
train_df=train_df.reset_index(drop=True)
train_df.to_feather('train.feather')

In [None]:
%%time

# read from feather file (not from original train csv) for fast loading 

train_df = pd.read_feather('train.feather')
train_df.info()

### Save / Load test data from feather

In [None]:
%%time
# saving training dataframe to feather file
test_df=test_df.reset_index(drop=True)
test_df.to_feather('test.feather')

In [None]:
%%time

# read from feather file (not from original train csv) for fast loading 

test_df = pd.read_feather('test.feather')
test_df.info()

## Task 2

Compute the Pearson correlation between the following: (9 pt)
* Euclidean distance of the ride and the taxi fare
* time of day and distance traveled
* time of day and the taxi fare


Since we have computed haversine and manhattan distances in addition to euclidean distance, we might want to get pearson correlation for those as well.

In [None]:
# Haversine Correlation
train_df.haversine_dist.corr(train_df.fare_amount)

In [None]:
# Manhattan Correlation
train_df.manhattan_dist.corr(train_df.fare_amount)

### Correlation between Euclidean distance of the ride and the taxi fare

In [None]:
# Euclidean Correlation
train_df.euclidean_dist.corr(train_df.fare_amount)

From above we can see that all (Euclidean, Manhattan and Haversine) are strongly correlated with fare amount (each with value more than 0.8)

### Correlation between time of day and distance traveled

For time of day, we have three options:
    1. Take continuous values of hours (0 to 23)
    2. Take minutes elapsed (from 12am)
    3. Take seconds elapsed (from 12am)

Here time of day is calculated as seconds elapsed from midnight

In [None]:
%%time

time_of_day = train_df.pickup_datetime.dt.hour * 3600 + train_df.pickup_datetime.dt.minute * 60 + train_df.pickup_datetime.dt.second

In [None]:
# time of day and distance traveled correlation
time_of_day.corr(train_df.euclidean_dist)

### Correlation between time of day and the taxi fare

In [None]:
# time of day and the taxi fare correlation
time_of_day.corr(train_df.fare_amount)

As seen, time of day is almost uncorrelated with both distance traveled and fare amount.

### Highest Correlation

* Highest correlation is between Euclidean distance of the ride and the taxi fare

## Task 3

For each subtask of (2), create a plot visualizing the relation between the variables. Comment on whether you see non-linear or any other interesting relations. (9 pt)

### Scattered plot between Euclidean distance of the ride and the taxi fare

* For Distance vs Fare graph, some low euclidean distance records have relatively higher fare amount than for certain big distances. (although for distances less than 0.5, it seems strongly correlated with fare.)

* For Fare vs Distance graph, apart from where fare amount is near 0 and distances are unusual high (on top left portion), the other parts show strong correlation i.e. fare is increasing > distance is also increasing

Below are shown these two graphs.

In [None]:
plt.figure(figsize=(15,7))
plt.scatter(train_df.euclidean_dist[0:10**3],train_df.fare_amount[0:10**3])
plt.title('Distance vs Fare')
plt.xlabel('Distances')
plt.ylabel('Fare')

In [None]:
plt.figure(figsize=(15,7))
plt.scatter(train_df.fare_amount[0:10**3],train_df.euclidean_dist[0:10**3])
plt.title('Fare vs Distance')
plt.xlabel('Fare')
plt.ylabel('Distances')

### Scattered plot between time of day and distance traveled

* The plot shows that distances traveled is uncorrelated with time of day. (distances traveled is similar across different time)

In [None]:
plt.figure(figsize=(15,7))
plt.scatter(time_of_day[0:10**3],train_df.euclidean_dist[0:10**3])
plt.title('Time vs Distance')
plt.xlabel('Seconds elapsed from midnight')
plt.ylabel('Distances')

### Scattered plot between time of day and the taxi fare

The plot shows that fare is uncorrelated with time of day. (fare paid is distributed similarly across time period)

In [None]:
plt.figure(figsize=(15,7))
plt.scatter(time_of_day[0:10**3],train_df.fare_amount[0:10**3])
plt.title('Time vs Fare')
plt.xlabel('Seconds elapsed from midnight')
plt.ylabel('Fare')

In [None]:
del time_of_day # free up bit memory

## Task 4
Create an exciting plot of your own using the dataset that you think reveals something very interesting.   Explain what it is, and anything else you learned. (15 pt)

Here are 4 intersting plots.

### Plot between Hour of day vs Number of Rides (during that time)

Observations:
* 5am (morning period) has least umber of rides taken.
* 7pm has most number of rides taken.
* Throughout the work hours, number of taxi rides taken are high (even this number increases during 6pm to 10pm)
(This may be due to rush hour people going to home from work in addition to people going outside to restaurants, malls, movies or any touristy places in this evening period)

In [None]:
plt.figure(figsize=(15,7))
plt.hist(train_df.pickup_datetime.dt.hour, bins=100)
plt.title('Hour vs Number of Rides')
plt.xlabel('Hour')
plt.ylabel('Frequency')

### Plot between Hour of day vs Fare (during that time)

Observations:
* The fare amount seems similar throughout the entire day. (nothing like spike during rush hours or night time)

In [None]:
plt.figure(figsize=(15,7))
plt.scatter(x=train_df.pickup_datetime[0:10**6].dt.hour, y=train_df['fare_amount'][0:10**6], s=1.5)
plt.title('Hour vs Fare')
plt.xlabel('Hour')
plt.ylabel('Fare')

### Heatmap of Pickup points on NYC map

Reference: https://alysivji.github.io/getting-started-with-folium.html

Observations:
* Most of pickup points are in Manhattan (more specifically in Lower Manhattan, Midtown, Upper East and Upper West side)
* Certain area in Brooklyn (Dumbo near Manhattan and Brooklyn Bridge and williamsburg) and Long Island City in Queens contribute to most of pickup points in those two boroughs.
* Airports (JFK, LGA both in Queens) have very high heatmap (infer people take taxi to city frequently)
* On zooming in, hotspots like Penn Station, Times Square, Port Authority, Grand Central, South Ferry, Columbus Circle have high pickup counts. (maybe due to commuters taking cabs from penn st, port authority, grand central, south ferry (staten island people :P) and office areas (columbus circle, times square area).

In [None]:
# initialize map with first row from training data as coordinates
m = folium.Map([40.721317, -73.844315], zoom_start=11)

for index, row in train_df[0:1000].iterrows():
    folium.CircleMarker([row['pickup_latitude'], row['pickup_longitude']],
                        radius=0.00001,
                        fill_color="#3db7e4"
                       ).add_to(m)
    
# convert to (n, 2) nd-array format for heatmap
stationArr = train_df[0:2500][['pickup_latitude', 'pickup_longitude']].as_matrix()

# plot heatmap
m.add_child(plugins.HeatMap(stationArr, radius=15))
m

In [None]:
del stationArr
del m

### Heatmap of Dropoff points on NYC map

Reference: https://alysivji.github.io/getting-started-with-folium.html

Observations:
* Unlike pickup points which are mostly concentrated in / near manhattan, dropoff points are spread little farther from Manhattan.Still most of dropoff points are spread throughout in Manhattan. (This time, it includes Harlem, Uptown on western side and part of Bronx as well)
* Certain area in Brooklyn, this time spread upto Prospect Park (Dumbo near Manhattan and Brooklyn Bridge and williamsburg) and Long Island City plus Jamica in Queens contribute to most of pickup points in those two boroughs. (Jamaica maybe because people take cab to it and then airtrain)
* Again airports (JFK, LGA both in Queens) and Newark (EWR) in NJ have very high heatmap.

In [None]:
m = folium.Map([40.712276, -73.841614], zoom_start=11)

for index, row in train_df[0:1000].iterrows():
    folium.CircleMarker([row['dropoff_latitude'], row['dropoff_longitude']],
                        radius=0.00001,
                        fill_color="#3db7e4"
                       ).add_to(m)
    
# convert to (n, 2) nd-array format for heatmap
stationArr = train_df[0:2500][['dropoff_latitude', 'dropoff_longitude']].as_matrix()

# plot heatmap
m.add_child(plugins.HeatMap(stationArr, radius=15))
m

In [None]:
del stationArr
del m

In [None]:
train_df=train_df[0:10**5]
train_df=train_df.drop(['manhattan_dist','haversine_dist'],axis=1)

## Task 5

Generate additional features like those from (2) from the given data set. What additional features can you create? (10 pt)


### Additional Features:

* Distance from / to Airports (already added 6 new columns as features as done above while cleaning dataset)

* Cross Borough Travel: Using NYC Open Data for Borough Boundaries (https://data.cityofnewyork.us/City-Government/Borough-Boundaries/tqmj-j8zm), get the shape file which contains polynomial containing boundary points for each five boroughs. Find out whether pickup and dropoff points for each taxi ride are in same borough or not. (part of external dataset as well)

* Compute actual road driving distance between points using OSRM (Open Source Routing Machine) APIs. (part of external dataset as well)

OSRM and NYC Open Data are used as part of external datasets incorporated.

### Feature: Cross Borough Travel

Using fiona to read shape file and display each borough shapes

multipolys = fiona.open("../input/nycboundariesshape/borough.shp")
print(multipolys.schema)
#len(multipolys)

Convert longitude and latitude to point object necessary for checking whether it is inside or outside borough boundary

%%time

pickup_points = train_df.apply(lambda x: Point((float(x.pickup_longitude), float(x.pickup_latitude))), axis=1)
dropoff_points = train_df.apply(lambda x: Point((float(x.dropoff_longitude), float(x.dropoff_latitude))), axis=1)
train_df["cross_borough"]=0

# Borough name and group number (0,1,2,3,4)
poly  = geopandas.GeoDataFrame.from_file('../input/nycboundariesshape/borough.shp')
print(poly.boro_name)

In [None]:
# Optimized method to compute whether point is inside any of the polygons or not
'''
%%time

tstPT = train_df.apply(lambda x: Point((float(x.pickup_longitude), float(x.pickup_latitude))), axis=1)
crs = {'init': 'epsg:27700'}
gdf = geopandas.GeoDataFrame(train_df, crs=crs, geometry = tstPT)

pointInPolys = sjoin(gdf, poly, how='left')
pickup_grouped = pointInPolys.groupby('index_right').groups

tstPT = train_df.apply(lambda x: Point((float(x.dropoff_longitude), float(x.dropoff_latitude))), axis=1)
crs = {'init': 'epsg:27700'}
gdf = geopandas.GeoDataFrame(train_df, crs=crs, geometry = tstPT)

pointInPolys = sjoin(gdf, poly, how='left')
dropoff_grouped = pointInPolys.groupby('index_right').groups
#display(len(dropoff_grouped[0.0]))

###
across_borough=[]
for ind, row in train_df.iterrows():
    in_bronx=(ind in pickup_grouped[1.0]) and (ind in dropoff_grouped[1.0])
    in_staten=(ind in pickup_grouped[2.0]) and (ind in dropoff_grouped[2.0])
    in_brooklyn=(ind in pickup_grouped[3.0]) and (ind in dropoff_grouped[3.0])
    in_queens=(ind in pickup_grouped[4.0]) and (ind in dropoff_grouped[4.0])
    in_manhattan=(ind in pickup_grouped[0.0]) and (ind in dropoff_grouped[0.0])
    if in_manhattan==True or in_bronx==True or in_staten==True or in_brooklyn==True or in_queens==True:
        across_borough.append(0)
    else:
        across_borough.append(1)
        
###
train_df['across_borough']=across_borough 

###
del tstPT
del crs
del gdf
del pointInPolys
del pickup_grouped
del dropoff_grouped

print("Number of rides across borough")
len(train_df[train_df.across_borough==1])
'''

Check whether pickup and dropoff points are in same borough or not (1 if in same, else 0)

# slower method to work on smaller dataset
%%time

already_visited=[]
for multi in multipolys:
    shp=shape(multi['geometry'])
    print(multi["properties"])
    for i in range(len(pickup_points)):
        if i%20000==0:
            print(i)
        if i in already_visited:
            continue
        a_in=pickup_points[i].within(shp)
        b_in=dropoff_points[i].within(shp)
        if (a_in is True and b_in is False) or (a_in is False and b_in is True):
            train_df["cross_borough"][i]=1
            already_visited.append(i)

see how many records are for cross borough travel among first 100K records: 11727

len(train_df[train_df.cross_borough==1])

del pickup_points
del dropoff_points

Check correlation between cross_borough vs Fare amount: 0.6375859534462518 (positive correlation)

train_df.cross_borough.corr(train_df.fare_amount)

Create new feature for test data as well (similar process to train set)

multipolys = fiona.open("borough.shp")
print(multipolys.schema)

%%time

pickup_points = test_df.apply(lambda x: Point((float(x.pickup_longitude), float(x.pickup_latitude))), axis=1)
dropoff_points = test_df.apply(lambda x: Point((float(x.dropoff_longitude), float(x.dropoff_latitude))), axis=1)
test_df["cross_borough"]=0

%%time

already_visited=[]
for multi in multipolys:
    shp=shape(multi['geometry'])
    print(multi["properties"])
    for i in range(len(pickup_points)):
        if i%2000==0:
            print(i)
        if i in already_visited:
            continue
        a_in=pickup_points[i].within(shp)
        b_in=dropoff_points[i].within(shp)
        if (a_in is True and b_in is False) or (a_in is False and b_in is True):
            test_df["cross_borough"][i]=1
            already_visited.append(i)
            
len(test_df[test_df.cross_borough==1])

## Task 7 (Task 6 to follow next)

Consider external datasets that may be helpful to expand your feature set. Give bullet points explaining all the datasets you could identify that would help improve your predictions. If possible, try finding such datasets online to incorporate into your training. List any that you were able to use in your analysis. (10 pt)


* NYC Open Data to get borough boundaries shape (https://data.cityofnewyork.us/City-Government/Borough-Boundaries/tqmj-j8zm) (as shown above to construct new feature: cross borough travel, giving correlation of 0.64 with fare amount)

* OSRM (Open Source Routing Machine) APIs to compute actual road distances between two points. (since API has limits, used here on first 1 million records to compute distance for small data models, not for highest rank achieved model on Kaggle). oSRM achieves correlation of 0.82 with fare amount.


Use OSRM curl url giving input as coordinates and query for actual driving distance.

%%time

train_df["actual_dist"]=0
for index, row in train_df.iterrows():
    if index%10000==0:
        print(index)
    url = "http://router.project-osrm.org/route/v1/driving/"+str(round(row["pickup_longitude"],6))+","+str(round(row["pickup_latitude"],6))+";"+str(round(row["dropoff_longitude"],6))+","+str(round(row["dropoff_latitude"],6))+"?overview=false"
    contents = urllib.request.urlopen(url).read()
    data = json.loads(contents.decode('utf8'))
    train_df["actual_dist"][index]=data["routes"][0]["legs"][0]["distance"]

Correlation between actual road distance obtained from OSRM against fare amount

train_df.actual_dist.corr(train_df.fare_amount)

# dropping now not required geometry field
train_df=train_df.drop(['geometry'],axis=1)

Calculate OSRM distances for test data as well

%%time

test_df["actual_dist"]=0

for index, row in test_df.iterrows():
    if index%1000==0:
        print(index)
    url = "http://router.project-osrm.org/route/v1/driving/"+str(round(row["pickup_longitude"],6))+","+str(round(row["pickup_latitude"],6))+";"+str(round(row["dropoff_longitude"],6))+","+str(round(row["dropoff_latitude"],6))+"?overview=false"
    contents = urllib.request.urlopen(url).read()
    data = json.loads(contents.decode('utf8'))
    test_df["actual_dist"][index]=data["routes"][0]["legs"][0]["distance"]

### Saving final train / test data to Feather files

In [None]:
%%time
# saving training dataframe to feather file
train_df=train_df.reset_index(drop=True)
train_df.to_feather('final_train.feather')

Remove not required columns from test data

In [None]:
test_df=test_df.drop(['manhattan_dist','haversine_dist'],axis=1)

In [None]:
%%time
# saving training dataframe to feather file
test_df=test_df.reset_index(drop=True)
test_df.to_feather('final_test.feather')

Again read train / test data from feather file

Now we will build models using this data.

In [None]:
%%time

# read from feather file (not from original train csv) for fast loading 

train_df = pd.read_feather('final_train.feather')
train_df.info()

In [None]:
%%time

# read from feather file (not from original train csv) for fast loading 

test_df = pd.read_feather('final_test.feather')
test_df.info()

Check correlation between features before building models.

High positive correlation of fare amount with euclidean distance, cross borough travel and actual distance computed by OSRM.

In [None]:
f, ax = plt.subplots(figsize=(10, 8))
corr = train_df.corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True),
            square=True, ax=ax)

In [None]:
train_df.corr(method='pearson', min_periods=1)

Drop certain features from train and test data

In [None]:
key = test_df['key']
test_df = test_df.drop(['key','pickup_datetime'],axis=1)

In [None]:
y_train_fare=train_df['fare_amount']
train_df=train_df.drop(['fare_amount','pickup_datetime'],axis=1)

In [None]:
### Drop any more features if you want

#test_df = test_df.drop(['passenger_count','lga_pickup_dist','lga_dropoff_dist','actual_dist'],axis=1)

#train_df=train_df.drop(['passenger_count','lga_pickup_dist','lga_dropoff_dist','actual_dist'],axis=1)

## Task 6
Set up a simple linear regression model to predict taxi fare. Use your generated features from the previous task if applicable. How well/badly does it work? What are the coefficients for your features? Which variable(s) are the most important one? (12 pt)


Split training set into part train and part test for applying linear regression and checking RMSE on training data. Use sklearn's train_test_split library to achieve this.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train_df,y_train_fare, test_size=0.01)

### Simple Linear Regression Model

In [None]:
lm = LinearRegression()
lm.fit(X_train,y_train)
print(lm.score(X_train,y_train))
print(lm.score(X_test,y_test))

Model works well (similar as on actual training) on part testset taken from training data.

Coefficients for features are computed and outputted as follows:

In [None]:
train_df.info()

print('Intercept', round(lm.intercept_, 4))

print('pickup_longitude coef: ', round(lm.coef_[0], 4),
      '\npickup_latitude coef:', round(lm.coef_[1], 4), 
      '\ndropoff_longitude coef:', round(lm.coef_[2], 4),
      '\ndropoff_latitude coef:', round(lm.coef_[3], 4), 
      '\npassenger_count coef:', round(lm.coef_[4], 4), 
      '\njfk_pickup_dist coef:', round(lm.coef_[5], 4), 
      '\nlga_pickup_dist coef:', round(lm.coef_[6], 4), 
      '\newr_pickup_dist coef:', round(lm.coef_[7], 4), 
      '\njfk_dropoff_dist coef:', round(lm.coef_[8], 4), 
      '\nlga_dropoff_dist coef:', round(lm.coef_[9], 4), 
      '\newr_dropoff_dist coef:', round(lm.coef_[10], 4), 
      '\neuclidean_dist coef:', round(lm.coef_[11], 4), 
      #'\ncross_borough coef:', round(lm.coef_[12], 4), 
      #'\nactual_dist coef:', round(lm.coef_[13], 4)
     )

### Important features:
    euclidean_dist, ewr_pickup_dist, lga_pickup_dist (positive correlated with fare)
    lga_dropoff_dist, dropoff_longitude, pickup_longitude  (negatively correlated with fare)

Check RMSE error on entire training set with LR

In [None]:
y_pred = lm.predict(train_df)
lrmse = np.sqrt(metrics.mean_squared_error(y_pred, y_train_fare))
lrmse

Now, let's predict on test data and output to csv file for Kaggle upload

In [None]:
LinearPredictions = lm.predict(test_df)
LinearPredictions = np.round(LinearPredictions, decimals=2)
LinearPredictions

linear_submission = pd.DataFrame({"key": key,"fare_amount": LinearPredictions},columns = ['key','fare_amount'])
linear_submission.to_csv('submission.csv', index = False)

## Task 8

Now, try to build a better prediction model that works harder to solve the task. Perhaps it will still use linear regression but with new features. Perhaps it will preprocess features better (e.g. normalize or scale the input vector, convert non-numerical value into float, or do a special treatment of missing values). Perhaps it will use a different machine learning approach (e.g. nearest neighbors, random forests, etc). Briefly explain what you did differently here versus the simple model. Which of your models minimizes the squared error? (10 pt)


Rerun the model with xgboost and random forest.

Both gives better result than linear regression. (due to fact that they capture non-linearity in data more properly, maybe linear regression was overfitting due to high amount of data and large features)

### xgboost

Below is xgboost with linear regularization, learning rate set to 0,1, evaluation metric is RMSE, ran for 50 rounds

It provides better execution speed and better results (maybe it solves the problem of vanishing gradients with LR due to gradient boosting)

In [None]:
dtrain = xgb.DMatrix(train_df, label=y_train_fare)
dtest = xgb.DMatrix(test_df)
params = {'max_depth':7,
          'eta':1,
          'silent':1,
          'objective':'reg:linear',
          'eval_metric':'rmse',
          'learning_rate':0.1
         }
num_rounds = 50
xb = xgb.train(params, dtrain, num_rounds)
y_pred_xgb = xb.predict(dtest)
print(y_pred_xgb)
xgb_submission = pd.DataFrame({"key": key,"fare_amount": y_pred_xgb},columns = ['key','fare_amount'])
xgb_submission.to_csv('submission.csv', index = False)

### Random Forest

using sklearn random forest library to build model and predict on test data.

It takes lots of time to run on entire dataset but gives best accuracy (minimizes RMSE).

In [None]:
rf = RandomForestRegressor()
rf.fit(train_df,y_train_fare)
rf_predict = rf.predict(test_df)

In [None]:
y_pred = rf.predict(train_df)
lrmse = np.sqrt(metrics.mean_squared_error(y_pred, y_train_fare))
lrmse

In [None]:
rf_predict

In [None]:
rf_submission = pd.DataFrame({"key": key,"fare_amount": rf_predict},columns = ['key','fare_amount'])
rf_submission.to_csv('submission.csv', index = False)

Random Forest minimizes RMSE on Kaggle. (Best RMSE: 3.35137)

## Task 9

Predict all the taxi fares for instances at file “sample_submission.csv”. Write the result into a csv file and submit it to the website. You should do this for every model you develop. Report the rank, score, number of entries, for your highest rank. Include a snapshot of your best score on the leaderboard as confirmation. (15 pt)


* Linear Regression: Best RMSE 4.33129
* Xgboost: Best RMSE 3.68630
* Random Forest: Best RMSE 3.35137

Total Number of Entries: 12

Profile Link: https://www.kaggle.com/yashah19
    
![Kaggle Rank / RMSE](kaggle.PNG)