##Data analysis, feature engineering, and data cleaning, with some bonus performance tips##

We will load the data and do some analysis, feature engineering and data cleaning. 

The features will include measures of trip speed, direction, and alignment with the New York city street grid, as well as breaking out the timestamp into more useful components. 

On the way, we'll learn how to avoid a common pandas performance issue, and find some interesting groups of outliers.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

Import various utilities for later

In [None]:
# date/time maths
import datetime as datetime

# plotting tools
import matplotlib.pyplot as plt
import seaborn as sns

# general maths
import math as math

In [None]:
train_data = pd.read_csv("../input/train.csv", parse_dates=['pickup_datetime','dropoff_datetime'])
train_data.head()

Get a summary of the data. Using `include='all'` means the datetime columns are included; by default non-numerics are excluded.

In [None]:
train_data.describe(include ='all')

**Interesting points**

 - At least one trip lasted a second! We should investigate outliers at this lower end of trip_duration.
 - At least one trip lasted around 350,000 seconds, or nearly 100 hours! We should investigate outliers at this higher end of trip duration.
 - The data ranges for the first 6 months of 2016. Any supplementary datasets we gather need to cover this time range.
 - ids are unique. One suspects they won't be of great value.
 - Vendor ID is either 1 or 2, probably distinguishing between two operators.
 - It looks like there aren't missing values as the numbers of entries on the top line are consistently identical, but let's check that to be sure.
 - Passenger count goes up to 9, and is generally low (over 50% of rides are for zero or 1 passenger)
 - Passenger count can be zero. One suspects this is user error by the cabbie, and may line up with the very short trip durations. Let's investigate with those outliers.
 - We have max and min values for latitude and longitude; we should look at outliers there too, but we will need a way to convert those to something we have a better understanding of.


In [None]:
# Check for null entries
train_data.isnull().sum()

Happily there are no missing data points to impute.

## Feature engineering ##
Before we examine the outliers, and look to run algorithms over it, let's think about any features that may help, and that are available to extract from the data provided.

### Time of day ###
We want to model rush hour, people coming from from clubs at weekends, people heading from work to the airport on Friday evening, and so on. We need time of day, as we expect some cyclical behaviour here.

### Day of week ###
Traffic patterns will change daily due to work habits, people leaving for the weekend, and so forth. 

### Distance travelled and approximate speed ###
We can convert the longitude and latitude to a measure of point to point distance, and similarly get a measure of speed "as the crow flies".

The speed can't be used in our model, as it relies on the duration which we are trying to predict. However we can use it as a sanity check to filter out undesirable outliers from our training data. Unrealistically high speeds likely mean bad readings; unrealistically low speeds over long time periods probably mean someone has hired the cab for a protracted period on retainer. These latter cases will, I anticipate, be unpredictable. Better to just accept we will have some test error from such cases and exclude then than have them pollute our model for the cases which we can reasonably predict.

### Day of month/month/date ###
We clearly want to separate out the date in some way. How best to do this? We might decide that the day of the month might have cyclical behaviour as we expect the day of the week to. I can't see a good reason for this, but might try to explore it graphically to see if there is some legitimacy to it.

As we have 1.45M rows, spread over approx 180 days (half a year), we have around 8k data points/day. That's enough for a good ML algorithm to work with without breaking it down further. So, instead of using day and month separately, I prefer for now to have the day and month combined into a date.

To make this easy for regression algorithms to work with we will represent date as an offset in days from the 1st Jan 2016. 

### Direction of travel ###
Manhattan's streets are laid out on a grid.  Traffic going one way in the morning may well reverse flow in the evening
Traffic aligned with the grid may well have a more direct route. So perhaps direction of travel and/or grid alignment may be useful features. 

What direction do the streets go in? This link has a good analysis: http://www.charlespetzold.com/etc/AvenuesOfManhattan/

It turns out the "north/south" streets actually run 29 degrees clockwise of north/south.

### Grid alignment ###
How close are we to running a trip in direct alignment with the Manhattan grid? Is our journey aligned with the Manhattan avenues or streets, or is the trip more along the diagonal of the grid? This is something we can extract from the data.

### Gridwise distance ###
Imagine the start and end points of a journey aligned with the grid, so the journey doesn't require any turns. The taxi will only need to cover the point to point distance. 
Conversely, if the journey is along the grid diagonal, the journey would have to go along two sides of a triangle, so the required distance would be longer.
In general, if we imagine the journey is on a grid aligned with that of Manhattan, the gridwise distance is the distance covered along the two sides of a right angled triangle where the hypotenuse is between the start and end points of the journey.
This may be a useful feature as it will help any algorithm factor in alignment with the grid.

## Other data sets ##
Other things I would like to have information on include weather, unusual traffic issues, and major events such as sport home games. These can't be extracted from the data set we have, but perhaps we can find them elsewhere.

## A "gotcha" on generating features in pandas ##
One natural way, coming into python, to generate these features, is to:
- Write a function that operates on a row
- Apply that function to all rows using dataframe.apply.
This is give in one example below, and the way I first tried (being quite new to Python,). What with various bits of shunting data around, inferring, and so forth, it can be very slow. It seems particularly bad when there are datatype conversions, as there are with the direction function below.

A faster approach is to use a vectorised version, which takes the input columns of the dataframe (ie pandas series objects) and returns a pandas series object for the new column. 

I have coded both approaches below so you can try for yourself. The slow way using apply takes maybe 20 minutes on my system; the vectorised version a few seconds!

See https://tomaugspurger.github.io/modern-4-performance.html for more information.

In [None]:
# Let's get some features!
# Showing that apply works - although it's slow. 
train_data['day_of_week'] = train_data['pickup_datetime'].apply(lambda dt: dt.weekday())
train_data['time_of_day'] = train_data['pickup_datetime'].apply(lambda dt: dt.time())



In [None]:
# WARNING - SLOW VERSION WHICH OPERATES ON DATAFRAME ROWS. INCLUDED ONLY SO YOU CAN TRY OUT 
# JUST HOW SLOW THIS IS BY COMPARISON 

# At 40° north or south*, the distance between a degree of longitude is 53 miles (85 kilometers).
# We are pretty much 40 degrees North, so 85 Km will do.
# Each degree of latitude is approximately 69 miles (111 kilometers) apart.
# ref: https://www.thoughtco.com/degree-of-latitude-and-longitude-distance-4070616     

LAT_SCALE_METRES = 111000
LONG_SCALE_METRES = 85000

def latToM(lat) :
    return lat * LAT_SCALE_METRES

def longToM(lng) :
    return lng * LONG_SCALE_METRES

def distanceInM( row ) :
    longDiffm = longDiffM(row)
    latDiffm = latDiffM(row)
    return np.sqrt( longDiffm * longDiffm + latDiffm * latDiffm)

def calcSpeedKmh( metres, secs) :
    # km/hour = (metres / 1000) / (secs/ (60*60) )
    return (3600 * metres) / (secs * 1000)

def speedInKmh (row) :
    return calcSpeedKmh(row['distance_in_metres'], row['trip_duration'])       

def longDiffM(row) :
    return  longToM( abs(row['dropoff_longitude'] - row['pickup_longitude']) )

def latDiffM(row) :
    return latToM( abs(row['dropoff_latitude'] - row['pickup_latitude']))

def daysFromNewYear2016(dt) :
   nyDay = datetime.date(2016,1,1)
   return (dt.date() - nyDay ).days

def direction( row ) :
    # longitude positive is e->w (as we are west of London, further west = higher value for longitude)
    # latitude is s->n so ie higher latitude is further south as we are north of eqator.
    # tangent = opposite/adjacent, ie lat dist / long dist    
    # This function returns, for example:
    # Heading westwards = 0, southwards = 1, eastwards = 2, northwards = -1.
    # Heading northwest = -0.5, northeast = -1.5, southeast = 1.5, southewst = 0.5
    # The closer the absolute value is to 1, the closer we are to going n/s
    return 2 * math.atan2( (row['dropoff_latitude'] - row['pickup_latitude']), (row['dropoff_longitude'] - row['pickup_longitude']))/math.pi

# 0 for going directly e/w or n/s
# 0.5 for a diagonal movement
# other values in between as appropriate.
# likely a more elegant way to do this
def diffFromGridDirection ( row ) :
    nsDiff = abs(abs(row['direction']) - 1)
    ewDiff = min(abs(row['direction']), abs(abs(row['direction']) - 2))
    return min(ewDiff, nsDiff)
                
            

In [None]:
# At 40° north or south*, the distance between a degree of longitude is 53 miles (85 kilometers).
# We are pretty much 40 degrees North, so 85 Km will do.
# Each degree of latitude is approximately 69 miles (111 kilometers) apart.
# ref: https://www.thoughtco.com/degree-of-latitude-and-longitude-distance-4070616     

# Vectorised version.
# See https://tomaugspurger.github.io/modern-4-performance.html for more info on why this is 
# a big speed improvements over apply


LAT_SCALE_METRES = 111000
LONG_SCALE_METRES = 85000

def latToMVec(lat) :
    return lat * LAT_SCALE_METRES

def longToMVec(lng) :
    return lng * LONG_SCALE_METRES

def distanceInMVec( long1, long2, lat1, lat2 ) :
    longDiffm = longDiffMVec(long1, long2)
    latDiffm = latDiffMVec(lat1, lat2)
    return np.sqrt( longDiffm * longDiffm + latDiffm * latDiffm)

def calcSpeedKmhVec( metres, secs) :
    # km/hour = (metres / 1000) / (secs/ (60*60) )
    return (3600 * metres) / (secs * 1000)

def speedInKmhVec ( metres, duration) :
    return calcSpeedKmh(metres, duration)       

def longDiffMVec(long1, long2) :
    return  longToMVec( abs(long2-long1) )

def latDiffMVec(lat1, lat2) :
    return latToMVec( abs(lat2-lat1))

def daysFromNewYear2016Vec(dt) :
   nyDay = datetime.date(2016,1,1)
   return [(d.date() - nyDay ).days for d in dt]

def directionVec( long1, long2, lat1, lat2 ) :
    # longitude positive is e->w (as we are west of London, further west = higher value for longitude)
    # latitude is s->n so ie higher latitude is further south as we are north of eqator.
    # tangent = opposite/adjacent, ie lat dist / long dist    
    # I've given results in degrees as it's easier to sanity check the numbers are right.
    # This function returns, for example:
    # Heading westwards = 0, southwards = 90, eastwards = 180, northwards = -90.
    # Heading northwest = -45, northeast = -135, southeast = 135, southewst = 45
    # The closer the absolute value is to 1, the closer we are to going n/s
    lngDiff = long2 - long1
    latDiff = lat2 - lat1
    atansFunc = np.vectorize( lambda y,x : 180 * math.atan2(y,x)/math.pi )
    atans = atansFunc(latDiff, lngDiff)
    #atans = pd.Series([math.atan2(y,x) for y,x in zip(latDiff, lngDiff)])
    #return 180 * atans/math.pi
    return atans
    
    
def shiftAntiClockwise( direction ) :
    newDirection = direction + 29
    if newDirection > 180 :
       newDirection = 180 - newDirection
    return newDirection


def gridDiff( direction ) :
    # Work out how many degrees a direction differs from n/s or e/w.
    # First rotate negative directions about e/w to map onto a semicircle
    diff = direction % 90
    
    if diff < 45 :
       return diff
    else :
       return 90 - diff


# 0 for going directly e/w or n/s
# 45 for a perfect diagonal.
def diffFromGridDirectionVec ( direction ) :
    # To make the maths easier, let's rotate our direction 29 degrees anticlockwise.
    # Then we check how close the result is to north/south or east/west alignment.
    
    # We use vectorise for performance, though not strictly necessary here.See:
    # https://docs.scipy.org/doc/numpy/reference/generated/numpy.vectorize.html
    # https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html    
    vFuncShift = np.vectorize( shiftAntiClockwise )
    vFuncDiff = np.vectorize( gridDiff )
    return vFuncDiff(vFuncShift(direction ))
            
   
def gridDistance ( distance, diffFromGridDirection) :
    radians = (diffFromGridDirection * math.pi)/180
    return math.sin( radians ) * distance + math.cos( radians ) * distance

def gridDistanceVec( distance, diffFromGridDirection) :
     vFuncDist = np.vectorize(gridDistance)
     return vFuncDist(distance, diffFromGridDirection)
        
        
    

In [None]:
train_data['distance_in_metres'] = distanceInMVec(train_data['pickup_longitude'], train_data['dropoff_longitude'], train_data['pickup_latitude'], train_data['dropoff_latitude'])
train_data['days_from_new_year'] = daysFromNewYear2016Vec(train_data['pickup_datetime'])
train_data['direction'] = directionVec(train_data['pickup_longitude'], train_data['dropoff_longitude'], train_data['pickup_latitude'], train_data['dropoff_latitude'])    


In [None]:
train_data['speed_in_kmh'] = speedInKmhVec(train_data['distance_in_metres'], train_data['trip_duration'])
train_data['diffFromGridDirection'] = diffFromGridDirectionVec(train_data['direction'])


In [None]:
train_data['grid_distance'] = gridDistanceVec( train_data['distance_in_metres'], train_data['diffFromGridDirection'] )

In [None]:
# Slow way, using apply
#train_data['distance_in_metres'] = train_data.apply(lambda row: distanceInM(row),axis=1)
#train_data['grid_distance_in_metres'] = train_data.apply(lambda row: gridDistanceInM(row),axis=1)
#train_data['direction'] = train_data.apply(lambda row: direction(row),axis=1)
#train_data['days_from_new_year'] = train_data['pickup_datetime'].apply(lambda dt: daysFromNewYear2016(dt))




In [None]:
# Slow way, using apply
#train_data['speed_in_kmh'] = train_data.apply(lambda row: speedInKmh(row),axis=1)
#train_data['grid_speed_in_kmh'] = train_data.apply(lambda row: gridSpeedInKmh(row),axis=1)
#train_data['diffFromHorVert'] = train_data.apply(lambda row: diffFromHorVert(row),axis=1)

In [None]:
train_data.describe()

A quick sanity check.

 - Average speed is 14 km/h. This seems slow, but remember it's speed as the crow flies; most routes will be far from straight lines. So this seems like it is in the right ballpark.
 - The value of 9000 kmh seems wrong, and looking at that value, the pickup and dropoff longitude are identical to 6 decimal places, which seems deeply suspicious. It looks like using abnormally high (and perhaps low) speeds will indeed help identify and filter bogus data.
 - Max days since new year looks right - 181.
 - At least one trip covers an abnormal distance.  We should drill into those.
- Grid alignments range form 0 to 45. We should check some of the close to zero values to see if they align on a map.
- A reminder we need to review those passenger zero/very low duration trips

## Cleaning the data##
First, let's look at those abnormally high speeds. Do they correlate with other outlier values? Do they look like genuinely bogus data or a problem with our preprocessing? What's a sensible cutoff for filtering them?

### Speed outliers ###





In [None]:
train_data[train_data['speed_in_kmh'] > 200]

The good-ish news is that these are only 68 rows in 1.45M: approx 1 row in 25,000.

Many of these have very low trip duration. We might hope that they are due to noise in the very low duration, but this is not realistic in all cases. Some rows have managed to travel a km in a few seconds.

Clearly those with very high speed are not legit. Such speeds are not realistic and must be due to faulty readings (or perhaps bad code on our side...with more time I would sanity check a few values on google maps and calculate "by hand" to ensure there's been no mistake).

We will need to strip some of these outliers from our training data, but we don't want to throw out the baby with the bathwater. First we will visualise the data, and get a feel for the kind of distribution we have for distance, duration and speed. This will help inform our cutoff process for outliers.

In [None]:
plt.figure(figsize=(12,8))
sns.distplot(train_data['speed_in_kmh'].values, bins=50, kde=True)
plt.xlabel('speed', fontsize=12)
plt.show()          

Wow! The really high outliers skew things hugely. Let's strip those out of the plot, there are only a few of them after all

In [None]:
plt.figure(figsize=(12,8))
#sns.distplot(train_data['speed_in_kmh'].values, bins=50, kde=True)

sns.distplot(train_data[train_data['speed_in_kmh'] < 200]['speed_in_kmh'].values, bins = 100, kde=True)
plt.xlabel('speed', fontsize=12)
plt.show()

This looks more like what we should expect. Traffic crawling along. Happily, this looks not too far from a normal distribution. It looks like a cutoff for speed between 50 and 100 kmh should work to get rid of some of the bogus data without affecting our training set much. Let's check that.

In [None]:
train_data['speed_in_kmh'].quantile(q= (0.5,0.75, 0.99, 0.995, 0.999, 0.9995, 0.9997, 0.9999, 0.99999))

At this point, I wondered whether 50-60 kmh had anything in common. It may be they are legitimately going down freeways at night, for example.

In [None]:
train_data[ (train_data['speed_in_kmh'] > 65) & (train_data['speed_in_kmh'] < 120) & (train_data['distance_in_metres'] > 1000) ]

Spot checking a few of these just confirmed they were almost certainly bad data - google maps suggesting journeys that would take 35 minutes over 15 miles were taking 5 minutes instead. 

For now I will cut off at .99, so 40.853 kmh. I'm unsure whether I'm excluding some legit data points, and indeed whether I'm still include some bad ones; it would be good to do further analysis later if there is time.

In [None]:
train_data_cleaned = train_data[train_data['speed_in_kmh'] <= 40.853]

In [None]:
train_data[ (train_data['trip_duration'] > 20000) ]

Now, this is interesting. These long journeys all seem to last a similar amount of time, around 86000 seconds, and it looks like there is similarity in the pickup and dropoff point. Also, they all belong to taxi firm 2.

86000 seconds is close to 24 hours. It looks like there are cases where taxi firm 2 hires you a taxi for 24 hours.

It's hard to know how we can handle these cases, unless we can find some way to identify them from the rest of the data. They are about 0.1% of the dataset, so my instinct is to filter them out initially, and perhaps drill into them more later. 

Let's confirm this theory by drilling in with a plot


In [None]:
plt.figure(figsize=(12,8))
#sns.distplot(train_data['speed_in_kmh'].values, bins=50, kde=True)

big_trips = train_data[(train_data['trip_duration'] > 7200) ]['trip_duration']
sns.distplot(big_trips.values, bins = 100, kde=True)
plt.xlabel('trip duration when over 2 hours', fontsize=12)
plt.show()

We do indeed have this big spike. I leave it to the reader to confirm this is at the 24 hour mark, and proceed to remove these longer trips from the cleaned data set.

In [None]:
train_data_cleaned = train_data_cleaned[train_data_cleaned['trip_duration'] < 70000]


Next, I want to review these trips with very low duration (less than a minute). Do they correlate, for example, with zero passengers and minimal distance?

In [None]:
train_data_cleaned[train_data_cleaned['trip_duration'] < 60].describe()

It looks like the zero passenger trips are a minority of these cases - we can check those separately.

The percentiles for the distance are reassuring; these trips are a few hundred metres, so the passenger probably just changed their mind. As they are still point to point trips, there's no reason our ML algorithms can't handle them

Mostly tiny trips, as expected, though one was a 20km journey. For now, I'll leave these in the data set, as the smaller trips should be easy enough to predict.

Some outliers, and some interesting spikes. Let's again filter out any huge outliers to get a nicer view

## Next Steps ##
We are done for the moment, as I need to break for a while.

We might want to apply some more outlier analysis to the trip start and end points so we focus on New York and clean the data of resulting worrying outliers. I leave this to the reader for now.

I will add a feature representing "grid distance", that is the distance assuming the taxi can only travel along the 29 degrees from north/south axis or perpendicular to it. 

We should also check for correlations between our engineered features to see if they are useful.

I then plan to run this through XGBoost, and try a neural network based approach.