# Taxi trips

The yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. The data used in the attached datasets were collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP). The trip data was not created by the TLC, and TLC makes no representations as to the accuracy of these data.

Data can be found [here](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

In [64]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime as dt
from scipy import stats

path = 'C:/Users/Zaca/Documents/Datasets/taxis/yellow_tripdata_2017-06.csv'

In [None]:
# open data and explore
taxi = pd.read_csv(path)

In [None]:
# looking at the size of the data
taxi.shape

A dictionary to each column description can be found [here](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf).

In [None]:
# examine the data
taxi.head()

In [None]:
# examine dtypes
taxi.dtypes

In [None]:
# checking for nans
taxi.isna().sum()
# beautiful

# Exploring and understanding column content

**VendorID**: code indicating the TPEP provider that provided the record. 
- The TLC (Taxi-Limousine Comission) requires all medallion taxicabs to be equipped with a Taxicab Technology System (“T-PEP”), which processes credit, debit, and prepaid card payments, enables taxicab drivers to receive text messages from the TLC, allows the TLC to collect electronic trip sheet data, and possesses a Passenger Information Monitor (“PIM”), which displays content to taxicab passengers.

In [None]:
# vendorID
taxi.VendorID.value_counts()
# I don't I'll be very interested in this column for my analysis.


**passenger_count**: the number of passengers in the vehicle. This is a driver-entered value.

In [None]:
taxi.passenger_count.value_counts()

# there are some trips with zero passengers. because this is a driver input value,
# I think zeros are probably mistakes. In any case they are only 595 in 10M, so we can remove them later.

**trip_distance**: the elapsed trip distance in miles reported by the taximeter.

In [None]:
print(taxi.trip_distance.describe())
# the seems to be a 600 mile trip as maximum. This doesn't feel like your normal taxi trip in NYC.
# we might be interested in removing these outliers.

plt.hist(taxi.trip_distance, bins=100, range=(0, 20));

**RatecodeID**: the final rate code in effect at the end of the trip.
1. Standard rate
2. JFK
3. Newark
4. Nassau or Westchester
5. Negotiated fare
6. Group ride


In [None]:
# ratecodeID
taxi.RatecodeID.value_counts()
# the number 99 is probably some error in the system
# nonetheless, I think I might only be interested in keeping the normal fare, it's most of the data anyway.

**Store_and_fwd_flag**: this flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server.

In [None]:
# store and forward
taxi.store_and_fwd_flag.value_counts()

# doesn't seem to have much interest.

**PULocationID**: TLC Taxi Zone in which the taximeter was engaged.

In [None]:
# pick-up-location code has 261 unique values
taxi.PULocationID.value_counts()

# this is a pretty important variable, i'm probably gonna have to think of a way of dealing with those later.

**DOLocationID**: TLC Taxi Zone in which the taximeter was disengaged.

In [None]:
# drop-off location has 262 unique values.
taxi.DOLocationID.value_counts()

**payment_type**: a numeric code signifying how the passenger paid for the trip.
1. Credit card
2. Cash
3. No charge
4. Dispute
5. Unknown
6. Voided trip

In [None]:
taxi.payment_type.value_counts()
# In the payment type we can observe one of the main problems with the data.
# Tips are only included for credit card payments.
# We might have to only include these rides in our tip recommendation system.

**fare_amount**: the time-and-distance fare calculated by the meter.

In [None]:
print(taxi.fare_amount.describe())

# it seems like there was a trip worth 175k, that doesn't sound right at all.

plt.hist(taxi.fare_amount, bins=100, range=(0, 100));

# there's a couple of weird things here:
# there are some negative values. if we assume this is an error we can solve it by taking the absolute.
# there's always a weird peak at 52$. 

# taxi.fare_amount.value_counts() here I should detect the peak using the mode between 40-60 or something like that.

taxi[taxi.fare_amount == 52]['RatecodeID'].value_counts(normalize=True)

# it seems that a lot of this errors come from zone 132.
# ok I figured it out, it seems to be something static about the fare from JFK Airport.
# this is solves if we only include trips with ratecode 1

**extra**: Miscellaneous extras and surcharges. Currently, this only includes the .5 and 1 dollar rush hour and overnight charges.

In [None]:
taxi.extra.value_counts()

**improvement_surcharge**: 0.30 dollar improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.

In [None]:
# this charge seems to be present in all of the trips so it doesn't have much information.
taxi.improvement_surcharge.value_counts()

In [None]:
# this column is the same as the improvement surcharge.
taxi.mta_tax.value_counts()

**tip_amount**: This field is automatically populated for credit card tips. Cash tips are not included.

In [None]:
# the target variable for our purposes
taxi.tip_amount.describe()
#plt.hist(taxi.tip_amount, bins=100);

**toll_amount**: Total amount of all tolls paid in trip. 

In [None]:
taxi.tolls_amount.value_counts()
plt.hist(taxi.tolls_amount, log=True);

In [None]:
date columns

In [None]:
# transforming to datetime
taxi['tpep_pickup_datetime'] = pd.to_datetime(taxi['tpep_pickup_datetime'])
taxi['tpep_dropoff_datetime'] = pd.to_datetime(taxi['tpep_dropoff_datetime'])

In [None]:
taxi['duration'] = taxi['tpep_dropoff_datetime'] - taxi['tpep_pickup_datetime']
taxi['duration'] = taxi['duration'].dt.seconds/60
taxi['weekday'] = [x.weekday() for x in taxi['tpep_pickup_datetime']]
taxi['hour'] = taxi['tpep_pickup_datetime'].dt.hour

# Data cleaning

In [None]:
taxi.columns

In [None]:
# droping columns
drop_cols = ['VendorID', 'store_and_fwd_flag', 'mta_tax', 'improvement_surcharge', 'tpep_pickup_datetime', 'tpep_dropoff_datetime']
taxi.drop(labels=drop_cols, axis=1, inplace=True)

In [None]:
# removing 0 passengers
taxi = taxi[(taxi.passenger_count != 0) & (taxi.RatecodeID == 1) & (taxi.payment_type == 1)]

In [None]:
plt.hist(taxi.fare_amount, bins=100);

In [None]:
z = np.abs(stats.zscore(taxi))

In [None]:
taxi_o = taxi[(z < 3).all(axis=1)]