# NYC Taxi fare problem - Assignment 1
***
> ##### __Author__: _Vanshaj Lokwani_
> ##### __SBU ID__: _112025869_

In [1]:
# Importing the tools required
import pandas as pd
import numpy as np
import sklearn as skl
from matplotlib import pyplot as plt
import seaborn as sns

The dataset contains 55M rows and trying to load all of them at once causes the system to hang and intermittently crash. For the initial cleaning and testing phase, using a subset of the data. Once we have an idea of the actual cleaning effort required, we can iteratively perform that on the whole dataset for further processing. 

__For the time being, using first 10M rows as our subset. There is no reason behind this and the number was selected by intuition. Choosing 10M rows gives us enough liberty with the data to identify patterns and the data is small enough to not overload the RAM during pre-processing__

In [2]:
# Getting the dataset
NUM_ROWS = 10**7 # 10M ROWS
train_df = pd.read_csv('../dataset/train.csv', nrows=NUM_ROWS)

KeyboardInterrupt: 

Displaying the first 5 rows from the dataframe to get a rough idea of the DF looks like and if it's loaded correctly. 

In [None]:
display(train_df.head())

In [None]:
# Few steps to make our life easier in the long run.
# set the new index as the 'key' coumn. The dataset specifies that the key is unique overall
# the it does not make sense to maintain two different indexes. 
train_df.set_index('key', inplace=True, drop=True) 

#getting the columns in a list. Might be helpful later on. 
columns = list(train_df.columns)
display(columns);display(train_df.head())

## Question 1: 
##### ** Take a look at the training data. There may be anomalies in the data that you may need to factor in before you start on the other tasks. Clean the data first to handle these issues. Explain what you did to clean the data (in bulleted form). (10 pt) **


##### Let's print a mathematical summary of the data using the .describe() api provided in pandas. This may help us detect any anomalies that might exist in the system.

But before we get into that, we need to supress scientific notation in pandas. For this we have an api call 'set_option'. We'll set the precision for the number to be accurate upto 3 decimal places. More information on how to do that can be found at the stackoverflow link: https://stackoverflow.com/questions/21137150/format-suppress-scientific-notation-from-python-pandas-aggregation-results

In [None]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)
display(train_df.describe())

#### We can already see that there are clearly some issues with the data. Let's pick the columns one by one.

__ColumnName__: *pickup_longitude, pickup_latitude, dropoff_longitude & dropoff_latitude* Possible Issues:

* dropoff_longitude and dropoff_latitude count is lesser compared to other columns. This signifies the presence of missing values or NaN values in the two columns. We would have to take care of these depending how many values are missing.

In [None]:
missing_long_indexes = train_df[train_df.dropoff_longitude.isnull().values].dropoff_longitude.index
missing_lat_indexes = train_df[train_df.dropoff_latitude.isnull().values].dropoff_latitude.index
missing_values_list = set(missing_long_indexes).union(set(missing_lat_indexes))
print("Number of rows with missing indexes: {}".format(len(missing_values_list)))

the number seems small enough that these rows can be dropped without worrying about loosing important data. 69 in 10M is mostly not very significant. 

In [None]:
train_df.drop(missing_values_list, inplace=True)

In [None]:
# describe api should return equal counts for all columns with no na values
display(train_df.describe())
display("Any null values left in the two columns: %s" % train_df[['dropoff_longitude', 'dropoff_latitude']].isnull().values.any())

* The range of latitude values is -90 to 90 and the range of longitude values is -180 to 180. From the data summary, it's clear that the data does not follow the range. Calculating the number of points which are outside the range: 

In [None]:
# Out of Range = or_
or_pickup_longitude = set(train_df[(train_df.pickup_longitude < -180.0) | (train_df.pickup_longitude > 180.0)].index)
or_pickup_latitude = set(train_df[(train_df.pickup_latitude < -90.0) | (train_df.pickup_latitude > 90.0)].index)
or_dropoff_longitude = set(train_df[(train_df.dropoff_longitude < -180.0) | (train_df.dropoff_longitude > 180.0)].index)
or_dropoff_latitude = set(train_df[(train_df.dropoff_latitude < -90.0) | (train_df.dropoff_latitude > 90.0)].index)
or_indexes = or_pickup_longitude.union(or_pickup_latitude).union(or_dropoff_latitude).union(or_dropoff_longitude)
display("Number of rows with outside range values: %s" % len(or_indexes))

In [None]:
train_df.loc[list(or_indexes)].describe()


__ColumnName__: *fare_amount*
Possible Issues: 
* fare_amount is negative which should not be possible. Assumption: there was an issue with data entry or fare calculation. looking into the dataset to fetch all the rows where fare_amount is negetive.

In [None]:
neg_fare = train_df[train_df.fare_amount < 0]
display(neg_fare.describe())

Looking at the number of data points that have negative fare value is just 420 in 10M. We can safely drop these values from the dataset.

In [None]:
train_df.drop(neg_fare.index, inplace=True)
display(train_df.head())
display(train_df.describe())