# This is a basic Starter Kernel for the New York City Taxi Fare Prediction Playground Competition 
Here we'll use a simple linear model based on the travel vector from the taxi's pickup location to dropoff location which predicts the `fare_amount` of each ride.

This kernel uses some `pandas` and mostly `numpy` for the critical work.  There are many higher-level libraries you could use instead, for example `sklearn` or `statsmodels`.  

In [None]:
# Initial Python environment setup...
import numpy as np # linear algebra
import pandas as pd # CSV file I/O (e.g. pd.read_csv)
import os # reading the input files we have access to

print(os.listdir('../input'))

### Setup training data
First let's read in our training data.  Kernels do not yet support enough memory to load the whole dataset at once, at least using `pd.read_csv`.  The entire dataset is about 55M rows, so we're skipping a good portion of the data, but it's certainly possible to build a model using all the data.

In [None]:
train_df =  pd.read_csv('../input/train.csv', nrows = 10_000_000)
train_df.dtypes

In [None]:
test_df = pd.read_csv("../input/test.csv")

In [None]:
test_df.dtypes

Let's create two new features in our training set representing the "travel vector" between the start and end points of the taxi ride, in both longitude and latitude coordinates.  We'll take the absolute value since we're only interested in distance traveled. Use a helper function since we'll want to do the same thing for the test set later.

In [None]:
# Given a dataframe, add two new features 'abs_diff_longitude' and
# 'abs_diff_latitude' reprensenting the "Manhattan vector" from
# the pickup location to the dropoff location.


def add_travel_vector_features(df):
    df['abs_diff_longitude'] = (df.dropoff_longitude - df.pickup_longitude).abs()
    df['abs_diff_latitude'] = (df.dropoff_latitude - df.pickup_latitude).abs()
    
    

add_travel_vector_features(train_df)


In [None]:
add_travel_vector_features(test_df)

In [None]:
train_df.columns

### Explore and prune outliers
First let's see if there are any `NaN`s in the dataset.

In [None]:
print(train_df.isnull().sum())

There are a small amount, so let's remove them from the dataset.

In [None]:
print('Old size: %d' % len(train_df))
train_df = train_df.dropna(how = 'any', axis = 0)
print('New size: %d' % len(train_df))

Now let's quickly plot a subset of our travel vector features to see its distribution.

In [None]:
plot = train_df.iloc[:2000].plot.scatter('abs_diff_longitude', 'abs_diff_latitude')

We expect most of these values to be very small (likely between 0 and 1) since it should all be differences between GPS coordinates within one city.  For reference, one degree of latitude is about 69 miles.  However, we can see the dataset has extreme values which do not make sense.  Let's remove those values from our training set. Based on the scatterplot, it looks like we can safely exclude values above 5 (though remember the scatterplot is only showing the first 2000 rows...)

In [None]:
print('Old size: %d' % len(train_df))
train_df = train_df[(train_df.abs_diff_longitude < 5.0) & (train_df.abs_diff_latitude < 5.0)]
print('New size: %d' % len(train_df))

In [None]:
train_df = train_df[(train_df.abs_diff_longitude < 5.0) & (train_df.abs_diff_latitude < 5.0)]

In [None]:
train_df.head()

In [None]:
train_df['pickup_datetime'][0][11:19]

In [None]:
list1 = list(train_df['pickup_datetime'])               # Creating an extra col of pickup time,extracting from pickup_datetime

for i in range(len(list1)):
    list1[i] = list1[i][11:19]

train_df['pickup_time'] = list1



list2 = list(test_df['pickup_datetime'])

for i in range(len(list2)):
    list2[i] = list2[i][11:19]

test_df['pickup_time'] = list2
    

In [None]:
train_df.head()

In [None]:
test_df.head()

In [None]:
x=pd.Timestamp(train_df['pickup_datetime'][0][:-4]).dayofweek
x

In [None]:
# Creating an extra col for day of the week

list1 = list(train_df['pickup_datetime'])

for i in range(len(list1)):
    list1[i] = pd.Timestamp(list1[i][:-4]).dayofweek

train_df['weekday'] = list1


list2 = list(test_df['pickup_datetime'])

for i in range(len(list2)):
    list2[i] = pd.Timestamp(list2[i][:-4]).dayofweek

test_df['weekday'] = list2


In [None]:
test_df.head()

In [None]:
# Dropping "pickup_datetime" col

train_df.drop("pickup_datetime",axis=1,inplace=True)
test_df.drop("pickup_datetime",axis=1,inplace=True)

In [None]:
test_df.head()

In [None]:
train_df.head()

In [None]:
col = test_df.columns.tolist()
col = col[:6] + col[8:] +col[6:8]
col

test_df = test_df[col]
test_df.head()

In [None]:
train_df.shape

In [None]:
train_df['weekday'].replace(to_replace=[i for i in range(0,7)],
                           value=["monday","tuesday",'wednesday','thursday','friday','saturday','sunday'],
                           inplace=True)

test_df['weekday'].replace(to_replace=[i for i in range(0,7)],
                           value=["monday","tuesday",'wednesday','thursday','friday','saturday','sunday'],
                           inplace=True)

In [None]:
train_df.head()

In [None]:
test_df.head()

In [None]:
train_one_hot = pd.get_dummies(train_df['weekday'])
train_df = pd.concat([train_df,train_one_hot],axis=1)

test_one_hot = pd.get_dummies(test_df['weekday'])
test_df = pd.concat([test_df,test_one_hot],axis=1)


In [None]:
test_df.head()

In [None]:
train_df.head()

In [None]:
train_df.drop("weekday",axis=1,inplace=True)
test_df.drop("weekday",axis=1,inplace=True)

In [None]:
a = train_df['pickup_time'][0].split(":")
(int(a[0])*100) + int(a[1]) + float(a[2])/100


In [None]:
# Converting pickup_time to float

list1 = list(train_df['pickup_time'])
for i in range(len(list1)):
    a = list1[i].split(":")
    list1[i] = (int(a[0])*100) + int(a[1]) + float(a[2])/100

train_df['pickup_time'] = list1

list2 = list(test_df['pickup_time'])
for i in range(len(list2)):
    a = list2[i].split(":")
    list2[i] = (int(a[0])*100) + int(a[1]) + float(a[2])/100

test_df['pickup_time'] = list2


In [None]:
train_df.head()

In [None]:
test_df.head()

In [None]:
# rearranging cols
test_df = test_df[train_df.drop('fare_amount',axis=1).columns]

In [None]:
test_df.head()

In [None]:
train_df.head()

In [None]:
# Calculating distance in kms

R = 6373.0
lat1 =np.asarray(np.radians(train_df['pickup_latitude']))
lon1 = np.asarray(np.radians(train_df['pickup_longitude']))
lat2 = np.asarray(np.radians(train_df['dropoff_latitude']))
lon2 = np.asarray(np.radians(train_df['dropoff_longitude']))

dlon = lon2 - lon1
dlat = lat2 - lat1
ls1=[] 
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/ 2)**2
c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
distance = R * c

    
train_df['Distance']=np.asarray(distance)*0.621



lat1 =np.asarray(np.radians(test_df['pickup_latitude']))
lon1 = np.asarray(np.radians(test_df['pickup_longitude']))
lat2 = np.asarray(np.radians(test_df['dropoff_latitude']))
lon2 = np.asarray(np.radians(test_df['dropoff_longitude']))

dlon = lon2 - lon1
dlat = lat2 - lat1
 
a = np.sin(dlat / 2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/ 2)**2
c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
distance = R * c
test_df['Distance']=np.asarray(distance)*0.621

In [None]:
train_df.head()

In [None]:
test_df.head()

Latitude: 40.6413111 Longitude: -73.7781391 Of John F Kennedy Airport


In [None]:
# Calculated distances in ref to the airport

R = 6373.0
lat1 =np.asarray(np.radians(train_df['pickup_latitude']))
lon1 = np.asarray(np.radians(train_df['pickup_longitude']))
lat2 = np.asarray(np.radians(train_df['dropoff_latitude']))
lon2 = np.asarray(np.radians(train_df['dropoff_longitude']))

lat3=np.zeros(len(train_df))+np.radians(40.6413111)
lon3=np.zeros(len(train_df))+np.radians(-73.7781391)
dlon_pickup = lon3 - lon1
dlat_pickup = lat3 - lat1
d_lon_dropoff=lon3 -lon2
d_lat_dropoff=lat3-lat2
a1 = np.sin(dlat_pickup/2)**2 + np.cos(lat1) * np.cos(lat3) * np.sin(dlon_pickup/ 2)**2
c1 = 2 * np.arctan2(np.sqrt(a1), np.sqrt(1 - a1))
distance1 = R * c1
train_df['Pickup_Distance_airport']=np.asarray(distance1)*0.621

a2=np.sin(d_lat_dropoff/2)**2 + np.cos(lat2) * np.cos(lat3) * np.sin(d_lon_dropoff/ 2)**2
c2 = 2 * np.arctan2(np.sqrt(a2), np.sqrt(1 - a2))
distance2 = R * c2

    
train_df['Dropoff_Distance_airport']=np.asarray(distance2)*0.621



lat1 =np.asarray(np.radians(test_df['pickup_latitude']))
lon1 = np.asarray(np.radians(test_df['pickup_longitude']))
lat2 = np.asarray(np.radians(test_df['dropoff_latitude']))
lon2 = np.asarray(np.radians(test_df['dropoff_longitude']))

lat3=np.zeros(len(test_df))+np.radians(40.6413111)
lon3=np.zeros(len(test_df))+np.radians(-73.7781391)
dlon_pickup = lon3 - lon1
dlat_pickup = lat3 - lat1
d_lon_dropoff=lon3 -lon2
d_lat_dropoff=lat3-lat2
a1 = np.sin(dlat_pickup/2)**2 + np.cos(lat1) * np.cos(lat3) * np.sin(dlon_pickup/ 2)**2
c1 = 2 * np.arctan2(np.sqrt(a1), np.sqrt(1 - a1))
distance1 = R * c1
test_df['Pickup_Distance_airport']=np.asarray(distance1)*0.621

a2=np.sin(d_lat_dropoff/2)**2 + np.cos(lat2) * np.cos(lat3) * np.sin(d_lon_dropoff/ 2)**2
c2 = 2 * np.arctan2(np.sqrt(a2), np.sqrt(1 - a2))
distance2 = R * c2

test_df['Dropoff_Distance_airport']=np.asarray(distance2)*0.621


In [None]:
# Rounding off data to two decimal places

train_df['Distance']=np.round(train_df['Distance'],2)
train_df['Pickup_Distance_airport']=np.round(train_df['Pickup_Distance_airport'],2)
train_df['Dropoff_Distance_airport']=np.round(train_df['Dropoff_Distance_airport'],2)

test_df['Distance']=np.round(test_df['Distance'],2)
test_df['Pickup_Distance_airport']=np.round(test_df['Pickup_Distance_airport'],2)
test_df['Dropoff_Distance_airport']=np.round(test_df['Dropoff_Distance_airport'],2)

In [None]:
train_df.drop(['pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude'],axis=1,inplace=True)
test_df.drop(['pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude'],axis=1,inplace=True)

In [None]:
train_df.head()

In [None]:
test_df.head()

In [None]:
print(train_df.shape , test_df.shape)

In [None]:
from sklearn.model_selection import train_test_split

X=train_df.drop(['key','fare_amount'],axis=1)
y=train_df['fare_amount']

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.01,random_state=80)

In [None]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
from sklearn.linear_model import LinearRegression
reg=LinearRegression()
reg.fit(X_train,y_train)
reg.score(X_test,y_test)

In [None]:
predictions = reg.predict(test_df.drop("key",axis=1))
predictions = np.round(predictions,2)
predictions

In [None]:
Submission=pd.DataFrame(data=predictions,columns=['fare_amount'])

Submission['key']=test_df['key']

Submission=Submission[['key','fare_amount']]

In [None]:
Submission.set_index('key',inplace=True)

In [None]:
Submission.reset_index().head()

In [None]:
Submission.to_csv('Submission.csv')