# This is a basic Starter Kernel for the New York City Taxi Fare Prediction Playground Competition 
Here we'll use a simple linear model based on the travel vector from the taxi's pickup location to dropoff location which predicts the `fare_amount` of each ride.

This kernel uses some `pandas` and mostly `numpy` for the critical work.  There are many higher-level libraries you could use instead, for example `sklearn` or `statsmodels`.  

In [None]:
# Initial Python environment setup...
import numpy as np # linear algebra
import pandas as pd # CSV file I/O (e.g. pd.read_csv)
import os # reading the input files we have access to

print(os.listdir('../input'))

### Setup training data
First let's read in our training data.  Kernels do not yet support enough memory to load the whole dataset at once, at least using `pd.read_csv`.  The entire dataset is about 55M rows, so we're skipping a good portion of the data, but it's certainly possible to build a model using all the data.

In [None]:
train_df =  pd.read_csv('../input/train.csv', nrows = 1_000_000)
train_df.dtypes

Let's create two new features in our training set representing the "travel vector" between the start and end points of the taxi ride, in both longitude and latitude coordinates.  We'll take the absolute value since we're only interested in distance traveled. Use a helper function since we'll want to do the same thing for the test set later.

In [None]:
# Given a dataframe, add two new features 'abs_diff_longitude' and
# 'abs_diff_latitude' reprensenting the "Manhattan vector" from
# the pickup location to the dropoff location.
def add_travel_vector_features(df):
    df['abs_diff_longitude'] = (df.dropoff_longitude - df.pickup_longitude).abs()
    df['abs_diff_latitude'] = (df.dropoff_latitude - df.pickup_latitude).abs()

add_travel_vector_features(train_df)

### Explore and prune outliers
First let's see if there are any `NaN`s in the dataset.

In [None]:
print(train_df.isnull().sum())

There are a small amount, so let's remove them from the dataset.

In [None]:
#print('Old size: %d' % len(train_df))
t=len(train_df)
print(f"Old size {t}")
train_df = train_df.dropna(how = 'any', axis = 'rows')
print('New size: %d' % len(train_df))

Now let's quickly plot a subset of our travel vector features to see its distribution.

In [None]:
plot=train_df.iloc[:2000].plot.scatter('abs_diff_longitude', 'abs_diff_latitude')

We expect most of these values to be very small (likely between 0 and 1) since it should all be differences between GPS coordinates within one city.  For reference, one degree of latitude is about 69 miles.  However, we can see the dataset has extreme values which do not make sense.  Let's remove those values from our training set. Based on the scatterplot, it looks like we can safely exclude values above 5 (though remember the scatterplot is only showing the first 2000 rows...)

In [None]:
print('Old size: %d' % len(train_df))
train_df = train_df.loc[(train_df.abs_diff_longitude < 5.0) & (train_df.abs_diff_latitude < 5.0)]
print('New size: %d' % len(train_df))

In [None]:
test_df = pd.read_csv('../input/test.csv')
test_df.dtypes

In [None]:
add_travel_vector_features(test_df)

### Train our model
Our model will take the form $X \cdot w = y$ where $X$ is a matrix of input features, and $y$ is a column of the target variable, `fare_amount`, for each row. The weight column $w$ is what we will "learn".

First let's setup our input matrix $X$ and target column $y$ from our training set.  The matrix $X$ should consist of the two GPS coordinate differences, plus a third term of 1 to allow the model to learn a constant bias term.  The column $y$ should consist of the target `fare_amount` values.

In [None]:

test_df = test_df.loc[(test_df.abs_diff_longitude < 5.0) & (test_df.abs_diff_latitude < 5.0)]


Now let's use `numpy`'s `lstsq` library function to find the optimal weight column $w$.

These weights pass a quick sanity check, since we'd expect the first two values -- the weights for the absolute longitude and latitude differences -- to be positive, as more distance should imply a higher fare, and we'd expect the bias term to loosely represent the cost of a very short ride.

Sidenote:  we can actually calculate the weight column $w$ directly using the [Ordinary Least Squares](https://en.wikipedia.org/wiki/Ordinary_least_squares) method:
$w = (X^T \cdot X)^{-1} \cdot X^T \cdot y$

### Make predictions on the test set
Now let's load up our test inputs and predict the `fare_amount`s for them using our learned weights!

In [None]:
train_df['pickup_datetime'].head()

In [None]:
test_df['pickup_datetime'].head()

In [None]:
ls1=list(train_df['pickup_datetime'])
for i in range(len(ls1)):
    ls1[i]=ls1[i][11:-7]
train_df['pickup_time']=ls1


ls2=list(test_df['pickup_datetime'])
for i in range(len(ls2)):
    ls2[i]=ls2[i][11:-7]
test_df['pickup_time']=ls2

In [None]:
ls1=list(train_df['pickup_datetime'])
for i in range(len(ls1)) :
    ls1[i]=ls1[i][:-4:]
    ls1[i]=pd.Timestamp(ls1[i])
    ls1[i]=ls1[i].weekday()
train_df['Weekday']=ls1

ls=list(test_df['pickup_datetime'])
for i in range(len(ls)) :
    ls[i]=ls[i][:-4:]
    ls[i]=pd.Timestamp.weekday(pd.Timestamp(ls[i]))
test_df['Weekday']=ls

In [None]:
train_df.head()

In [None]:
train_df.info()

In [None]:
train_df.drop('pickup_datetime',inplace=True, axis=1)
test_df.drop('pickup_datetime', inplace=True,axis=1)

In [None]:
train_df.head()

In [None]:
train_df['Weekday'].replace(to_replace=[i for i in range(0,7)], value=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'], inplace=True)
test_df['Weekday'].replace(to_replace=[i for i in range(0,7)], value=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'], inplace=True)

In [None]:
train_df.head()

In [None]:
train_onehot=pd.get_dummies(train_df['Weekday'])
test_onehot=pd.get_dummies(test_df['Weekday'])
train_df=pd.concat([train_df,train_onehot],axis=1)
test_df=pd.concat([test_df,test_onehot],axis=1)


In [None]:
train_df.drop('Weekday', axis=1,inplace=True)
test_df.drop('Weekday', axis=1,inplace=True)

In [None]:
train_df.head()

In [None]:
type(train_df['pickup_time'][0])

In [None]:
ls1=list(train_df['pickup_time'])
for i in range(len(ls1)) :
    z=ls1[i].split(':')
    ls1[i]=int(z[0])*100+int(z[1])
train_df['pickup_time']=ls1

ls1=list(test_df['pickup_time'])
for i in range(len(ls1)) :
    z=ls1[i].split(':')
    ls1[i]=int(z[0])*100+int(z[1])
test_df['pickup_time']=ls1

In [None]:
m=len(train_df)
print(m)

In [None]:
train_df['pickup_time'].head()

In [None]:
type(train_df['pickup_time'])

In [None]:
ls=list(train_df['pickup_time'])
m=len(ls)
for i in range(m) :
    if ls[i]>700 and ls[i]<1000 :
        ls[i]='peak'
    elif ls[i]>1600 and ls[i]<2000 :
        ls[i]='peak'
    else :
        ls[i]='not Peak'
train_df['Peak_hour']=ls

In [None]:
train_df.head()

In [None]:
ls=list(test_df['pickup_time'])
m=len(ls)
for i in range(m) :
    if ls[i]>700 and ls[i]<1000 :
        ls[i]='peak'
    elif ls[i]>1600 and ls[i]<2000 :
        ls[i]='peak'
    else :
        ls[i]='not Peak'
test_df['Peak_hour']=ls

In [None]:
trainoh=pd.get_dummies(train_df['Peak_hour'])
testoh=pd.get_dummies(test_df['Peak_hour'])
train_df=pd.concat([train_df,trainoh],axis=1)
test_df=pd.concat([test_df,testoh],axis=1)

In [None]:
test_df.tail()

In [None]:
train_df.drop('Peak_hour',inplace=True,axis=1)
test_df.drop('Peak_hour',inplace=True,axis=1)

In [None]:
train_df.head()

In [None]:
R=6373.0
lat1=np.asarray(np.radians(train_df['pickup_latitude']))
lon1=np.asarray(np.radians(train_df['pickup_longitude']))
lat2=np.asarray(np.radians(train_df['dropoff_latitude']))
lon2=np.asarray(np.radians(train_df['dropoff_longitude']))

dlat=lat2-lat1
dlon=lon1-lon2
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2*np.arctan2(np.sqrt(a), np.sqrt(1-a))
distance=R*c
train_df['Distance']=np.asarray(distance)*0.621

lat1=np.asarray(np.radians(test_df['pickup_latitude']))
lon1=np.asarray(np.radians(test_df['pickup_longitude']))
lat2=np.asarray(np.radians(test_df['dropoff_latitude']))
lon2=np.asarray(np.radians(test_df['dropoff_longitude']))

dlat=lat2-lat1
dlon=lon1-lon2
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2*np.arctan2(np.sqrt(a), np.sqrt(1-a))
distance=R*c
test_df['Distance']=np.asarray(distance)*0.621

In [None]:
R=6373.0
lat1=np.asarray(np.radians(train_df['pickup_latitude']))
lon1=np.asarray(np.radians(train_df['pickup_longitude']))
lat2=np.asarray(np.radians(train_df['dropoff_latitude']))
lon2=np.asarray(np.radians(train_df['dropoff_longitude']))

lat3=np.zeros(len(train_df))+np.radians(40.6413111)
lon3=np.zeros(len(train_df))+np.radians(-73.7781391)

dlat_pickup=lat3-lat1
dlon_pickup=lon3-lon1
dlat_dropoff=lat3-lat2
dlon_dropoff=lon3-lon2

a1 = np.sin(dlat_pickup/2)**2 + np.cos(lat1) * np.cos(lat3) * np.sin(dlon_pickup/2)**2
c1 = 2*np.arctan2(np.sqrt(a1), np.sqrt(1-a1))
distance1=R*c1
train_df['pickup_Distance_airport']=np.asarray(distance1)*0.621

a2 = np.sin(dlat_dropoff/2)**2 + np.cos(lat2) * np.cos(lat3) * np.sin(dlon_dropoff/2)**2
c2 = 2*np.arctan2(np.sqrt(a2), np.sqrt(1-a2))
distance2=R*c2
train_df['Dropoff_Distance_airport']=np.asarray(distance2)*0.621

In [None]:
R=6373.0
lat1=np.asarray(np.radians(test_df['pickup_latitude']))
lon1=np.asarray(np.radians(test_df['pickup_longitude']))
lat2=np.asarray(np.radians(test_df['dropoff_latitude']))
lon2=np.asarray(np.radians(test_df['dropoff_longitude']))

lat3=np.zeros(len(test_df))+np.radians(40.6413111)
lon3=np.zeros(len(test_df))+np.radians(-73.7781391)

dlat_pickup=lat3-lat1
dlon_pickup=lon3-lon1
dlat_dropoff=lat3-lat2
dlon_dropoff=lon3-lon2

a1 = np.sin(dlat_pickup/2)**2 + np.cos(lat1) * np.cos(lat3) * np.sin(dlon_pickup/2)**2
c1 = 2*np.arctan2(np.sqrt(a1), np.sqrt(1-a1))
distance1=R*c1
test_df['pickup_Distance_airport']=np.asarray(distance1)*0.621

a2 = np.sin(dlat_dropoff/2)**2 + np.cos(lat2) * np.cos(lat3) * np.sin(dlon_dropoff/2)**2
c2 = 2*np.arctan2(np.sqrt(a2), np.sqrt(1-a2))
distance2=R*c2
test_df['Dropoff_Distance_airport']=np.asarray(distance2)*0.621

In [None]:
train_df['Distance']=np.round(train_df['Distance'],2)
train_df['pickup_Distance_airport']=np.round(train_df['pickup_Distance_airport'],2)
train_df['Dropoff_Distance_airport']=np.round(train_df['Dropoff_Distance_airport'],2)
test_df['Distance']=np.round(test_df['Distance'],2)
test_df['pickup_Distance_airport']=np.round(test_df['pickup_Distance_airport'],2)
test_df['Dropoff_Distance_airport']=np.round(test_df['Dropoff_Distance_airport'],2)

In [None]:
#train_df.drop(['pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude'],inplace=True,axis=1)
#test_df.drop(['pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude'],inplace=True,axis=1)

In [None]:
train_df.head()

In [None]:
train_df['abs_diff_longitude']=np.abs(train_df['abs_diff_longitude']-np.mean(train_df['abs_diff_longitude']))
train_df['abs_diff_longitude']=train_df['abs_diff_longitude']/np.var(train_df['abs_diff_longitude'])


In [None]:
test_df['abs_diff_longitude']=np.abs(test_df['abs_diff_longitude']- np.mean(test_df['abs_diff_longitude']))
test_df['abs_diff_longitude']=test_df['abs_diff_longitude']/np.var(test_df['abs_diff_longitude'])

In [None]:
train_df.shape

In [None]:
test_df.shape

In [None]:
train_df.head()

In [None]:
from sklearn.model_selection import train_test_split
X=train_df.drop(['key','fare_amount'],axis=1)
y=train_df['fare_amount']
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.1,random_state=80)

In [None]:
from sklearn.linear_model import LinearRegression
lr=LinearRegression(normalize=True)
lr.fit(X_train,y_train)


In [None]:
print(lr.score(X_test,y_test))

In [None]:
# Reuse the above helper functions to add our features and generate the input matrix.
#add_travel_vector_features(test_df)
#test_X = get_input_matrix(test_df)
# Predict fare_amount on the test set using our model (w) trained on the training set.
#test_y_predictions = np.matmul(test_X, w).round(decimals = 2)

# Write the predictions to a CSV file which we can submit to the competition.
#submission = pd.DataFrame(
 #   {'key': test_df.key, 'fare_amount': test_y_predictions},
  #  columns = ['key', 'fare_amount'])
#submission.to_csv('submission.csv', index = False)

#print(os.listdir('.'))

In [None]:
pred=np.round(lr.predict(test_df.drop('key',axis=1)),2)


In [None]:
submission=pd.DataFrame(data=pred, columns=['fare_amount'])
submission['key']=test_df['key']
submission=submission[['key','fare_amount']]

In [None]:
submission

In [None]:
submission.set_index('key',inplace=True)

In [None]:
submission.to_csv('submission.csv')

## Ideas for Improvement
The output here will score an RMSE of $5.74, but you can do better than that!  Here are some suggestions:

* Use more columns from the input data.  Here we're only using the start/end GPS points from columns `[pickup|dropoff]_[latitude|longitude]`.  Try to see if the other columns -- `pickup_datetime` and `passenger_count` -- can help improve your results.
* Use absolute location data rather than relative.  Here we're only looking at the difference between the start and end points, but maybe the actual values -- indicating where in NYC the taxi is traveling -- would be useful.
* Use a non-linear model to capture more intricacies within the data.
* Try to find more outliers to prune, or construct useful feature crosses.
* Use the entire dataset -- here we're only using about 20% of the training data!

Special thanks to Dan Becker, Will Cukierski, and Julia Elliot for reviewing this Kernel and providing suggestions!