# This is a basic Starter Kernel for the New York City Taxi Fare Prediction Playground Competition 
Here we'll use a simple linear model based on the travel vector from the taxi's pickup location to dropoff location which predicts the `fare_amount` of each ride.

This kernel uses some `pandas` and mostly `numpy` for the critical work.  There are many higher-level libraries you could use instead, for example `sklearn` or `statsmodels`.  

In [None]:
# Initial Python environment setup...
import numpy as np # linear algebra
import pandas as pd # CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import time
import os # reading the input files we have access to

print(os.listdir('../input'))

### Setup training data
First let's read in our training data.  Kernels do not yet support enough memory to load the whole dataset at once, at least using `pd.read_csv`.  The entire dataset is about 55M rows, so we're skipping a good portion of the data, but it's certainly possible to build a model using all the data.

In [None]:
train_df =  pd.read_csv('../input/train.csv', nrows = 10_000_000)
train_df.dtypes

In [None]:
train_df.head()

In [None]:
train_df.shape

In [None]:
train_df.info()

In [None]:
test_df = pd.read_csv('../input/test.csv')
test_df.dtypes

In [None]:
test_df.head()

In [None]:
test_df.info()

In [None]:
test_df.shape

In [None]:
train_df.isna().sum()

Let's create two new features in our training set representing the "travel vector" between the start and end points of the taxi ride, in both longitude and latitude coordinates.  We'll take the absolute value since we're only interested in distance traveled. Use a helper function since we'll want to do the same thing for the test set later.

In [None]:
# Given a dataframe, add two new features 'abs_diff_longitude' and
# 'abs_diff_latitude' reprensenting the "Manhattan vector" from
# the pickup location to the dropoff location.
def add_travel_vector_features(df):
    df['abs_diff_longitude'] = (df.dropoff_longitude - df.pickup_longitude).abs()
    df['abs_diff_latitude'] = (df.dropoff_latitude - df.pickup_latitude).abs()

add_travel_vector_features(train_df)
add_travel_vector_features(test_df)

In [None]:
print(f'Before Dropping null values: {len(train_df)}')
train_df.dropna(inplace=True)
print(f'After Dropping null values: {len(train_df)}')

### Explore and prune outliers
First let's see if there are any `NaN`s in the dataset.

In [None]:
print(train_df.isnull().sum())

There are a small amount, so let's remove them from the dataset.

In [None]:
print('Old size: %d' % len(train_df))
train_df = train_df.dropna(how = 'any', axis = 'rows')
print('New size: %d' % len(train_df))

Now let's quickly plot a subset of our travel vector features to see its distribution.

In [None]:
plot = train_df.iloc[:2000].plot.scatter('abs_diff_longitude', 'abs_diff_latitude')

We expect most of these values to be very small (likely between 0 and 1) since it should all be differences between GPS coordinates within one city.  For reference, one degree of latitude is about 69 miles.  However, we can see the dataset has extreme values which do not make sense.  Let's remove those values from our training set. Based on the scatterplot, it looks like we can safely exclude values above 5 (though remember the scatterplot is only showing the first 2000 rows...)

In [None]:
print('Old size: %d' % len(train_df))
train_df = train_df[(train_df.abs_diff_longitude < 5.0) & (train_df.abs_diff_latitude < 5.0)]
print('New size: %d' % len(train_df))

In [None]:
def creating_time(df):
    ls1=list(df['pickup_datetime'])
    for i in range(len(ls1)):
        ls1[i]=ls1[i][11:-7:]
    df['pickuptime']=ls1    

creating_time(train_df)
creating_time(test_df)

In [None]:
train_df.head()

In [None]:
test_df.head()

In [None]:
def creating_weekdays(df):
    ls1=list(df['pickup_datetime'])
    for i in range(len(ls1)):
        ls1[i]=ls1[i][:-4:]
        ls1[i]=pd.Timestamp(ls1[i])
        ls1[i]=ls1[i].weekday()
    df['Weekday']=ls1

creating_weekdays(train_df)
creating_weekdays(test_df)    

In [None]:
train_df.head()

In [None]:
test_df.head()

In [None]:
train_df.drop('pickup_datetime',inplace=True,axis=1)
test_df.drop('pickup_datetime',inplace=True,axis=1)

In [None]:
def replace_weekday(df):
    df['Weekday'].replace(to_replace=[i for i in range(0,7)],
                                value=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'],
                                  inplace=True)
replace_weekday(train_df)
replace_weekday(test_df)

In [None]:
train_df.head()

In [None]:
test_df.head()

In [None]:
train_one_hot=pd.get_dummies(train_df['Weekday'])
test_one_hot=pd.get_dummies(test_df['Weekday'])
train_df=pd.concat([train_df,train_one_hot],axis=1)
test_df=pd.concat([test_df,test_one_hot],axis=1)

In [None]:
train_df.head()

In [None]:
test_df.head()

In [None]:
train_df.drop('Weekday',axis=1,inplace=True)
test_df.drop('Weekday',axis=1,inplace=True)

In [None]:
def creating_pickupdate(df):
    ls1=list(df['pickuptime'])
    for i in range(len(ls1)):
        z=ls1[i].split(':')
        ls1[i]=int(z[0])*100+int(z[1])
    df['pickuptime']=ls1

creating_pickupdate(train_df)
creating_pickupdate(test_df)

In [None]:
train_df.head()

In [None]:
test_df.head()

In [None]:
def finding_distance(df):
    R = 6373.0
    lat1 =np.asarray(np.radians(df['pickup_latitude']))
    lon1 = np.asarray(np.radians(df['pickup_longitude']))
    lat2 = np.asarray(np.radians(df['dropoff_latitude']))
    lon2 = np.asarray(np.radians(df['dropoff_longitude']))

    dlon = lon2 - lon1
    dlat = lat2 - lat1
    ls1=[] 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/ 2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    distance = R * c

    
    df['Distance']=np.asarray(distance)*0.621

finding_distance(train_df)
finding_distance(test_df)

In [None]:
def creating_pickup_dropoff_distance(df):
    R = 6373.0
    lat1 =np.asarray(np.radians(df['pickup_latitude']))
    lon1 = np.asarray(np.radians(df['pickup_longitude']))
    lat2 = np.asarray(np.radians(df['dropoff_latitude']))
    lon2 = np.asarray(np.radians(df['dropoff_longitude']))

    lat3=np.zeros(len(df))+np.radians(40.6413111)
    lon3=np.zeros(len(df))+np.radians(-73.7781391)
    dlon_pickup = lon3 - lon1
    dlat_pickup = lat3 - lat1
    d_lon_dropoff=lon3 -lon2
    d_lat_dropoff=lat3-lat2
    a1 = np.sin(dlat_pickup/2)**2 + np.cos(lat1) * np.cos(lat3) * np.sin(dlon_pickup/ 2)**2
    c1 = 2 * np.arctan2(np.sqrt(a1), np.sqrt(1 - a1))
    distance1 = R * c1
    df['Pickup_Distance_airport']=np.asarray(distance1)*0.621

    a2=np.sin(d_lat_dropoff/2)**2 + np.cos(lat2) * np.cos(lat3) * np.sin(d_lon_dropoff/ 2)**2
    c2 = 2 * np.arctan2(np.sqrt(a2), np.sqrt(1 - a2))
    distance2 = R * c2

    
    df['Dropoff_Distance_airport']=np.asarray(distance2)*0.621

creating_pickup_dropoff_distance(train_df)
creating_pickup_dropoff_distance(test_df)

In [None]:
train_df['Distance']=np.round(train_df['Distance'],2)
train_df['Pickup_Distance_airport']=np.round(train_df['Pickup_Distance_airport'],2)
train_df['Dropoff_Distance_airport']=np.round(train_df['Dropoff_Distance_airport'],2)
test_df['Distance']=np.round(test_df['Distance'],2)
test_df['Pickup_Distance_airport']=np.round(test_df['Pickup_Distance_airport'],2)
test_df['Dropoff_Distance_airport']=np.round(test_df['Dropoff_Distance_airport'],2)

In [None]:
train_df.drop(['pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude'],axis=1,inplace=True)
test_df.drop(['pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude'],axis=1,inplace=True)

In [None]:
train_df['abs_diff_longitude']=np.abs(train_df['abs_diff_longitude']-np.mean(train_df['abs_diff_longitude']))
train_df['abs_diff_longitude']=train_df['abs_diff_longitude']/np.var(train_df['abs_diff_longitude'])

In [None]:
train_df['abs_diff_latitude']=np.abs(train_df['abs_diff_latitude']-np.mean(train_df['abs_diff_latitude']))
train_df['abs_diff_latitude']=train_df['abs_diff_latitude']/np.var(train_df['abs_diff_latitude'])

In [None]:
test_df['abs_diff_longitude']=np.abs(test_df['abs_diff_longitude']-np.mean(test_df['abs_diff_longitude']))
test_df['abs_diff_longitude']=test_df['abs_diff_longitude']/np.var(test_df['abs_diff_longitude'])

test_df['abs_diff_latitude']=np.abs(test_df['abs_diff_latitude']-np.mean(test_df['abs_diff_latitude']))
test_df['abs_diff_latitude']=test_df['abs_diff_latitude']/np.var(test_df['abs_diff_latitude'])

In [None]:
print(train_df.shape)
print(test_df.shape)

In [None]:
from sklearn.model_selection import train_test_split
X=train_df.drop(['key','fare_amount'],axis=1)
y=train_df['fare_amount']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.01,random_state=80)

In [None]:
from sklearn.linear_model import LinearRegression
lr=LinearRegression(normalize=True)
lr.fit(X_train,y_train)
print(lr.score(X_test,y_test))

In [None]:
pred=np.round(lr.predict(test_df.drop('key',axis=1)),2)
print(pred)

In [None]:
Submission = pd.DataFrame(data = pred,columns = ['fare_amount'])
Submission['key'] = test_df['key']
Submission = Submission[['key','fare_amount']]

In [None]:
Submission.set_index('key', inplace = True)

In [None]:
Submission.head()

In [None]:
Submission.to_csv('Submission.csv')