In this competition, Kaggle is challenging you to build a model that predicts the total ride duration of taxi trips in New York City. Your primary dataset is one released by the NYC Taxi and Limousine Commission, which includes pickup time, geo-coordinates, number of passengers, and several other variables.

Longtime Kagglers will recognize that this competition objective is similar to the ECML/PKDD trip time challenge we hosted in 2015. But, this challenge comes with a twist. Instead of awarding prizes to the top finishers on the leaderboard, this playground competition was created to reward collaboration and collective learning.

We are encouraging you (with cash prizes!) to publish additional training data that other participants can use for their predictions. We also have designated bi-weekly and final prizes to reward authors of kernels that are particularly insightful or valuable to the community.

# Import


In [None]:
import numpy as np
import pandas as pd
import xgboost as xgb

import os
from pathlib import Path

# importer la lib pour cross valider le model
from sklearn.model_selection import cross_val_score

# importer la lib pour la regression de Random Forest
from sklearn.ensemble import RandomForestRegressor

# importer la lib pour la regression de Random Forest
from sklearn.linear_model import SGDRegressor

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import ShuffleSplit


import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import datetime as dt


%matplotlib inline

# 1. Data loading

In [None]:
#train = pd.read_csv('training/train.csv')
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
sample = pd.read_csv('../input/sample_submission.csv')


In [None]:
train.head()

In [None]:
train.dtypes

In [None]:
train.info()

# 2. Data exploration

In [None]:
train.isna().sum()

In [None]:
train.trip_duration.min()


In [None]:
train.trip_duration.max()

Dans 1 minute, il y a 60 secondes  
Dans 1 heure, il y a 60 minutes. donc 60 * 60= 3600secondes  
Dans 1 journée, il y a 24 heures. donc 24*3600= 86400 secondes.  
=> 3526282/86400 = 40,81344907 jours


On peut voir qu'il y a un temps de trajet minimum de 1 seconde et un maximum de 40 jours.
Il faudra les enlever car cela va fausser les résultats.  
Dans mon cas, je juge qu'un temps de trajet dans un taxi doit être comprise entre 5 minutes et quelques heures.

In [None]:
fig, ax = plt.subplots(ncols=1, nrows=1,figsize=(12,10))
plt.ylim(40.6, 40.9)
plt.xlim(-74.1,-73.7)
ax.scatter(train['pickup_longitude'],train['pickup_latitude'], s=0.0002, alpha=1)

# 3. Data preprocessing :

#### 3.1 Gestion des Outliers

In [None]:
plt.subplots(figsize=(18,7))
plt.title("Répartition des outliers")
train.boxplot()

In [None]:
#train.loc[train.trip_duration<4000,"trip_duration"].hist(bins=120)
train['trip_duration'] = np.log(train['trip_duration'].values)

1. 1. On pourra choisir un trip_duration max (4000) et trip_duration_min (0 car les trip_durations peuvent être nulls ou annulées)

In [None]:
train['passenger_count'].value_counts()

distances

In [None]:
import math

def haversine(lat1, lon1, lat2, lon2):
   R = 6372800  # Earth radius in meters
   phi1, phi2 = math.radians(lat1), math.radians(lat2)
   dphi       = math.radians(lat2 - lat1)
   dlambda    = math.radians(lon2 - lon1)

   a = math.sin(dphi/2)**2 + \
       math.cos(phi1)*math.cos(phi2)*math.sin(dlambda/2)**2

   return 2*R*math.atan2(math.sqrt(a), math.sqrt(1 - a))

train['dist_long'] = train['pickup_longitude'] - train['dropoff_longitude']
test['dist_long'] = test['pickup_longitude'] - test['dropoff_longitude']

train['dist_lat'] = train['pickup_latitude'] - train['dropoff_latitude']
test['dist_lat'] = test['pickup_latitude'] - test['dropoff_latitude']

train['dist'] = np.sqrt(np.square(train['dist_long']) + np.square(train['dist_lat']))
test['dist'] = np.sqrt(np.square(test['dist_long']) + np.square(test['dist_lat']))

In [None]:
train['pickup_datetime'] = pd.to_datetime(train['pickup_datetime'])
train['dropoff_datetime'] = pd.to_datetime(train['dropoff_datetime'])
test['pickup_datetime'] = pd.to_datetime(test['pickup_datetime'])

train['hour'] = train.pickup_datetime.dt.hour
train['day'] = train.pickup_datetime.dt.dayofweek
train['month'] = train.pickup_datetime.dt.month
test['hour'] = test.pickup_datetime.dt.hour
test['day'] = test.pickup_datetime.dt.dayofweek
test['month'] = test.pickup_datetime.dt.month

Suppression vitesse incohérente & distance


#### 3.2 Missing values handling

In [None]:
train.isnull().sum()

#### 3.4 Scaling des données

# 4. Features engineering : selection, extraction, creation

In [None]:
col_diff = list(set(train.columns).difference(set(test.columns)))

In [None]:
train.head()

In [None]:
y_train = train["trip_duration"] # <-- target
X_train = train[["vendor_id","passenger_count","pickup_longitude", "pickup_latitude", "dropoff_longitude","dropoff_latitude","month","hour","day","dist"]] # <-- features

X_datatest = test[["vendor_id","passenger_count","pickup_longitude", "pickup_latitude", "dropoff_longitude","dropoff_latitude","month","hour","day","dist"]]

# 5. Sélection de modèles et/ou datasets (si il y en a plusieurs)****

In [None]:
# declarer le model et l'entrainer

#sgd = SGDRegressor()
#sgd.fit(X_train, y_train)

# 6. Entrainement du ou des modèle(s) & Predictions

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size = 0.1, random_state=42)

In [None]:
rfr = RandomForestRegressor(n_estimators=100,min_samples_leaf=10, min_samples_split=15, max_depth=80,verbose=0,max_features="auto",bootstrap=True,n_jobs=-1)
rfr.fit(X_train, y_train)

In [None]:
# Trop long
# calculer les scores de cross validation du model selon une decoupe du dataset de train
cv_scores = cross_val_score(rfr, X_train, y_train, cv=5, scoring= 'neg_mean_squared_log_error')

In [None]:
cv_scores

In [None]:
for i in range(len(cv_scores)):
    cv_scores[i] = np.sqrt(abs(cv_scores[i]))
cv_scores


In [None]:
train_pred = rfr.predict(X_datatest)
train_pred[:5]

In [None]:
train_pred

In [None]:
my_submission = pd.DataFrame({"id": test.id, "trip_duration": np.exp(train_pred)})
print(my_submission)

In [None]:
my_submission.to_csv('submission.csv', index=False)


In [None]:
my_submission.head()