# Task 1 :

## In the previous project, you learnt how to build a logistic regression model to predict whether a customer will churn or not. 

### In this assignment, we are providing you with the NYC trip duration dataset. Here we predict the trip duration using regression models.

### At some point or other, almost every one of us has used an Ola or Uber for taking a ride. 

#### Ride-hailing services are services that use online-enabled platforms to connect passengers and local drivers using their personal vehicles. In most cases, they are a comfortable method for door-to-door transport. Usually, they are cheaper than using licensed taxicabs. Examples of ride-hailing services include Uber and Lyft.

#### To improve the efficiency of taxi dispatching systems for such services, it is important to be able to predict how long a driver will have his taxi occupied. If a dispatcher knew approximately when a taxi driver would be ending their current ride, they would be better able to identify which driver to assign to each pickup request. So, we can try to predict the trip duration using machine learning regression models.

### You can download the dataset from the link given below and build a regression model using that. Once you have built the model, submit the jupyter notebook and we will evaluate it. Make sure you calculate the R^2 score for the model.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
import warnings
warnings.filterwarnings("ignore")

# Load the dataset
df = pd.read_csv("D:\\Sourav_Singh\\Excell ,SQL & Tableau\\Project_to_make\\nyc_taxi_trip_duration.csv")

In [2]:
df.shape

(729322, 11)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 729322 entries, 0 to 729321
Data columns (total 11 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  729322 non-null  object 
 1   vendor_id           729322 non-null  int64  
 2   pickup_datetime     729322 non-null  object 
 3   dropoff_datetime    729322 non-null  object 
 4   passenger_count     729322 non-null  int64  
 5   pickup_longitude    729322 non-null  float64
 6   pickup_latitude     729322 non-null  float64
 7   dropoff_longitude   729322 non-null  float64
 8   dropoff_latitude    729322 non-null  float64
 9   store_and_fwd_flag  729322 non-null  object 
 10  trip_duration       729322 non-null  int64  
dtypes: float64(4), int64(3), object(4)
memory usage: 61.2+ MB


In [4]:
df["vendor_id"].unique()

array([2, 1], dtype=int64)

In [5]:
df['vendor_id'] = df['vendor_id'].astype(str)

In [6]:
df["vendor_id"].unique()

array(['2', '1'], dtype=object)

In [7]:
df["vendor_id"].dtype

dtype('O')

In [8]:
df["pickup_datetime"] = pd.to_datetime(df["pickup_datetime"])

In [9]:
df["dropoff_datetime"] = pd.to_datetime(df["dropoff_datetime"])

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 729322 entries, 0 to 729321
Data columns (total 11 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   id                  729322 non-null  object        
 1   vendor_id           729322 non-null  object        
 2   pickup_datetime     729322 non-null  datetime64[ns]
 3   dropoff_datetime    729322 non-null  datetime64[ns]
 4   passenger_count     729322 non-null  int64         
 5   pickup_longitude    729322 non-null  float64       
 6   pickup_latitude     729322 non-null  float64       
 7   dropoff_longitude   729322 non-null  float64       
 8   dropoff_latitude    729322 non-null  float64       
 9   store_and_fwd_flag  729322 non-null  object        
 10  trip_duration       729322 non-null  int64         
dtypes: datetime64[ns](2), float64(4), int64(2), object(3)
memory usage: 61.2+ MB


In [11]:
#Extract categorical columns from the dataframe
#Here we extract the columns with object datatype as they are the categorical columns
categorical_columns = df[["vendor_id","store_and_fwd_flag"]].columns.tolist()

In [12]:
#Initialize OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)

In [13]:
# Apply one-hot encoding to the categorical columns
one_hot_encoded = encoder.fit_transform(df[categorical_columns])

In [14]:
#Create a DataFrame with the one-hot encoded columns
#We use get_feature_names_out() to get the column names for the encoded data
one_hot_df = pd.DataFrame(one_hot_encoded, columns=encoder.get_feature_names_out(categorical_columns))

In [15]:
# Concatenate the one-hot encoded dataframe with the original dataframe
df_encoded = pd.concat([df, one_hot_df], axis=1)

In [16]:
# Drop the original categorical columns
df_encoded = df_encoded.drop(categorical_columns, axis=1)

In [17]:
df_encoded

Unnamed: 0,id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,trip_duration,vendor_id_1,vendor_id_2,store_and_fwd_flag_N,store_and_fwd_flag_Y
0,id1080784,2016-02-29 16:40:21,2016-02-29 16:47:01,1,-73.953918,40.778873,-73.963875,40.771164,400,0.0,1.0,1.0,0.0
1,id0889885,2016-03-11 23:35:37,2016-03-11 23:53:57,2,-73.988312,40.731743,-73.994751,40.694931,1100,1.0,0.0,1.0,0.0
2,id0857912,2016-02-21 17:59:33,2016-02-21 18:26:48,2,-73.997314,40.721458,-73.948029,40.774918,1635,0.0,1.0,1.0,0.0
3,id3744273,2016-01-05 09:44:31,2016-01-05 10:03:32,6,-73.961670,40.759720,-73.956779,40.780628,1141,0.0,1.0,1.0,0.0
4,id0232939,2016-02-17 06:42:23,2016-02-17 06:56:31,1,-74.017120,40.708469,-73.988182,40.740631,848,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
729317,id3905982,2016-05-21 13:29:38,2016-05-21 13:34:34,2,-73.965919,40.789780,-73.952637,40.789181,296,0.0,1.0,1.0,0.0
729318,id0102861,2016-02-22 00:43:11,2016-02-22 00:48:26,1,-73.996666,40.737434,-74.001320,40.731911,315,1.0,0.0,1.0,0.0
729319,id0439699,2016-04-15 18:56:48,2016-04-15 19:08:01,1,-73.997849,40.761696,-74.001488,40.741207,673,1.0,0.0,1.0,0.0
729320,id2078912,2016-06-19 09:50:47,2016-06-19 09:58:14,1,-74.006706,40.708244,-74.013550,40.713814,447,1.0,0.0,1.0,0.0


In [18]:
df_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 729322 entries, 0 to 729321
Data columns (total 13 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   id                    729322 non-null  object        
 1   pickup_datetime       729322 non-null  datetime64[ns]
 2   dropoff_datetime      729322 non-null  datetime64[ns]
 3   passenger_count       729322 non-null  int64         
 4   pickup_longitude      729322 non-null  float64       
 5   pickup_latitude       729322 non-null  float64       
 6   dropoff_longitude     729322 non-null  float64       
 7   dropoff_latitude      729322 non-null  float64       
 8   trip_duration         729322 non-null  int64         
 9   vendor_id_1           729322 non-null  float64       
 10  vendor_id_2           729322 non-null  float64       
 11  store_and_fwd_flag_N  729322 non-null  float64       
 12  store_and_fwd_flag_Y  729322 non-null  float64       
dtyp

In [19]:
df_encoded["id"].unique()

array(['id1080784', 'id0889885', 'id0857912', ..., 'id0439699',
       'id2078912', 'id1053441'], dtype=object)

In [20]:
df_cato = df_encoded["id"]
df_cato.describe()

count        729322
unique       729322
top       id1080784
freq              1
Name: id, dtype: object

In [21]:
df_encoded['pickup_hour'] = df_encoded['pickup_datetime'].dt.hour
df_encoded['pickup_day'] = df_encoded['pickup_datetime'].dt.dayofweek
df_encoded['pickup_month'] = df_encoded['pickup_datetime'].dt.month

In [22]:
df_encoded['is_weekend'] = df_encoded['pickup_datetime'].dt.weekday >= 5
df_encoded['rush_hour'] = df_encoded['pickup_hour'].apply(lambda x: 1 if x in [7, 8, 9, 17, 18, 19] else 0)
df_encoded['log_trip_duration'] = np.log1p(df_encoded['trip_duration'])

In [23]:
# Define the Haversine function to calculate the distance between two points
def distance(lat1, lon1, lat2, lon2):
    R = 6371  # Radius of the Earth in kilometers
    lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a))
    return R * c  # Returns distance in kilometers

In [24]:
# Calculate trip_distance using Haversine formula
df_encoded['trip_distance'] = distance(df_encoded['pickup_latitude'], df_encoded['pickup_longitude'],
                                  df_encoded['dropoff_latitude'], df_encoded['dropoff_longitude'])

In [25]:
# Step 4: Drop unnecessary columns (like 'id', 'pickup_datetime', etc.)
df_encoded_dt = df_encoded.drop(columns=['id', 'pickup_datetime', 'dropoff_datetime',"pickup_longitude","pickup_latitude",
                                         "dropoff_longitude","dropoff_latitude","trip_duration"])

In [26]:
df_encoded_dt.head()

Unnamed: 0,passenger_count,vendor_id_1,vendor_id_2,store_and_fwd_flag_N,store_and_fwd_flag_Y,pickup_hour,pickup_day,pickup_month,is_weekend,rush_hour,log_trip_duration,trip_distance
0,1,0.0,1.0,1.0,0.0,16,0,2,False,0,5.993961,1.199073
1,2,1.0,0.0,1.0,0.0,23,4,3,False,0,7.003974,4.129111
2,2,0.0,1.0,1.0,0.0,17,6,2,True,1,7.40001,7.250753
3,6,0.0,1.0,1.0,0.0,9,1,1,False,1,7.040536,2.361097
4,1,1.0,0.0,1.0,0.0,6,2,2,False,0,6.744059,4.328534


In [27]:
# Define target
X = df_encoded_dt.drop(["log_trip_duration"], axis=1)
y = df_encoded_dt['log_trip_duration']  # Target

In [28]:
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [29]:
# Step 2: Ensure all features are numeric
print(y_train.dtypes)

float64


In [30]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(583457, 11) (583457,)
(145865, 11) (145865,)


In [31]:
print(X_train.isna().sum())

passenger_count         0
vendor_id_1             0
vendor_id_2             0
store_and_fwd_flag_N    0
store_and_fwd_flag_Y    0
pickup_hour             0
pickup_day              0
pickup_month            0
is_weekend              0
rush_hour               0
trip_distance           0
dtype: int64


In [32]:
rf_model = RandomForestRegressor(n_estimators=10, random_state=42)

In [33]:
rf_model.fit(X_train, y_train)

In [34]:
y_pred_rf = rf_model.predict(X_test)
print(y_pred_rf)

[6.34630638 5.40628578 7.48563582 ... 6.16633333 7.25539379 6.05418192]


In [35]:
r2_rf = r2_score(y_test, y_pred_rf)
print(r2_rf)

0.5876996530149412


In [36]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [37]:
param_grid = {'n_estimators': [10, 20], 'max_depth': [1, 2]}
grid = GridSearchCV(RandomForestRegressor(), param_grid, cv=5)
grid.fit(X_train, y_train)

In [38]:
scores = cross_val_score(grid,X_train, y_train, cv=2, scoring='r2')

In [39]:
print(f"Average R² score: {np.mean(scores)}")

Average R² score: 0.5316173852630953
