In [40]:
!curl -O https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 45.4M  100 45.4M    0     0  3526k      0  0:00:13  0:00:13 --:--:-- 3738k


In [41]:
!curl -O https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 45.5M  100 45.5M    0     0  3491k      0  0:00:13  0:00:13 --:--:-- 3608k


In [42]:
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error

**Question 1. Downloading the data**

Read the data for January. How many columns are there?

In [43]:
# Read the parquet file
df_january = pd.read_parquet('yellow_tripdata_2023-01.parquet')

# Count the number of columns
num_columns = len(df_january.columns)
print(f"Number of columns: {num_columns}")

Number of columns: 19


**Q2. Computing duration**

Now let's compute the duration variable. It should contain the duration of a ride in minutes.

What's the standard deviation of the trips duration in January?

In [44]:
tpep_pickup_datetime = pd.to_datetime(df_january.tpep_pickup_datetime)
tpep_dropoff_datetime = pd.to_datetime(df_january.tpep_dropoff_datetime)
df_january['duration'] = tpep_dropoff_datetime - tpep_pickup_datetime
df_january['duration'] = df_january['duration'].dt.total_seconds() / 60

In [45]:
df_january['duration'].describe()

count    3.066766e+06
mean     1.566900e+01
std      4.259435e+01
min     -2.920000e+01
25%      7.116667e+00
50%      1.151667e+01
75%      1.830000e+01
max      1.002918e+04
Name: duration, dtype: float64

**Q3. Dropping outliers**

Next, we need to check the distribution of the duration variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

What fraction of the records left after you dropped the outliers?

In [46]:
valid_records = df_january[(df_january['duration'] >= 1) & (df_january['duration'] <= 60)]

In [47]:
percentage_of_records_left = (len(valid_records) / len(df_january)) * 100
percentage_of_records_left

98.1220282212598

**Q4. One-hot encoding**

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model.

- Turn the dataframe into a list of dictionaries (remember to re-cast the ids to strings - otherwise it will label encode them)
- Fit a dictionary vectorizer
- Get a feature matrix from it

What's the dimensionality of this matrix (number of columns)?

In [48]:
feature_col = ['PULocationID', 'DOLocationID']
features = valid_records[feature_col].astype(str)
train_dict = features.to_dict(orient='records')

In [49]:
dv = DictVectorizer()
X_train = dv.fit_transform(train_dict)

In [50]:
X_train

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 6018346 stored elements and shape (3009173, 515)>

In [51]:
target = 'duration'
y_train = valid_records[target]

**Q5. Training a model**

Now let's use the feature matrix from the previous step to train a model.

- Train a plain linear regression model with default parameters, where duration is the response variable
- Calculate the RMSE of the model on the training data

What's the RMSE on train?

In [52]:
lr = LinearRegression()
lr.fit(X_train, y_train)

In [53]:
y_pred = lr.predict(X_train)

In [54]:
rmse_train = root_mean_squared_error(y_true=y_train, y_pred=y_pred)
rmse_train

7.649261822035489

**Q6. Evaluating the model**

Now let's apply this model to the validation dataset (February 2023).

What's the RMSE on validation?

In [55]:
#Load Parquet
df_validation = pd.read_parquet('yellow_tripdata_2023-02.parquet')

# Calculate duration
tpep_pickup_datetime = pd.to_datetime(df_validation.tpep_pickup_datetime)
tpep_dropoff_datetime = pd.to_datetime(df_validation.tpep_dropoff_datetime)
df_validation['duration'] = tpep_dropoff_datetime - tpep_pickup_datetime
df_validation['duration'] = df_validation['duration'].dt.total_seconds() / 60

# Remove Outliers
df_validation = df_validation[(df_validation['duration'] >= 1) & (df_validation['duration'] <= 60)]

# One hot encode features
valid_features = df_validation[feature_col].astype(str)
valid_dict = valid_features.to_dict(orient='records')
X_valid = dv.transform(valid_dict)
y_valid = df_validation[target]

In [56]:
y_pred = lr.predict(X_valid)
rmse_valid = root_mean_squared_error(y_true=y_valid, y_pred=y_pred)
rmse_valid

7.811821332387183

**HW1 COMPLETED**