# Data Cleaning & Feature Engineering

The goal of this notebook is to clean and transform the data into the panel format ready for analysis

In [1]:
import pandas as pd 
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt 

In [2]:
df = pd.read_csv('../data/trips.csv')
df.head()

Unnamed: 0,usertype,zip_code_start,borough_start,neighborhood_start,zip_code_end,borough_end,neighborhood_end,start_time,stop_time,day_mean_temperature,day_mean_wind_speed,day_total_precipitation,trip_minutes,avg_trip_minutes
0,Subscriber,10110,Manhattan,Chelsea and Clinton,10004,Manhattan,Lower Manhattan,2023-04-12 08:31:03.611,2023-04-12 09:09:15.819,51.5,3.8,0.0,40.0,38.2
1,Subscriber,11232,Brooklyn,Sunset Park,11215,Brooklyn,Northwest Brooklyn,2023-05-27 17:01:20.904,2023-05-27 17:13:43.114,67.2,5.8,0.09,10.0,12.366667
2,Subscriber,11106,Queens,Northwest Queens,11101,Queens,Northwest Queens,2023-05-02 13:19:52.243,2023-05-02 13:29:13.009,77.0,4.7,0.0,10.0,9.333333
3,Subscriber,10075,Manhattan,Upper East Side,10167,Manhattan,Gramercy Park and Murray Hill,2023-01-12 06:56:51.471,2023-01-12 07:07:25.899,54.5,5.7,0.0,10.0,10.566667
4,Subscriber,11102,Queens,Northwest Queens,11102,Queens,Northwest Queens,2023-05-24 10:41:44.260,2023-05-24 10:48:11.830,73.2,2.2,0.29,10.0,6.45


## 1. Data Cleaning

* Datatypes
* Missing Data
* Duplicates
* Outliers

In [3]:
# Summary statistics of the dataset
df.describe()

Unnamed: 0,zip_code_start,zip_code_end,day_mean_temperature,day_mean_wind_speed,day_total_precipitation,trip_minutes,avg_trip_minutes
count,5506273.0,5506273.0,5506273.0,5506273.0,5506273.0,5506273.0,5506273.0
mean,10242.5,10242.55,50.95836,4.494994,0.129057,16.7301,16.12945
std,456.0281,456.071,14.38971,2.076705,0.283195,440.9013,440.8894
min,10001.0,10001.0,9.7,1.0,0.0,0.0,1.016667
25%,10010.0,10010.0,39.8,2.7,0.0,10.0,5.75
50%,10018.0,10018.0,48.7,4.3,0.0,10.0,9.6
75%,10065.0,10065.0,63.0,5.7,0.1,20.0,16.65
max,11238.0,11238.0,80.9,11.5,1.68,325170.0,325167.5


In [4]:
# Check datatypes
df.dtypes

usertype                    object
zip_code_start               int64
borough_start               object
neighborhood_start          object
zip_code_end                 int64
borough_end                 object
neighborhood_end            object
start_time                  object
stop_time                   object
day_mean_temperature       float64
day_mean_wind_speed        float64
day_total_precipitation    float64
trip_minutes               float64
avg_trip_minutes           float64
dtype: object

In [5]:
# Check for missing values
df.isna().sum() 

usertype                   0
zip_code_start             0
borough_start              0
neighborhood_start         0
zip_code_end               0
borough_end                0
neighborhood_end           0
start_time                 0
stop_time                  0
day_mean_temperature       0
day_mean_wind_speed        0
day_total_precipitation    0
trip_minutes               0
avg_trip_minutes           0
dtype: int64

In [6]:
# Check for duplicates
df.duplicated().sum()

0

## 2. Feature Engineering

Some features already generated during the inital data pull.

Adding a few more features for further analysis

In [12]:
# Filter data for casual users since this is the focus of the analysis
df = df[df['usertype'] == 'Customer']

In [17]:
# Create holiday dummy variable based on US federal holidays 1 if not 0 
from pandas.tseries.holiday import USFederalHolidayCalendar
cal = USFederalHolidayCalendar()
holidays = cal.holidays(start='2018-01-01', end='2018-12-31')
df['start_time'] = pd.to_datetime(df['start_time'])
df['holiday'] = df['start_time'].dt.normalize().isin(holidays).astype(int)

# Create day of week variable as numeric and string
df['day_of_week_num'] = df['start_time'].dt.dayofweek
df['day_of_week'] = df['start_time'].dt.day_name()  

# Encode neighborhood names as numeric ids
neighborhoods = df['neighborhood_start'].unique()
neighborhood_to_id = {name: idx for idx, name in enumerate(neighborhoods)}
df['neighborhood_id'] = df['neighborhood_start'].map(neighborhood_to_id)

In [32]:

# Set up panel data format with neighborhood and day of week as index
df_panel = df.groupby(['neighborhood_id', 'day_of_week_num']).agg({
    'avg_trip_minutes': 'mean',
    'day_mean_temperature': 'mean',
    'day_mean_wind_speed': 'mean',
    'day_total_precipitation': 'mean',
    'holiday': 'max'  # if any day in the group is a holiday, mark the whole group as holiday
}).reset_index()

df_panel.to_csv('../data/trips_panel.csv', index=False)
df_panel.head().sort_values(by=['neighborhood_id', 'day_of_week_num'])

  df_panel = df.groupby(['neighborhood_id', 'day_of_week_num']).agg({


Unnamed: 0,neighborhood_id,day_of_week_num,avg_trip_minutes,day_mean_temperature,day_mean_wind_speed,day_total_precipitation,holiday
0,7,0,216.141809,58.514334,2.779181,0.061809,0
1,7,1,33.855193,61.546618,3.564734,0.110531,0
2,7,2,642.319895,61.175787,3.779177,0.190872,0
3,7,3,554.48806,63.247974,4.384435,0.067463,0
4,7,4,255.974007,58.556244,4.472927,0.029724,0
