# Individual Assignment 2

##### You can work on this file directly and fill in your answers/code below. Please save as HTML or PDF and submit to Blackboard.


#### Student Name: Venkata Rithish Sai Reddy Yarasu [G46195694]


# Capital Bikeshare Problem

## Classification: More Pickups or More Dropoffs

In Part I of the group assignment, we developed prediction models for both pickups and drop-offs. If the predicted pickups exceed the drop-offs, it suggests that the operator should allocate more bikes than the open docks. One straightforward method for this prediction involves directly comparing the predicted 'pu_ct' with 'do_ct' from the regression models. However, this approach does not provide insights into how likely 'pu_ct' will be greater than 'do_ct'.

Alternatively, we can frame it as a classification problem by initially determining, within the dataset, whether 'pu_ct $\geq$ do_ct' for each day (here, we consider 'greater than or equal to'). Subsequently, the target variable transforms into True/False. We can then utilize logistic regression for the classification task. Moving forward, our aim is to construct a logistic regression model with features 'temp', 'precip', and 'windspeed', utilizing the same datasets from the group assignment.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV
from sklearn.preprocessing import scale

In [3]:
#read bikeshare files
df_Feb = pd.read_csv('/Users/yvrit/Desktop/GW ClassRoom/Spring 24/ML1/grp assgmt/202302-captialbikeshare-tripdata.csv')
df_Mar = pd.read_csv('/Users/yvrit/Desktop/GW ClassRoom/Spring 24/ML1/grp assgmt/202303-capitalbikeshare-tripdata.csv')
df_Apr = pd.read_csv('/Users/yvrit/Desktop/GW ClassRoom/Spring 24/ML1/grp assgmt/202304-capitalbikeshare-tripdata.csv')
df_May = pd.read_csv('/Users/yvrit/Desktop/GW ClassRoom/Spring 24/ML1/grp assgmt/202305-capitalbikeshare-tripdata.csv')
df_Jun = pd.read_csv('/Users/yvrit/Desktop/GW ClassRoom/Spring 24/ML1/grp assgmt/202306-capitalbikeshare-tripdata.csv')

In [4]:
#concat the data
df_bike = pd.concat([df_Feb, df_Mar, df_Apr, df_May, df_Jun])

#transform time to date only
df_bike['started_at_date'] = pd.to_datetime(df_bike['started_at']).dt.date
df_bike['ended_at_date'] = pd.to_datetime(df_bike['ended_at']).dt.date

#weather data
df_weather = pd.read_csv('/Users/yvrit/Desktop/GW ClassRoom/Spring 24/ML1/ind 2/washington, dc 2023-01-01 to 2023-12-31.csv')

#datetime column to date format
df_weather['datetime'] = pd.to_datetime(df_weather['datetime']).dt.date

#filter pu_ct
pu_ct = df_bike[df_bike['start_station_name'] == "22nd & H St NW"].groupby('started_at_date').size().reset_index(name='pu_ct')

#filter do_ct
do_ct = df_bike[df_bike['end_station_name'] == "22nd & H St NW"].groupby('ended_at_date').size().reset_index(name='do_ct')


In [5]:
pu_ct

Unnamed: 0,started_at_date,pu_ct
0,2023-02-01,20
1,2023-02-02,26
2,2023-02-03,14
3,2023-02-04,12
4,2023-02-05,17
...,...,...
145,2023-06-26,21
146,2023-06-27,20
147,2023-06-28,26
148,2023-06-29,32


In [6]:
do_ct

Unnamed: 0,ended_at_date,do_ct
0,2023-02-01,24
1,2023-02-02,28
2,2023-02-03,17
3,2023-02-04,13
4,2023-02-05,24
...,...,...
145,2023-06-26,18
146,2023-06-27,21
147,2023-06-28,26
148,2023-06-29,43


In [7]:
# Merge on started_at_date and ended_at_date
df_bike_merge = pd.merge(pu_ct, do_ct, left_on='started_at_date', right_on='ended_at_date')

# Display the combined DataFrame
df_bike_merge

Unnamed: 0,started_at_date,pu_ct,ended_at_date,do_ct
0,2023-02-01,20,2023-02-01,24
1,2023-02-02,26,2023-02-02,28
2,2023-02-03,14,2023-02-03,17
3,2023-02-04,12,2023-02-04,13
4,2023-02-05,17,2023-02-05,24
...,...,...,...,...
145,2023-06-26,21,2023-06-26,18
146,2023-06-27,20,2023-06-27,21
147,2023-06-28,26,2023-06-28,26
148,2023-06-29,32,2023-06-29,43


In [8]:
#creating a column which returns True if pu_ct >= do_ct
df_bike_merge['greater'] = df_bike_merge['pu_ct'] >= df_bike_merge['do_ct']

In [9]:
df_bike_merge

Unnamed: 0,started_at_date,pu_ct,ended_at_date,do_ct,greater
0,2023-02-01,20,2023-02-01,24,False
1,2023-02-02,26,2023-02-02,28,False
2,2023-02-03,14,2023-02-03,17,False
3,2023-02-04,12,2023-02-04,13,False
4,2023-02-05,17,2023-02-05,24,False
...,...,...,...,...,...
145,2023-06-26,21,2023-06-26,18,True
146,2023-06-27,20,2023-06-27,21,False
147,2023-06-28,26,2023-06-28,26,True
148,2023-06-29,32,2023-06-29,43,False


In [10]:
# Selecting specific columns
w = df_weather[['datetime', 'temp', 'precip', 'windspeed']]

# Display the DataFrame with selected columns
w

Unnamed: 0,datetime,temp,precip,windspeed
0,2023-01-01,51.8,0.004,8.8
1,2023-01-02,50.9,0.000,9.1
2,2023-01-03,59.3,0.000,17.7
3,2023-01-04,59.5,0.000,14.1
4,2023-01-05,56.4,0.000,12.6
...,...,...,...,...
360,2023-12-27,50.4,1.046,10.0
361,2023-12-28,53.2,0.041,9.9
362,2023-12-29,50.1,0.000,16.1
363,2023-12-30,44.4,0.000,17.3


In [11]:
# Merge df_combined with df_weather on 'started_at_date' and 'datetime'
df_bike_merge = df_bike_merge.merge(w, left_on='started_at_date', right_on='datetime')

# Drop redundant 'datetime' column
df_bike_merge.drop('datetime', axis=1, inplace=True)

# Display the merged DataFrame
df_bike_merge

Unnamed: 0,started_at_date,pu_ct,ended_at_date,do_ct,greater,temp,precip,windspeed
0,2023-02-01,20,2023-02-01,24,False,35.6,0.043,15.3
1,2023-02-02,26,2023-02-02,28,False,36.0,0.000,11.3
2,2023-02-03,14,2023-02-03,17,False,31.7,0.000,28.1
3,2023-02-04,12,2023-02-04,13,False,24.5,0.000,16.9
4,2023-02-05,17,2023-02-05,24,False,44.2,0.000,17.9
...,...,...,...,...,...,...,...,...
145,2023-06-26,21,2023-06-26,18,True,80.8,0.000,14.4
146,2023-06-27,20,2023-06-27,21,False,74.2,0.755,16.4
147,2023-06-28,26,2023-06-28,26,True,74.7,0.000,15.7
148,2023-06-29,32,2023-06-29,43,False,75.7,0.000,9.8


[10 marks for each question.]

[Note: you need to load relevant packages in the lecture python example.]
### Questions

#### (1)	Prepare the target variable (Y) and features (X).

In [12]:
X = df_bike_merge[['temp','precip', 'windspeed']]
y = df_bike_merge['greater']

#### (2) Use train_test_split function from sklearn package to randomly split the entire data into training data (e.g., 40%) and test set (e.g., 60%).



In [13]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X.values, y, test_size = 0.60, random_state=20)

#### (3)	Train a logistic regression model (simply with the default setting), get the model coefficients, and write the decision boundary formula by replacing the corresponding $\beta$'s below:

$$
f(x)=\beta_0+\beta_1*temp+\beta_2*precip+\beta_3*windspeed
$$

Answer:

In [22]:
from sklearn.linear_model import LogisticRegression

log = LogisticRegression(random_state=20).fit(X_train, y_train)

#print coefficients and the intercept
print('The intercept beta_0 is ', log.intercept_, ' and the feature coefficients are ',log.coef_)

The intercept beta_0 is  [-1.98523324]  and the feature coefficients are  [[ 0.00749616 -0.57846369  0.08694727]]


f(x) = -1.98523324 + (0.00749616)(temp) + (-0.57846369)(precip) + (0.08694727)(windspeed)

#### (4)  Report both the in-sample and out-of-sample accuracy scores.

Answer:

In [23]:
#in-sample prediction performance/ accuracy using only training data
y_pred = log.predict(X_train)
from sklearn.metrics import accuracy_score
accuracy_score(y_train, y_pred)

0.65

In [24]:
#out of sample accuracy scores/ accuracy using test data
y_pred = log.predict(X_test)
accuracy_score(y_test, y_pred)

0.5111111111111111

#### (5)	Check the order of classes in the model and use sklearn function 'predict_proba(X)' to calculate the probability of 'more pickups than dropoffs' for the first instance in the test data.
Answer:

In [25]:
log.classes_

array([False,  True])

In [26]:
#probability
y_pred = log.predict_proba(X_test[[0]])
print(y_pred)

[[0.6662315 0.3337685]]


Probability of having  'more number of pickups than dropoffs' = 0.3337685

#### (6)	Use 'decision_function(x)' to score each instance in the test data, print the score for the first instance in the test data, verify the probability of 'more pickups than dropoffs' for this instance using the score.
Answer:

In [27]:
f_x = log.decision_function(X_test)
print('Using decision function f(x) to score the 1st instance:', f_x[0])

# convert f(x) to class probability using the logistic function
print('The estimated probability:',  np.exp(f_x[0])/(1+np.exp(f_x[0])))

Using decision function f(x) to score the 1st instance: -0.6911895682401055
The estimated probability: 0.33376850013564113
