Build a regression model.

In [42]:
import pandas as pd
import statsmodels.api as sm

# Retrieve the df_final_combined csv
df_final_combined = pd.read_csv('/Users/ruu/documents/LHL/Project-Statistical-Modelling/data/final_combined.csv',index_col=0)

# The X values (i.e. independent variable) for the multivariate regression analysis are the following:
X = df_final_combined.loc[:, ['num_shops', 'avg_rating', 'min_distance']]

# The y values (i.e. dependent variable) for the multivariate regression analysis is the following:
y = df_final_combined['num_bike_slots']

In [43]:
# Print and inspect the top 5 rows of X to confirm results
X.head()

Unnamed: 0,num_shops,avg_rating,min_distance
0,21,3.633333,209.920901
1,50,4.142,50.119272
2,42,3.602381,105.91468
3,43,3.974419,135.167919
4,7,4.357143,448.380658


In [44]:
# Similarly for y values
y.head()

0    40
1    23
2    30
3    20
4    40
Name: num_bike_slots, dtype: int64

In [45]:
X = sm.add_constant(X)
lin_reg = sm.OLS(y,X)
model = lin_reg.fit()
print_model = model.summary()

Provide model output and an interpretation of the results. 

In [46]:
print(print_model)

                            OLS Regression Results                            
Dep. Variable:         num_bike_slots   R-squared:                       0.108
Model:                            OLS   Adj. R-squared:                  0.084
Method:                 Least Squares   F-statistic:                     4.472
Date:                Sat, 26 Jul 2025   Prob (F-statistic):            0.00528
Time:                        14:14:21   Log-Likelihood:                -388.25
No. Observations:                 115   AIC:                             784.5
Df Residuals:                     111   BIC:                             795.5
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const           40.6223     13.153      3.088   

The above model shows that out of the 3 independent variables explored, only the number of nearby coffee shops (num_shops) shows a statistically significant relationship with the number of bike slots at a station since the p-value is less than 0.05. The coefficient for num_shops is -0.1384, meaning that for each additional coffee shop within the specified radius, the model predicts a slight decrease of about 0.14 bike slots. In contrast, avg_rating (coefficient = -1.1774, p = 0.729) and min_distance (coefficient = 0.0021, p = 0.806) are satistically insignificant since the p-values are more than 0.05. Overall, the model explains approximately 8.4% (adjusted R-squared value) of the variation in bike slots, which suggests that other unmeasured factors are likely influencing bike slot counts. While the model is statistically significant as a whole (since the prob (F-statistic) is 0.00528), the small coefficients and low adjusted R-squared value suggest that it may not be very reliable for predicting bike slots.

# Stretch

How can you turn the regression model into a classification model?

The dependent variable (num_bike_slots) can be turned into a categorial classification - for example, we can categorize the num_bike_slots per given area as per following:
    a. 0-15 slots are low (0)
    b. 16-30 slots are medium (1)
    c. 31-45 slots are high (2)

After which we can apply multinomial regression to model the relationship between the independent variables and the categorization of the bike slots.
