Build a regression model.

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

Provide model output and an interpretation of the results. 

In [2]:
#Importing data from CSV
bikes_venues_df = pd.read_csv("bikes_venues_df.csv")
bikes_venues_df

Unnamed: 0,distance_station_fsq,name,category_name_fsq,rating_yelp,free_bikes,empty_slots,station_name
0,97,Marulilu Cafe,café,4.0,1,32,10th & Cambie
1,197,Hokkaido Ramen Santouka,restaurant,4.2,1,32,10th & Cambie
2,993,The Cascade Room,restaurant,3.9,1,32,10th & Cambie
3,936,Fable Diner,restaurant,3.9,1,32,10th & Cambie
4,142,Sushi California,restaurant,3.6,1,32,10th & Cambie
...,...,...,...,...,...,...,...
2794,569,Le Crocodile Restaurant,restaurant,4.3,4,14,Bute & Davie
2795,614,CinCin Ristorante + Bar,restaurant,4.1,4,14,Bute & Davie
2796,707,Joe Fortes Seafood & Chop House,restaurant,4.0,4,14,Bute & Davie
2797,927,Ancora Waterfront Dining and Patio,restaurant,4.1,4,14,Bute & Davie


In [3]:
#Removing columns that do not contribute to the model. In the model, I want to check if we can predict the availability of free bikes based solely on the distance from the bike station to the venue
van_bikes_model = bikes_venues_df.drop(columns = ["name","category_name_fsq", "station_name", "empty_slots", "rating_yelp"])
van_bikes_model.head()

Unnamed: 0,distance_station_fsq,free_bikes
0,97,1
1,197,1
2,993,1
3,936,1
4,142,1


In [4]:
van_bikes_model.describe()

Unnamed: 0,distance_station_fsq,free_bikes
count,2799.0,2799.0
mean,560.631654,7.3194
std,264.465222,5.349289
min,12.0,0.0
25%,346.5,3.0
50%,565.0,6.0
75%,789.0,10.0
max,1194.0,27.0


In [5]:
#Creating model
y = van_bikes_model["free_bikes"]
X = van_bikes_model.drop("free_bikes", axis=1)
X = sm.add_constant(X) #adds a column of 1's so the model will contain an intercept

model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:             free_bikes   R-squared:                       0.002
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     4.446
Date:                Sun, 11 Aug 2024   Prob (F-statistic):             0.0351
Time:                        22:22:35   Log-Likelihood:                -8662.7
No. Observations:                2799   AIC:                         1.733e+04
Df Residuals:                    2797   BIC:                         1.734e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                    6.8677 

The R-squared value of 0.002 indicates that the above linear regression model explains only 0.2% of the variance in the dependent variable (number of available bikes) based on the independent variable (distance from the bike station). This very low R-squared suggests that the model is not capturing much of the relationship between the variables. In other words, most of the variation in the number of available bikes is not explained by the distance from the station in this model.
P-value = 0.035:

The p-value of 0.035 suggests that the relationship between distance from the bike station and the number of available bikes is statistically significant. This means that there is only a 3.5% chance that the observed relationship could occur if there were no actual relationship between the variables. Despite the low R-squared value, the p-value shows that the relationship, though weak, is unlikely to be due to random chance.

The coefficient of 0.0008 indicates that for every unit increase in distance from the bike station, the number of available bikes increases by 0.0008. It is implying that distance has a minimal impact on the number of available bikes.

Conclusion:
While the model finds a statistically significant relationship between distance from the bike station and the number of available bikes, the effect is so small that it may not be meaningful. The low R-squared indicates that distance alone is not a good predictor of available bikes, and other factors could be explored to better explain the variation in the number of available bikes.

# Stretch

How can you turn the regression model into a classification model?