Build a regression model.

In [6]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

Provide model output and an interpretation of the results. 

In [7]:
#Importing data from CSV
bikes_venues_df = pd.read_csv("bikes_venues_df.csv")
bikes_venues_df

Unnamed: 0,name,rating,distance_station,free_bikes,empty_slots,station_name
0,Saku,4.3,178.845344,1,32,10th & Cambie
1,La Taqueria Pinche Taco Shop,4.2,170.590723,1,32,10th & Cambie
2,Hokkaido Ramen Santouka,4.2,191.044234,1,32,10th & Cambie
3,Uma Sushi,4.3,152.157897,1,32,10th & Cambie
4,Seaport City Seafood Restaurant,4.2,168.615262,1,32,10th & Cambie
...,...,...,...,...,...,...
1420,Freshii,2.8,70.443227,11,15,Wesbrook Village - Berton & Shrum
1421,The Portside Pub,3.1,12.791033,15,10,Maple Tree Square
1422,The Greek,3.6,70.240358,15,10,Maple Tree Square
1423,Thida Thai Resturant,3.3,31.257034,4,14,Bute & Davie


In [10]:
#Removing columns that do not contribute to the model. In the model, I want to check if we can predict the availability of free bikes based solely on the distance from the bike station to the venue
van_bikes_model = bikes_venues_df.drop(columns = ["name", "station_name", "empty_slots", "rating"])
van_bikes_model.head()

Unnamed: 0,distance_station,free_bikes
0,178.845344,1
1,170.590723,1
2,191.044234,1
3,152.157897,1
4,168.615262,1


In [11]:
van_bikes_model.describe()

Unnamed: 0,distance_station,free_bikes
count,1425.0,1425.0
mean,510.821667,7.694035
std,371.2381,5.240199
min,5.02302,0.0
25%,188.30599,3.0
50%,398.930139,8.0
75%,861.928415,10.0
max,1389.662545,25.0


In [12]:
#Creating model
y = van_bikes_model["free_bikes"]
X = van_bikes_model.drop("free_bikes", axis = 1)
X = sm.add_constant(X) #adds a column of 1's so the model will contain an intercept

model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:             free_bikes   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.013
Method:                 Least Squares   F-statistic:                     19.27
Date:                Sun, 11 Aug 2024   Prob (F-statistic):           1.22e-05
Time:                        23:58:10   Log-Likelihood:                -4372.2
No. Observations:                1425   AIC:                             8748.
Df Residuals:                    1423   BIC:                             8759.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const                6.8606      0.235  

The R-squared value of 0.013 indicates that the above linear regression model explains only 1.3% of the variance in the dependent variable (number of available bikes) based on the independent variable (distance from the bike station). This very low R-squared suggests that the model is not capturing much of the relationship between the variables. In other words, most of the variation in the number of available bikes is not explained by the distance from the station in this model.

The p-value of 0.000 suggests that the relationship between distance from the bike station and the number of available bikes is statistically significant. This means that the observed effect is very unlikely to be due to random chance alone. Despite the low R-squared value, the p-value shows that the relationship, though weak, is unlikely to be due to random chance.

The coefficient of 0.0016 indicates that for every unit increase in distance from the bike station, the number of available bikes increases by 0.0016. It is implying that distance has a minimal impact on the number of available bikes.

Conclusion:
While the model finds a statistically significant relationship between distance from the bike station and the number of available bikes, the effect is so small that it may not be meaningful. The low R-squared indicates that distance alone is not a good predictor of available bikes, and other factors could be explored to better explain the variation in the number of available bikes.

# Stretch

How can you turn the regression model into a classification model?