Build a regression model.

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

In [13]:
data = pd.read_csv('../data/Bike_Foursquare_Yelp.csv')
data

Unnamed: 0,free_bikes,name,ll,avg_rating,avg_review,avg_distance,no_of_bars_yelp
0,21,10th & Cambie,"49.262487,-123.114397",3.73,100.29,891.97,28
1,9,Yaletown-Roundhouse Station,"49.274566,-123.121817",3.70,98.74,482.66,50
2,13,Dunsmuir & Beatty,"49.279764,-123.110154",3.88,152.90,569.80,50
3,2,12th & Yukon (City Hall),"49.260599,-123.113504",3.78,81.73,902.70,30
4,9,8th & Ash,"49.264215,-123.117772",3.45,138.09,640.18,11
...,...,...,...,...,...,...,...
236,11,Heather & 29th,"49.245535,-123.120496",4.00,1.00,942.16,1
237,16,Cardero & Robson,"49.289255,-123.132677",3.62,146.94,729.72,50
238,0,Commercial & 20th,"49.252656,-123.067965",0.00,0.00,0.00,0
239,2,Hornby & Drake,"49.277527,-123.129464",3.60,129.14,537.85,50


In [9]:
# x: independent variables y:independent variable

X = data[['avg_rating', 'avg_review', 'avg_distance','no_of_bars_yelp']]
y = data['free_bikes']

In [10]:
X = sm.add_constant(X)

In [11]:
model = sm.OLS(y, X).fit()

Provide model output and an interpretation of the results. 

In [12]:
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:             free_bikes   R-squared:                       0.049
Model:                            OLS   Adj. R-squared:                  0.033
Method:                 Least Squares   F-statistic:                     3.072
Date:                Sun, 26 Feb 2023   Prob (F-statistic):             0.0171
Time:                        21:06:13   Log-Likelihood:                -740.42
No. Observations:                 241   AIC:                             1491.
Df Residuals:                     236   BIC:                             1508.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const               7.0258      2.285     

The R-squared value is 0.049, indicating that only 4.9% of the variation in "free_bikes" can be explained by the independent variables. The Adjusted R-squared value is slightly lower at 0.033, which suggests that the model may **not be a good fit for the data.

The F-statistic is 3.072, and its associated p-value is 0.0171, indicating that at least one independent variable is significantly related to "free_bikes".

The coefficients show the direction and magnitude of the relationship between each independent variable and "free_bikes". For example, "avg_review" has a positive coefficient of 0.0288, indicating that as the average review score increases, the number of free bikes also tends to increase. On the other hand, "avg_rating" has a negative coefficient of -0.4218, suggesting that higher average ratings are associated with lower numbers of free bikes.

The p-values for each independent variable show whether it is statistically significant in predicting "free_bikes". For example, "avg_review" has a p-value of 0.016, indicating that it is statistically significant, while "no_of_bars_yelp" has a p-value of 0.993, suggesting that it is not statistically significant in predicting "free_bikes".

In [19]:
# x: independent variables y:independent variable

X = data[['avg_rating', 'avg_review', 'avg_distance']]
y = data['free_bikes']

X = sm.add_constant(X)

model = sm.OLS(y, X).fit()

print(model.summary())

                            OLS Regression Results                            
Dep. Variable:             free_bikes   R-squared:                       0.049
Model:                            OLS   Adj. R-squared:                  0.037
Method:                 Least Squares   F-statistic:                     4.113
Date:                Sun, 26 Feb 2023   Prob (F-statistic):            0.00720
Time:                        22:49:55   Log-Likelihood:                -740.42
No. Observations:                 241   AIC:                             1489.
Df Residuals:                     237   BIC:                             1503.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const            7.0264      2.279      3.083   

In [20]:
# x: independent variables y:independent variable

X = data[['avg_rating', 'avg_review']]
y = data['free_bikes']

X = sm.add_constant(X)

model = sm.OLS(y, X).fit()

print(model.summary())

                            OLS Regression Results                            
Dep. Variable:             free_bikes   R-squared:                       0.049
Model:                            OLS   Adj. R-squared:                  0.041
Method:                 Least Squares   F-statistic:                     6.142
Date:                Sun, 26 Feb 2023   Prob (F-statistic):            0.00251
Time:                        22:50:42   Log-Likelihood:                -740.47
No. Observations:                 241   AIC:                             1487.
Df Residuals:                     238   BIC:                             1497.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          7.0772      2.269      3.119      0.0

In [21]:
# x: independent variables y:independent variable

X = data[['avg_review']]
y = data['free_bikes']

X = sm.add_constant(X)

model = sm.OLS(y, X).fit()

print(model.summary())

                            OLS Regression Results                            
Dep. Variable:             free_bikes   R-squared:                       0.048
Model:                            OLS   Adj. R-squared:                  0.044
Method:                 Least Squares   F-statistic:                     12.13
Date:                Sun, 26 Feb 2023   Prob (F-statistic):           0.000591
Time:                        22:51:31   Log-Likelihood:                -740.57
No. Observations:                 241   AIC:                             1485.
Df Residuals:                     239   BIC:                             1492.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          6.1341      0.821      7.473      0.0

This is the output of a simple linear regression model where free_bikes is the dependent variable and avg_review is the independent variable.

The coefficient of avg_review is 0.0271, which indicates that on average, for every one-unit increase in avg_review, the number of free_bikes available at a bike-sharing station increases by 0.0271. The p-value for avg_review is 0.001, which is less than the commonly used significance level of 0.05, indicating that the relationship between avg_review and free_bikes is statistically significant.

The R-squared value of 0.048 indicates that only 4.8% of the variance in free_bikes can be explained by avg_review, and the adjusted R-squared value of 0.044 suggests that adding more variables to the model may not significantly improve its explanatory power.

The coefficient of const is 6.1341, which represents the intercept of the regression line. It can be interpreted as the expected value of free_bikes when avg_review is zero.

Finally, the Omnibus test and Jarque-Bera test are tests for normality of residuals, and Durbin-Watson test checks for autocorrelation among residuals. A significant p-value in the Omnibus and Jarque-Bera tests indicates that the residuals may not be normally distributed, while a value close to 2 in the Durbin-Watson test suggests that there is no significant autocorrelation among residuals.

# Stretch

How can you turn the regression model into a classification model?