Build a regression model.

In [1]:
# import
import statsmodels.api as sm
import pandas as pd
import numpy as np
import sqlite3
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import scipy

Provide model output and an interpretation of the results. 

In [2]:
# create connection
conn = sqlite3.connect('../data/mtl_bike_rentals.sqlite')

In [None]:
# Select data for first model

yelp_model_all_review = pd.read_sql(
    """
    SELECT
        s.station_name AS name,
        s.station_id AS station,
        AVG(s.total_bikes) AS total_bikes,
        AVG(pd.to_station_m) AS avg_distance_to_station_m,
        AVG(review_count) AS avg_review,
        AVG(p.rating) AS avg_rating,
        AVG(p.price) AS avg_price,
        COUNT(*) AS number_poi
    FROM
        stations s
    JOIN poi_detail pd USING(station_id)
    JOIN pois p USING(poi_id)
    JOIN api a USING(poi_id)
    WHERE
        api_name = 'Yelp'
        AND total_bikes BETWEEN 10 AND 35
        AND rating > 6
    GROUP BY 1, 2
    HAVING
        avg_review NOT NULL
    ORDER BY
        station;
    """, conn
)

print(yelp_model_all_review)

        name  station  total_bikes  \
0                            Union / Ste-Catherine       26         35.0   
1                            University / des Pins       27         35.0   
2                                 Dorion / Ontario       28         35.0   
3                   de la Montagne / Ste-Catherine       29         35.0   
4                          Larivière / de Lorimier       30         35.0   
..                                             ...      ...          ...   
759                       de Chateaubriand / Jarry      789         11.0   
760                           Cégep Marie-Victorin      790         11.0   
761                         Ste-Famille / des Pins      791         10.0   
762                            Waverly / Van Horne      792         10.0   
763  Gare d'autocars de Montréal (Berri / Ontario)      793         10.0   

     avg_distance_to_station_m  avg_review  avg_rating  avg_price  number_poi  
0                   401.842105  265.947368    8.105263   2.421053          19  
1                   771.000000  530.777778    8.500000   2.444444           9  
2                   539.062500  113.312500    8.242188   2.250000          16  
3                   254.650000  292.850000    8.031250   2.350000          20  
4                   618.187500   68.125000    7.960938   1.875000          16  
..                         ...         ...         ...        ...         ...  
759                 352.125000   31.812500    8.171875   1.750000          16  
760                 764.428571    4.285714    8.232143   2.000000           7  
761                 623.000000  427.833333    8.375000   2.222222          18  
762                 534.705882  155.705882    8.411765   2.294118          17  
763                 404.157895  240.157895    8.342105   2.368421          19  

In [5]:
# Run model with all variables
y = yelp_model_all_review['total_bikes']
X = yelp_model_all_review.drop(['total_bikes', 'name', 'station'], axis=1)
X = sm.add_constant(X)
X

Unnamed: 0,const,avg_distance_to_station_m,avg_review,avg_rating,avg_price,number_poi


In [None]:
# Run model and output summary
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:            total_bikes   R-squared:                       0.108
Model:                            OLS   Adj. R-squared:                  0.102
Method:                 Least Squares   F-statistic:                     18.27
Date:                Sun, 19 NOV 2023   Prob (F-statistic):           3.75e-17
Time:                        22:52:18   Log-Likelihood:                -2309.6
No. Observations:                 764   AIC:                             4631.
Df Residuals:                     758   BIC:                             4659.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
=============================================================================================
                                coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------
const                        26.8686      5.110      5.258      0.000      16.837      36.900
avg_distance_to_station_m    -0.0018      0.001     -1.427      0.154      -0.004       0.001
avg_review                    0.0057      0.002      3.249      0.001       0.002       0.009
avg_rating                   -1.6184      0.529     -3.058      0.002      -2.657      -0.580
avg_price                     2.7360      0.954      2.868      0.004       0.864       4.608
number_poi                    0.0916      0.047      1.966      0.050       0.000       0.183
==============================================================================
Omnibus:                       29.935   Durbin-Watson:                   0.213
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               32.648
Skew:                           0.505   Prob(JB):                     8.14e-08
Kurtosis:                       3.067   Cond. No.                     1.63e+04
==============================================================================

The Standard Errors says that the covariance matrix of the errors is correctly specified.
The condition number is 1.63e+04. This might indicate that there are
strong multicollinearity or other numerical problems because of the large value.

# Stretch

How can you turn the regression model into a classification model?

Conduct a survey to collect data on whether customers believe there are enough bikes at a station and then assign binary values 0 or 1 to represent "Not enough bikes" and "Enough bikes".
Then I would model using the point of interests to see the relationship between the data and customer statements. We could use this model to predict the number of bikes needed at a station or build a new station to see how much bikes we need to start with.


