Build a regression model.

In [5]:
import sqlite3
import pandas as pd
import statsmodels.api as sm

# 1) Load the joined station–POI table from SQLite
conn = sqlite3.connect("data/bike_poi_data.db")
df = pd.read_sql_query("SELECT * FROM station_poi", conn)
conn.close()

# 2) Define predictor(s) and response
X = df[["poi_count"]]
y = df["free_bikes"]

# 3) Add a constant term so we fit an intercept
X = sm.add_constant(X)

# 4) Fit an OLS regression
model = sm.OLS(y, X).fit()

# 5) Display the full regression results
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:             free_bikes   R-squared:                       0.004
Model:                            OLS   Adj. R-squared:                 -0.000
Method:                 Least Squares   F-statistic:                    0.9375
Date:                Thu, 15 May 2025   Prob (F-statistic):              0.334
Time:                        22:21:11   Log-Likelihood:                -810.44
No. Observations:                 264   AIC:                             1625.
Df Residuals:                     262   BIC:                             1632.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         15.8224      8.515      1.858      0.0

Provide model output and an interpretation of the results. 

In [4]:
# show the key stats
print(f"R² = {model.rsquared:.3f}")
print(f"Intercept = {model.params['const']:.2f}")
print(f"POI coefficient = {model.params['poi_count']:.2f} (p = {model.pvalues['poi_count']:.3f})")

R² = 0.004
Intercept = 15.82
POI coefficient = -0.83 (p = 0.334)


## Interpretation of the Regression Output

- **R² ≈ 0.004**  
  Only about 0.4 % of the variation in **free_bikes** is explained by **poi_count**. In other words, POI density does almost nothing to predict bike availability in this sample.

- **Intercept ≈ 15.82**  
  When there are zero POIs within the buffer, the model predicts about **15.8 free bikes** at a station on average. This serves as the baseline level of availability.

- **POI coefficient ≈ –0.83 (p = 0.334)**  
  - The point estimate suggests each additional POI is associated with about **0.83 fewer free bikes**.  
  - However, the p-value of **0.334** is well above 0.05, so this effect is **not statistically significant**. We cannot rule out that the true effect is zero (or even positive).

---

### Bottom Line

There is **no meaningful or reliable** linear relationship between POI count and free‐bike availability here. To improve the model, consider:

1. **Adding more or different predictors**  
   e.g. time of day, station capacity, day of week, weather conditions.  
2. **Exploring non-linear patterns**  
   Perhaps availability changes sharply around certain POI thresholds.  
3. **Expanding your POI definition**  
   Include cafés, bars, transit stops, etc., to capture broader “activity density.”  
4. **Increasing sample size**  
   More stations or longer time windows may reveal stronger signals.  

# Stretch

How can you turn the regression model into a classification model?

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score

# 1. Binarize target at the median
df["high_availability"] = (df["free_bikes"] >= df["free_bikes"].median()).astype(int)

# 2. Feature matrix
X_clf = df[["poi_count"]]
y_clf = df["high_availability"]

# 3. Fit logistic regression
clf = LogisticRegression().fit(X_clf, y_clf)

# 4. Evaluate
y_pred = clf.predict(X_clf)
print(classification_report(y_clf, y_pred))
print("ROC AUC:", roc_auc_score(y_clf, clf.predict_proba(X_clf)[:,1]))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00       122
           1       0.54      1.00      0.70       142

    accuracy                           0.54       264
   macro avg       0.27      0.50      0.35       264
weighted avg       0.29      0.54      0.38       264

ROC AUC: 0.5100727314707919


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
