#### Load Data from CSV File

In [5]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

In [6]:
# Load the cleaned data from the CSV file
final_df_clean = pd.read_csv('cleaned_bike_stations_pois.csv')

### Build Regression Model

In [7]:
# Define features and target variable
target = 'number_of_bikes'

In [8]:
# One-hot encode categorical variables
final_df_encoded = pd.get_dummies(final_df_clean, columns=['poi_price', 'poi_category', 'source'], drop_first=True)

In [9]:
# Select all numeric and one-hot encoded features dynamically
all_features = [col for col in final_df_encoded.columns if col.startswith(('poi_price_', 'poi_category_', 'source_'))] + ['poi_rating', 'poi_latitude', 'poi_longitude']

In [10]:
# Check if there are any features selected
if not all_features:
    raise ValueError("No features selected")

In [11]:
# Convert features to numeric
X = final_df_encoded[all_features].apply(pd.to_numeric, errors='coerce')
y = pd.to_numeric(final_df_encoded[target], errors='coerce')

In [12]:
# Drop any columns in X that couldn't be converted to numeric data
X = X.select_dtypes(include=[np.number])

In [13]:
# Drop any rows with missing values in features or target
missing_rows = X.isna().any(axis=1) | y.isna()
X = X.loc[~missing_rows]
y = y.loc[~missing_rows]

In [14]:
# Check for sufficient data
if len(X) < len(X.columns):
    raise ValueError("Not enough data to fit the model")

In [15]:
# Add a constant term for the OLS model
X = sm.add_constant(X)

In [16]:
# Create and fit the OLS regression model
model = sm.OLS(y, X).fit()

In [17]:
# Print the model summary
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:        number_of_bikes   R-squared:                       0.116
Model:                            OLS   Adj. R-squared:                  0.111
Method:                 Least Squares   F-statistic:                     26.34
Date:                Tue, 07 May 2024   Prob (F-statistic):           5.06e-16
Time:                        21:22:37   Log-Likelihood:                -2163.6
No. Observations:                 607   AIC:                             4335.
Df Residuals:                     603   BIC:                             4353.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const          3481.4418    475.983      7.314

### Regression Model Interpretation

1. **R-squared**: The R-squared value of 0.116 indicates that only about 11.5% of the variation in the number of bikes can be explained by the model. This means that the model, as it currently stands, does not explain a large portion of the variation in the number of bikes. There may be other variables not included in the model that could help explain the number of bikes.

2. **Adjusted R-squared**: The adjusted R-squared is 0.111, which is slightly less than the R-squared value of 0.116. This indicates that some of the predictors in the model do not significantly improve the model's ability to predict the dependent variable, `number_of_bikes`.

3. **Coefficients**: The coefficients tell us about the relationship between the predictors (poi_rating, poi_latitude, poi_longitude) and the response variable (number_of_bikes).

   - `poi_rating`: The coefficient of -0.1669 suggests that for each unit increase in poi_rating, the number of bikes decreases by about 0.2175, assuming all other variables are held constant. However, the p-value for this variable is 0.298, which is greater than 0.05, suggesting that the effect of poi_rating on the number of bikes is not statistically significant at the 5% level.
   
   - `poi_latitude`: The coefficient of -59.2988 suggests that for each unit increase in poi_latitude, the number of bikes decreases by about 59, assuming all other variables are held constant. The p-value for this variable is less than 0.05, suggesting that the effect of poi_latitude on the number of bikes is statistically significant at the 5% level.
   
   - `poi_longitude`: The coefficient of -43.3587 suggests that for each unit increase in poi_longitude, the number of bikes decreases by about 43, assuming all other variables are held constant. The p-value for this variable is less than 0.05, suggesting that the effect of poi_longitude on the number of bikes is statistically significant at the 5% level.

4. **F-statistic**: The F-statistic is used to test whether at least one predictor variable has a non-zero coefficient. In this case, the p-value of the F-statistic is very small (5.06e-16), suggesting that at least one of the predictors is statistically significant.

In summary, the model suggests that `poi_latitude` and `poi_longitude` have a significant impact on the `number_of_bikes`, while `poi_rating` does not. However, the model only explains about 11.5% of the variation in the number of bikes, and there are potential issues with autocorrelation and multicollinearity that may need to be addressed.

# Stretch

How can you turn the regression model into a classification model?

In [18]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [19]:
# Load the cleaned data from the CSV file
df = pd.read_csv('cleaned_bike_stations_pois.csv')

In [20]:
# Assuming df is your DataFrame and it includes your target 'number_of_bikes' and features 'poi_rating', 'poi_latitude', 'poi_longitude'
df['is_high'] = df['number_of_bikes'].apply(lambda x: 1 if x > 20 else 0)

X = df[['poi_rating', 'poi_latitude', 'poi_longitude']]
y = df['is_high']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train the model
clf = LogisticRegression(random_state=42)
clf.fit(X_train, y_train)

# Now you can use clf to predict whether the number of bikes is high or not
y_pred = clf.predict(X_test)

In [21]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")

Accuracy: 0.860655737704918
Precision: 0.7142857142857143
Recall: 0.25
F1 Score: 0.37037037037037035


Interpretation of the metrics used to evaluate the performance of the classification model:

1. **Accuracy**: This is the ratio of correct predictions to the total number of predictions. An accuracy of 0.86 means that the model correctly predicted the class 86% of the time. This is a general measure of how often the classifier is correct.

2. **Precision**: This is the ratio of true positive predictions to the total number of positive predictions (true positives + false positives). A precision of 0.71 means that when the model predicts the positive class, it is correct 71% of the time. Precision is a measure of how many of the positive predictions were actually correct.

3. **Recall (Sensitivity)**: This is the ratio of true positive predictions to the total number of actual positives (true positives + false negatives). A recall of 0.25 means that the model correctly identifies 25% of all actual positive instances. Recall is a measure of how many of the actual positive instances the model is able to identify.

4. **F1 Score**: This is the harmonic mean of precision and recall, and it tries to balance the two. An F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. The F1 score is 0.37, which is relatively low, indicating that the model's precision and recall are not well balanced.

In summary, the model has high accuracy but low recall and a low F1 score. This suggests that while the model is correct a high percentage of the time, it's not doing a great job at identifying positive instances (it's missing a lot of positive instances). The low F1 score suggests that the model's precision and recall are not well balanced. 