### Section 1: Importing Libraries

In [22]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

### Section 2: Loading Data

In [23]:
# Load the cleaned data from the CSV file
df = pd.read_csv('cleaned_bike_stations_pois.csv')

### Section 3: Building a Regression Model

#### 3.1 Feature Selection and Encoding

In [24]:
# Define features and target variable
target = 'number_of_bikes'

# One-hot encode categorical variables
df_encoded = pd.get_dummies(df, columns=['poi_price', 'poi_category', 'source'], drop_first=True)

# Select all numeric and one-hot encoded features dynamically
all_features = [col for col in df_encoded.columns if col.startswith(('poi_price_', 'poi_category_', 'source_'))] + ['poi_rating', 'poi_latitude', 'poi_longitude']

# Check if there are any features selected
if not all_features:
    raise ValueError("No features selected")

#### 3.2 Data Preprocessing

In [25]:
# Convert features to numeric
X = df_encoded[all_features].apply(pd.to_numeric, errors='coerce')
y = pd.to_numeric(df_encoded[target], errors='coerce')

# Drop any columns in X that couldn't be converted to numeric data
X = X.select_dtypes(include=[np.number])

# Drop any rows with missing values in features or target
missing_rows = X.isna().any(axis=1) | y.isna()
X = X.loc[~missing_rows]
y = y.loc[~missing_rows]

# Check for sufficient data
if len(X) < len(X.columns):
    raise ValueError("Not enough data to fit the model")

#### 3.3 Fitting the OLS Regression Model

In [26]:
# Add a constant term for the OLS model
X = sm.add_constant(X)

# Create and fit the OLS regression model
model = sm.OLS(y, X).fit()

# Print the model summary
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:        number_of_bikes   R-squared:                       0.370
Model:                            OLS   Adj. R-squared:                  0.156
Method:                 Least Squares   F-statistic:                     1.727
Date:                Mon, 23 Sep 2024   Prob (F-statistic):           7.51e-06
Time:                        18:46:37   Log-Likelihood:                -2060.5
No. Observations:                 607   AIC:                             4431.
Df Residuals:                     452   BIC:                             5114.
Df Model:                         154                                         
Covariance Type:            nonrobust                                         
                                                                    coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------

#### 3.4: Interpretation of the Regression Model Results

The Ordinary Least Squares (OLS) regression model has yielded the following results:

1. **R-squared**: The R-squared value is 0.370, meaning that approximately 37% of the variability in the number of bikes is explained by the model. This means a substantial portion of the variance is unexplained, suggesting that additional factors may be influencing the number of bikes at each station.

2. **Adjusted R-squared**: The adjusted R-squared is 0.156, which is significantly lower than the R-squared value. This suggests that many of the predictors in the model may not be contributing meaningfully to explaining the variation in the target variable. The difference between the R-squared and adjusted R-squared values indicates that the inclusion of additional variables does not significantly improve the model’s fit.

3. **F-statistic**: The F-statistic is 1.727 with a p-value of 7.51e-06, which indicates that the overall model is statistically significant. However, this result must be interpreted with caution due to the high number of predictors (154), many of which may not be individually significant.

4. **Coefficients and Predictors**:
   - **poi_rating**: The coefficient for `poi_rating` is -1.4899, with a p-value of 0.027, indicating a significant negative relationship. This suggests that as the POI rating increases by 1 unit, the number of bikes decreases by approximately 1.49, holding all else constant. This is somewhat counterintuitive and might warrant further investigation.
   
   - **poi_latitude and poi_longitude**: Both latitude and longitude have significant negative relationships with the number of bikes, with coefficients of -66.0190 and -36.4241, respectively. This indicates that geographical location is a significant factor in determining the number of bikes at a station. Specifically, as the latitude or longitude increases, the number of bikes tends to decrease.

   - **Price Categories**: None of the price categories, including `poi_price_2.0`, `poi_price_3.0`, `poi_price_4.0`, and `poi_price_Unknown`, were statistically significant predictors of the number of bikes. This suggests that the price level of POIs in the vicinity may not be a meaningful factor in explaining bike availability.
   
   - **Category-Specific Predictors**: A few categories, such as `poi_category_Brasserie`, `poi_category_Bistros`, `poi_category_Tabernas`, and `poi_category_Seafood`, were significant positive predictors of the number of bikes, suggesting that certain types of POIs might be located near more popular bike stations. Other categories, such as `poi_category_Delis` and `poi_category_Chee Kufta`, had significant negative relationships with the number of bikes, indicating that these categories are associated with stations that tend to have fewer bikes.
   
   - **Multicollinearity Warning**: The model notes the possibility of strong multicollinearity (or singularity), as indicated by the condition number and the very small eigenvalue (1.84e-30). This suggests that some of the predictors may be highly correlated with each other, potentially inflating the standard errors of the coefficients and making it difficult to assess the individual impact of certain predictors.

5. **Model Performance**:
   - **Model Fit**: Although the model is statistically significant, the relatively low adjusted R-squared and multicollinearity issues suggest that this model may not provide the most accurate or robust predictions. Future work could involve removing or consolidating some of the highly correlated predictors or exploring alternative models (e.g., regularization techniques such as Ridge or Lasso regression).
   
   - **Residual Analysis**: The high skew and kurtosis values, combined with the significant Omnibus and Jarque-Bera test results, suggest that the residuals are not normally distributed. This violation of the OLS assumption could lead to unreliable estimates and affect the interpretation of the model's coefficients.

### Summary:
- **Significant Predictors**: Latitude, longitude, and `poi_rating` were significant predictors of the number of bikes. Additionally, a few POI categories, such as `Brasseries`, `Bistros`, `Tabernas`, and `Seafood`, had significant positive relationships with the number of bikes, while `Delis` and `Chee Kufta` had significant negative relationships.
- **Model Limitations**: The model's explanatory power is moderate (R-squared = 0.370), but the adjusted R-squared is low (0.156), suggesting that many predictors may not contribute significantly to explaining the variation in the number of bikes. Multicollinearity and non-normal residuals may be affecting the model's performance.
- **Next Steps**: Consider using feature selection or regularization techniques to address multicollinearity and improve the model's predictive accuracy. Additionally, investigating alternative models and transformations for the response variable could yield better insights.

### Section 4: Convert Regression to Classification Model

#### 4.1. Problem Definition

The goal is to convert the regression model into a classification model. We will create a binary target variable `is_high`, which indicates if the number of bikes is greater than 20.

#### 4.2. Data Preprocessing for Classification

In [27]:
# Create a binary target variable 'is_high'
df['is_high'] = df['number_of_bikes'].apply(lambda x: 1 if x > 20 else 0)

# Define features and target variable
X = df[['poi_rating', 'poi_latitude', 'poi_longitude']]
y = df['is_high']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#### 4.3. Fitting the Logistic Regression Model

In [28]:
# Train the Logistic Regression model
clf = LogisticRegression(random_state=42)
clf.fit(X_train, y_train)

# Use the model to make predictions
y_pred = clf.predict(X_test)

#### 4.4 Evaluating the Classification Model

In [29]:
# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print evaluation metrics
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")

Accuracy: 0.8688524590163934
Precision: 0.75
Recall: 0.3
F1 Score: 0.42857142857142855


#### 4.5. Interpretation of Classification Metrics

- **Accuracy**: 87% (the model predicts correctly 87% of the time).
- **Precision**: 75% (75% of predicted positives were correct).
- **Recall**: 30% (the model only identifies 30% of all actual positives).
- **F1 Score**: 0.42 (low due to imbalance between precision and recall).

#### 5. Conclusion

- The regression model shows that `poi_latitude` and `poi_longitude` are significant predictors of the number of bikes, but the model only explains 37% of the variation.
- The logistic regression model provides reasonable accuracy (87%), but recall is low (30%). This suggests that while the model performs well overall, it struggles to identify all instances of high bike availability.