# Rileigh - DIVE Analysis

## Formulate Substantive Question

**Substantive Question:**

Given the observed influence of `avg_daily_temperature_c` and `avg_daily_wind_speed_kph` on churn prediction, **"How can Gym management leverage daily weather forecasts to proactively identify and engage at-risk members, particularly during periods of extreme temperatures or high winds, to mitigate potential churn?"**

This question delves into operationalizing the model's insights. It asks how the gym can use external factors (weather) that influence churn to take preventative action. This moves beyond just predicting churn to understanding how to intervene, aligning with the model's capability to identify influential factors, even with its current performance limitations (low recall), and paves the way for further investigation into targeted member engagement strategies.

# Task
Create a comprehensive DIVE analysis for the trained BQML churn prediction model, including a 'Data' section summarizing the `GymData_Curated` and `Weather_Raw_Streaming` datasets and features, an 'Insights' section detailing key findings from `ML.EVALUATE` and `ML.WEIGHTS` with a focus on weather features, and an 'Evaluation & Next Steps' section outlining model performance, limitations (e.g., low recall), and future improvements. Conclude with one substantive question the model can address.

## Structure DIVE Analysis - Data

### Subtask:
Compile and present the 'Data' section of the DIVE analysis, summarizing the input datasets (`GymData_Curated`, `Weather_Raw_Streaming`) and the features used for training the BQML churn prediction model.


## DIVE Analysis - Data

### Input Datasets and Feature Engineering

The BQML churn prediction model was trained using a combination of two primary datasets:

1.  **`GymData_Curated`**: This dataset served as the source for **batch features** related to gym member demographics and activity. Key features extracted from this table include:
    *   `Churn_Label`: The target variable for the model, indicating whether a member churned (1) or not (0).
    *   `Tenure_Days`: The duration in days since the member joined.
    *   `Age`: The age of the member.
    *   `Gender`: The gender of the member.
    *   `Membership_Type`: The type of membership the member holds.
    
2.  **`Weather_Raw_Streaming`**: This dataset provided **streaming weather data**, which was processed and aggregated to enrich the model with environmental context. The raw streaming data contained hourly records, which were then aggregated to daily averages to create the following features:
    *   `avg_daily_temperature_c`: Average daily temperature in Celsius.
    *   `avg_daily_wind_speed_kph`: Average daily wind speed in kilometers per hour.
    *   `avg_daily_relative_humidity`: Average daily relative humidity.
    *   `day_proportion`: The proportion of daylight hours in a given day.

### Data Joining for Model Training

To create a comprehensive dataset for model training, the `GymData_Curated` and the daily aggregated `Weather_Raw_Streaming` datasets were joined. The join was performed on the `Last_Visit_Date` from the `GymData_Curated` table and the `weather_date` (which was derived from the `event_timestamp`) from the aggregated weather data. This linkage allowed the model to associate member churn behavior with specific daily weather conditions around their last visit, thereby incorporating the influence of external environmental factors into the churn prediction.

## Structure DIVE Analysis - Insights

### Subtask:
Formulate the 'Insights' section of the DIVE analysis. This will involve extracting and explaining key findings from the model's performance evaluation (ML.EVALUATE) and feature importances (ML.WEIGHTS), with a specific focus on the contribution of weather-related features to churn prediction.


**Reasoning**:
First, I'll extract the model evaluation metrics and feature importances from the previously generated DataFrames to prepare for the 'Insights' section of the DIVE analysis. This will involve identifying the top features and specifically examining the impact of weather-related variables.



In [None]:
from google.colab import auth
from google.cloud import bigquery

# Redefine constants (as they were not preserved in the previous execution context)
PROJECT_ID = "mgmt467-project"
DATASET_ID = "GymDB"
MODEL_NAME = "churn_prediction_model"

bq = bigquery.Client(project=PROJECT_ID)

# Re-authenticate to ensure fresh credentials
auth.authenticate_user()
print("✅ Re-authenticated to Google Cloud")

evaluation_query = f"""
SELECT
  *
FROM
  ML.EVALUATE(MODEL `{PROJECT_ID}.{DATASET_ID}.{MODEL_NAME}`)
"""

print("Running model evaluation query...")
eval_result = bq.query(evaluation_query).result().to_dataframe()

feature_importance_query = f"""
SELECT
  * EXCEPT(category_weights)
FROM
  ML.WEIGHTS(MODEL `{PROJECT_ID}.{DATASET_ID}.{MODEL_NAME}`)
ORDER BY
  ABS(weight) DESC
"""

print("Running feature importance query...")
feature_importance_result = bq.query(feature_importance_query).result().to_dataframe()

model_metrics = eval_result.iloc[0]
feature_importances = feature_importance_result.copy()

# Identify top influential features (excluding intercept)
top_features = feature_importances[feature_importances['processed_input'] != '__INTERCEPT__'] \
    .sort_values(by='weight', key=abs, ascending=False)

# Extract weather-related features
weather_features = top_features[
    top_features['processed_input'].isin([
        'avg_daily_temperature_c',
        'avg_daily_wind_speed_kph',
        'avg_daily_relative_humidity',
        'day_proportion'
    ])
]

print("Model metrics:")
print(model_metrics)
print("\nTop features (excluding intercept) by absolute weight:")
print(top_features)
print("\nWeather-related feature importances:")
print(weather_features)


✅ Re-authenticated to Google Cloud
Running model evaluation query...
Running feature importance query...
Model metrics:
precision    1.000000
recall       0.025641
accuracy     0.746667
f1_score     0.050000
log_loss     0.555046
roc_auc      0.594548
Name: 0, dtype: float64

Top features (excluding intercept) by absolute weight:
               processed_input    weight
1      avg_daily_temperature_c -0.059069
2                          Age  0.012715
3     avg_daily_wind_speed_kph -0.011281
4  avg_daily_relative_humidity -0.000162
5                  Tenure_Days -0.000023
6               day_proportion  0.000000
7                       Gender       NaN
8              Membership_Type       NaN

Weather-related feature importances:
               processed_input    weight
1      avg_daily_temperature_c -0.059069
3     avg_daily_wind_speed_kph -0.011281
4  avg_daily_relative_humidity -0.000162
6               day_proportion  0.000000


## DIVE Analysis - Insights

### Model Performance Evaluation (`ML.EVALUATE`)

The logistic regression model for churn prediction exhibits the following performance metrics:

*   **Accuracy:** `0.746667`
*   **Precision:** `1.0`
*   **Recall:** `0.025641`
*   **F1-Score:** `0.05`
*   **Log Loss:** `0.555046`
*   **ROC AUC:** `0.594548`

These metrics suggest that while the model has a relatively high accuracy and perfect precision (meaning when it predicts churn, it is always correct), its recall is very low. A recall of approximately 2.5% indicates that the model is only able to identify a small fraction of the actual churners. This leads to a very low F1-Score, which is a harmonic mean of precision and recall, suggesting that the model is not practically effective at identifying churn for the positive class (churners).

### Feature Importances (`ML.WEIGHTS`)

Analyzing the absolute weights of the features reveals their relative impact on the churn prediction. The model's coefficients indicate the direction and magnitude of a feature's influence on the log-odds of churn.

**Top Influential Features (by absolute weight, excluding intercept):**

1.  **`avg_daily_temperature_c`**: `weight = -0.059069`
2.  **`Age`**: `weight = 0.012715`
3.  **`avg_daily_wind_speed_kph`**: `weight = -0.011281`
4.  **`avg_daily_relative_humidity`**: `weight = -0.000162`
5.  **`Tenure_Days`**: `weight = -0.000023`
6.  **`day_proportion`**: `weight = 0.000000`

(`Gender` and `Membership_Type` show `NaN` weights, likely due to all instances falling into a single category or perfect collinearity after one-hot encoding for the given dataset, or they were dropped due to no variance/missing data handling by BQML.)

### Contribution of Weather-Related Features

Weather-related features play a notable role in the model's predictions:

*   **`avg_daily_temperature_c`** is the most significant feature among the non-intercept terms. Its **negative weight** (`-0.059069`) suggests that **higher average daily temperatures are associated with a decreased likelihood of customer churn**. This could imply that members are less likely to churn during warmer periods, or perhaps warmer weather encourages gym attendance, indirectly reducing churn.

*   **`avg_daily_wind_speed_kph`** also has a **negative weight** (`-0.011281`), indicating that **higher average daily wind speeds are associated with a slightly decreased likelihood of churn**. This effect is less pronounced than temperature but still present.

*   **`avg_daily_relative_humidity`** shows a very small **negative weight** (`-0.000162`), implying a negligible inverse relationship with churn.

*   **`day_proportion`** has a **zero weight**, indicating it does not contribute to the model's prediction in this specific logistic regression setup.

In summary, daily temperature and wind speed appear to be the most influential weather factors, with warmer and windier conditions subtly associated with lower churn rates. This highlights the importance of external environmental factors in understanding customer behavior and churn at the gym.

## Structure DIVE Analysis - Evaluation & Next Steps

### Subtask:
Develop the 'Evaluation' section of the DIVE analysis. This will summarize the model's performance metrics, discuss its limitations (e.g., low recall), and propose concrete next steps for improving the model or conducting further analysis, such as exploring different model types or feature engineering.


## DIVE Analysis - Evaluation & Next Steps

### Model Performance Summary

The evaluation of the BQML logistic regression model named `churn_prediction_model` yielded the following key performance metrics:

*   **Accuracy:** 0.746667
*   **Precision:** 1.0
*   **Recall:** 0.025641
*   **F1-Score:** 0.05
*   **Log Loss:** 0.555046
*   **ROC AUC:** 0.594548

The model demonstrates perfect precision (1.0), meaning that when it predicts a customer will churn, that prediction is always correct within the evaluated dataset. However, its recall is extremely low (0.025641), indicating that the model only identifies a very small fraction of the actual churners. This significant imbalance between precision and recall leads to a very low F1-score (0.05), suggesting that the model is not effective at identifying the positive class (churners).

### Limitations

The primary limitation of the current model is its **poor recall** for predicting customer churn. While highly precise, the model fails to identify the vast majority of customers who actually churn. This means that many opportunities to intervene and prevent churn would be missed. The low F1-score confirms that the model is not practically useful for its intended purpose of proactively identifying at-risk customers. This could be due to a significant class imbalance in the training data (far fewer churners than non-churners) or insufficient predictive power of the current features to distinguish churners.

### Next Steps and Recommendations

To improve the `churn_prediction_model` and address its limitations, the following steps are recommended:

1.  **Address Class Imbalance:**
    *   Investigate the distribution of `Churn_Label` in the training data. If there's a significant imbalance, consider techniques such as oversampling the minority class (churners) or undersampling the majority class (non-churners). BigQuery ML offers `WEIGHTS` option for class weights or techniques like `STRATIFIED_SPLIT` for balanced training.
    *   Focus on evaluation metrics more suitable for imbalanced datasets, such as the Confusion Matrix, Precision-Recall Curve, and Average Precision.

2.  **Explore Alternative Model Types:**
    *   Logistic Regression can be sensitive to class imbalance and feature scaling. Consider exploring other BigQuery ML models that might handle these challenges better or capture non-linear relationships, such as:
        *   **Boosted Tree (XGBoost):** Often performs well on tabular data and can handle non-linearities and interactions.
        *   **Random Forest:** Another robust ensemble method.

3.  **Further Feature Engineering:**
    *   **Interaction Terms:** Create new features by combining existing ones (e.g., `Age` * `Tenure_Days`).
    *   **Time-Series Features from Weather Data:** Instead of just daily averages, incorporate features like `weather_trend` (change in temperature over the last week), `weather_variability` (standard deviation of temperature), or `extreme_weather_events` (binary flag for days above/below certain thresholds).
    *   **More Granular Demographic/Activity Features:** If available, include more detailed membership activity data, visit frequency, usage of gym facilities, or personal trainer interactions.
    *   **Lagged Features:** Incorporate previous month's churn status or weather conditions as features.

4.  **Data Acquisition:**
    *   If possible, gather more historical data, especially for churned customers, to provide the model with a richer learning signal for the minority class.

5.  **Refine Weather Data Aggregation:**
    *   Experiment with different aggregation windows for weather data. For instance, using weekly or bi-weekly averages instead of daily might capture longer-term trends that influence churn more effectively. The `Last_Visit_Date` join might not be the most relevant for weather impact; perhaps averaging weather conditions over a period *before* the `Last_Visit_Date` could be more insightful.

## Final Task

### Subtask:
Present the complete DIVE analysis, including the Data, Insights, and Evaluation sections, along with the formulated substantive question.


## Summary:

### Q&A
The substantive question the model can help address is: "How can Gym management leverage daily weather forecasts to proactively identify and engage at-risk members, particularly during periods of extreme temperatures or high winds, to mitigate potential churn?"

### Data Analysis Key Findings
*   **Data Overview**: The churn prediction model utilized `GymData_Curated` (for batch features like `Churn_Label`, `Tenure_Days`, `Age`, `Gender`, `Membership_Type`) and `Weather_Raw_Streaming` (for aggregated daily weather features like `avg_daily_temperature_c`, `avg_daily_wind_speed_kph`, `avg_daily_relative_humidity`, and `day_proportion`). These datasets were joined on `Last_Visit_Date` and `weather_date` to link member behavior with weather conditions.
*   **Model Performance**: The logistic regression model achieved an accuracy of 0.746667 and a precision of 1.0. However, its recall was very low at 0.025641, resulting in a low F1-Score of 0.05. The ROC AUC was 0.594548, and Log Loss was 0.555046. The perfect precision indicates that when the model predicts churn, it is always correct, but the extremely low recall means it identifies only a very small fraction of actual churners.
*   **Feature Importance**:
    *   `avg_daily_temperature_c` was the most influential non-intercept feature with a weight of -0.059069, suggesting higher temperatures are associated with a decreased likelihood of churn.
    *   `Age` (weight: 0.012715) was the second most influential.
    *   `avg_daily_wind_speed_kph` also had a negative weight (-0.011281), indicating a slight decrease in churn with higher wind speeds.
    *   `avg_daily_relative_humidity` had a negligible negative weight (-0.000162).
    *   `day_proportion` had a zero weight, indicating no contribution to the model's prediction.
    *   `Gender` and `Membership_Type` showed `NaN` weights, possibly due to lack of variance in the dataset.
*   **Model Limitations**: The primary limitation is the model's very low recall, making it ineffective at proactively identifying most actual churners, thus missing significant intervention opportunities.

### Insights or Next Steps
*   **Address Class Imbalance and Model Selection**: The model's low recall and F1-score strongly suggest a significant class imbalance. Future work should prioritize techniques to address this (e.g., oversampling, undersampling, using class weights) and explore alternative BigQuery ML models like Boosted Trees (XGBoost) or Random Forest, which are often more robust to imbalanced data and can capture non-linear relationships.
*   **Enhanced Feature Engineering and Data Sourcing**: To improve predictive power, consider generating more sophisticated features, including interaction terms, time-series weather attributes (e.g., weather trends, variability, extreme events), and potentially acquiring more granular member activity data or historical churn data. Refining how weather data is aggregated and linked (e.g., averaging weather conditions over a period *before* the `Last_Visit_Date`) could also yield better insights.


In [None]:
import plotly.express as px
import pandas as pd
from google.cloud import bigquery

# Redefine constants if not already defined in the current session
PROJECT_ID = "mgmt467-project"
DATASET_ID = "GymDB"
GYM_DATA_CURATED_TABLE = "GymData_Curated"
WEATHER_RAW_STREAMING_TABLE = "Weather_Raw_Streaming"

bq = bigquery.Client(project=PROJECT_ID)

# Query to get curated gym data joined with daily aggregated weather data
combined_data_query = f"""
SELECT
  g.Churn_Label,
  DATE(w.event_timestamp) AS weather_date,
  AVG(w.temperature_c) AS avg_daily_temperature_c
FROM
  `{PROJECT_ID}.{DATASET_ID}.{GYM_DATA_CURATED_TABLE}` AS g
LEFT JOIN
  `{PROJECT_ID}.{DATASET_ID}.{WEATHER_RAW_STREAMING_TABLE}` AS w
ON
  g.Last_Visit_Date = DATE(w.event_timestamp)
WHERE
  g.Churn_Label IS NOT NULL
GROUP BY
  g.Churn_Label, weather_date
"""

print("Fetching combined data from BigQuery...")
combined_df = bq.query(combined_data_query).result().to_dataframe()

# Drop rows where avg_daily_temperature_c is NaN (no matching weather data)
combined_df.dropna(subset=['avg_daily_temperature_c'], inplace=True)

# Define temperature bins for visualization
temp_bins = [-float('inf'), 0, 10, 20, 30, float('inf')] # Example bins
temp_labels = ['<0°C', '0-10°C', '10-20°C', '20-30°C', '>30°C']

combined_df['temperature_bin'] = pd.cut(
    combined_df['avg_daily_temperature_c'],
    bins=temp_bins,
    labels=temp_labels,
    right=False # Bins are [low, high)
)

# Calculate churn rate per temperature bin
churn_rate_by_temp = combined_df.groupby('temperature_bin')['Churn_Label'].mean().reset_index()
churn_rate_by_temp.rename(columns={'Churn_Label': 'Churn_Rate'}, inplace=True)

# Plotly Bar Chart
fig = px.bar(
    churn_rate_by_temp,
    x='temperature_bin',
    y='Churn_Rate',
    title='Churn Rate by Average Daily Temperature Bin',
    labels={'temperature_bin': 'Average Daily Temperature (°C)', 'Churn_Rate': 'Churn Rate'},
    color='Churn_Rate',
    color_continuous_scale=px.colors.sequential.Plasma
)

fig.update_layout(
    xaxis={'categoryorder':'array', 'categoryarray': temp_labels},
    yaxis_tickformat='.2%', # Format y-axis as percentage
)

fig.show()


Fetching combined data from BigQuery...


  churn_rate_by_temp = combined_df.groupby('temperature_bin')['Churn_Label'].mean().reset_index()


Rileigh Dethy - Collaboration

I worked on assistance with brainstorming and putting our API and dataset into our pipeline. I also created our model using prompts through gemini to create the best model we could with out given dataset. I learned that our dataset included a lot of missing values and there was a lot not in the batch dataset that would have been more helpful in the creation of the model.

I also organized our GitHub and folders within it to help keep our repo clean.

Some lessons that I learned is that it is important to ensure that you're not making any mistakes when you setup your API and inserting your dataset because it can cause a lot of issues in the future if you don't really know how to trace back your work. This was the first time I've worked on a project that really expected you to use stuff that you have learned through the whole semester, but it was challenging and a good reminder of what you've done.