In [None]:
'''
This project aims to analyze digital well‑being and mental health outcomes using both SQL and Python. 
We first construct a Composite Digital Well‑being Index (DWI) to capture overall digital well‑being, 
and then apply an explanatory model (multinomial logistic regression) to identify the main predictive 
factors of mental_state. SQL is used for descriptive analysis, while Python is required for predictive modeling.



Composite Digital Well‑being Index (DWI): designed to capture the overall digital well‑being of a user.
It combines several dimensions of digital and personal life:
- Screen time (daily_screen_time_min) → higher screen time reduces the score, since excessive use is considered negative.
- Negative interactions (negative_interactions_count) → more negative interactions lower the score.
- Sleep hours (sleep_hours) → sufficient sleep increases the score.
- Physical activity (physical_activity_min) → higher levels of physical activity increase the score.

- Values close to 0 → low digital well‑being (heavy screen use, little sleep, low activity, many negative interactions).
- Values close to 1 → high digital well‑being (balanced usage, sufficient sleep, regular physical activity, few negative interactions).
'''

#Average wellbeing index by age group and gender
SELECT
    CASE
        WHEN age BETWEEN 18 AND 25 THEN '18–25'
        WHEN age BETWEEN 26 AND 35 THEN '26–35'
        WHEN age BETWEEN 36 AND 50 THEN '36–50'
        ELSE '51+'
    END AS age_group,
    gender,
    AVG(
        (
            (daily_screen_time_min / 600.0 * -1) +
            (negative_interactions_count / 10.0 * -1) +
            (sleep_hours / 9.0) +
            (physical_activity_min / 60.0)
        ) / 4
    ) AS wellbeing_index,
    COUNT(*) AS user_count
FROM mental_health_social_media_correct
GROUP BY age_group, gender
ORDER BY age_group, gender;

#Findings:
'''Age effect dominates: The youngest group (18–25) has the poorest digital well‑being, while mid‑adults (36–50) show the healthiest balance.
Gender differences are minimal: Within each age group, Female, Male, and Other categories have very similar scores.
Older adults (51+) remain relatively positive, though not as high as the 36–50 group.
This suggests that age is the primary driver of digital well‑being, while gender plays only a minor role.
'''

# What are the main predictive factors of mental_state?
'''We apply an explanatory model—specifically a multinomial logistic regression—that allows us to evaluate how behavioral and demographic factors influence the 
probability of belonging to one of the categories (Healthy, At_Risk, or Stressed). This approach highlights which factors make each mental state more probable 
according to the model and provides insights into the strongest predictors of digital well‑being.
'''

# Why use Python
'''
The analysis of predictive factors for mental_state could not be implemented directly in SQL. 
Instead, Python was used with libraries such as pandas and scikit‑learn, which provide the necessary tools 
to build explanatory models, train and test them, and interpret the coefficients. 
This allowed us to identify which variables make each mental state more probable according to the model.
'''

# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

# Load the dataset
df = pd.read_csv("/kaggle/input/mental-health-social-media-correct/mental_health_social_media_correct.csv")

# Select explanatory variables
X = df[[
    "daily_screen_time_min",        # screen time
    "negative_interactions_count",  # negative interactions
    "sleep_hours",                  # sleep
    "physical_activity_min",        # physical activity
    "age",                          # age
    "gender",                       # categorical, will need encoding
    "platform"                      # categorical, will need encoding
]]

# Encode categorical variables (gender, platform)
X = pd.get_dummies(X, drop_first=True)

# Target variable
y = df["mental_state"]

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

# Multinomial logistic regression
model = LogisticRegression(multi_class="multinomial", solver="lbfgs", max_iter=1000)
model.fit(X_train, y_train)

# Model evaluation
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
# Report includes:
'''precision: proportion of correct predictions among those made for a class.
    recall: proportion of true cases correctly identified by the model.
    f1-score: harmonic mean of precision and recall.
    support: number of examples in each class.
'''

# Inspect coefficients
'''
    Positive coefficient → “If the value of this variable increases, it makes this class more probable according to the model.”
    Negative coefficient → “If the value of this variable increases, it makes this class less probable according to the model.”
'''
coefficients = pd.DataFrame(model.coef_, columns=X.columns, index=model.classes_)
print(coefficients)

'''
Results:
Overall accuracy = 0.99 (99%) → the model predicts mental_state very well.
- Healthy: precision 0.95, recall 0.97 → almost all “Healthy” cases correctly identified.
- Stressed: precision 0.99, recall 1.00 → nearly perfect prediction.
- At_Risk: precision 1.00 but recall 0.37 → very reliable when predicted, but many true cases missed (due to class imbalance: only 19 cases vs. 1381 “Stressed”).

Main predictive factors:
- Screen time:
      Positive for Stressed (+2.51) → more screen time = more stress.
      Negative for Healthy (−2.79) → excessive screen time reduces probability of being Healthy.
- Negative interactions:
      Positive for Stressed (+1.98) → more negative interactions increase stress.
      Negative for Healthy (−1.62) → they reduce mental health.
- Sleep hours:
      Positive for Healthy (+2.19) → more sleep = better mental health.
      Negative for Stressed (−2.18) → lack of sleep = more stress.
- Physical activity:
      Positive for Healthy (+2.52) → physical activity protects mental health.
      Negative for Stressed (−2.35) → inactivity favors stress.
- Age:
      Positive for Stressed (+0.30) → younger users are more vulnerable.
      Slightly positive for Healthy (+0.05) → weak effect.
- Platforms:
      Instagram, Snapchat, TikTok → positive for Stressed (2.00, 2.29, 1.86) → associated with stress.
      YouTube → strongly positive for Stressed (+2.50), strongly negative for Healthy (−1.95).
      WhatsApp → slightly positive for Healthy (+0.02), negative for Stressed (−0.23).
      Fast‑paced platforms (TikTok, Snapchat, Instagram, YouTube) are linked to stress, while WhatsApp is more neutral/protective.

Protective factors (Healthy): sufficient sleep, physical activity, low screen time, few negative interactions.
Risk factors (Stressed): high screen time, many negative interactions, lack of sleep, low physical activity, intensive use of TikTok/Snapchat/Instagram/YouTube.
At_Risk: difficult to predict due to very few cases in the dataset → low recall.

Conclusion
This combined SQL and Python analysis demonstrates that age is the primary driver of digital well‑being, while behavioral factors such as screen time, 
negative interactions, sleep, and physical activity are the strongest predictors of mental_state. Platform choice also plays a role, with fast‑paced social media 
environments associated with stress. Overall, the results highlight the importance of balanced digital habits and lifestyle factors in promoting healthier mental states.
'''