## Exploratory Data Analysis (EDA)

### Business Context:
Explore the relationship between screen time and mental health. Your goal is to understand
the structure of the dataset, perform analysis, build predictive models, and interpret the
results. You need treat the mental_health_score either as a continuous variable (regression)
or convert it into a binary classification target:


### Packages and Libraries

In [2]:
import pandas as pd 
import pathlib
import os
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder,OrdinalEncoder
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor

#https://medium.com/@prosun.csedu/polynomialfeatures-is-a-preprocessing-tool-provided-by-the-scikit-learn-library-in-python-that-is-84118adea049

In [3]:
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import time
import numpy as np

### Read data

In [20]:
def read_file(file_name : str )  -> pd.DataFrame:
    """
    summary
    """
    try:
        dir_folder = pathlib.Path().cwd()
        print(dir_folder)
        file_path  = dir_folder
        print(file_path)
        df = pd.read_csv(os.path.join(file_path/file_name))
        return df
    except FileNotFoundError:
        print(f"Error: The file at '{file_name}' was not found.")
        raise
    except Exception as e:
         print(f"An error occurred: {e}")
        
        
        
df = read_file('digital_diet_mental_health.csv')      
  
df.head()   





c:\Abdelouaheb\perso\Data_science_2024_projects\2025\machine_learning_project_Ames_House_Price\test_solution
c:\Abdelouaheb\perso\Data_science_2024_projects\2025\machine_learning_project_Ames_House_Price\test_solution


Unnamed: 0,user_id,age,gender,daily_screen_time_hours,phone_usage_hours,laptop_usage_hours,tablet_usage_hours,tv_usage_hours,social_media_hours,work_related_hours,...,stress_level,physical_activity_hours_per_week,location_type,mental_health_score,uses_wellness_apps,eats_healthy,caffeine_intake_mg_per_day,weekly_anxiety_score,weekly_depression_score,mindfulness_minutes_per_day
0,user_1,51,Female,4.8,3.4,1.3,1.6,1.6,4.1,2.0,...,10,0.7,Urban,32,1,1,125.2,13,15,4.0
1,user_2,64,Male,3.9,3.5,1.8,0.9,2.0,2.7,3.1,...,6,4.3,Suburban,75,0,1,150.4,19,18,6.5
2,user_3,41,Other,10.5,2.1,2.6,0.7,2.2,3.0,2.8,...,5,3.1,Suburban,22,0,0,187.9,7,3,6.9
3,user_4,27,Other,8.8,0.0,0.0,0.7,2.5,3.3,1.6,...,5,0.0,Rural,22,0,1,73.6,7,2,4.8
4,user_5,55,Male,5.9,1.7,1.1,1.5,1.6,1.1,3.6,...,7,3.0,Urban,64,1,1,217.5,8,10,0.0


### Data Cleaning

#### Handling missing values

In [21]:
import pandas as pd
from sklearn.impute import SimpleImputer

def impute_dataframe(df, strategy_num="mean", strategy_cat="most_frequent"):
    """
    Imputes missing values in a DataFrame separately for numerical and categorical columns.

    Parameters:
    df (pd.DataFrame): Input DataFrame.
    strategy_num (str): Strategy for imputing numerical columns (default="mean").
    strategy_cat (str): Strategy for imputing categorical columns (default="most_frequent").

    Returns:
    pd.DataFrame: DataFrame with imputed values.
    """
    df = df.copy()

    # Separate numerical and categorical columns
    num_cols = df.select_dtypes(include=["number"]).columns
    cat_cols = df.select_dtypes(exclude=["number"]).columns

    # Impute numerical columns
    if len(num_cols) > 0:
        num_imputer = SimpleImputer(strategy=strategy_num)
        df[num_cols] = num_imputer.fit_transform(df[num_cols])

    # Impute categorical columns
    if len(cat_cols) > 0:
        cat_imputer = SimpleImputer(strategy=strategy_cat)
        df[cat_cols] = cat_imputer.fit_transform(df[cat_cols])

    return df


#### Removing duplicates rows


In [22]:
def remove_duplicates(df, subset=None, keep='first', inplace=False):
    """
    Removes duplicate rows from the DataFrame and provides a summary of duplicates.
    
    Parameters:
        df (pd.DataFrame): The DataFrame from which to remove duplicates.
        subset (list): List of columns to consider for duplicate checking. 
                       If None, checks all columns.
        keep (str): Which duplicates to keep. Options:
            - 'first': Keep the first occurrence (default).
            - 'last': Keep the last occurrence.
            - 'none': Drop all duplicates.
        inplace (bool): If True, modifies the original DataFrame. 
                        If False, returns a new DataFrame.
        
    Returns:
        pd.DataFrame: DataFrame with duplicates removed (if inplace=False).
    """
    if keep not in ['first', 'last', 'none']:
        raise ValueError("keep must be one of 'first', 'last', or 'none'.")
    
    # Count duplicates before removal
    total_rows = len(df)
    duplicate_rows = df.duplicated(subset=subset, keep=False).sum()
    percentage_duplicates = (duplicate_rows / total_rows) * 100
    
    print(f"Total Rows: {total_rows}")
    print(f"Duplicate Rows: {duplicate_rows} ({percentage_duplicates:.2f}%)")
    
    if duplicate_rows == 0:
        print("No duplicates found. No rows removed.")
        return df if not inplace else None
    
    # Handle duplicate removal
    if keep == 'none':
        # Drop all duplicates and keep only unique rows
        duplicated_mask = df.duplicated(subset=subset, keep=False)
        result = df[~duplicated_mask]
    else:
        # Use pandas built-in drop_duplicates
        result = df.drop_duplicates(subset=subset, keep=keep)
    
    # Count duplicates after removal
    remaining_rows = len(result)
    rows_removed = total_rows - remaining_rows
    
    print(f"Rows Removed: {rows_removed}")
    print(f"Remaining Rows: {remaining_rows}")
    
    if rows_removed > 0:
        print("Duplicates successfully removed.")
    else:
        print("No duplicates were removed.")
    
    if inplace:
        df.drop_duplicates(subset=subset, keep=keep, inplace=True)
    else:
        return result


In [23]:
dm=remove_duplicates(df)

Total Rows: 2000
Duplicate Rows: 0 (0.00%)
No duplicates found. No rows removed.


#### Handles outliers

In [24]:
def handle_outliers(df, numerical_columns=None, method='IQR', verbose=True):
    """
    Handles outliers in continuous numerical variables using the specified method.
    
    Parameters:
        df (pd.DataFrame): The DataFrame to process.
        numerical_columns (list): List of numerical columns to check for outliers. If None, auto-selects all numerical columns.
        method (str): Method to handle outliers ('IQR' or 'Z-Score'). Default is 'IQR'.
        verbose (bool): If True, displays outlier statistics.
        
    Returns:
        pd.DataFrame: DataFrame with outliers handled.
    """
    # Select numerical columns if not provided
    if numerical_columns is None:
        numerical_columns = df.select_dtypes(include=['number']).columns
    
    # Iterate over each numerical column
    for col in numerical_columns:
        if col in df.columns:
            if method == 'IQR':
                # Calculate Q1, Q3, and IQR
                Q1 = df[col].quantile(0.25)
                Q3 = df[col].quantile(0.75)
                IQR = Q3 - Q1
                lower_bound = Q1 - 1.5 * IQR
                upper_bound = Q3 + 1.5 * IQR
                
                # Count and handle outliers
                outliers = ((df[col] < lower_bound) | (df[col] > upper_bound))
                num_outliers = outliers.sum()
                
                # Option 1: Capping
                df.loc[df[col] < lower_bound, col] = lower_bound
                df.loc[df[col] > upper_bound, col] = upper_bound
                
                if verbose:
                    print(f"[INFO] Outliers handled in '{col}' using IQR:")
                    print(f"         - Number of Outliers: {num_outliers}")
                    print(f"         - Lower Bound: {lower_bound}, Upper Bound: {upper_bound}")
            
            elif method == 'Z-Score':
                # Alternative method (if needed)
                mean = df[col].mean()
                std_dev = df[col].std()
                threshold = 3  # Standard deviation threshold
                
                # Count and handle outliers
                outliers = ((df[col] < (mean - threshold * std_dev)) | (df[col] > (mean + threshold * std_dev)))
                num_outliers = outliers.sum()
                
                # Option 1: Capping
                lower_bound = mean - threshold * std_dev
                upper_bound = mean + threshold * std_dev
                df.loc[df[col] < lower_bound, col] = lower_bound
                df.loc[df[col] > upper_bound, col] = upper_bound
                
                if verbose:
                    print(f"[INFO] Outliers handled in '{col}' using Z-Score:")
                    print(f"         - Number of Outliers: {num_outliers}")
                    print(f"         - Lower Bound: {lower_bound}, Upper Bound: {upper_bound}")
    
    return df

In [25]:
dc=handle_outliers(dm, numerical_columns=None, method='IQR', verbose=True)

[INFO] Outliers handled in 'age' using IQR:
         - Number of Outliers: 0
         - Lower Bound: -11.5, Upper Bound: 88.5
[INFO] Outliers handled in 'daily_screen_time_hours' using IQR:
         - Number of Outliers: 18
         - Lower Bound: 0.7625000000000002, Upper Bound: 11.2625
[INFO] Outliers handled in 'phone_usage_hours' using IQR:
         - Number of Outliers: 6
         - Lower Bound: -1.0, Upper Bound: 7.0
[INFO] Outliers handled in 'laptop_usage_hours' using IQR:
         - Number of Outliers: 7
         - Lower Bound: -0.8, Upper Bound: 4.800000000000001
[INFO] Outliers handled in 'tablet_usage_hours' using IQR:
         - Number of Outliers: 4
         - Lower Bound: -0.45000000000000007, Upper Bound: 2.35
[INFO] Outliers handled in 'tv_usage_hours' using IQR:
         - Number of Outliers: 6
         - Lower Bound: -1.3, Upper Bound: 4.300000000000001
[INFO] Outliers handled in 'social_media_hours' using IQR:
         - Number of Outliers: 9
         - Lower Bound:

  df.loc[df[col] < lower_bound, col] = lower_bound
  df.loc[df[col] < lower_bound, col] = lower_bound
  df.loc[df[col] < lower_bound, col] = lower_bound
  df.loc[df[col] < lower_bound, col] = lower_bound
  df.loc[df[col] < lower_bound, col] = lower_bound
  df.loc[df[col] < lower_bound, col] = lower_bound
  df.loc[df[col] < lower_bound, col] = lower_bound


#### Data type conversions

In [26]:
dc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 25 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   user_id                           2000 non-null   object 
 1   age                               2000 non-null   float64
 2   gender                            2000 non-null   object 
 3   daily_screen_time_hours           2000 non-null   float64
 4   phone_usage_hours                 2000 non-null   float64
 5   laptop_usage_hours                2000 non-null   float64
 6   tablet_usage_hours                2000 non-null   float64
 7   tv_usage_hours                    2000 non-null   float64
 8   social_media_hours                2000 non-null   float64
 9   work_related_hours                2000 non-null   float64
 10  entertainment_hours               2000 non-null   float64
 11  gaming_hours                      2000 non-null   float64
 12  sleep_

In [27]:
import pandas as pd

def get_object_columns(df):
    """
    Cette fonction retourne la liste des colonnes de type 'object' dans un DataFrame.
    
    Paramètres :
    df (pd.DataFrame) : Le DataFrame à analyser.
    
    Retour :
    list : Liste des noms de colonnes de type 'object'.
    """
    return df.select_dtypes(include='object').columns.tolist()

# Exemple d'utilisation
# df = pd.read_csv('ton_fichier.csv')
object_columns = get_object_columns(dc)
print(object_columns)


['user_id', 'gender', 'location_type']


In [28]:
dc = dc.drop('user_id', axis=1)

In [13]:
def label_encode_columns(df, columns=object_columns, verbose=True):
    """
    Performs Label Encoding on specified categorical columns.
    
    Parameters:
        df (pd.DataFrame): The DataFrame to encode.
        columns (list): List of column names to be label encoded.
        verbose (bool): If True, displays the label mappings.
        
    Returns:
        pd.DataFrame: DataFrame with encoded columns.
    """
    # Loop through the specified columns and encode them
    for col in columns:
        if col in df.columns:
            le = LabelEncoder()
            df[col] = le.fit_transform(df[col])
            
            # Display label mapping
            if verbose:
                mapping = dict(zip(le.classes_, le.transform(le.classes_)))
                print(f"\n[INFO] Label Encoding for '{col}': {mapping}")
        else:
            print(f"[WARNING] Column '{col}' not found in DataFrame.")
    
    return df



In [29]:
dc=label_encode_columns(dc, columns=object_columns, verbose=True)



[INFO] Label Encoding for 'gender': {'Female': 0, 'Male': 1, 'Other': 2}

[INFO] Label Encoding for 'location_type': {'Rural': 0, 'Suburban': 1, 'Urban': 2}


In [30]:
dc

Unnamed: 0,age,gender,daily_screen_time_hours,phone_usage_hours,laptop_usage_hours,tablet_usage_hours,tv_usage_hours,social_media_hours,work_related_hours,entertainment_hours,...,stress_level,physical_activity_hours_per_week,location_type,mental_health_score,uses_wellness_apps,eats_healthy,caffeine_intake_mg_per_day,weekly_anxiety_score,weekly_depression_score,mindfulness_minutes_per_day
0,51.0,0,4.8,3.4,1.3,1.6,1.6,4.1,2.0,1.00,...,10.0,0.7,2,32.0,1.0,1.0,125.2,13,15,4.0
1,64.0,1,3.9,3.5,1.8,0.9,2.0,2.7,3.1,1.00,...,6.0,4.3,1,75.0,0.0,1.0,150.4,19,18,6.5
2,41.0,2,10.5,2.1,2.6,0.7,2.2,3.0,2.8,4.10,...,5.0,3.1,1,22.0,0.0,0.0,187.9,7,3,6.9
3,27.0,2,8.8,0.0,0.0,0.7,2.5,3.3,1.6,1.30,...,5.0,0.0,0,22.0,0.0,1.0,73.6,7,2,4.8
4,55.0,1,5.9,1.7,1.1,1.5,1.6,1.1,3.6,0.80,...,7.0,3.0,2,64.0,1.0,1.0,217.5,8,10,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,58.0,0,5.6,4.0,2.5,0.3,1.5,1.1,1.2,2.10,...,9.0,0.0,2,62.0,0.0,1.0,164.9,20,17,4.9
1996,62.0,0,3.9,3.1,1.0,1.5,1.1,2.7,4.1,5.85,...,8.0,2.7,2,29.0,0.0,0.0,172.6,15,15,25.5
1997,64.0,0,7.4,3.0,0.0,1.4,0.9,0.8,2.6,4.10,...,4.0,6.5,2,54.0,1.0,0.0,101.3,1,20,9.5
1998,19.0,1,4.2,4.4,2.3,0.9,1.4,1.7,1.2,2.00,...,8.0,2.6,2,28.0,0.0,0.0,123.7,1,11,13.4


#### Target Variable

In [None]:
# Create a binary target based on the median of &#39;mental_health_score&#39;
median_score = 30
dc['mental_health_binary'] = (dc['mental_health_score'] >= median_score).astype(int)

In [32]:
dc = dc.drop('mental_health_score', axis=1)

In [34]:
from sklearn.model_selection import train_test_split
x= dc.drop(columns='mental_health_binary')
y= dc['mental_health_binary']

#### Data split


In [37]:
x_t,x_te,y_t,y_te= train_test_split(x,y,test_size=.25,random_state=20,stratify=y)

#### Pipeline

In [39]:

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform only the selected columns
x_t = scaler.fit_transform(x_t)
x_te = scaler.fit_transform(x_te)
#x_t[columns_to_scale] = scaler.fit_transform(x_t[columns_to_scale])
#x_te[columns_to_scale] = scaler.fit_transform(x_te[columns_to_scale])

#### Classification Evaluation Function 


In [40]:
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, confusion_matrix, classification_report
)
import time

def evaluate_classification_model(model, model_name, x_t, y_t, x_te, y_te):
    print(f"\n===== {model_name} =====")

    # Train model
    training_start = time.perf_counter()
    model.fit(x_t, y_t)
    training_end = time.perf_counter()

    best_model = model.best_estimator_ if hasattr(model, "best_estimator_") else model

    # Predict
    prediction_start = time.perf_counter()
    y_pred_train = best_model.predict(x_t)
    y_pred_test = best_model.predict(x_te)
    prediction_end = time.perf_counter()

    # If model supports predict_proba, compute AUC
    y_proba_test = best_model.predict_proba(x_te)[:, 1] if hasattr(best_model, "predict_proba") else None

    # Metrics
    train_acc = accuracy_score(y_t, y_pred_train)
    test_acc = accuracy_score(y_te, y_pred_test)

    train_f1 = f1_score(y_t, y_pred_train)
    test_f1 = f1_score(y_te, y_pred_test)

    test_auc = roc_auc_score(y_te, y_proba_test) if y_proba_test is not None else "N/A"

    train_time = training_end - training_start
    prediction_time = prediction_end - prediction_start

    # Output
    print("\nBest Parameters:", model.best_params_ if hasattr(model, "best_params_") else "Default")

    print("\nTraining Set:")
    print(f"  - Accuracy: {train_acc:.4f}")
    print(f"  - F1 Score: {train_f1:.4f}")
    print(f"  - Training Time: {train_time:.4f} seconds")

    print("\nTesting Set:")
    print(f"  - Accuracy: {test_acc:.4f}")
    print(f"  - F1 Score: {test_f1:.4f}")
    print(f"  - ROC AUC: {test_auc}")
    print(f"  - Prediction Time: {prediction_time:.5f} seconds")

    print("\nClassification Report:")
    print(classification_report(y_te, y_pred_test))

    print("Confusion Matrix:")
    print(confusion_matrix(y_te, y_pred_test))

#### SVM

In [41]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# SVM Classifier
svm_params = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}
svm_clf = GridSearchCV(SVC(probability=True), svm_params, cv=3, scoring='f1', n_jobs=-1)
evaluate_classification_model(svm_clf, "SVM Classifier", x_t, y_t, x_te, y_te)

# KNN Classifier
knn_params = {
    'n_neighbors': [3, 5, 7],
    'weights': ['uniform', 'distance'],
    'p': [1, 2]  # Manhattan or Euclidean
}
knn_clf = GridSearchCV(KNeighborsClassifier(), knn_params, cv=3, scoring='f1', n_jobs=-1)
evaluate_classification_model(knn_clf, "KNN Classifier", x_t, y_t, x_te, y_te)

# Decision Tree Classifier
tree_params = {
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}
tree_clf = GridSearchCV(DecisionTreeClassifier(random_state=42), tree_params, cv=3, scoring='f1', n_jobs=-1)
evaluate_classification_model(tree_clf, "Decision Tree Classifier", x_t, y_t, x_te, y_te)

# Random Forest Classifier
rf_params = {
    'n_estimators': [50, 100],
    'max_depth': [None, 10],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}
rf_clf = GridSearchCV(RandomForestClassifier(random_state=42), rf_params, cv=3, scoring='f1', n_jobs=-1)
evaluate_classification_model(rf_clf, "Random Forest Classifier", x_t, y_t, x_te, y_te)



===== SVM Classifier =====

Best Parameters: {'C': 0.1, 'gamma': 'scale', 'kernel': 'linear'}

Training Set:
  - Accuracy: 0.8367
  - F1 Score: 0.9111
  - Training Time: 5.1720 seconds

Testing Set:
  - Accuracy: 0.8380
  - F1 Score: 0.9119
  - ROC AUC: 0.5377883850437549
  - Prediction Time: 0.02112 seconds

Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        81
           1       0.84      1.00      0.91       419

    accuracy                           0.84       500
   macro avg       0.42      0.50      0.46       500
weighted avg       0.70      0.84      0.76       500

Confusion Matrix:
[[  0  81]
 [  0 419]]

===== KNN Classifier =====


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



Best Parameters: {'n_neighbors': 7, 'p': 2, 'weights': 'uniform'}

Training Set:
  - Accuracy: 0.8433
  - F1 Score: 0.9140
  - Training Time: 0.3224 seconds

Testing Set:
  - Accuracy: 0.8280
  - F1 Score: 0.9059
  - ROC AUC: 0.3983617666990778
  - Prediction Time: 0.28627 seconds

Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        81
           1       0.84      0.99      0.91       419

    accuracy                           0.83       500
   macro avg       0.42      0.49      0.45       500
weighted avg       0.70      0.83      0.76       500

Confusion Matrix:
[[  0  81]
 [  5 414]]

===== Decision Tree Classifier =====

Best Parameters: {'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 5}

Training Set:
  - Accuracy: 0.8553
  - F1 Score: 0.9204
  - Training Time: 0.1565 seconds

Testing Set:
  - Accuracy: 0.8200
  - F1 Score: 0.9007
  - ROC AUC: 0.4911606116856713
  - Prediction Time: 0.001

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
