## Exploratory Data Analysis (EDA)

### Business Context:
Explore the relationship between screen time and mental health. Your goal is to understand
the structure of the dataset, perform analysis, build predictive models, and interpret the
results. You need treat the mental_health_score either as a continuous variable (regression)
or convert it into a binary classification target:


### Packages and Libraries

In [1]:
import pandas as pd 
import pathlib
import os
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder,OrdinalEncoder
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor

#https://medium.com/@prosun.csedu/polynomialfeatures-is-a-preprocessing-tool-provided-by-the-scikit-learn-library-in-python-that-is-84118adea049

In [49]:
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import time
import numpy as np

### Read data

In [4]:
def read_file(file_name : str )  -> pd.DataFrame:
    """
    summary
    """
    try:
        dir_folder = pathlib.Path().cwd()
        print(dir_folder)
        file_path  = dir_folder
        print(file_path)
        df = pd.read_csv(os.path.join(file_path/file_name))
        return df
    except FileNotFoundError:
        print(f"Error: The file at '{file_name}' was not found.")
        raise
    except Exception as e:
         print(f"An error occurred: {e}")
        
        
        
df = read_file('digital_diet_mental_health.csv')      
  
df.head()   





c:\Abdelouaheb\perso\Data_science_2024_projects\2025\machine_learning_project_Ames_House_Price\test_solution
c:\Abdelouaheb\perso\Data_science_2024_projects\2025\machine_learning_project_Ames_House_Price\test_solution


Unnamed: 0,user_id,age,gender,daily_screen_time_hours,phone_usage_hours,laptop_usage_hours,tablet_usage_hours,tv_usage_hours,social_media_hours,work_related_hours,...,stress_level,physical_activity_hours_per_week,location_type,mental_health_score,uses_wellness_apps,eats_healthy,caffeine_intake_mg_per_day,weekly_anxiety_score,weekly_depression_score,mindfulness_minutes_per_day
0,user_1,51,Female,4.8,3.4,1.3,1.6,1.6,4.1,2.0,...,10,0.7,Urban,32,1,1,125.2,13,15,4.0
1,user_2,64,Male,3.9,3.5,1.8,0.9,2.0,2.7,3.1,...,6,4.3,Suburban,75,0,1,150.4,19,18,6.5
2,user_3,41,Other,10.5,2.1,2.6,0.7,2.2,3.0,2.8,...,5,3.1,Suburban,22,0,0,187.9,7,3,6.9
3,user_4,27,Other,8.8,0.0,0.0,0.7,2.5,3.3,1.6,...,5,0.0,Rural,22,0,1,73.6,7,2,4.8
4,user_5,55,Male,5.9,1.7,1.1,1.5,1.6,1.1,3.6,...,7,3.0,Urban,64,1,1,217.5,8,10,0.0


### Data structure

In [5]:
def data_diagnostic(df):
        print("#"*50)
        print(df.info())
        print("#"*50)
        print("The number of total rows  {x: .0f} ".format(x=df.shape[0]))
        print("The number of total variables {x: .0f} ".format(x=df.shape[1]))
        print("The variables names {x:} ".format(x=list(df.columns.values)))

        column_headers =list(df.columns.values)
        qualitative_columns = [col for col in column_headers if df[col].dtype=="object"]
        quantitative_columns = [col for col in column_headers if df[col].dtype in ['int64', 'float64']]

        print("The qualitative variables {x:} ".format(x=qualitative_columns))
        print("The quantitative variables {x:} ".format(x=quantitative_columns))
        print("#"*50)
        print("Total number missing value {x:} ".format(x=df.isnull().sum()))

In [6]:
df

Unnamed: 0,user_id,age,gender,daily_screen_time_hours,phone_usage_hours,laptop_usage_hours,tablet_usage_hours,tv_usage_hours,social_media_hours,work_related_hours,...,stress_level,physical_activity_hours_per_week,location_type,mental_health_score,uses_wellness_apps,eats_healthy,caffeine_intake_mg_per_day,weekly_anxiety_score,weekly_depression_score,mindfulness_minutes_per_day
0,user_1,51,Female,4.8,3.4,1.3,1.6,1.6,4.1,2.0,...,10,0.7,Urban,32,1,1,125.2,13,15,4.0
1,user_2,64,Male,3.9,3.5,1.8,0.9,2.0,2.7,3.1,...,6,4.3,Suburban,75,0,1,150.4,19,18,6.5
2,user_3,41,Other,10.5,2.1,2.6,0.7,2.2,3.0,2.8,...,5,3.1,Suburban,22,0,0,187.9,7,3,6.9
3,user_4,27,Other,8.8,0.0,0.0,0.7,2.5,3.3,1.6,...,5,0.0,Rural,22,0,1,73.6,7,2,4.8
4,user_5,55,Male,5.9,1.7,1.1,1.5,1.6,1.1,3.6,...,7,3.0,Urban,64,1,1,217.5,8,10,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,user_1996,58,Female,5.6,4.0,2.5,0.3,1.5,1.1,1.2,...,9,0.0,Urban,62,0,1,164.9,20,17,4.9
1996,user_1997,62,Female,3.9,3.1,1.0,1.5,1.1,2.7,4.1,...,8,2.7,Urban,29,0,0,172.6,15,15,25.5
1997,user_1998,64,Female,7.4,3.0,0.0,1.4,0.9,0.8,2.6,...,4,6.5,Urban,54,1,0,101.3,1,20,9.5
1998,user_1999,19,Male,4.2,4.4,2.3,0.9,1.4,1.7,1.2,...,8,2.6,Urban,28,0,0,123.7,1,11,13.4


In [7]:
data_diagnostic(df)

##################################################
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 25 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   user_id                           2000 non-null   object 
 1   age                               2000 non-null   int64  
 2   gender                            2000 non-null   object 
 3   daily_screen_time_hours           2000 non-null   float64
 4   phone_usage_hours                 2000 non-null   float64
 5   laptop_usage_hours                2000 non-null   float64
 6   tablet_usage_hours                2000 non-null   float64
 7   tv_usage_hours                    2000 non-null   float64
 8   social_media_hours                2000 non-null   float64
 9   work_related_hours                2000 non-null   float64
 10  entertainment_hours               2000 non-null   float64
 11  gaming_hours      

In [8]:
def na(df, percent = True, verbose = True):
    srs = df.isna().sum()[df.isna().sum() > 0]
    srs = srs.sort_values(ascending=False)
    if percent:
       if verbose:
           print('% of NaNs in df:')
       return srs / df.shape[0]
    else:
        if verbose:
           print('# of NaNs in df:')
        return srs

na(df, False)

# of NaNs in df:


Series([], dtype: int64)

### Analyse univary

#### Numerical varaiables

In [9]:
dtypes = df.dtypes.to_frame().reset_index()
dtypes.columns = ['col', 'dtype']
print('Df dtypes:')
dtypes.groupby('dtype').size()

Df dtypes:


dtype
int64       9
float64    13
object      3
dtype: int64

In [10]:
def numeric_analysis(df):
        
    
        return print(df.describe().T)

In [11]:
numeric_analysis(df)

                                   count       mean        std   min    25%  \
age                               2000.0   38.80550  14.929203  13.0   26.0   
daily_screen_time_hours           2000.0    6.02560   1.974123   0.0    4.7   
phone_usage_hours                 2000.0    3.02370   1.449399   0.0    2.0   
laptop_usage_hours                2000.0    1.99995   0.997949   0.0    1.3   
tablet_usage_hours                2000.0    0.99565   0.492714   0.0    0.6   
tv_usage_hours                    2000.0    1.50370   0.959003   0.0    0.8   
social_media_hours                2000.0    2.03920   1.133435   0.0    1.2   
work_related_hours                2000.0    2.01025   1.116111   0.0    1.2   
entertainment_hours               2000.0    2.46735   1.236860   0.0    1.6   
gaming_hours                      2000.0    1.27950   0.894500   0.0    0.6   
sleep_duration_hours              2000.0    6.53755   1.203856   3.0    5.7   
sleep_quality                     2000.0    5.56700 

#### Numerical varaiables visualization

In [12]:
def univariate_analysis(df, base_folder="univariate_analysis"):
        # Ensure the base folder exists
        if not os.path.exists(base_folder):
            os.makedirs(base_folder)
        
        # Select numeric columns
        numeric_cols = df.select_dtypes(include=['number']).columns
        
        for col in numeric_cols:
            # Create a folder for the analysis
            col_folder = os.path.join(base_folder, col)
            if not os.path.exists(col_folder):
                os.makedirs(col_folder)
            
            print(f"\nPerforming Univariate Analysis for: {col}")
            
            # Create a single figure with 2x2 layout
            fig, axes = plt.subplots(2, 2, figsize=(12, 10))
            fig.suptitle(f'Univariate Analysis for {col}', fontsize=16)
            
            # Bar Chart
            sns.barplot(
                x=df[col].value_counts().index, 
                y=df[col].value_counts().values, 
                palette="viridis", 
                ax=axes[0, 0]
            )
            axes[0, 0].set_title('Bar Chart')
            axes[0, 0].set_xlabel(col)
            axes[0, 0].set_ylabel('Frequency')
            
            # Box Plot
            sns.boxplot(y=df[col], palette="viridis", ax=axes[0, 1])
            axes[0, 1].set_title('Box Plot')
            axes[0, 1].set_xlabel(col)
            
            # Density Plot
            sns.kdeplot(df[col], fill=True, color="blue", alpha=0.6, ax=axes[1, 0])
            axes[1, 0].set_title('Density Plot')
            axes[1, 0].set_xlabel(col)
            axes[1, 0].set_ylabel('Density')
            
            # Histogram
            sns.histplot(df[col], kde=False, color="green", ax=axes[1, 1])
            axes[1, 1].set_title('Histogram')
            axes[1, 1].set_xlabel(col)
            axes[1, 1].set_ylabel('Frequency')
            
            # Adjust layout
            plt.tight_layout(rect=[0, 0, 1, 0.96])  # Make room for the main title
            
            # Save the combined plot
            combined_plot_path = os.path.join(col_folder, f"{col}_univariate_analysis.png")
            plt.savefig(combined_plot_path, bbox_inches='tight')
            plt.close()
            
            print(f"Combined plots for {col} saved in: {combined_plot_path}")

In [None]:
univariate_analysis(df, base_folder="univariate_analysis")

#### Categorical varaiables

In [14]:
def categorical_analysis(df):
    return print(df.select_dtypes(include='object').describe().T)

In [15]:
categorical_analysis(df)
# Percentage

              count unique     top freq
user_id        2000   2000  user_1    1
gender         2000      3  Female  935
location_type  2000      3   Urban  999


#### Categorical varaiables visualization

In [16]:
def univariate_analysis_categorical(df, base_folder="univariate_analysis_categorical"):
    # Ensure the base folder exists
    if not os.path.exists(base_folder):
        os.makedirs(base_folder)
    
    # Select categorical columns
    categorical_cols = df.select_dtypes(include=['object', 'category']).columns
    
    for col in categorical_cols:
        # Create a folder for the analysis
        col_folder = os.path.join(base_folder, col)
        if not os.path.exists(col_folder):
            os.makedirs(col_folder)
        
        print(f"\nPerforming Univariate Analysis for: {col}")
        
        # Create a figure with 1x2 layout
        fig, axes = plt.subplots(1, 2, figsize=(15, 6))
        fig.suptitle(f'Univariate Analysis for {col}', fontsize=16)
        
        # Bar Plot
        sns.countplot(x=df[col], palette="viridis", ax=axes[0])
        axes[0].set_title('Count Plot')
        axes[0].set_xlabel(col)
        axes[0].set_ylabel('Frequency')
        axes[0].tick_params(axis='x', rotation=45)
        
        # Pie Chart
        df[col].value_counts().plot.pie(
            autopct='%1.1f%%', 
            colors=sns.color_palette("viridis", len(df[col].unique())), 
            ax=axes[1], 
            startangle=90
        )
        axes[1].set_title('Pie Chart')
        axes[1].set_ylabel('')  # Remove y-label for better visualization
        
        # Adjust layout
        plt.tight_layout(rect=[0, 0, 1, 0.96])  # Make room for the main title
        
        # Save the combined plot
        combined_plot_path = os.path.join(col_folder, f"{col}_univariate_analysis.png")
        plt.savefig(combined_plot_path, bbox_inches='tight')
        plt.close()
        
        print(f"Combined plots for {col} saved in: {combined_plot_path}")


In [None]:
univariate_analysis_categorical(df, base_folder="univariate_analysis_categorical")

#### Correlation matrix

In [18]:
import os
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from scipy.stats import pearsonr
import numpy as np

def correlation_and_significance(df, base_folder="correlation_analysis"):
    """
    Generates a correlation matrix heatmap with significance markers.
    Outputs:
        - Correlation Matrix with significance points (red = not significant, green = significant)
    
    Parameters:
        df (pd.DataFrame): The DataFrame to analyze.
        base_folder (str): Directory to save the analysis images.
    """
    # Ensure the base folder exists
    if not os.path.exists(base_folder):
        os.makedirs(base_folder)
    
    # Select only numerical columns
    numeric_df = df.select_dtypes(include=['number'])
    cols = numeric_df.columns
    
    # Calculate the correlation matrix
    corr_matrix = numeric_df.corr()
    
    # Initialize p-value matrix
    p_values = pd.DataFrame(np.ones((len(cols), len(cols))), columns=cols, index=cols)
    
    # Calculate p-values for each pair of variables
    for row in cols:
        for col in cols:
            if row != col:
                _, p_value = pearsonr(numeric_df[row], numeric_df[col])
                p_values.loc[row, col] = p_value
    
    # --- Plot Correlation Matrix with Significance ---
    plt.figure(figsize=(14, 12))
    sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
    
    # Add significance markers
    for i in range(len(cols)):
        for j in range(len(cols)):
            if i != j:  # Skip diagonal
                p_val = p_values.iloc[i, j]
                x = j + 0.5
                y = i + 0.5
                
                # Significant if p-value < 0.05
                if p_val < 0.05:
                    plt.plot(x, y, 'o', color='green')  # Green for significant
                else:
                    plt.plot(x, y, 'o', color='red')    # Red for not significant
    
    plt.title('Correlation Matrix with Significance (Numerical Variables)', fontsize=18)
    plt.xticks(rotation=45)
    plt.yticks(rotation=0)
    plt.tight_layout()
    
    # Save Correlation Matrix with Significance
    corr_plot_path = os.path.join(base_folder, "correlation_matrix_significance.png")
    plt.savefig(corr_plot_path, bbox_inches='tight')
    plt.close()
    
    print(f"Correlation matrix with significance saved in: {corr_plot_path}")


In [19]:
# Assuming df is your DataFrame
correlation_and_significance(df=df)

Correlation matrix with significance saved in: correlation_analysis\correlation_matrix_significance.png


### Data Cleaning

#### Handling missing values

In [20]:
import pandas as pd
from sklearn.impute import SimpleImputer

def impute_dataframe(df, strategy_num="mean", strategy_cat="most_frequent"):
    """
    Imputes missing values in a DataFrame separately for numerical and categorical columns.

    Parameters:
    df (pd.DataFrame): Input DataFrame.
    strategy_num (str): Strategy for imputing numerical columns (default="mean").
    strategy_cat (str): Strategy for imputing categorical columns (default="most_frequent").

    Returns:
    pd.DataFrame: DataFrame with imputed values.
    """
    df = df.copy()

    # Separate numerical and categorical columns
    num_cols = df.select_dtypes(include=["number"]).columns
    cat_cols = df.select_dtypes(exclude=["number"]).columns

    # Impute numerical columns
    if len(num_cols) > 0:
        num_imputer = SimpleImputer(strategy=strategy_num)
        df[num_cols] = num_imputer.fit_transform(df[num_cols])

    # Impute categorical columns
    if len(cat_cols) > 0:
        cat_imputer = SimpleImputer(strategy=strategy_cat)
        df[cat_cols] = cat_imputer.fit_transform(df[cat_cols])

    return df


#### Removing duplicates rows


In [21]:
def remove_duplicates(df, subset=None, keep='first', inplace=False):
    """
    Removes duplicate rows from the DataFrame and provides a summary of duplicates.
    
    Parameters:
        df (pd.DataFrame): The DataFrame from which to remove duplicates.
        subset (list): List of columns to consider for duplicate checking. 
                       If None, checks all columns.
        keep (str): Which duplicates to keep. Options:
            - 'first': Keep the first occurrence (default).
            - 'last': Keep the last occurrence.
            - 'none': Drop all duplicates.
        inplace (bool): If True, modifies the original DataFrame. 
                        If False, returns a new DataFrame.
        
    Returns:
        pd.DataFrame: DataFrame with duplicates removed (if inplace=False).
    """
    if keep not in ['first', 'last', 'none']:
        raise ValueError("keep must be one of 'first', 'last', or 'none'.")
    
    # Count duplicates before removal
    total_rows = len(df)
    duplicate_rows = df.duplicated(subset=subset, keep=False).sum()
    percentage_duplicates = (duplicate_rows / total_rows) * 100
    
    print(f"Total Rows: {total_rows}")
    print(f"Duplicate Rows: {duplicate_rows} ({percentage_duplicates:.2f}%)")
    
    if duplicate_rows == 0:
        print("No duplicates found. No rows removed.")
        return df if not inplace else None
    
    # Handle duplicate removal
    if keep == 'none':
        # Drop all duplicates and keep only unique rows
        duplicated_mask = df.duplicated(subset=subset, keep=False)
        result = df[~duplicated_mask]
    else:
        # Use pandas built-in drop_duplicates
        result = df.drop_duplicates(subset=subset, keep=keep)
    
    # Count duplicates after removal
    remaining_rows = len(result)
    rows_removed = total_rows - remaining_rows
    
    print(f"Rows Removed: {rows_removed}")
    print(f"Remaining Rows: {remaining_rows}")
    
    if rows_removed > 0:
        print("Duplicates successfully removed.")
    else:
        print("No duplicates were removed.")
    
    if inplace:
        df.drop_duplicates(subset=subset, keep=keep, inplace=True)
    else:
        return result


In [22]:
dm=remove_duplicates(df)

Total Rows: 2000
Duplicate Rows: 0 (0.00%)
No duplicates found. No rows removed.


#### Handles outliers

In [23]:
def handle_outliers(df, numerical_columns=None, method='IQR', verbose=True):
    """
    Handles outliers in continuous numerical variables using the specified method.
    
    Parameters:
        df (pd.DataFrame): The DataFrame to process.
        numerical_columns (list): List of numerical columns to check for outliers. If None, auto-selects all numerical columns.
        method (str): Method to handle outliers ('IQR' or 'Z-Score'). Default is 'IQR'.
        verbose (bool): If True, displays outlier statistics.
        
    Returns:
        pd.DataFrame: DataFrame with outliers handled.
    """
    # Select numerical columns if not provided
    if numerical_columns is None:
        numerical_columns = df.select_dtypes(include=['number']).columns
    
    # Iterate over each numerical column
    for col in numerical_columns:
        if col in df.columns:
            if method == 'IQR':
                # Calculate Q1, Q3, and IQR
                Q1 = df[col].quantile(0.25)
                Q3 = df[col].quantile(0.75)
                IQR = Q3 - Q1
                lower_bound = Q1 - 1.5 * IQR
                upper_bound = Q3 + 1.5 * IQR
                
                # Count and handle outliers
                outliers = ((df[col] < lower_bound) | (df[col] > upper_bound))
                num_outliers = outliers.sum()
                
                # Option 1: Capping
                df.loc[df[col] < lower_bound, col] = lower_bound
                df.loc[df[col] > upper_bound, col] = upper_bound
                
                if verbose:
                    print(f"[INFO] Outliers handled in '{col}' using IQR:")
                    print(f"         - Number of Outliers: {num_outliers}")
                    print(f"         - Lower Bound: {lower_bound}, Upper Bound: {upper_bound}")
            
            elif method == 'Z-Score':
                # Alternative method (if needed)
                mean = df[col].mean()
                std_dev = df[col].std()
                threshold = 3  # Standard deviation threshold
                
                # Count and handle outliers
                outliers = ((df[col] < (mean - threshold * std_dev)) | (df[col] > (mean + threshold * std_dev)))
                num_outliers = outliers.sum()
                
                # Option 1: Capping
                lower_bound = mean - threshold * std_dev
                upper_bound = mean + threshold * std_dev
                df.loc[df[col] < lower_bound, col] = lower_bound
                df.loc[df[col] > upper_bound, col] = upper_bound
                
                if verbose:
                    print(f"[INFO] Outliers handled in '{col}' using Z-Score:")
                    print(f"         - Number of Outliers: {num_outliers}")
                    print(f"         - Lower Bound: {lower_bound}, Upper Bound: {upper_bound}")
    
    return df

In [None]:
dc=handle_outliers(dm, numerical_columns=None, method='IQR', verbose=True)

#### Data type conversions

In [28]:
dc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 25 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   user_id                           2000 non-null   object 
 1   age                               2000 non-null   float64
 2   gender                            2000 non-null   object 
 3   daily_screen_time_hours           2000 non-null   float64
 4   phone_usage_hours                 2000 non-null   float64
 5   laptop_usage_hours                2000 non-null   float64
 6   tablet_usage_hours                2000 non-null   float64
 7   tv_usage_hours                    2000 non-null   float64
 8   social_media_hours                2000 non-null   float64
 9   work_related_hours                2000 non-null   float64
 10  entertainment_hours               2000 non-null   float64
 11  gaming_hours                      2000 non-null   float64
 12  sleep_

In [31]:
import pandas as pd

def get_object_columns(df):
    """
    Cette fonction retourne la liste des colonnes de type 'object' dans un DataFrame.
    
    Paramètres :
    df (pd.DataFrame) : Le DataFrame à analyser.
    
    Retour :
    list : Liste des noms de colonnes de type 'object'.
    """
    return df.select_dtypes(include='object').columns.tolist()

# Exemple d'utilisation
# df = pd.read_csv('ton_fichier.csv')
object_columns = get_object_columns(dc)
print(object_columns)


['gender', 'location_type']


In [30]:
dc = dc.drop('user_id', axis=1)

In [32]:
def label_encode_columns(df, columns=object_columns, verbose=True):
    """
    Performs Label Encoding on specified categorical columns.
    
    Parameters:
        df (pd.DataFrame): The DataFrame to encode.
        columns (list): List of column names to be label encoded.
        verbose (bool): If True, displays the label mappings.
        
    Returns:
        pd.DataFrame: DataFrame with encoded columns.
    """
    # Loop through the specified columns and encode them
    for col in columns:
        if col in df.columns:
            le = LabelEncoder()
            df[col] = le.fit_transform(df[col])
            
            # Display label mapping
            if verbose:
                mapping = dict(zip(le.classes_, le.transform(le.classes_)))
                print(f"\n[INFO] Label Encoding for '{col}': {mapping}")
        else:
            print(f"[WARNING] Column '{col}' not found in DataFrame.")
    
    return df



In [33]:
dc=label_encode_columns(dc, columns=object_columns, verbose=True)



[INFO] Label Encoding for 'gender': {'Female': 0, 'Male': 1, 'Other': 2}

[INFO] Label Encoding for 'location_type': {'Rural': 0, 'Suburban': 1, 'Urban': 2}


In [34]:
dc

Unnamed: 0,age,gender,daily_screen_time_hours,phone_usage_hours,laptop_usage_hours,tablet_usage_hours,tv_usage_hours,social_media_hours,work_related_hours,entertainment_hours,...,stress_level,physical_activity_hours_per_week,location_type,mental_health_score,uses_wellness_apps,eats_healthy,caffeine_intake_mg_per_day,weekly_anxiety_score,weekly_depression_score,mindfulness_minutes_per_day
0,51.0,0,4.8,3.4,1.3,1.6,1.6,4.1,2.0,1.00,...,10.0,0.7,2,32.0,1.0,1.0,125.2,13,15,4.0
1,64.0,1,3.9,3.5,1.8,0.9,2.0,2.7,3.1,1.00,...,6.0,4.3,1,75.0,0.0,1.0,150.4,19,18,6.5
2,41.0,2,10.5,2.1,2.6,0.7,2.2,3.0,2.8,4.10,...,5.0,3.1,1,22.0,0.0,0.0,187.9,7,3,6.9
3,27.0,2,8.8,0.0,0.0,0.7,2.5,3.3,1.6,1.30,...,5.0,0.0,0,22.0,0.0,1.0,73.6,7,2,4.8
4,55.0,1,5.9,1.7,1.1,1.5,1.6,1.1,3.6,0.80,...,7.0,3.0,2,64.0,1.0,1.0,217.5,8,10,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,58.0,0,5.6,4.0,2.5,0.3,1.5,1.1,1.2,2.10,...,9.0,0.0,2,62.0,0.0,1.0,164.9,20,17,4.9
1996,62.0,0,3.9,3.1,1.0,1.5,1.1,2.7,4.1,5.85,...,8.0,2.7,2,29.0,0.0,0.0,172.6,15,15,25.5
1997,64.0,0,7.4,3.0,0.0,1.4,0.9,0.8,2.6,4.10,...,4.0,6.5,2,54.0,1.0,0.0,101.3,1,20,9.5
1998,19.0,1,4.2,4.4,2.3,0.9,1.4,1.7,1.2,2.00,...,8.0,2.6,2,28.0,0.0,0.0,123.7,1,11,13.4


#### Target Variable

In [35]:
from sklearn.model_selection import train_test_split
x= dc.drop(columns='mental_health_score')
y= dc['mental_health_score']

#### Data split


In [36]:
x_t,x_te,y_t,y_te= train_test_split(x,y,test_size=.25,random_state=20)

#### Pipeline

In [37]:

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform only the selected columns
x_t = scaler.fit_transform(x_t)
x_te = scaler.fit_transform(x_te)
#x_t[columns_to_scale] = scaler.fit_transform(x_t[columns_to_scale])
#x_te[columns_to_scale] = scaler.fit_transform(x_te[columns_to_scale])

In [38]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
import time
import numpy as np

#### LinearRegression

In [39]:
# Initialize Linear Regression model
lr = LinearRegression()

# Timing the training process
training_start = time.perf_counter()
model = lr.fit(x_t, y_t)
training_end = time.perf_counter()

# Timing the prediction process
prediction_start = time.perf_counter()
y_pred_train = model.predict(x_t)
y_pred_test = model.predict(x_te)
prediction_end = time.perf_counter()

# Calculate KPIs
# Root Mean Squared Error (RMSE)
rmse_train = np.sqrt(mean_squared_error(y_t, y_pred_train))
rmse_test = np.sqrt(mean_squared_error(y_te, y_pred_test))

# Mean Absolute Error (MAE)
mae_train = mean_absolute_error(y_t, y_pred_train)
mae_test = mean_absolute_error(y_te, y_pred_test)

# R-squared score
r2_train = r2_score(y_t, y_pred_train)
r2_test = r2_score(y_te, y_pred_test)

# Time metrics
train_time = training_end - training_start
prediction_time = prediction_end - prediction_start

# Output metrics
print("Linear Regression Model Performance:")
print("\nTraining Set:")
print(f"  - RMSE: {rmse_train:.4f}")
print(f"  - MAE: {mae_train:.4f}")
print(f"  - R^2 Score: {r2_train:.4f}")
print(f"  - Training Time: {train_time:.4f} seconds")

print("\nTesting Set:")
print(f"  - RMSE: {rmse_test:.4f}")
print(f"  - MAE: {mae_test:.4f}")
print(f"  - R^2 Score: {r2_test:.4f}")
print(f"  - Prediction Time: {prediction_time:.5f} seconds")

Linear Regression Model Performance:

Training Set:
  - RMSE: 17.3544
  - MAE: 14.9520
  - R^2 Score: 0.0103
  - Training Time: 0.0162 seconds

Testing Set:
  - RMSE: 17.9485
  - MAE: 15.7154
  - R^2 Score: -0.0134
  - Prediction Time: 0.00134 seconds


##### Interpreation

In [40]:
print(y_te.mean())
print(y_te.std())

49.294
17.847033288319626


Target Variable: mental_health_score
Mean: 49.294

Standard Deviation: 17.847

Model Evaluation Metric: Root Mean Squared Error (RMSE)
RMSE Value: 23939.3614

Interpretation:
1. Magnitude and Practical Relevance
The RMSE is extremely large compared to both the mean (49.294) and standard deviation (17.847) of the target variable.

This suggests that the model's predictions are highly inaccurate. An RMSE of 23939.3614, when the average value of the target is only around 49, means predictions are orders of magnitude off.

2. Relative Error Analysis
To assess model performance, compare RMSE with the mean:

Relative RMSE = 
23939.36
49.294
≈
485.5
49.294
23939.36
​
 ≈485.5
This means the model's predictions deviate by roughly 485 times the average mental health score, which is unacceptably high.

Similarly, comparing RMSE to the standard deviation:

Relative to standard deviation = 
23939.36
17.847
≈
1341.0
17.847
23939.36
​
 ≈1341.0
This indicates the error is far beyond the natural variability in the data.

Conclusion:
The model's RMSE is too high relative to both the mean and standard deviation of the mental_health_score.

This likely points to a serious issue in model training, target leakage, incorrect units, or data preprocessing errors.

You should double-check:

Whether the target variable was scaled or transformed.

That the model is not predicting an unrelated or mismatched target.

That the metric wasn't accidentally computed on a misaligned scale (e.g., predicting dollars instead of a 0–100 score).



#### Ridge and Lasso regression

##### Cross Validation

In [41]:
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler


# Define models and parameter grids for GridSearchCV
ridge = Ridge()
lasso = Lasso()

ridge_params = {'alpha': [0.01, 0.1, 1, 10, 100]}
lasso_params = {'alpha': [0.01, 0.1, 1, 10, 100,200,300,400,500]}
ridge_grid = GridSearchCV(estimator=ridge, param_grid=ridge_params, 
                          cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
# Lasso Regression Hypertuning
lasso_grid = GridSearchCV(estimator=lasso, param_grid=lasso_params, 
                          cv=5, scoring='neg_mean_squared_error', n_jobs=-1)

##### Ridge and Lasso Regression

In [42]:
ridge_grid.fit(x_t, y_t)
best_ridge = ridge_grid.best_estimator_
print("\nBest Ridge Model:", ridge_grid.best_params_)


lasso_grid.fit(x_t, y_t)
best_lasso = lasso_grid.best_estimator_
print("\nBest Lasso Model:", lasso_grid.best_params_)


Best Ridge Model: {'alpha': 100}

Best Lasso Model: {'alpha': 10}


In [43]:
# Evaluate the best models
best_models = {'Ridge': best_ridge, 'Lasso': best_lasso}

for model_name, model in best_models.items():
    print(f"\nEvaluating {model_name} Model:")

    # Timing the training process
    training_start = time.perf_counter()
    model.fit(x_t, y_t)
    training_end = time.perf_counter()
    
    # Timing the prediction process
    prediction_start = time.perf_counter()
    y_pred_train = model.predict(x_t)
    y_pred_test = model.predict(x_te)
    prediction_end = time.perf_counter()

    # Calculate KPIs
    rmse_train = np.sqrt(mean_squared_error(y_t, y_pred_train))
    rmse_test = np.sqrt(mean_squared_error(y_te, y_pred_test))
    mae_train = mean_absolute_error(y_t, y_pred_train)
    mae_test = mean_absolute_error(y_te, y_pred_test)
    r2_train = r2_score(y_t, y_pred_train)
    r2_test = r2_score(y_te, y_pred_test)

    # Time metrics
    train_time = training_end - training_start
    prediction_time = prediction_end - prediction_start

    # Output metrics
    print("\nTraining Set:")
    print(f"  - RMSE: {rmse_train:.4f}")
    print(f"  - MAE: {mae_train:.4f}")
    print(f"  - R^2 Score: {r2_train:.4f}")
    print(f"  - Training Time: {train_time:.4f} seconds")

    print("\nTesting Set:")
    print(f"  - RMSE: {rmse_test:.4f}")
    print(f"  - MAE: {mae_test:.4f}")
    print(f"  - R^2 Score: {r2_test:.4f}")
    print(f"  - Prediction Time: {prediction_time:.5f} seconds")


Evaluating Ridge Model:

Training Set:
  - RMSE: 17.3548
  - MAE: 14.9554
  - R^2 Score: 0.0102
  - Training Time: 0.0015 seconds

Testing Set:
  - RMSE: 17.9350
  - MAE: 15.7046
  - R^2 Score: -0.0119
  - Prediction Time: 0.00069 seconds

Evaluating Lasso Model:

Training Set:
  - RMSE: 17.4440
  - MAE: 15.0515
  - R^2 Score: 0.0000
  - Training Time: 0.0022 seconds

Testing Set:
  - RMSE: 17.8355
  - MAE: 15.6291
  - R^2 Score: -0.0007
  - Prediction Time: 0.00100 seconds


#### Method to run all the models seperatlty

In [47]:
def evaluate_model(model, model_name, x_t, y_t, x_te, y_te):
    print(f"\n===== {model_name} =====")

    # Training
    training_start = time.perf_counter()
    model.fit(x_t, y_t)
    training_end = time.perf_counter()

    # Prediction
    prediction_start = time.perf_counter()
    y_pred_train = model.predict(x_t)
    y_pred_test = model.predict(x_te)
    prediction_end = time.perf_counter()

    # Metrics
    rmse_train = np.sqrt(mean_squared_error(y_t, y_pred_train))
    rmse_test = np.sqrt(mean_squared_error(y_te, y_pred_test))

    mae_train = mean_absolute_error(y_t, y_pred_train)
    mae_test = mean_absolute_error(y_te, y_pred_test)

    r2_train = r2_score(y_t, y_pred_train)
    r2_test = r2_score(y_te, y_pred_test)

    train_time = training_end - training_start
    prediction_time = prediction_end - prediction_start

    # Output
    print("\nTraining Set:")
    print(f"  - RMSE: {rmse_train:.4f}")
    print(f"  - MAE: {mae_train:.4f}")
    print(f"  - R^2 Score: {r2_train:.4f}")
    print(f"  - Training Time: {train_time:.4f} seconds")

    print("\nTesting Set:")
    print(f"  - RMSE: {rmse_test:.4f}")
    print(f"  - MAE: {mae_test:.4f}")
    print(f"  - R^2 Score: {r2_test:.4f}")
    print(f"  - Prediction Time: {prediction_time:.5f} seconds")


#### SVM

In [50]:
# SVM
svm_model = SVR()
evaluate_model(svm_model, "Support Vector Machine", x_t, y_t, x_te, y_te)


===== Support Vector Machine =====

Training Set:
  - RMSE: 16.8954
  - MAE: 14.3899
  - R^2 Score: 0.0619
  - Training Time: 0.1069 seconds

Testing Set:
  - RMSE: 17.9688
  - MAE: 15.7236
  - R^2 Score: -0.0157
  - Prediction Time: 0.23183 seconds


#### KNN

In [51]:
# KNN
knn_model = KNeighborsRegressor()
evaluate_model(knn_model, "K-Nearest Neighbors", x_t, y_t, x_te, y_te)




===== K-Nearest Neighbors =====

Training Set:
  - RMSE: 15.4634
  - MAE: 12.8645
  - R^2 Score: 0.2142
  - Training Time: 0.0012 seconds

Testing Set:
  - RMSE: 19.8890
  - MAE: 17.0400
  - R^2 Score: -0.2444
  - Prediction Time: 0.23565 seconds


#### Decision Tree

In [52]:
# Decision Tree
tree_model = DecisionTreeRegressor(random_state=42)
evaluate_model(tree_model, "Decision Tree", x_t, y_t, x_te, y_te)




===== Decision Tree =====

Training Set:
  - RMSE: 0.0000
  - MAE: 0.0000
  - R^2 Score: 1.0000
  - Training Time: 0.0322 seconds

Testing Set:
  - RMSE: 26.1623
  - MAE: 21.4300
  - R^2 Score: -1.1532
  - Prediction Time: 0.00073 seconds


#### Random Forest

In [53]:
# Random Forest
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
evaluate_model(rf_model, "Random Forest", x_t, y_t, x_te, y_te)


===== Random Forest =====

Training Set:
  - RMSE: 6.6018
  - MAE: 5.5907
  - R^2 Score: 0.8568
  - Training Time: 1.5907 seconds

Testing Set:
  - RMSE: 18.2149
  - MAE: 15.8943
  - R^2 Score: -0.0437
  - Prediction Time: 0.02850 seconds


### Evaluation Function with Best Estimator

In [54]:
def evaluate_model(model, model_name, x_t, y_t, x_te, y_te):
    print(f"\n===== {model_name} =====")

    # Training
    training_start = time.perf_counter()
    model.fit(x_t, y_t)
    training_end = time.perf_counter()

    best_model = model.best_estimator_ if hasattr(model, "best_estimator_") else model

    # Prediction
    prediction_start = time.perf_counter()
    y_pred_train = best_model.predict(x_t)
    y_pred_test = best_model.predict(x_te)
    prediction_end = time.perf_counter()

    # Metrics
    rmse_train = np.sqrt(mean_squared_error(y_t, y_pred_train))
    rmse_test = np.sqrt(mean_squared_error(y_te, y_pred_test))

    mae_train = mean_absolute_error(y_t, y_pred_train)
    mae_test = mean_absolute_error(y_te, y_pred_test)

    r2_train = r2_score(y_t, y_pred_train)
    r2_test = r2_score(y_te, y_pred_test)

    train_time = training_end - training_start
    prediction_time = prediction_end - prediction_start

    # Output
    print("\nBest Parameters:", model.best_params_ if hasattr(model, "best_params_") else "Default")
    print("\nTraining Set:")
    print(f"  - RMSE: {rmse_train:.4f}")
    print(f"  - MAE: {mae_train:.4f}")
    print(f"  - R^2 Score: {r2_train:.4f}")
    print(f"  - Training Time: {train_time:.4f} seconds")

    print("\nTesting Set:")
    print(f"  - RMSE: {rmse_test:.4f}")
    print(f"  - MAE: {mae_test:.4f}")
    print(f"  - R^2 Score: {r2_test:.4f}")
    print(f"  - Prediction Time: {prediction_time:.5f} seconds")


In [57]:
# SVM
svm_params = {
    'C': [0.1, 1, 10],
    'kernel': ['rbf', 'linear'],
    'gamma': ['scale', 'auto']
}
svm_grid = GridSearchCV(SVR(), svm_params, cv=3, scoring='neg_root_mean_squared_log_error', n_jobs=-1)
evaluate_model(svm_grid, "Support Vector Machine (SVM)", x_t, y_t, x_te, y_te)

# KNN
knn_params = {
    'n_neighbors': [3, 5, 7],
    'weights': ['uniform', 'distance'],
    'p': [1, 2]  # Manhattan or Euclidean
}
knn_grid = GridSearchCV(KNeighborsRegressor(), knn_params, cv=3, scoring='neg_root_mean_squared_log_error', n_jobs=-1)
evaluate_model(knn_grid, "K-Nearest Neighbors (KNN)", x_t, y_t, x_te, y_te)

# Decision Tree
tree_params = {
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
tree_grid = GridSearchCV(DecisionTreeRegressor(random_state=42), tree_params, cv=3, scoring='neg_root_mean_squared_log_error', n_jobs=-1)
evaluate_model(tree_grid, "Decision Tree", x_t, y_t, x_te, y_te)

# Random Forest
rf_params = {
    'n_estimators': [50, 100],
    'max_depth': [None, 10],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}
rf_grid = GridSearchCV(RandomForestRegressor(random_state=42), rf_params, cv=3, scoring='neg_root_mean_squared_log_error', n_jobs=-1)
evaluate_model(rf_grid, "Random Forest", x_t, y_t, x_te, y_te)



===== Support Vector Machine (SVM) =====

Best Parameters: {'C': 0.1, 'gamma': 'scale', 'kernel': 'rbf'}

Training Set:
  - RMSE: 17.3779
  - MAE: 14.9739
  - R^2 Score: 0.0076
  - Training Time: 0.7760 seconds

Testing Set:
  - RMSE: 17.8509
  - MAE: 15.6494
  - R^2 Score: -0.0024
  - Prediction Time: 0.24657 seconds

===== K-Nearest Neighbors (KNN) =====

Best Parameters: {'n_neighbors': 7, 'p': 1, 'weights': 'distance'}

Training Set:
  - RMSE: 0.0000
  - MAE: 0.0000
  - R^2 Score: 1.0000
  - Training Time: 0.0670 seconds

Testing Set:
  - RMSE: 18.8566
  - MAE: 16.1676
  - R^2 Score: -0.1186
  - Prediction Time: 0.02000 seconds

===== Decision Tree =====

Best Parameters: {'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 10}

Training Set:
  - RMSE: 16.5455
  - MAE: 14.0127
  - R^2 Score: 0.1004
  - Training Time: 0.1815 seconds

Testing Set:
  - RMSE: 18.8191
  - MAE: 16.3478
  - R^2 Score: -0.1141
  - Prediction Time: 0.00041 seconds

===== Random Forest =====

Best P