<div style="
    background-color: #1f77b4; 
    color: white; 
    padding: 20px; 
    border-radius: 10px; 
    text-align: center; 
    font-family: 'Arial', sans-serif;
    box-shadow: 2px 2px 12px rgba(0,0,0,0.3);
">
    <h1 style="margin: 0; font-size: 36px;">MENTOPRED</h1>
    <p style="margin: 5px 0 0; font-size: 18px; color: #dce6f1;">
        Predicting Mental Health Treatment in the Tech Industry
    </p>
</div>


# Problem Statement: 
*In tech workplaces, mental health is increasingly recognized as crucial yet stigmatized. The goal of this Project is to build a model that predicts whether an individual has sought treatment for a mental health condition based on their responses to survey questions about demographics, workplace culture, attitudes, and support systems. The model should be accurate, interpretable, and useful for informing policy or workplace intervention.*

# 2.0 Dataset Setup
- *Downloading the data to the specified folders*

In [2]:
# Dataset download script
import requests

url = "https://drive.google.com/uc?id=1oudxpap1iR8Xg7GBpAzB6KVPzIHA5RD4"
save_path = "../data/raw/mental_health_survey.csv"

response = requests.get(url, stream=True)
response.raise_for_status()  # Check if the request was successful
with open(save_path, 'wb') as file:
            for chunk in response.iter_content(chunk_size=8192):
                if chunk:  # Filter out keep-alive chunks
                    file.write(chunk)

print(f"Dataset downloaded and saved to {save_path}")

Dataset downloaded and saved to ../data/raw/mental_health_survey.csv


# 3.0 Meta_data of our dataset

In [14]:
# first 5 rows of our dataset
import pandas as pd
df = pd.read_csv(save_path)
df.head()  # Display the first 5 rows

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,
3,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,...,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,
4,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,...,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,


In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1259 entries, 0 to 1258
Data columns (total 27 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Timestamp                  1259 non-null   object
 1   Age                        1259 non-null   int64 
 2   Gender                     1259 non-null   object
 3   Country                    1259 non-null   object
 4   state                      744 non-null    object
 5   self_employed              1241 non-null   object
 6   family_history             1259 non-null   object
 7   treatment                  1259 non-null   object
 8   work_interfere             995 non-null    object
 9   no_employees               1259 non-null   object
 10  remote_work                1259 non-null   object
 11  tech_company               1259 non-null   object
 12  benefits                   1259 non-null   object
 13  care_options               1259 non-null   object
 14  wellness

## 3.1 Summary stats

In [16]:
df.describe()

Unnamed: 0,Age
count,1259.0
mean,79428150.0
std,2818299000.0
min,-1726.0
25%,27.0
50%,31.0
75%,36.0
max,100000000000.0


## 3.2 Null Values

In [18]:
df.isna().sum().head(20)

Timestamp                      0
Age                            0
Gender                         0
Country                        0
state                        515
self_employed                 18
family_history                 0
treatment                      0
work_interfere               264
no_employees                   0
remote_work                    0
tech_company                   0
benefits                       0
care_options                   0
wellness_program               0
seek_help                      0
anonymity                      0
leave                          0
mental_health_consequence      0
phys_health_consequence        0
dtype: int64

- we have 3 cols present null values `state` : 515 , `self_employed` : 18 and `work_interfere` : 264 
- total null values we have : 797


In [91]:
# listing the Unique values in each column
from itertools import zip_longest

unique = []
cat_cols = df.select_dtypes(include=['object']).columns

for col in cat_cols[1:-2]:
    unique.append(df[col].unique())

unique_dict = {k:v for k, v in zip(cat_cols[1:-2], unique)}

unique_df = unique_df = pd.DataFrame(
    zip_longest(*unique_dict.values(), fillvalue="END--"),
    columns=unique_dict.keys()
)


In [92]:
unique_df

Unnamed: 0,gender,country,state,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,...,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical
0,female,United States,IL,,No,Yes,Often,6-25,No,Yes,...,Yes,Yes,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes
1,male,Canada,IN,Yes,Yes,No,Rarely,More than 1000,Yes,No,...,Don't know,Don't know,Don't know,Maybe,Yes,No,No,Yes,No,Don't know
2,male-ish,United Kingdom,,No,END--,END--,Never,26-100,END--,END--,...,No,No,Somewhat difficult,Yes,Maybe,Yes,Some of them,Maybe,Yes,No
3,maile,Bulgaria,TX,END--,END--,END--,Sometimes,100-500,END--,END--,...,END--,END--,Very difficult,END--,END--,END--,END--,END--,END--,END--
4,trans-female,France,TN,END--,END--,END--,,1-5,END--,END--,...,END--,END--,Very easy,END--,END--,END--,END--,END--,END--,END--
5,cis female,Portugal,MI,END--,END--,END--,END--,500-1000,END--,END--,...,END--,END--,END--,END--,END--,END--,END--,END--,END--,END--
6,something kinda male?,Netherlands,OH,END--,END--,END--,END--,END--,END--,END--,...,END--,END--,END--,END--,END--,END--,END--,END--,END--,END--
7,cis male,Switzerland,CA,END--,END--,END--,END--,END--,END--,END--,...,END--,END--,END--,END--,END--,END--,END--,END--,END--,END--
8,woman,Poland,CT,END--,END--,END--,END--,END--,END--,END--,...,END--,END--,END--,END--,END--,END--,END--,END--,END--,END--
9,mal,Australia,MD,END--,END--,END--,END--,END--,END--,END--,...,END--,END--,END--,END--,END--,END--,END--,END--,END--,END--


In [93]:
unique_df.to_csv("../data/processed/unique_values.csv", index=False)

# 4.0 Data Cleaning
- Strategy we will keep gender into 3 category `male`, `female` and `other`

## 4.1 Standardizing columns names

In [95]:
df.columns = df.columns.str.lower().str.replace(' ', '_')
df.columns

Index(['timestamp', 'age', 'gender', 'country', 'state', 'self_employed',
       'family_history', 'treatment', 'work_interfere', 'no_employees',
       'remote_work', 'tech_company', 'benefits', 'care_options',
       'wellness_program', 'seek_help', 'anonymity', 'leave',
       'mental_health_consequence', 'phys_health_consequence', 'coworkers',
       'supervisor', 'mental_health_interview', 'phys_health_interview',
       'mental_vs_physical', 'obs_consequence', 'comments'],
      dtype='object')

## 4.2 Mapping every typo to right category for `gender` column

In [96]:
# Map gender typos to three categories: 'male', 'female', 'other'
typo = {
    'Male': 'male', 'M': 'male', 'male': 'male',
    'Female': 'female', 'F': 'female', 'female': 'female',
    'Cis Male': 'male', 'Cis Female': 'female',
    'Man': 'male', 'Woman': 'female',
    'Trans Male': 'male', 'Trans Man': 'male',
    'Trans Female': 'female', 'Trans Woman': 'female',
    'msle': 'male', 'femail': 'female',
    'Malr': 'male', 'Mail': 'male', 'Feamale': 'female',
    'cis male': 'male', 'cis female': 'female',
    'non-binary': 'other', 'Nonbinary': 'other', 'Agender': 'other',
    'Genderqueer': 'other', 'Gender fluid': 'other', 'Enby': 'other',
    'queer': 'other', 'All': 'other', 'other': 'other',
    'A little about you': 'other', 'p': 'other', 'something kinda male?': 'male',
    'Nah': 'other', 'none': 'other', 'fluid': 'other', 'Androgyne': 'other',
    'male leaning androgynous': 'male', 'female (trans)': 'female',
    'male (trans)': 'male', 'female (cis)': 'female', 'male (cis)': 'male',
    'f': 'female', 'm': 'male',
}

df['gender'] = df['gender'].map(lambda x: typo.get(str(x).strip(), 'other'))
df['gender'].value_counts()

gender
male      981
female    243
other      35
Name: count, dtype: int64

## 4.3 Dropping Unwanted Columns

In [101]:
df = df.drop(columns=['timestamp', 'state'])

## 4.3 Filling Nan Values for `self_employed`

In [104]:
df['self_employed'] = df['self_employed'].fillna('No')

## 4.4 Filling Nan Values for `work_interfere` column.

In [109]:
df['work_interfere'] = df['work_interfere'].fillna('Unknown')

In [110]:
df['work_interfere'].value_counts()

work_interfere
Sometimes    465
Unknown      264
Never        213
Rarely       173
Often        144
Name: count, dtype: int64

## Final Filling `comments` column NaN Values.

In [106]:
df.head()

Unnamed: 0,age,gender,country,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,37,female,United States,No,No,Yes,Often,6-25,No,Yes,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,44,male,United States,No,No,No,Rarely,More than 1000,No,No,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,32,male,Canada,No,No,No,Rarely,6-25,No,Yes,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,
3,31,male,United Kingdom,No,Yes,Yes,Often,26-100,No,Yes,...,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,
4,31,male,United States,No,No,No,Never,100-500,Yes,Yes,...,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,


In [113]:
df['comments'] = df['comments'].fillna('No comments')

161

# 5.0 Exporting the Cleaned Dataset

In [114]:
df.to_csv("../data/processed/mental_health_cleaned.csv", index=False)

# 6.0 Creating a dataset Encoded for training and experimentation

In [115]:
encoded_df = df.copy()
# Create a copy of the cleaned dataframe for encoding


In [116]:
# drop the comments column as it won't be used for training
encoded_df = encoded_df.drop(columns=['comments'])

In [117]:
encoded_df.head()

Unnamed: 0,age,gender,country,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,...,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence
0,37,female,United States,No,No,Yes,Often,6-25,No,Yes,...,Yes,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No
1,44,male,United States,No,No,No,Rarely,More than 1000,No,No,...,Don't know,Don't know,Maybe,No,No,No,No,No,Don't know,No
2,32,male,Canada,No,No,No,Rarely,6-25,No,Yes,...,Don't know,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No
3,31,male,United Kingdom,No,Yes,Yes,Often,26-100,No,Yes,...,No,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes
4,31,male,United States,No,No,No,Never,100-500,Yes,Yes,...,Don't know,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No


In [118]:
# Let's analyze the current data structure for encoding
print("Dataset shape:", encoded_df.shape)
print("\nColumn data types:")
print(encoded_df.dtypes)
print("\nTarget variable distribution:")
print(encoded_df['treatment'].value_counts())
print("\nSample of the data:")
encoded_df.head()

Dataset shape: (1259, 24)

Column data types:
age                           int64
gender                       object
country                      object
self_employed                object
family_history               object
treatment                    object
work_interfere               object
no_employees                 object
remote_work                  object
tech_company                 object
benefits                     object
care_options                 object
wellness_program             object
seek_help                    object
anonymity                    object
leave                        object
mental_health_consequence    object
phys_health_consequence      object
coworkers                    object
supervisor                   object
mental_health_interview      object
phys_health_interview        object
mental_vs_physical           object
obs_consequence              object
dtype: object

Target variable distribution:
treatment
Yes    637
No     622
Name: count, 

Unnamed: 0,age,gender,country,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,...,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence
0,37,female,United States,No,No,Yes,Often,6-25,No,Yes,...,Yes,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No
1,44,male,United States,No,No,No,Rarely,More than 1000,No,No,...,Don't know,Don't know,Maybe,No,No,No,No,No,Don't know,No
2,32,male,Canada,No,No,No,Rarely,6-25,No,Yes,...,Don't know,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No
3,31,male,United Kingdom,No,Yes,Yes,Often,26-100,No,Yes,...,No,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes
4,31,male,United States,No,No,No,Never,100-500,Yes,Yes,...,Don't know,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No


In [119]:
# Examine unique values in categorical columns
categorical_cols = encoded_df.select_dtypes(include=['object']).columns
print("Categorical columns and their unique value counts:")
for col in categorical_cols:
    print(f"\n{col}: {encoded_df[col].nunique()} unique values")
    print(f"Values: {encoded_df[col].unique()[:10]}")  # Show first 10 unique values

Categorical columns and their unique value counts:

gender: 3 unique values
Values: ['female' 'male' 'other']

country: 48 unique values
Values: ['United States' 'Canada' 'United Kingdom' 'Bulgaria' 'France' 'Portugal'
 'Netherlands' 'Switzerland' 'Poland' 'Australia']

self_employed: 2 unique values
Values: ['No' 'Yes']

family_history: 2 unique values
Values: ['No' 'Yes']

treatment: 2 unique values
Values: ['Yes' 'No']

work_interfere: 5 unique values
Values: ['Often' 'Rarely' 'Never' 'Sometimes' 'Unknown']

no_employees: 6 unique values
Values: ['6-25' 'More than 1000' '26-100' '100-500' '1-5' '500-1000']

remote_work: 2 unique values
Values: ['No' 'Yes']

tech_company: 2 unique values
Values: ['Yes' 'No']

benefits: 3 unique values
Values: ['Yes' "Don't know" 'No']

care_options: 3 unique values
Values: ['Not sure' 'No' 'Yes']

wellness_program: 3 unique values
Values: ['No' "Don't know" 'Yes']

seek_help: 3 unique values
Values: ['Yes' "Don't know" 'No']

anonymity: 3 unique valu

## 6.1 Encoding Strategy

Based on the analysis above, here's our encoding approach:

1. **Binary Variables** (2 unique values): Use Label Encoding (0, 1)
   - `self_employed`, `family_history`, `treatment`, `remote_work`, `tech_company`, `obs_consequence`

2. **Ordinal Variables** (meaningful order): Use custom ordinal encoding
   - `work_interfere`: Unknown < Never < Rarely < Sometimes < Often
   - `no_employees`: 1-5 < 6-25 < 26-100 < 100-500 < 500-1000 < More than 1000
   - `leave`: Very difficult < Somewhat difficult < Don't know < Somewhat easy < Very easy

3. **Nominal Variables with low cardinality** (≤5 categories): Use One-Hot Encoding
   - `gender`, `benefits`, `care_options`, `wellness_program`, `seek_help`, `anonymity`
   - `mental_health_consequence`, `phys_health_consequence`, `coworkers`, `supervisor`
   - `mental_health_interview`, `phys_health_interview`, `mental_vs_physical`

4. **High cardinality nominal** (`country`): Use target encoding or frequency encoding

5. **Numerical variables**: Apply standard scaling

In [120]:
# Import necessary libraries for encoding
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler, OrdinalEncoder
from sklearn.feature_selection import mutual_info_classif
import warnings
warnings.filterwarnings('ignore')

# Create a copy of the dataframe for encoding
final_encoded_df = encoded_df.copy()
print("Starting encoding process...")
print(f"Original shape: {final_encoded_df.shape}")

Starting encoding process...
Original shape: (1259, 24)


In [121]:
# 1. Label Encoding for Binary Variables
binary_cols = ['self_employed', 'family_history', 'remote_work', 'tech_company', 'obs_consequence']

le = LabelEncoder()
for col in binary_cols:
    final_encoded_df[col] = le.fit_transform(final_encoded_df[col])
    print(f"Label encoded {col}: {dict(zip(le.classes_, le.transform(le.classes_)))}")

# Special handling for treatment (target variable) - ensure 'Yes' = 1, 'No' = 0
final_encoded_df['treatment'] = final_encoded_df['treatment'].map({'Yes': 1, 'No': 0})
print(f"Target variable 'treatment': Yes=1, No=0")

print("\nBinary variables encoded successfully!")

Label encoded self_employed: {'No': np.int64(0), 'Yes': np.int64(1)}
Label encoded family_history: {'No': np.int64(0), 'Yes': np.int64(1)}
Label encoded remote_work: {'No': np.int64(0), 'Yes': np.int64(1)}
Label encoded tech_company: {'No': np.int64(0), 'Yes': np.int64(1)}
Label encoded obs_consequence: {'No': np.int64(0), 'Yes': np.int64(1)}
Target variable 'treatment': Yes=1, No=0

Binary variables encoded successfully!


In [122]:
# 2. Ordinal Encoding for variables with meaningful order

# work_interfere: Unknown < Never < Rarely < Sometimes < Often
work_interfere_mapping = {
    'Unknown': 0, 'Never': 1, 'Rarely': 2, 'Sometimes': 3, 'Often': 4
}
final_encoded_df['work_interfere'] = final_encoded_df['work_interfere'].map(work_interfere_mapping)

# no_employees: 1-5 < 6-25 < 26-100 < 100-500 < 500-1000 < More than 1000
no_employees_mapping = {
    '1-5': 0, '6-25': 1, '26-100': 2, '100-500': 3, '500-1000': 4, 'More than 1000': 5
}
final_encoded_df['no_employees'] = final_encoded_df['no_employees'].map(no_employees_mapping)

# leave: Very difficult < Somewhat difficult < Don't know < Somewhat easy < Very easy
leave_mapping = {
    'Very difficult': 0, 'Somewhat difficult': 1, "Don't know": 2, 
    'Somewhat easy': 3, 'Very easy': 4
}
final_encoded_df['leave'] = final_encoded_df['leave'].map(leave_mapping)

print("Ordinal variables encoded:")
print(f"work_interfere mapping: {work_interfere_mapping}")
print(f"no_employees mapping: {no_employees_mapping}")
print(f"leave mapping: {leave_mapping}")
print("\nOrdinal encoding completed!")

Ordinal variables encoded:
work_interfere mapping: {'Unknown': 0, 'Never': 1, 'Rarely': 2, 'Sometimes': 3, 'Often': 4}
no_employees mapping: {'1-5': 0, '6-25': 1, '26-100': 2, '100-500': 3, '500-1000': 4, 'More than 1000': 5}
leave mapping: {'Very difficult': 0, 'Somewhat difficult': 1, "Don't know": 2, 'Somewhat easy': 3, 'Very easy': 4}

Ordinal encoding completed!


In [123]:
# 3. One-Hot Encoding for nominal variables with low cardinality
nominal_cols = ['gender', 'benefits', 'care_options', 'wellness_program', 'seek_help', 
                'anonymity', 'mental_health_consequence', 'phys_health_consequence', 
                'coworkers', 'supervisor', 'mental_health_interview', 
                'phys_health_interview', 'mental_vs_physical']

# Apply one-hot encoding
final_encoded_df = pd.get_dummies(final_encoded_df, columns=nominal_cols, prefix=nominal_cols, drop_first=True)

print(f"Applied One-Hot Encoding to {len(nominal_cols)} nominal variables")
print(f"New shape after one-hot encoding: {final_encoded_df.shape}")
print("\nNew columns created:")
new_cols = [col for col in final_encoded_df.columns if any(prefix in col for prefix in nominal_cols)]
print(f"Total new columns: {len(new_cols)}")
for col in new_cols[:10]:  # Show first 10 new columns
    print(f"  - {col}")

Applied One-Hot Encoding to 13 nominal variables
New shape after one-hot encoding: (1259, 37)

New columns created:
Total new columns: 26
  - gender_male
  - gender_other
  - benefits_No
  - benefits_Yes
  - care_options_Not sure
  - care_options_Yes
  - wellness_program_No
  - wellness_program_Yes
  - seek_help_No
  - seek_help_Yes


In [124]:
# 4. Target Encoding for high cardinality variable (country)
# Calculate the mean treatment rate for each country
country_treatment_mean = final_encoded_df.groupby('country')['treatment'].mean()

# Add smoothing to avoid overfitting - use global mean for countries with few samples
global_mean = final_encoded_df['treatment'].mean()
country_counts = final_encoded_df['country'].value_counts()

# Apply smoothing: if country has fewer than 10 samples, blend with global mean
smoothing_factor = 10
country_encoded = {}

for country in country_treatment_mean.index:
    count = country_counts[country]
    if count >= smoothing_factor:
        country_encoded[country] = country_treatment_mean[country]
    else:
        # Blend with global mean
        weight = count / (count + smoothing_factor)
        country_encoded[country] = weight * country_treatment_mean[country] + (1 - weight) * global_mean

# Apply the encoding
final_encoded_df['country_encoded'] = final_encoded_df['country'].map(country_encoded)
final_encoded_df = final_encoded_df.drop('country', axis=1)

print(f"Applied target encoding to 'country' variable")
print(f"Countries with their encoded values (first 10):")
for i, (country, value) in enumerate(list(country_encoded.items())[:10]):
    print(f"  {country}: {value:.4f}")
print(f"Global treatment rate: {global_mean:.4f}")
print(f"Final shape: {final_encoded_df.shape}")

Applied target encoding to 'country' variable
Countries with their encoded values (first 10):
  Australia: 0.6190
  Austria: 0.3892
  Bahamas, The: 0.5509
  Belgium: 0.3787
  Bosnia and Herzegovina: 0.4600
  Brazil: 0.4412
  Bulgaria: 0.5043
  Canada: 0.5139
  China: 0.4600
  Colombia: 0.4216
Global treatment rate: 0.5060
Final shape: (1259, 37)


In [125]:
# 5. Feature Scaling for numerical variables
numerical_cols = ['age', 'country_encoded']

scaler = StandardScaler()
final_encoded_df[numerical_cols] = scaler.fit_transform(final_encoded_df[numerical_cols])

print("Applied StandardScaler to numerical columns:")
for col in numerical_cols:
    print(f"  - {col}: mean = {final_encoded_df[col].mean():.4f}, std = {final_encoded_df[col].std():.4f}")

print(f"\nFinal dataset shape: {final_encoded_df.shape}")
print(f"All columns are now numerical: {final_encoded_df.dtypes.nunique() <= 2}")  # Should be True if only int/float

Applied StandardScaler to numerical columns:
  - age: mean = 0.0000, std = 1.0004
  - country_encoded: mean = -0.0000, std = 1.0004

Final dataset shape: (1259, 37)
All columns are now numerical: False


In [126]:
# Final data validation and summary
print("=== ENCODING SUMMARY ===")
print(f"Original dataset shape: {encoded_df.shape}")
print(f"Final encoded dataset shape: {final_encoded_df.shape}")
print(f"\nFeature expansion: {final_encoded_df.shape[1] - encoded_df.shape[1]} additional features created")

print(f"\nData types in final dataset:")
print(final_encoded_df.dtypes.value_counts())

print(f"\nTarget variable distribution:")
print(final_encoded_df['treatment'].value_counts())

print(f"\nFirst 5 rows of encoded dataset:")
final_encoded_df.head()

=== ENCODING SUMMARY ===
Original dataset shape: (1259, 24)
Final encoded dataset shape: (1259, 37)

Feature expansion: 13 additional features created

Data types in final dataset:
bool       26
int64       9
float64     2
Name: count, dtype: int64

Target variable distribution:
treatment
1    637
0    622
Name: count, dtype: int64

First 5 rows of encoded dataset:


Unnamed: 0,age,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,leave,obs_consequence,...,coworkers_Yes,supervisor_Some of them,supervisor_Yes,mental_health_interview_No,mental_health_interview_Yes,phys_health_interview_No,phys_health_interview_Yes,mental_vs_physical_No,mental_vs_physical_Yes,country_encoded
0,-0.028194,0,0,1,4,1,0,1,3,0,...,False,False,True,True,False,False,False,False,True,0.459198
1,-0.028194,0,0,0,2,5,0,0,2,0,...,False,False,False,True,False,True,False,False,False,0.459198
2,-0.028194,0,0,0,2,1,0,1,1,0,...,True,False,True,False,True,False,True,True,False,-0.049081
3,-0.028194,0,1,1,4,2,0,1,1,1,...,False,False,False,False,False,False,False,True,False,-0.226484
4,-0.028194,0,0,0,1,3,1,1,2,0,...,False,False,True,False,True,False,True,False,False,0.459198


In [127]:
# Export the fully encoded dataset
output_path = "../data/processed/mental_health_encoded.csv"
final_encoded_df.to_csv(output_path, index=False)

print(f"✅ Fully encoded dataset saved to: {output_path}")
print(f"📊 Dataset ready for machine learning!")
print(f"🎯 Target variable: 'treatment' (1=Yes, 0=No)")
print(f"📈 Features: {final_encoded_df.shape[1]-1} (excluding target)")
print(f"📝 Samples: {final_encoded_df.shape[0]}")

# Optional: Display column names for reference
print(f"\n All feature columns:")
feature_cols = [col for col in final_encoded_df.columns if col != 'treatment']
for i, col in enumerate(feature_cols, 1):
    print(f"{i:2d}. {col}")
    
print(f"\nEncoding techniques applied:")
print("   • Label Encoding: Binary variables")
print("   • Ordinal Encoding: Variables with meaningful order")
print("   • One-Hot Encoding: Nominal variables (low cardinality)")
print("   • Target Encoding: Country variable (high cardinality)")
print("   • Standard Scaling: Numerical variables")

✅ Fully encoded dataset saved to: ../data/processed/mental_health_encoded.csv
📊 Dataset ready for machine learning!
🎯 Target variable: 'treatment' (1=Yes, 0=No)
📈 Features: 36 (excluding target)
📝 Samples: 1259

📋 All feature columns:
 1. age
 2. self_employed
 3. family_history
 4. work_interfere
 5. no_employees
 6. remote_work
 7. tech_company
 8. leave
 9. obs_consequence
10. gender_male
11. gender_other
12. benefits_No
13. benefits_Yes
14. care_options_Not sure
15. care_options_Yes
16. wellness_program_No
17. wellness_program_Yes
18. seek_help_No
19. seek_help_Yes
20. anonymity_No
21. anonymity_Yes
22. mental_health_consequence_No
23. mental_health_consequence_Yes
24. phys_health_consequence_No
25. phys_health_consequence_Yes
26. coworkers_Some of them
27. coworkers_Yes
28. supervisor_Some of them
29. supervisor_Yes
30. mental_health_interview_No
31. mental_health_interview_Yes
32. phys_health_interview_No
33. phys_health_interview_Yes
34. mental_vs_physical_No
35. mental_vs_physi

In [130]:
# 7.0 Comprehensive Encoder for Deployment
import pickle
import os
import joblib
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin

os.makedirs("../models", exist_ok=True)

class MentoPredEncoder(BaseEstimator, TransformerMixin):
    """Unified encoder handling all preprocessing steps for deployment."""

    def __init__(self):
        self.gender_mapping = typo
        self.work_interfere_mapping = work_interfere_mapping
        self.no_employees_mapping = no_employees_mapping
        self.leave_mapping = leave_mapping
        self.binary_cols = binary_cols
        self.nominal_cols = nominal_cols
        self.numerical_cols = ['age', 'country_encoded']
        self.country_encoding = country_encoded
        self.global_mean = global_mean
        self.scaler = scaler
        self.expected_columns = None

    def fit(self, X, y=None):
        sample_df = self._preprocess(X.copy())
        self.expected_columns = sample_df.columns.tolist()
        return self

    def _standardize_column_names(self, df):
        df.columns = df.columns.str.lower().str.replace(' ', '_')
        return df

    def _preprocess(self, df):
        df = self._standardize_column_names(df)

        if 'gender' in df.columns:
            df['gender'] = df['gender'].map(lambda x: self.gender_mapping.get(str(x).strip(), 'other'))
        if 'self_employed' in df.columns:
            df['self_employed'] = df['self_employed'].fillna('No')
        if 'work_interfere' in df.columns:
            df['work_interfere'] = df['work_interfere'].fillna('Unknown')

        drop_cols = [c for c in ['timestamp', 'state', 'comments'] if c in df.columns]
        df = df.drop(columns=drop_cols, errors='ignore')

        for col in self.binary_cols:
            if col in df.columns:
                if set(df[col].unique()) == {'Yes', 'No'}:
                    df[col] = df[col].map({'Yes': 1, 'No': 0})
                else:
                    df[col] = LabelEncoder().fit_transform(df[col])

        if 'treatment' in df.columns:
            df['treatment'] = df['treatment'].map({'Yes': 1, 'No': 0})

        if 'work_interfere' in df.columns:
            df['work_interfere'] = df['work_interfere'].map(self.work_interfere_mapping)
        if 'no_employees' in df.columns:
            df['no_employees'] = df['no_employees'].map(self.no_employees_mapping)
        if 'leave' in df.columns:
            df['leave'] = df['leave'].map(self.leave_mapping)

        if 'country' in df.columns:
            df['country_encoded'] = df['country'].map(self.country_encoding).fillna(self.global_mean)
            df = df.drop('country', axis=1)

        cols_to_encode = [col for col in self.nominal_cols if col in df.columns]
        if cols_to_encode:
            df = pd.get_dummies(df, columns=cols_to_encode, prefix=cols_to_encode, drop_first=True)

        return df

    def transform(self, X):
        df = self._preprocess(X.copy())
        cols_to_scale = [col for col in self.numerical_cols if col in df.columns]
        if cols_to_scale:
            df[cols_to_scale] = self.scaler.transform(df[cols_to_scale])

        if self.expected_columns:
            missing_cols = set(self.expected_columns) - set(df.columns)
            for col in missing_cols:
                df[col] = 0
            df = df[self.expected_columns]
        return df

    def fit_transform(self, X, y=None):
        return self.fit(X).transform(X)

    def inverse_transform(self, X):
        df = X.copy()
        cols = [c for c in self.numerical_cols if c in df.columns]
        if cols:
            df[cols] = self.scaler.inverse_transform(df[cols])
        return df


# === Encoder training & saving ===
encoder_final = MentoPredEncoder()
encoder_final.fit(df)

with open('../models/encoder_final.pkl', 'wb') as f:
    pickle.dump(encoder_final, f)

print("✅ Encoder saved to '../models/encoder_final.pkl'")

# === Test ===
sample_input = df.iloc[[0]].copy()
try:
    encoded_sample = encoder_final.transform(sample_input)
    print("Encoder test successful!")
    print("Encoded sample shape:", encoded_sample.shape)
    display(encoded_sample.head())
except Exception as e:
    print(f"Error during encoder test: {e}")

# === Deployment usage example ===
print("""
# Deployment Example:
import pickle
import pandas as pd

with open('models/encoder_final.pkl', 'rb') as f:
    encoder = pickle.load(f)
with open('models/model.pkl', 'rb') as f:
    model = pickle.load(f)

def predict_treatment(user_data):
    if isinstance(user_data, dict):
        user_df = pd.DataFrame([user_data])
    else:
        user_df = user_data
    encoded_input = encoder.transform(user_df)
    pred = model.predict(encoded_input)[0]
    prob = model.predict_proba(encoded_input)[0][1]
    return {
        'prediction': 'Yes' if pred == 1 else 'No',
        'probability': float(prob),
        'needs_treatment': bool(pred == 1)
    }
""")


✅ Encoder saved to '../models/encoder_final.pkl'
Encoder test successful!
Encoded sample shape: (1, 37)


Unnamed: 0,age,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,leave,obs_consequence,...,coworkers_Some of them,coworkers_Yes,supervisor_Some of them,supervisor_Yes,mental_health_interview_No,mental_health_interview_Yes,phys_health_interview_No,phys_health_interview_Yes,mental_vs_physical_No,mental_vs_physical_Yes
0,-0.028194,0,0,1,4,1,0,0,3,0,...,0,0,0,0,0,0,0,0,0,0



# Deployment Example:
import pickle
import pandas as pd

with open('models/encoder_final.pkl', 'rb') as f:
    encoder = pickle.load(f)
with open('models/model.pkl', 'rb') as f:
    model = pickle.load(f)

def predict_treatment(user_data):
    if isinstance(user_data, dict):
        user_df = pd.DataFrame([user_data])
    else:
        user_df = user_data
    encoded_input = encoder.transform(user_df)
    pred = model.predict(encoded_input)[0]
    prob = model.predict_proba(encoded_input)[0][1]
    return {
        'prediction': 'Yes' if pred == 1 else 'No',
        'probability': float(prob),
        'needs_treatment': bool(pred == 1)
    }

