Mandatory (very important): Before any processing you must fulfil these two steps :
- mandatory : present the basic information about the problem, in the followin format :
    - Number of features
    - Number of samples
    - Name of the target variable and physical meaning, if not obvious
- Present the dataset shortly in your own words (please do not copy a description from another resource) and link to the url where you downloaded it from. For instance it is very important to present the features of this dataset that are not obvious for someone that is not necessarily familiar with the dataset.
- explain very explicitely what problem you are trying to solve, and in particular what quantity you are trying to predict, as a function of which features. If the quantity to predict is one of the columns of the dataset, name it explicitely. If relevant, discuss why solving this problem would be interesting or have a value for an industry.
You are encouraged to compare several estimators / optimization procedures, from different points of view (scoring, training, time, etc). General guideline : as this course is dedicated to discovering and exploring some of the many principles of machine learning, rather than being a production-oriented course, you are encouraged to explore original and personal approaches. It is not a huge deal is the final scored are not outstanding, as long as you took the chance to explore a custom approach and learned new methods.
Suggestion of steps :
- Provide general analysis of the dataset, that studies its statistical proeprties, outliers, correlation matrices, or any other interesting analysis. discuss whether it is consistent with intuition or not (instead of for instance just showing a correlation matrix).
- If relevant or necessary, preprocess the data, and to justify this preprocessing. You could compare the estimators obtained with and without preprocessing.
- Discuss the relevant optimization details : cross validation, hyperparameters, etc
Again, explicit conclusions are very important :
- Provide an avluation or multiple evaluations of the obtained estimators, thanks to scoring of your choice.
- Discuss the results obtained. Have we solved the problem with this model ? Could the model be actually used ? What precision can be expected with the model on unseen data ?

In [None]:
pip install numpy pandas scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [None]:
import time
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix


Starting imports...
Importing numpy...
OK numpy 2.4.0 time: 0.053 s
Importing pandas...
OK pandas 2.3.3 time: 0.272 s
Importing sklearn.model_selection...
OK train_test_split time: 32.47 s
Importing sklearn.metrics...
OK metrics time: 32.471 s
All imports OK. Total time: 32.471 s


In [2]:
DATA_PATH = "mental_health_dataset.csv"
TARGET = "mental_health_risk"

TEST_SIZE = 0.25
RANDOM_STATE = 7

In [3]:
df = pd.read_csv(DATA_PATH)
df.columns = [c.strip() for c in df.columns]

print("Number of features:", df.shape[1] - 1)
print("Number of samples:", df.shape[0])
print("Target variable:", TARGET)

Number of features: 13
Number of samples: 10000
Target variable: mental_health_risk


In [4]:
def preprocess_fit_transform(X_train):
    X = X_train.copy()

    num_cols = [c for c in X.columns if pd.api.types.is_numeric_dtype(X[c])]
    cat_cols = [c for c in X.columns if c not in num_cols]

    num_medians = {}
    for c in num_cols:
        col = pd.to_numeric(X[c], errors="coerce")
        med = col.median()
        med = float(med) if pd.notna(med) else 0.0
        num_medians[c] = med
        X[c] = col.fillna(med).astype(float)

    cat_modes = {}
    for c in cat_cols:
        mode = X[c].astype(str).mode(dropna=True)
        m = str(mode.iloc[0]) if len(mode) else ""
        cat_modes[c] = m
        X[c] = X[c].astype(str).fillna(m)

    X_num = X[num_cols].astype(float)
    X_cat = pd.get_dummies(X[cat_cols], dummy_na=False)

    means, stds = {}, {}
    for c in num_cols:
        m = X_num[c].mean()
        s = X_num[c].std(ddof=0)
        if s == 0:
            s = 1
        means[c] = m
        stds[c] = s
        X_num[c] = (X_num[c] - m) / s

    X_proc = pd.concat([X_num, X_cat], axis=1)

    state = {
        "num_cols": num_cols,
        "cat_cols": cat_cols,
        "num_medians": num_medians,
        "cat_modes": cat_modes,
        "means": means,
        "stds": stds,
        "features": X_proc.columns
    }

    return X_proc.to_numpy(dtype=np.float64), state

In [5]:
def preprocess_transform(X_test, state):
    X = X_test.copy()

    for c in state["num_cols"]:
        col = pd.to_numeric(X[c], errors="coerce").fillna(state["num_medians"][c])
        X[c] = (col - state["means"][c]) / state["stds"][c]

    for c in state["cat_cols"]:
        X[c] = X[c].astype(str).fillna(state["cat_modes"][c])

    X_num = X[state["num_cols"]].astype(float)
    X_cat = pd.get_dummies(X[state["cat_cols"]], dummy_na=False)

    X_proc = pd.concat([X_num, X_cat], axis=1)
    X_proc = X_proc.reindex(columns=state["features"], fill_value=0)

    return X_proc.to_numpy(dtype=np.float64)

In [6]:
def encode_labels(y_train, y_test):
    classes = sorted(y_train.astype(str).unique())
    mapping = {c: i for i, c in enumerate(classes)}

    y_train_enc = y_train.astype(str).map(mapping).to_numpy()
    y_test_enc = y_test.astype(str).map(mapping).to_numpy()

    return y_train_enc, y_test_enc, classes

In [7]:
class MACHINE_LEARNING_TEK5_MODEL:
    def __init__(self, lr=0.15, epochs=1200, l2=1e-3, batch_size=512, seed=7):
        self.lr = lr
        self.epochs = epochs
        self.l2 = l2
        self.batch_size = batch_size
        self.seed = seed

    def _softmax(self, z):
        z -= z.max(axis=1, keepdims=True)
        e = np.exp(z)
        return e / e.sum(axis=1, keepdims=True)

    def fit(self, X, y):
        rng = np.random.default_rng(self.seed)
        n, d = X.shape
        k = y.max() + 1

        self.W = rng.normal(0, 0.01, (d, k))
        self.b = np.zeros((1, k))

        Y = np.eye(k)[y]

        for _ in range(self.epochs):
            idx = rng.permutation(n)
            for i in range(0, n, self.batch_size):
                xb = X[idx[i:i+self.batch_size]]
                yb = Y[idx[i:i+self.batch_size]]

                probs = self._softmax(xb @ self.W + self.b)
                grad = (probs - yb) / xb.shape[0]

                self.W -= self.lr * (xb.T @ grad + self.l2 * self.W)
                self.b -= self.lr * grad.sum(axis=0, keepdims=True)

    def predict(self, X):
        return np.argmax(self._softmax(X @ self.W + self.b), axis=1)


In [8]:
X = df.drop(columns=[TARGET])
y = df[TARGET]

X_train_df, X_test_df, y_train, y_test = train_test_split(
    X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE, stratify=y
)

X_train, state = preprocess_fit_transform(X_train_df)
X_test = preprocess_transform(X_test_df, state)

y_train_enc, y_test_enc, class_names = encode_labels(y_train, y_test)


In [9]:
model = MACHINE_LEARNING_TEK5_MODEL()

t0 = time.perf_counter()
model.fit(X_train, y_train_enc)
train_time = time.perf_counter() - t0


In [10]:
t1 = time.perf_counter()
y_pred = model.predict(X_test)
pred_time = time.perf_counter() - t1

acc = accuracy_score(y_test_enc, y_pred)
cm = confusion_matrix(y_test_enc, y_pred)

print("Accuracy:", round(acc, 3))
print("Training time (s):", round(train_time, 3))
print("Prediction time (s):", round(pred_time, 3))
print()
print("Confusion matrix:")
print(cm)


Accuracy: 0.977
Training time (s): 1.722
Prediction time (s): 0.001

Confusion matrix:
[[ 576    0   16]
 [   0  410   25]
 [   5   12 1456]]


In [12]:
print("Per-class metrics:")
for i, cls in enumerate(class_names):
    tp = cm[i, i]
    fp = cm[:, i].sum() - tp
    fn = cm[i, :].sum() - tp

    precision = tp / (tp + fp) if (tp + fp) else 0
    recall = tp / (tp + fn) if (tp + fn) else 0
    support = cm[i, :].sum()

    print(cls)
    print("  Precision:", round(precision, 3))
    print("  Recall:", round(recall, 3))
    print("  Support:", support)


Per-class metrics:
High
  Precision: 0.991
  Recall: 0.973
  Support: 592
Low
  Precision: 0.972
  Recall: 0.943
  Support: 435
Medium
  Precision: 0.973
  Recall: 0.988
  Support: 1473
