# **Final Project Task 3 - Census Modeling Regression**

Requirements
- Create a regression model on the Census dataset, with 'hours-per-week' target

- You can use models (estmators) from sklearn, but feel free to use any library for traditional ML. 
    - Note: in sklearn, the LinearRegression estimator is based on OLS, a statistical method. Please use the SGDRegressor estimator, since this is based on gradient descent. 
    - You can use LinearRegression estimator, but only as comparison with the SGDRegressor - Optional.

- Model Selection and Setup **2p**:
    - Implement multiple models, to solve a regression problem using traditional ML: 
        - Linear Regression
        - Decision Tree Regression
        - Random Forest Regression - Optional
        - Ridge Regression - Optional
        - Lasso Regression - Optional
    - Choose a loss (or experiment with different losses) for the model and justify the choice. *1p*
        - MSE, MAE, RMSE, Huber Loss or others
    - Justify model choices based on dataset characteristics and task requirements; specify model pros and cons. *1p*


- Data Preparation
    - Use the preprocessed datasets from Task 1.
    - From the train set, create an extra validation set, if necesarry. So in total there will be: train, validation and test datasets.
    - Be sure all models have their data preprocessed as needed. Some models require different, or no encoding for some features.


- Model Training and Experimentation **10p**
    - Establish a Baseline Model *2p*
        - For each model type, train a simple model with default settings as a baseline.
        - Evaluate its performance to establish a benchmark for comparison.
    - Make plots with train, validation loss and metric on epochs (or on steps), if applicable. - Optional
    - Feature Selection: - Optional
        - Use insights from EDA in Task 2 to identify candidate features by analyzing patterns, relationships, and distributions.
    - Experimentation: *8p*
        - For each baseline model type, iteratively experiment with different combinations of features and transformations.
        - Experiment with feature engineering techniques such as interaction terms, polynomial features, or scaling transformations.
        - Identify the best model which have the best performance metrics on test set.
        - You may need multiple preprocessed datasets preprocessed
- Hyperparameter Tuning - Optional
  - Perform hyperparameter tuning only on the best-performing model after evaluating all model types and experiments. 
  - Consider using techniques like Grid Search for exhaustive tuning, Random Search for quicker exploration, or Bayesian Optimization for an intelligent, efficient search of hyperparameters.
  - Avoid tuning models that do not show strong baseline performance or are unlikely to outperform others based on experimentation.
  - Ensure that hyperparameter tuning is done after completing feature selection, baseline modeling, and experimentation, ensuring that the model is stable and representative of the dataset.


- Model Evaluation **3p**
    - Evaluate models on the test dataset using regression metrics: *1p*
        - Mean Absolute Error (MAE)
        - Mean Squared Error (MSE)
        - Root Mean Squared Error (RMSE)
        - RÂ² Score
    - Choose one metric for model comparison and explain your choice *1p*
    - Compare the results across different models. Save all experiment results  into a table. *1p*

Feature Importance - Optional
- For applicable models (e.g., Decision Tree Regression), analyze feature importance and discuss its relevance to the problem.



Deliverables

- Notebook code with no errors.
- Code and results from experiments. Create a table with all experiments results, include experiment name, metrics results.
- Explain findings, choices, results.
- Potential areas for improvement or further exploration.


##### Date initiale

In [1]:
import pandas as pd

In [2]:
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
    "hours-per-week", "native-country", "income"
]

data = pd.read_csv(data_url, header=None, names=columns, na_values=" ?", skipinitialspace=True)
data.sample(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
17799,40,Private,195394,Assoc-acdm,12,Married-civ-spouse,Prof-specialty,Husband,White,Male,7688,0,40,United-States,>50K
6151,27,Private,39232,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K
7066,58,Local-gov,98361,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,50,United-States,<=50K
3321,37,Private,189922,Assoc-acdm,12,Married-civ-spouse,Tech-support,Husband,White,Male,0,0,50,United-States,>50K
4134,38,Private,278253,HS-grad,9,Divorced,Adm-clerical,Unmarried,White,Female,0,0,48,United-States,<=50K
5948,39,Private,327435,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,White,Female,0,0,36,United-States,>50K
13628,30,Private,154120,Bachelors,13,Never-married,Sales,Not-in-family,White,Male,0,0,65,United-States,<=50K
861,43,Private,191547,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,40,Mexico,<=50K
26700,26,Self-emp-not-inc,102476,HS-grad,9,Never-married,Craft-repair,Own-child,White,Male,0,0,50,United-States,<=50K
7417,23,Private,273206,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Female,0,0,23,United-States,<=50K


##### Data preparation

In [8]:
# Preluare date

import pandas as pd

X_train = pd.read_csv("data\\X_train.csv")
X_test = pd.read_csv("data\\X_test.csv")
y_train = pd.read_csv("data\\y_train.csv").squeeze()
y_test = pd.read_csv("data\\y_test.csv").squeeze()

In [None]:
# Verificare dimensiuni

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((26029, 32), (6508, 32), (26029,), (6508,))

In [10]:
# Separam set-ul de testare

from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

In [11]:
# Verificam ca datele sa fie numerice
X_train.info()

<class 'pandas.DataFrame'>
Index: 20823 entries, 15272 to 23654
Data columns (total 32 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   education-num                  20823 non-null  float64
 1   sex                            20823 non-null  int64  
 2   income                         20823 non-null  int64  
 3   is_married                     20823 non-null  int64  
 4   capital_net                    20823 non-null  float64
 5   is_usa                         20823 non-null  int64  
 6   occupation_Armed-Forces        20823 non-null  bool   
 7   occupation_Craft-repair        20823 non-null  bool   
 8   occupation_Exec-managerial     20823 non-null  bool   
 9   occupation_Farming-fishing     20823 non-null  bool   
 10  occupation_Handlers-cleaners   20823 non-null  bool   
 11  occupation_Machine-op-inspct   20823 non-null  bool   
 12  occupation_Other-service       20823 non-null  bool   
 13

In [12]:
# Modificam datele pentru a fi numerice

X_train = X_train.astype(float)
X_val = X_val.astype(float)
X_test = X_test.astype(float)

In [None]:
# Scalare date pentru a elimina diferentele dintre caracteristici

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

##### Model Selection and Setup

Am ales sa implementez multiple modele (Linear Regression, Decision Tree Regression, Random Forest Regression, Ridge Regression, Lasso Regression) pentru a putea prezice orele pentrecute la locul de munca, pentru a explora daca datele au nevoie de modele complexe sau nu si pentru a gasi cel mai optim raspuns.
Voi experimenta cu MSE (Mean Squared Error) deoarece penalizeaza erorile mai drastic din cauza ridicarii la patrat si forteaza modelul sa evite predictii diferite de realitate. Spre exemplu, variabila capital_net are o varianta foarte mare, iar MSE va forta modelul sa nu ignore persoanele cu venituri sau pierderi mari (considerate outliers) care ar putea influenta orele lucrate. MAE (Mean Absolute Error) deoarece stim deja ca datele prezinta outliers( doar o ora lucrata sau 99, venituri prea mari sau prea mici), aceasta metrice fiind mai robusta la valorile atipice. Ne va spne cu cate ore greseste modelul in medie pe saptamana, fara a distorsiona rezultatele din cauza cazurilor de ore extreme de munca lucrate. Ma voi uita si la RMSE (Root Mean Squared Error) deoarece readuce eroarea la aceeasi unitate de masura ca variabila target (orele), dar, spre deosebire de MAE, pastreaza sensibilitatea fata de erorile mari si este utila pentru a vedea daca modelul are "scapari" majore pe anumite segmente de date. Va fi folosit oentru a vedea daca modelul face erori de predictie in functie de anumite categorii ocupationale.

Linear Regression - bun pentru volumul mare de date
- Pro: control asupra procesului de invatare
- Cons: foarte sensibil la scalarea datelor si la outliers

Decision Tree Regression - Capteaza relatii non-liniare intre variabile, variabilele varsta si ore lucrate nu au o relatie liniara, avand un cumul mai mare de persoane de varsta medie care lucreaza 40 ore, in timp ce tinerii si persoanele varstince lucreaza diferit
- Pro: usor de interpretat, poate capta interactiuni intre mai multe variabile
- Cons: predispus la overfitting

Random Forest - Combina mai multi arbori pentru a stabiliza predictia, din cauza varietatii de variabile binare, un singur arbore ar putea fi instabil, in timp ce media mai multor arbori ofera alta perspectiva
- Pro: robust la outliers, reduce varianta si riscul de overfitting al unui singur arbore decizional
- Cons: consuma mai multe resurse si ofera black box greu de interpretat, lent pentru numarul mare de coloane din set (32)

Ridge Regression - Adauga o penalizare pentru a mentine coeficientii mici, unele ocupatii sa nu ofere informatii importante legate de numarul orelor lucrate
- Pro previne overfitting-ul in prezenta unor variabile corelate
- Cons: nu elimina variabile inutile, chiar daca, spre ex unele ocupatii nu influenteaza orele lucrate pe saptamana

Lasso Regression - Penalizare care poate forta coeficientii spre zero
- Pro: realizeaza selectie automata de variabile
- Cons: poate elimina variabile importante daca exista corelatie puternica, spre ex daca doua ocupatii sunt corelate, ar putea sterge una la intamplare si astfel modelul sa piarda informatii importante

##### Model training and experimentation

Model Training and Experimentation 10p

    Establish a Baseline Model 2p
        For each model type, train a simple model with default settings as a baseline.
        Evaluate its performance to establish a benchmark for comparison.
    Make plots with train, validation loss and metric on epochs (or on steps), if applicable. - Optional
    Feature Selection: - Optional
        Use insights from EDA in Task 2 to identify candidate features by analyzing patterns, relationships, and distributions.
    Experimentation: 8p
        For each baseline model type, iteratively experiment with different combinations of features and transformations.
        Experiment with feature engineering techniques such as interaction terms, polynomial features, or scaling transformations.
        Identify the best model which have the best performance metrics on test set.
        You may need multiple preprocessed datasets preprocessed



In [14]:
#  Definim o functie pentru a calcula toate metricile

from sklearn.metrics import mean_absolute_error,mean_squared_error, r2_score
import numpy as np

def get_metrics(y_true, y_pred):
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)
    return mae, mse, rmse, r2

In [16]:
# Construim un model care arata performanta unui model fata de medie

from sklearn.dummy import DummyRegressor

dummy_regr = DummyRegressor(strategy="mean")
dummy_regr.fit(X_train, y_train)
mae_b, mse_b, rmse_b, r2_b = get_metrics(y_val, dummy_regr.predict(X_val))

results = []
results.append({
    "Model": "Baseline (Mean)",
    "MAE": mae_b, "MSE": mse_b, "RMSE": rmse_b, "R2": r2_b
})

In [17]:
# Definim modelele pe care le vom testa

from sklearn.linear_model import SGDRegressor, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

trained_models = {
    "SGD_Regressor": SGDRegressor(random_state=42),
    "Decision_Tree": DecisionTreeRegressor(max_depth=10, random_state=42),
    "Random_Forest": RandomForestRegressor(n_estimators=50, max_depth=10, random_state=42),
    "Ridge": Ridge(alpha=1.0),
    "Lasso": Lasso(alpha=0.1)
}

for name, model in trained_models.items():
    # Antrenare
    model.fit(X_train, y_train)
    # Predictie pe setul de VALIDARE
    preds = model.predict(X_val)
    # Calcul metrici
    mae, mse, rmse, r2 = get_metrics(y_val, preds)
    # Salvare rezultate
    results.append({
        "Model": name,
        "MAE": mae, "MSE": mse, "RMSE": rmse, "R2": r2
    })

In [23]:
# Setam formatul de afisare pentru float-uri la 3 zecimale

import pandas as pd
pd.options.display.float_format = '{:.3f}'.format

In [24]:
# Afisare rezultate

df_results = pd.DataFrame(results)
print(df_results.sort_values(by="MAE"))

             Model        MAE                  MSE         RMSE  \
3    Random_Forest      7.393              116.543       10.796   
0  Baseline (Mean)      7.496              150.155       12.254   
5            Lasso      7.523              118.257       10.875   
4            Ridge      7.574              118.104       10.868   
2    Decision_Tree      7.589              125.330       11.195   
1    SGD_Regressor 825936.991 1546673823734331.500 39327774.203   

                   R2  
3               0.224  
0              -0.000  
5               0.212  
4               0.213  
2               0.165  
1 -10300598207214.150  


In urma primei generari, putem observa ca modelul random forest a obtinut cele mai bune rezultate, aducand imbunatatiri fata de media simpla a datelor. MAE (7.393) ne arata ca modelul are o marja de eroare de aproximativ 7 ore si 20 de minute in numarul de ore lucrate pe saptamana. Totusi, precizia acestui rezultat nu este mare, ceea ce ne indica faptul ca, acest model mai de graba a aproximat orele, in loc sa fie exact. R squared ne indica faptul ca modelul reuseste sa explice 22.4%, un scor destul de mic, din variatia orelor de munca in functie de variabilele date. Avand in vedere MAE al Baseline, daca comparam celelalte modele, toate prezinta o valoare mai mare. Acest lucru ne arata faptul ca toate mpdelele acestea se complica inutil si dau rezultate negative. Faptul ca valoarea MAE este de 7 ore, iar MSE este peste 115, confirma prezenta valorilor extreme care afecteaza rezultatele. Valoarea RMSE este mai mare cu cel putin 3 unitati fata de MAE, ceea ce ne indica faptul ca modelul face greseli de predictie importante pe care nu le poate explica inca, in cazul Random forest eroarea fiind de aproximativ 11 ore. Comparativ cu media, modelul a reusit sa imbunatateasca rezultatul cu aproximativ 2 unitati, reducand astfel din greseli.
SGD Regressor a esuat complet, cel mai probabil din cauza unul learning pace prea mare si necesita o recalibrare a procesului de invatare.