# Factors that influence the salary (wages, earning)
Exam Example

Enter your name here:

## Name:

## First Name:
    

We are given one-hot encoded panel data on earnings of 595 individuals for the years 1976–1982, originating from the [Panel Study of Income Dynamics](https://rdrr.io/cran/AER/man/PSID7682.html). The data were originally analyzed by Cornwell and Rupert (1988) and employed for assessing various instrumental-variable estimators for panel models.

**Your task is to predict the earnings class (`wage_class`) based on the remaining features.**

In [None]:
%%html
<style> |
table td, table th, table tr {text-align:left !important;}
</style>

A data frame containing 7 annual observations on 12 variables for 595 individuals.


| feature | description |
| --------| -------------|
| `experience` | Years of full-time work experience |
| `weeks` | Weeks worked |
| ` education` | Years of education. |
| `occupation_white` | factor. Is the individual a white-collar ("white"=`True`) or blue-collar ("blue"=`False`) worker? |
| `industry` | factor. Does the individual work in a manufacturing industry? |
| `south_yes` |factor. Does the individual reside in the South? |
| `smsa_yes` |factor. Does the individual reside in a SMSA (standard metropolitan statistical area)? |
| `married_yes` |factor. Is the individual married? |
| `gender_male` | factor indicating a male gender. |
| `union_yes` | factor. Is the individual's wage set by a union contract? |
| `ethnicity_other` |factor indicating ethnicity. Is the individual African-American ("afam") or not ("other")? |
| `wage_class` | **resopnse** $y$: Wage class (`['average, 'high', 'low', 'very high']`) |






Here, we import the necessary libraries.


In [None]:
# Basic Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Evaluation & CV Libraries
from sklearn.metrics import precision_score, accuracy_score
from sklearn.model_selection import GridSearchCV


In [None]:
df_onehot=pd.read_csv('PSID_earnings_onehot.csv', index_col=0)
#df.info()
df_onehot.head()

We check for missing values and NaN, and remove them.

In [None]:
#drop missing values
df_onehot.dropna(inplace=True)
df_onehot.isnull().sum()

### (a) Extract the features $X$ and the response (label, target) $y$ of the dataset

- generate a `numpy` array `X` that contains the features $X$.
- generate a `numpy` array `y` that contains the response $y$.


In [None]:
# START CODE HERE 

# END CODE HERE 

### (b) Plot a histogram of the response $y$ (`'wage_class'`)
- Are the classes well balanced?
- Answer: ...

In [None]:
#START CODE HERE

#END CODE HERE

### (c) Split the data in 80% training data and 20% test data


In [None]:
# Data Pre-processing Libraries
from sklearn.model_selection import train_test_split

# START CODE HERE 


# END CODE HERE 

### (d) Use the `StandardScaler` to standardize the data

In [None]:
# Data Pre-processing Libraries
from sklearn.preprocessing import StandardScaler

#START CODE HERE



#END CODE HERE

### (e) Model Evaluation

Use the following **classifiers as baseline** for your classification and evaluate the **precision** (macro average: `average='macro'`) on the training and test data for each of these classifiers

- Random Forest classifier (`RandomForestClassifier`) with standard parameters
- k-nearest neighbors classifier (`KNeighborsClassifier`) with `k=3`
  


In [None]:
# Modelling Libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import precision_score, accuracy_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

classes=['average', 'high', 'low', 'very high']


In [None]:
#START CODE HERE



#END CODE HERE

### (f) Plot a confusion matrix for each classifier and interpret the results

Plot a **Confusion Matrix** for each of the two classifiers, e.g. using
   - `cm = confusion_matrix(y_test, y_pred, labels=model.classes_)`
   - ` disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)`
   

Enter your comments in at least two sentences here:
- ...
- ...

###  (g) Hyperparameter Tuning of random forest

- Tune the hyperparameters a Random Forest Classifier `RandomForestClassifier()` using a 10-fold crossvalidated grid search using `GridSearchCV`. 
- Use the following hyperparameters for your grid search:
    - `params= {'n_estimators':[10,50,100,200], 'max_depth':list(range(1,7))}`
- Use the F1 (`f1_macro`) score as a metric.
- What are the best parameters out of the grid?

In [None]:
from sklearn.model_selection import GridSearchCV

rf=RandomForestClassifier()
params= {'n_estimators':[10,50,100,200],
         'max_depth':list(range(1,7))}

In [None]:
from sklearn.model_selection import cross_val_score

#START CODE HERE



#END CODE HERE


### (h) Compute and plot the permutation feature importances of the best tuned random forest classifier
- What are the most important factors for a high salary?


In [None]:
from sklearn.inspection import permutation_importance

#START CODE HERE 


#END CODE HERE


###  (i) Hyperparameter Tuning of kNN

- Hypertune a K-nearest neighbour classifier `KNeighborsClassifier()` using a 10-fold crossvalidated grid search. 
- Use the following parameters for your grid search:
    - `params= {'n_neighbors':list(range(20,50))}`
- Use the F1 score (`scoring=f1_macro`) as a metric.
- What are the best parameters for the number of neighbours?

In [None]:
knn=KNeighborsClassifier()
params= {'n_neighbors':list(range(20,50))}

In [None]:
#START CODE HERE


#END CODE HERE


### (j) Compare and discuss the different approaches

- Considering the classifiers (e/f) where the hyperparameters were not tuned and those where the hyperparameters were tuned (g/i), respectively, which classifier would you recommend and why?

Answers:
- ...
- ... 

## Upload this notebook as ipynb-File and as html-File (File  →  Download as  →  HTML) to the upload field of this question (2 files are allowed). 