<a href="https://colab.research.google.com/github/vappanna/My_First/blob/master/try_it_9_2_starter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Try-it 9.2: Predicting Wages

This activity is meant to summarize your work with regularized regression models.  You will use your earlier work with data preparation and pipelines together with what you've learned with grid searches to determine an optimal model.  In addition to the prior strategies, this example is an excellent opportunity to utilize the `TransformedTargetRegressor` estimator in scikitlearn.

### The Data

This dataset is loaded from the OpenML resource library.  Originally from census data, the data contains wage and demographic information on 534 individuals. From the dataset documentation [here](https://www.openml.org/d/534)

```
The Current Population Survey (CPS) is used to supplement census information between census years. These data consist of a random sample of 534 persons from the CPS, with information on wages and other characteristics of the workers, including sex, number of years of education, years of work experience, occupational status, region of residence, and union membership.
```

In [8]:
from sklearn.datasets import fetch_openml

In [9]:
wages = fetch_openml(data_id=534, as_frame=True)

In [19]:
wages.frame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 534 entries, 0 to 533
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   EDUCATION   534 non-null    int64   
 1   SOUTH       534 non-null    category
 2   SEX         534 non-null    category
 3   EXPERIENCE  534 non-null    int64   
 4   UNION       534 non-null    category
 5   WAGE        534 non-null    float64 
 6   AGE         534 non-null    int64   
 7   RACE        534 non-null    category
 8   OCCUPATION  534 non-null    category
 9   SECTOR      534 non-null    category
 10  MARR        534 non-null    category
dtypes: category(7), float64(1), int64(3)
memory usage: 21.4 KB


In [10]:
wages.frame.head()


Unnamed: 0,EDUCATION,SOUTH,SEX,EXPERIENCE,UNION,WAGE,AGE,RACE,OCCUPATION,SECTOR,MARR
0,8,no,female,21,not_member,5.1,35,Hispanic,Other,Manufacturing,Married
1,9,no,female,42,not_member,4.95,57,White,Other,Manufacturing,Married
2,12,no,male,1,not_member,6.67,19,White,Other,Manufacturing,Unmarried
3,12,no,male,4,not_member,4.0,22,White,Other,Other,Unmarried
4,12,no,male,17,not_member,7.5,35,White,Other,Other,Married


#### Task

Build regression models to predict `WAGE`.  Incorporate the categorical features and transform the target using a logarithm.  Build `Ridge` models and consider some different amounts of regularization.  

After fitting your model, interpret the model and try to understand what features led to higher wages.  Consider using `permutation_importance` that you encountered in module 8.  Discuss your findings in the class forum.

For an in depth example discussing the perils of interpreting the coefficients, see the example in scikitlearn examples [here](https://scikit-learn.org/stable/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html).

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer, TransformedTargetRegressor
from sklearn.inspection import permutation_importance
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np
import plotly.express as px
import pandas as pd
import warnings

In [12]:
transformer = make_column_transformer((OneHotEncoder(drop = 'if_binary', handle_unknown = 'ignore'), ['SOUTH', 'SEX', 'UNION', 'RACE', 'OCCUPATION', 'SECTOR', 'MARR']),
                                  remainder = StandardScaler())

In [13]:
pipe1 = Pipeline([('encoder', transformer),
                   ('ridge', Ridge())])

In [14]:
pipe1.fit(wages.data, wages.target)

The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



In [20]:
pipe1.score(wages.data, wages.target)

0.3264211783656321

In [21]:
ridge_coef = pipe1['ridge'].coef_

# Get feature names from OneHotEncoder
onehot_features = pipe1['encoder'].named_transformers_['onehotencoder'].get_feature_names_out()

# Get feature names for the remainder (scaled numerical columns)
# These are the columns not passed to OneHotEncoder
# Let's inspect the original columns to confirm
original_numerical_cols = ['EDUCATION', 'EXPERIENCE', 'AGE'] # From wages.frame.info() and the make_column_transformer definition

# Combine all feature names
all_feature_names = np.concatenate([onehot_features, original_numerical_cols])

# Create a DataFrame to display coefficients with their names
coefficients_df = pd.DataFrame({'Feature': all_feature_names, 'Coefficient': ridge_coef})

print("Ridge Model Coefficients:")
print(coefficients_df.sort_values(by='Coefficient', ascending=False))


Ridge Model Coefficients:
                    Feature  Coefficient
7     OCCUPATION_Management     2.600894
1                  SEX_male     1.923345
16                EDUCATION     1.592056
9   OCCUPATION_Professional     1.308674
18                      AGE     0.572738
13     SECTOR_Manufacturing     0.525357
5                RACE_White     0.475702
17               EXPERIENCE     0.469099
12      SECTOR_Construction    -0.023882
3             RACE_Hispanic    -0.120987
15           MARR_Unmarried    -0.297611
4                RACE_Other    -0.354715
14             SECTOR_Other    -0.501475
0                 SOUTH_yes    -0.564597
8          OCCUPATION_Other    -0.613940
6       OCCUPATION_Clerical    -0.615913
11       OCCUPATION_Service    -1.303909
10         OCCUPATION_Sales    -1.375806
2          UNION_not_member    -1.577608


In [22]:
perm_importance = permutation_importance(pipe1, wages.data, wages.target, n_repeats = 10)

In [23]:
perm_importance

{'importances_mean': array([0.20562976, 0.007245  , 0.07629157, 0.01548172, 0.02548108,
        0.02482088, 0.00664449, 0.12973964, 0.01414742, 0.00239284]),
 'importances_std': array([0.02394346, 0.00306839, 0.01591444, 0.00565583, 0.00863262,
        0.00736333, 0.00451738, 0.02151687, 0.00325803, 0.00124184]),
 'importances': array([[ 2.05664854e-01,  2.23821651e-01,  1.90022760e-01,
          1.91852575e-01,  2.58431960e-01,  1.89777827e-01,
          2.31113503e-01,  1.75702083e-01,  2.03793139e-01,
          1.86117199e-01],
        [ 9.41730453e-03,  9.30292258e-03, -5.24507484e-04,
          7.43500269e-03,  8.26232390e-03,  8.44651399e-03,
          9.72569581e-03,  9.39507344e-03,  7.14003145e-03,
          3.84963277e-03],
        [ 4.92875574e-02,  8.04960919e-02,  6.65394222e-02,
          8.40918651e-02,  8.43107575e-02,  7.03905114e-02,
          4.99285759e-02,  9.05497304e-02,  8.98492141e-02,
          9.74719597e-02],
        [ 2.35484278e-02,  1.34463371e-02,  5.272

In [24]:
sample_female = pd.DataFrame({
    'EDUCATION': [12],
    'SOUTH': ['no'],
    'SEX': ['female'],
    'EXPERIENCE': [10],
    'UNION': ['not_member'],
    'AGE': [30],
    'RACE': ['White'],
    'OCCUPATION': ['Other'],
    'SECTOR': ['Other'],
    'MARR': ['Unmarried'],
})

# Ensure column order matches the training data if not using a ColumnTransformer that handles order
# For simplicity and robustness, it's generally best to ensure the column order is consistent.
# However, ColumnTransformer should handle this based on column names/indices.

predicted_wage = pipe1.predict(sample_female)
print(f"Predicted wage for a generic female: ${predicted_wage[0]:.2f}")


Predicted wage for a generic female: $5.79


In [27]:
df = pd.DataFrame(perm_importance['importances'])
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.205665,0.223822,0.190023,0.191853,0.258432,0.189778,0.231114,0.175702,0.203793,0.186117
1,0.009417,0.009303,-0.000525,0.007435,0.008262,0.008447,0.009726,0.009395,0.00714,0.00385
2,0.049288,0.080496,0.066539,0.084092,0.084311,0.070391,0.049929,0.09055,0.089849,0.097472
3,0.023548,0.013446,0.005272,0.015085,0.013988,0.006745,0.018787,0.01733,0.018465,0.022151
4,0.02345,0.028273,0.021246,0.010372,0.028463,0.039072,0.016036,0.039199,0.021751,0.026949
5,0.036222,0.02349,0.011785,0.023449,0.027007,0.012036,0.029586,0.025719,0.028068,0.030847
6,0.001189,0.012335,0.008675,0.013719,0.0106,0.006288,0.000239,0.001857,0.007575,0.003969
7,0.110631,0.147328,0.121631,0.152661,0.154297,0.110807,0.133127,0.109796,0.160107,0.09701
8,0.01482,0.02004,0.012693,0.013083,0.015299,0.017792,0.008233,0.011179,0.012009,0.016326
9,0.002691,-0.000374,0.001939,0.002736,0.001921,0.001255,0.002984,0.002745,0.003802,0.00423


In [29]:
df = df.T
df.columns = wages.feature_names
df.columns

Index(['EDUCATION', 'SOUTH', 'SEX', 'EXPERIENCE', 'UNION', 'AGE', 'RACE',
       'OCCUPATION', 'SECTOR', 'MARR'],
      dtype='object')

In [30]:
px.box(data_frame=df, orientation='h', title = 'Feature importance for wage prediction')

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.205665,0.223822,0.190023,0.191853,0.258432,0.189778,0.231114,0.175702,0.203793,0.186117
1,0.009417,0.009303,-0.000525,0.007435,0.008262,0.008447,0.009726,0.009395,0.00714,0.00385
2,0.049288,0.080496,0.066539,0.084092,0.084311,0.070391,0.049929,0.09055,0.089849,0.097472
3,0.023548,0.013446,0.005272,0.015085,0.013988,0.006745,0.018787,0.01733,0.018465,0.022151
4,0.02345,0.028273,0.021246,0.010372,0.028463,0.039072,0.016036,0.039199,0.021751,0.026949
5,0.036222,0.02349,0.011785,0.023449,0.027007,0.012036,0.029586,0.025719,0.028068,0.030847
6,0.001189,0.012335,0.008675,0.013719,0.0106,0.006288,0.000239,0.001857,0.007575,0.003969
7,0.110631,0.147328,0.121631,0.152661,0.154297,0.110807,0.133127,0.109796,0.160107,0.09701
8,0.01482,0.02004,0.012693,0.013083,0.015299,0.017792,0.008233,0.011179,0.012009,0.016326
9,0.002691,-0.000374,0.001939,0.002736,0.001921,0.001255,0.002984,0.002745,0.003802,0.00423
