#Transformation and Regression I

**Outline**

1. Topic Review
2. Case 1 - HR Dataset
3. Case 2 - Kid's Score and Mom Background

- We need to import library and function that need to run the code in this notebook to:
  - load data
  - doing simulation
  - drawing graph and other visualization
  - performing cross validation using statsmodel

In [1]:
%pip install "https://files.pythonhosted.org/packages/83/11/00d3c3dfc25ad54e731d91449895a79e4bf2384dc3ac01809010ba88f6d5/seaborn-0.13.2-py3-none-any.whl"
# The following code is to import libraries necessary to run this notebook

# Data manipulation
import pandas as pd
import numpy as np

# Model fitting
import statsmodels.formula.api as smf

# Visualization
import matplotlib.pyplot as plt
import seaborn as sb

Collecting seaborn==0.13.2
  Downloading seaborn-0.13.2-py3-none-any.whl (294 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.9/294.9 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m


In [2]:
def print_coef_std_err(results):
    """
    Function to combine estimated coefficients and standard error in one DataFrame
    :param results: <statsmodels RegressionResultsWrapper> OLS regression results
    :return df: <pandas DataFrame>
    """
    coef = results.params
    std_err = results.bse

    df = pd.DataFrame(data = np.transpose([coef, std_err]),
                      index = coef.index,
                      columns=["coef","std err"])
    return df

## **Topic Review**
---

From the materials that have been learned in the video learning:
### **Linear transformations**

- Standardization is linear tranformation that can improve model interpretation, but not affect model's performance
- Standardization can be done in various way
  - Scaling predictors using reasoable unit scale
  - Centering the variables using their mean
  - Standardization using z-scores, subtracting a variable with their mean and deviding it by their standard deviation
$$z = \frac{x-\bar{x}}{s_x}$$
     - Use 2 standard deviation when dealing with binary predictor
     - Using external specified parameter ($\mu$ and $\sigma$) when needed to compare with external standard
- We can improve the interpretation of model with interaction by usiing both centering and standardization by considering using two standard deviation if binary predictor included in the model

# **Case 1:** HR Dataset
___

A company is interested in explaining how certain factors relate to paid compensation. The HR department is building regression models, but needs to make it in a way that's easily explainable to upper management.

The HR department is especially interested in explaining variation of salary in terms of performance score, employee satisfaction and gender.

## **Load data**

The dataset is loaded as follows.


In [4]:
hr = pd.read_csv('HRDataset_prep.csv')
hr

Unnamed: 0,Employee_Name,Salary,Perf,Gender,EmpSatisfaction
0,"Adinolfi, Wilson K",62506,4,1,5
1,"Ait Sidi, Karthikeyan",104437,3,1,3
2,"Akinkuolie, Sarah",64955,3,0,3
3,"Alagbe,Trina",64991,3,0,5
4,"Anderson, Carol",50825,3,0,4
...,...,...,...,...,...
306,"Woodson, Jason",65893,3,1,4
307,"Ybarra, Catherine",48513,1,0,2
308,"Zamora, Jennifer",220450,4,0,5
309,"Zhou, Julia",89292,3,0,3


The dataset we're interested about consists of:

1. `Employee_Name`.
2. `Salary`: yearly salary, in US dollars.
3. `Perf`: performance score, where:
    - 1 means PIP, (A performance improvement plan, not meeting job performance goals)
    - 3 means Fully Meets
    - 4 means Exceeds
4. `Gender`: 0 for female and 1 for male.
5. `EmpSatisfaction` : A basic satisfaction score between 1 and 5, as reported on a recent employee satisfaction survey

### Overview of Data

The overview for the descriptive statstics is given as follows.

In [5]:
hr.describe()

Unnamed: 0,Salary,Perf,Gender,EmpSatisfaction
count,311.0,311.0,311.0,311.0
mean,69020.684887,2.977492,0.434084,3.890675
std,25156.63693,0.587072,0.496435,0.909241
min,45046.0,1.0,0.0,1.0
25%,55501.5,3.0,0.0,3.0
50%,62810.0,3.0,0.0,4.0
75%,72036.0,3.0,1.0,5.0
max,250000.0,4.0,1.0,5.0


- The average salary the employee is about 69 thousands dollar
- The average performance is about 2.9 $\approx$ 3 that means on average the employee approximate fully meets target
- The average employee satisfaction score is 3.8 of 5

In [6]:
hr.groupby(["Perf"])[["Salary","EmpSatisfaction"]].mean()

Unnamed: 0_level_0,Salary,EmpSatisfaction
Perf,Unnamed: 1_level_1,Unnamed: 2_level_1
1,57956.0,2.461538
2,68407.555556,3.611111
3,68421.024691,3.954733
4,77144.864865,4.108108


- Employee that have higher satisfaction score is tend to have higher score
- Higher performance score and employee satisfaction tend to have higher salary

In [7]:
hr.groupby(["Gender"])[["Salary"]].mean()

Unnamed: 0_level_0,Salary
Gender,Unnamed: 1_level_1
0,67786.727273
1,70629.4


From the On average, male have higher salary than female
- From the data descriptive statistics, we can get initial assumptions about the data, such as male tend to have higher salary, performance and satisfaction score have positive relationship with the salary.
- Next, we can build regression model to know how's exactly the relationship of other variables to salary
### Fit Linear Regression

In [8]:
# Create OLS model object
model = smf.ols("Salary ~ EmpSatisfaction + Perf + Gender", hr)

# Fit the model
results = model.fit()

# Extract the results (Coefficient and Standard Error) to DataFrame
results_hr = print_coef_std_err(results)

In [9]:
results_hr

Unnamed: 0,coef,std err
Intercept,48575.265981,8581.609832
EmpSatisfaction,755.005144,1639.814368
Perf,5405.498361,2541.008101
Gender,3255.389434,2865.971776


In [10]:
results.rsquared

0.021828044750521358

- Here we see high standard error especially for the satisfaction score (over two times the estimated coefficient itsself), it means the satisfaction score is not have good enough information to tell about salary since it has high uncertainty in the coefficient
- The performance of the model is bad, the model just explain 2% variance of salary
- However, let's just continue to interpret the coefficient first since we insist to know predictors relatioship with the salary

### Coefficient Interpretation

- The `intercept`, `$48,575` is the average salary for female employees, that has 0 satisfaction and performance score (again, it's meaningless, we have no 0 score in our data)
- The coefficient of `EmpSatisfaction`, the predictive difference comparing two employees that have same gender and performance but differ 1 point in satisfaction is `$755`
- The coefficient of `Perf`, the predictive difference comparing two employees that have same gender and satifaction score but differ 1 point in performance score is `$5405`
- The coefficient of `Gender`, the predictive difference comparing two employees that have same performance and satifaction score but difference gender is `$3255`

That's the initial model, and we can definitely improve the interpretation upon it. Let's go!

### **1. Scale better unit for salary**

Upper management cares about salary in terms of thousand dollars. Unit dollars are too granular.

We can achieve this by rescaling the `Salary` variable.

In [11]:
hr['SalaryK'] = hr['Salary']/1000

Fit Linear Regression

In [12]:
# Create OLS model object
model = smf.ols("SalaryK ~ EmpSatisfaction + Perf + Gender", hr)

# Fit the model
results = model.fit()

# Extract the results (Coefficient and Standard Error) to DataFrame
results_hr = print_coef_std_err(results)
results_hr

Unnamed: 0,coef,std err
Intercept,48.575266,8.58161
EmpSatisfaction,0.755005,1.639814
Perf,5.405498,2.541008
Gender,3.255389,2.865972


We can see that the coefficients are scaled down just like the predictors.

- Difference in one point of performance correspond to salary difference of 5.4 thousand dollars, while difference in gender correspond to salary difference of 3.26 thousand dollars.

- The initial model also offered the same explanation, but the coefficient values are in unit dollars (too granular). The upper management cares more for salary in thousand-dollars unit.

Next, try to improve the interpretation of the intercept

### **2. Make intercept explain average salary across genders instead of for female**

We can achieve this by standardizing `Gender` using 2 standard deviations

In [13]:
gender = hr['Gender']

hr['z2_Gender'] = (gender - np.mean(gender)) / (2*np.std(gender))

In [14]:
# Create OLS model object
model = smf.ols("SalaryK ~ EmpSatisfaction + Perf + z2_Gender", hr)

# Fit the model
results = model.fit()

# Extract the results (Coefficient and Standard Error) to DataFrame
results_hr = print_coef_std_err(results)
results_hr

Unnamed: 0,coef,std err
Intercept,49.988377,8.414986
EmpSatisfaction,0.755005,1.639814
Perf,5.405498,2.541008
z2_Gender,3.226976,2.840957


Now the intercept is 49.9 instead of the original 48.57

- The original intercept reads as "average salary **across all females** is 49.9 thousand dollars when other variables are 0".
- The improved model's intercept reads as "average salary ***across all genders*** is 48.57 thousand dollars when other variables are 0".

- Performance and satisfaction score is defined only in range 1 - 4 and 1 - 5. The intercept is not meaningfull because Perf never takes value zero.

### **2. Make intercept more meaningfull by centering**

- We can center the satisfaction and performance score by subtracting each value by its mean, but since they are ordinal variables, it's better centering them by the category or baseline that make sense.
- Suppose upper management want the baseline defined as 'Fully Meets' for the performace and 3 score for satisfaction
- It means we set the `Fully Meets` in performance score and Score 3 in satisfaction as the center of the variables by centering
    - The Performance Score will change from [1,2,3,4] to [-2,-1,0,-1]
    - The Satisfaction Score will change from [1,2,3,4,5] to [-2,-1,0,-1,-2]

In [15]:
# Rescale Perf via centering, by using 3 as center so that 'Fully Meets' becomes baseline (or 0)
hr['Perf_centered'] = hr['Perf'] - 3

# Rescale Perf via centering, by using 3 as center so that '3' performance score becomes baseline (or 0)
hr['EmpSatisfaction_centered'] = hr['EmpSatisfaction'] - 3

Then, build the regression use variables that denote salary in thousand dollars, performance score w.r.t 'Fully Meets', Satisfaction score w.r.t the median, 3 and Standardized gender variable

In [16]:
# Create OLS model object
model = smf.ols("SalaryK ~ EmpSatisfaction_centered + Perf_centered + z2_Gender", hr)

# Fit the model
results = model.fit()

# Extract the results (Coefficient and Standard Error) to DataFrame
results_hr = print_coef_std_err(results)
results_hr

Unnamed: 0,coef,std err
Intercept,68.469888,2.048617
EmpSatisfaction_centered,0.755005,1.639814
Perf_centered,5.405498,2.541008
z2_Gender,3.226976,2.840957


Here's the final interpretation of the model that tells us the relationship between satisfaction score, performance score, and gender to the Salary:
- The `intercept`, `$68.47 thousands` is the average salary across all gender employees, that has 3 satisfaction score and 3 performance score



- The coefficient of `EmpSatisfaction`, the predictive difference comparing two employees that have same gender and performance but differ 1 point in satisfaction is `$0.8 thousands`


- The coefficient of `Perf`, the predictive difference comparing two employees that have same gender and satifaction score but differ 1 point in performance score is `$5,4 thousands`


- The coefficient of `Gender`, the predictive difference comparing two employees that have same performance and satifaction score but difference gender is `$3.2 thousands`

## **Case 2 - Kid Score vs. Mom Background**
---

A study about kid's score attempt to relate kid's score and their mother variables.
A model has been fitted, but it's not too intuitive. We seek out to make more intuitive interpretations using linear transformations.

### Load Data

The dataset is read as follows.

In [18]:
kidiq = pd.read_csv("kid_iq.csv")
kidiq.head()

Unnamed: 0,kid_score,mom_hs,mom_iq,mom_work,mom_age
0,65,1,121.117529,4,27
1,98,1,89.361882,4,25
2,85,1,115.443165,4,27
3,83,1,99.449639,3,25
4,115,1,92.74571,4,27


- One of mother's variable is maternal imployment, it was in ordered scale from 1 to 4, however treating the mom_work as continous become irrelevant since 1 to 4 have unequal difference year of work,
  - mom_work = 1: mother did not work in first three years of child’s life
  - mom_work = 2: mother worked in second or third year of child’s life
  - mom_work = 3: mother worked part-time in first year of child’s life
  - mom_work = 4: mother worked full-time in first year of child’s life.
- So, it's better to discretize the mom_work variable

### Fit Linear Regression
- Add mom_work multilevel categorical predictor by using `C(...)`


In [19]:
# Create OLS model object
model = smf.ols("kid_score ~ mom_iq + C(mom_work)", kidiq)
# Fit the model
results = model.fit()

# Extract the results (Coefficient and Standard Error) to DataFrame
results_kid_iqhs = print_coef_std_err(results)
results_kid_iqhs

Unnamed: 0,coef,std err
Intercept,24.142256,6.142757
C(mom_work)[T.2],3.970261,2.789798
C(mom_work)[T.3],6.601399,3.239858
C(mom_work)[T.4],3.063925,2.446824
mom_iq,0.594777,0.059424



In this model `mom_work=1` are set as baseline category since it depends on the order (mom_work = 1 is the smallest number).

- The `intercept`, 24 is the average test scores for children whose mother did not work in first three years of child’s life and had IQs of 0—not a meaningful scenario


- The coefficient of `mom_work=2`, the predictive difference comparing test scores for children whose mother did not work in first three years of child’s life and mothers worked in second or third year of child’s life, but have mother IQ = 0, is 4


- The coefficient of `mom_work=3`, the predictive difference comparing test scores for children whose mother did not work in first three years of child’s life and mother worked part-time in first year of child’s life, but have mother IQ = 0, is 6.6


- The coefficient of `mom_work=3`, the predictive difference comparing test scores for children whose mother did not work in first three years of child’s life and mother worked full-time in first year of child’s life, but have mother IQ = 0, is 0.6


- The coefficient of `mom_iq`, the predictive difference comparing test scores for children whose mother did not work in first three years of child’s life but have 1 unit difference in mother IQ, is 0.6

The interpretation of intercept are not insightful, since no mother have IQ=0 in our data and we want to know the bigger difference in mom_iq by other comparison unit
### Improve Coefficient Interpretation by Standardization



In [22]:
mom_iq_mean = kidiq["mom_iq"].mean()
mom_iq_std = kidiq["mom_iq"].std()
mom_iq = kidiq["mom_iq"]

kidiq["z_mom_iq"] = (mom_iq-mom_iq_mean)/mom_iq_std

In [23]:
print(f"standard deviation of mom_iq = {mom_iq_std}")

standard deviation of mom_iq = 14.99999999999999


### Fit Linear Regression

In [24]:
# Create OLS model object
model = smf.ols("kid_score ~ z_mom_iq + C(mom_work)", kidiq)

# Fit the model
results = model.fit()

# Extract the results (Coefficient and Standard Error) to DataFrame
results_kid_iqhs_std = print_coef_std_err(results)

### Coefficient Interpretation

In [25]:
results_kid_iqhs_std

Unnamed: 0,coef,std err
Intercept,83.619981,2.084466
C(mom_work)[T.2],3.970261,2.789798
C(mom_work)[T.3],6.601399,3.239858
C(mom_work)[T.4],3.063925,2.446824
z_mom_iq,8.921659,0.891359


The difference of estimated coefficient happens in intercept and z_mom_iq because of standardization in mom_iq, while coefficient of mom_work remain the same,

### Coefficient Interpretation

- The `intercept`, 83.6 is the average test scores for children whose mother did not work in first three years of child’s life and had average IQs
- The coefficient of `mom_iq`, the predictive difference comparing test scores for children whose mother did not work in first three years of child’s life but have 1 standard deviation or 15 unit difference in mother IQ, is 9

### Insight
The coefficient of the model allows for different averages for the children of mothers corresponding to each category of maternal employment if the mother have average IQ.

This allows us to see that the children of mothers who work part-time in the first year after the child is born (mom_work=2) achieve the highest average test scores, 83.6 + 6.6. because they have highest coefficient (6.6)

### **Standardization using externally specified parameter distribution**

- Suppose we knows a survey that concludes that globally, mom iq has $\mu=115$ and $\sigma=12$.
- And we are interested to explain our model in terms of that survey's mean and standard deviation.
- We can obtain that by using standardization using expternal specified parameter

In [26]:
mean_mom_iq_global = 115
std_mom_iq_global = 12

kidiq['zx_mom_iq'] = (mom_iq - mean_mom_iq_global)/std_mom_iq_global

### Fit Linear Regression

In [27]:
# Create OLS model object
model = smf.ols("kid_score ~ zx_mom_iq + C(mom_work)", kidiq)

# Fit the model
results = model.fit()

# Extract the results (Coefficient and Standard Error) to DataFrame
results_kid_iq_x = print_coef_std_err(results)

### Coefficient Interpretation

In [28]:
results_kid_iq_x

Unnamed: 0,coef,std err
Intercept,92.54164,2.329818
C(mom_work)[T.2],3.970261,2.789798
C(mom_work)[T.3],6.601399,3.239858
C(mom_work)[T.4],3.063925,2.446824
zx_mom_iq,7.137327,0.713087


- This model has the intercept value of 92.54 and zx_mom_iq coefficient of 7.13.
- We interpret these values based on the specified standardization parameters, which in this case is **mean and standard deviation of height, globally**.

The interpretations for these values are:

- Based on our data, children whow whose mother did not work in first three years of child’s life and had global IQs (115) has 92.5 on average
- Children whose mother have one global standard deviation, 12 higher in IQ and did not work in first three years of child’s life but have on average 7 score more


Reference and dataset source:

https://www.kaggle.com/datasets/rhuebner/human-resources-data-set