# Baseline model 

Ideas based on analysis:
- Using an average of one of the categorical variables as the sole predictor
- Using a combination of 2 or more (possibly all) of the categorical variables and calculating the average as the predictors
- Be able to incorporate the continuous variables
    + we can calculate the average of each of the values, subtract those averages from the mean of the overall salary data and add it to the prediction
    + do a simple single OLS between the continuous variables and the salary and use the intercept and slope values of fit and combine with the categorical predictors somehow
        + they could be averaged? i.e. categorical averages gives a prediction and then the years of experience gets a salary prediction and the two predicted salaries get averaged 

*fitting the numeric data*
1. calculate the mean of the target on the data used for fitting
2. calculate the average salary per each value of the numeric variable (0-25 for years, 0 - 99 for miles)
3. get a difference between the grouped averages and the overall average
4. store it

*predictions with numeric predictors*
(this is after the category predictions are made)
1. after ensuring that the variables are in the data
2. loop over the self.numeric_vars list and join with on = the looping var (key of the dict) 
3. take the rowsum of the category preds and the numeric diff columns 
4. drop category preds and the numeric diff columns 
5. profit

___

In [1]:
import pandas as pd
import numpy as np
from itertools import combinations
import seaborn as sns

from src.eda_utils import salary_per_category_table
from src.Baseline import BaselineModel

# Load data

In [2]:
train_salaries = pd.read_csv("../data/interim/salaries_train_85_15_split.csv", index_col = 0)
test_salaries = pd.read_csv("../data/interim/salaries_test_85_15_split.csv", index_col = 0)

In [3]:
print(f"training set shape: {train_salaries.shape}")
print(f"test set shape: {test_salaries.shape}")

training set shape: (850000, 8)
test set shape: (150000, 8)


# `BaselineModel` Class

*Params*
- constructor:
    + grouping_vars: list of categorical variables to group on for fitting
    + numeric_vars:  list of numerical variables to fit 
    
*Methods*
- fit():
    + For the grouping variables, this calculates the average salary per group and stores this values with each grouping variable as an index
    + For the numeric variables, this groups by each value and calculates the mean and stores it with each value of the variable as the index and the average as the value

In [4]:
# Test prototyping
grouping_vars = ['industry', 'major', 'jobType']

avg_values = salary_per_category_table(grouping_vars, train_salaries)

avg_values.set_index(grouping_vars, inplace=True)

print(avg_values)

# This is the predict stage
train_salaries.join(avg_values, on = grouping_vars, rsuffix = '_preds').head()

                                   salary
industry  major       jobType            
EDUCATION NONE        JANITOR   55.152589
SERVICE   NONE        JANITOR   60.129620
AUTO      NONE        JANITOR   64.680927
EDUCATION NONE        JUNIOR    69.534633
HEALTH    NONE        JANITOR   70.294427
...                                   ...
OIL       ENGINEERING CFO      177.157631
FINANCE   ENGINEERING CEO      177.809339
OIL       BUSINESS    CEO      178.989669
FINANCE   BUSINESS    CEO      185.525577
OIL       ENGINEERING CEO      187.346798

[448 rows x 1 columns]


Unnamed: 0,jobId,jobType,degree,major,industry,yearsExperience,milesFromMetropolis,salary,salary_preds
891719,JOB1362685299406,SENIOR,BACHELORS,PHYSICS,WEB,21,68,128,126.375962
177062,JOB1362684584749,CTO,NONE,NONE,HEALTH,1,14,116,125.447948
27080,JOB1362684434767,SENIOR,HIGH_SCHOOL,NONE,OIL,11,49,133,108.911572
546274,JOB1362684953961,CEO,NONE,NONE,AUTO,17,65,126,129.640336
832069,JOB1362685239756,JUNIOR,HIGH_SCHOOL,NONE,OIL,24,26,137,99.563506


In [28]:
test = BaselineModel(grouping_vars='jobType')

In [29]:
test.fit(train_salaries)

In [30]:
test.fitted_average_salaries

Unnamed: 0_level_0,salary
jobType,Unnamed: 1_level_1
JANITOR,70.80305
JUNIOR,95.29357
SENIOR,105.453358
MANAGER,115.373455
VICE_PRESIDENT,125.40885
CFO,135.419149
CTO,135.436604
CEO,145.294699


In [31]:
test.predict(test_salaries)

Unnamed: 0,jobId,jobType,degree,major,industry,yearsExperience,milesFromMetropolis,salary,salary_preds
987231,JOB1362685394918,CFO,HIGH_SCHOOL,NONE,WEB,12,21,174,135.419149
79954,JOB1362684487641,JANITOR,HIGH_SCHOOL,NONE,HEALTH,15,64,58,70.803050
567130,JOB1362684974817,CTO,HIGH_SCHOOL,NONE,OIL,21,4,168,135.436604
500891,JOB1362684908578,CTO,HIGH_SCHOOL,NONE,FINANCE,5,89,85,135.436604
55399,JOB1362684463086,JUNIOR,DOCTORAL,BIOLOGY,WEB,23,64,145,95.293570
...,...,...,...,...,...,...,...,...,...
484822,JOB1362684892509,SENIOR,MASTERS,ENGINEERING,HEALTH,21,12,189,105.453358
902986,JOB1362685310673,JANITOR,HIGH_SCHOOL,NONE,OIL,4,28,100,70.803050
138960,JOB1362684546647,VICE_PRESIDENT,NONE,NONE,SERVICE,6,11,87,125.408850
895087,JOB1362685302774,SENIOR,DOCTORAL,CHEMISTRY,EDUCATION,10,13,109,105.453358


In [15]:
class Test:
    def __init__(self, target = 'salary'):
        self.target = target
    
    def func(self, data):
        return data[[self.target]]

Test().func(train_salaries)

Unnamed: 0,salary
891719,128
177062,116
27080,133
546274,126
832069,137
...,...
259178,113
365838,109
131932,123
671155,165


In [34]:
avg_salary_overall = train_salaries.salary.mean()

avg_salary_by_year = train_salaries.groupby('yearsExperience').salary.mean()
print(avg_salary_by_year)

# These are the fitted values
print("\n\nThese are the fitted values")
years_diff = avg_salary_by_year - avg_salary_overall
years_diff

yearsExperience
0      91.968565
1      94.039934
2      96.141088
3      97.979226
4     100.037532
5     101.686603
6     103.865457
7     106.107381
8     107.839315
9     110.075783
10    112.134154
11    113.919876
12    116.347211
13    117.870305
14    120.354740
15    121.888924
16    124.041466
17    126.013482
18    128.244769
19    130.103428
20    132.110810
21    134.127544
22    136.309815
23    138.460865
24    140.006013
Name: salary, dtype: float64


These are the fitted values


yearsExperience
0    -24.075262
1    -22.003893
2    -19.902739
3    -18.064601
4    -16.006295
5    -14.357224
6    -12.178370
7     -9.936446
8     -8.204512
9     -5.968044
10    -3.909673
11    -2.123951
12     0.303384
13     1.826478
14     4.310913
15     5.845097
16     7.997639
17     9.969655
18    12.200942
19    14.059601
20    16.066983
21    18.083717
22    20.265988
23    22.417038
24    23.962186
Name: salary, dtype: float64

In [27]:
avg_salary_by_miles = train_salaries.groupby('milesFromMetropolis').salary.mean()

print(avg_salary_by_miles)
miles_diff = avg_salary_by_miles - avg_salary_overall
miles_diff

milesFromMetropolis
0     135.537307
1     135.483163
2     134.640791
3     134.824962
4     135.117070
         ...    
95     98.076253
96     97.344226
97     97.329820
98     96.816241
99     95.816550
Name: salary, Length: 100, dtype: float64


milesFromMetropolis
0     19.493479
1     19.439336
2     18.596964
3     18.781135
4     19.073243
        ...    
95   -17.967574
96   -18.699601
97   -18.714007
98   -19.227586
99   -20.227277
Name: salary, Length: 100, dtype: float64

In [26]:
miles_diff.loc[68]

-7.364727832926263

In [35]:
train_salaries.join(miles_diff, on = 'milesFromMetropolis', rsuffix = '_miles_diff').join(years_diff, on = 'yearsExperience', rsuffix = "_years_diff")

Unnamed: 0,jobId,jobType,degree,major,industry,yearsExperience,milesFromMetropolis,salary,salary_miles_diff,salary_years_diff
891719,JOB1362685299406,SENIOR,BACHELORS,PHYSICS,WEB,21,68,128,-7.364728,18.083717
177062,JOB1362684584749,CTO,NONE,NONE,HEALTH,1,14,116,14.061349,-22.003893
27080,JOB1362684434767,SENIOR,HIGH_SCHOOL,NONE,OIL,11,49,133,0.081878,-2.123951
546274,JOB1362684953961,CEO,NONE,NONE,AUTO,17,65,126,-6.588774,9.969655
832069,JOB1362685239756,JUNIOR,HIGH_SCHOOL,NONE,OIL,24,26,137,9.745740,23.962186
...,...,...,...,...,...,...,...,...,...,...
259178,JOB1362684666865,CTO,HIGH_SCHOOL,NONE,EDUCATION,1,34,113,6.414863,-22.003893
365838,JOB1362684773525,VICE_PRESIDENT,BACHELORS,COMPSCI,EDUCATION,2,38,109,4.619681,-19.902739
131932,JOB1362684539619,SENIOR,BACHELORS,ENGINEERING,EDUCATION,8,19,123,12.417469,-8.204512
671155,JOB1362685078842,CEO,MASTERS,COMPSCI,WEB,9,30,165,8.469702,-5.968044


In [40]:
train_salaries.groupby('yearsExperience')['salary'].mean()

yearsExperience
0      91.968565
1      94.039934
2      96.141088
3      97.979226
4     100.037532
5     101.686603
6     103.865457
7     106.107381
8     107.839315
9     110.075783
10    112.134154
11    113.919876
12    116.347211
13    117.870305
14    120.354740
15    121.888924
16    124.041466
17    126.013482
18    128.244769
19    130.103428
20    132.110810
21    134.127544
22    136.309815
23    138.460865
24    140.006013
Name: salary, dtype: float64

## Run through numeric fit

In [3]:
avg_salary_overall = train_salaries.salary.mean()
avg_salary_overall

116.04382705882352

In [4]:
numeric_vars = ['yearsExperience', 'milesFromMetropolis']

fitted_numeric = {column:None for column in numeric_vars}
fitted_numeric

{'yearsExperience': None, 'milesFromMetropolis': None}

In [7]:
# loop over keys, get the grouped mean, subtract the overall mean from it, store it
for key in fitted_numeric.keys():
    avg = (train_salaries.groupby(key)['salary'].mean()) - avg_salary_overall
    fitted_numeric[key] = avg
    
print(fitted_numeric['yearsExperience'])
print(fitted_numeric['milesFromMetropolis'])

yearsExperience
0    -24.075262
1    -22.003893
2    -19.902739
3    -18.064601
4    -16.006295
5    -14.357224
6    -12.178370
7     -9.936446
8     -8.204512
9     -5.968044
10    -3.909673
11    -2.123951
12     0.303384
13     1.826478
14     4.310913
15     5.845097
16     7.997639
17     9.969655
18    12.200942
19    14.059601
20    16.066983
21    18.083717
22    20.265988
23    22.417038
24    23.962186
Name: salary, dtype: float64
milesFromMetropolis
0     19.493479
1     19.439336
2     18.596964
3     18.781135
4     19.073243
        ...    
95   -17.967574
96   -18.699601
97   -18.714007
98   -19.227586
99   -20.227277
Name: salary, Length: 100, dtype: float64


In [8]:
fitted_numeric

{'yearsExperience': yearsExperience
 0    -24.075262
 1    -22.003893
 2    -19.902739
 3    -18.064601
 4    -16.006295
 5    -14.357224
 6    -12.178370
 7     -9.936446
 8     -8.204512
 9     -5.968044
 10    -3.909673
 11    -2.123951
 12     0.303384
 13     1.826478
 14     4.310913
 15     5.845097
 16     7.997639
 17     9.969655
 18    12.200942
 19    14.059601
 20    16.066983
 21    18.083717
 22    20.265988
 23    22.417038
 24    23.962186
 Name: salary, dtype: float64,
 'milesFromMetropolis': milesFromMetropolis
 0     19.493479
 1     19.439336
 2     18.596964
 3     18.781135
 4     19.073243
         ...    
 95   -17.967574
 96   -18.699601
 97   -18.714007
 98   -19.227586
 99   -20.227277
 Name: salary, Length: 100, dtype: float64}

## Run through numeric predict

In [10]:
# loop through keys, join on the key name
preds = test_salaries.copy()
for key in fitted_numeric.keys():
    column_suffix = f"_{key}_diff"
    preds = preds.join(fitted_numeric[key], on = key, rsuffix = column_suffix)
    
preds

Unnamed: 0,jobId,jobType,degree,major,industry,yearsExperience,milesFromMetropolis,salary,salary_yearsExperience_diff,salary_milesFromMetropolis_diff
987231,JOB1362685394918,CFO,HIGH_SCHOOL,NONE,WEB,12,21,174,0.303384,10.699932
79954,JOB1362684487641,JANITOR,HIGH_SCHOOL,NONE,HEALTH,15,64,58,5.845097,-5.445622
567130,JOB1362684974817,CTO,HIGH_SCHOOL,NONE,OIL,21,4,168,18.083717,19.073243
500891,JOB1362684908578,CTO,HIGH_SCHOOL,NONE,FINANCE,5,89,85,-14.357224,-15.314648
55399,JOB1362684463086,JUNIOR,DOCTORAL,BIOLOGY,WEB,23,64,145,22.417038,-5.445622
...,...,...,...,...,...,...,...,...,...,...
484822,JOB1362684892509,SENIOR,MASTERS,ENGINEERING,HEALTH,21,12,189,18.083717,15.124927
902986,JOB1362685310673,JANITOR,HIGH_SCHOOL,NONE,OIL,4,28,100,-16.006295,8.683477
138960,JOB1362684546647,VICE_PRESIDENT,NONE,NONE,SERVICE,6,11,87,-12.178370,15.127079
895087,JOB1362685302774,SENIOR,DOCTORAL,CHEMISTRY,EDUCATION,10,13,109,-3.909673,14.453634


In [3]:
BL = BaselineModel(category_vars = 'jobType', numeric_vars = ['yearsExperience', 'milesFromMetropolis'])

BL.fit(train_salaries)

In [4]:
BL.fitted_category_salaries

Unnamed: 0_level_0,salary
jobType,Unnamed: 1_level_1
JANITOR,70.80305
JUNIOR,95.29357
SENIOR,105.453358
MANAGER,115.373455
VICE_PRESIDENT,125.40885
CFO,135.419149
CTO,135.436604
CEO,145.294699


In [5]:
BL.fitted_numeric_salaries

{'yearsExperience': yearsExperience
 0    -24.075262
 1    -22.003893
 2    -19.902739
 3    -18.064601
 4    -16.006295
 5    -14.357224
 6    -12.178370
 7     -9.936446
 8     -8.204512
 9     -5.968044
 10    -3.909673
 11    -2.123951
 12     0.303384
 13     1.826478
 14     4.310913
 15     5.845097
 16     7.997639
 17     9.969655
 18    12.200942
 19    14.059601
 20    16.066983
 21    18.083717
 22    20.265988
 23    22.417038
 24    23.962186
 Name: salary, dtype: float64,
 'milesFromMetropolis': milesFromMetropolis
 0     19.493479
 1     19.439336
 2     18.596964
 3     18.781135
 4     19.073243
         ...    
 95   -17.967574
 96   -18.699601
 97   -18.714007
 98   -19.227586
 99   -20.227277
 Name: salary, Length: 100, dtype: float64}

In [6]:
preds = BL.predict(test_salaries)
preds

Unnamed: 0,jobId,jobType,degree,major,industry,yearsExperience,milesFromMetropolis,salary,salary_preds,salary_yearsExperience_diff,salary_milesFromMetropolis_diff
987231,JOB1362685394918,CFO,HIGH_SCHOOL,NONE,WEB,12,21,174,135.419149,0.303384,10.699932
79954,JOB1362684487641,JANITOR,HIGH_SCHOOL,NONE,HEALTH,15,64,58,70.803050,5.845097,-5.445622
567130,JOB1362684974817,CTO,HIGH_SCHOOL,NONE,OIL,21,4,168,135.436604,18.083717,19.073243
500891,JOB1362684908578,CTO,HIGH_SCHOOL,NONE,FINANCE,5,89,85,135.436604,-14.357224,-15.314648
55399,JOB1362684463086,JUNIOR,DOCTORAL,BIOLOGY,WEB,23,64,145,95.293570,22.417038,-5.445622
...,...,...,...,...,...,...,...,...,...,...,...
484822,JOB1362684892509,SENIOR,MASTERS,ENGINEERING,HEALTH,21,12,189,105.453358,18.083717,15.124927
902986,JOB1362685310673,JANITOR,HIGH_SCHOOL,NONE,OIL,4,28,100,70.803050,-16.006295,8.683477
138960,JOB1362684546647,VICE_PRESIDENT,NONE,NONE,SERVICE,6,11,87,125.408850,-12.178370,15.127079
895087,JOB1362685302774,SENIOR,DOCTORAL,CHEMISTRY,EDUCATION,10,13,109,105.453358,-3.909673,14.453634


In [9]:
diff_cols = preds.loc[:,['salary_yearsExperience_diff', 'salary_milesFromMetropolis_diff']]
diff_cols

Unnamed: 0,salary_yearsExperience_diff,salary_milesFromMetropolis_diff
987231,0.303384,10.699932
79954,5.845097,-5.445622
567130,18.083717,19.073243
500891,-14.357224,-15.314648
55399,22.417038,-5.445622
...,...,...
484822,18.083717,15.124927
902986,-16.006295,8.683477
138960,-12.178370,15.127079
895087,-3.909673,14.453634


In [14]:
diff_cols.sum(axis = 1)

987231    11.003316
79954      0.399475
567130    37.156960
500891   -29.671873
55399     16.971416
            ...    
484822    33.208643
902986    -7.322818
138960     2.948708
895087    10.543961
835147   -19.005065
Length: 150000, dtype: float64