In [None]:
%load_ext nb_black

# üè¢ Chicago Salaries üí∞

Chicago has public data on many job salaries of public servant type roles.  This data can be found [here](https://data.cityofchicago.org/Administration-Finance/Current-Employee-Names-Salaries-and-Position-Title/xzkq-xp2w).

### Warm Up ü•µ

* What are C and epsilon used for in SVR?

----

#### Boosting üöÄ

Boosting is an 'ensembling' technique.

What does ensembling mean in the context of machine learning?

In boosting, we'll iteratively build models (aka build models in a series; aka build one model after another). The overview is.

1. Build a pretty dumb model (more typically called a 'weak learner')
   * In the image doing classification, this is the first grid
* See where that model makes mistakes
* Build another model with a focus on not making the same mistakes again
   * In the image, this is the second/third grids.  The mistakes are enlarged in these to show that they're a priority.
* Repeat steps 1-3 as much as you want
* Combine the output of these models somehow
   * In the image, this is the final grid.


<img src='https://d1jnx9ba8s6j9r.cloudfront.net/blog/wp-content/uploads/2019/06/How-Does-Boosting-Algorithm-Work-Boosting-Machine-Learning-Edureka-min-528x254.png' width='50%'>

----

This image is more focused on boosting in a regression setting.

<p align='center'><img src='https://i.imgur.com/RewteYv.png' width=70%></p>

## Data Import and EDA

In [None]:
# !pip install gender_guesser
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor

from category_encoders import LeaveOneOutEncoder
from sklearn.pipeline import Pipeline

from gender_guesser.detector import Detector

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
# This is a direct link to the published data.
# The data might change (last updated Oct 2019)
# The data might be moved and this link might break
# A snapshot of the data can be found on kaggle:
# https://www.kaggle.com/chicago/chicago-citywide-payroll-data

# The commentary in this notebook also falls out of date pretty quickly... sorry
data_url = (
    "https://data.cityofchicago.org/api/views/xzkq-xp2w/rows.csv?accessType=DOWNLOAD"
)
chicago = pd.read_csv(data_url)

Do some general 'get to know you' EDA

If you didn't already run an `np.nan` check, do one now to show the percent of missing values by column

Those percents look suspiciously related...

Let's say we want to only predict salary, drop hourly data from the dataset and re-check for missing values.

We now see that the missing values were directly related to the `'Salary or Hourly'` distinction.  We can now drop some columns that don't give us any info going forward.

In [None]:
drop_cols = ["Salary or Hourly", "Typical Hours", "Hourly Rate"]
chicago = chicago.drop(columns=drop_cols)
chicago.head()

All categorical variables... 

Filter the dataframe to just full-time workers and drop the `'Full or Part-Time'` column.

If there are any NAs remaining in the salary column, drop them.

In general, names aren't too informative for prediction.  BUT! We don't have a lot of features here, we probably want to put our feature engineering hats on.

* Maybe we wanted to investigate nepotism, if that was the case, name could be valuable and we might want to restrict to surname...
* Maybe we want to investigate if there's a gender pay gap in our data.  To do this we might try and guess the gender of the person based on their first name.  This, of course, won't be amazingly accurate... but it's kind of neat that we can do this.
    * Note, this type of feature is definitely a bit of a stretch, we're almost surely introducing some bias.  This type of feature might not be to good to use in practice unless you're wanting to do some analysis that can generate clicks but might not have the strongest backing.  

Let's go down this maybe ill advised rabbit trail of engineering a gender column.

First, we need to isolate first name.  Below are some example names in the form we'll be working with in the dataframe. However, before we think about doing this in pandas, let's figure out how to isolate the name in a string.  

Write some code to extract the first names.

In [None]:
name = "ADRIANO,  RACQUEL ANNE"  # Expected output: 'RACQUEL'
# name = 'AFFANEH,  MAHIR A'       # Expected output: 'MAHIR'
# name = 'SPANNBAUER, ADAM M'      # Expected output: 'ADAM'

Now translate this to pandas and apply it to the `'Name'` column

We prolly want this as a function so we can hide away all this logic.

In [None]:
def get_first_name(chitown_names):
    # some awesome pandas code here
    return first_name

In [None]:
chicago["First Name"] = get_first_name(chicago["Name"])
chicago.head()

Now we need to classify these as male/female... A couple ways we could do this:

* Find a database (like Social Security or something idk) of names by gender and look up the names and label with the most common
* Use a model trained on a database like this to make predictions
   * ^One of those is `pip` installable (`!pip install gender_guesser`)
   
Below is an example on how to use it.

In [None]:
# from gender_guesser.detector import Detector

gd = Detector()
print("Title case:")
print(gd.get_gender("Candy"))
print(gd.get_gender("Scott"))
print(gd.get_gender("Tonks"))  # my dog's name (she's a lady)

# It doesn't know how to handle casing...
print("\nUpper case:")
print(gd.get_gender("CANDY"))
print(gd.get_gender("SCOTT"))
print(gd.get_gender("TONKS"))

We need to change our first names to title case to get predictions it seems.

In [None]:
# Example in string land
'THE GREAT GATSBY'.title()

Apply title casing the the first names in the dataframe

In [None]:
chicago['First Name'] = chicago['First Name'].___._____
chicago.head()

Create a new column named `'gender_guess'` by applying the `gd.get_gender` to the title cased first name column

In [None]:
chicago["gender_guess"] = chicago["First Name"]._____(____________)
chicago.head()

We can now drop our name columns (unless we (1) wanted to investigate nepotism, (2) check if bradley's make more money, or something else name related).

In [None]:
drop_cols = ["Name", "First Name"]
chicago = chicago.drop(columns=drop_cols)

Create a violin plot of `'Annual Salary'` by `'gender_guess'`.

Ah, of course, the very often forgotten gender, 'andy'.

Per documentation, andy is their shorthand for androgynous.  Let's collapse down to 3 categories: male, female, other.

In [None]:
replacements = {
    "mostly_male": "male",
    "mostly_female": "female",
    "unknown": "other",
    "andy": "other",
}

chicago["gender_guess"] = chicago["gender_guess"].replace(replacements)

sns.violinplot("Annual Salary", "gender_guess", data=chicago)
plt.show()

In our plot we might be seeing a gender pay gap, with the biggest loser being... Andy.  At least Andy has Woody and Buzz to help cope.

üí•BOOM üí• new feature is now engineered.  Let's get back to some more on topic stuff.

Look at the value counts for `'Job Titles'` and `'Department'`.  Spoiler, there's a lot, create an 'other' category for both.  Decide some cutoff point for what's too few (threshold by count, threshold by count percentile, take the top n, etc.)

* Perform the process on one of the columns
* Translate this logic into a function
* Use your function on the other column

## Model Prep

Perform a train test split stratified by our gender guess feature.

In [None]:
X = chicago.dropna().drop(columns=["Annual Salary"])
y = chicago.dropna()["Annual Salary"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=chicago["gender_guess"],
)

Our `X` data is all categories; let's use the `LeaveOneOutEncoder()`.  If we get poor performance we can come back and drop in a `OneHotEncoder()` with fairly little effort.

* Complete the `Pipeline`
    * Fill in the category encoder
    * Fill in the gradient boosted regressor
* Fit the pipeline to the training data
* Report the scores for the training and testing data

In [None]:
# fmt: off
pipeline = Pipeline([
    ("encode_cats", _____),
    ("gbr", ____)
])
# fmt: on

pipeline.fit(X_train, y_train)

train_score = pipeline.score(X_train, y_train)
test_score = pipeline.score(X_test, y_test)

print(f"train_score {train_score}")
print(f"test_score {test_score}")

A pretty smart guy said this is his parameter grid for this model type. (the names don't line up with sklearn's names).

<img src='https://i.stack.imgur.com/9GgQK.jpg' width='70%'>

* Grid search some hyperparams to increase performance
* Print out the best parameters from the CV

In [None]:
# Adjusted max_features/max_depth to have smaller grid
grid = {
    "gbr__subsample": _____,
    "gbr__max_features": _____,
    "gbr__max_depth": _____,
}

n_trees = 100
learning_rate = 2 / n_trees

# fmt: off
pipeline = Pipeline([
    ("encode_cats", LeaveOneOutEncoder()),
    ("gbr", GradientBoostingRegressor(n_estimators=n_trees, 
                                      learning_rate=learning_rate))
])
# fmt: on

pipeline_cv = GridSearchCV(pipeline, grid, verbose=1)
pipeline_cv.fit(X_train, y_train)

pipeline_cv.best_params_

Print out the train and test scores

In [None]:
train_score = pipeline_cv.score(X_train, y_train)
test_score = pipeline_cv.score(X_test, y_test)

print(f"train_score {train_score}")
print(f"test_score {test_score}")

Extract the `.feature_importances_` from the gbtree regressor in your pipeline.  What was most important?