<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Baltimore Salaries

_Authors: Greg Baker (SYD)_

---

The City of Baltimore publishes data about all of its employees, including their salaries. And these annual salaries can differ from their gross pay: Perhaps an employee works overtime and earns more than their official salary; perhaps they are only employed for a part of the year and earn less.

In this lab, we'll estimate what a typical city employee's gross pay will be based on their annual salary.

In [None]:
%matplotlib inline
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import datetime as dt

## Read the Data Set

The Baltimore salaries data set is available in `datasets/Baltimore_City_Employee_Salaries_2011.csv`. You can use Column 0 as an index. Column 4 is a date.

In [None]:
# A:

## Pre-Process the Data (Convert Strings to Numbers)

The `AnnualSalary` and `GrossPay` columns are strings and start with a `$`. Strip this off and convert these columns to floats.

In [None]:
# A:

## Perform Exploratory Data Analysis

Create a scatterplot of annual salary versus gross pay.

In [None]:
# A:

# Look for a Linear Relationship

It seems like there is a linear relationship in there, but it's obscured by a lot of noise.

Split the data into training and testing data sets.

In [None]:
# A:

## Plot Ordinary Least Squares

The errors in the graph above don't look evenly balanced, which doesn't bode well for ordinary least squares.

Let's see what it gives us: Import `sklearn.linear_models`, create an ordinary least squares regressor, and train it.

In [None]:
# A:

### Visualize

Plot the test data, then plot the predictions from the linear model over it. OLS will generally predict a gross salary that's a little too high.

In [None]:
# A:

### Measure

Initially, let's look at three metrics to understand how well this line represents the data.

- Calculate the $R^2$ score for the predictions it made.
- Calculate the median absolute error.
- Calculate the mean absolute error.

Remember that `sklearn.metrics` has functions for all of these.

In [None]:
# A:

## Robust Regression

Perform the same analysis using Theil-Sen, RANSAC, and Huber.

### Theil-Sen

Train the Theil-Sen regressor, plot its predictions for the testing data, and calculate the three metrics above. You can copy and paste most of the code you wrote.

Expect to see the $R^2$ be worse — and perhaps other metrics be worse — but to have a better-looking fit.

In [None]:
# A:

## RANSAC

Perform this analysis again using RANSAC.

In [None]:
# A:

## Huber

Note: If you are running an old version of scikit-learn (0.18 or earlier), you might not have the option to create a Huber regressor.

In [None]:
# A:

## Review

- Which model had the highest $R^2$ score? Why is this obvious?
- Which model had the lowest median absolute error?
- Which model had the lowest mean absolute error?

In [None]:
# A:

- OLS will always have the highest $R^2$ score, because that's what it maximizes.
- Huber usually wins on median absolute error and mean absolute error.

# Commercial Analysis

Say that you are the City of Baltimore's hiring manager. New employees regularly ask how much they are actually likely to earn given the salary to which they are about to agree.

You don't want to give an answer that is too high because you might be putting the city at risk for a lawsuit for misrepresenting the job. On the other hand, you don't want to give an answer that's too low because the candidate might pass up on the job and work elsewhere.

You decide that it will cost \$0.05 in lawsuit risk for each dollar you overrepresent, but only \$0.01 for each dollar you underrepresent.

E.g., if a candidate is actually likely to earn \$100,000 and you say \$120,000, this is worth \$10,000 in potential lawsuits for misrepresentation. If you say \$80,000, then that will cost you \$200 in potential recruiters' fees to find someone else.

## Evaluate Existing Models

You will need to choose between the four models you've built. Select the one that costs the city the least amount of money if you were to use it on all of the employees in your testing set.

Write a scoring function that returns the dollar value given an estimator, an $X$ testing set, and a $Y$ testing set.

In [None]:
# A:


### Score the Four Models Using This Function

- OLS
- RANSAC
- Theil-Sen
- Huber

In [None]:
# A:


# Optional: Find the Optimal Coefficient

Note: gradient descent would quickly find the best coefficient to minimize the dollar risk to the city. If you're familiar with this technique, feel free to use it here.

Alternatively, you can brute-force through small ranges of coefficients to create a linear model that poses the least dollar risk.

Remember that you can set the `coef_` and `intercept` attributes instead of training a regressor.

In [None]:
# A:


# Optional (2): Improve the Model

One factor that will make a big difference to an employee's gross salary is whether they were employed for the whole year.

Can you improve the model if you exclude recent hires?

In [None]:
# A:
