# Excercise:
Can we predict person's income just looking at Age - Salary data?
Let's try to find out.

### Data
Let's look at "HR Employee Attrition and Performance" provided by IBM to predict attrition and plot Age - MonthlyIncome relation:

In [None]:
import numpy as np 
import pandas as pd 
income_raw_data = pd.read_csv('../input/WA_Fn-UseC_-HR-Employee-Attrition.csv')
from ggplot import *

In [None]:
ggplot( aes(x='Age', y='MonthlyIncome'), data = income_raw_data) + \
geom_point() +\
theme_bw() + \
ggtitle('Monthly Income by Age - Raw Data')

There are other variables that has very strong impact on MonthlyIncome (like JobRole - see graph below), but let's skip them by purpose.

In [None]:
ggplot( aes(x='Age', y='MonthlyIncome', colour='JobRole'), data = income_raw_data) + \
    geom_point() + \
    stat_smooth(span=0.3, level=0.95) + \
    theme_bw() + \
    ggtitle('Monthly Income by age grouped by JobRole')

Let's try to create some age bins and see if we'll get any trend here:

In [None]:
bins=[-float('inf'), 18, 25, 35, 55, float('inf')]
labels = range(0, len(bins) - 1)
income_raw_data['AgeRange'] = pd.cut(income_raw_data['Age'], bins = bins, labels = labels, retbins=False)
range_stats = income_raw_data[['MonthlyIncome', 'AgeRange']].groupby('AgeRange').agg(['min', 'max', 'mean'])
range_stats.columns = range_stats.columns.get_level_values(1)
range_stats_to_merge = range_stats.reset_index()
range_stats = range_stats.stack()
range_stats = range_stats.reset_index()
range_stats.columns = ['AgeRange', 'stat', 'meanMonthlyIncome']

In [None]:
ggplot(aes(x = 'AgeRange', y='MonthlyIncome'), data = income_raw_data) + \
    geom_boxplot() + \
    theme_bw() + \
    ggtitle('Distribution of MonthlyIncome in age bins')

We can observe some positive trend, but we have very high variability that is increasing with age. Varaiblity is much higher that trend  - not sure if using age as only dimention makes any sense for this data. But let's try to do the best we can.

 ## Naive prediction
Let's try some naive approach - use mean and age bins to calculate mean increases in bins, and than use them to estimate person's income.

In [None]:
data_to_merge = income_raw_data[['Age', 'AgeRange', 'EmployeeNumber', 'MonthlyIncome']].copy()
data_to_merge['OriginalAgeRange'] = data_to_merge['AgeRange']
data_to_merge = data_to_merge.merge(range_stats_to_merge, on='AgeRange')
data_to_merge['distance_to_mean'] = (data_to_merge['MonthlyIncome'] - data_to_merge['mean']) 
predictions_users = data_to_merge[['EmployeeNumber', 'Age', 'OriginalAgeRange', 'MonthlyIncome', 'distance_to_mean']].copy()
predictions_users['key'] = 1
predictions_ranges = range_stats_to_merge[['AgeRange', 'mean']].copy()
predictions_ranges['key'] = 1
predictions_users = predictions_users.merge(predictions_ranges, on='key')
predictions_users['PredictedIncome'] = predictions_users['mean'] + predictions_users['distance_to_mean']
predictions_users = predictions_users[['EmployeeNumber', 'Age', 'OriginalAgeRange', 'MonthlyIncome', 'AgeRange', 'PredictedIncome']]

In [None]:
predictions_users.head(10)

## Validation
Unfortunately there is no timeseries data in per-person resolution to validate predictions. 

One  validation could be to apply some heuristics - like income cannot be negative. Here we can observe that backwards predictions does not work well and most likely should be skipped.

Another option would be to compare original and data distribution in buckets. On graph below we can observe that initial bucket got much bigger variance than original one.



In [None]:
ggplot(aes(x = 'AgeRange', y='PredictedIncome'), data = predictions_users) + \
    geom_boxplot() + \
    theme_bw() + \
    ggtitle('Distribution of predicted MonthlyIncome in age bins')

## Better estimation approach

Some variation of naive algorithm could be to use position between min and max instead of distance to mean and than rescale this to other bins. This approach should eliminate negative incomes as well as keep within bucket distributions similar to  original ones.

We can try fit some linear model here.

TBD ...