%matplotlib inline

## Introduction

We will make an early investigation of the keywords of several job titles, how they affect
salaries and what range of salaries we are looking at. The methodology should be applicable
for many similar metrics.

## Step 1. Data loading

We open the dataset and have a quick look...

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd



src = pd.read_csv("../input/Salaries.csv", low_memory=False)
src.describe()
src.head(10)

## Step 2. Having a look at titles

In this steps I get data only for 2014 and have a look at job titles. There are
997 distinct job titles and most of them have just a few words in them. This is
good because it means that even algorithms with relatively high complexity will run fine.

In [None]:
from nltk.corpus import stopwords
en_stopwords = stopwords.words('english')
#stopwords quita las palabras básicas: i, as, how, etc..

src2014 = src[src["Year"]==2014]

job_titles = src2014["JobTitle"]
unique_job_titles = job_titles.unique()
#obtiene los jobs únicos, como agrupar por jobs

print("Job titles: ", job_titles.count(), " unique job titles ", len(unique_job_titles))

import re
def tokenize(title):
    return filter(lambda w: w and (w not in en_stopwords), re.split('[^a-z]*', title.lower()))


tokenized_titles = list(filter(None, [list(tokenize(title)) for title in unique_job_titles]))

lengths = list(map(len, tokenized_titles))

pd.DataFrame({"number or keywords per title": lengths}).hist(figsize=(16, 4));
#pd.DataFrame({"number or keywords per title": lengths}).plot(figsize=(16, 4));

## Step 3. Extracting n-order tuples from titles
Let's create a little algorithm to create all combinations of the keywords from titles.
We are going to use this in a bit. For example, `deputy chief` breaks down to 3
combinations: `deputy`, `chief` and `deputy chief`.

In [None]:
tokenized_title = tokenized_titles[0]

print("original: ", tokenized_title)

import itertools

def keywords_for_title(tokenized_title):
    for keyword_len in range(1, len(tokenized_title)+1):
        for keyword in itertools.combinations(tokenized_title, keyword_len):
            yield keyword

print("keywords: ", list(keywords_for_title(tokenized_title)))

## Step 4. Extracting n-order tuples from titles

Let's have a look on distinct keyword combinations from titles. As we can see
there's a great variety of titles that include "supervisor", "senior", "assistant" etc.

In [None]:
from collections import Counter
keywords_stats = Counter()
for title in tokenized_titles:
    for keyword in keywords_for_title(title):
        keywords_stats[keyword] += 1

print("distinct keywords: ", len(keywords_stats))
keywords_stats.most_common(1115)

## Step 5. Do the data extraction and aggregation

This is a relatively slow step because we go through the initial dataset and we
create primitives that will help us compute more advanced statistics in the next step.
Essentially it's just increasing a few counters per keyword combination.

In [None]:
salaries = Counter()
counts = Counter()
min_salaries = {}
max_salaries = {}

for index, row in src2014.iterrows():
    job_title = row["JobTitle"]
    salary_plus_benefits = row["TotalPayBenefits"]
    
    # Remove temporary jobs
    if salary_plus_benefits < 10000:
        continue
    
    tokenized_title = list(tokenize(job_title))
    if not tokenized_title:
        continue
    for keyword in keywords_for_title(tokenized_title):
        salaries[keyword] += salary_plus_benefits
        counts[keyword] += 1
        if keyword in max_salaries:
            if salary_plus_benefits < min_salaries[keyword]:
                min_salaries[keyword] = salary_plus_benefits
            if salary_plus_benefits > max_salaries[keyword]:
                max_salaries[keyword] = salary_plus_benefits
        else:
            min_salaries[keyword] = salary_plus_benefits
            max_salaries[keyword] = salary_plus_benefits

## Step 6. We create the two statistics; shift and variance.

For a given keyword, shift is the amount of shift in average salary for titles
that have a keyword in contrast to the ones that don't have it. For example
if jobs with "senior" have an average of USD 1100 and jobs without have an
average of USD 900 and the average is USD 1000 then the shift is (1100 - 900) / 1000 =
200/1000 = 0.2 = 20%.

The variance is the max-min salary / average salary for the given keywords. Thos are
two interesting statistics to investigate.

In [None]:
s_all = src2014["TotalPayBenefits"].sum()
n_all = src2014["TotalPayBenefits"].count()
avg_salary = s_all / n_all

shifts = Counter()
variances = Counter()

for keyword in salaries:
    s_with = salaries[keyword]
    n_with = counts[keyword]
    
    # Skip ill-cases
    if min_salaries[keyword] == 0:
        continue
    
    if n_with < 5:
        continue
    
    if len(keyword) > 2:
        continue

    avg_salary_with = s_with / n_with
        
    shifts[keyword] = (avg_salary_with - ((s_all-s_with)/(n_all-n_with))) / avg_salary
    variances[keyword] = max_salaries[keyword] / min_salaries[keyword]

print('shifts: ', shifts.most_common(15))
print()
print('variances: ', variances.most_common(15))

## Step 7. Scatter-plot the results

We compile a `DataFrame` that helps us plot and gain intuition on the results.

In [None]:
keys=[]
shifts_v=[]
variances_v=[]

for i in shifts.keys():
    keys.append(i)
    shifts_v.append(shifts[i])
    variances_v.append(variances[i])

stats = pd.DataFrame({"keys": keys, "shifts": shifts_v, "variances": variances_v})
stats.plot("shifts", "variances", "scatter")

## Step 8. Focus on above average

Focus only on above average salaries

In [None]:
# Keep only the above average salaries
keys=[]
shifts_v=[]
variances_v=[]

for i in shifts.keys():
    if shifts[i] <= 1:
        continue
    keys.append(i)
    shifts_v.append(shifts[i])
    variances_v.append(variances[i])

stats = pd.DataFrame({"keys": keys, "shifts": shifts_v, "variances": variances_v})
stats.plot("shifts", "variances", "scatter")

## Step 9. Clustering with DBSCAN.

Let's try to identify some clusters using the DBSCAN algorithm. 

In [None]:
from sklearn.cluster import DBSCAN
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
X = stats[['shifts', 'variances']].values
#X = StandardScaler().fit_transform(X)
#db = DBSCAN(eps=0.4, min_samples=10).fit(X)
db = KMeans(n_clusters=3).fit(X)

##############################################################################
# From http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html
##############################################################################

core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
#core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)

print('Estimated number of clusters: %d' % n_clusters_)

##############################################################################
# Plot result
import matplotlib.pyplot as plt

# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = 'k'

    class_member_mask = (labels == k)

    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)

    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

## Step 10. Print general cluster information

Print the identified number of clusters and the number of data-points in each cluster.
The -1 cluster is the noize one and can be ignored.

In [None]:
from collections import defaultdict
labels_per_class = defaultdict(list)
for label in set(labels):
    for cnt, i_label in enumerate(labels):
        if i_label == label:
            labels_per_class[label].append(cnt)

[(k, len(labels_per_class[k])) for k in labels_per_class.keys()]

## Step 11. Print cluster results

For each interesting class of datapoints, print us details on the samples

In [None]:
for i_class in range(len(labels_per_class)):
    print("class: ", i_class)
    for ikey in labels_per_class[i_class]:
        dp = [
            ("shift", format(shifts_v[ikey], '.2f')),
            ("variance", format(variances_v[ikey], '.2f')),
            ("key", keys[ikey]),
        ]
        print("  -", dp)


## Step 12. Summary

We can see here a segmentation between highly-payed and conservative and highly-payed
and more open-market related jobs. Professions that have to compete with the private sector
e.g. 'attorney', 'architect', 'physician', 'nurse' and some other managerial
positions have large variance indicating performance-driven incentives. Contrary jobs with
keywords such as 'sheriff', 'civil', 'commander' etc. are very well paid but they are more
conservative in terms of performance incentives. This might or might not be the right
thing to do.