# Explore SF Salary Data (Quick EDA)

* Data Overview
* Numerical Variables
  * Total Pay
  * Base Pay, Overtime Pay and Benefits
* Categorical Variables
* Recap

Kaggle provided San Francisco city employee salary data. Let’s do a quick EDA!

Let's import necessary modules and read the data.

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np

from collections import Counter

import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [None]:
data = pd.read_csv('../input/Salaries.csv')

# Data Overview

Firstly, look at the structure of our data. I use `head` DataFrame method for it. 
The first three lines will be enough.

In [None]:
data.head(3)

It's pretty clear that most of the columns are mean. Let's group them according to the intended types.

Numerals:
* BasePay
* OvertimePay
* OtherPay
* Benefits
* TotalPay
* TotalPayBenefits
* Year

Categorical:
* EmployeeName
* JobTitle
* Notes
* Agency
* Status

Now get information about the real types and not null values. Use `info` method for it.

In [None]:
data.info()

Looks like *BasePay*, *OvertimePay*, *OtherPay* and *Benefits* have wrong types. They are objects, but I think they definitely should be number types. Let's convert them. Notice that these columns may contain not numerical values. I use `errors='coerce'` attribute for coerce invalid values to NaN.


In [None]:
series_list = ['BasePay', 'OvertimePay', 'OtherPay', 'Benefits']
for series in series_list:
    data[series] = pd.to_numeric(data[series], errors='coerce')

Column *Year* obviously contains the year. Look at it. I use `unique` DataFrame method for getting unique value in the column.

In [None]:
data['Year'].unique()

Pretty good. We have observations from 2011 to 2014. 

Keep digging. Dataset info shows that *Notes* column contains no information. We'll drop it but a little later. What *Agency* and *Status* mean? Let's get unique values in these columns.

In [None]:
print('Agency unique values:')
data['Agency'].unique()

In [None]:
print('Status unique values:')
data['Status'].unique()

*Agency* column doesn't give us any useful information. *Status* contains work schedule and has two option: Part-Time (PT) or Full-Time (FT). We'll use it in the future.

Drop unnecessary columns.

In [None]:
data.drop(['Notes', 'Agency'], axis=1, inplace=True)

# Numerical Variables

Let's look at our numerical variables. Firstly, some statistical information. I use `describe` DataFrame method for it.

In [None]:
data.describe()

Looks normal, but I see one weird thing — some people have negative pay. What does that mean?

In [None]:
data[data['TotalPay'] < 0]

It's one person. One strange person who has negative pay.

How about people who has pay of 0?

In [None]:
data[data['TotalPay'] == 0].head(3)

In [None]:
len(data[data['TotalPay'] == 0])

So we have 368 people. I didn't show them all. Zero payment is really sad, but some of these people are getting benefits. Not too bad.

Let's show people who get \$100–\$200.

In [None]:
data[(data['TotalPay'] > 0) & (data['TotalPay'] <= 400)].head(3)

In [None]:
len(data[(data['TotalPay'] > 0) & (data['TotalPay'] < 400)])

We have 1475 people here.

## Total Pay

Now do some visualisation. Let's look at *TotalPay* for each year.

In [None]:
g = sns.FacetGrid(data, col="Year", col_wrap=2, size=5, dropna=True)
g.map(sns.kdeplot, 'TotalPay', shade=True);

The plots are very similar. This suggests that the situation in the labour market sustainable. Also, it looks like we have two spikes on each plot. What is it? Different types of jobs, or maybe different forms of employment?

Let's visualise *TotalPay* for different forms of employment: Part-Time (PT) and Full-Time (FT).

In [None]:
ft = data[data['Status'] == 'FT']
pt = data[data['Status'] == 'PT']

fig, ax = plt.subplots(figsize=(9, 6))

sns.kdeplot(ft['TotalPay'].dropna(), label="Full-Time", shade=True, ax=ax)
sns.kdeplot(pt['TotalPay'].dropna(), label="Part-Time", shade=True, ax=ax)

plt.xlabel('Total Pay')
plt.ylabel('Density')
title = plt.title('Total Pay Distribution')

Got it! We exactly found a parameter that divide our original distribution into two parts. And also, we can clearly see the difference between the two forms of employment.

# Base Pay, Overtime Pay and Benefits

Also, visualise the other numerical features.

In [None]:
fig, ax = plt.subplots(figsize=(9, 6))

sns.kdeplot(ft['BasePay'].dropna(), label="Full-Time", shade=True, ax=ax)
sns.kdeplot(pt['BasePay'].dropna(), label="Part-Time", shade=True, ax=ax)

plt.xlabel('Base Pay')
plt.ylabel('Density')
title = plt.title('Base Pay Distribution')

Looks similar to the previous graph. We also have a few spikes there.

In [None]:
fig, ax = plt.subplots(figsize=(9, 6))

sns.kdeplot(ft['OvertimePay'].dropna(), label="Full-Time", shade=True, ax=ax)
sns.kdeplot(pt['OvertimePay'].dropna(), label="Part-Time", shade=True, ax=ax)

plt.xlabel('Overtime Pay')
plt.ylabel('Density')
title = plt.title('OvertimePay Distribution')

Nice graph! Not difficult to guess that most people have overtime pay around 0. But look at the graph tail. Someone gets more than \$170000 by overtime pay!

In [None]:
fig, ax = plt.subplots(figsize=(9, 6))

sns.kdeplot(ft['Benefits'].dropna(), label="Full-Time", shade=True, ax=ax)
sns.kdeplot(pt['Benefits'].dropna(), label="Part-Time", shade=True, ax=ax)

plt.xlabel('Benefits')
plt.ylabel('Density')
title = plt.title('Benefits Distribution')

In this graph, we also see large differences between full-time and part-time workers. Full-time workers get much more benefits.

# Categorical Variables

Let's dig into categorical variables. We have only two meaning categorical variables: *EmployeeName* and 
*JobTitle*. The first one not interesting, so I'm going to deal with *JobTitle*.

How many unique job titles we have?

In [None]:
print('All unique job titles:', len(data['JobTitle'].unique()) - 1)
print('Full-time unique job titles:', len(ft['JobTitle'].unique()) - 1)
print('Part-time unique job titles:', len(pt['JobTitle'].unique()) - 1)

Data in the column is very various and we can't use them for frequency analysis. There are too many different values. We need to simplify these values. I'm going to split all data into individual words and count their frequency. Then I'll look through top 200 words and try to form several job groups. Each group corresponds a set of words that contain in the job titles of people working in this group. In the end, I'll spread each person to one of the groups. The remaining will be placed in the *other* group.

I'm going to use cool `Counter` object that will allow me to create a dictionary of frequencies and display top words.

In [None]:
from collections import Counter

In [None]:
job_titles = data['JobTitle'].unique()[:-1] # deleting the last element "Not provided"

words_in_titles = []

for job_title in job_titles:
    words_in_titles += job_title.lower().split()
    
# a little cleaning
words = []
for word in words_in_titles:
    if not word.isdigit() and len(word) > 3:
        words.append(word)
    
words_count = Counter(words)

In [None]:
# words_count.most_common(200)

Looked through 200 most common words I selected the following groups:

In [None]:
job_groups = {'Fire'    : ['fire'],
              'Airport' : ['airport'],
              'Animal'  : ['animal'],
              'Mayor'   : ['mayor'],
              'Library' : ['librar'],
              'Parking' : ['parking'],
              'Clerk'   : ['clerk'],
              'Porter'  : ['porter'],
              'Engineer and Tech': ['engineer', 'programmer', 'electronic', 'tech'], 
              'Court'   : ['court', 'legal', "attorney's", 'atty', 'eligibility'], 
              'Police'  : ['sherif', 'officer', 'police', 'probation', "sheriff's", 'sergeant'],
              'Medical' : ['nurse', 'medical', 'health', 'physician', 'therapist', 'psychiatric', 'treatment', 'hygienist'],
              'Public Works' : ['public'],
              'Food Service' : ['food'],
              'Architectural' : ['architect']}

Now define our transform function which converts job title into job group.

In [None]:
def transform_func(title):
    title = title.lower()
    for key, value in job_groups.items():
        for each_value in value:
            if title.find(each_value) != -1:
                return key
    return 'Other'

Save new information into *JobGroup* column.

In [None]:
data['JobGroup'] = data['JobTitle'].apply(transform_func)

And let's see the data.

In [None]:
data.head(3)

Let's visualize it!

In [None]:
g = sns.FacetGrid(data, col="JobGroup", col_wrap=3, size=4.5, dropna=True)
res = g.map(sns.kdeplot, 'TotalPay', shade=True)

Well, that’s a little interesting. You can immediately notice the huge peak on the public works graph. The values on the graph mostly concentrating around zero — it shows how small the salaries of public workers.

Most of the graphs have more than one spikes. What does it mean? Likely, it's about internal distribution within some job groups. That's normal — some people have high posts, some have low posts.

Some groups have one spike (Architectural, Fire, Airport etc.). It shows that in these group people generally get the same salary.

Why don’t we take a closer look at Medical? As before, we divide the workers into two groups: FT and PT.

In [None]:
ft_med = data[(data['Status'] == 'FT') & (data['JobGroup'] == 'Medical')]
pt_med = data[(data['Status'] == 'PT') & (data['JobGroup'] == 'Medical')]

In [None]:
fig, ax = plt.subplots(figsize=(9, 6))

sns.kdeplot(ft['TotalPay'].dropna(), label="Full-Time", shade=True, ax=ax)
sns.kdeplot(pt['TotalPay'].dropna(), label="Part-Time", shade=True, ax=ax)

plt.xlabel('TotalPay')
plt.ylabel('Density')
title = plt.title('Medical Total Pay Distribution')

Predictably, full-time workers, on average, receive more salary. However, part-time workers significantly more than full-time workers. For medicine, this is not surprising.

# Recap

We looked through the SF Salaries dataset, analysed the data and did a lot of cool visualisations. Of course, this is not intended to be a complete analysis, but as my first public script... I hope, it's good :)

Thx for your attention.