 - In this project, I mainly examined salaries of jobs with the word
   "analyst"  in the job title from the database.
 - I found ten "analyst" jobs with most employees on record, and compare their total pay, benefits and total pay with benefits. 

In [None]:
import pandas as pd
import numpy as np
import sqlite3
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
con = sqlite3.connect('../input/database.sqlite')

 - Let's find jobs with "data" in their titles.

In [None]:
JobTitle_with_data_summery = pd.read_sql_query('''
        select COUNT(*) AS Nums, AVG(TotalPay) AS TotalPay_Avg, 
        AVG(Benefits) AS Benefits_Avg, AVG(TotalPayBenefits) AS TotalPayBenefits_Avg
        from 
        Salaries
        WHERE JobTitle LIKE '%data%'
        ''',con)
JobTitle_with_data_summery

 - More specifically, the three jobs with "data" are:

In [None]:
JobTitle_with_data = pd.read_sql_query('''
        SELECT JobTitle, COUNT(*) AS Nums
        FROM
        (select *
        from 
        Salaries
        WHERE JobTitle LIKE '%data%') AS a
        GROUP BY JobTitle
        ''',con)
JobTitle_with_data

 - Not too many, and the two JobTitles are actually the same.
 - OK, let's find jobs with "analyst" in their titles.

In [None]:
JobTitle_with_analyst_summery = pd.read_sql_query('''
        select COUNT(*) AS Nums, AVG(TotalPay) AS TotalPay_Avg, 
        AVG(Benefits) AS Benefits_Avg, AVG(TotalPayBenefits) AS TotalPayBenefits_Avg,
        MIN(TotalPay) AS TotalPay_Min, 
        MIN(Benefits) AS Benefits_Min, MIN(TotalPayBenefits) AS TotalPayBenefits_Min
        from 
        Salaries
        WHERE JobTitle LIKE '%analyst%'
        ''',con)
JobTitle_with_analyst_summery

 - There are pretty a lot of "Analyst" jobs in the database, and next
   let's see what these "analyst" jobs are.
 - Since there may be the same job in lowercase or uppercase, let's use
   upper() function to convert one to the other.
 - Also, since there are zero values for the 'totalpay' column in the
   database, which I think should be excluded from the salary analysis.

In [None]:
JobTitle_with_analyst = pd.read_sql_query('''
        SELECT upper(JobTitle) AS JobTitle, 
        COUNT(*) AS Nums
        FROM
        (select *
        from 
        Salaries
        WHERE JobTitle LIKE '%analyst%' AND TotalPay>0) AS analyst_table
        GROUP BY upper(JobTitle)
        ''',con)
JobTitle_with_analyst

 - We can notice that there are some similar jobs, like the first job
   "admin analyst" should be the same as "administrative analyst", etc.
   We can classify similar ones together. 
 - One little thing we need to be careful is that there are programmer
   analyst (including programmer analyst and prg analyst in the table)
   and program analyst (including program analyst and program support
   analyst), I think.
 - Here are all "analyst" jobs with some basic summery:

In [None]:
Job_Salaries = pd.read_sql_query('''
        SELECT CASE
        WHEN upper(JobTitle) LIKE '%ADMIN%' THEN 'ADMINISTRATIVE ANALYST'
        WHEN upper(JobTitle) LIKE '%RETIREMENT%' THEN 'RETIREMENT ANALYST'
        WHEN upper(JobTitle) LIKE '%HUMAN RESOURCE%' THEN 'HUMAN RESOURCES ANALYST'
        WHEN upper(JobTitle) LIKE '%BENEFIT%' THEN 'BENEFITS ANALYST'
        WHEN (upper(JobTitle) LIKE '%COMP APP%') OR (upper(JobTitle) LIKE '%COMPUTER APPLICATION%') THEN 'COMPUTER APPLICATIONS ANALYST'
        WHEN upper(JobTitle) LIKE '%FEASIBILITY%' THEN 'FEASIBILITY ANALYST'
        WHEN upper(JobTitle) LIKE '%HEALTH%' THEN 'HEALTH CARE ANALYST'
        WHEN upper(JobTitle) LIKE '%MEDICAL%' THEN 'MEDICAL STAFF SERVICES DEPARTMENT ANALYST'
        WHEN upper(JobTitle) LIKE '%BUSINESS%' THEN 'BUSINESS ANALYST'
        WHEN upper(JobTitle) LIKE '%OPERATOR%' THEN 'OPERATOR ANALYST'
        WHEN (upper(JobTitle) LIKE '%PROGRAMMER%') OR (upper(JobTitle) LIKE '%PRG ANALYST%') THEN 'PROGRAMMER ANALYST'
        WHEN upper(JobTitle) LIKE '%PERF%' THEN 'PERFORMANCE ANALYST'
        WHEN (upper(JobTitle) LIKE '%MANAGEMENT%') OR (upper(JobTitle) LIKE '%MGMT%') THEN 'MANAGEMENT ANALYST'
        WHEN upper(JobTitle) LIKE '%PERSONNEL%' THEN 'PERSONNEL ANALYST'
        WHEN (upper(JobTitle) LIKE '%PROGRAM ANALYST%') OR (upper(JobTitle) LIKE '%PROGRAM SUPPORT%') THEN 'PROGRAM ANALYST'
        WHEN upper(JobTitle) LIKE '%SAFETY%' THEN 'SAFETY ANALYST'
        WHEN upper(JobTitle) LIKE '%SECURITY%' THEN 'SECURITY ANALYST'
        WHEN upper(JobTitle) LIKE '%UTILITY%' THEN 'UTILITY ANALYST'
        WHEN upper(JobTitle) LIKE '%WATER%' THEN 'WATER OPERATIONS ANALYST'
        END AS Job, COUNT(*) AS Nums, AVG(TotalPay) AS TotalPay_Avg, 
        AVG(Benefits) AS Benefits_Avg, AVG(TotalPayBenefits) AS TotalPayBenefits_Avg
        FROM
        (select *
        from 
        Salaries
        WHERE JobTitle LIKE '%analyst%' AND TotalPay>0) AS analyst_table
        GROUP BY Job
        ORDER BY Nums DESC
        ''',con)
Job_Salaries

 - In the above table, the jobs are ordered according to the employee
   numbers.  And next, I am going to focus on the first 10 "hot" jobs in the table and visualize some results. I have two reasons to do so:
 - 1.the jobs with very small numbers may be able to included in some other categories.  
 - 2.As a newbie myself, it's very likely that I can get hired for a job
   which needs more people.
 
 - The ten hot analyst jobs are:

In [None]:
Hot_Job = Job_Salaries['Job'][0:10]
Hot_Job.values

Let's examine salaries related to these 10 jobs.

In [None]:
Hot_Job_Salaries = pd.read_sql_query('''
        SELECT JobTitle, 
        CASE
        WHEN upper(JobTitle) LIKE '%ADMIN%' THEN 'ADMINISTRATIVE ANALYST'
        WHEN upper(JobTitle) LIKE '%BUSINESS%' THEN 'BUSINESS ANALYST'
        WHEN upper(JobTitle) LIKE '%PERSONNEL%' THEN 'PERSONNEL ANALYST' 
        WHEN (upper(JobTitle) LIKE '%PROGRAMMER%') OR (upper(JobTitle) LIKE '%PRG ANALYST%') THEN 'PROGRAMMER ANALYST'
        WHEN upper(JobTitle) LIKE '%RETIREMENT%' THEN 'RETIREMENT ANALYST'
        WHEN upper(JobTitle) LIKE '%BENEFIT%' THEN 'BENEFITS ANALYST'
        WHEN (upper(JobTitle) LIKE '%PROGRAM ANALYST%') OR (upper(JobTitle) LIKE '%PROGRAM SUPPORT%') THEN 'PROGRAM ANALYST'
        WHEN upper(JobTitle) LIKE '%UTILITY%' THEN 'UTILITY ANALYST'
        WHEN upper(JobTitle) LIKE '%PERF%' THEN 'PERFORMANCE ANALYST'
        WHEN upper(JobTitle) LIKE '%HEALTH%' THEN 'HEALTH CARE ANALYST'
        END AS Job, 
        TotalPay, Benefits, TotalPayBenefits, Year
        FROM
        (select *
        from 
        Salaries
        WHERE JobTitle LIKE '%analyst%') AS analyst_table
        
        ''',con)

In [None]:
Hot_Job_Salaries.describe()

I did not show the above big table. Let's use graph to compare their total pay, benefits total pay with benefits

In [None]:
fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(1,1,1)
# I want to plot a boxplot. On x-axis I put the jobgroups with corresponding employee numbers, 
# and I want to order them by employee numbers (from biggest to smallest).
order = Hot_Job.values
sns.boxplot(x = 'Job', y = 'TotalPay', data = Hot_Job_Salaries,ax=ax,order = order)
Hot_Job_Number = Job_Salaries['Job'][0:10]+' '+Job_Salaries['Nums'][0:10].map(str)
ax.set_xticklabels(Hot_Job_Number.values)
plt.xticks(size = 10, rotation = 80)
ax.set_xlabel('Jobs with corresponding Employee numbers')
fig.suptitle('TotalPays of ten hot analyst jobs in SF')

I tried to make the same plot for "Benefits", but I failed at first. Then I found that the reason was that
the type of "Benefits" column was object. Then I changed the type and plot.

In [None]:
Hot_Job_Salaries.dtypes

In [None]:
Hot_Job_Salaries["Benefits"]=Hot_Job_Salaries["Benefits"].apply(pd.to_numeric)

In [None]:
fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(1,1,1)
sns.boxplot(x = 'Job', y = 'Benefits', data = Hot_Job_Salaries,ax=ax,order = order)
ax.set_xticklabels(Hot_Job_Number.values)
plt.xticks(size = 10, rotation = 80)
ax.set_xlabel('Jobs with corresponding Employee numbers')
fig.suptitle('Benefits of ten hot analyst jobs in SF')

In [None]:
fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(1,1,1)
sns.boxplot(x = 'Job', y = 'TotalPayBenefits', data = Hot_Job_Salaries,ax=ax,order = order)
ax.set_xticklabels(Hot_Job_Number.values)
plt.xticks(size = 10, rotation = 80)
ax.set_xlabel('Jobs with corresponding Employee numbers')
fig.suptitle('TotalPay+Benefits of ten hot analyst jobs in SF')

 - Summary:
 - 1.The ten "Analyst" jobs with most employees in SF are: Administrative Analyst, Business Analyst, Personnel Analyst,
   Programmer Analyst, Retirement Analyst, Benefits Analyst, Program
   Analyst, Utility Analyst, Performance Analyst, Health Care Analyst.
 - 2.The average total pay of all "Analyst" jobs is 82019, the average
   benefit is 29854, and the average totalpay with benefits is 105293.
 - 3.The median total pay is 88353 and the median total pay with benefits
   is 112216. The average salaries are less than corresponding
   median salaries due to a lot of low values. 
 - The next three facts are about median values:
 - 4.The median salaries of Administrative Analyst, Business Analyst,
   Personnel Analyst, Programmer Analyst, Program Analyst, Performance
   Analyst are above 80000. The median salaries of other four are between
   60000 to 80000.
 - 5.Most of the ten jobs have median benefits between 30000 to 40000. However, the average of benefits are all below 30000 according to the JobGroup_Salaries table, again, due to a lot of low benefits values. 
 - 6.Overall, the median total pay with benefits for most of the ten jobs are around 100000.