<h1>Exploratory Data Analysis with SQL and Python

<h2>We'll answer the following questions</h2>
<h3><ol>
    <li><a href="#1">How many unique job titles are there?</a></li>
    <li><a href="#2">What are the unique job titles?</a></li>
    <li><a href="#3">For which years is the data reported?</a></li>
    <li><a href="#4">What's the yearly trend in the count of reported jobs?</a></li>
    <li><a href="#5">Are there any jobs related to data or machine learning?</a></li>
    <li><a href="#6">What's the trend in average Total Pay over years?</a></li>
    <li><a href="#7">What are the top 5 jobs in terms of mean TotalPay in the recent year (2014)?</a></li>
    <li><a href="#8">What's the trend in Average TotalPay of the above 5 jobs?</a></li>
    <li><a href="#9">What are the bottom 5 jobs in terms of mean TotalPay in the recent year (2014)?</a></li>
    <li><a href="#10">What's the trend in Average TotalPay of the above 5 jobs?</a></li>
    <li><a href="#11">Which employee earned the most in terms of TotalPay by Year?</a></li>
    <li><a href="#12">Which employee earned the least in terms of TotalPay by Year?</a></li>
    <li><a href="#13">Is there any pattern in salaries of Junior, Senior, and Chief employee titles?</a></li>
    <li><a href="#14">Is there any correlation between various pay components?</a></li>
    <li><a href="#15">Who are the employees with TotalPay between 500,000 and 1,000,000?</a></li>
    <li><a href="#16">Who were on a single job all the 4 years?</a></li>
    <li><a href="#17">Who changed the job every year across all the 4 years?</a></li>
    <li><a href="#18">Who changed job atleast once in 4 years?</a></li>
    <li><a href="#19">How does the mean TotalPay vary between Full time and Part time job?</a></li>
</ol></h3>

In [None]:
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#Creating a Connection object to the database
conn = sqlite3.connect("../input/sf-salaries/database.sqlite")

In [None]:
#Function to display the query result as a dataframe
def query_result(query):
    cursor = conn.cursor()#creating a cursor object to run the query
    cursor.execute(query) #execute the query pass as an argument to this function
    df = pd.DataFrame(cursor.fetchall())#fetching the results (raw result is a list) and converting it to dataframe
    #SQLite query result doesn't return column names of table. So we get the column names from the description of cursor 
    df.columns = [col_name[0] for col_name in cursor.description]
    cursor.close()
    return df

<h2>Select all the records from the table Salaries

In [None]:
src = query_result("SELECT * FROM Salaries;")
src

Missing values are marked as 'Not Provided'. Almost every column has missing values

<a id="1"></a><h2>How many unique job titles are there?

In [None]:
query_result("SELECT COUNT(JobTitle) AS '#Job Titles', COUNT(DISTINCT JobTitle) AS '#Unique Job Titles' FROM Salaries;")

<a id="2"></a><h2>What are the unique job titles?</h2>

Printing only first 10.

In [None]:
" | ".join(list(query_result("SELECT DISTINCT JobTitle FROM Salaries ORDER BY JobTitle asc")['JobTitle'])[:10])

<a id="3"></a><h2>For which years is the data reported?

In [None]:
query_result("SELECT DISTINCT Year FROM Salaries;")

<a id="4"></a><h2>What's the yearly trend in the count of reported jobs?

In [None]:
trend = query_result("SELECT Year,COUNT(*) AS 'Job Count' FROM Salaries GROUP BY Year ORDER BY Year")
print(trend)
trend.plot(x="Year",y="Job Count",legend=False,marker="o")
plt.title("Count of reported jobs every year")
plt.xticks(trend.Year)
plt.ylabel("No of jobs reported");

<a id="5"></a><h2>Are there any jobs related to data or machine learning?

In [None]:
# Finding job titles containing the words 'data' or 'machine'
query_result("SELECT * FROM Salaries WHERE LOWER(JobTitle) LIKE '%machine%' OR JobTitle LIKE '%data%' ORDER BY JobTitle;")

The only data related job found is 'Senior Data Entry Operator'

<a id="6"></a><h2>What's the trend in average Total Pay over years?

In [None]:
sal_trend = query_result("SELECT Year,ROUND(AVG(TotalPay),2) AS 'Avg_Tot_Pay' FROM Salaries GROUP BY Year;")
print(sal_trend)
sal_trend.plot(x='Year',y='Avg_Tot_Pay',legend=False,marker="o")
plt.title("Trend in mean TotalPay")
plt.xticks(sal_trend.Year)
plt.ylabel("Mean TotalPay");

<a id="7"></a><h2>What are the top 5 jobs in terms of mean TotalPay in the recent year (2014)?

In [None]:
query_result("SELECT JobTitle,ROUND(AVG(TotalPay),2) AS Avg_TotalPay FROM Salaries WHERE Year=2014 AND TotalPay>0 GROUP BY JobTitle ORDER BY AVG(TotalPay) desc LIMIT 5")

<a id="8"></a><h2>What's the trend in Average TotalPay of the above 5 jobs?

In [None]:
trend = query_result("""SELECT Year,JobTitle,ROUND(AVG(TotalPay),2) AS Avg_TotalPay FROM Salaries WHERE JobTitle IN 
(SELECT JobTitle FROM Salaries WHERE Year=2014 AND TotalPay>0 GROUP BY JobTitle ORDER BY AVG(TotalPay) desc LIMIT 5) GROUP BY JobTitle,Year
ORDER BY Year,AVG(TotalPay) desc""")
print(trend)
fig = plt.figure(figsize=(10,5))
sns.lineplot(x="Year",y="Avg_TotalPay",marker="o",data=trend,hue="JobTitle")
plt.xticks(trend.Year);

'Chief Investment Officer' has no previous salary history except for the year 2014. All the 5 jobs don't have data for 2011.

<a id="9"></a><h2>What are the bottom 5 jobs in terms of mean TotalPay in the recent year (2014)?

In [None]:
query_result("SELECT JobTitle,ROUND(AVG(TotalPay),2) AS Avg_TotalPay FROM Salaries WHERE Year=2014 AND TotalPay>0 GROUP BY JobTitle ORDER BY AVG(TotalPay) asc LIMIT 5")

<a id="10"></a><h2>What's the trend in Average TotalPay of the above 5 jobs?

In [None]:
trend = query_result("""SELECT Year,JobTitle,ROUND(AVG(TotalPay),2) AS Avg_TotalPay FROM Salaries WHERE TotalPay>0 AND JobTitle IN 
(SELECT JobTitle FROM Salaries WHERE Year=2014 AND TotalPay>0 GROUP BY JobTitle ORDER BY AVG(TotalPay) asc LIMIT 5) GROUP BY JobTitle,Year
ORDER BY Year,AVG(TotalPay)""")
print(trend)
fig = plt.figure(figsize=(14,5))
sns.lineplot(x="Year",y="Avg_TotalPay",marker="o",data=trend,hue="JobTitle")
plt.xticks(trend.Year);

'Cashier 3' has no previous salary history except for the year 2014

<a id="11"></a><h2>Which employee earned the most in terms of TotalPay by Year?

In [None]:
query_result("SELECT Year,EmployeeName,JobTitle,MAX(TotalPay) FROM Salaries GROUP BY Year")

<a id="12"></a><h2>Which employee earned the least in terms of TotalPay by Year?

In [None]:
query_result("SELECT Year,EmployeeName,JobTitle,MIN(TotalPay) FROM Salaries WHERE TotalPay>0 GROUP BY Year")

<a id="13"></a><h2>Is there any pattern in salaries of Junior, Senior, and Chief employee titles?

In [None]:
title = ["junior","senior","chief"]
for i,ti in enumerate(title):
    if i==0:
        title_sal = query_result(f"SELECT ROUND(AVG(TotalPay),2) as {ti} FROM Salaries WHERE TotalPay>0 AND LOWER(JobTitle) LIKE '%{ti}%'")
    else:
        title_sal[ti] = query_result(f"SELECT ROUND(AVG(TotalPay),2) as {ti} FROM Salaries WHERE TotalPay>0 AND LOWER(JobTitle) LIKE '%{ti}%'")
print(title_sal)
title_sal.T.plot(kind="bar",legend=False)
plt.ylabel("Avg TotalPay");

Employees with 'Junior' in their titles have the least average TotalPay followed by 'Senior' with 'Chief' being the highest.

<h3>The above query can be performed directly using SQL as shown below.

In [None]:
title = query_result("""SELECT CASE WHEN LOWER(JobTitle) LIKE '%junior%' THEN 'junior' 
             WHEN LOWER(JobTitle) LIKE '%senior%' THEN 'senior' 
             WHEN LOWER(JobTitle) LIKE '%chief%' THEN 'chief' 
             ELSE 'others' END 
             AS Title,ROUND(AVG(TotalPay),2) AS Avg_TotalPay FROM Salaries WHERE TotalPay>0 GROUP BY Title ORDER BY AVG(TotalPay)""")
title = title.loc[title["Title"]!="others",:].copy()
print(title)
title.plot(kind="barh",x="Title",y="Avg_TotalPay",legend=False,figsize=(10,5))
plt.xlabel("Avg TotalPay");

<a id="14"></a><h2>Is there any correlation between various pay components?

In [None]:
pay_df = query_result("""SELECT OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits FROM Salaries WHERE 
OvertimePay>0 AND OtherPay>0 AND Benefits>0 AND TotalPay>0 AND TotalPayBenefits>0 AND 
OvertimePay!='Not Provided' AND OtherPay!='Not Provided' AND Benefits!='Not Provided' AND TotalPay!='Not Provided' AND TotalPayBenefits!='Not Provided'""")
sns.heatmap(pay_df.corr(),annot=True,mask=np.triu(pay_df.corr()),cbar=False)
plt.title("Correlation b/w various pay components");

We can see a moderate to strong positive correlation amongst various pay components

<a id="15"></a><h2>Who are the employees with TotalPay between 500,000 and 1,000,000?

In [None]:
query_result("SELECT * FROM Salaries WHERE TotalPay BETWEEN 500000 AND 1000000")

<a id="16"></a><h2>Who were on a single job all the 4 years?</h2>

<h3>This query gets complicated in SQL. Hence, we'll implement it in Python

In [None]:
pd.set_option("display.min_rows",200)
df = query_result("SELECT * FROM Salaries")
df.drop_duplicates(inplace=True)

df["EmployeeName"] = df["EmployeeName"].str.lower()
df["JobTitle"] = df["JobTitle"].str.lower()


df["EmployeeName"]=df["EmployeeName"].str.replace("  "," ",regex=False)
df["JobTitle"]=df["JobTitle"].str.replace("  "," ",regex=False)


tot_yrs = df[["EmployeeName","Year"]].drop_duplicates().groupby("EmployeeName").count().reset_index().copy()
tot_yrs.columns = ["EmployeeName","year_count"]
tot_yrs.sort_values(by="year_count",ascending=False,inplace=True)
tot_yrs.drop_duplicates("EmployeeName",inplace=True)

df=df.merge(tot_yrs,on="EmployeeName",how="left").copy()

df["EmployeeName"] = df["EmployeeName"].replace("not provided",np.nan)
df["JobTitle"] = df["JobTitle"].replace("not provided",np.nan)

df.dropna(inplace=True)

tot_jobs = df[["EmployeeName","JobTitle","Year"]].groupby(["EmployeeName","JobTitle"]).count().reset_index().copy()
tot_jobs.drop(columns="JobTitle",inplace=True)
tot_jobs.columns = ["EmployeeName","job_count"]
tot_jobs.sort_values(by="job_count",ascending=False,inplace=True)
tot_jobs.drop_duplicates("EmployeeName",inplace=True)


df = df.merge(tot_jobs,on=["EmployeeName"],how="left").copy()

sol = df.loc[(df["year_count"]==df["job_count"]) & (df["year_count"]==4),:].copy()
sol.sort_values(["EmployeeName","Year"]).reset_index(drop=True)

<a id="17"></a><h2>Who changed the job every year across all the 4 years?</h2>

In [None]:
sol = df.loc[(df["year_count"]==4) & (df["job_count"]==1),:].copy()
sol.sort_values(["EmployeeName","Year"]).reset_index(drop=True)

<a id="18"></a><h2>Who changed job atleast once in 4 years?

In [None]:
sol = df.loc[(df["year_count"]==4) & (df["job_count"]<4),:].copy()
sol.sort_values(["EmployeeName","Year"]).reset_index(drop=True)

<a id="19"></a><h2>How does the mean TotalPay vary between Full time and Part time job?</h2>
Full time: Status = FT<br>
Part time: Status = PT

In [None]:
stt = query_result("SELECT Status,AVG(TotalPay) FROM Salaries WHERE Status IN ('FT','PT') GROUP BY Status")
print(stt)
stt.plot(kind="bar",x='Status',y='AVG(TotalPay)',legend=False);