In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot as plt
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## About this data
This [dataset](https://www.kaggle.com/andrewmvd/data-analyst-jobs) was created by [picklesueat](https://github.com/picklesueat/data_jobs_data) and contains more than 2000 job listing for data analyst positions, with features such as:

- Salary Estimate
- Location
- Company Rating
- Job Description
- and more.

## List of things we will explore:
- Which sectors and industries pay the highest?
- Which sectors and industries have the most available jobs?
 > There's no point finding out which industry/sectors pays handsomely if there are only a handful of jobs available right?
- The top and bottom 25 cities in terms of salaries
- States that pay the best salaries
- Effect of Company Size/Age on salaries
 > Do larger or older companies pay more? We find out!
- How much better do senior-level jobs pay?

First, let's read in the data and have a feel of what it looks like.

In [None]:
df = pd.read_csv('/kaggle/input/data-analyst-jobs/DataAnalyst.csv')
df.head(10)

Now let's look a the type of columns we have:

In [None]:
df.info()

### The first look at the data tells us there are 15 columns. Some ideas about what this EDA can explore are:
- Which sector/industry pays more?
- What's the range of salaries for different seniorities
- Does the size of the company have an impact on salary?
- Does the length of the job description correlate with the salary?
- Do certain locations pay more?

## Which sector pays the highest?

![](http://)Since salary estimates are presented in a range, it could be useful to transform it into a median salary

In [None]:
df['low_bound_sal'] = df['Salary Estimate'].apply(lambda x : (x.split()[0]).split('-')[0] if x != '-1' else x)
df['upp_bound_sal'] = df['Salary Estimate'].apply(lambda x : (x.split()[0]).split('-')[1] if x != '-1' else x)

df['upp_bound_sal'] = df['upp_bound_sal'].apply(lambda x : x[1:-1] if x != '-1' else x)
df['low_bound_sal'] = df['low_bound_sal'].apply(lambda x : x[1:-1] if x != '-1' else x)

df['upp_bound_sal'] = df['upp_bound_sal'].apply(pd.to_numeric)
df['low_bound_sal'] = df['low_bound_sal'].apply(pd.to_numeric)

df['median_sal'] = (df['upp_bound_sal'] + df['low_bound_sal'])/2 * 1000

In [None]:
sns.set(font_scale=1.2)
sns.catplot(y='Sector', x='median_sal', kind='box' ,data=df, height=7, aspect=2, \
           order=df.groupby('Sector')['median_sal'].median().sort_values(ascending=False).index)
plt.title('Distribution of median salaries by Sector',fontsize='20')
plt.xticks(rotation=90)

- Judging by the median lines, it looks like chossing **Biotech** offers the highest chance of getting a higher salary. 
- Seperately, the **Arts & Entertainment** sector has HUGE variation, an analyst role pays anywhere from roughly `$40,000` to over `$140,000`
- **'Restaurants, Bars & Food Services'** isn't a great place to look if you want higher salaries as the range is pretty limited and it has the lowest median

## Which industry pays the highest?

In [None]:
sns.set(font_scale=1.2)
sns.catplot(y='Industry', x='median_sal', kind='box', data=df, height=15, aspect=1, order=df.groupby('Industry')['median_sal'].median().sort_values(ascending=False).index)
plt.title('Distribution of median salaries by Industry', fontsize='20')
plt.xticks(rotation=90)

There's a long list of industries here but the top 5 paying are:
- Education Training Services  
- Health Care Products Manufacturing 
- Drug & Health Stores
- Gambling
- Biotech & Pharmaceuticals

Out of the top 5, having **Education Training Services** as the top paying is a little surprising to me, maybe they're hiring data analyst instructors? 

And the bottom 5 are:
- Oil & Gas Services
- Membership Organizations
- Trucking
- Audiovisual
- Grocery Stores & Supermarkets 

## Which sector and industry is hiring the most?

In [None]:
plt.figure(figsize=(12, 7))
sns.set(font_scale=1.2)
sns.countplot(y='Sector', data=df, order=df['Sector'].value_counts().index)
plt.title('Counts of job postings by Sector', fontsize='18')
plt.xticks(rotation=90)

Information Technology and Business Services are by far the sectors that are hiring the most data analysts

In [None]:
plt.figure(figsize=(7, 18))
sns.set(font_scale=1.2)
sns.countplot(y='Industry', data=df, order=df['Industry'].value_counts().index)
plt.title('Counts of job postings by Industry', fontsize='16')
plt.xticks(rotation=90)

Apart from the missing data (-1), **'IT Services'** and **'Staffing & Outsourcing'** are by far the 2 industries with the highest counts of job posting.

## Salaries by Location

In [None]:
count = df['Location'].nunique()

print(f'There are {count} unique cities')

Since there are 253 different locations,it might be ideal to show the the top and bottom 25 in terms of median salary. We can also aggregate it by state by creating a new column.

### Salaries by Location (Top 25 Cities)

In [None]:
sns.set(font_scale=1.2)
sns.catplot(y='Location', x='median_sal', kind='box', data=df, height=10, aspect=1, order=df.groupby('Location')['median_sal'].median().sort_values(ascending=False).iloc[:25].index)

plt.title('Top 25 median salaries by Location', fontsize='18')
plt.xticks(rotation=90)

These are definitely the cities to be looking out for jobs but the sparsity of data for half the cities doesn't give a good sample of the salaries. It might be better if we can look at more aggregated data eg. at the state level. But from the looks of it, majority of the cities are in CA

### Salaries by Location (Bottom 25 Cities)

In [None]:
sns.set(font_scale=1.2)
sns.catplot(y='Location', x='median_sal', kind='box', data=df, height=10, aspect=1, order=df.groupby('Location')['median_sal'].median().sort_values(ascending=False).iloc[-25:].index)

plt.title('Bottom 25 median salaries by Location', fontsize='18')
plt.xticks(rotation=90)

Same data sparsity issue when we look at the bottom 25. Most of the cities seem to be in UT and PA.

### Salaries by Location (States)

In [None]:
# create the state column
df['state'] = df['Location'].apply(lambda x : x.split(',')[-1])

In [None]:
count = df['state'].nunique()
print(f'There are {count} states')

In [None]:
sns.set(font_scale=1.2)
sns.catplot(y='state', x='median_sal', kind='box', data=df, height=10, aspect=1, order=df.groupby('state')['median_sal'].median().sort_values(ascending=False).index)

plt.title('Salaries by State', fontsize='18')
plt.xticks(rotation=90)

This looks way better now and as we guessed earlier, CA is the state with the highest salaries. But it also has the largest range of salaries. IL in second place has a much tighter distribution of salaries and you can be sure your salary is almost at last $60,000.

## Effect of Size/Age of Company on Salary

One would think that more established companies (bigger and older) would be in a better financial position to pay more but on the flip side, startups that are well funded, especially in silicon valley, would also need to offer competitive salaries to attract talent. Let's find out if there is a relationship between Salaries and Size/Age of a company.

In [None]:
df['Age'] = df['Founded'].apply(lambda x : (2020 - x) if x != -1 else x)

In [None]:
sns.set(font_scale=1.2)
sns.set_palette("Paired")
plt.figure(figsize=(15, 7))
sns.scatterplot(x='Age', y='median_sal', data=df[df['Age'] != 1], hue='state')

plt.title('Salaries by Company Age', fontsize='18')
plt.xticks(rotation=90)

I also took the liberty to add a colour code to the scatterplot to show the differences between states.

A couple of interesting observations here:
- There's also a downward trend in salaries as the `Age` of the company increases, disproving our earlier theory that more established companies would pay more.
- Younger companies are hiring more data analyst roles. 
- After the company age of 50 is where we start to see job postings getting lesser.
- Like we've seen before, CA takes the top slot in terms of high salaried data analyst roles 

In [None]:
# remove unknown and -1 company size
df_filtered = df.loc[(df['Size'] != '-1') & (df['Size'] != 'Unknown')]

sns.set(font_scale=1.2)
sns.catplot(y='Size', x='median_sal', kind='box', data=df_filtered, height=10, aspect=1,\
           order=df_filtered.groupby('Size')['median_sal'].median().sort_values(ascending=False).index)

plt.title('Salaries by Company Size', fontsize='18')
plt.xticks(rotation=90)

It looks like there isn't a material difference in the amount salaries across companies of diffferent sizes.
One minor thing to point out could be that companies between 5001 and 10000 employees seem to have a larger range of salaries above the 50th percentile

## Seniority Levels

- If let's say you're a data analyst with some experience under your belt, where are the places that are hiring?
- What kind of salary should you expect as an experienced hire?
- What seniority levels are the jobs at?

In [None]:
def extract_seniority(t):
    t  = t.lower()
    if 'senior' in t:
        return 'Senior'
    elif 'manager' in t:
        return 'Manager'
    elif 'lead' in t:
        return 'Lead'
    elif 'principal' in t:
        return 'Principal'
    else:
        return 'Everyone Else'
    

df['Level'] = df['Job Title'].apply(extract_seniority)



In [None]:
sns.set(font_scale=1.2)
sns.catplot(y='Level', x='median_sal', kind='box', data=df, height=10, aspect=1, \
            order=df.groupby('Level')['median_sal'].median().sort_values(ascending=False).index)

plt.title('Salaries by Seniority Levels', fontsize='18')
plt.xticks(rotation=90)

The results are a little surprising because 'Lead' and "Principal' are expected to pay higher than 'Senior' level roles. Maybe it's because the way we segment out the levels are a bit too simplistic and we're not considering other titles that could have senior in them. 
I'm most curious about why 'Principal' level jobs are paid so much lower relative to the others. 

Let's find out!

In [None]:
df[df['Job Title'].str.contains('principal', case=False)].tail(20)

Digging a bit deeper into the data, it turns out there are only 8 jobs with 'Principal' in their names so it could be easily skewed by extreme values. 

Of the 8 jobs above, a **Principal Business Analyst - Data Governance at Rockstar** only pays between **27K−52K** and a **Principal Data Management Analys** at **Northrop Grumman** only pays between **42K−66K** both significantly lower than the rest in the same group.

## Summary of Findings

So what did we learn so far?
- Education Training Services, Health Care Products Manufacturing, Drug & Health Stores, Gambling and Biotech & Pharmaceuticals related industries pay the most but jobs in these sectors might not be abundant
- There are most job postings in Information Technology and Business Services sectors
- Salaries in CA are the highest, followed by IL and AZ
- Older companies don't necessarily pay more, most of the job postings and high paying roles are from younger companies (< 40 years old)
- Company size doesn't have much impact on salaries
- Seniority levels don't necessarily get paid higher salaries across the board