# Introduction

Data analytics jobs are becoming trendy nowadays. A general understanding of the job market can help job seekers navigate the job search and career preparation. The purpose of exploring data analysis is to answer some questions with data for data analyst dream chasers:

* How are salaries?
* Where are the jobs?
* Who are the top hirers?
* What skills/education/years of experience are needed?

This study is based on the dataset of data-analyst-jobs, created by picklesueat. The dataset contains 2253 job listings for data analyst positions in the US from Glassdoor. Based on the data features analysis, such as Salary Estimate, Location, Company Rating, Job Description etc., some recommendations are generated at the end of the study.

Keywords: EDA, Visualization, NLP

# Content:

Preparation:

* Package importation
* Data importation
* Data Observation & Cleaning

Analysis:

* Salary
* Location
* Top Hirers
* Skills

Summary

# Preparation

Package Importation:

In [None]:
# Import nessecary packages

# Packages for dataframe operations
import numpy as np 
import pandas as pd

# Packages for visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import collections

# Packages for text analysis
from wordcloud import WordCloud
import string
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
#!pip install gensim --user
from gensim.summarization import keywords
from gensim.summarization import summarize
import spacy
import plotly
plotly.offline.init_notebook_mode (connected = True)

In [None]:
# To set the plot styles of the study using Seaborn white style.
sns.set_style('white')

Data Importation

In [None]:
job = pd.read_csv("/kaggle/input/data-analyst-jobs/DataAnalyst.csv")

Data Observation:

In [None]:
job.head()

In [None]:
job.shape

In [None]:
job.columns

In [None]:
job.info()

Data Cleaning

In [None]:
#Delete a column not needed 
job.reset_index(drop=True)
del job['Unnamed: 0']

In [None]:
job.head()

In [None]:
job.isnull().sum()

In [None]:
# fill the NA with -1
job.fillna(-1,inplace=True)

In [None]:
job.isnull().sum()

In [None]:
# To check the unique catagories of 'Size'
job["Size"].unique()

In [None]:
# To check the unique catagories of 'Revenue'
job["Revenue"].unique()

In [None]:
# To check the unique catagories of 'Salary Estimate'
job["Salary Estimate"].unique()

We can see that, the column consist salary ranges strings. To make the data comparable and easy to analyze, the code below add 4 columns: 'Salary_From', 'Salary_To', 'Salary_Mean', 'Salary_Range'

In [None]:
# Add 4 columns based on "Salary Estimate"
job['Salary_From'] = job["Salary Estimate"].str.extractall(r"[$](\d+)").xs(0, level='match')
job['Salary_To'] = job["Salary Estimate"].str.extractall(r"[$](\d+)").xs(1, level='match')
job.fillna(-1,inplace=True)
job['Salary_From'] = job['Salary_From'].astype(int)
job["Salary_To"] = job["Salary_To"].astype(int)
job['Salary_Mean'] = (job["Salary_To"] + job['Salary_From'])/2
job['Salary_Range'] = job["Salary_To"] - job['Salary_From']

# Can also be achieved by:
#job['Salary_From'] = job["Salary Estimate"].str.extract(r"[$](\d+)")
#job['Salary_To'] = job["Salary Estimate"].str.extract(r"(\d+)\S ")


In [None]:
# Strip '\n' from "Company Name"
job["Company Name"] = job["Company Name"].str.replace("(\n).*","")

In [None]:
# Add two columns based on 'Location', to seperate city and state
job["Location_State"] = job["Location"].agg(lambda x: x[-2:])
job["Location_City"] = job["Location"].agg(lambda x: x[:-4])

In [None]:
job.head(3)

In [None]:
job[job["Company Name"].isnull()]

In [None]:
job.fillna(-1,inplace=True)

In [None]:
# To check the unique catagories of "Type of ownership"
job["Type of ownership"].unique()

In [None]:
# To check the unique catagories of "Industry"
job["Industry"].unique()

# Analysis 

* Salary

In [None]:
# Visualize the salary lower and upper bound
fig,(ax0, ax1) = plt.subplots(nrows =1, ncols = 2, figsize = (20,4))
sns.distplot(job['Salary_From'], ax = ax0)
sns.distplot(job['Salary_To'], color = 'r', ax = ax1)
ax0.set(xlabel = "Salary Lower Bound ($ k/year)", ylabel = 'Percentage')
ax0.set(title = "Salary Lower Bound Distribution")
ax1.set(xlabel = "Salary Upper Bound ($ k/year)", ylabel = 'Percentage')
ax1.set(title = "Salary Upper Bound Distribution")
plt.show()

In [None]:
# Visualize the salary range and the salary mean
fig,(ax0, ax1) = plt.subplots(nrows =1, ncols = 2, figsize = (20,4))
sns.distplot(job['Salary_Range'], color = 'g', ax = ax0)
sns.distplot(job['Salary_Mean'], color = 'y', ax = ax1)
ax0.set(xlabel = "Salary Range ($ k/year)", ylabel = 'Percentage')
ax0.set(title = "Salary Range Distribution")
ax1.set(xlabel = "Average Salary ($ k/year)", ylabel = 'Percentage')
ax1.set(title = "Average Salary Distribution")
plt.show()

In [None]:
Salary_df = job[['Salary_From','Salary_To', 'Salary_Range', 'Salary_Mean']].drop(2149)
Salary_df.describe()

The above shows:
Salary Lower bound: the majority are within 40-60k, the average is 54k,
Salary Upperbound: the majority are within 70-100k; the standard is 90k.
We also notice that the upper bound variation (std about 30) is more expansive than lower bounds (std about 20).

Salary Range: the majority are within 25-45k, the average is 35k.
Salary Mean: the majority are within 55-80k; the average is 72k.

Then we are going to look at the mean salary distribution in different states.

In [None]:
# Make a order list
sort_list = sorted(job.groupby('Location_State')['Salary_Mean'].median().items(), key= lambda x:x[1], reverse = True)
state_list_sort = [x[0] for x in sort_list]

In [None]:
plt.figure(figsize = (20,8))
sns.boxplot(x='Salary_Mean',y = 'Location_State', data=job, whis = 10, order = state_list_sort, palette="vlag" )
plt.xlabel("Average Salary ($ k/year)")
plt.title("Average Salary Per State", size =20)
plt.show()

We can see that the states of Califonia, Illinois are leading the salaries.

Then we are going to check the mean salary distribution in different cities.

In [None]:
# Make a order list
city_20_list = job.groupby('Location')['Location'].count().sort_values(ascending = False).head(20)
city_count_list = [x for x in city_20_list.index]
sort_list_city = sorted(job[job['Location'].isin(city_count_list)].groupby('Location')['Salary_Mean'].median().items(), key= lambda x:x[1], reverse = True)
city_list = [x[0] for x in sort_list_city ]

In [None]:
plt.figure(figsize = (20,8))
sns.boxplot(x='Salary_Mean',y = 'Location', data=job, whis = 10, order = city_list, palette="vlag" )
plt.xlabel("Average Salary ($ k/year)")
plt.title("Average Salary Per City", size = 20)
plt.show()

The above shows that jobs in cities (San Jose, San Francisco, San Diego and Los Angeles) of Califonia have higher average salaries.

Now we have a general idea of the salaries of the data analyst jobs and how salaries differentiate among locations. In the following sections, we will also look at the wages in detail (combined with job numbers) according to different places, companies, sectors.

* Location

First we will have a look at the job numbers by state:

In [None]:
# Create a count Count dataframe to count the job numbers by each city
State_City_df = job.groupby(['Location_State','Location_City'])['Location_City'].count().to_frame('Count').reset_index()
State_City_df['Country'] = 'US'

# Use the Count dataframe to draw the treemap
fig = px.treemap(State_City_df, path=['Country', 'Location_State', 'Location_City'], values='Count',
                 color= 'Count'
                ,color_continuous_scale='Blues', title = 'Job No. by State and City')
fig.data[0].textinfo = 'label+text+value+percent parent'
fig.show()

From the above Treemap, we can see that among the 19 states which offer data analyst jobs:
* The top 5 states: Califonia, Texas, New York, Illinois, Pennsylvania offer more than 50% of total jobs.
* Califonia and Taxas are the most significant job market at the state level:
    Four big cities (San Jose, San Francisco, San Diego and Los Angeles) of CA are the major job contributors. Also, about 50% of jobs are in numerous CA cities.
    Jobs in TX are mainly from 6 big cities (Austin, Houston, Dallas, San Antonio) 
* New York City & Chicago are the most significant job markets at the city level.

Now we want to contrast the job numbers by the average salary of the states / cities:

In [None]:
# Make a order list by # of jobs in each state
state_list = sorted(job.groupby('Location_State')['Location_State'].count().items(), key= lambda x:x[1], reverse = True)
state_count_list = [x[0] for x in state_list]

In [None]:
# Show both the job # and average salaries of each state
plt.figure(figsize = (20,8))
state_mean = job.groupby('Location_State')['Salary_Mean'].mean()
sns.countplot(x='Location_State',data=job, order = state_count_list, palette="ch:s=.25,rot=-.25")
ax2 = plt.twinx()
sns.pointplot(x = 'Location_State', y = 'Salary_Mean', data = job,order = state_count_list,ax=ax2, linestyles=["--"])
ax2.set(ylabel = 'Average Salary ($ k/year)')
plt.title("Job No. and Average Salary Per State", size = 20)
plt.show()

From the above, we can see that Califonia has the most job offerings and the highest average salaries; Texas jobs have the lowest average wages among the top 5 job supply states. 

Then we want to contrast the job numbers by the average salary of cities:

In [None]:
city_15_list = job.groupby('Location')['Location'].count().sort_values(ascending = False).head(15)
city_count_list_15 = [x for x in city_15_list.index]

# Show both the job # and average salaries of each city
plt.figure(figsize = (20,8))
state_mean = job.groupby('Location')['Salary_Mean'].mean()
sns.countplot(x='Location',data=job, order = city_count_list_15, palette="ch:s=.25,rot=-.25")
ax2 = plt.twinx()
sns.pointplot(x = 'Location', y = 'Salary_Mean', data = job, alpha = 0.1, linestyles=["--"],order = city_count_list_15,ax=ax2)
ax2.set(ylabel = 'Average Salary ($ k/year)')
plt.title("Job No. and Average Salary Per City (Top 15)", size = 20)
plt.show()

There are 253 cities with DA jobs, and New York is the city's biggest job market. 
Salary-wise, jobs in the cities of Califonia has significantly higher salaries compared with other cities.

* Top Hirers

Now we are checking the top 50 companies in terms of job listing numbers:

In [None]:
# Create a count Count dataframe to count the job numbers by each company
company_list = sorted(job.groupby('Company Name')['Company Name'].count().items(), key= lambda x:x[1], reverse = True)
top20_company = company_list[0:20]
company_count_list = [x[0] for x in top20_company]

Company_df = job.groupby(['Company Name','Job Title'])['Job Title'].count().to_frame('Count').reset_index()
Company_df = Company_df[Company_df['Company Name'].isin(company_count_list)]
Company_df['Country'] = 'US'
Company_df_sort = Company_df.sort_values('Count', ascending = False)

In [None]:
# Create a count Count dataframe to count the job numbers by each company
#Company_df = job.groupby(['Company Name','Job Title'])['Job Title'].count().to_frame('Count').reset_index()
#Company_df['Country'] = 'US'
#Company_df_sort = Company_df.sort_values('Count', ascending = False)

# Use the Count dataframe to draw the treemap
fig = px.treemap(Company_df_sort, path=['Country', 'Company Name','Job Title'], values='Count',
                 color= 'Count'
                ,color_continuous_scale='Blues', title = 'Top 50 Hirers')
fig.data[0].textinfo = 'label+text+value'
fig.show()

From the above, we can see that among all companies (1502):
there is one company (staffigo Technico Services) offering 58 openings; and nine companies with 20+ positions.
The jobs are sparsely distributed among different companies.

In [None]:
 len(sorted(job.groupby('Company Name')['Company Name'].count().items(), key= lambda x:x[1], reverse = True))

Then let have a look at the salaries of the top 10 hirers:

In [None]:
company_s = job.groupby(['Company Name'])['Company Name'].count().sort_values(ascending = False)[:10]
comapny_l = [x for x in company_s.index if x != '-1' ]
plt.figure(figsize = (18,6))
sns.countplot(x='Company Name', data=job, order = comapny_l, palette="ch:s=.25,rot=-.25")
ax2 = plt.twinx()
sns.pointplot(x = 'Company Name', y = 'Salary_Mean', data = job, alpha = 0.1, linestyles=["--"],order = comapny_l,ax=ax2)
ax2.set(ylabel = 'Average Salary ($ k/year)')
plt.title("Job No. and Average Salary of Top 10 Hirers", size = 20)
plt.show()

The Average Salary of top hirer Staffigo Technical Service is lower than other companies, as the openings are mostly for junior DAs. 
Apple Jobs are with higher average salaries.
Here are the links to the website of the top 5 hirers:

* Staffigo: https://www.staffigo.com/it-staffing.html
* Diverse Lynx: https://www.diverselynx.com/
* Kforce:https://www.kforce.com/
* Lorven Technologies Inc: https://www.lorventech.com/
* Mondo: https://mondo.com/

The research finds out that almost all top hirers are IT staffing companies, so it is reasonable to assume that the real demand for DA jobs are from their clients. And since staffing companies stand for a large proportion, no further digging in company-related info, such as size/revenue, etc.

Top 10 Hirers' Company Profile:

In [None]:
company_s = job.groupby(['Company Name'])['Company Name'].count().sort_values(ascending = False)[:10]
comapny_l = [x for x in company_s.index if x != '-1' ]
temp=job[job['Company Name'].isin(comapny_l)]
df_top10_company = temp.groupby('Company Name').first().reset_index()
df_top10_company[['Company Name','Location','Headquarters','Size','Founded','Type of ownership','Revenue','Sector','Industry','Competitors']]

Then we catagorize companies by sectors

In [None]:
# Create a count Count dataframe to count the job numbers by each city
Sector_df = job.groupby(['Sector','Industry'])['Industry'].count().to_frame('Count').reset_index()
Sector_df = Sector_df [Sector_df ['Industry']!= '-1']
Sector_df['Country'] = 'US'

# Use the Count dataframe to draw the treemap
fig = px.treemap(Sector_df, path=['Country', 'Sector', 'Industry'], values='Count',
                 color= 'Count'
                ,color_continuous_scale='Blues', title = 'Job No. by Sector')
fig.data[0].textinfo = 'label+text+value+percent parent'
fig.show()

As we learned from the top hirers, IT services and Staffing & Outsourcing companies stand for a big chunck of the total job market.

In [None]:
sector_s = job.groupby(['Sector'])['Sector'].count().sort_values(ascending = False)[:11]
sector_l = [x for x in sector_s.index if x != '-1' ]
plt.figure(figsize = (18,6))
sns.countplot(x='Sector', data=job, order = sector_l, palette="ch:s=.25,rot=-.25")
ax2 = plt.twinx()
sns.pointplot(x = 'Sector', y = 'Salary_Mean', data = job, alpha = 0.1, linestyles=["--"],order = sector_l,ax=ax2)
ax2.set(ylabel = 'Average Salary ($ k/year)')
plt.title("Job No. and Average Salary of Top 10 Sectors", size = 20)
plt.show()


In [None]:
plt.figure(figsize = (18,6))
job_4 = job[job['Sector'].isin(sector_l)]
job_4 = job_4[job_4['Salary_Mean'] != -1 ]
sns.swarmplot(y = job_4['Salary_Mean'], x = job_4['Sector'], order = sector_l)
plt.title("Average Salary of Top 10 Sectors", size = 20)
plt.show()

Noticed from above, companies in IT/ Business Services/Health Care offer higher average salaries.

* Skills

Finally, we are interested to check the general skills needed for DA jobs. We will look into the skills mentioned in:
1. Job Titles;
2. Job Descriptions

In [None]:
# Download stopwords for NLP analysis
nltk.download('stopwords') 
stopwords = nltk.corpus.stopwords.words('english')

In [None]:
# Set up a set for removal
remove_these = set(stopwords + list(string.punctuation) + list(string.digits))

We will first check job titles:

In [None]:
# Check most frequent words in job titles after removing unwanted words
jt_lst = job[job['Job Title'] != '-1']['Job Title'].tolist()
word_list = []
for item in jt_lst:
    content = nltk.word_tokenize(item)
    for word in content:
        if word.lower() not in remove_these and word.lower() != 'data'and word.lower() != 'analyst' and word.lower() != 'data analyst':
            word_list.append(word.lower())
freq = nltk.FreqDist(word_list)
plt.figure(figsize = (10,4))
freq.plot(20)

WordCloud for job titles:


In [None]:
text = ""
for item in word_list:
    if item.lower() not in remove_these and item.lower() != 'data'and item.lower() != 'analyst' and item.lower() != 'data analyst':
        text += " "+str(item.lower())

wordcloud = WordCloud(width = 800, height = 400).generate(text)
plt.figure(figsize = (20,10))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

We noticed that 'Senior' as the most frequently mentioned word in job titles.
Then we will try to find out the most frequently mentioned skills in job titles:

In [None]:
nlp = spacy.load('en', tagger=False, parser=False, matcher=False)

In [None]:
# To extract NERs from all job titles and save in a dictionary
# jt_lst = job[job['Job Title'] != '-1']['Job Title'].tolist()
# word_list = []
job_title_ner = {}
#c=0
for item in jt_lst:
    token = nlp(item)
    for ent in token.ents:      
        job_title_ner.setdefault(ent.label_ , []).append(ent.text) 
#    c = c+1
#    print (c)


In [None]:
job_title_ner.keys()

In [None]:
collections.Counter(job_title_ner['ORG']).most_common(10)

By observations, we noticed that some hard tech skills are mentioned:
* SQL
* SAP
* Python
* SAS
* BI
* ERP
* Microsoft Dynamics
* USMTF
* Excel
* DAX
* LATAM
* Tableau
* AB Testing
* EFL
* Java
* Mongo
* AWS
* VBA
* Oracle
Since we identified them, then we are going to check how often they appear in the job titles.

In [None]:
skill_word_list = ['sql','sap','sas','python','erp','bi','microsoft','dynamics','usmft','excel','dax','latam','tableau','efl','java','mongo','aws','vba','oracle']

In [None]:
skill_word_count = []
for word in word_list:
    if word in skill_word_list:
        skill_word_count.append(word)

collections.Counter(skill_word_count).most_common(10)
freq_skills = nltk.FreqDist(skill_word_count)
plt.figure(figsize = (10,4))
freq_skills.plot(10)

We can see from above that SQL, BI, SAP, Python seem to be the most commonly mentioned skills in job titles

> Then we move on to job descriptions:

In [None]:
# Check most frequent words in job titles after removing unwanted words
jd_lst = job[job['Job Description'] != '-1']['Job Description'].tolist()
word_list_jd = []
for item in jd_lst:
    content = nltk.word_tokenize(item)
    for word in content:
        if word.lower() not in remove_these and word.lower() != 'data'and word.lower() != 'analyst' and word.lower() != 'data analyst':
            word_list_jd.append(word.lower())
freq = nltk.FreqDist(word_list_jd)
plt.figure(figsize = (10,4))
freq.plot(20)

In [None]:
text_jd = ""
for item in word_list_jd:
    text_jd += " "+str(item.lower())

wordcloud = WordCloud(width = 800, height = 400).generate(text_jd)
plt.figure(figsize = (20,10))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

'Support', 'team','experience' seem to be highly frequently mentioned words in JDs. 

In [None]:
#jd_lst = job[job['Job Description'] != '-1']['Job Description'].tolist()
#word_list_jd = []
job_description_ner = {}
#c=0
for item in jd_lst:
    token = nlp(item)
    for ent in token.ents:      
        job_description_ner.setdefault(ent.label_ , []).append(ent.text) 
#    c = c+1
#    print (c)


In [None]:
job_description_ner.keys()

Then we want to check the most_common skills mentioned under the tag 'ORG':

In [None]:
collections.Counter(job_description_ner['ORG']).most_common(10)

By observation, we noticed some additional tech skills to the skill_word_list:

In [None]:
skill_word_list = ['sql','sap','sas','python','erp','bi','microsoft','dynamics','usmft','excel','dax','latam','tableau','efl','java','mongo','aws','vba','oracle','spss','javascript','visio','access','git','github','python/r'       ]

In [None]:
skill_word_count_jd = []
for word in word_list_jd:
    if word in skill_word_list:
        skill_word_count_jd.append(word)

collections.Counter(skill_word_count_jd).most_common(10)
freq_skills = nltk.FreqDist(skill_word_count_jd)
plt.figure(figsize = (10,4))
freq_skills.plot(20)

From above analysis we can see that SQL/Excel/Tableau/Python/BI are all very popular skills mentioned in JDs.

We continue to check other observations (education / experience requirments):

In [None]:
# Education:
print('Times of GED mentioned: ' + str(collections.Counter(word_list_jd)['ged']))
print('Times of Bachelor mentioned: ' + str(collections.Counter(word_list_jd)['bachelor']+collections.Counter(word_list_jd)['undergraduate']))
print('Times of Master mentioned: ' + str(collections.Counter(word_list_jd)['master']+collections.Counter(word_list_jd)['postgraduate']))
print('Times of Doctor mentioned: ' + str(collections.Counter(word_list_jd)['dr.']+collections.Counter(word_list_jd)['dr']))

In [None]:
# Experience:
collections.Counter(job_description_ner['DATE']).most_common()
new_lst = []
for word in job_description_ner['DATE']:
    if 'year' in word:
        new_lst.append(word)

# to re-classify the experience years catagory
years = collections.Counter(new_lst).most_common()
new_year_dict = {'1':0, '2':0, '3':0,'4':0,'5':0,'6':0,'7':0,'8':0,'9':0}
for tup in years:
    if '1'in tup[0]:
        new_year_dict['1'] += tup[1]
    elif '2'in tup[0]:
        new_year_dict['2'] += tup[1]
    elif '3'in tup[0]:
        new_year_dict['3'] += tup[1]
    elif '4'in tup[0]:
        new_year_dict['4'] += tup[1]        
    elif '5'in tup[0]:
        new_year_dict['5'] += tup[1]
    elif '6'in tup[0]:
        new_year_dict['6'] += tup[1]
    elif '7'in tup[0]:
        new_year_dict['7'] += tup[1]
    elif '8'in tup[0]:
        new_year_dict['8'] += tup[1]
    elif '9'in tup[0]:
        new_year_dict['9'] += tup[1]


years_s = pd.Series(new_year_dict)

plt.figure(figsize = (10,5))
sns.barplot(x= years_s.index, y = years_s.values,  palette="ch:s=.25,rot=-.25")
plt.xlabel('Years of Experience')
plt.ylabel('Times appeared in Job Descriptions')
plt.title("Years of Experience Mentioned in Job Descriptions", size = 15)
plt.show()


1-5 years'experience are mostly common-mentioned requirement in job descriptions.

# Summary

We gained a general picture of the US job market for data analysts through the dataset analysis. 

Firstly, **Salary-wise**, the average salary of most jobs is within 55-80k, the average is 72k; companies typically gave out a salary range within 25-45k, the average is 35k. 

**Geographically**, all jobs are located in 19 states, Califonia, Texas, New York provides more than 50% of total employment; New York City & Chicago are the most significant job markets at the city level. 

Almost all **top hirers** (like:Staffigo/Diverse Lynx/Kforce etc., except Apple) are IT staffing companies. So it is reasonable to assume that the real demand for DA jobs is from their clients; on the other hand, job seekers may consider getting in touch with these agents to get in the talent pool. 

Finally, job titles and descriptions show that some **hard tech skills** are popular, such as SQL/Tableau/Power BI/Excel, so that job seeker could consider horning the skills beforehand. Besides, a bachelor's degree and 1-5 years' experience are generally preferred as well.

