<a id='Top'></a>
<center>
<h1><u>Glasdoor Data Analyst Job Market Exploration</u></h1>
<h3>Author: Robert Kwiatkowski</h3>
</center>

---

<!-- Start of Unsplash Embed Code - Centered (Embed code by @BirdyOz)-->
<div style="width:60%; margin: 20px 20% !important;">
    <img src="https://images.unsplash.com/photo-1599658880436-c61792e70672?ixlib=rb-1.2.1&amp;q=80&amp;fm=jpg&amp;crop=entropy&amp;cs=tinysrgb&amp;w=720&amp;fit=max&amp;ixid=eyJhcHBfaWQiOjEyMDd9" class="img-responsive img-fluid img-med" alt="person using macbook pro on black table " title="person using macbook pro on black table ">
    <div class="text-muted" style="opacity: 0.5">
        <small><a href="https://unsplash.com/photos/eveI7MOcSmw" target="_blank">Photo</a> by <a href="https://unsplash.com/@mjessier" target="_blank">@mjessier</a> on <a href="https://unsplash.com" target="_blank">Unsplash</a>, accessed 24/09/2020</small>
    </div>
</div>
<!-- End of Unsplash Embed code -->


As Data Analytics becames more and more popularfield it's worth get into details of job offers to know what is exactly required from candidates here. There is a great [report prepared by PwC](https://www.pwc.com/us/en/library/data-science-and-analytics.html) about future of this market I highly recommend you to read.


Here, we will explore US data analyst job offers scrapped from Glassdoor and we will answer many insteresting questions, like:
* What are salaries of data analysts in general?
* In which state there is the biggest number of offers?
* What skills are required?
and many, many more...

In [None]:
# importing basic libraries 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# **DATA READING AND CLEANING**

Data are stored in a CSV file.

In [None]:
data = pd.read_csv('../input/data-analyst-jobs/DataAnalyst.csv')

In [None]:
print("In our dataset there are: {} rows and {} columns.".format(data.shape[0],data.shape[1]))

Looking at the general structure and form of our data.

In [None]:
data.head()

In [None]:
data.dtypes

Most of colums are of *object* type. However, "Founded" column contains a year so it can be cahnged to integer and "Easy Apply" is a *yes/no* column so it can be changed to boolean.

In [None]:
data["Founded"] = data["Founded"].astype("Int64")
data["Easy Apply"] = data["Easy Apply"].astype("bool")

Column "Unnamed: 0" can be dropped or used as a new index. I will drop it.

In [None]:
data.drop(columns=["Unnamed: 0"], inplace=True)
data.replace([-1,-1.0,"-1"],np.nan, inplace=True)

# **Feature Engineering and EDA**

To make our analysis more informative we will create additional columns.

First we will check what tools are meantioned in a job description. We will check for:
* [Python](https://www.python.org/) - a general purpose language
* [R](https://www.r-project.org/) - a statistical programming language
* [SQL](https://en.wikipedia.org/wiki/SQL) - Stadard Query Language (most common database querying language)
* [Excel](https://en.wikipedia.org/wiki/Microsoft_Excel) - super popular Microsoft analytics product
* [SAS](https://www.sas.com/en_us/solutions/analytics.html) - statistical software suite
* [AWS](https://aws.amazon.com/) - Amazon Web Services - claud services provider
* [Stata](https://www.stata.com/) - statistical software package
* [Power BI](https://powerbi.microsoft.com/en-us/) - Microsoft interactive BI/visualisation software
* [Microstrategy](https://www.microstrategy.com/en) - BI/visualisation software
* [Tableau](https://www.tableau.com/) - BI/visualisation software
* [VBA](https://en.wikipedia.org/wiki/Visual_Basic_for_Applications) - Microsoft's event-driven programming language

In [None]:
data["Python"] = data["Job Description"].apply(lambda x: 1 if "Python" in x or "python" in x else 0)
data["R"] = data["Job Description"].apply(lambda x: 1 if " R " in x or " R/" in x or "R," in x else 0)

toolset = ["Python", "R","SQL", "Excel", "SAS","AWS", "Stata", "Power BI", "Microstrategy", "Tableau", "VBA"]

for tool in toolset[2:]:
    data[tool] = data["Job Description"].apply(lambda x: 1 if tool in x else 0)

In [None]:
tools_sum = data[toolset].sum().sort_values(ascending=False).div(len(data)).mul(100)
plt.style.use('ggplot')
ax, fig = plt.subplots(figsize=(12,6))
sns.barplot(tools_sum.index,
            tools_sum)
plt.title("DA tools in job offers")
plt.ylabel("Percentage")
plt.show()

In more than 60% of job offers SQL is meantioned. It looks like a core skill for data analyst so you may consider honning your skills in it on [hackerrank](https://www.hackerrank.com/) or on [leetcode](https://leetcode.com/). Moreover, here's a very good article I recommend to read:  ["Top 5 SQL Analytic Functions Every Data Analyst Needs to Know"](https://towardsdatascience.com/top-5-sql-analytic-functions-every-data-analyst-needs-to-know-3f32788e4ebb).  

The second place takes Excel (over 50%), a very common and popular product of Microsoft. Still being am analysis basic tool for many companies.  
Then Python and Tableau followed by R and SAS.

Below we'll generate some [Venn diagrams](https://en.wikipedia.org/wiki/Venn_diagram) showing how some of these tools are combined with each other.

In [None]:
from matplotlib_venn import venn2, venn2_circles, venn3, venn3_circles

py = data["Python"].sum()
r = data["R"].sum()
sql = data["SQL"].sum()
excel = data["Excel"].sum()

py_r = data[(data["Python"]==1) & (data["R"]==1)]["Python"].sum()
py_sql = data[(data["Python"]==1) & (data["SQL"]==1)]["Python"].sum()
r_sql = data[(data["R"]==1) & (data["SQL"]==1)]["Python"].sum()
py_r_sql = data[(data["Python"]==1) & (data["R"]==1) & (data["SQL"]==1)]["Python"].sum()
py_excel = data[(data["Python"]==1) & (data["Excel"]==1) & (data["SQL"]==1)]["Python"].sum()

fig, axes = plt.subplots(2,2,figsize=(10,8))

venn2(subsets = (py, r, py_r), set_labels = ("Python", "R"), ax=axes[0][0], set_colors=('red', 'green'))
venn2_circles(subsets = (py, r, py_r), ax=axes[0][0])

venn2(subsets = (py, sql,py_sql), set_labels = ("Python", "SQL",), ax=axes[0][1], set_colors=('red', 'blue'))
venn2_circles(subsets = (py, sql,py_sql), ax=axes[0][1])

venn2(subsets = (r, sql, r_sql), set_labels = ("R", "SQL",), ax=axes[1][0], set_colors=('green', 'blue'))
venn2_circles(subsets = (r, sql, r_sql), ax=axes[1][0])

venn2(subsets = (py, excel, py_excel), set_labels = ("Python", "Excel",), ax=axes[1][1], set_colors=('green', 'yellow'))
venn2_circles(subsets = (py, excel, py_excel), ax=axes[1][1])

fig.suptitle("Venn diagrams - DA tools in job offers", size=15)

plt.show()

Let's see how three main languages (Python, R and SQL) combine with each other.

In [None]:
fig, ax = plt.subplots(1,1,figsize=(8,8))

venn3(subsets = {
    "100":py, "010":r, "001":sql,
    "110":py_r, "101":py_sql, "011":r_sql,
    "111":py_r_sql},
    set_labels = ("Python", "R", "SQL"),
    ax=ax)

venn3_circles(subsets = {
    "100":py, "010":r, "001":sql,
    "110":py_r, "101":py_sql, "011":r_sql,
    "111":py_r_sql},
    ax=ax)
plt.show()

Now we will investigate what are the most common job titles.

In [None]:
def job_name_cleaner(cell,pos):
    try:
        value = str(cell).split(",")[pos]
        return value
    except:
        return np.nan
    
data["Job Title 1"] = data["Job Title"].apply(lambda x: job_name_cleaner(x,0))
data["Job Title 2"] = data["Job Title"].apply(lambda x: job_name_cleaner(x,1))

In [None]:
jobT_1 = data["Job Title 1"].value_counts(normalize=True).mul(100)
print("There are {} various job titles.".format(len(jobT_1)))

Let's plot 20 most common job titles.

In [None]:
jobT_1 = data["Job Title 1"].value_counts(normalize=True).mul(100)

ax, fig = plt.subplots(figsize=(14,6))
sns.barplot(x=jobT_1.index[:20], 
            y=jobT_1.values[:20])
plt.ylabel("Percentage")
plt.xticks(rotation=70, ha="right")
plt.show()

Definitely the most basic name - simply a "Data Analyst" - is the most popular. Followed by a position with a specified level like: Senior Data Analyst and a Junior Data Analyst. We can see that there are also variations of these names and levels like "Sr. Data Analyst", "Data Analyst I", "Data Analyst III" or simply "Analyst".

Because of that we may want to consolidate some of the names to better reflec the Experience Levels. For this we will create new columns.

In [None]:
def experience(job):
    for w in ["Junior Data Analyst", "Jr.", "Data Analyst I", "Jr", "1"]:
        if w in job:
            return "Junior"    
    for w in ["Senior", "III", "Lead", "Sr", "Sr.", "3", "Principal", "Master"]:
        if w in job:
            return "Senior"
    else:
        return "Regular/Other"

data["Exp. Level"] = data["Job Title 1"].apply(experience)

In [None]:
jobT_2 = data["Exp. Level"].value_counts(normalize=True).mul(100)

ax, fig = plt.subplots(figsize=(12,6))
plt.pie(jobT_2.values, labels=jobT_2.index, autopct='%1.1f%%', shadow = True, startangle=90, colors=["#fccb05","#059efc","#36fa28"], textprops={"size":14})
plt.title("Job offers exerience levels")
plt.show()

Over 70% of job offers are with a generic level title or can be treated as a regular level. Many general-level job offeres contain names specific to a given domain like: "Marketing Data Analyst", "Financial Data Analyst", "Data Management Analyst", etc.

Around 22% of job offers are senior level. 

Only around 7% of job offers are aimed for junior-level which is an obvious problem for anyone who is freshly graduated or doesn't have any previous experience in this domain.  

Let's look now what are the salaries in various locations and how many are there. This will require a little bit of cleaning because they are stored as a text with an estimated range.

In [None]:
def salary_cleaner(cell,pos):
    if cell == -1:
        return np.nan
    else:
        try:
            value = str(cell).split("K")[pos].replace("$","").replace("-","")
            return int(value)
        except:
            return np.nan
    
data["lower_salary"] = data["Salary Estimate"].apply(lambda x: salary_cleaner(x,0))
data["upper_salary"] = data["Salary Estimate"].apply(lambda x: salary_cleaner(x,1))
data["average_salary"] = (data["lower_salary"]+data["upper_salary"])/2

In [None]:
locations_salaries = data.groupby(["Location"])[["lower_salary","upper_salary","average_salary","Job Title"]].mean().round(1)
locations_salaries["offers"] = data.groupby(["Location"])[["Job Title"]].count()
locations_salaries = locations_salaries.sort_values(by=["average_salary"], ascending=False)
locations_salaries.head(10)

Top of the list is occupied by cities in California. However a city-level is a bit too granual. Let's compare salaries in various states. For this we need more data cleaning.

In [None]:
locations = data["Location"].str.split(r",",expand=True)
locations.columns = ["City","State","temp"]
locations.drop(["temp"],axis=1,inplace=True)

#dealing with "... ,Arapahoe, CO" syntax
locations[locations["State"]==" Arapahoe"] = " CO"
locations["State"] = locations["State"].str.strip()

# concatenating
data = pd.concat([data,locations],axis=1)

In [None]:
states_salaries = data.groupby(["State"])[["lower_salary","upper_salary","average_salary"]].mean().round(1)
states_salaries["offers"] = data.groupby(["State"])[["average_salary"]].count()
states_salaries = states_salaries.sort_values(by=["average_salary"], ascending=False)
states_salaries

In [None]:
ax, fig = plt.subplots(figsize=(14,6))
sns.barplot(x=states_salaries.index, 
            y=states_salaries["average_salary"])
plt.ylabel("Salary [k$]")
plt.xticks(rotation=70, ha="right")
plt.title("Average Salaries in various US states")
plt.show()

Our database is incomplete because it contains only job offers from 19 states. However, from available data we observe:
* The **biggest number of job offers** is from **California** (626) and **Texas** (394)
* The **highest average salaries** are in **California** and **Illiois**
* The least amount of job offers are in Sansas, South Calorina and Georgia - everywehere less than 4 
* The lowest average salaries are in Utah and Georgia

Let's create a choropleth map from this what we have.

In [None]:
import plotly.graph_objects as go

fig = go.Figure(data=go.Choropleth(
    locations=states_salaries.index, # Spatial coordinates
    z = states_salaries['average_salary'], # Data to be color-coded
    locationmode = 'USA-states', # set of locations match entries in `locations`
    colorscale = 'Reds',
    colorbar_title = "Average salary [k$]",
))

fig.update_layout(
    title_text = 'Average salary',
    geo_scope='usa', # limite map scope to USA
)

fig.show()

It's worth checking what industries are mostly looking for data analysts as well.

In [None]:
industries = data["Industry"].value_counts(normalize=True).mul(100)

plt.style.use('ggplot')
ax, fig = plt.subplots(figsize=(14,6))
sns.barplot(x=industries.index[:20], 
            y=industries.values[:20])
plt.ylabel("Percentage")
plt.xticks(rotation=70, ha="right")
plt.text(15,16, "No. of industries: {}".format(len(industries)), size=15)
plt.show()

First two places are taken by generic "IT services" and "Staffing & Outsourcing". By looking at following places we can see that the following idustries are hiring as well:
* Healthcare and Biotech/pharmaceutical
* Financial institutions and banks
* Advertising/marketing
* Insurance
* Colleges and Universities  

Now let's investigate what sizes of companies we are dealing with.

In [None]:
sizes = data["Size"].value_counts(normalize=True).mul(100)

plt.style.use('ggplot')

ax, fig = plt.subplots(figsize=(12,6))
sns.barplot(x=sizes.index, 
            y=sizes.values,
            order = ['1 to 50 employees', '51 to 200 employees',
                  '201 to 500 employees', '501 to 1000 employees',
                  '1001 to 5000 employees', '5001 to 10000 employees',
                  '10000+ employees', 'Unknown'])
plt.ylabel("Percentage")
plt.xticks(rotation=70, ha="right")
plt.show()

The barchart above shows that there is no strong trend but job offers are posted often by smaller companies (up to 200 employees). However, big ones (1000+ employess) are hiring as well.

In [None]:
data["Company Name"] = data["Company Name"].apply(lambda x: str(x).split("\n")[0])

In [None]:
data["Company Name"].value_counts().head()

TOP 5 companies in terms of job offers overall are:
* [Staffigo](https://www.staffigo.com/) - IT staffing and recruiting company
* [Diverse Lynx](https://www.diverselynx.com/) - IT staffing and recruiting
* [Lorven Technologies Inc  ](https://www.lorventech.com/) - IT staffing and recruiting
* [Kforce](https://www.kforce.com/) - IT staffing and recruiting
* [Robert Half](https://www.roberthalf.com/) - IT staffing and recruiting
As you see staffing and recruiting agencies are posting quite a number of job offers of Glassdor.  

What about Healthcare industry?

In [None]:
data[data["Industry"]=="Health Care Services & Hospitals"][["Company Name","Headquarters"]].value_counts().head()

In the Healthcare and Hospitals category:
* [Cedars-Sinai](https://www.cedars-sinai.org/) - a non-profit hospital in Los Angeles
* [NYU Langone Health](https://nyulangone.org/) - a healthcare provider
* [Kaiser Permanente](https://healthy.kaiserpermanente.org/) - one of the largest US healthcare provider
* [Mount Sinai Medical Center](https://www.mountsinai.org/) - a hospital in New York
* [IPRO](https://ipro.org/) - a non-profit health organisation

Now it's time for Computer Hardware and Software category:

In [None]:
data[data["Industry"]=="Computer Hardware & Software"][["Company Name","Headquarters"]].value_counts().head()

Here the top companies are:
* [Apple](https://www.apple.com/) - a well known software/hardware brand
* [APN Software Services](http://www.apninc.com/) - an offshore outsourcing company
* [Microsoft Corporation](https://www.microsoft.com/de-de/) - a multinational technology corporation
* [Intuit](https://www.intuit.com/) - a company developing software for finance and accouting
* [Autodesk](https://www.autodesk.com/) - a company developing engineering software (e.g. CAD systems)

I'm still working on this notebook. If you like it please upvote. 

If you have any questions of suggestions please write in commenst.