**This data set have the Data analysts job posting from the sources. It has information about the Job Role, Location,Salary and Company information along with other details.

**We analyze the data and get insight about data analysts job trends and help any potential job seeker to identify future company. ****

The following are the questions we try to answer:
1. Which US state more number of job opportunities?
2. What is average Max/Min salary in US?
3. what are the top 10 data analysts- job roles in demand?
4. Companies having more job positions(Top 10)
5. which company size hiring more data analysts
6. which sector needs more data analysts

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import plotly as py
import plotly.graph_objs as go
import plotly.express as px

%matplotlib inline

print("Python libraries are loaded succesfully")

In [None]:
# Reading the data analysts jobs csv file and display the first 20 records
df = pd.read_csv("../input/data-analyst-jobs/DataAnalyst.csv")
df.head(20)

From the above loaded data, we can see that,
1. file itself has index column(Unnamed: 0)
2. any missing values are filled with -1
so when reading the data file, we can make (Unnamed: 0) as index 
and fill any missing values(-1) with python standard NaN

In [None]:
df = pd.read_csv("../input/data-analyst-jobs/DataAnalyst.csv",index_col="Unnamed: 0",
                 na_values=[-1,-1.0,"-1","-1.0"] )
df.head(20)

In [None]:
# Get high level information about the dataframe columns
df.info()

The Info confirms Competitors and Easy Apply columns do not have data for larger records. It would be difficult to derive any useful information, so we can drop those columns

In [None]:
df.drop(columns=["Competitors","Easy Apply"],inplace=True)
df.head(5)

Job Title column have the Job Role and the domain information separated by comma(,). We can split and get the Role alone and assign to Job Title column

In [None]:
df["Job Title"] = df["Job Title"].str.split(",", expand=True)[0]
df["Job Title"].value_counts()

Some of the Job Titles are not in a standard format. Example, we have the Senior Data Analyst , Sr. Data Analyst etc..

In [None]:
df["Job Title"].replace(to_replace= ["Sr. Data Analyst","Sr Data Analyst",
                                     "Senior Analysts","Sr Analyst","Sr. Analyst",
                                     "Senior Contract Data Analyst","Data Analyst Senior",
                                     "Senior Analyst", "SENIOR ANALYST"],
                        value= "Senior Data Analyst",inplace=True )
df["Job Title"].replace(to_replace=  ["Jr. Data Analyst","Jr Data Analyst","Junior Analysts",
                                      "Jr Analyst","Jr. Analyst","Junior Contract Data Analyst",
                                      "Data Analyst Junior","JUNIOR ANALYST"],
                        value = "Junior Data Analyst",inplace=True)
df["Job Title"].replace(to_replace=  ["Analyst","Data analyst"],
                    value = "Data Analyst",inplace=True)
df["Job Title"].value_counts()

Now we have the standarized Job Titles.

Salart Estimate have Min-Max Salary range for each job postings. We will extarct the Minimum and Maximum Salary, so easliy get the location/job wise eatimated salary

In [None]:
df.insert(2,"Min_Salary_USD_K",df["Salary Estimate"].str.split("-",expand=True)[0].str.extract('(\d+)'))
df.insert(3,"Max_Salary_USD_K",df["Salary Estimate"].str.split("-",expand=True)[1].str.extract('(\d+)'))
df.head()


In [None]:
df.info()

Data types for Minimum and Maximum Salary is Object and there is 1 missing value.
Fill the missing value to 0 and convert them into int type

In [None]:
df[["Min_Salary_USD_K","Max_Salary_USD_K"]] = df[["Min_Salary_USD_K","Max_Salary_USD_K"]].fillna(0)
df[["Min_Salary_USD_K","Max_Salary_USD_K"]] = df[["Min_Salary_USD_K","Max_Salary_USD_K"]].astype(int)

In [None]:
df.info()

Company Name column have the Rating information also at the end. As we have the separate column for rating , we can remove it from Company Name column

In [None]:
df["Company Name"] = df["Company Name"].str.split("\n", expand=True)[0]
df["Company Name"]

Next the location column have both City and US State code as comma separated. We can get them as 2 separate columns in the dataframe. So we can analyse the data with City and state

In [None]:
df.insert(8,"City",df["Location"].str.split(",", expand=True)[0].str.strip())
df.insert(9,"State",df["Location"].str.split(",", expand=True)[1].str.strip())
df.head()

In [None]:
df["State"].value_counts()

Replace the US state codes with actual name, so will be easy to understand. Arapahoe is a county name in CO:Colorado
Create a mapping and replace the values

In [None]:
mapping_state = {"CA" :  "California", 
       "TX" :  "Texas",
       "NY" :  "New York",
       "IL" :  "Illinois",
       "PA" :  "Pennsylvania",
       "AZ" :  "Arizona",
       "CO" :  "Colorado",    
       "NC" :  "North California",
       "NJ" :  "New Jersey",    
       "WA" :  "Washington",
       "VA" :  "Virginia",
       "OH" :  "Ohio",
       "UT" :  "Utah",
       "FL" :  "Florida",
       "IN" :  "Indiana",
       "DE" :  "Delaware",
       "GA" :  "Georgia",
       "SC" :  "South California",    
       "KS" :  "Kansas","Arapahoe" : "Colorado"
        }
df.State = df.State.map(mapping_state)
df["State"].value_counts()

We have standarized most of the columns of interest. Now try to analyse and visualize the data.

First we can see which US state have more number of jobs

In [None]:
fig = px.bar(x= df["State"].value_counts().index, y= df["State"].value_counts().values, labels={"x":"State","y":'no of jobs'})
fig.update_layout(title="US state wise Data Analyst Job openings")
fig.show()

From above chart California state tops the list with more number of job opportunities. Texas, New York, Illinois and Pennsylvania forms the top 5 states.

Next we can look at top 10 Job Title the comapnnies are looking for.

In [None]:
top_10_jobs = df["Job Title"].value_counts().head(10)
top_10_jobs

In [None]:
fig = px.bar(data_frame=top_10_jobs, x=top_10_jobs.index, y=top_10_jobs.values, title="Top 10 Job Roles",
             labels={"index":"Job Roles", "y":"No of jobs"} )
fig.show()

In [None]:
top_hiring_company = df["Company Name"].value_counts().head(10)
fig = px.pie(data_frame=top_hiring_company, names=top_hiring_company.index,values=top_hiring_company.values,
            labels={"index":"Company","values":"No.of Jobs"}, title="Top 10 hiring companies")
fig.show()

Calculate average minimum and maximum salary in each state

In [None]:
mean_salary = df[["State","Min_Salary_USD_K","Max_Salary_USD_K"]].groupby(by="State",
                as_index=False).mean().sort_values(by="Max_Salary_USD_K", ascending=False)
mean_salary

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(name ="Max Salary USD K", x=mean_salary["State"], y=mean_salary["Max_Salary_USD_K"]))
fig.add_trace(go.Bar(name ="Min Salary USD K", x=mean_salary["State"], y=mean_salary["Min_Salary_USD_K"]))
fig.update_layout(title="US State wise Max/Min Salary", yaxis_title="USD(K)")
fig.show()

In [None]:

cmp_size_dict ={"1 to 50 employees":1,"51 to 200 employees":2,"201 to 500 employees":3,"501 to 1000 employees":4,
         "1001 to 5000 employees":5,"5001 to 10000 employees":6,"10000+ employees":7,"Unknown":8}
company_size = df[["Size","Job Title"]].groupby(by="Size", as_index=False).count().sort_values(by="Job Title", ascending=False)
company_size["sort_company_size"] = company_size["Size"].apply(lambda cmp_size: cmp_size_dict[cmp_size])
company_size_sorted = company_size.sort_values(by="sort_company_size")
company_size_sorted

In [None]:
fig = px.bar(data_frame=company_size_sorted, x=company_size_sorted["Size"],y=company_size_sorted["Job Title"])
fig.update_layout(title="Job Openings VS Company Size",xaxis_title = "Company Size",yaxis_title="No of job openings")
fig.show()

In [None]:
sector_data = df.Sector.value_counts()
fig = px.pie(data_frame=sector_data, names=sector_data.index,values=sector_data.values,
            labels={"index":"Sector","values":"No.of Jobs"}, title="Sector wise job opportunities")
fig.show()