# 1. Importing the relevant libraries and data

Let us import all the relevant libraries and also the data upon which we are gonna perform EDA.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px

In [None]:
df=pd.read_csv('../input/data-analyst-jobs/DataAnalyst.csv')
df.head()

# 2. Data Cleaning

At the very start, we would like to check for the presence of any null values in our data. To get a quick glance, we can use a heatmap that will give us a good idea of what features require extra attention with respect to missing values.

In [None]:
sns.heatmap(df.isnull(),cmap='gnuplot')

Great ! It looks like there aren't any or extremely negligible missing values in the dataframe. However, instead of a null value, the data maybe replaced by keywords such as 'NA' or 'null'. We shall check for those later in our data cleaning operation.

Let us confirm that there aren't any null values in the dataframe.

In [None]:
df.isna().any()

Oh wow ! There seems to be a few missing values in the company name column. However, since our data analysis wouldn't require much use of company name, hence we aren't bothered by it. However, let us check the number of missing values for curiosity sake.

In [None]:
df['Company Name'].isna().value_counts()

So, there is just one single company name which is missing. We can safely dismiss the presence of any missing values since we shall be dropping the Companu Name column anyway.

Next, we shall check the various data types that we will be dealing with. 

In [None]:
df.info()

## Droppable columns

As we look at the various columns, we see that not all the columns are important to us. These can be dropped immediately to make the data less cluttered. Let's see which columns may be removed.

In [None]:
unn_col=['Unnamed: 0','Job Description','Company Name']
for cols in unn_col:
    df.drop(cols,axis=1,inplace=True)
df.head()

## Salary Estimate

Upon glancing at the above data types, we see that there is an issue with salary estimate. Here, since the estimate is given to us in a range figure, we need to split the lower and upper bounds of the salaries into the max salary and min salary column. 

Let us solve the salary range issue now.


**But wait ! There is something weird about the data. One of the entries have -1 which is weird as it doesn't make sense. Let us assume that -1 actually refers to null value. We shall replace this data with the mode of the data.**



In [None]:
df['Salary Estimate'].mode()[0]

In [None]:
df['Salary Estimate']=df['Salary Estimate'].replace('-1',df['Salary Estimate'].mode()[0])

Let us first remove the source of the salary listing i.e. Glassdoor.

**split(separator,max_splits)**

In [None]:
df['Salary Estimate'],_=df['Salary Estimate'].str.split('(',1).str

In [None]:
df['Salary Estimate']

In [None]:
df['Min_Salary'],df['Max_Salary']=df['Salary Estimate'].str.split('-').str

In [None]:
df.head()

As we can see, we have separated the salaries into Minimum salary and maximum salary. However, there is still some unncecessary info the columns which we need to clean. Let us try to do that using the split function. In addition, it is important to change the datatype of salary from string to int. Let us attempt to do that aswell. 

In [None]:
df['Min_Salary']=df['Min_Salary'].str.strip(' ').str.lstrip('$').str.rstrip('K').fillna(0).astype('int')
df['Max_Salary']=df['Max_Salary'].str.strip(' ').str.lstrip('$').str.rstrip('K').fillna(0).astype('int')

In [None]:
df.head()

In [None]:
df.info()

As we can see, we have successfully cleaned the data for max salary and min salary in their integer forms. 

## Missing Values

Although we initially claimed that there were no missing values, it is now seen that in the dataframe, instead of null values, -1 has been entered. Hence, what we can do is replace all the -1 with null values. This will let us fill the null values with other suitable values such as mean or mode.

Let us now create the heatmap again that will give us an idea of the number of missing values in each column. 


In [None]:
df.replace('-1',np.nan,inplace=True)

In [None]:
df.head()

In [None]:
sns.heatmap(df.isna(),cmap='viridis')

Oh boy ! Extremely high number of missing values in competitors and easy apply columns. Considerable missing values in a few more columns.

Let us check the number of missing values.

In [None]:
miss_values=[]
def check_null(df):
    for i in range(df.columns.shape[0]):
        if df.iloc[:,i].isnull().any():
            print('Missing values in {} : {} '.format(df.columns[i],df.iloc[:,i].isna().value_counts()[1]))
            miss_values.append(df.iloc[:,i].isna().value_counts()[1])
            i+=1


In [None]:
miss_val_arr=np.array(miss_values)

In [None]:
miss_val_arr

In [None]:
check_null(df)

Let us make this into a more readable dataframe.

In [None]:
null_cols=[]
for i in range(df.columns.shape[0]):
    if df.iloc[:,i].isnull().any():
        null_cols.append(df.columns[i])
null_arr=np.array(null_cols)

In [None]:
miss_val_arr

In [None]:
miss_val=pd.DataFrame(null_arr)
miss_val.rename(columns={0:'Column name'},inplace=True)
miss_val['Missing values']=miss_val_arr
miss_val['Percentage missing (%)']=np.round(100* miss_val['Missing values']/df.shape[0],1)
miss_val

There we have it ! As we can see, 96.45% of values in Easy apply are null while 76.88% of Competitor values are high. These are quite high.

Easy apply could be an indication are currently open for the particular role in that company. So far, only about 3.6% listings are open to hire. 

Regarding the competitors column, I believe we can consider this as an unimportant section since it doesn't provide us with any insights. Hence, we shall drop this value too.

In [None]:
df.drop('Competitors',axis=1,inplace=True)

In order to make the Easy Apply data more readable, let us fill all the null values with false. This shall clearly indicate that the value is false and the company isn't hiring at the moment.

In [None]:
df['Easy Apply']=df['Easy Apply'].fillna('False')

With this, I believe we have cleaned the data just enough for us to perform some insightful data visualisations.

# 3. Data Visualisation

Alright, first off, let's check how the jobs pay.

## Salary Estimate

Let us see what are the most common salaries paid to the Data Analysts in the United States.

In [None]:
df['Count']=1
df_salaries=df.groupby('Salary Estimate')['Count'].sum().reset_index()

In [None]:
df_salaries.sort_values(by='Count',ascending=False,inplace=True)
df_salaries.head()

In [None]:

sns.catplot('Salary Estimate','Count',data=df_salaries,height=5,aspect=3)
plt.xticks(rotation=90)

As we can see, the data is extremely cluttered. It is seen the maximum jobs are paying in the range of 41K-78k $ a year to Data Analysts.

In order to make the data less cluttered, we shall consider the data for only the top 10 salary values. 

In [None]:
df_salaries_top=df_salaries.head(10)
plt.figure(figsize=(10,8))
plt.bar(df_salaries_top['Salary Estimate'],df_salaries_top['Count'],color=['red','blue','green','orange','brown','purple'])
plt.xticks(rotation=45)
plt.xlabel('Salary estimates',size=15)
plt.ylabel('Number of jobs',size=15)
plt.title('Top 10 salary estimates',size=20)


Let us see how the minimum and maximum salaries are exactly distributed through the distplots below.

In [None]:
fig1=plt.figure(figsize=(10,5))
ax1=fig1.add_subplot(121)
g=sns.distplot(df['Min_Salary'],ax=ax1,color='green')
ax1.set_xlabel('Minimum Salary \n Median min salary:$ {0:.1f} K'.format(df['Min_Salary'].median()))
l1=g.axvline(df['Min_Salary'].median(),color='red')

ax2=fig1.add_subplot(122)
h=sns.distplot(df['Max_Salary'],ax=ax2,color='Red')
ax2.set_xlabel('Maximum_Salary \n Median min salary:$ {0:.1f} K'.format(df['Max_Salary'].median()))
l2=h.axvline(df['Max_Salary'].median(),color='Blue')


As we can see, the minimum and max salaries are distributed differently. The vertical red and blue lines in the above distplots show the median salary in each section.

Let us infact superimpose the two distplots to understand their difference.

In [None]:
fig2=plt.figure(figsize=(20,10))
ax1=fig2.add_subplot(111)

#Plots


g=sns.distplot(df['Min_Salary'],ax=ax1,color='green',label='Minimum salary')
h=sns.distplot(df['Max_Salary'],ax=ax1,color='Red',label='Maximum salary')

# Vertical median lines
l1=g.axvline(df['Min_Salary'].median(),color='black',label='Median min salary')
l2=h.axvline(df['Max_Salary'].median(),color='Blue',label='Median max salary')

#Font descriptions

plt.xlabel('Salary distribution',size=20)
plt.title('Min/Max salary distribution',size=20)

#Legend box
plt.legend(fontsize='x-large', title_fontsize='40')

## Current job openings

Current job openings can be understood through the Easy Apply column. As we have already established in the past, there are very few companies (about 3.6 %) which are currently open to recruit. Let us check which industries and cities are reporting the maximum job openings right now.

In [None]:
df_ea_ind=df[df['Easy Apply']=='True']
df_ea_ind_grouped=df_ea_ind.groupby('Industry')['Easy Apply'].count().reset_index()

In [None]:
df_ea_ind_grouped.sort_values(by='Easy Apply',ascending=False,inplace=True)
df_ea_ind_grouped

In [None]:

sns.catplot('Industry','Easy Apply',data=df_ea_ind_grouped,kind='bar',height=10,aspect=2)
plt.xticks(rotation=90,size=15)
plt.ylabel('Job openings',size=20)
plt.xlabel('Industry',size=20)
plt.title('Current job openings in the industry',size=25)
ticks=np.arange(20)
plt.yticks(ticks,fontsize=15)

As we can see from the above barplot, Staffing & Outsourcing has the highest number of vacancies (about 18).

Let us now check which cities are showing the highest number of job postings.

In [None]:
df_ea_loc=df_ea_ind.groupby('Location')['Easy Apply'].count().reset_index()
df_ea_loc

In [None]:
df_ea_loc.sort_values('Easy Apply',ascending=False,inplace=True)

In [None]:
sns.catplot('Location','Easy Apply',data=df_ea_loc,kind='bar',height=10,aspect=2,palette='summer')
plt.xticks(rotation=90,size=15)
plt.ylabel('Job openings',size=20)
plt.xlabel('Location',size=20)
plt.title('Current job openings in different locations',size=25)
ticks=np.arange(20)
plt.yticks(ticks,fontsize=15)

From the above barplot, we can see that NY,SF and Chicago are the top 3 cities with job postings.

## Location wise mean salaries

Let us check how the mean salaries in the cities range. We will try to find the top 10 locations with highest mean maximum salaries.

In [None]:
df_salaries=df.groupby('Location')[['Min_Salary','Max_Salary']].mean()

In [None]:
df_salaries.sort_values(by='Max_Salary',ascending=False,inplace=True)
df_salaries=df_salaries.head(10)
df_salaries

Here, we have the top 10 cities with highest mean maximum salaries. Let us depict this through a visualisation.

In [None]:
fig3=go.Figure()
fig3.add_trace(go.Bar(x=df_salaries.index,y=df_salaries['Min_Salary'],name='Minimum Salary'))
fig3.add_trace(go.Bar(x=df_salaries.index,y=df_salaries['Max_Salary'],name='Maximum Salary'))

fig3.update_layout(title='Top 10 cities with mean minimum and maximum salaries',barmode='stack')

From the above stacked graph, we see that Newark provides the highest mean maximum salary to their Data Analysts. The also provide the second highest minimum salary of the top 10 destinations for Data Analysts. Hence, Newark can be said to be the most idea location for Data Analysts.


## Industry wise mean salaries

Let us check the mean maximum and minimum salaries. We shall find the top 10 industries providing the highest mean maximum salaries to the data analysts.

In [None]:
df_sal_ind=df.groupby('Industry')[['Min_Salary','Max_Salary']].mean()

In [None]:

df_sal_ind=df_sal_ind.sort_values('Max_Salary',ascending=False)

In [None]:
df_sal_ind=df_sal_ind.head(10)

In [None]:
fig4=go.Figure()
fig4.add_trace(go.Bar(x=df_sal_ind.index,y=df_sal_ind['Min_Salary'],name='Minimum Salary'))
fig4.add_trace(go.Bar(x=df_sal_ind.index,y=df_sal_ind['Max_Salary'],name='Maximum Salary'))

fig4.update_layout(title='Top 10 Industries with mean minimum and maximum salaries in $',barmode='stack')

## Job Ratings

Job ratings might be an important indicator for job searchers. Let us check how the various rated jobs are distributed. We will visualise the top 10 common job ratings.

In [None]:
df_rate=df.groupby('Rating')['Count'].sum().reset_index()
df_rate.sort_values(by='Count',ascending=False,inplace=True)

In [None]:
df_rate=df_rate.iloc[1:,:].head(10)  #Since we are discounting the null values given by -1

In [None]:

sns.catplot('Rating','Count',data=df_rate,kind='bar',palette='winter',height=5,aspect=2)

We can see that most number of jobs have a rating of 3.9 . A fair number of jobs also have a rating of 5.


# If you found the kernel useful, an upvote would be great ! :)