# Hello Guys.. 
* I have analyzed the Data Analyst Job dataset.
* First i have cleaned the data and created some new fields from the previous ones.
* Then I have performed exploratory research on the dataset.
* Enjoy the analysis and upvote if you find this interesting

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

In [None]:
df = pd.read_csv('../input/data-analyst-jobs/DataAnalyst.csv')

In [None]:
df.head(10)

In [None]:
df.info()

# Cleaning the data & creating new fields 

Checking null values

In [None]:
msno.bar(df, color='green')

Breaking salary column into salary minimum (sal_min) and salary maximum (sal_max)

In [None]:
sal = df['Salary Estimate'].apply(lambda x:x.split()[0])
df['sal_min'] = sal.apply(lambda x: x.split('-')[0][1:-1])
df['sal_max'] = sal.apply(lambda x: x.split('-')[1][1:-1])
df[['sal_min', 'sal_max']].head()

* The data type of sal_min and sal_max was nt able to change to float because there were empty spaces
* We changed them to zero and then converted to float 

In [None]:
#change salary datatype
df['sal_min'].replace('', '0', inplace=True)
df['sal_min'] = df['sal_min'].astype('float')
df['sal_max'].replace('', '0', inplace=True)
df['sal_max'] = df['sal_max'].astype('float')

Then we applied 0 and 1 on easy apply column
1 = easy apply
0 = Not easy apply

In [None]:
df['Easy Apply'].replace({'-1': '0', 'True': '1'}, inplace=True)
df['Competitors'].replace('-1', 'No Competitor Found', inplace=True)

We took out the state abbreviations because for plotting on plotly we require state abbreviations

In [None]:
df['Location_Abb'] = df['Location'].apply(lambda x: x.split(',')[1])

* We also made a new column from company size column to extract the minimum company size
* we changed 10000+ to 10000 as the minimum size to keep all values numeric

In [None]:
df['minimum_company_size'] = df['Size'].apply(lambda x: x.split()[0])
df['minimum_company_size'].replace('10000+', '10000', inplace=True)

In [None]:
df.head()

* We start our analysis by looking at what companies are offering the most jobs based on their revenue? 
* Are the big revenue making companies hiring more?

In [None]:
df['Revenue'].unique()

In [None]:
rev = df.groupby('Revenue').count().sort_values(ascending=False, by='Job Title')
rev = rev.reset_index()
plt.figure(figsize=(14,8))
sns.barplot(x='Revenue',y='Job Title', data=rev, palette='OrRd_r')
plt.xticks(
    rotation=45, 
    horizontalalignment='right',
    fontweight='light',
    fontsize='x-large'  
)
sns.despine(left=True)

* The highest number of jobs are from companies that have not reported a revenue so its hard to conclude
* However it seems like most of the hiring is done from mid size companies from the graph
* observe that the billion dollar companies are much lower in hiring as compared to small medium sized companies making around 10-100 Million in Revenue

* Next we analyze our data based on Industry
* Since there are 89 industries in the data set, I decided to just keep the focus on top 15
* I have grouped the data by industry and plotted boxplot for it

In [None]:
Ind_10 = df.groupby('Industry').sum().sort_values(ascending=False, by='sal_min').reset_index()['Industry'][0:16]
filtered_df = df[df['Industry'].isin(Ind_10)]
filtered_df = filtered_df[filtered_df != '-1']


In [None]:
plt.figure(figsize=(14,8))
#sns.stripplot(x='Industry', y='sal_min', data=filtered_df, palette='coolwarm')
sns.boxplot(x='Industry', y='sal_min', data=filtered_df, palette='magma')

plt.xticks(
    rotation=70, 
    horizontalalignment='right',
    fontweight='light',
    fontsize='x-large'  
)
sns.despine(left=True)

* Then I analyzed how does companies of different sizes pay at these jobs
* It seems that companies of all sizes offer high and low salary jobs 
* I have also added rating as the hue to see what jobs are highly rated
* It seems small businesses have a higher rating jobs to offer as compared to the big ones
* Jobs that do not have a good rating do not have minimum company size information available in the dataset

In [None]:
plt.figure(figsize=(14,8))
sns.scatterplot(x='minimum_company_size', y='sal_min', data=df, hue=df['Rating'], palette='coolwarm', s=100)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
sns.despine(left=True)

* This is the same visual but here we observe whether they give an option to easy apply or not
* clearly majority jobs do not give an option of easy apply
* This is very rare feature found when applying for data analyst jobs

In [None]:
plt.figure(figsize=(14,8))
sns.stripplot(x='minimum_company_size', y='sal_min', data=df, hue=df['Easy Apply'], palette='magma', s=6)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
sns.despine(left=True)

* This is the same analysis as the Industry one done above, The difference is here we are looking at sector
* I have taken top 10 sectors and how much they pay minimum salary

In [None]:
Ind_10 = df.groupby('Sector').sum().sort_values(ascending=False, by='sal_min').reset_index()['Sector'][0:11]
filtered_df_1 = df[df['Sector'].isin(Ind_10)]
filtered_df_1 = filtered_df_1[filtered_df_1 != '-1']

plt.figure(figsize=(14,8))
#sns.stripplot(x='Industry', y='sal_min', data=filtered_df, color='green')
sns.boxplot(x='Sector', y='sal_min', data=filtered_df_1, palette='OrRd')

plt.xticks(
    rotation=35, 
    horizontalalignment='right',
    fontweight='light',
    fontsize='x-large'  
)
sns.despine(left=True)

Then I have analyzed for the top 10 job titles that are required in the job market with their salary minimum
It seems like $50,000 is atleast the minimum offered in all these roles

In [None]:
Ind_10 = df.groupby('Job Title').sum().sort_values(ascending=False, by='sal_min').reset_index()['Job Title'][0:10]
filtered_df_2 = df[df['Job Title'].isin(Ind_10)]
filtered_df_2 = filtered_df_2[filtered_df_2 != '-1']

plt.figure(figsize=(14,8))
#sns.stripplot(x='Industry', y='sal_min', data=filtered_df, color='green')
sns.boxplot(x='Job Title', y='sal_min', data=filtered_df_2, palette='GnBu_r')

plt.xticks(
    rotation=35, 
    horizontalalignment='right',
    fontweight='light',
    fontsize='x-large'  
)
sns.despine(left=True)

Now we will plot the job counts according to the states and see where in america is the major custers of job

In [None]:
from plotly import __version__
import cufflinks as cf
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
cf.go_offline()
import plotly.graph_objs as go


* There has been a mistake in the extraction maybe
* Arapahoe is a county in Colorado so we will just be changing that to CO
* the apply function was used becase the abbreviation had spaces before and after which were not working with the maps

In [None]:
df['Location_Abb'].replace({' Arapahoe': 'CO'}, inplace=True)
df_loc = df.groupby('Location_Abb').count().reset_index()
df_loc = pd.DataFrame(df_loc)
df_loc['Location_Abb'] = df_loc['Location_Abb'].apply(lambda x: x.split()[0])

In [None]:
data = dict(type='choropleth',colorscale='RdBu_r', locations = df_loc['Location_Abb'], locationmode = 'USA-states', z= df_loc['Job Title'], colorbar={'title':'Scale'},  marker = dict(line=dict(width=0))) 
layout = dict(title = 'Data Analyst Job Market!', geo = dict(scope='usa')) # , showlakes=True, lakecolor = 'grey'))
Choromaps2 = go.Figure(data=[data], layout=layout)
iplot(Choromaps2)

# Hope you like it! 