# About data 

This data contains info related to more than 2000 Data Analyst jobs. Few features of the data are : 

* Job Title
* Salary Estimate 
* Company Size and rating 
* Industry 
* Revenue etc...

# Helpful for : 

this data will be very helpful for those -

1. Who are actively looking for a data analyst job with a decent salary mostly in USA...
2. Especially for those who lost their jobs in the current pandemic situation
3. Students of Statistics and Computer science 


# Topics : 

Here I tried to visualize the following topics-

1. Top rated companies with the highest salary
2. Highest paying job titles
3. Best rated companies 
4. Best place to look for jobs 
5. Location of company headquarters
6. Size of the companies 
7. Average Salary according to the size and rating 
8. Demand across the industries 
9. Most popular Job Titles 
10. Revenue of companies
11. Salary in the large companies 
12. Available jobs in different sectors 
13. Foundation Year of the companies
14. Distribution of salaries 
15. Companies in India
16. Corelation between rating and salary

Hope this will help...

## Import section

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns  
sns.set(style="whitegrid")
from wordcloud import WordCloud

In [None]:
missing_value = [-1]
df = pd.read_csv('../input/data-analyst-jobs/DataAnalyst.csv', na_values = missing_value)

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.info()

# Data Cleaning Section : 

Removed the **Job Description** as every companies have their own job description and there is nothing special to visualize 

Along with that removed **Unamed Column**

In [None]:
df.columns = df.columns.str.replace(" " , "_")
df.drop(['Unnamed:_0', 'Job_Description'], axis = 1, inplace = True )

**Selected category according to the rating.**

* 0-2 is Bad 
* 2-3 is Not Bad 
* 3-4 is Good 
and 
* 4-5 is Excellent

In [None]:
df["Rated"] = pd.cut(df.Rating , bins= [0 , 2.0 ,3.0, 4.0 ,5] , 
                                 labels = ['Bad' , 'Not Bad','Good', 'Excellent'])

**Either dropped or filled the missing values with suitable parameters**

In [None]:
df.Easy_Apply.fillna(value= False,inplace= True)
df.Competitors.fillna(value = 0, inplace = True)
df.Salary_Estimate.dropna(inplace = True)
df.Rating.fillna(value = df.Rating.mean(), inplace = True)
df.Industry.fillna(value = 'Others', inplace = True)
df.Revenue.fillna(value = 'Unknwon',inplace = True)
df.Founded.fillna(value = df.Founded.mean(),inplace= True)
df.Founded = df.Founded.astype(int)

Min Max and average salary of the comapnies

In [None]:
split_sal = df['Salary_Estimate'].str.split("-" , expand = True)
df["Min_Salary_US_K"] = pd.to_numeric(split_sal[0].str.extract('(\d+)' , expand = False))
df["Max_Salary_US_K"] = pd.to_numeric(split_sal[1].str.extract('(\d+)' , expand = False))
df["Average_Salary_US_K"] = (df["Min_Salary_US_K"] + df["Max_Salary_US_K"])/2

In [None]:
split_c_name = df['Company_Name'].str.split("\n" , expand = True)
df.Company_Name = split_c_name[0]

States in which the **companies** are located 

In [None]:
state = df.Location.str.split(',', expand = True )
state 

States where the **company headquarters** are located

In [None]:
state_h = df.Headquarters.str.split(',', expand = True )
state_h 

In [None]:
df['Location_State'] = state[1]
df['Hq_State'] =state_h[1]

**Selected category according to the Average salary**

* $0 - $50k is poor 
* $50k - $100k is Medium and
* $100k - $150k is Good 


In [None]:
df["Salary_Range"] = pd.cut(df.Average_Salary_US_K , bins= [0 , 50 ,100, 150] , 
                                 labels = ['Poor' , 'Medium','Good'])

In [None]:
df.head()

# Visualization :

# 1. Top rated companies with the highest salary :

Companies having the highest salary of $190k 

Here I dropped a few rows having NaN values 

In [None]:
df.sort_values(by = 'Max_Salary_US_K', ascending = False).head()

In [None]:
top_sal = df[df['Max_Salary_US_K'] == 190.0]
top_sal.sort_values(by = 'Rating', inplace = True)

In [None]:
top_sal.tail()

In [None]:
top_sal.drop([1479,1493], inplace = True)
# as NaN values 

In [None]:
top_sal.Company_Name.unique()

In [None]:
fig= plt.gcf();
fig.set_size_inches(15,7);
sns.barplot( x='Company_Name',y='Rating', data = top_sal,
            dodge=False, hue = 'Rated').set_title('Companies having the highest salary');
plt.xticks(rotation=80);
plt.xlabel('Company Name');

# 2. Highest paying job titles : 

In [None]:
top_sal.Job_Title.unique()

In [None]:
fig= plt.gcf();
fig.set_size_inches(13,7);
sns.barplot( y='Job_Title',x='Rating', data = top_sal,
            dodge=False, ci= None, palette= 'Reds', orient = 'h').set_title('Job Title with the highest salary');
plt.xticks(rotation=80);
plt.xlabel('Job Title');

# Word Art :

word art of available job titles 

In [None]:
wordCloud = WordCloud(background_color='white',max_font_size = 50).generate(' '.join(df['Job_Title']))
plt.figure(figsize=(15,7))
plt.axis('off')
plt.imshow(wordCloud)
plt.show()

# 3. Best rated companies :

Only those companies who have the 5 star ratings...

In [None]:
top_rated = df[df['Rating'] == 5.0]
top_rated = top_rated.sort_values(by= 'Average_Salary_US_K')

In [None]:
top_rated.Company_Name.unique()

In [None]:
fig= plt.gcf();
fig.set_size_inches(25,7);
sns.barplot( x='Company_Name',y='Average_Salary_US_K',
            data = top_rated, palette = 'copper',ci=None).set_title('Companies having the 5.0 rating');
plt.xticks(rotation=90);
plt.xlabel('Company Name');
plt.ylabel('Average Salary');

# 4. Where most of the companies are located :


In [None]:
c = df.groupby('Location_State').count()
loc_st = c[c['Job_Title'] > 20]
others = c[c['Job_Title'] < 20]
loc_st.loc['Others'] = pd.Series(others.sum())

In [None]:
loc_st.reset_index(inplace = True)
loc_st.sort_values(by = 'Job_Title', inplace = True)

In [None]:
state[[0,1]]

New York is popular so made a cut out though its not the state with highest no of available jobs!

Along with that we cut out the others section

**Others are sum of those states who have less than 20 companies**

* Here you can see others section is very small so most of the companies are in these states

In [None]:
fig = plt.gcf();
fig.set_size_inches(20,10);
plt.pie(loc_st.Job_Title, labels = loc_st.Location_State,
        wedgeprops = dict(width =0.3),shadow = True, 
        startangle = 10,
        explode=[0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0]);
plt.title('States with highest jobs');
plt.show();

# 5. where most of the HQs are located :

In [None]:
c = df.groupby('Hq_State').count()
loc_hq = c[c['Job_Title'] > 20]
others = c[c['Job_Title'] < 20]
loc_hq.loc['Others'] = pd.Series(others.sum())

In [None]:
loc_hq.reset_index(inplace = True)

In [None]:
loc_hq.sort_values(by = 'Job_Title', inplace = True)

In [None]:
state_h[[0,1]]

**Others are sum of those states who have less than 20 company Headquarters**

* Here you can see the others section is pretty big 
* Along with that we can see few companies have their Headquarters in India and Uk

In [None]:
fig = plt.gcf();
fig.set_size_inches(20,10);
plt.pie(loc_hq.Job_Title, labels = loc_hq.Hq_State,
        wedgeprops = dict(width =0.3),shadow = True, 
        startangle = -45,
       explode=[0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0]);
plt.title('States with highest headquarters');
plt.show();

# 6. Size of the companies :

sorted them according to the **available jobs** 

In [None]:
sz = df.groupby('Size').count()

In [None]:
sz.sort_values(by = 'Job_Title',ascending = False,  inplace = True)
sz.reset_index(inplace = True )

In [None]:
sz.Size.unique()

In [None]:
fig= plt.gcf();
fig.set_size_inches(15,7);
sns.barplot( x='Size',y='Job_Title', data = sz, palette = 'autumn').set_title('Available jobs according the size of the company');
plt.xticks(rotation=80);
plt.xlabel('Size of the company');
plt.ylabel('Available jobs');

# 7. Average Salary according to rating :

In [None]:
fig= plt.gcf();
fig.set_size_inches(15,7);
sns.swarmplot( x='Size',y='Average_Salary_US_K', data = df,
              hue = 'Rated', palette = 'winter').set_title('Available jobs according the size of the company');
plt.xticks(rotation=80);
plt.xlabel('Size of the company');
plt.ylabel('Average Salary');

# 8. Demand across the industries :

In [None]:
c = df.groupby('Industry').count().sort_values(by = 'Job_Title')
ind = c[c.Job_Title > 2]
others = c[c.Job_Title < 2.1]
ind.loc['Others'] = pd.Series(others.sum())
ind.reset_index(inplace= True)
ind.copy(deep=True)
ind.sort_values(by= 'Job_Title', inplace = True)

In [None]:
ind.Industry.unique()

* As you can see **IT services** and **Stuffing & Outsourcing** are the two industries way ahead of others with the highest demand 
* Right after that we have **Health care services** and so on...
* **Here others are sum of those states who have less than 20 companies**

In [None]:
fig= plt.gcf();
fig.set_size_inches(20,10);
sns.barplot(x='Industry', y ='Job_Title',data = ind,
            ci=None, palette = 'autumn').set_title('Available jobs according to the industry');
plt.xticks(rotation=90);
plt.xlabel('Industry Name');
plt.ylabel('No of jobs available');

# 9. Popular Job Titles :

Clearly **Data Analyst** is the dominating job Title

In [None]:
jt = df.groupby('Job_Title').count().sort_values(by = 'Salary_Estimate',ascending = False)
jt.reset_index(inplace= True)

In [None]:
jt.Job_Title.unique()

In [None]:
fig= plt.gcf();
fig.set_size_inches(10,7);
sns.barplot(y='Job_Title', x ='Salary_Estimate',data = jt.head(30),
            ci=None, palette = 'winter', orient = 'h').set_title('Most popular Job Titles')
plt.xticks(rotation=90);
plt.ylabel('Job Title')
plt.xlabel('No of jobs available');

# 10. Revenue :

No of comapanies in each **Revenue** category

In [None]:
rev = df.groupby('Revenue').count().sort_values(by = 'Job_Title',ascending = False)
rev.reset_index(inplace= True)

In [None]:
rev.Revenue.unique()

In [None]:
fig= plt.gcf();
fig.set_size_inches(10,7);
sns.barplot(y='Revenue', x ='Job_Title',data = rev,
            ci = None, palette = 'summer', orient = 'h').set_title('No of Jobs according to the companies revenue')
plt.xticks(rotation=90);
plt.ylabel('Revenue')
plt.xlabel('No of jobs available');

# 11. Salary in the large companies :

Only those companies which has revenue more than $10+ billion (USD)

In [None]:
lg_comp = df[df.Revenue == '$10+ billion (USD)']

In [None]:
lg_comp.sort_values(by = 'Rating', inplace = True)

In [None]:
lg_comp.Company_Name.unique()

In [None]:
fig= plt.gcf();
fig.set_size_inches(20,10);
sns.barplot(x='Company_Name', y ='Average_Salary_US_K',data = lg_comp,
            ci = None, dodge= False, hue = 'Rated',
            palette = 'cool' ).set_title('Salary of the large companies\n Revenue = $10+ billion (USD)')
plt.xticks(rotation=90);
plt.ylabel('Average Salary')
plt.xlabel('Company Name');

# 12. Available jobs in different sectors : 

In [None]:
df.Sector.unique()

In [None]:
fig= plt.gcf();
fig.set_size_inches(15,7);
sns.swarmplot( y='Sector',x='Average_Salary_US_K', data = df,
              hue = 'Rated', palette=['#ff0000','#ffa200','#00bd13','#0080ff'],
              orient = 'h').set_title('Salary Across the Sectors');
plt.xticks(rotation=90);
plt.ylabel('Sectors');
plt.xlabel('Average Salary');
plt.legend(loc='upper left',bbox_to_anchor=(1,1));

# 13. Foundation Year of the companies : 

We can see most of the companies are founded between 1950 - 2000+

In [None]:
fig= plt.gcf();
fig.set_size_inches(15,7);
sns.distplot(df.Founded, color ='r');

# 14. Distribution of salaries :

Distribution of Min, Max and Average salary of companies 

*  Most company's min salary is a little less than 50k (USD)
*  Most company's max salary is a little more than 75k (USD)
*  Most company's average salary is a between 50k - 75k (USD)

In [None]:
fig= plt.gcf();
fig.set_size_inches(15,7);
sns.kdeplot(df.Min_Salary_US_K, color ='r', shade = True);
sns.kdeplot(df.Max_Salary_US_K, color ='g', shade = True);
sns.kdeplot(df.Average_Salary_US_K, color = 'b', shade = True);
plt.legend(['Min Salary', 'Max Salary', 'Average Salary']);
plt.xlabel('Salary');

# 15. Companies in India :

In [None]:
india = df[df.Hq_State == ' India']
india.Company_Name.unique()

Companies in india according their size 

In [None]:
fig= plt.gcf();
fig.set_size_inches(15,7);
sns.barplot( x='Size',y='Average_Salary_US_K', data = india,
              hue = 'Rated', ci= None,
            palette = ['#ff0000','#ffa200','#00bd13','#0080ff']).set_title('Available jobs according the size of the company');
plt.xticks(rotation=80);
plt.xlabel('Size of the company');
plt.ylabel('Average Salary');

# 16. Corelation between rating and salary :

In [None]:
fig= plt.gcf();
fig.set_size_inches(15,7);
sns.regplot(x='Average_Salary_US_K', y='Rating', data = df,
            marker = '+', color='Black').set_title('Corelation between rating and salary');
plt.xlabel('Average Salary');

**There is a corelation but it is not a significant linear corelation...**

If you liked my work and found this visualization use then Do Upvote this Notebook

In [None]:
Thank You 