# Data Analyst jobs visualization

## About Dataset

This dataset was created by picklesueat and contains more than 2000 job listing for data analyst positions, with features such as:

* Job Title.
* Salary Estimate.
* Company Name.
* Location.
* Industry.
* Sector.
* Rating
* Size.
* Revenue

## Importing necessary Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
df = pd.read_csv('../input/data-analyst-jobs/DataAnalyst.csv')

In [None]:
df.head(2)

## Data Cleaning

* ### Removing 'Unnamed' column

In [None]:
df.drop(['Unnamed: 0'], axis=1,inplace=True)

* ### checking for null values in the dataset

In [None]:
df.isna().sum()

> we can see that only 'Company Name' column has a null value.Displaying the row

In [None]:
df[df.isnull().any(axis=1)]

* As we can see that it's very hard to predict the null values because it has many missing values present in the entire row. So it's better to remove the row.

In [None]:
df.dropna(axis=0 , subset=['Company Name'], inplace=True)

* ## Creating a dictionary having keys as missing values

In [None]:
missing_val_dict = {
    -1 : np.nan,
    -1.0 : np.nan,
    '-1' : np.nan
}

* Replacing the values having missing or incorrect data in the dataframe

In [None]:
df.replace(missing_val_dict, inplace=True)

In [None]:
df['Easy Apply'].replace(np.nan, False, inplace=True)

In [None]:
df.isna().sum()

* ### From the above output we can see that there are many missing values in the dataset. But we have only work with some columns like :
    1. Job Title.
    2. Salary Estimate.
    3. Company Name.
    4. Location.
    5. Industry.
    6. Sector.
    7. Rating
    8. Size.
    9. Revenue
    
    
    
* ### So let's view the details of this cloumns to make this ready for visualization purpose

## starting with 'Job Title' column

In [None]:
df['Job Title'], df['Department'] = df['Job Title'].str.split(',', 1).str

In [None]:
df['Job Title'].value_counts()[:20]

In [None]:
df['Job Title'] = df['Job Title'].replace(['Sr. Data Analyst', 'Sr Data Analyst'], 'Senior Data Analyst')

In [None]:
df['Job Title'].value_counts()[:20]

### Cleaning Salary Estimate column

In [None]:
df['Salary Estimate']

>  As we can see 'Salary Estimate' column need some cleaning by removing the glassdoor est. and spliting salary into 2 col( min and max ) columns

In [None]:
df['Salary Estimate'],_ = df['Salary Estimate'].str.split('(', 1).str
df['Min Salary'], df['Max Salary'] = df['Salary Estimate'].str.split('-').str
df.dropna(axis=0 , subset=['Max Salary'], inplace=True)

In [None]:
df['Max Salary'] = df['Max Salary'].str.extract('(\d+)')
df['Min Salary'] = df['Min Salary'].str.extract('(\d+)')

df['Min Salary'] = df['Min Salary'].astype(str).astype(int)
df['Max Salary'] = df['Max Salary'].astype(str).astype(int)

In [None]:
del df['Salary Estimate']

### Cleaning 'Company Name' column

In [None]:
df['Company Name'], temp = df['Company Name'].str.split('\n', 1).str

### Cleaning 'Location' column

In [None]:
df['Location'].value_counts()[:20]

* we are spliting the states and cities from 'Location' column

In [None]:
df['City'], df['State'] = df['Location'].str.split(',', 1).str

In [None]:
df['State'] = df['State'].replace([' Arapahoe, CO'], ' CO')

In [None]:
df['State'] = df['State'].str.strip()
df['City'] = df['City'].str.strip()

In [None]:
df['State'].value_counts()

### Cleaning 'Industry' column**

In [None]:
df['Industry'] = df['Industry'].fillna('Others')

### Cleaning 'Sector' column

In [None]:
df['Sector'] = df['Sector'].fillna('Others')

### Cleaning 'Rating' column

In [None]:
df['Rating'] = df['Rating'].fillna(round(df['Rating'].mean(), 1))

### Cleaning 'Revenue' column

### filter revenue Function

In [None]:
def filter_revenue(x):
    revenue=0
    if(x== 'Unknown / Non-Applicable' or type(x)==float):
        revenue=0
    elif(('million' in x) and ('billion' not in x)):
        maxRev = x.replace('(USD)','').replace("million",'').replace('$','').strip().split('to')
        if('Less than' in maxRev[0]):
            revenue = float(maxRev[0].replace('Less than','').strip())
        else:
            if(len(maxRev)==2):
                revenue = float(maxRev[1])
            elif(len(maxRev)<2):
                revenue = float(maxRev[0])
    elif(('billion'in x)):
        maxRev = x.replace('(USD)','').replace("billion",'').replace('$','').strip().split('to')
        if('+' in maxRev[0]):
            revenue = float(maxRev[0].replace('+','').strip())*1000
        else:
            if(len(maxRev)==2):
                revenue = float(maxRev[1])*1000
            elif(len(maxRev)<2):
                revenue = float(maxRev[0])*1000
    return revenue

In [None]:
df['Revenue'] = df['Revenue'].apply(lambda x: filter_revenue(x))

In [None]:
important_column = ['Job Title', 'Rating', 'Company Name', 'State', 'City','Size', 'Industry', 'Sector', 'Min Salary', 'Max Salary', 'Revenue']

## Now our working dataset is ready

In [None]:
df[important_column].head()

## Now it's time to visualize.

* Top 20 most Openings in different Roles

In [None]:
top_20_job = pd.DataFrame(df['Job Title'].value_counts()[:20]).reset_index()
top_20_job.rename(columns={'index': 'Job Title', 'Job Title': 'No. of Openings'}, inplace=True)

In [None]:
fig = go.Figure(go.Bar(
    x=top_20_job['Job Title'],
    y=top_20_job['No. of Openings'],
))
fig.update_layout(title_text='Current openings in different Roles',xaxis_title="Job Title",yaxis_title="Number of openings")
fig.show()

* Top 20 Industries offering most number of Jobs

In [None]:
top_20_industry = pd.DataFrame(df['Industry'].value_counts()[1:21]).reset_index()
top_20_industry.rename(columns={'index': 'Industry', 'Industry': 'No. of Openings'}, inplace=True)

In [None]:
fig = go.Figure(go.Bar(
    x=top_20_industry['Industry'],
    y=top_20_industry['No. of Openings'],
))
fig.update_layout(title_text='Current openings in different Industry',xaxis_title="Industry",yaxis_title="Number of openings")
fig.show()

* Jobs offering in different city

In [None]:
top_20_city = pd.DataFrame(df['City'].value_counts()[:20]).reset_index()
top_20_city.rename(columns={'index':'City', 'City':'No. of Openings'}, inplace=True)

In [None]:
fig = go.Figure(go.Bar(
    x=top_20_city['City'],
    y=top_20_city['No. of Openings'],
))
fig.update_layout(title_text='Current openings in different City',xaxis_title="City",yaxis_title="Number of openings")
fig.show()

* Top 20 Companies providing most Jobs

In [None]:
top_20_company = pd.DataFrame(df['Company Name'].value_counts()[:20]).reset_index()
top_20_company.rename(columns={'index':'Company Name' , 'Company Name':'No. of Openings'},inplace=True)

* Ratings and Revenue

In [None]:
companies = top_20_company['Company Name'].values
revenue_rating = df[df['Company Name'].isin(companies)][['Company Name','Rating', 'Revenue']]
revenue_rating = revenue_rating.groupby('Company Name').mean()

In [None]:
fig = go.Figure(go.Bar(
    x=top_20_company['Company Name'],
    y=top_20_company['No. of Openings'],
))
fig.update_layout(title_text='Current openings in different City',xaxis_title="Company",yaxis_title="Number of openings")
fig.show()

* Jobs having min and max salary

In [None]:
df.dropna(axis=0 , subset=['Max Salary','Min Salary'], inplace=True)

In [None]:
grp_job_title = df[['Job Title','Min Salary', 'Max Salary']].groupby('Job Title').mean().reset_index()
grp_job_title = grp_job_title[grp_job_title['Job Title'].isin(top_20_job['Job Title'].values)].reset_index()
del grp_job_title['index']

In [None]:
grp_job_title['Min Salary'] = grp_job_title['Min Salary'].round(2)
grp_job_title['Max Salary'] = grp_job_title['Max Salary'].round(2)

In [None]:
fig = go.Figure(data=[
    go.Bar(name='Min Salary', x=grp_job_title['Job Title'], y=grp_job_title['Min Salary'],marker_color='indianred'),
    go.Bar(name='Max Salary', x=grp_job_title['Job Title'], y=grp_job_title['Max Salary'],marker_color='lightsalmon'),
])
# Change the bar mode
fig.update_layout(barmode='group', title='Min and Max salary of top 20 Job openings',
                 yaxis=dict(
                    title='USD (_K)',
                    titlefont_size=16,
                    tickfont_size=14,
                ),
                xaxis=dict(
                    title='Job Title',
                    titlefont_size=16,
                    tickfont_size=14,
                ))

fig.show()

# If you like my work do UPVOTE