# Introduction
<p>The dataset is from <a href='https://www.kaggle.com/andrewmvd/data-analyst-jobs'> <span> Kaggle</span> </a>.</p>
<p>There are three questions of interest:</p>
<p>1.Find the best jobs by salary and company rating</p>
<p>2.Explore skills required in job descriptions</p>
<p>3.Predict salary based on industry, location, company revenue</p>

Firstly, read in the dataset.

In [13]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import warnings 
warnings.filterwarnings("ignore")

step 1: load and clean data

In [110]:
# load dataset
df = pd.read_csv('./Dataset/DataAnalyst.csv',index_col=0)

df.head()

Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply
0,"Data Analyst, Center on Immigration and Justic...",$37K-$66K (Glassdoor est.),Are you eager to roll up your sleeves and harn...,3.2,Vera Institute of Justice\r\n3.2,"New York, NY","New York, NY",201 to 500 employees,1961,Nonprofit Organization,Social Assistance,Non-Profit,$100 to $500 million (USD),-1,True
1,Quality Data Analyst,$37K-$66K (Glassdoor est.),Overview\r\n\r\nProvides analytical and techni...,3.8,Visiting Nurse Service of New York\r\n3.8,"New York, NY","New York, NY",10000+ employees,1893,Nonprofit Organization,Health Care Services & Hospitals,Health Care,$2 to $5 billion (USD),-1,-1
2,"Senior Data Analyst, Insights & Analytics Team...",$37K-$66K (Glassdoor est.),We’re looking for a Senior Data Analyst who ha...,3.4,Squarespace\r\n3.4,"New York, NY","New York, NY",1001 to 5000 employees,2003,Company - Private,Internet,Information Technology,Unknown / Non-Applicable,GoDaddy,-1
3,Data Analyst,$37K-$66K (Glassdoor est.),Requisition NumberRR-0001939\r\nRemote:Yes\r\n...,4.1,Celerity\r\n4.1,"New York, NY","McLean, VA",201 to 500 employees,2002,Subsidiary or Business Segment,IT Services,Information Technology,$50 to $100 million (USD),-1,-1
4,Reporting Data Analyst,$37K-$66K (Glassdoor est.),ABOUT FANDUEL GROUP\r\n\r\nFanDuel Group is a ...,3.9,FanDuel\r\n3.9,"New York, NY","New York, NY",501 to 1000 employees,2009,Company - Private,Sports & Recreation,"Arts, Entertainment & Recreation",$100 to $500 million (USD),DraftKings,True


In [111]:
# 1)get columns and rows of dataset
print(df.shape)  # rows: 2253 columns: 15

# 2)check null value in columns
df.info() # not exists NaN in all columns

(2253, 15)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2253 entries, 0 to 2252
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Job Title          2253 non-null   object 
 1   Salary Estimate    2253 non-null   object 
 2   Job Description    2253 non-null   object 
 3   Rating             2253 non-null   float64
 4   Company Name       2252 non-null   object 
 5   Location           2253 non-null   object 
 6   Headquarters       2253 non-null   object 
 7   Size               2253 non-null   object 
 8   Founded            2253 non-null   int64  
 9   Type of ownership  2253 non-null   object 
 10  Industry           2253 non-null   object 
 11  Sector             2253 non-null   object 
 12  Revenue            2253 non-null   object 
 13  Competitors        2253 non-null   object 
 14  Easy Apply         2253 non-null   object 
dtypes: float64(1), int64(1), object(13)
memory usage: 281.6+ KB


In [120]:
# 3)check abnormal value in columns one by one
# df.columns[0]]
print(df['Job Title'].dtype)
df['Job Title'].value_counts() # some job title could be classified
df['Job Title'] = df['Job Title'].str.lower()
df['Job Title'].value_counts()[:50] 

object


data analyst                                                                        411
senior data analyst                                                                 122
junior data analyst                                                                  47
data business analyst                                                                31
data quality analyst                                                                 21
data analyst ii                                                                      17
data governance analyst                                                              16
reporting data analyst                                                               16
lead data analyst                                                                    15
financial data analyst                                                               12
data analyst i                                                                       11
data analyst iii                

In [121]:
df = df.replace('sr. data analyst','senior data analyst')
df = df.replace('sr data analyst','senior data analyst')
df = df.replace('data analyst junior','junior data analyst')  
df = df.replace('quality data analyst','data quality analyst')
df = df.replace('business data analyst','data business analyst')
df = df.replace('data reporting analyst','reporting data analyst')
df['Job Title'].value_counts()[:50]

data analyst                                                                        411
senior data analyst                                                                 122
junior data analyst                                                                  47
data business analyst                                                                31
data quality analyst                                                                 21
data analyst ii                                                                      17
data governance analyst                                                              16
reporting data analyst                                                               16
lead data analyst                                                                    15
financial data analyst                                                               12
data analyst i                                                                       11
data analyst iii                

In [86]:
# df.columns[1]
print(df['Salary Estimate'].dtype)
df['Salary Estimate'].unique() # need to repace -1 into np.nan
df['Salary Estimate'] = df['Salary Estimate'].replace('-1',np.nan)
df['Salary Estimate'].value_counts(dropna=False)
df['Salary Estimate'].isnull().mean() * 100 # only exists 1 missing value, so I decide to drop this data
df[df['Salary Estimate'].isnull()]
df.drop(index=2149,inplace=True)    

object


In [87]:
# split salary estimate into lower 
s = df["Salary Estimate"].str.split(" ",n=1,expand=True)
sr = s[0].str.split('-',expand=True,n=1)
df['lower_salary'] = sr[0].str.replace('$','')
df['lower_salary'] = df['lower_salary'].str.replace('K','000')
df['upper_salary'] = sr[1].str.replace('$','')
df['upper_salary'] = df['upper_salary'].str.replace('K','000')

# convert data type
df['upper_salary'] = df['upper_salary'].astype('int')
df['lower_salary'] = df['lower_salary'].astype('int')

# add new column 'average_salary'
df['average_salary'] = (df['upper_salary'] + df['lower_salary']) / 2

# delete column
df = df.drop(columns=['Salary Estimate', 'upper_salary', 'lower_salary'])

In [88]:
# df.columns[2]
print(df['Job Description'].dtype)
df['Job Description'].value_counts(dropna=False)
df['Job Description'].isnull().mean()   # no missing values

object


0.0

In [89]:
# df.columns[3]
print(df['Rating'].dtype)
df['Rating'].unique()
df['Rating'] = df['Rating'].replace(-1,np.nan)
print(df['Rating'].isnull().mean())
df['Rating'].value_counts(dropna=False)
df[df['Rating'].isnull()]  # delay cleaning

float64
0.12285581826611033


Unnamed: 0,Job Title,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply,average_salary
11,data analyst,BulbHead is currently seeking a Data Analyst t...,,BulbHead,"Fairfield, NJ",-1,1 to 50 employees,-1,Company - Private,-1,-1,Unknown / Non-Applicable,-1,-1,51500.0
21,data science analyst,"Job Description\r\nOur client, a music streami...",,MUSIC & Entertainment,"New York, NY","Marina del Rey, CA",Unknown,-1,Company - Public,-1,-1,Unknown / Non-Applicable,-1,-1,51500.0
34,data analyst,Carry1st is the leading mobile game publisher ...,,Carry1st,"New York, NY",-1,-1,-1,-1,-1,-1,-1,-1,-1,66500.0
36,data business analyst,"At Clear Street, we are disrupting the institu...",,Clear Street,"New York, NY","New York, NY",51 to 200 employees,2018,Company - Public,-1,-1,$1 to $5 million (USD),-1,-1,66500.0
40,business analyst,Company Description\r\n\r\nPinto is building t...,,Pinto,"New York, NY","New York, NY",1 to 50 employees,-1,Company - Private,-1,-1,Unknown / Non-Applicable,-1,-1,66500.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2200,data analyst,Role Data Analyst Duration12+ months Location ...,,"TechAspect Solutions, Inc. dba TA Digital","Centennial, CO",-1,-1,-1,-1,-1,-1,-1,-1,-1,70000.0
2202,financial data analyst,Position:Financial Data AnalystJob Description...,,Black Knight Financial Technology Solutions,"Denver, CO",-1,-1,-1,-1,-1,-1,-1,-1,-1,70000.0
2239,senior contract data analyst,OverviewAmyx is seeking to hire a Senior Contr...,,"Amyx, Iinc.","Aurora, CO",-1,-1,-1,-1,-1,-1,-1,-1,-1,91000.0
2246,technical business analyst,Spiceorb is looking for Technical Business Ana...,,Spiceorb,"Denver, CO",-1,-1,-1,-1,-1,-1,-1,-1,-1,91000.0


In [90]:
# df.columns[4]
print(df['Company Name'].dtype)
df['Company Name'].unique()
df['Company Name'].str.split('\r\n',n=1,expand=True)
df['Company Name'] = df['Company Name'].str.split('\r\n',n=1,expand=True)[0]
print(df['Company Name'].isnull().mean())
df[df['Company Name'].isnull()]
df.drop(index=1860,inplace=True)

object
0.0004636068613815484


In [91]:
# df.columns[5]
print(df['Location'].dtype)
df['Location'].value_counts(dropna=False)
print(df['Location'].isnull().mean())

object
0.0


In [92]:
# df.columns[6]
print(df['Headquarters'].dtype)
df['Headquarters'].unique()
df['Headquarters'].value_counts(dropna=False)
df['Headquarters'] = df['Headquarters'].replace('-1',np.nan)
print(df['Headquarters'].isnull().mean())
df[df['Headquarters'].isnull()] # delay cleaning

object
0.0774582560296846


Unnamed: 0,Job Title,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply,average_salary
11,data analyst,BulbHead is currently seeking a Data Analyst t...,,BulbHead,"Fairfield, NJ",,1 to 50 employees,-1,Company - Private,-1,-1,Unknown / Non-Applicable,-1,-1,51500.0
34,data analyst,Carry1st is the leading mobile game publisher ...,,Carry1st,"New York, NY",,-1,-1,-1,-1,-1,-1,-1,-1,66500.0
55,data reporting analyst,OverviewThe Data Analyst is a new position in ...,,NADAP NYS INC.,"New York, NY",,-1,-1,-1,-1,-1,-1,-1,-1,66500.0
68,data science analyst,Job Details\r\n\r\nLevel\r\n\r\nExperienced\r\...,,Greater New York Mutual Insurance Companies (GNY),"New York, NY",,-1,-1,-1,-1,-1,-1,-1,-1,69500.0
90,data analyst,NYU Grossman School of Medicine is one of the ...,,NYU Langone Medical Center,"New York, NY",,-1,-1,-1,-1,-1,-1,-1,-1,69000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2200,data analyst,Role Data Analyst Duration12+ months Location ...,,"TechAspect Solutions, Inc. dba TA Digital","Centennial, CO",,-1,-1,-1,-1,-1,-1,-1,-1,70000.0
2202,financial data analyst,Position:Financial Data AnalystJob Description...,,Black Knight Financial Technology Solutions,"Denver, CO",,-1,-1,-1,-1,-1,-1,-1,-1,70000.0
2239,senior contract data analyst,OverviewAmyx is seeking to hire a Senior Contr...,,"Amyx, Iinc.","Aurora, CO",,-1,-1,-1,-1,-1,-1,-1,-1,91000.0
2246,technical business analyst,Spiceorb is looking for Technical Business Ana...,,Spiceorb,"Denver, CO",,-1,-1,-1,-1,-1,-1,-1,-1,91000.0


In [93]:
# df.columns[7]
print(df['Size'].dtype)
df['Size'].unique()
df['Size'] = df['Size'].replace('-1',np.nan)
print(df['Size'].isnull().mean())
df[df['Size'].isnull()] # delay cleaning

object
0.07328385899814471


Unnamed: 0,Job Title,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply,average_salary
34,data analyst,Carry1st is the leading mobile game publisher ...,,Carry1st,"New York, NY",,,-1,-1,-1,-1,-1,-1,-1,66500.0
55,data reporting analyst,OverviewThe Data Analyst is a new position in ...,,NADAP NYS INC.,"New York, NY",,,-1,-1,-1,-1,-1,-1,-1,66500.0
68,data science analyst,Job Details\r\n\r\nLevel\r\n\r\nExperienced\r\...,,Greater New York Mutual Insurance Companies (GNY),"New York, NY",,,-1,-1,-1,-1,-1,-1,-1,69500.0
90,data analyst,NYU Grossman School of Medicine is one of the ...,,NYU Langone Medical Center,"New York, NY",,,-1,-1,-1,-1,-1,-1,-1,69000.0
109,data analyst,"Data Analyst\r\n\r\nJersey City, NJ\r\n\r\n12+...",,Vertex Intel Systems,"Jersey City, NJ",,,-1,-1,-1,-1,-1,-1,-1,69000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2200,data analyst,Role Data Analyst Duration12+ months Location ...,,"TechAspect Solutions, Inc. dba TA Digital","Centennial, CO",,,-1,-1,-1,-1,-1,-1,-1,70000.0
2202,financial data analyst,Position:Financial Data AnalystJob Description...,,Black Knight Financial Technology Solutions,"Denver, CO",,,-1,-1,-1,-1,-1,-1,-1,70000.0
2239,senior contract data analyst,OverviewAmyx is seeking to hire a Senior Contr...,,"Amyx, Iinc.","Aurora, CO",,,-1,-1,-1,-1,-1,-1,-1,91000.0
2246,technical business analyst,Spiceorb is looking for Technical Business Ana...,,Spiceorb,"Denver, CO",,,-1,-1,-1,-1,-1,-1,-1,91000.0


In [94]:
# df.columns[8]
print(df['Founded'].dtype)
df['Founded'].unique()
df['Founded'] = df['Founded'].replace(-1,np.nan)
df['Founded'].value_counts(dropna=False)
print(df['Founded'].isnull().mean())
df[df['Founded'].isnull()] # delay cleaning

int64
0.29545454545454547


Unnamed: 0,Job Title,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply,average_salary
11,data analyst,BulbHead is currently seeking a Data Analyst t...,,BulbHead,"Fairfield, NJ",,1 to 50 employees,,Company - Private,-1,-1,Unknown / Non-Applicable,-1,-1,51500.0
15,sustainability data analyst,Job Description\r\nRole Description\r\n\r\nSus...,3.6,CodeGreen Solutions,"New York, NY","New York, NY",1 to 50 employees,,Company - Private,Building & Personnel Services,Business Services,Unknown / Non-Applicable,-1,-1,51500.0
21,data science analyst,"Job Description\r\nOur client, a music streami...",,MUSIC & Entertainment,"New York, NY","Marina del Rey, CA",Unknown,,Company - Public,-1,-1,Unknown / Non-Applicable,-1,-1,51500.0
23,data analyst,Haven Life is an insurtech innovator at MassMu...,3.5,Andiamo,"New York, NY","Warren, MI",201 to 500 employees,,Company - Private,Casual Restaurants,"Restaurants, Bars & Food Services",$1 to $5 million (USD),-1,-1,51500.0
24,entry level / jr. data analyst,Dash Technologies is an industry leading softw...,3.8,Dash Technologies Inc,"New York, NY","Columbus, OH",1 to 50 employees,,Unknown,-1,-1,Unknown / Non-Applicable,-1,-1,51500.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2239,senior contract data analyst,OverviewAmyx is seeking to hire a Senior Contr...,,"Amyx, Iinc.","Aurora, CO",,,,-1,-1,-1,-1,-1,-1,91000.0
2241,data security analyst fr,Maintains systems to protect data from unautho...,2.5,"Avacend, Inc.","Denver, CO","Alpharetta, GA",51 to 200 employees,,Company - Private,Staffing & Outsourcing,Business Services,Unknown / Non-Applicable,-1,-1,91000.0
2244,data security analyst,Contract Duration: 9 Months\r\n\r\nLocation: D...,2.5,"Avacend, Inc.","Denver, CO","Alpharetta, GA",51 to 200 employees,,Company - Private,Staffing & Outsourcing,Business Services,Unknown / Non-Applicable,-1,-1,91000.0
2246,technical business analyst,Spiceorb is looking for Technical Business Ana...,,Spiceorb,"Denver, CO",,,,-1,-1,-1,-1,-1,-1,91000.0


In [95]:
# df.columns[9]
print(df['Type of ownership'].dtype)
df['Type of ownership'].unique()
df['Type of ownership'] = df['Type of ownership'].replace('-1',np.nan)
print(df['Type of ownership'].isnull().mean())
df[df['Type of ownership'].isnull()] # delay cleaning

object
0.07328385899814471


Unnamed: 0,Job Title,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply,average_salary
34,data analyst,Carry1st is the leading mobile game publisher ...,,Carry1st,"New York, NY",,,,,-1,-1,-1,-1,-1,66500.0
55,data reporting analyst,OverviewThe Data Analyst is a new position in ...,,NADAP NYS INC.,"New York, NY",,,,,-1,-1,-1,-1,-1,66500.0
68,data science analyst,Job Details\r\n\r\nLevel\r\n\r\nExperienced\r\...,,Greater New York Mutual Insurance Companies (GNY),"New York, NY",,,,,-1,-1,-1,-1,-1,69500.0
90,data analyst,NYU Grossman School of Medicine is one of the ...,,NYU Langone Medical Center,"New York, NY",,,,,-1,-1,-1,-1,-1,69000.0
109,data analyst,"Data Analyst\r\n\r\nJersey City, NJ\r\n\r\n12+...",,Vertex Intel Systems,"Jersey City, NJ",,,,,-1,-1,-1,-1,-1,69000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2200,data analyst,Role Data Analyst Duration12+ months Location ...,,"TechAspect Solutions, Inc. dba TA Digital","Centennial, CO",,,,,-1,-1,-1,-1,-1,70000.0
2202,financial data analyst,Position:Financial Data AnalystJob Description...,,Black Knight Financial Technology Solutions,"Denver, CO",,,,,-1,-1,-1,-1,-1,70000.0
2239,senior contract data analyst,OverviewAmyx is seeking to hire a Senior Contr...,,"Amyx, Iinc.","Aurora, CO",,,,,-1,-1,-1,-1,-1,91000.0
2246,technical business analyst,Spiceorb is looking for Technical Business Ana...,,Spiceorb,"Denver, CO",,,,,-1,-1,-1,-1,-1,91000.0


In [96]:
# df.columns[10]
print(df['Industry'].dtype)
df['Industry'].unique()
df['Industry'] = df['Industry'].replace('-1',np.nan)
print(df['Industry'].isnull().mean())
df[df['Industry'].isnull()]  # delay cleaning

object
0.16001855287569575


Unnamed: 0,Job Title,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply,average_salary
11,data analyst,BulbHead is currently seeking a Data Analyst t...,,BulbHead,"Fairfield, NJ",,1 to 50 employees,,Company - Private,,-1,Unknown / Non-Applicable,-1,-1,51500.0
21,data science analyst,"Job Description\r\nOur client, a music streami...",,MUSIC & Entertainment,"New York, NY","Marina del Rey, CA",Unknown,,Company - Public,,-1,Unknown / Non-Applicable,-1,-1,51500.0
24,entry level / jr. data analyst,Dash Technologies is an industry leading softw...,3.8,Dash Technologies Inc,"New York, NY","Columbus, OH",1 to 50 employees,,Unknown,,-1,Unknown / Non-Applicable,-1,-1,51500.0
32,data analyst,Job Description:\r\nLegal experience is requir...,3.5,Pozent,"New York, NY","Piscataway, NJ",1 to 50 employees,,Contract,,-1,Less than $1 million (USD),-1,-1,66500.0
34,data analyst,Carry1st is the leading mobile game publisher ...,,Carry1st,"New York, NY",,,,,,-1,-1,-1,-1,66500.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2220,mdg functional data analyst,MDG Functional Data Analyst\r\nLocation: Green...,5.0,Georgia IT Inc.,"Greenwood Village, Arapahoe, CO","Alpharetta, GA",1 to 50 employees,,Company - Private,,-1,Less than $1 million (USD),-1,-1,78500.0
2234,data analyst 3,Business Unit: Summary Responsible for working...,3.6,Comcast,"Englewood, CO","Philadelphia, PA",10000+ employees,1963.0,Company - Public,,-1,$10+ billion (USD),"AT&T, Verizon",-1,78500.0
2239,senior contract data analyst,OverviewAmyx is seeking to hire a Senior Contr...,,"Amyx, Iinc.","Aurora, CO",,,,,,-1,-1,-1,-1,91000.0
2246,technical business analyst,Spiceorb is looking for Technical Business Ana...,,Spiceorb,"Denver, CO",,,,,,-1,-1,-1,-1,91000.0


In [97]:
# df.columns[11]
print(df['Sector'].dtype)
df['Sector'].unique()
df['Sector'] = df['Sector'].replace('-1',np.nan)
print(df['Sector'].isnull().mean())
df[df['Sector'].isnull()]  # delay cleaning

object
0.16001855287569575


Unnamed: 0,Job Title,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply,average_salary
11,data analyst,BulbHead is currently seeking a Data Analyst t...,,BulbHead,"Fairfield, NJ",,1 to 50 employees,,Company - Private,,,Unknown / Non-Applicable,-1,-1,51500.0
21,data science analyst,"Job Description\r\nOur client, a music streami...",,MUSIC & Entertainment,"New York, NY","Marina del Rey, CA",Unknown,,Company - Public,,,Unknown / Non-Applicable,-1,-1,51500.0
24,entry level / jr. data analyst,Dash Technologies is an industry leading softw...,3.8,Dash Technologies Inc,"New York, NY","Columbus, OH",1 to 50 employees,,Unknown,,,Unknown / Non-Applicable,-1,-1,51500.0
32,data analyst,Job Description:\r\nLegal experience is requir...,3.5,Pozent,"New York, NY","Piscataway, NJ",1 to 50 employees,,Contract,,,Less than $1 million (USD),-1,-1,66500.0
34,data analyst,Carry1st is the leading mobile game publisher ...,,Carry1st,"New York, NY",,,,,,,-1,-1,-1,66500.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2220,mdg functional data analyst,MDG Functional Data Analyst\r\nLocation: Green...,5.0,Georgia IT Inc.,"Greenwood Village, Arapahoe, CO","Alpharetta, GA",1 to 50 employees,,Company - Private,,,Less than $1 million (USD),-1,-1,78500.0
2234,data analyst 3,Business Unit: Summary Responsible for working...,3.6,Comcast,"Englewood, CO","Philadelphia, PA",10000+ employees,1963.0,Company - Public,,,$10+ billion (USD),"AT&T, Verizon",-1,78500.0
2239,senior contract data analyst,OverviewAmyx is seeking to hire a Senior Contr...,,"Amyx, Iinc.","Aurora, CO",,,,,,,-1,-1,-1,91000.0
2246,technical business analyst,Spiceorb is looking for Technical Business Ana...,,Spiceorb,"Denver, CO",,,,,,,-1,-1,-1,91000.0


In [98]:
# df.columns[12]
print(df['Revenue'].dtype)
df['Revenue'].unique()
df['Revenue'] = df['Revenue'].replace('-1',np.nan)
print(df['Revenue'].isnull().mean())
df[df['Revenue'].isnull()]  # delay cleaning

object
0.07328385899814471


Unnamed: 0,Job Title,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply,average_salary
34,data analyst,Carry1st is the leading mobile game publisher ...,,Carry1st,"New York, NY",,,,,,,,-1,-1,66500.0
55,data reporting analyst,OverviewThe Data Analyst is a new position in ...,,NADAP NYS INC.,"New York, NY",,,,,,,,-1,-1,66500.0
68,data science analyst,Job Details\r\n\r\nLevel\r\n\r\nExperienced\r\...,,Greater New York Mutual Insurance Companies (GNY),"New York, NY",,,,,,,,-1,-1,69500.0
90,data analyst,NYU Grossman School of Medicine is one of the ...,,NYU Langone Medical Center,"New York, NY",,,,,,,,-1,-1,69000.0
109,data analyst,"Data Analyst\r\n\r\nJersey City, NJ\r\n\r\n12+...",,Vertex Intel Systems,"Jersey City, NJ",,,,,,,,-1,-1,69000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2200,data analyst,Role Data Analyst Duration12+ months Location ...,,"TechAspect Solutions, Inc. dba TA Digital","Centennial, CO",,,,,,,,-1,-1,70000.0
2202,financial data analyst,Position:Financial Data AnalystJob Description...,,Black Knight Financial Technology Solutions,"Denver, CO",,,,,,,,-1,-1,70000.0
2239,senior contract data analyst,OverviewAmyx is seeking to hire a Senior Contr...,,"Amyx, Iinc.","Aurora, CO",,,,,,,,-1,-1,91000.0
2246,technical business analyst,Spiceorb is looking for Technical Business Ana...,,Spiceorb,"Denver, CO",,,,,,,,-1,-1,91000.0


In [99]:
# df.columns[13]
print(df['Competitors'].dtype)
df['Competitors'].unique()
df['Competitors'] = df['Competitors'].replace('-1',np.nan)
print(df['Competitors'].isnull().mean())
df[df['Competitors'].isnull()]  # the missing value is too much, so I decide to delete this column
df = df.drop(columns = ['Competitors'])

object
0.7690166975881262


In [100]:
# df.columns[14]
print(df['Easy Apply'].dtype)
df['Easy Apply'].value_counts()
df['Easy Apply'] = df['Easy Apply'].replace('True',1)

object


In [108]:
# Now, the missing values for all columns
print(df.isnull().mean().sort_values() * 100)

Job Title             0.000000
Job Description       0.000000
Company Name          0.000000
Location              0.000000
Easy Apply            0.000000
average_salary        0.000000
Size                  7.328386
Type of ownership     7.328386
Revenue               7.328386
Headquarters          7.745826
Rating               12.244898
Industry             16.001855
Sector               16.001855
Founded              29.545455
dtype: float64


In [103]:
df.corr()

Unnamed: 0,Rating,Founded,average_salary
Rating,1.0,0.179638,0.047219
Founded,0.179638,1.0,0.093385
average_salary,0.047219,0.093385,1.0


In [104]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2156 entries, 0 to 2252
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Job Title          2156 non-null   object 
 1   Job Description    2156 non-null   object 
 2   Rating             1892 non-null   float64
 3   Company Name       2156 non-null   object 
 4   Location           2156 non-null   object 
 5   Headquarters       1989 non-null   object 
 6   Size               1998 non-null   object 
 7   Founded            1519 non-null   float64
 8   Type of ownership  1998 non-null   object 
 9   Industry           1811 non-null   object 
 10  Sector             1811 non-null   object 
 11  Revenue            1998 non-null   object 
 12  Easy Apply         2156 non-null   object 
 13  average_salary     2156 non-null   float64
dtypes: float64(3), object(11)
memory usage: 252.7+ KB


1.Find the best jobs by salary and company rating

In [109]:
df.groupby('Job Title').size().sort_values(ascending=False)[:10]

Job Title
data analyst               595
senior data analyst        202
junior data analyst         62
business data analyst       28
data quality analyst        19
lead data analyst           19
data governance analyst     16
data reporting analyst      14
data analyst iii            13
financial data analyst      13
dtype: int64

2.Explore skills required in job descriptions

3.Predict salary based on industry, location, company revenue