# Introduction
<p>The dataset is from <a href='https://www.kaggle.com/andrewmvd/data-analyst-jobs'> <span> Kaggle</span> </a>.</p>
<p>There are three questions of interest:</p>
<p>1.Find the best jobs by salary and company rating</p>
<p>2.Explore skills required in job descriptions</p>
<p>3.Predict salary based on industry, location, company revenue</p>

Firstly, read in the dataset.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
%matplotlib inline

step 1: load and clean data

In [2]:
# load dataset
df = pd.read_csv('./Dataset/DataAnalyst.csv',index_col=0)

df.head()

Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply
0,"Data Analyst, Center on Immigration and Justic...",$37K-$66K (Glassdoor est.),Are you eager to roll up your sleeves and harn...,3.2,Vera Institute of Justice\n3.2,"New York, NY","New York, NY",201 to 500 employees,1961,Nonprofit Organization,Social Assistance,Non-Profit,$100 to $500 million (USD),-1,True
1,Quality Data Analyst,$37K-$66K (Glassdoor est.),Overview\n\nProvides analytical and technical ...,3.8,Visiting Nurse Service of New York\n3.8,"New York, NY","New York, NY",10000+ employees,1893,Nonprofit Organization,Health Care Services & Hospitals,Health Care,$2 to $5 billion (USD),-1,-1
2,"Senior Data Analyst, Insights & Analytics Team...",$37K-$66K (Glassdoor est.),We’re looking for a Senior Data Analyst who ha...,3.4,Squarespace\n3.4,"New York, NY","New York, NY",1001 to 5000 employees,2003,Company - Private,Internet,Information Technology,Unknown / Non-Applicable,GoDaddy,-1
3,Data Analyst,$37K-$66K (Glassdoor est.),Requisition NumberRR-0001939\nRemote:Yes\nWe c...,4.1,Celerity\n4.1,"New York, NY","McLean, VA",201 to 500 employees,2002,Subsidiary or Business Segment,IT Services,Information Technology,$50 to $100 million (USD),-1,-1
4,Reporting Data Analyst,$37K-$66K (Glassdoor est.),ABOUT FANDUEL GROUP\n\nFanDuel Group is a worl...,3.9,FanDuel\n3.9,"New York, NY","New York, NY",501 to 1000 employees,2009,Company - Private,Sports & Recreation,"Arts, Entertainment & Recreation",$100 to $500 million (USD),DraftKings,True


In [3]:
# 1)get columns and rows of dataset
print(df.shape)  # rows: 2253 columns: 15

# 2)check null value in columns
df.info() # not exists NaN in all columns

(2253, 15)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2253 entries, 0 to 2252
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Job Title          2253 non-null   object 
 1   Salary Estimate    2253 non-null   object 
 2   Job Description    2253 non-null   object 
 3   Rating             2253 non-null   float64
 4   Company Name       2252 non-null   object 
 5   Location           2253 non-null   object 
 6   Headquarters       2253 non-null   object 
 7   Size               2253 non-null   object 
 8   Founded            2253 non-null   int64  
 9   Type of ownership  2253 non-null   object 
 10  Industry           2253 non-null   object 
 11  Sector             2253 non-null   object 
 12  Revenue            2253 non-null   object 
 13  Competitors        2253 non-null   object 
 14  Easy Apply         2253 non-null   object 
dtypes: float64(1), int64(1), object(13)
memory usage: 281.6+ KB


In [4]:
# 3)check abnormal value in columns one by one
# df.columns[0]]
print(df['Job Title'].dtype)
df['Job Title'].value_counts() # no abnormal value


object


Data Analyst                                           405
Senior Data Analyst                                     90
Junior Data Analyst                                     30
Business Data Analyst                                   28
Sr. Data Analyst                                        21
                                                      ... 
Senior Business Intelligence & Data Science Analyst      1
Product Analyst - Data                                   1
Senior Data Analyst and Applied Scientist                1
Data Reporting Analyst, BI                               1
Sr Data Management Analyst                               1
Name: Job Title, Length: 1272, dtype: int64

In [5]:
# df.columns[1]
print(df['Salary Estimate'].dtype)
df['Salary Estimate'].value_counts(dropna=False) # need to repace -1 into np.nan
df['Salary Estimate'] = df[df.columns[1]].replace('-1',np.nan)
df['Salary Estimate'].value_counts(dropna=False)
df['Salary Estimate'].isnull().mean()  # only exists 1 missing value, so I decide to drop this data
df[df['Salary Estimate'].isnull()]
df.drop(index=2149,inplace=True)

object


In [13]:
s = df["Salary Estimate"].str.split(" ",n=1,expand=True)
sr = s[0].str.split('-',expand=True,n=1)
df['lower_salary'] = sr[0].str.replace('$','')
df['lower_salary'] = df['lower_salary'].str.replace('K','000')
df['upper_salary'] = sr[1].str.replace('$','')
df['upper_salary'] = df['upper_salary'].str.replace('K','000')


df['upper_salary'] = df['upper_salary'].astype('int')
df['lower_salary'] = df['lower_salary'].astype('int')
df.head()

Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply,lower_salary,upper_salary
0,"Data Analyst, Center on Immigration and Justic...",$37K-$66K (Glassdoor est.),Are you eager to roll up your sleeves and harn...,3.2,Vera Institute of Justice\r\n3.2,"New York, NY","New York, NY",201 to 500 employees,1961,Nonprofit Organization,Social Assistance,Non-Profit,$100 to $500 million (USD),-1,True,37000,66000
1,Quality Data Analyst,$37K-$66K (Glassdoor est.),Overview\r\n\r\nProvides analytical and techni...,3.8,Visiting Nurse Service of New York\r\n3.8,"New York, NY","New York, NY",10000+ employees,1893,Nonprofit Organization,Health Care Services & Hospitals,Health Care,$2 to $5 billion (USD),-1,-1,37000,66000
2,"Senior Data Analyst, Insights & Analytics Team...",$37K-$66K (Glassdoor est.),We’re looking for a Senior Data Analyst who ha...,3.4,Squarespace\r\n3.4,"New York, NY","New York, NY",1001 to 5000 employees,2003,Company - Private,Internet,Information Technology,Unknown / Non-Applicable,GoDaddy,-1,37000,66000
3,Data Analyst,$37K-$66K (Glassdoor est.),Requisition NumberRR-0001939\r\nRemote:Yes\r\n...,4.1,Celerity\r\n4.1,"New York, NY","McLean, VA",201 to 500 employees,2002,Subsidiary or Business Segment,IT Services,Information Technology,$50 to $100 million (USD),-1,-1,37000,66000
4,Reporting Data Analyst,$37K-$66K (Glassdoor est.),ABOUT FANDUEL GROUP\r\n\r\nFanDuel Group is a ...,3.9,FanDuel\r\n3.9,"New York, NY","New York, NY",501 to 1000 employees,2009,Company - Private,Sports & Recreation,"Arts, Entertainment & Recreation",$100 to $500 million (USD),DraftKings,True,37000,66000


In [6]:
# df.columns[2]
print(df['Job Description'].dtype)
df['Job Description'].value_counts(dropna=False)
df['Job Description'].isnull().mean()

object


0.0

In [7]:
# df.columns[3]
print(df['Rating'].dtype)
df['Rating'].value_counts(dropna=False)
df['Rating'] = df[df.columns[3]].replace(-1,np.nan)
df['Rating'].isnull().mean()
df['Rating'].value_counts(dropna=False)
df[df['Rating'].isnull()]   # delay cleaning

float64


Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply
11,Data Analyst,$37K-$66K (Glassdoor est.),BulbHead is currently seeking a Data Analyst t...,,BulbHead,"Fairfield, NJ",-1,1 to 50 employees,-1,Company - Private,-1,-1,Unknown / Non-Applicable,-1,-1
21,Data Science Analyst,$37K-$66K (Glassdoor est.),"Job Description\r\nOur client, a music streami...",,MUSIC & Entertainment,"New York, NY","Marina del Rey, CA",Unknown,-1,Company - Public,-1,-1,Unknown / Non-Applicable,-1,-1
34,Data Analyst (Games),$46K-$87K (Glassdoor est.),Carry1st is the leading mobile game publisher ...,,Carry1st,"New York, NY",-1,-1,-1,-1,-1,-1,-1,-1,-1
36,Data Business Analyst,$46K-$87K (Glassdoor est.),"At Clear Street, we are disrupting the institu...",,Clear Street,"New York, NY","New York, NY",51 to 200 employees,2018,Company - Public,-1,-1,$1 to $5 million (USD),-1,-1
40,"Business Analyst, Data Platforms",$46K-$87K (Glassdoor est.),Company Description\r\n\r\nPinto is building t...,,Pinto,"New York, NY","New York, NY",1 to 50 employees,-1,Company - Private,-1,-1,Unknown / Non-Applicable,-1,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2200,Data Analyst,$49K-$91K (Glassdoor est.),Role Data Analyst Duration12+ months Location ...,,"TechAspect Solutions, Inc. dba TA Digital","Centennial, CO",-1,-1,-1,-1,-1,-1,-1,-1,-1
2202,Financial Data Analyst,$49K-$91K (Glassdoor est.),Position:Financial Data AnalystJob Description...,,Black Knight Financial Technology Solutions,"Denver, CO",-1,-1,-1,-1,-1,-1,-1,-1,-1
2239,Senior Contract Data Analyst,$78K-$104K (Glassdoor est.),OverviewAmyx is seeking to hire a Senior Contr...,,"Amyx, Iinc.","Aurora, CO",-1,-1,-1,-1,-1,-1,-1,-1,-1
2246,"Technical Business Analyst (SQL, Data analytic...",$78K-$104K (Glassdoor est.),Spiceorb is looking for Technical Business Ana...,,Spiceorb,"Denver, CO",-1,-1,-1,-1,-1,-1,-1,-1,-1


In [8]:
# df.columns[4]
print(df['Company Name'].dtype)
df['Company Name'].value_counts(dropna=False)
df['Company Name'].isnull().mean()
df[df['Company Name'].isnull()]
df.drop(index=1860,inplace=True)

object


In [9]:
# df.columns[5]
print(df['Location'].dtype)
df['Location'].value_counts(dropna=False)
df['Location'].isnull().mean()

object


0.0

In [10]:
# df.columns[6]
print(df['Headquarters'].dtype)
df['Headquarters'].value_counts(dropna=False)
df['Headquarters'] = df[df.columns[6]].replace('-1',np.nan)
df['Headquarters'].isnull().mean()
df[df['Headquarters'].isnull()] # delay cleaning

object


Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply
11,Data Analyst,$37K-$66K (Glassdoor est.),BulbHead is currently seeking a Data Analyst t...,,BulbHead,"Fairfield, NJ",,1 to 50 employees,-1,Company - Private,-1,-1,Unknown / Non-Applicable,-1,-1
34,Data Analyst (Games),$46K-$87K (Glassdoor est.),Carry1st is the leading mobile game publisher ...,,Carry1st,"New York, NY",,-1,-1,-1,-1,-1,-1,-1,-1
55,Data Reporting Analyst,$46K-$87K (Glassdoor est.),OverviewThe Data Analyst is a new position in ...,,NADAP NYS INC.,"New York, NY",,-1,-1,-1,-1,-1,-1,-1,-1
68,Data Science Analyst,$51K-$88K (Glassdoor est.),Job Details\r\n\r\nLevel\r\n\r\nExperienced\r\...,,Greater New York Mutual Insurance Companies (GNY),"New York, NY",,-1,-1,-1,-1,-1,-1,-1,-1
90,Data Analyst,$51K-$87K (Glassdoor est.),NYU Grossman School of Medicine is one of the ...,,NYU Langone Medical Center,"New York, NY",,-1,-1,-1,-1,-1,-1,-1,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2200,Data Analyst,$49K-$91K (Glassdoor est.),Role Data Analyst Duration12+ months Location ...,,"TechAspect Solutions, Inc. dba TA Digital","Centennial, CO",,-1,-1,-1,-1,-1,-1,-1,-1
2202,Financial Data Analyst,$49K-$91K (Glassdoor est.),Position:Financial Data AnalystJob Description...,,Black Knight Financial Technology Solutions,"Denver, CO",,-1,-1,-1,-1,-1,-1,-1,-1
2239,Senior Contract Data Analyst,$78K-$104K (Glassdoor est.),OverviewAmyx is seeking to hire a Senior Contr...,,"Amyx, Iinc.","Aurora, CO",,-1,-1,-1,-1,-1,-1,-1,-1
2246,"Technical Business Analyst (SQL, Data analytic...",$78K-$104K (Glassdoor est.),Spiceorb is looking for Technical Business Ana...,,Spiceorb,"Denver, CO",,-1,-1,-1,-1,-1,-1,-1,-1


In [11]:
# df.columns[7]
print(df['Size'].dtype)
df['Size'].value_counts(dropna=False)
df['Size'] = df[df.columns[7]].replace('-1',np.nan)
df['Size'].isnull().mean()
df[df['Size'].isnull()] # delay cleaning

object


Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply
34,Data Analyst (Games),$46K-$87K (Glassdoor est.),Carry1st is the leading mobile game publisher ...,,Carry1st,"New York, NY",,,-1,-1,-1,-1,-1,-1,-1
55,Data Reporting Analyst,$46K-$87K (Glassdoor est.),OverviewThe Data Analyst is a new position in ...,,NADAP NYS INC.,"New York, NY",,,-1,-1,-1,-1,-1,-1,-1
68,Data Science Analyst,$51K-$88K (Glassdoor est.),Job Details\r\n\r\nLevel\r\n\r\nExperienced\r\...,,Greater New York Mutual Insurance Companies (GNY),"New York, NY",,,-1,-1,-1,-1,-1,-1,-1
90,Data Analyst,$51K-$87K (Glassdoor est.),NYU Grossman School of Medicine is one of the ...,,NYU Langone Medical Center,"New York, NY",,,-1,-1,-1,-1,-1,-1,-1
109,Data Analyst,$51K-$87K (Glassdoor est.),"Data Analyst\r\n\r\nJersey City, NJ\r\n\r\n12+...",,Vertex Intel Systems,"Jersey City, NJ",,,-1,-1,-1,-1,-1,-1,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2200,Data Analyst,$49K-$91K (Glassdoor est.),Role Data Analyst Duration12+ months Location ...,,"TechAspect Solutions, Inc. dba TA Digital","Centennial, CO",,,-1,-1,-1,-1,-1,-1,-1
2202,Financial Data Analyst,$49K-$91K (Glassdoor est.),Position:Financial Data AnalystJob Description...,,Black Knight Financial Technology Solutions,"Denver, CO",,,-1,-1,-1,-1,-1,-1,-1
2239,Senior Contract Data Analyst,$78K-$104K (Glassdoor est.),OverviewAmyx is seeking to hire a Senior Contr...,,"Amyx, Iinc.","Aurora, CO",,,-1,-1,-1,-1,-1,-1,-1
2246,"Technical Business Analyst (SQL, Data analytic...",$78K-$104K (Glassdoor est.),Spiceorb is looking for Technical Business Ana...,,Spiceorb,"Denver, CO",,,-1,-1,-1,-1,-1,-1,-1


In [12]:
# df.columns[8]
print(df['Founded'].dtype)
df['Founded'].value_counts(dropna=False)
df['Founded'] = df[df.columns[8]].replace(-1,np.nan)
df['Founded'].value_counts(dropna=False)
df['Founded'].isnull().mean()
df[df['Founded'].isnull()] # delay cleaning

int64


Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply
11,Data Analyst,$37K-$66K (Glassdoor est.),BulbHead is currently seeking a Data Analyst t...,,BulbHead,"Fairfield, NJ",,1 to 50 employees,,Company - Private,-1,-1,Unknown / Non-Applicable,-1,-1
15,Sustainability Data Analyst,$37K-$66K (Glassdoor est.),Job Description\r\nRole Description\r\n\r\nSus...,3.6,CodeGreen Solutions\r\n3.6,"New York, NY","New York, NY",1 to 50 employees,,Company - Private,Building & Personnel Services,Business Services,Unknown / Non-Applicable,-1,-1
21,Data Science Analyst,$37K-$66K (Glassdoor est.),"Job Description\r\nOur client, a music streami...",,MUSIC & Entertainment,"New York, NY","Marina del Rey, CA",Unknown,,Company - Public,-1,-1,Unknown / Non-Applicable,-1,-1
23,Data Analyst,$37K-$66K (Glassdoor est.),Haven Life is an insurtech innovator at MassMu...,3.5,Andiamo\r\n3.5,"New York, NY","Warren, MI",201 to 500 employees,,Company - Private,Casual Restaurants,"Restaurants, Bars & Food Services",$1 to $5 million (USD),-1,-1
24,Entry Level / Jr. Data Analyst,$37K-$66K (Glassdoor est.),Dash Technologies is an industry leading softw...,3.8,Dash Technologies Inc\r\n3.8,"New York, NY","Columbus, OH",1 to 50 employees,,Unknown,-1,-1,Unknown / Non-Applicable,-1,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2244,"Data Security Analyst, Sr",$78K-$104K (Glassdoor est.),Contract Duration: 9 Months\r\n\r\nLocation: D...,2.5,"Avacend, Inc.\r\n2.5","Denver, CO","Alpharetta, GA",51 to 200 employees,,Company - Private,Staffing & Outsourcing,Business Services,Unknown / Non-Applicable,-1,-1
2246,"Technical Business Analyst (SQL, Data analytic...",$78K-$104K (Glassdoor est.),Spiceorb is looking for Technical Business Ana...,,Spiceorb,"Denver, CO",,,,-1,-1,-1,-1,-1,-1
2247,Marketing/Communications - Data Analyst-Marketing,$78K-$104K (Glassdoor est.),Job Description\r\nJob Title: Marketing/Commun...,4.1,APN Software Services Inc.\r\n4.1,"Broomfield, CO","Newark, CA",51 to 200 employees,,Company - Private,Computer Hardware & Software,Information Technology,$25 to $50 million (USD),-1,-1
2248,RQS - IHHA - 201900004460 -1q Data Security An...,$78K-$104K (Glassdoor est.),Maintains systems to protect data from unautho...,2.5,"Avacend, Inc.\r\n2.5","Denver, CO","Alpharetta, GA",51 to 200 employees,,Company - Private,Staffing & Outsourcing,Business Services,Unknown / Non-Applicable,-1,-1


In [13]:
# df.columns[9]
print(df['Type of ownership'].dtype)
df['Type of ownership'].value_counts(dropna=False)
df['Type of ownership'] = df[df.columns[9]].replace('-1',np.nan)
df['Type of ownership'].isnull().mean()
df[df['Type of ownership'].isnull()] # delay cleaning

object


Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply
34,Data Analyst (Games),$46K-$87K (Glassdoor est.),Carry1st is the leading mobile game publisher ...,,Carry1st,"New York, NY",,,,,-1,-1,-1,-1,-1
55,Data Reporting Analyst,$46K-$87K (Glassdoor est.),OverviewThe Data Analyst is a new position in ...,,NADAP NYS INC.,"New York, NY",,,,,-1,-1,-1,-1,-1
68,Data Science Analyst,$51K-$88K (Glassdoor est.),Job Details\r\n\r\nLevel\r\n\r\nExperienced\r\...,,Greater New York Mutual Insurance Companies (GNY),"New York, NY",,,,,-1,-1,-1,-1,-1
90,Data Analyst,$51K-$87K (Glassdoor est.),NYU Grossman School of Medicine is one of the ...,,NYU Langone Medical Center,"New York, NY",,,,,-1,-1,-1,-1,-1
109,Data Analyst,$51K-$87K (Glassdoor est.),"Data Analyst\r\n\r\nJersey City, NJ\r\n\r\n12+...",,Vertex Intel Systems,"Jersey City, NJ",,,,,-1,-1,-1,-1,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2200,Data Analyst,$49K-$91K (Glassdoor est.),Role Data Analyst Duration12+ months Location ...,,"TechAspect Solutions, Inc. dba TA Digital","Centennial, CO",,,,,-1,-1,-1,-1,-1
2202,Financial Data Analyst,$49K-$91K (Glassdoor est.),Position:Financial Data AnalystJob Description...,,Black Knight Financial Technology Solutions,"Denver, CO",,,,,-1,-1,-1,-1,-1
2239,Senior Contract Data Analyst,$78K-$104K (Glassdoor est.),OverviewAmyx is seeking to hire a Senior Contr...,,"Amyx, Iinc.","Aurora, CO",,,,,-1,-1,-1,-1,-1
2246,"Technical Business Analyst (SQL, Data analytic...",$78K-$104K (Glassdoor est.),Spiceorb is looking for Technical Business Ana...,,Spiceorb,"Denver, CO",,,,,-1,-1,-1,-1,-1


In [19]:
# df.columns[10]
print(df['Industry'].dtype)
df['Industry'].value_counts(dropna=False)
df['Industry'] = df[df.columns[10]].replace('-1',np.nan)
df['Industry'].isnull().mean()
df[df['Industry'].isnull()]  # delay cleaning

object


Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply
11,Data Analyst,$37K-$66K (Glassdoor est.),BulbHead is currently seeking a Data Analyst t...,,BulbHead,"Fairfield, NJ",,1 to 50 employees,,Company - Private,,,Unknown / Non-Applicable,,0
21,Data Science Analyst,$37K-$66K (Glassdoor est.),"Job Description\r\nOur client, a music streami...",,MUSIC & Entertainment,"New York, NY","Marina del Rey, CA",Unknown,,Company - Public,,,Unknown / Non-Applicable,,0
24,Entry Level / Jr. Data Analyst,$37K-$66K (Glassdoor est.),Dash Technologies is an industry leading softw...,3.8,Dash Technologies Inc\r\n3.8,"New York, NY","Columbus, OH",1 to 50 employees,,Unknown,,,Unknown / Non-Applicable,,0
32,Data Analyst,$46K-$87K (Glassdoor est.),Job Description:\r\nLegal experience is requir...,3.5,Pozent\r\n3.5,"New York, NY","Piscataway, NJ",1 to 50 employees,,Contract,,,Less than $1 million (USD),,0
34,Data Analyst (Games),$46K-$87K (Glassdoor est.),Carry1st is the leading mobile game publisher ...,,Carry1st,"New York, NY",,,,,,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2220,"MDG Functional Data Analyst-Greenwood Village, CO",$57K-$100K (Glassdoor est.),MDG Functional Data Analyst\r\nLocation: Green...,5.0,Georgia IT Inc.\r\n5.0,"Greenwood Village, Arapahoe, CO","Alpharetta, GA",1 to 50 employees,,Company - Private,,,Less than $1 million (USD),,0
2234,"Data Analyst 3, Customer Experience - Centennial",$57K-$100K (Glassdoor est.),Business Unit: Summary Responsible for working...,3.6,Comcast\r\n3.6,"Englewood, CO","Philadelphia, PA",10000+ employees,1963.0,Company - Public,,,$10+ billion (USD),"AT&T, Verizon",0
2239,Senior Contract Data Analyst,$78K-$104K (Glassdoor est.),OverviewAmyx is seeking to hire a Senior Contr...,,"Amyx, Iinc.","Aurora, CO",,,,,,,,,0
2246,"Technical Business Analyst (SQL, Data analytic...",$78K-$104K (Glassdoor est.),Spiceorb is looking for Technical Business Ana...,,Spiceorb,"Denver, CO",,,,,,,,,0


In [20]:
# df.columns[11]
print(df['Sector'].dtype)
df['Sector'].value_counts(dropna=False)
df['Sector'] = df[df.columns[11]].replace('-1',np.nan)
df['Sector'].isnull().mean()
df[df['Sector'].isnull()]  # delay cleaning

object


Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply
11,Data Analyst,$37K-$66K (Glassdoor est.),BulbHead is currently seeking a Data Analyst t...,,BulbHead,"Fairfield, NJ",,1 to 50 employees,,Company - Private,,,Unknown / Non-Applicable,,0
21,Data Science Analyst,$37K-$66K (Glassdoor est.),"Job Description\r\nOur client, a music streami...",,MUSIC & Entertainment,"New York, NY","Marina del Rey, CA",Unknown,,Company - Public,,,Unknown / Non-Applicable,,0
24,Entry Level / Jr. Data Analyst,$37K-$66K (Glassdoor est.),Dash Technologies is an industry leading softw...,3.8,Dash Technologies Inc\r\n3.8,"New York, NY","Columbus, OH",1 to 50 employees,,Unknown,,,Unknown / Non-Applicable,,0
32,Data Analyst,$46K-$87K (Glassdoor est.),Job Description:\r\nLegal experience is requir...,3.5,Pozent\r\n3.5,"New York, NY","Piscataway, NJ",1 to 50 employees,,Contract,,,Less than $1 million (USD),,0
34,Data Analyst (Games),$46K-$87K (Glassdoor est.),Carry1st is the leading mobile game publisher ...,,Carry1st,"New York, NY",,,,,,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2220,"MDG Functional Data Analyst-Greenwood Village, CO",$57K-$100K (Glassdoor est.),MDG Functional Data Analyst\r\nLocation: Green...,5.0,Georgia IT Inc.\r\n5.0,"Greenwood Village, Arapahoe, CO","Alpharetta, GA",1 to 50 employees,,Company - Private,,,Less than $1 million (USD),,0
2234,"Data Analyst 3, Customer Experience - Centennial",$57K-$100K (Glassdoor est.),Business Unit: Summary Responsible for working...,3.6,Comcast\r\n3.6,"Englewood, CO","Philadelphia, PA",10000+ employees,1963.0,Company - Public,,,$10+ billion (USD),"AT&T, Verizon",0
2239,Senior Contract Data Analyst,$78K-$104K (Glassdoor est.),OverviewAmyx is seeking to hire a Senior Contr...,,"Amyx, Iinc.","Aurora, CO",,,,,,,,,0
2246,"Technical Business Analyst (SQL, Data analytic...",$78K-$104K (Glassdoor est.),Spiceorb is looking for Technical Business Ana...,,Spiceorb,"Denver, CO",,,,,,,,,0


In [21]:
# df.columns[12]
print(df['Revenue'].dtype)
df['Revenue'].value_counts(dropna=False)
df['Revenue'] = df[df.columns[12]].replace('-1',np.nan)
df['Revenue'].isnull().mean()
df[df['Revenue'].isnull()]  # delay cleaning

object


Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply
34,Data Analyst (Games),$46K-$87K (Glassdoor est.),Carry1st is the leading mobile game publisher ...,,Carry1st,"New York, NY",,,,,,,,,0
55,Data Reporting Analyst,$46K-$87K (Glassdoor est.),OverviewThe Data Analyst is a new position in ...,,NADAP NYS INC.,"New York, NY",,,,,,,,,0
68,Data Science Analyst,$51K-$88K (Glassdoor est.),Job Details\r\n\r\nLevel\r\n\r\nExperienced\r\...,,Greater New York Mutual Insurance Companies (GNY),"New York, NY",,,,,,,,,0
90,Data Analyst,$51K-$87K (Glassdoor est.),NYU Grossman School of Medicine is one of the ...,,NYU Langone Medical Center,"New York, NY",,,,,,,,,0
109,Data Analyst,$51K-$87K (Glassdoor est.),"Data Analyst\r\n\r\nJersey City, NJ\r\n\r\n12+...",,Vertex Intel Systems,"Jersey City, NJ",,,,,,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2200,Data Analyst,$49K-$91K (Glassdoor est.),Role Data Analyst Duration12+ months Location ...,,"TechAspect Solutions, Inc. dba TA Digital","Centennial, CO",,,,,,,,,0
2202,Financial Data Analyst,$49K-$91K (Glassdoor est.),Position:Financial Data AnalystJob Description...,,Black Knight Financial Technology Solutions,"Denver, CO",,,,,,,,,0
2239,Senior Contract Data Analyst,$78K-$104K (Glassdoor est.),OverviewAmyx is seeking to hire a Senior Contr...,,"Amyx, Iinc.","Aurora, CO",,,,,,,,,0
2246,"Technical Business Analyst (SQL, Data analytic...",$78K-$104K (Glassdoor est.),Spiceorb is looking for Technical Business Ana...,,Spiceorb,"Denver, CO",,,,,,,,,0


In [24]:
# df.columns[13]
print(df['Competitors'].dtype)
df['Competitors'].value_counts(dropna=False)
df['Competitors'] = df['Competitors'].replace('-1',np.nan)
df['Competitors'].isnull().mean()
df[df['Competitors'].isnull()]  # delay cleaning

object


Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply
0,"Data Analyst, Center on Immigration and Justic...",$37K-$66K (Glassdoor est.),Are you eager to roll up your sleeves and harn...,3.2,Vera Institute of Justice\r\n3.2,"New York, NY","New York, NY",201 to 500 employees,1961.0,Nonprofit Organization,Social Assistance,Non-Profit,$100 to $500 million (USD),,1
1,Quality Data Analyst,$37K-$66K (Glassdoor est.),Overview\r\n\r\nProvides analytical and techni...,3.8,Visiting Nurse Service of New York\r\n3.8,"New York, NY","New York, NY",10000+ employees,1893.0,Nonprofit Organization,Health Care Services & Hospitals,Health Care,$2 to $5 billion (USD),,0
3,Data Analyst,$37K-$66K (Glassdoor est.),Requisition NumberRR-0001939\r\nRemote:Yes\r\n...,4.1,Celerity\r\n4.1,"New York, NY","McLean, VA",201 to 500 employees,2002.0,Subsidiary or Business Segment,IT Services,Information Technology,$50 to $100 million (USD),,0
5,Data Analyst,$37K-$66K (Glassdoor est.),About Cubist\r\nCubist Systematic Strategies i...,3.9,Point72\r\n3.9,"New York, NY","Stamford, CT",1001 to 5000 employees,2014.0,Company - Private,Investment Banking & Asset Management,Finance,Unknown / Non-Applicable,,0
6,Business/Data Analyst (FP&A),$37K-$66K (Glassdoor est.),Two Sigma is a different kind of investment ma...,4.4,Two Sigma\r\n4.4,"New York, NY","New York, NY",1001 to 5000 employees,2001.0,Company - Private,Investment Banking & Asset Management,Finance,Unknown / Non-Applicable,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2246,"Technical Business Analyst (SQL, Data analytic...",$78K-$104K (Glassdoor est.),Spiceorb is looking for Technical Business Ana...,,Spiceorb,"Denver, CO",,,,,,,,,0
2247,Marketing/Communications - Data Analyst-Marketing,$78K-$104K (Glassdoor est.),Job Description\r\nJob Title: Marketing/Commun...,4.1,APN Software Services Inc.\r\n4.1,"Broomfield, CO","Newark, CA",51 to 200 employees,,Company - Private,Computer Hardware & Software,Information Technology,$25 to $50 million (USD),,0
2248,RQS - IHHA - 201900004460 -1q Data Security An...,$78K-$104K (Glassdoor est.),Maintains systems to protect data from unautho...,2.5,"Avacend, Inc.\r\n2.5","Denver, CO","Alpharetta, GA",51 to 200 employees,,Company - Private,Staffing & Outsourcing,Business Services,Unknown / Non-Applicable,,0
2250,"Technical Business Analyst (SQL, Data analytic...",$78K-$104K (Glassdoor est.),"Title: Technical Business Analyst (SQL, Data a...",,Spiceorb,"Denver, CO",,,,,,,,,0


In [23]:
# df.columns[14]
print(df['Easy Apply'].dtype)
df['Easy Apply'].value_counts()
df['Easy Apply'] = df['Easy Apply'].replace('-1',0)
df['Easy Apply'] = df['Easy Apply'].replace('True',1)

int64


In [18]:
# Salary Estimate 将数据变为高，低两列。变为int型
# Size列同上