**Introduction**

In the age of information, valuable insights can be gleaned from data scattered across the digital landscape. Glassdoor, a renowned platform for employee reviews and job listings, holds a treasure trove of data that can provide crucial insights into job markets and workplace experiences. However, before we can harness this wealth of information for analysis and decision-making, it is essential to address the challenges of data cleanliness and reliability. In this project, we embark on a journey of data cleaning, where we meticulously process and refine the data we've scraped from Glassdoor. By eliminating inconsistencies, handling missing values, and ensuring data quality, we aim to transform raw web data into a valuable resource for informed business and employment-related decisions.

**Importing necessary libraries**

In [2]:
import pandas as pd
import numpy as np

**Loading the dataset and understanding the data**

In [3]:
#Load dataset
df = pd.read_csv(r"C:\Users\vishy\Downloads\datacleaning\Uncleaned_DS_jobs.csv")
df

Unnamed: 0,index,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors
0,0,Sr Data Scientist,$137K-$171K (Glassdoor est.),Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst\n3.1,"New York, NY","New York, NY",1001 to 5000 employees,1993,Nonprofit Organization,Insurance Carriers,Insurance,Unknown / Non-Applicable,"EmblemHealth, UnitedHealth Group, Aetna"
1,1,Data Scientist,$137K-$171K (Glassdoor est.),"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech\n4.2,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,Company - Public,Research & Development,Business Services,$1 to $2 billion (USD),-1
2,2,Data Scientist,$137K-$171K (Glassdoor est.),Overview\n\n\nAnalysis Group is one of the lar...,3.8,Analysis Group\n3.8,"Boston, MA","Boston, MA",1001 to 5000 employees,1981,Private Practice / Firm,Consulting,Business Services,$100 to $500 million (USD),-1
3,3,Data Scientist,$137K-$171K (Glassdoor est.),JOB DESCRIPTION:\n\nDo you have a passion for ...,3.5,INFICON\n3.5,"Newton, MA","Bad Ragaz, Switzerland",501 to 1000 employees,2000,Company - Public,Electrical & Electronic Manufacturing,Manufacturing,$100 to $500 million (USD),"MKS Instruments, Pfeiffer Vacuum, Agilent Tech..."
4,4,Data Scientist,$137K-$171K (Glassdoor est.),Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions\n2.9,"New York, NY","New York, NY",51 to 200 employees,1998,Company - Private,Advertising & Marketing,Business Services,Unknown / Non-Applicable,"Commerce Signals, Cardlytics, Yodlee"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,667,Data Scientist,$105K-$167K (Glassdoor est.),Summary\n\nWe’re looking for a data scientist ...,3.6,TRANZACT\n3.6,"Fort Lee, NJ","Fort Lee, NJ",1001 to 5000 employees,1989,Company - Private,Advertising & Marketing,Business Services,Unknown / Non-Applicable,-1
668,668,Data Scientist,$105K-$167K (Glassdoor est.),Job Description\nBecome a thought leader withi...,-1.0,JKGT,"San Francisco, CA",-1,-1,-1,-1,-1,-1,-1,-1
669,669,Data Scientist,$105K-$167K (Glassdoor est.),Join a thriving company that is changing the w...,-1.0,AccessHope,"Irwindale, CA",-1,-1,-1,-1,-1,-1,-1,-1
670,670,Data Scientist,$105K-$167K (Glassdoor est.),100 Remote Opportunity As an AINLP Data Scient...,5.0,ChaTeck Incorporated\n5.0,"San Francisco, CA","Santa Clara, CA",1 to 50 employees,-1,Company - Private,Advertising & Marketing,Business Services,$1 to $5 million (USD),-1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 672 entries, 0 to 671
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   index              672 non-null    int64  
 1   Job Title          672 non-null    object 
 2   Salary Estimate    672 non-null    object 
 3   Job Description    672 non-null    object 
 4   Rating             672 non-null    float64
 5   Company Name       672 non-null    object 
 6   Location           672 non-null    object 
 7   Headquarters       672 non-null    object 
 8   Size               672 non-null    object 
 9   Founded            672 non-null    int64  
 10  Type of ownership  672 non-null    object 
 11  Industry           672 non-null    object 
 12  Sector             672 non-null    object 
 13  Revenue            672 non-null    object 
 14  Competitors        672 non-null    object 
dtypes: float64(1), int64(2), object(12)
memory usage: 78.9+ KB


In [5]:
df.describe()

Unnamed: 0,index,Rating,Founded
count,672.0,672.0,672.0
mean,335.5,3.518601,1635.529762
std,194.133974,1.410329,756.74664
min,0.0,-1.0,-1.0
25%,167.75,3.3,1917.75
50%,335.5,3.8,1995.0
75%,503.25,4.3,2009.0
max,671.0,5.0,2019.0


**1) Job Title**

The two functions, title_simplifier and seniority that I have used is to preprocess and categorize job titles into simplified categories and seniority levels, respectively. This can be helpful when analyzing or visualizing job title data in a more structured and manageable way. As evident from the data, around 20 percent of the job titles contain unnecessary information which have been standardized into easier and better readable values for analysis.

In [6]:
df["Job Title"].value_counts()

Data Scientist                                            337
Data Engineer                                              26
Senior Data Scientist                                      19
Machine Learning Engineer                                  16
Data Analyst                                               12
                                                         ... 
Data Science Instructor                                     1
Business Data Analyst                                       1
Purification Scientist                                      1
Data Engineer, Enterprise Analytics                         1
AI/ML - Machine Learning Scientist, Siri Understanding      1
Name: Job Title, Length: 172, dtype: int64

In [7]:
def title_simplifier(title):
    if 'data scientist' in title.lower():
        return 'data scientist'
    elif 'data engineer' in title.lower():
        return 'data engineer'
    elif 'analyst' in title.lower():
        return 'analyst'
    elif 'machine learning' in title.lower():
        return 'mle'
    elif 'manager' in title.lower():
        return 'manager'
    elif 'director' in title.lower():
        return 'director'
    else:
        return 'na'
    
def seniority(title):
    if 'sr' in title.lower() or 'senior' in title.lower() or 'sr' in title.lower() or 'lead' in title.lower() or 'principal' in title.lower():
            return 'senior'
    elif 'jr' in title.lower() or 'jr.' in title.lower():
        return 'jr'
    else:
        return 'na'

In [8]:
df['job_simp'] = df['Job Title'].apply(title_simplifier)
df.job_simp.value_counts()

data scientist    455
na                 69
analyst            55
data engineer      47
mle                36
manager             7
director            3
Name: job_simp, dtype: int64

In [9]:
df['seniority'] = df['Job Title'].apply(seniority)
df.seniority.value_counts()

na        576
senior     94
jr          2
Name: seniority, dtype: int64

**2) Salary Estimate**

The original "Salary Estimate" column contained values like " $70K - $110K (Glassdoor Est.)", we applied the lambda function to remove the unnecessary information and resulted in the values being updated to just "$70K - $110K," effectively removing the "(Glassdoor Est.)" part. We could also modify the column into a integer data type if needed.

In [10]:
df['Salary Estimate'].unique()

array(['$137K-$171K (Glassdoor est.)', '$75K-$131K (Glassdoor est.)',
       '$79K-$131K (Glassdoor est.)', '$99K-$132K (Glassdoor est.)',
       '$90K-$109K (Glassdoor est.)', '$101K-$165K (Glassdoor est.)',
       '$56K-$97K (Glassdoor est.)', '$79K-$106K (Glassdoor est.)',
       '$71K-$123K (Glassdoor est.)', '$90K-$124K (Glassdoor est.)',
       '$91K-$150K (Glassdoor est.)', '$141K-$225K (Glassdoor est.)',
       '$145K-$225K(Employer est.)', '$79K-$147K (Glassdoor est.)',
       '$122K-$146K (Glassdoor est.)', '$112K-$116K (Glassdoor est.)',
       '$110K-$163K (Glassdoor est.)', '$124K-$198K (Glassdoor est.)',
       '$79K-$133K (Glassdoor est.)', '$69K-$116K (Glassdoor est.)',
       '$31K-$56K (Glassdoor est.)', '$95K-$119K (Glassdoor est.)',
       '$212K-$331K (Glassdoor est.)', '$66K-$112K (Glassdoor est.)',
       '$128K-$201K (Glassdoor est.)', '$138K-$158K (Glassdoor est.)',
       '$80K-$132K (Glassdoor est.)', '$87K-$141K (Glassdoor est.)',
       '$92K-$155K (Glassdo

In [11]:
df['Salary Estimate'].apply(lambda x: x.split('(')[0]).unique()

array(['$137K-$171K ', '$75K-$131K ', '$79K-$131K ', '$99K-$132K ',
       '$90K-$109K ', '$101K-$165K ', '$56K-$97K ', '$79K-$106K ',
       '$71K-$123K ', '$90K-$124K ', '$91K-$150K ', '$141K-$225K ',
       '$145K-$225K', '$79K-$147K ', '$122K-$146K ', '$112K-$116K ',
       '$110K-$163K ', '$124K-$198K ', '$79K-$133K ', '$69K-$116K ',
       '$31K-$56K ', '$95K-$119K ', '$212K-$331K ', '$66K-$112K ',
       '$128K-$201K ', '$138K-$158K ', '$80K-$132K ', '$87K-$141K ',
       '$92K-$155K ', '$105K-$167K '], dtype=object)

In [12]:
df['Salary Estimate'] = df['Salary Estimate'].apply(lambda x: x.split('(')[0])
df.head()

Unnamed: 0,index,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,job_simp,seniority
0,0,Sr Data Scientist,$137K-$171K,Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst\n3.1,"New York, NY","New York, NY",1001 to 5000 employees,1993,Nonprofit Organization,Insurance Carriers,Insurance,Unknown / Non-Applicable,"EmblemHealth, UnitedHealth Group, Aetna",data scientist,senior
1,1,Data Scientist,$137K-$171K,"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech\n4.2,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,Company - Public,Research & Development,Business Services,$1 to $2 billion (USD),-1,data scientist,na
2,2,Data Scientist,$137K-$171K,Overview\n\n\nAnalysis Group is one of the lar...,3.8,Analysis Group\n3.8,"Boston, MA","Boston, MA",1001 to 5000 employees,1981,Private Practice / Firm,Consulting,Business Services,$100 to $500 million (USD),-1,data scientist,na
3,3,Data Scientist,$137K-$171K,JOB DESCRIPTION:\n\nDo you have a passion for ...,3.5,INFICON\n3.5,"Newton, MA","Bad Ragaz, Switzerland",501 to 1000 employees,2000,Company - Public,Electrical & Electronic Manufacturing,Manufacturing,$100 to $500 million (USD),"MKS Instruments, Pfeiffer Vacuum, Agilent Tech...",data scientist,na
4,4,Data Scientist,$137K-$171K,Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions\n2.9,"New York, NY","New York, NY",51 to 200 employees,1998,Company - Private,Advertising & Marketing,Business Services,Unknown / Non-Applicable,"Commerce Signals, Cardlytics, Yodlee",data scientist,na


**3) Job Description**

The "Job Description column is split by newline(\n) characters to make better use of in terms of NLP analysis or just for better readability. Furthermore with the information retrieved in the job description column, we have added new features to our dataset containing various skills such as python, excel, aws, etc that can give the machine learning model a basic understanding to trace each skill required for that particular job listing.

In [13]:
df['Job Description'][2].split('\n')

['Overview',
 '',
 '',
 'Analysis Group is one of the largest international economics consulting firms, with more than 1,000 professionals across 14 offices in North America, Europe, and Asia. Since 1981, we have provided expertise in economics, finance, health care analytics, and strategy to top law firms, Fortune Global 500 companies, and government agencies worldwide. Our internal experts, together with our network of affiliated experts from academia, industry, and government, offer our clients exceptional breadth and depth of expertise.',
 '',
 'We are currently seeking a Data Scientist to join our team. The ideal candidate should be passionate about working on cutting edge research and analytical services for Fortune 500 companies, global pharma/biotech firms and leaders in industries such as finance, energy and life sciences. The Data Scientist will be a contributing member to client engagements and have the opportunity to work with our network of world-class experts and thought 

In [14]:
df['python'] = df['Job Description'].apply(lambda x: 1 if 'python' in x.lower() else 0)
df['excel'] = df['Job Description'].apply(lambda x: 1 if 'excel' in x.lower() else 0)
df['hadoop'] = df['Job Description'].apply(lambda x: 1 if 'hadoop' in x.lower() else 0)
df['spark'] = df['Job Description'].apply(lambda x: 1 if 'spark' in x.lower() else 0)
df['aws'] = df['Job Description'].apply(lambda x: 1 if 'aws' in x.lower() else 0)
df['tableau'] = df['Job Description'].apply(lambda x: 1 if 'tableau' in x.lower() else 0)
df['big_data'] = df['Job Description'].apply(lambda x: 1 if 'big data' in x.lower() else 0)
df.head()

Unnamed: 0,index,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,...,Competitors,job_simp,seniority,python,excel,hadoop,spark,aws,tableau,big_data
0,0,Sr Data Scientist,$137K-$171K,Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst\n3.1,"New York, NY","New York, NY",1001 to 5000 employees,1993,...,"EmblemHealth, UnitedHealth Group, Aetna",data scientist,senior,0,0,0,0,1,0,0
1,1,Data Scientist,$137K-$171K,"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech\n4.2,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,...,-1,data scientist,na,0,0,1,0,0,0,1
2,2,Data Scientist,$137K-$171K,Overview\n\n\nAnalysis Group is one of the lar...,3.8,Analysis Group\n3.8,"Boston, MA","Boston, MA",1001 to 5000 employees,1981,...,-1,data scientist,na,1,1,0,0,1,0,0
3,3,Data Scientist,$137K-$171K,JOB DESCRIPTION:\n\nDo you have a passion for ...,3.5,INFICON\n3.5,"Newton, MA","Bad Ragaz, Switzerland",501 to 1000 employees,2000,...,"MKS Instruments, Pfeiffer Vacuum, Agilent Tech...",data scientist,na,1,1,0,0,1,0,0
4,4,Data Scientist,$137K-$171K,Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions\n2.9,"New York, NY","New York, NY",51 to 200 employees,1998,...,"Commerce Signals, Cardlytics, Yodlee",data scientist,na,1,1,0,0,0,0,0


**4) Rating**

The Rating column contained some negative values associated with some job listings. We have modified those values into zero. 

In [15]:
df.sort_values(by='Rating', ascending=True)

Unnamed: 0,index,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,...,Competitors,job_simp,seniority,python,excel,hadoop,spark,aws,tableau,big_data
457,457,Data Scientist,$69K-$116K,Work in a fast growing startup with unlimited ...,-1.0,Stride Search,"San Francisco, CA","Westlake Village, CA",1 to 50 employees,-1,...,-1,data scientist,na,1,0,0,0,0,0,0
425,425,Data Scientist,$124K-$198K,"Job Description\nSelecting features, building ...",-1.0,Microagility,"New York, NY","Princeton, NJ",1 to 50 employees,-1,...,-1,data scientist,na,0,1,0,0,0,0,0
351,351,Data Scientist,$122K-$146K,About Our AI/ML Team\n\nOur mission is to buil...,-1.0,Point72 Ventures,"Palo Alto, CA",-1,-1,-1,...,-1,data scientist,na,0,0,0,0,0,0,0
504,504,Data Scientist,$95K-$119K,Job Description\nWorking at Sophinea\n\nSophin...,-1.0,Sophinea,"Chantilly, VA",-1,1 to 50 employees,-1,...,-1,data scientist,na,0,0,0,0,0,0,0
500,500,Data Scientist,$95K-$119K,Job Overview: The Data Scientist is a key memb...,-1.0,Hatch Data Inc,"San Francisco, CA",-1,-1,-1,...,-1,data scientist,na,1,0,0,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
209,209,Data Scientist - TS/SCI FSP or CI Required,$79K-$106K,US Citizenship Required and (TS/SCI with FSP o...,5.0,Phoenix Operations Group\n5.0,"Annapolis Junction, MD","Woodbine, MD",1 to 50 employees,2011,...,-1,data scientist,na,1,0,0,1,0,0,1
405,405,Senior Machine Learning Engineer,$110K-$163K,We are looking for an experienced engineer wit...,5.0,LifeOmic\n5.0,"Raleigh, NC","Indianapolis, IN",51 to 200 employees,2016,...,-1,mle,senior,0,0,0,1,1,0,0
113,113,Data Engineer,$99K-$132K,"At Phantom AI, experience the fast paced envir...",5.0,Phantom AI\n5.0,"Burlingame, CA","Burlingame, CA",1 to 50 employees,2016,...,-1,data engineer,na,1,0,0,0,0,0,0
355,355,Data Scientist - TS/SCI FSP or CI Required,$122K-$146K,US Citizenship Required and (TS/SCI with FSP o...,5.0,Phoenix Operations Group\n5.0,"Annapolis Junction, MD","Woodbine, MD",1 to 50 employees,2011,...,-1,data scientist,na,1,0,0,1,0,0,1


In [16]:
df[df['Rating']==-1.0].shape

(50, 24)

In [17]:
df.Rating = np.where(df.Rating==-1.0,0,df.Rating)

In [18]:
df.sort_values(by='Rating', ascending=True)

Unnamed: 0,index,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,...,Competitors,job_simp,seniority,python,excel,hadoop,spark,aws,tableau,big_data
457,457,Data Scientist,$69K-$116K,Work in a fast growing startup with unlimited ...,0.0,Stride Search,"San Francisco, CA","Westlake Village, CA",1 to 50 employees,-1,...,-1,data scientist,na,1,0,0,0,0,0,0
425,425,Data Scientist,$124K-$198K,"Job Description\nSelecting features, building ...",0.0,Microagility,"New York, NY","Princeton, NJ",1 to 50 employees,-1,...,-1,data scientist,na,0,1,0,0,0,0,0
351,351,Data Scientist,$122K-$146K,About Our AI/ML Team\n\nOur mission is to buil...,0.0,Point72 Ventures,"Palo Alto, CA",-1,-1,-1,...,-1,data scientist,na,0,0,0,0,0,0,0
504,504,Data Scientist,$95K-$119K,Job Description\nWorking at Sophinea\n\nSophin...,0.0,Sophinea,"Chantilly, VA",-1,1 to 50 employees,-1,...,-1,data scientist,na,0,0,0,0,0,0,0
500,500,Data Scientist,$95K-$119K,Job Overview: The Data Scientist is a key memb...,0.0,Hatch Data Inc,"San Francisco, CA",-1,-1,-1,...,-1,data scientist,na,1,0,0,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
209,209,Data Scientist - TS/SCI FSP or CI Required,$79K-$106K,US Citizenship Required and (TS/SCI with FSP o...,5.0,Phoenix Operations Group\n5.0,"Annapolis Junction, MD","Woodbine, MD",1 to 50 employees,2011,...,-1,data scientist,na,1,0,0,1,0,0,1
405,405,Senior Machine Learning Engineer,$110K-$163K,We are looking for an experienced engineer wit...,5.0,LifeOmic\n5.0,"Raleigh, NC","Indianapolis, IN",51 to 200 employees,2016,...,-1,mle,senior,0,0,0,1,1,0,0
113,113,Data Engineer,$99K-$132K,"At Phantom AI, experience the fast paced envir...",5.0,Phantom AI\n5.0,"Burlingame, CA","Burlingame, CA",1 to 50 employees,2016,...,-1,data engineer,na,1,0,0,0,0,0,0
355,355,Data Scientist - TS/SCI FSP or CI Required,$122K-$146K,US Citizenship Required and (TS/SCI with FSP o...,5.0,Phoenix Operations Group\n5.0,"Annapolis Junction, MD","Woodbine, MD",1 to 50 employees,2011,...,-1,data scientist,na,1,0,0,1,0,0,1


**5) Company Name**

The Company name column had ratings of its companies appended to the company name. We have eliminated the ratings from this column keeping only the name of the company.

In [19]:
df['Company Name'] = df['Company Name'].apply(lambda x: x.split('\n')[0])
df.head()

Unnamed: 0,index,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,...,Competitors,job_simp,seniority,python,excel,hadoop,spark,aws,tableau,big_data
0,0,Sr Data Scientist,$137K-$171K,Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst,"New York, NY","New York, NY",1001 to 5000 employees,1993,...,"EmblemHealth, UnitedHealth Group, Aetna",data scientist,senior,0,0,0,0,1,0,0
1,1,Data Scientist,$137K-$171K,"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,...,-1,data scientist,na,0,0,1,0,0,0,1
2,2,Data Scientist,$137K-$171K,Overview\n\n\nAnalysis Group is one of the lar...,3.8,Analysis Group,"Boston, MA","Boston, MA",1001 to 5000 employees,1981,...,-1,data scientist,na,1,1,0,0,1,0,0
3,3,Data Scientist,$137K-$171K,JOB DESCRIPTION:\n\nDo you have a passion for ...,3.5,INFICON,"Newton, MA","Bad Ragaz, Switzerland",501 to 1000 employees,2000,...,"MKS Instruments, Pfeiffer Vacuum, Agilent Tech...",data scientist,na,1,1,0,0,1,0,0
4,4,Data Scientist,$137K-$171K,Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions,"New York, NY","New York, NY",51 to 200 employees,1998,...,"Commerce Signals, Cardlytics, Yodlee",data scientist,na,1,1,0,0,0,0,0


**6) Headquarters**

Some Headquarters column had negative values. We have changed those to unknown.

In [20]:
df.sort_values(by='Headquarters', ascending=True)

Unnamed: 0,index,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,...,Competitors,job_simp,seniority,python,excel,hadoop,spark,aws,tableau,big_data
496,496,Data Scientist,$95K-$119K,Job Overview: The Data Scientist is a key memb...,0.0,Hatch Data Inc,"San Francisco, CA",-1,-1,-1,...,-1,data scientist,na,1,0,0,1,1,0,0
360,360,Data Scientist,$122K-$146K,Job Overview: The Data Scientist is a key memb...,0.0,Hatch Data Inc,"San Francisco, CA",-1,-1,-1,...,-1,data scientist,na,1,0,0,1,1,0,0
613,613,Data Scientist,$87K-$141K,DESCRIPTION\n\nGrainBridge is seeking a talent...,0.0,"GrainBridge, LLC","Omaha, NE",-1,-1,-1,...,-1,data scientist,na,1,1,0,0,1,0,1
497,497,Data Scientist,$95K-$119K,Job Overview: The Data Scientist is a key memb...,0.0,Hatch Data Inc,"San Francisco, CA",-1,-1,-1,...,-1,data scientist,na,1,0,0,1,1,0,0
158,158,Machine Learning Engineer,$101K-$165K,Overview\n\nRadical Convergence is a fast-pace...,0.0,Radical Convergence,"Reston, VA",-1,-1,-1,...,-1,mle,na,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
507,507,Data Scientist - TS/SCI Required,$95K-$119K,US Citizenship Required and (TS or TS/SCI) Req...,5.0,Phoenix Operations Group,"Baltimore, MD","Woodbine, MD",1 to 50 employees,2011,...,-1,data scientist,na,1,0,0,1,0,0,1
212,212,Data Scientist - TS/SCI Required,$79K-$106K,US Citizenship Required and (TS or TS/SCI) Req...,5.0,Phoenix Operations Group,"Baltimore, MD","Woodbine, MD",1 to 50 employees,2011,...,-1,data scientist,na,1,0,0,1,0,0,1
225,225,Data Scientist,$71K-$123K,Hello we are looking full timeContract candida...,4.5,Global Data Management Inc,"Hartford, CT","Woodbridge, NJ",51 to 200 employees,2008,...,-1,data scientist,na,1,0,0,0,0,0,0
86,86,Data Analyst,$79K-$131K,What are we looking for in a Data Analyst?\n\n...,2.6,Comprehensive Healthcare,"Yakima, WA","Yakima, WA",501 to 1000 employees,1971,...,-1,analyst,na,0,0,0,0,0,0,0


In [21]:
df['Headquarters'] = df['Headquarters'].str.replace(r'^\s*-?1(\.0+)?\s*$', 'Unknown', regex=True)

In [22]:
df['Headquarters']

0                New York, NY
1                 Herndon, VA
2                  Boston, MA
3      Bad Ragaz, Switzerland
4                New York, NY
                ...          
667              Fort Lee, NJ
668                   Unknown
669                   Unknown
670           Santa Clara, CA
671           Carle Place, NY
Name: Headquarters, Length: 672, dtype: object

**7) Size**

Similar to Headquarters column, some Size column had negative values. We have changed those to unknown.

In [23]:
df.sort_values(by='Size', ascending=True)

Unnamed: 0,index,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,...,Competitors,job_simp,seniority,python,excel,hadoop,spark,aws,tableau,big_data
158,158,Machine Learning Engineer,$101K-$165K,Overview\n\nRadical Convergence is a fast-pace...,0.0,Radical Convergence,"Reston, VA",Unknown,-1,-1,...,-1,mle,na,1,0,0,0,0,0,0
358,358,Data Scientist,$122K-$146K,Job Overview: The Data Scientist is a key memb...,0.0,Hatch Data Inc,"San Francisco, CA",Unknown,-1,-1,...,-1,data scientist,na,1,0,0,1,1,0,0
357,357,Data Scientist,$122K-$146K,Job Overview: The Data Scientist is a key memb...,0.0,Hatch Data Inc,"San Francisco, CA",Unknown,-1,-1,...,-1,data scientist,na,1,0,0,1,1,0,0
613,613,Data Scientist,$87K-$141K,DESCRIPTION\n\nGrainBridge is seeking a talent...,0.0,"GrainBridge, LLC","Omaha, NE",Unknown,-1,-1,...,-1,data scientist,na,1,1,0,0,1,0,1
555,555,Data Scientist,$128K-$201K,"Job Description\nAs a Data Scientist, you will...",0.0,HireAi,"San Francisco, CA",Unknown,-1,-1,...,-1,data scientist,na,1,1,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
615,615,Data Scientist,$87K-$141K,Position Description\n\nSprezzatura seeks a Da...,0.0,Sprezzatura Management Consulting,"Washington, VA","McLean, VA",Unknown,-1,...,-1,data scientist,na,0,1,0,0,0,0,0
193,193,Data Scientist,$56K-$97K,Job Description\nClient JD below:\n\nWe need a...,5.0,SkillSoniq,"San Francisco, CA","Jersey City, NJ",Unknown,-1,...,-1,data scientist,na,1,0,0,0,0,0,0
189,189,Principal Data Scientist - Machine Learning,$56K-$97K,Are you a highly experienced Data Scientist wi...,3.6,Constant Contact,"Waltham, MA","Waltham, MA",Unknown,1995,...,"Drip, iContact, Mailchimp",data scientist,senior,1,1,1,1,1,0,1
524,524,Data Scientist,$212K-$331K,CompuForce is seeking an experienced and highl...,0.0,CompuForce,"New York, NY","New York, NY",Unknown,-1,...,-1,data scientist,na,0,1,1,1,1,0,1


In [24]:
df['Size'] = df['Size'].str.replace(r'^\s*-?1(\.0+)?\s*$', 'Unknown', regex=True)
df['Size']

0       1001 to 5000 employees
1      5001 to 10000 employees
2       1001 to 5000 employees
3        501 to 1000 employees
4          51 to 200 employees
                ...           
667     1001 to 5000 employees
668                    Unknown
669                    Unknown
670          1 to 50 employees
671     1001 to 5000 employees
Name: Size, Length: 672, dtype: object

**8) Founded**

All the negative values in Founded column have been changed to Unknown.

In [25]:
df.sort_values(by='Founded', ascending=True)

Unnamed: 0,index,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,...,Competitors,job_simp,seniority,python,excel,hadoop,spark,aws,tableau,big_data
335,335,Data Scientist,$79K-$147K,One of the largest health insurers in the nati...,3.4,Solving IT International Inc,"Chicago, IL","Chicago, IL",501 to 1000 employees,-1,...,-1,data scientist,na,0,1,0,0,0,0,0
389,389,Data Scientist,$110K-$163K,"Job Description\nAs a Data Scientist, you will...",0.0,HireAi,"San Francisco, CA",Unknown,Unknown,-1,...,-1,data scientist,na,1,1,0,1,0,1,0
499,499,Data Scientist,$95K-$119K,Job Overview: The Data Scientist is a key memb...,0.0,Hatch Data Inc,"San Francisco, CA",Unknown,Unknown,-1,...,-1,data scientist,na,1,0,0,1,1,0,0
500,500,Data Scientist,$95K-$119K,Job Overview: The Data Scientist is a key memb...,0.0,Hatch Data Inc,"San Francisco, CA",Unknown,Unknown,-1,...,-1,data scientist,na,1,0,0,1,1,0,0
388,388,Data Scientist,$110K-$163K,"Job Description\nAs a Data Scientist, you will...",0.0,HireAi,"San Francisco, CA",Unknown,Unknown,-1,...,-1,data scientist,na,1,1,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
444,444,Data Scientist,$79K-$133K,About Hive\n\nHive is a full-stack deep learni...,2.1,Hive (CA),"San Francisco, CA","Los Angeles, CA",Unknown,2019,...,-1,data scientist,na,1,0,1,1,0,0,0
77,77,Data Scientist,$79K-$131K,WHAT WE DO MATTERS:\n\nHere at The Knot Worldw...,3.5,The Knot Worldwide,"Washington, DC","Chevy Chase, MD",1001 to 5000 employees,2019,...,Zola Registry,data scientist,na,1,0,0,1,1,0,0
488,488,Data Scientist,$95K-$119K,WHAT WE DO MATTERS:\n\nHere at The Knot Worldw...,3.5,The Knot Worldwide,"Washington, DC","Chevy Chase, MD",1001 to 5000 employees,2019,...,Zola Registry,data scientist,na,1,0,0,1,1,0,0
43,43,Scientist - Molecular Biology,$75K-$131K,ArsenalBio’s mission is to develop efficacious...,5.0,Arsenal Biosciences,"South San Francisco, CA","South San Francisco, CA",51 to 200 employees,2019,...,-1,na,na,0,0,0,0,0,0,0


In [26]:
df.Founded = np.where(df.Founded==-1.0,'Unknown',df.Founded)
df['Founded']

0         1993
1         1968
2         1981
3         2000
4         1998
        ...   
667       1989
668    Unknown
669    Unknown
670    Unknown
671       1976
Name: Founded, Length: 672, dtype: object

**9) Competitors**

The trend of possessing negative numbers in the data continued with other columns too. Those were also modified to Unknown

In [27]:
df.sort_values(by='Competitors', ascending=True)

Unnamed: 0,index,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,...,Competitors,job_simp,seniority,python,excel,hadoop,spark,aws,tableau,big_data
335,335,Data Scientist,$79K-$147K,One of the largest health insurers in the nati...,3.4,Solving IT International Inc,"Chicago, IL","Chicago, IL",501 to 1000 employees,Unknown,...,-1,data scientist,na,0,1,0,0,0,0,0
419,419,Data Scientist,$124K-$198K,Job Title: Data Scientist\n\nLocation: New Jer...,4.8,InvenTech Info,"Jersey City, NJ","Bengaluru, India",201 to 500 employees,2010,...,-1,data scientist,na,1,0,0,0,0,0,0
421,421,Analytics - Business Assurance Data Analyst,$124K-$198K,Analytics - Business Assurance Data Analyst (C...,4.6,GreatAmerica Financial Services,"Cedar Rapids, IA","Cedar Rapids, IA",501 to 1000 employees,1992,...,-1,analyst,na,0,1,0,0,0,1,0
422,422,Data Scientist,$124K-$198K,Job Description\nJob Description for Data Scie...,4.5,Conflux Systems Inc.,"Winters, TX","Alpharetta, GA",1 to 50 employees,Unknown,...,-1,data scientist,na,1,1,1,1,1,0,0
424,424,Data Scientist,$124K-$198K,"Get To Know Voice\n\nAt Voice, we are on a mis...",3.4,Voice,"Brooklyn, NY",Unknown,Unknown,Unknown,...,-1,data scientist,na,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39,39,Senior Analyst/Data Scientist,$75K-$131K,At Edmunds were driven to make car buying easi...,3.4,Edmunds.com,"Santa Monica, CA","Santa Monica, CA",501 to 1000 employees,1966,...,"TrueCar, Cars.com, Kelley Blue Book",data scientist,senior,1,1,0,0,1,1,0
634,634,Data Scientist,$92K-$155K,"Overview:\n\n\nGood people, working with good ...",2.5,KeHE Distributors,"Naperville, IL","Naperville, IL",5001 to 10000 employees,1954,...,"United Natural Foods, US Foods, DPI Specialty ...",data scientist,na,1,1,1,1,1,1,1
77,77,Data Scientist,$79K-$131K,WHAT WE DO MATTERS:\n\nHere at The Knot Worldw...,3.5,The Knot Worldwide,"Washington, DC","Chevy Chase, MD",1001 to 5000 employees,2019,...,Zola Registry,data scientist,na,1,0,0,1,1,0,0
488,488,Data Scientist,$95K-$119K,WHAT WE DO MATTERS:\n\nHere at The Knot Worldw...,3.5,The Knot Worldwide,"Washington, DC","Chevy Chase, MD",1001 to 5000 employees,2019,...,Zola Registry,data scientist,na,1,0,0,1,1,0,0


In [34]:
df['Competitors'] = df['Competitors'].str.replace(r'^\s*-?1(\.0+)?\s*$', 'Unknown', regex=True)
df['Type of ownership'] = df['Type of ownership'].str.replace(r'^\s*-?1(\.0+)?\s*$', 'Unknown', regex=True)
df['Industry'] = df['Industry'].str.replace(r'^\s*-?1(\.0+)?\s*$', 'Unknown', regex=True)
df['Sector'] = df['Sector'].str.replace(r'^\s*-?1(\.0+)?\s*$', 'Unknown', regex=True)
df['Revenue'] = df['Revenue'].str.replace(r'^\s*-?1(\.0+)?\s*$', 'Unknown', regex=True)

In [35]:
df

Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,...,Competitors,job_simp,seniority,python,excel,hadoop,spark,aws,tableau,big_data
0,Sr Data Scientist,$137K-$171K,Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst,"New York, NY","New York, NY",1001 to 5000 employees,1993,Nonprofit Organization,...,"EmblemHealth, UnitedHealth Group, Aetna",data scientist,senior,0,0,0,0,1,0,0
1,Data Scientist,$137K-$171K,"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,Company - Public,...,Unknown,data scientist,na,0,0,1,0,0,0,1
2,Data Scientist,$137K-$171K,Overview\n\n\nAnalysis Group is one of the lar...,3.8,Analysis Group,"Boston, MA","Boston, MA",1001 to 5000 employees,1981,Private Practice / Firm,...,Unknown,data scientist,na,1,1,0,0,1,0,0
3,Data Scientist,$137K-$171K,JOB DESCRIPTION:\n\nDo you have a passion for ...,3.5,INFICON,"Newton, MA","Bad Ragaz, Switzerland",501 to 1000 employees,2000,Company - Public,...,"MKS Instruments, Pfeiffer Vacuum, Agilent Tech...",data scientist,na,1,1,0,0,1,0,0
4,Data Scientist,$137K-$171K,Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions,"New York, NY","New York, NY",51 to 200 employees,1998,Company - Private,...,"Commerce Signals, Cardlytics, Yodlee",data scientist,na,1,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,Data Scientist,$105K-$167K,Summary\n\nWe’re looking for a data scientist ...,3.6,TRANZACT,"Fort Lee, NJ","Fort Lee, NJ",1001 to 5000 employees,1989,Company - Private,...,Unknown,data scientist,na,1,1,1,0,0,1,1
668,Data Scientist,$105K-$167K,Job Description\nBecome a thought leader withi...,0.0,JKGT,"San Francisco, CA",Unknown,Unknown,Unknown,Unknown,...,Unknown,data scientist,na,0,0,0,0,0,0,0
669,Data Scientist,$105K-$167K,Join a thriving company that is changing the w...,0.0,AccessHope,"Irwindale, CA",Unknown,Unknown,Unknown,Unknown,...,Unknown,data scientist,na,1,1,1,0,0,1,0
670,Data Scientist,$105K-$167K,100 Remote Opportunity As an AINLP Data Scient...,5.0,ChaTeck Incorporated,"San Francisco, CA","Santa Clara, CA",1 to 50 employees,Unknown,Company - Private,...,Unknown,data scientist,na,1,0,1,1,0,0,1


In [36]:
df.to_csv(r'C:\Users\vishy\OneDrive\Desktop\Cleaned.csv', index=False)