## Data Wrangling of Glassdoor Job Posting Data

### Glassdoor

Glassdoor is an American website where current and former employees anonymously review companies. Headquartered in San Francisco, California, it has additional offices in Chicago, Dublin, London, and São Paulo.

In 2018, the company was acquired by the Japanese Recruit Holdings for US$1.2 billion, and it continues to operate as an independent subsidiary.

### Kaggle
Kaggle is a data science competition platform and online community of data scientists and machine learning practitioners under Google LLC. Kaggle enables users to find and publish datasets, explore and build models in a web-based data science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

### Dataset
The following dataset is provided by RASHIK RAHMAN, it was primarily webscrapped from https://www.glassdoor.co.in. You can find the details about the dataset on the [Kaggle](https://www.kaggle.com/datasets/rashikrahmanpritom/data-science-job-posting-on-glassdoor?select=Uncleaned_DS_jobs.csv)

### TODO List:
- Examine the data, checking missing values, duplicates, shape, columns and information about data.
- Converting the salary column into numerical value.
- Remove number from company name.
- Get State and City name from Location column.
- Removing the incorrect data.
- Creating new column based on the age of the company.
- Creating Skills columns based on the Job description.
- Simplifying Job title column and creating seniority column.
- Renaming and dropping unecessary columns.

In [1]:
#Importing relevant libraries
import pandas as pd
import numpy as np

In [2]:
#Reading data using pandas
df = pd.read_csv('Uncleaned_DS_jobs.csv')
df.head(3)

Unnamed: 0,index,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors
0,0,Sr Data Scientist,$137K-$171K (Glassdoor est.),Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst\n3.1,"New York, NY","New York, NY",1001 to 5000 employees,1993,Nonprofit Organization,Insurance Carriers,Insurance,Unknown / Non-Applicable,"EmblemHealth, UnitedHealth Group, Aetna"
1,1,Data Scientist,$137K-$171K (Glassdoor est.),"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech\n4.2,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,Company - Public,Research & Development,Business Services,$1 to $2 billion (USD),-1
2,2,Data Scientist,$137K-$171K (Glassdoor est.),Overview\n\n\nAnalysis Group is one of the lar...,3.8,Analysis Group\n3.8,"Boston, MA","Boston, MA",1001 to 5000 employees,1981,Private Practice / Firm,Consulting,Business Services,$100 to $500 million (USD),-1


#### TODO - 1: Examine the data, checking missing values, duplicates, shape, columns and information about data.

In [3]:
df.columns

Index(['index', 'Job Title', 'Salary Estimate', 'Job Description', 'Rating',
       'Company Name', 'Location', 'Headquarters', 'Size', 'Founded',
       'Type of ownership', 'Industry', 'Sector', 'Revenue', 'Competitors'],
      dtype='object')

In [4]:
df.shape

(672, 15)

In [5]:
#Checking for missing values
df.isnull().sum()

index                0
Job Title            0
Salary Estimate      0
Job Description      0
Rating               0
Company Name         0
Location             0
Headquarters         0
Size                 0
Founded              0
Type of ownership    0
Industry             0
Sector               0
Revenue              0
Competitors          0
dtype: int64

In [6]:
#Checking for duplicate values
df.duplicated().sum()

0

In [7]:
df.describe()

Unnamed: 0,index,Rating,Founded
count,672.0,672.0,672.0
mean,335.5,3.518601,1635.529762
std,194.133974,1.410329,756.74664
min,0.0,-1.0,-1.0
25%,167.75,3.3,1917.75
50%,335.5,3.8,1995.0
75%,503.25,4.3,2009.0
max,671.0,5.0,2019.0


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 672 entries, 0 to 671
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   index              672 non-null    int64  
 1   Job Title          672 non-null    object 
 2   Salary Estimate    672 non-null    object 
 3   Job Description    672 non-null    object 
 4   Rating             672 non-null    float64
 5   Company Name       672 non-null    object 
 6   Location           672 non-null    object 
 7   Headquarters       672 non-null    object 
 8   Size               672 non-null    object 
 9   Founded            672 non-null    int64  
 10  Type of ownership  672 non-null    object 
 11  Industry           672 non-null    object 
 12  Sector             672 non-null    object 
 13  Revenue            672 non-null    object 
 14  Competitors        672 non-null    object 
dtypes: float64(1), int64(2), object(12)
memory usage: 78.9+ KB


In [9]:
#Droping unecessary column
df.drop('index', axis = 1, inplace=True)
df.head()

Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors
0,Sr Data Scientist,$137K-$171K (Glassdoor est.),Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst\n3.1,"New York, NY","New York, NY",1001 to 5000 employees,1993,Nonprofit Organization,Insurance Carriers,Insurance,Unknown / Non-Applicable,"EmblemHealth, UnitedHealth Group, Aetna"
1,Data Scientist,$137K-$171K (Glassdoor est.),"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech\n4.2,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,Company - Public,Research & Development,Business Services,$1 to $2 billion (USD),-1
2,Data Scientist,$137K-$171K (Glassdoor est.),Overview\n\n\nAnalysis Group is one of the lar...,3.8,Analysis Group\n3.8,"Boston, MA","Boston, MA",1001 to 5000 employees,1981,Private Practice / Firm,Consulting,Business Services,$100 to $500 million (USD),-1
3,Data Scientist,$137K-$171K (Glassdoor est.),JOB DESCRIPTION:\n\nDo you have a passion for ...,3.5,INFICON\n3.5,"Newton, MA","Bad Ragaz, Switzerland",501 to 1000 employees,2000,Company - Public,Electrical & Electronic Manufacturing,Manufacturing,$100 to $500 million (USD),"MKS Instruments, Pfeiffer Vacuum, Agilent Tech..."
4,Data Scientist,$137K-$171K (Glassdoor est.),Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions\n2.9,"New York, NY","New York, NY",51 to 200 employees,1998,Company - Private,Advertising & Marketing,Business Services,Unknown / Non-Applicable,"Commerce Signals, Cardlytics, Yodlee"


#### TODO - 2: Converting the salary column into numerical value.

In [10]:
df['Salary Estimate'].unique()

array(['$137K-$171K (Glassdoor est.)', '$75K-$131K (Glassdoor est.)',
       '$79K-$131K (Glassdoor est.)', '$99K-$132K (Glassdoor est.)',
       '$90K-$109K (Glassdoor est.)', '$101K-$165K (Glassdoor est.)',
       '$56K-$97K (Glassdoor est.)', '$79K-$106K (Glassdoor est.)',
       '$71K-$123K (Glassdoor est.)', '$90K-$124K (Glassdoor est.)',
       '$91K-$150K (Glassdoor est.)', '$141K-$225K (Glassdoor est.)',
       '$145K-$225K(Employer est.)', '$79K-$147K (Glassdoor est.)',
       '$122K-$146K (Glassdoor est.)', '$112K-$116K (Glassdoor est.)',
       '$110K-$163K (Glassdoor est.)', '$124K-$198K (Glassdoor est.)',
       '$79K-$133K (Glassdoor est.)', '$69K-$116K (Glassdoor est.)',
       '$31K-$56K (Glassdoor est.)', '$95K-$119K (Glassdoor est.)',
       '$212K-$331K (Glassdoor est.)', '$66K-$112K (Glassdoor est.)',
       '$128K-$201K (Glassdoor est.)', '$138K-$158K (Glassdoor est.)',
       '$80K-$132K (Glassdoor est.)', '$87K-$141K (Glassdoor est.)',
       '$92K-$155K (Glassdo

`salary_extract(salary)` this function will take the salary string as input and give minimum and maximum salary offered in integer format.
<br>For Example:
<br>salary_extract( "$137K-$171K (Glassdoor est.)" )  will give  ('137000', '171000')

In [11]:
#Function to get min and max salary
def salary_extract(salary):
    min_ = salary.split('(')[0].strip().split('-')[0]
    max_ = salary.split('(')[0].strip().split('-')[1]
    if min_[-1] == 'K':
        min_= min_[1:-1] + '000'
    else:
        min_= min_[1:-1]
    if max_[-1] == 'K':
        max_= max_[1:-1] + '000'
    else:
        max_= max_[1:-1]
    return min_, max_

In [12]:
#Applying the function and creating new columns for the same
df['min_salary'] = df['Salary Estimate'].apply(salary_extract).str[0].astype(int)
df['max_salary'] = df['Salary Estimate'].apply(salary_extract).str[1].astype(int)

In [13]:
df.head(3)

Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,min_salary,max_salary
0,Sr Data Scientist,$137K-$171K (Glassdoor est.),Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst\n3.1,"New York, NY","New York, NY",1001 to 5000 employees,1993,Nonprofit Organization,Insurance Carriers,Insurance,Unknown / Non-Applicable,"EmblemHealth, UnitedHealth Group, Aetna",137000,171000
1,Data Scientist,$137K-$171K (Glassdoor est.),"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech\n4.2,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,Company - Public,Research & Development,Business Services,$1 to $2 billion (USD),-1,137000,171000
2,Data Scientist,$137K-$171K (Glassdoor est.),Overview\n\n\nAnalysis Group is one of the lar...,3.8,Analysis Group\n3.8,"Boston, MA","Boston, MA",1001 to 5000 employees,1981,Private Practice / Firm,Consulting,Business Services,$100 to $500 million (USD),-1,137000,171000


In [14]:
df.describe()

Unnamed: 0,Rating,Founded,min_salary,max_salary
count,672.0,672.0,672.0,672.0
mean,3.518601,1635.529762,99196.428571,148130.952381
std,1.410329,756.74664,33009.958111,48035.110051
min,-1.0,-1.0,31000.0,56000.0
25%,3.3,1917.75,79000.0,119000.0
50%,3.8,1995.0,91000.0,133000.0
75%,4.3,2009.0,122000.0,165000.0
max,5.0,2019.0,212000.0,331000.0


#### TODO - 3: Remove number from company name.

In [15]:
#Applying lambda function to remove number from company name
df['Company Name'] = df['Company Name'].apply(lambda x : x.split('\n')[0])

In [16]:
df.head(2)

Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,min_salary,max_salary
0,Sr Data Scientist,$137K-$171K (Glassdoor est.),Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst,"New York, NY","New York, NY",1001 to 5000 employees,1993,Nonprofit Organization,Insurance Carriers,Insurance,Unknown / Non-Applicable,"EmblemHealth, UnitedHealth Group, Aetna",137000,171000
1,Data Scientist,$137K-$171K (Glassdoor est.),"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,Company - Public,Research & Development,Business Services,$1 to $2 billion (USD),-1,137000,171000


#### TODO - 4: Get State and City name from Location column.

In [17]:
df['City'] = df['Location'].apply(lambda x: x.split(',')[0])

In [18]:
#Getting State name from Location column
df['State'] = df['Location'].apply(lambda x: x.split(',')[-1])


In [19]:
df.head(3)

Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,min_salary,max_salary,City,State
0,Sr Data Scientist,$137K-$171K (Glassdoor est.),Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst,"New York, NY","New York, NY",1001 to 5000 employees,1993,Nonprofit Organization,Insurance Carriers,Insurance,Unknown / Non-Applicable,"EmblemHealth, UnitedHealth Group, Aetna",137000,171000,New York,NY
1,Data Scientist,$137K-$171K (Glassdoor est.),"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,Company - Public,Research & Development,Business Services,$1 to $2 billion (USD),-1,137000,171000,Chantilly,VA
2,Data Scientist,$137K-$171K (Glassdoor est.),Overview\n\n\nAnalysis Group is one of the lar...,3.8,Analysis Group,"Boston, MA","Boston, MA",1001 to 5000 employees,1981,Private Practice / Firm,Consulting,Business Services,$100 to $500 million (USD),-1,137000,171000,Boston,MA


#### TODO - 5: Removing the incorrect data.

In [20]:
#Checking for incorrect data
df[df.Founded == -1].shape

(118, 18)

In [21]:
invalid_data_index = df.query('Founded == -1 and Rating == -1').index

In [22]:
invalid_data_index

Int64Index([154, 158, 230, 282, 285, 290, 319, 322, 329, 338, 351, 357, 358,
            359, 360, 361, 362, 388, 389, 409, 425, 430, 431, 437, 438, 440,
            457, 459, 495, 496, 497, 498, 499, 500, 504, 519, 524, 555, 568,
            613, 615, 637, 650, 656, 657, 660, 664, 668, 669],
           dtype='int64')

In [23]:
#Dropping invalid data
df.drop(invalid_data_index, inplace=True)

In [24]:
df.shape

(623, 18)

In [25]:
#Replacing rest incorrect data with 0
df.Founded.replace(-1, 0, inplace=True)

In [26]:
df.Founded.unique()

array([1993, 1968, 1981, 2000, 1998, 2010, 1996, 1990, 1983, 2014, 2012,
       2016, 1965, 1973, 1986, 1997, 2015, 1945, 1988, 2017, 2011, 1967,
       1860, 1992, 2003, 1951, 2005, 2019, 1925, 2008, 1999, 1978, 1966,
       1912, 1958, 2013, 1849, 1781, 1926, 2006, 1994, 1863, 1995,    0,
       1982, 1974, 2001, 1985, 1913, 1971, 1911, 2009, 1959, 2007, 1939,
       2002, 1961, 1963, 1969, 1946, 1957, 1953, 1948, 1850, 1851, 2004,
       1976, 1918, 1954, 1947, 1955, 2018, 1937, 1917, 1935, 1929, 1820,
       1952, 1932, 1894, 1960, 1788, 1830, 1984, 1933, 1880, 1887, 1970,
       1942, 1980, 1989, 1908, 1853, 1875, 1914, 1898, 1956, 1977, 1987,
       1896, 1972, 1949, 1962], dtype=int64)

#### TODO - 6: Creating new column based on the age of the company.

In [27]:
#Determining Companies age
df['Company Age'] = df.Founded.apply(lambda x : x if x <= 0 else 2023-x)

In [28]:
#Company with age more than 50 years
df[df['Company Age'] > 50]

Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,min_salary,max_salary,City,State,Company Age
1,Data Scientist,$137K-$171K (Glassdoor est.),"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,Company - Public,Research & Development,Business Services,$1 to $2 billion (USD),-1,137000,171000,Chantilly,VA,55
12,"Data Scientist - Statistics, Early Career",$137K-$171K (Glassdoor est.),*Organization and Job ID**\nJob ID: 310918\n\n...,3.7,PNNL,"Richland, WA","Richland, WA",1001 to 5000 employees,1965,Government,Energy,"Oil, Gas, Energy & Utilities",$500 million to $1 billion (USD),"Oak Ridge National Laboratory, National Renewa...",137000,171000,Richland,WA,58
17,Data Scientist,$137K-$171K (Glassdoor est.),Job Success Profile\n\nData Scientist\n\nBuckm...,3.5,Buckman,"Memphis, TN","Memphis, TN",1001 to 5000 employees,1945,Company - Private,Chemical Manufacturing,Manufacturing,$500 million to $1 billion (USD),-1,137000,171000,Memphis,TN,78
22,Human Factors Scientist,$137K-$171K (Glassdoor est.),Human Factors Scientist\nID\n\n3336\n\nLocatio...,3.5,Exponent,"Phoenix, AZ","Menlo Park, CA",1001 to 5000 employees,1967,Company - Public,Consulting,Business Services,$100 to $500 million (USD),-1,137000,171000,Phoenix,AZ,56
23,Business Intelligence Analyst I- Data Insights,$137K-$171K (Glassdoor est.),Position Summary\n\nIndividuals within the\nBu...,3.5,Guardian Life,"Appleton, WI","New York, NY",5001 to 10000 employees,1860,Company - Private,Insurance Carriers,Insurance,$5 to $10 billion (USD),Northwestern Mutual,137000,171000,Appleton,WI,163
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
631,"Data Scientist, Kinship - NYC/Portland",$92K-$155K (Glassdoor est.),Back to search results\nPrevious job\n\nNext j...,3.9,Mars,"New York, NY","Mc Lean, VA",10000+ employees,1911,Company - Private,Food & Beverage Manufacturing,Manufacturing,$10+ billion (USD),-1,92000,155000,New York,NY,112
634,Data Scientist,$92K-$155K (Glassdoor est.),"Overview:\n\n\nGood people, working with good ...",2.5,KeHE Distributors,"Naperville, IL","Naperville, IL",5001 to 10000 employees,1954,Company - Private,Wholesale,Business Services,Unknown / Non-Applicable,"United Natural Foods, US Foods, DPI Specialty ...",92000,155000,Naperville,IL,69
644,Data Scientist,$92K-$155K (Glassdoor est.),Job Description\n\nCACI is seeking fully clear...,3.5,CACI International,"Chantilly, VA","Arlington, VA",10000+ employees,1962,Company - Public,Aerospace & Defense,Aerospace & Defense,$2 to $5 billion (USD),"CSC, ManTech, SAIC",92000,155000,Chantilly,VA,61
647,ENGINEER - COMPUTER SCIENTIST - RESEARCH COMPU...,$92K-$155K (Glassdoor est.),Join our Defense and Intelligence Solutions Di...,3.9,Southwest Research Institute,"Oklahoma City, OK","San Antonio, TX",1001 to 5000 employees,1947,Nonprofit Organization,Research & Development,Business Services,$500 million to $1 billion (USD),"Los Alamos National Laboratory, Battelle, SRI ...",92000,155000,Oklahoma City,OK,76


In [29]:
df['Company Age'].unique()

array([ 30,  55,  42,  23,  25,  13,  27,  33,  40,   9,  11,   7,  58,
        50,  37,  26,   8,  78,  35,   6,  12,  56, 163,  31,  20,  72,
        18,   4,  98,  15,  24,  45,  57, 111,  65,  10, 174, 242,  97,
        17,  29, 160,  28,   0,  41,  49,  22,  38, 110,  52, 112,  14,
        64,  16,  84,  21,  62,  60,  54,  77,  66,  70,  75, 173, 172,
        19,  47, 105,  69,  76,  68,   5,  86, 106,  88,  94, 203,  71,
        91, 129,  63, 235, 193,  39,  90, 143, 136,  53,  81,  43,  34,
       115, 170, 148, 109, 125,  67,  46,  36, 127,  51,  74,  61],
      dtype=int64)

In [30]:
#Data with company age 242 years
df[df['Company Age'] == 242] 

Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,min_salary,max_salary,City,State,Company Age
51,Data Scientist,$75K-$131K (Glassdoor est.),"By clicking the Apply button, I understand tha...",3.7,Takeda,"Cambridge, MA","OSAKA, Japan",10000+ employees,1781,Company - Public,Biotech & Pharmaceuticals,Biotech & Pharmaceuticals,$10+ billion (USD),"Novartis, Baxter, Pfizer",75000,131000,Cambridge,MA,242
365,Data Scientist,$112K-$116K (Glassdoor est.),"By clicking the Apply button, I understand tha...",3.7,Takeda,"Cambridge, MA","OSAKA, Japan",10000+ employees,1781,Company - Public,Biotech & Pharmaceuticals,Biotech & Pharmaceuticals,$10+ billion (USD),"Novartis, Baxter, Pfizer",112000,116000,Cambridge,MA,242


In [31]:
df['Rating'].unique()

array([ 3.1,  4.2,  3.8,  3.5,  2.9,  3.9,  4.4,  3.6,  4.5,  4.7,  3.7,
        3.4,  4.1,  3.2,  4.3,  2.8,  5. ,  4.8,  3.3,  2.7,  2.2,  2.6,
        4. ,  2.5,  4.9,  2.4,  2.3,  4.6,  3. , -1. ,  2.1,  2. ])

In [32]:
df[df['Rating'] == -1].shape

(1, 19)

In [33]:
drop_index = df[df['Rating'] == -1].index

In [34]:
df.drop(drop_index, inplace=True)

In [35]:
df[df['Rating'] == -1].shape

(0, 19)

In [36]:
df.describe()

Unnamed: 0,Rating,Founded,min_salary,max_salary,Company Age
count,622.0,622.0,622.0,622.0,622.0
mean,3.881833,1763.969453,98496.784566,147062.700965,34.614148
std,0.610805,624.773093,33187.640385,47979.439952,40.258852
min,2.0,0.0,31000.0,56000.0,0.0
25%,3.5,1951.25,79000.0,119000.0,11.0
50%,3.8,1996.0,91000.0,132000.0,21.5
75%,4.4,2010.0,122000.0,164500.0,42.0
max,5.0,2019.0,212000.0,331000.0,242.0


In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 622 entries, 0 to 671
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Job Title          622 non-null    object 
 1   Salary Estimate    622 non-null    object 
 2   Job Description    622 non-null    object 
 3   Rating             622 non-null    float64
 4   Company Name       622 non-null    object 
 5   Location           622 non-null    object 
 6   Headquarters       622 non-null    object 
 7   Size               622 non-null    object 
 8   Founded            622 non-null    int64  
 9   Type of ownership  622 non-null    object 
 10  Industry           622 non-null    object 
 11  Sector             622 non-null    object 
 12  Revenue            622 non-null    object 
 13  Competitors        622 non-null    object 
 14  min_salary         622 non-null    int32  
 15  max_salary         622 non-null    int32  
 16  City               622 non

#### TODO - 7: Creating Skills columns based on the Job description.

In [38]:
#Checking the job description column, we can get the skill required for the job using the job description.
df['Job Description'][1]

"Secure our Nation, Ignite your Future\n\nJoin the top Information Technology and Analytic professionals in the industry to make invaluable contributions to our national security on a daily basis. In this innovative, self-contained, Big Data environment, the ManTech team is responsible for everything from infrastructure, to application development, to data science, to advanced analytics and beyond. The team is diverse, the questions are thought-provoking, and the opportunities for growth and advancement are numerous\n\nThe successful candidate will possess a diverse range of data-focused skills and experience, both technical and analytical. They will have a strong desire and capability for problem solving, data analysis and troubleshooting, analytical thinking, and experimentation.\n\nDuties, Tasks & Responsibilities\nWorking with large, complex, and disparate data sets\nDesigning and implementing innovative ways to analyze and exploit the Sponsors data holdings\nResearching and report


df['python'] = df['Job Description'].apply(lambda x: 1 if 'python' in x.lower() else 0)
df['excel'] = df['Job Description'].apply(lambda x: 1 if 'excel' in x.lower() else 0)
df['hadoop'] = df['Job Description'].apply(lambda x: 1 if 'hadoop' in x.lower() else 0)
df['spark'] = df['Job Description'].apply(lambda x: 1 if 'spark' in x.lower() else 0)
df['aws'] = df['Job Description'].apply(lambda x: 1 if 'aws' in x.lower() else 0)
df['tableau'] = df['Job Description'].apply(lambda x: 1 if 'tableau' in x.lower() else 0)
df['big_data'] = df['Job Description'].apply(lambda x: 1 if 'big data' in x.lower() else 0)
df.head()

In [39]:
#Getting the most common skills required for the job.
df = df.assign(python=df['Job Description'].apply(lambda x: 1 if 'python' in x.lower() else 0),
          excel=df['Job Description'].apply(lambda x: 1 if 'excel' in x.lower() else 0),
          hadoop=df['Job Description'].apply(lambda x: 1 if 'hadoop' in x.lower() else 0),
          spark=df['Job Description'].apply(lambda x: 1 if 'spark' in x.lower() else 0),
          aws=df['Job Description'].apply(lambda x: 1 if 'aws' in x.lower() else 0),
          tableau=df['Job Description'].apply(lambda x: 1 if 'tableau' in x.lower() else 0),
          big_data=df['Job Description'].apply(lambda x: 1 if 'big data' in x.lower() else 0))

In [40]:
df.columns

Index(['Job Title', 'Salary Estimate', 'Job Description', 'Rating',
       'Company Name', 'Location', 'Headquarters', 'Size', 'Founded',
       'Type of ownership', 'Industry', 'Sector', 'Revenue', 'Competitors',
       'min_salary', 'max_salary', 'City', 'State', 'Company Age', 'python',
       'excel', 'hadoop', 'spark', 'aws', 'tableau', 'big_data'],
      dtype='object')

In [41]:
df.head(1)

Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,...,City,State,Company Age,python,excel,hadoop,spark,aws,tableau,big_data
0,Sr Data Scientist,$137K-$171K (Glassdoor est.),Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst,"New York, NY","New York, NY",1001 to 5000 employees,1993,Nonprofit Organization,...,New York,NY,30,0,0,0,0,1,0,0


In [42]:
df.describe()

Unnamed: 0,Rating,Founded,min_salary,max_salary,Company Age,python,excel,hadoop,spark,aws,tableau,big_data
count,622.0,622.0,622.0,622.0,622.0,622.0,622.0,622.0,622.0,622.0,622.0,622.0
mean,3.881833,1763.969453,98496.784566,147062.700965,34.614148,0.73955,0.440514,0.220257,0.273312,0.245981,0.186495,0.212219
std,0.610805,624.773093,33187.640385,47979.439952,40.258852,0.439233,0.496848,0.414754,0.446018,0.431014,0.389819,0.409208
min,2.0,0.0,31000.0,56000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3.5,1951.25,79000.0,119000.0,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,3.8,1996.0,91000.0,132000.0,21.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,4.4,2010.0,122000.0,164500.0,42.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0
max,5.0,2019.0,212000.0,331000.0,242.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


#### TODO - 8: Simplifying Job title column and creating Seniority column.

In [43]:
#Checking all the job title available
df['Job Title'].value_counts()

Data Scientist                                            291
Data Engineer                                              26
Senior Data Scientist                                      19
Machine Learning Engineer                                  15
Data Analyst                                               12
                                                         ... 
Data Engineer (Remote)                                      1
Data Science Instructor                                     1
Business Data Analyst                                       1
Purification Scientist                                      1
AI/ML - Machine Learning Scientist, Siri Understanding      1
Name: Job Title, Length: 170, dtype: int64

`jobtitle(title)` this function will simplify the job title, string which contain some most common job title will be replaced with the simplest title

In [44]:
#Function to simplify the job title
def jobtitle(title):
    if 'data scientist' in title.lower():
        return 'Data Scientist'
    elif 'machine learning' in title.lower() or 'ml' in title.lower():
        return 'Machine Learning Engineer'
    elif 'data Engineer' in title.lower():
        return 'Data Engineer'
    elif 'data analyst' in title.lower():
        return 'Data Analyst'
    elif 'analyst' in title.lower():
        return 'Analyst'
    elif 'director' in title.lower():
        return 'Director'
    elif 'manager' in title.lower():
        return 'Manager'
    elif 'software engineer' in title.lower():
        return 'Software Engineer'
    elif 'engineer' in title.lower():
        return 'Engineer'
    elif 'intern' in title.lower():
        return 'Intern'
    elif 'scientist' in title.lower():
        return 'Scientist'    
    else:
        return 'NA'

In [45]:
jobtitle('Sr Data Scientist')

'Data Scientist'

In [46]:
#Creating new column for job title and applying the function
df['simple_job_title'] = df['Job Title'].apply(jobtitle)

In [47]:
df.head(2)

Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,...,State,Company Age,python,excel,hadoop,spark,aws,tableau,big_data,simple_job_title
0,Sr Data Scientist,$137K-$171K (Glassdoor est.),Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst,"New York, NY","New York, NY",1001 to 5000 employees,1993,Nonprofit Organization,...,NY,30,0,0,0,0,1,0,0,Data Scientist
1,Data Scientist,$137K-$171K (Glassdoor est.),"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,Company - Public,...,VA,55,0,0,1,0,0,0,1,Data Scientist


In [48]:
#Checking the value count based on job title
df['simple_job_title'].value_counts()

Data Scientist               409
Engineer                      64
Data Analyst                  47
Machine Learning Engineer     35
Scientist                     31
NA                            13
Analyst                        8
Manager                        7
Software Engineer              5
Director                       3
Name: simple_job_title, dtype: int64

`seniority(title)` this function will check the job title based on that it determine seneority of the job title

In [49]:
#Function to get the seniority of the job title
def seniority(title):
    if 'jr' in title.lower() or 'junior' in title.lower():
        return 'Junior'
    elif 'senior' in title.lower() or 'sr' in title.lower():
        return 'Senior'
    else:
        return 'NA'

In [50]:
df['seniority'] = df['Job Title'].apply(seniority)

In [51]:
df.head(2)

Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,...,Company Age,python,excel,hadoop,spark,aws,tableau,big_data,simple_job_title,seniority
0,Sr Data Scientist,$137K-$171K (Glassdoor est.),Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst,"New York, NY","New York, NY",1001 to 5000 employees,1993,Nonprofit Organization,...,30,0,0,0,0,1,0,0,Data Scientist,Senior
1,Data Scientist,$137K-$171K (Glassdoor est.),"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,Company - Public,...,55,0,0,1,0,0,0,1,Data Scientist,


In [52]:
df.seniority.value_counts()

NA        544
Senior     76
Junior      2
Name: seniority, dtype: int64

In [53]:
df[df.seniority == 'NA'].head(3)

Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,...,Company Age,python,excel,hadoop,spark,aws,tableau,big_data,simple_job_title,seniority
1,Data Scientist,$137K-$171K (Glassdoor est.),"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,Company - Public,...,55,0,0,1,0,0,0,1,Data Scientist,
2,Data Scientist,$137K-$171K (Glassdoor est.),Overview\n\n\nAnalysis Group is one of the lar...,3.8,Analysis Group,"Boston, MA","Boston, MA",1001 to 5000 employees,1981,Private Practice / Firm,...,42,1,1,0,0,1,0,0,Data Scientist,
3,Data Scientist,$137K-$171K (Glassdoor est.),JOB DESCRIPTION:\n\nDo you have a passion for ...,3.5,INFICON,"Newton, MA","Bad Ragaz, Switzerland",501 to 1000 employees,2000,Company - Public,...,23,1,1,0,0,1,0,0,Data Scientist,


In [54]:
df.columns

Index(['Job Title', 'Salary Estimate', 'Job Description', 'Rating',
       'Company Name', 'Location', 'Headquarters', 'Size', 'Founded',
       'Type of ownership', 'Industry', 'Sector', 'Revenue', 'Competitors',
       'min_salary', 'max_salary', 'City', 'State', 'Company Age', 'python',
       'excel', 'hadoop', 'spark', 'aws', 'tableau', 'big_data',
       'simple_job_title', 'seniority'],
      dtype='object')

In [55]:
#Examining the type of ownership column
df['Type of ownership'].value_counts()

Company - Private                 380
Company - Public                  151
Nonprofit Organization             36
Subsidiary or Business Segment     28
Government                         10
Other Organization                  5
Private Practice / Firm             3
College / University                3
Self-employed                       2
Contract                            2
Unknown                             1
Hospital                            1
Name: Type of ownership, dtype: int64

In [56]:
#Examining the type of Industry column
df['Industry'].value_counts()

Biotech & Pharmaceuticals                   66
Computer Hardware & Software                57
IT Services                                 57
Aerospace & Defense                         46
Enterprise Software & Network Solutions     43
Consulting                                  36
Staffing & Outsourcing                      36
Insurance Carriers                          28
-1                                          28
Internet                                    27
Advertising & Marketing                     23
Health Care Services & Hospitals            20
Research & Development                      17
Federal Agencies                            16
Investment Banking & Asset Management       13
Banks & Credit Unions                        8
Lending                                      8
Energy                                       5
Consumer Products Manufacturing              5
Telecommunications Services                  5
Insurance Agencies & Brokerages              4
Food & Bevera

In [57]:
df[df['Industry'] == '-1'].head(4)

Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,...,Company Age,python,excel,hadoop,spark,aws,tableau,big_data,simple_job_title,seniority
42,Data Analyst I,$75K-$131K (Glassdoor est.),Who is Cenlar?\n\nYou are.\n\nEmployee-owners ...,2.6,Cenlar,"Ewing, NJ","Ewing, NJ",1001 to 5000 employees,1958,Company - Private,...,65,0,0,0,0,0,0,0,Data Analyst,
168,Data Engineer,$101K-$165K (Glassdoor est.),Job Number: 10202\nGroup: Cosma International\...,3.5,Magna International Inc.,"Birmingham, AL","Aurora, Canada",10000+ employees,1957,Company - Public,...,66,1,0,1,1,1,0,1,Engineer,
193,Data Scientist,$56K-$97K (Glassdoor est.),Job Description\nClient JD below:\n\nWe need a...,5.0,SkillSoniq,"San Francisco, CA","Jersey City, NJ",Unknown,0,Company - Public,...,0,1,0,0,0,0,0,0,Data Scientist,
195,Data Scientist,$56K-$97K (Glassdoor est.),"About Joby\nLocated in Northern California, th...",4.3,Joby Aviation,"San Carlos, CA","Santa Cruz, CA",51 to 200 employees,0,Company - Private,...,0,1,1,0,1,1,0,1,Data Scientist,


In [58]:
df['Sector'].value_counts()

Information Technology                184
Business Services                     118
Biotech & Pharmaceuticals              66
Aerospace & Defense                    46
Finance                                33
Insurance                              32
-1                                     28
Manufacturing                          23
Health Care                            20
Government                             17
Oil, Gas, Energy & Utilities           10
Retail                                  7
Telecommunications                      7
Transportation & Logistics              6
Media                                   5
Real Estate                             3
Travel & Tourism                        3
Agriculture & Forestry                  3
Education                               3
Accounting & Legal                      3
Construction, Repair & Maintenance      2
Consumer Services                       2
Non-Profit                              1
Name: Sector, dtype: int64

In [59]:
#Some data is incorrect in Industry and Sector column hence replacing them with NA.
df['Industry'].replace('-1', 'NA', inplace=True)
df['Sector'].replace('-1', 'NA', inplace=True)

In [60]:
df[df['Sector'] == '-1'].shape

(0, 28)

In [61]:
df['Sector'].value_counts()

Information Technology                184
Business Services                     118
Biotech & Pharmaceuticals              66
Aerospace & Defense                    46
Finance                                33
Insurance                              32
NA                                     28
Manufacturing                          23
Health Care                            20
Government                             17
Oil, Gas, Energy & Utilities           10
Retail                                  7
Telecommunications                      7
Transportation & Logistics              6
Media                                   5
Real Estate                             3
Travel & Tourism                        3
Agriculture & Forestry                  3
Education                               3
Accounting & Legal                      3
Construction, Repair & Maintenance      2
Consumer Services                       2
Non-Profit                              1
Name: Sector, dtype: int64

In [62]:
df.columns

Index(['Job Title', 'Salary Estimate', 'Job Description', 'Rating',
       'Company Name', 'Location', 'Headquarters', 'Size', 'Founded',
       'Type of ownership', 'Industry', 'Sector', 'Revenue', 'Competitors',
       'min_salary', 'max_salary', 'City', 'State', 'Company Age', 'python',
       'excel', 'hadoop', 'spark', 'aws', 'tableau', 'big_data',
       'simple_job_title', 'seniority'],
      dtype='object')

#### TODO - 9: Renaming and dropping unecessary columns.


In [63]:
#Dropping uneccessary columns from the dataframe
df.drop(['Location', 'Salary Estimate', 'Competitors', 'Size', 'Revenue'], axis = 1, inplace=True)

In [64]:
df.columns

Index(['Job Title', 'Job Description', 'Rating', 'Company Name',
       'Headquarters', 'Founded', 'Type of ownership', 'Industry', 'Sector',
       'min_salary', 'max_salary', 'City', 'State', 'Company Age', 'python',
       'excel', 'hadoop', 'spark', 'aws', 'tableau', 'big_data',
       'simple_job_title', 'seniority'],
      dtype='object')

In [65]:
#Renaming the columns
rename_dict = {'min_salary': 'Minimum Salary',
 'max_salary': 'Maximum Salary',
 'python': 'Python',
 'excel' : 'Excel',
 'hadoop' : 'Hadoop',
 'spark': 'Spark',
 'aws': 'AWS',
 'tableau': 'Tableau',
 'big_data': 'Big Data',
 'simple_job_title': 'Simple_Job_Title',
 'seniority': 'Seniority'
}

In [66]:
df.rename(columns = rename_dict, inplace=True)

In [67]:
df.head()

Unnamed: 0,Job Title,Job Description,Rating,Company Name,Headquarters,Founded,Type of ownership,Industry,Sector,Minimum Salary,...,Company Age,Python,Excel,Hadoop,Spark,AWS,Tableau,Big Data,Simple_Job_Title,Seniority
0,Sr Data Scientist,Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst,"New York, NY",1993,Nonprofit Organization,Insurance Carriers,Insurance,137000,...,30,0,0,0,0,1,0,0,Data Scientist,Senior
1,Data Scientist,"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech,"Herndon, VA",1968,Company - Public,Research & Development,Business Services,137000,...,55,0,0,1,0,0,0,1,Data Scientist,
2,Data Scientist,Overview\n\n\nAnalysis Group is one of the lar...,3.8,Analysis Group,"Boston, MA",1981,Private Practice / Firm,Consulting,Business Services,137000,...,42,1,1,0,0,1,0,0,Data Scientist,
3,Data Scientist,JOB DESCRIPTION:\n\nDo you have a passion for ...,3.5,INFICON,"Bad Ragaz, Switzerland",2000,Company - Public,Electrical & Electronic Manufacturing,Manufacturing,137000,...,23,1,1,0,0,1,0,0,Data Scientist,
4,Data Scientist,Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions,"New York, NY",1998,Company - Private,Advertising & Marketing,Business Services,137000,...,25,1,1,0,0,0,0,0,Data Scientist,


In [68]:
#Creating a new column with average value of minimum and maximum salary
df['Average Salary'] = (df['Minimum Salary'] + df['Maximum Salary']) / 2

In [69]:
df.head(3)

Unnamed: 0,Job Title,Job Description,Rating,Company Name,Headquarters,Founded,Type of ownership,Industry,Sector,Minimum Salary,...,Python,Excel,Hadoop,Spark,AWS,Tableau,Big Data,Simple_Job_Title,Seniority,Average Salary
0,Sr Data Scientist,Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst,"New York, NY",1993,Nonprofit Organization,Insurance Carriers,Insurance,137000,...,0,0,0,0,1,0,0,Data Scientist,Senior,154000.0
1,Data Scientist,"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech,"Herndon, VA",1968,Company - Public,Research & Development,Business Services,137000,...,0,0,1,0,0,0,1,Data Scientist,,154000.0
2,Data Scientist,Overview\n\n\nAnalysis Group is one of the lar...,3.8,Analysis Group,"Boston, MA",1981,Private Practice / Firm,Consulting,Business Services,137000,...,1,1,0,0,1,0,0,Data Scientist,,154000.0


In [70]:
df.columns

Index(['Job Title', 'Job Description', 'Rating', 'Company Name',
       'Headquarters', 'Founded', 'Type of ownership', 'Industry', 'Sector',
       'Minimum Salary', 'Maximum Salary', 'City', 'State', 'Company Age',
       'Python', 'Excel', 'Hadoop', 'Spark', 'AWS', 'Tableau', 'Big Data',
       'Simple_Job_Title', 'Seniority', 'Average Salary'],
      dtype='object')

In [71]:
column_name = ['Job Title', 'Seniority','Simple_Job_Title', 'Company Name', 'Rating','Job Description','Minimum Salary', 'Maximum Salary','Average Salary','Python', 'Excel', 'Hadoop', 'Spark', 'AWS', 'Tableau', 'Big Data',
               'City','State','Headquarters', 'Type of ownership', 'Industry', 'Sector','Company Age']

In [72]:
#Changing the sequence of the columns in the dataframe
final_job_df = df[column_name]

In [73]:
final_job_df.head(3)

Unnamed: 0,Job Title,Seniority,Simple_Job_Title,Company Name,Rating,Job Description,Minimum Salary,Maximum Salary,Average Salary,Python,...,AWS,Tableau,Big Data,City,State,Headquarters,Type of ownership,Industry,Sector,Company Age
0,Sr Data Scientist,Senior,Data Scientist,Healthfirst,3.1,Description\n\nThe Senior Data Scientist is re...,137000,171000,154000.0,0,...,1,0,0,New York,NY,"New York, NY",Nonprofit Organization,Insurance Carriers,Insurance,30
1,Data Scientist,,Data Scientist,ManTech,4.2,"Secure our Nation, Ignite your Future\n\nJoin ...",137000,171000,154000.0,0,...,0,0,1,Chantilly,VA,"Herndon, VA",Company - Public,Research & Development,Business Services,55
2,Data Scientist,,Data Scientist,Analysis Group,3.8,Overview\n\n\nAnalysis Group is one of the lar...,137000,171000,154000.0,1,...,1,0,0,Boston,MA,"Boston, MA",Private Practice / Firm,Consulting,Business Services,42


In [74]:
#Adding _ in the name of the columns and remove the spaces.
final_job_df.columns = final_job_df.columns.str.replace(" ", "_")

In [75]:
final_job_df.head()

Unnamed: 0,Job_Title,Seniority,Simple_Job_Title,Company_Name,Rating,Job_Description,Minimum_Salary,Maximum_Salary,Average_Salary,Python,...,AWS,Tableau,Big_Data,City,State,Headquarters,Type_of_ownership,Industry,Sector,Company_Age
0,Sr Data Scientist,Senior,Data Scientist,Healthfirst,3.1,Description\n\nThe Senior Data Scientist is re...,137000,171000,154000.0,0,...,1,0,0,New York,NY,"New York, NY",Nonprofit Organization,Insurance Carriers,Insurance,30
1,Data Scientist,,Data Scientist,ManTech,4.2,"Secure our Nation, Ignite your Future\n\nJoin ...",137000,171000,154000.0,0,...,0,0,1,Chantilly,VA,"Herndon, VA",Company - Public,Research & Development,Business Services,55
2,Data Scientist,,Data Scientist,Analysis Group,3.8,Overview\n\n\nAnalysis Group is one of the lar...,137000,171000,154000.0,1,...,1,0,0,Boston,MA,"Boston, MA",Private Practice / Firm,Consulting,Business Services,42
3,Data Scientist,,Data Scientist,INFICON,3.5,JOB DESCRIPTION:\n\nDo you have a passion for ...,137000,171000,154000.0,1,...,1,0,0,Newton,MA,"Bad Ragaz, Switzerland",Company - Public,Electrical & Electronic Manufacturing,Manufacturing,23
4,Data Scientist,,Data Scientist,Affinity Solutions,2.9,Data Scientist\nAffinity Solutions / Marketing...,137000,171000,154000.0,1,...,0,0,0,New York,NY,"New York, NY",Company - Private,Advertising & Marketing,Business Services,25


In [76]:
final_job_df.describe()

Unnamed: 0,Rating,Minimum_Salary,Maximum_Salary,Average_Salary,Python,Excel,Hadoop,Spark,AWS,Tableau,Big_Data,Company_Age
count,622.0,622.0,622.0,622.0,622.0,622.0,622.0,622.0,622.0,622.0,622.0,622.0
mean,3.881833,98496.784566,147062.700965,122779.742765,0.73955,0.440514,0.220257,0.273312,0.245981,0.186495,0.212219,34.614148
std,0.610805,33187.640385,47979.439952,39637.321682,0.439233,0.496848,0.414754,0.446018,0.431014,0.389819,0.409208,40.258852
min,2.0,31000.0,56000.0,43500.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3.5,79000.0,119000.0,103000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,11.0
50%,3.8,91000.0,132000.0,114000.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,21.5
75%,4.4,122000.0,164500.0,136500.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,42.0
max,5.0,212000.0,331000.0,271500.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,242.0


In [77]:
final_job_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 622 entries, 0 to 671
Data columns (total 23 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Job_Title          622 non-null    object 
 1   Seniority          622 non-null    object 
 2   Simple_Job_Title   622 non-null    object 
 3   Company_Name       622 non-null    object 
 4   Rating             622 non-null    float64
 5   Job_Description    622 non-null    object 
 6   Minimum_Salary     622 non-null    int32  
 7   Maximum_Salary     622 non-null    int32  
 8   Average_Salary     622 non-null    float64
 9   Python             622 non-null    int64  
 10  Excel              622 non-null    int64  
 11  Hadoop             622 non-null    int64  
 12  Spark              622 non-null    int64  
 13  AWS                622 non-null    int64  
 14  Tableau            622 non-null    int64  
 15  Big_Data           622 non-null    int64  
 16  City               622 non

## References:
- [Kaggle Dataset](https://www.kaggle.com/datasets/rashikrahmanpritom/data-science-job-posting-on-glassdoor?select=Uncleaned_DS_jobs.csv)
- [Pandas Documentation](https://pandas.pydata.org/docs/#pandas-documentation)

