# Task 3
<h3 style="color:#ffc0cb;font-size:50px;font-family:Georgia;text-align:center;"><strong>Data Parsing, Cleansing and Integration</strong></h3>

Version: 1.0

Environment: Python 3 and Jupyter notebook

Libraries used: please include the main libraries you used in your assignment, e.g.,:
* pandas
* re
* numpy
* seaborn
* matplotlib

| **COLUMN**   | **DESCRIPTION**                                               |
|--------------|---------------------------------------------------------------|
| Id           | 8 digit Id of the job advertisement                           |
| Title        | Title of the advertised job position                          |
| Location     | Location of the advertised job position                       |
| Company      | Company (employer) of the advertised job position             |
| ContractType | The contract type of the advertised job position              |
| ContractTime | The contract time of the advertised job position              |
| Category     | The category of the advertised job position                   |
| Salary       | Annual salary of the advertised job position                  |
| OpenDate     | The opening time for the job application                      |
| CloseDate    | The closing time for applying for the advertised job position |
| SourceName   | The website where the job position is advertised              |



## Introduction

The dataset combines jobs, and related information on those positions gathered from various job-hunting websites. Data pre-processing analysis on job advertisements is becoming more crucial and advantageous for job-search websites since it allows for enhancements to the user experience of searching.
I will discover and resolve problems in data integration and manage a sizable collection of records related to job advertisements that are kept in XML format and have unknown data quality concerns.

## Identify conflict

### Schema level conflicts:
+ Typos and spelling mistakes
+ Irregularities, e.g., abnormal data values and data formats
+ Violations of the Integrity constraint.
+ Outliers
+ Duplications
+ Missing values
+ Inconsistency, e.g., inhomogeneity in values and types in representing the same data

### Data level conflicts:
+ Duplications

##  Import libraries 

In [1]:
import pandas as pd
import numpy as np

# Modules for data visualization
pd.options.mode.chained_assignment = None

dataset2.csv

In [2]:
# Code to inspect the provided data file
df2 = pd.read_csv('dataset2.csv')

df2.head(3)

Unnamed: 0,Location,Job Title,Monthly Payment,Closing,Category,Type,Opening,Organisation,Full-Time Equivalent (FTE)
0,Berkshire,Lead CRA UK,4583.33,2012-03-08 12:00:00,Health,,2012-01-08 12:00:00,SEC Recruitment,1.0
1,Bristol,Possession Manager,2812.5,2013-09-06 12:00:00,Engineering,Permanent,2013-08-07 12:00:00,Navartis Limited,1.0
2,Coventry,NVQ Assessor Banking/Financial Services Salary...,1791.67,2013-05-02 00:00:00,Hospitality,Permanent,2013-02-01 00:00:00,Pertemps,1.0


In [3]:
# add a new column SourceName since all record came from the same source
df2['SourceName'] = 'www.jobhuntlisting.com'

# add column Id to generate unique id for each row
df2['Id'] = df2.index + 1

df2 = df2[['Id','Location', 'Job Title', 'Monthly Payment', 'Closing', 'Category',
       'Type', 'Opening', 'Organisation', 'Full-Time Equivalent (FTE)',
       'SourceName']]

# overview of the dataset
print(df2.info())
df2.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Id                          5000 non-null   int64  
 1   Location                    5000 non-null   object 
 2   Job Title                   5000 non-null   object 
 3   Monthly Payment             5000 non-null   float64
 4   Closing                     5000 non-null   object 
 5   Category                    5000 non-null   object 
 6   Type                        3593 non-null   object 
 7   Opening                     5000 non-null   object 
 8   Organisation                4531 non-null   object 
 9   Full-Time Equivalent (FTE)  5000 non-null   float64
 10  SourceName                  5000 non-null   object 
dtypes: float64(2), int64(1), object(8)
memory usage: 429.8+ KB
None


Unnamed: 0,Id,Location,Job Title,Monthly Payment,Closing,Category,Type,Opening,Organisation,Full-Time Equivalent (FTE),SourceName
0,1,Berkshire,Lead CRA UK,4583.33,2012-03-08 12:00:00,Health,,2012-01-08 12:00:00,SEC Recruitment,1.0,www.jobhuntlisting.com
1,2,Bristol,Possession Manager,2812.5,2013-09-06 12:00:00,Engineering,Permanent,2013-08-07 12:00:00,Navartis Limited,1.0,www.jobhuntlisting.com
2,3,Coventry,NVQ Assessor Banking/Financial Services Salary...,1791.67,2013-05-02 00:00:00,Hospitality,Permanent,2013-02-01 00:00:00,Pertemps,1.0,www.jobhuntlisting.com


In [4]:
# Code to inspect the provided data file
df1 = pd.read_csv('dataset1_solution.csv')

# overview of the dataset
print(df1.info())
df1.sample(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50751 entries, 0 to 50750
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Id            50751 non-null  int64  
 1   Title         50751 non-null  object 
 2   Location      50751 non-null  object 
 3   Company       50043 non-null  object 
 4   ContractType  50751 non-null  object 
 5   ContractTime  50751 non-null  object 
 6   Category      50751 non-null  object 
 7   OpenDate      50751 non-null  object 
 8   CloseDate     50751 non-null  object 
 9   SourceName    50751 non-null  object 
 10  Salary        50751 non-null  float64
dtypes: float64(1), int64(1), object(9)
memory usage: 4.3+ MB
None


Unnamed: 0,Id,Title,Location,Company,ContractType,ContractTime,Category,OpenDate,CloseDate,SourceName,Salary
50707,68445079,coventure graduate role – investment managemen...,uk,non-specified,non-specified,non-specified,sales jobs,2013-02-12 00:00:00,2013-03-14 00:00:00,wikijob.co.uk,31500.0
46386,69568540,allied health care professional : mobile optom...,kent,trident,non-specified,non-specified,healthcare & nursing jobs,2013-06-04 12:00:00,2013-07-04 12:00:00,jobs4medical.co.uk,40500.0
27256,70601253,counselling concepts lecturer,crewe,,non-specified,non-specified,teaching jobs,2013-06-03 15:00:00,2013-09-01 15:00:00,jobcentre plus,42736.0


<h3 style="color:#ffc0cb;font-size:50px;font-family:Georgia;text-align:center;"><strong>2. Resolving schema conflicts</strong></h3>

+ Column names
+ Typos and spelling mistakes
+ Irregularities, e.g., abnormal data values and data formats
+ Violations of the Integrity constraint.
+ Outliers
+ Duplications
+ Missing values
+ Inconsistency, e.g., inhomogeneity in values and types in representing the same data

### Conflict 1: Naming conflict
I'll rename the columns for these 2 dataframe all have the same name and meaning

In [5]:
print(f'List of column names for the first dataframe df1: {df1.columns.to_list()}\n')
print(f'List of column names for the second dataframe df2: {df2.columns.to_list()}\n')

List of column names for the first dataframe df1: ['Id', 'Title', 'Location', 'Company', 'ContractType', 'ContractTime', 'Category', 'OpenDate', 'CloseDate', 'SourceName', 'Salary']

List of column names for the second dataframe df2: ['Id', 'Location', 'Job Title', 'Monthly Payment', 'Closing', 'Category', 'Type', 'Opening', 'Organisation', 'Full-Time Equivalent (FTE)', 'SourceName']



### --------> OBSERVATION
+ 2 dataframes both have 11 columns
+ Same column names for the same column meaning between these datasets: 'Id', 'Location', 'Category', 'SourceName'
+ Different names for the same column meaning between these datasets
> + 'Title': 'Job Title'
> + 'Company': 'Organisation'
> + 'ContractType': 'Full-Time Equivalent (FTE)' **note: value besides 1 would be part time job"**
> + 'ContractTime': 'Type'
> + 'OpenDate': 'Opening'
> + 'CloseDate': 'Closing'
> + 'Salary': 'Monthly Payment' **note: df1 is annual salary, df2 is monthly salary**

In [6]:
# Rename columns
df2.rename(columns={'Job Title':'Title','Organisation':'Company',
                   'Full-Time Equivalent (FTE)':'ContractType','Type':'ContractTime',
                   'Opening':'OpenDate','Closing':'CloseDate', 'Monthly Payment':'Salary'},inplace=True)


df2.head(3)

Unnamed: 0,Id,Location,Title,Salary,CloseDate,Category,ContractTime,OpenDate,Company,ContractType,SourceName
0,1,Berkshire,Lead CRA UK,4583.33,2012-03-08 12:00:00,Health,,2012-01-08 12:00:00,SEC Recruitment,1.0,www.jobhuntlisting.com
1,2,Bristol,Possession Manager,2812.5,2013-09-06 12:00:00,Engineering,Permanent,2013-08-07 12:00:00,Navartis Limited,1.0,www.jobhuntlisting.com
2,3,Coventry,NVQ Assessor Banking/Financial Services Salary...,1791.67,2013-05-02 00:00:00,Hospitality,Permanent,2013-02-01 00:00:00,Pertemps,1.0,www.jobhuntlisting.com


### Conflict 2: Convert Data type

In [7]:
def coerce_df_columns_to_best_dtype(df, int32_column_list, float_column_list, datetime_column_list):

    # convert to integer dtype 
    df[int32_column_list] = df[int32_column_list].astype('int32', errors='ignore')
    
    # convert to float dtype
    df[float_column_list] = df[float_column_list].astype('float', errors='ignore').round(2)
    
    # convert to datetime format yyyy-mm-dd hh:mm:ss
    df[datetime_column_list] = df[datetime_column_list].apply(pd.to_datetime, errors='coerce')

    # convert object columns to string datatype
    df = df.convert_dtypes()

    # select numeric columns
    df_numeric = df.select_dtypes(include=[np.number]).columns.to_list()

    # select non-numeric columns
    df_string = df.select_dtypes(include='string').columns.tolist()

    # print out the name and number of numeric column
    print("Number of numeric columns: ", len(df_numeric))
    print("List of numeric columns: ", df_numeric, "\n")

    # print out the name and number of categorical column
    print("Number of categorical columns: ", len(df_string))
    print("List of string columns: ", df_string, "\n\n")

    # return datatype for each column after coercing
    return df.info()

# apply the coerce_df_columns_to_best_dtype function to the dataframe
coerce_df_columns_to_best_dtype(df2, 'Id', 'Salary', ['OpenDate', 'CloseDate'])

Number of numeric columns:  3
List of numeric columns:  ['Id', 'Salary', 'ContractType'] 

Number of categorical columns:  6
List of string columns:  ['Location', 'Title', 'Category', 'ContractTime', 'Company', 'SourceName'] 


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Id            5000 non-null   Int32         
 1   Location      5000 non-null   string        
 2   Title         5000 non-null   string        
 3   Salary        5000 non-null   Float64       
 4   CloseDate     5000 non-null   datetime64[ns]
 5   Category      5000 non-null   string        
 6   ContractTime  3593 non-null   string        
 7   OpenDate      5000 non-null   datetime64[ns]
 8   Company       4531 non-null   string        
 9   ContractType  5000 non-null   Float64       
 10  SourceName    5000 non-null   string        
dtypes: Float64

### Conflict 3: Salary from monthly to annually 

In [8]:
df2.describe().T.round(2)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Id,5000.0,2500.5,1443.52,1.0,1250.75,2500.5,3750.25,5000.0
Salary,5000.0,2854.15,1310.43,424.0,1916.67,2583.33,3541.67,8000.0
ContractType,5000.0,0.99,0.09,0.2,1.0,1.0,1.0,1.0


In [9]:
df1.describe().T.round(2)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Id,50751.0,68847007.68,4238794.44,12612628.0,68347141.5,69214702.0,71268853.0,72705203.0
Salary,50751.0,34331.12,15521.12,5000.0,23500.0,31500.0,42000.0,116951.0


In [10]:
# covert monthly to anual salary
df2['Salary'] = df2['Salary'].apply(lambda x: x*12)
df2.describe().T.round(2)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Id,5000.0,2500.5,1443.52,1.0,1250.75,2500.5,3750.25,5000.0
Salary,5000.0,34249.78,15725.22,5088.0,23000.04,30999.96,42500.04,96000.0
ContractType,5000.0,0.99,0.09,0.2,1.0,1.0,1.0,1.0


### Conflict 4: Misspelling and inconsistency 

In [11]:
# select non-numeric columns
df2_string = df2.select_dtypes(include='string').columns.tolist()
    
# convert all string to lowercase and strip trailing and leading spaces
df2 = df2.convert_dtypes().applymap(lambda s: s.lower().strip() if type(s) == str else s)  

In [12]:
print(f'NUMBER OF CATEGORIES ContractType of df1: {df1.ContractType.nunique()}; \n\nUNIQUE NAMES OF THE CATEGORIES {df1.ContractType.unique()}\n\n\n')

NUMBER OF CATEGORIES ContractType of df1: 3; 

UNIQUE NAMES OF THE CATEGORIES ['non-specified' 'part_time' 'full_time']





In [13]:
print(f'NUMBER OF CATEGORIES ContractType of df2: {df2.ContractType.nunique()}; \n\nUNIQUE NAMES OF THE CATEGORIES {df2.ContractType.unique()}\n\n\n')

NUMBER OF CATEGORIES ContractType of df2: 5; 

UNIQUE NAMES OF THE CATEGORIES [1.  0.4 0.8 0.2 0.6]





In [14]:
# loop through ContractType column. 1 is full time, anything else is part time 
for i in range(len(df2['ContractType'])):
    if df2['ContractType'][i] == 1:
        df2['ContractType'][i] = 'full_time'
    else:
        df2['ContractType'][i] = 'part_time'
        
print(f'NUMBER OF CATEGORIES: {df2.ContractType.nunique()}; \n\nUNIQUE NAMES OF THE CATEGORIES {df2.ContractType.unique()}\n\n\n')

NUMBER OF CATEGORIES: 2; 

UNIQUE NAMES OF THE CATEGORIES ['full_time' 'part_time']





In [15]:
print(f'NUMBER OF CATEGORIES ContractTime of df2: {df2.ContractTime.nunique()}; \n\nUNIQUE NAMES OF THE CATEGORIES {df2.ContractTime.unique()}\n\n\n')

NUMBER OF CATEGORIES ContractTime of df2: 2; 

UNIQUE NAMES OF THE CATEGORIES [<NA> 'permanent' 'fixed term contract']





In [16]:
print(f'NUMBER OF CATEGORIES ContractTime of df1: {df1.ContractTime.nunique()}; \n\nUNIQUE NAMES OF THE CATEGORIES {df1.ContractTime.unique()}\n\n\n')

NUMBER OF CATEGORIES ContractTime of df1: 3; 

UNIQUE NAMES OF THE CATEGORIES ['non-specified' 'permanent' 'contract']





In [17]:
df2.loc[df2['ContractTime'] == 'fixed term contract', "ContractTime"] = "contract" # replacing all '-' with 'non-specified'

# double check
print(f'NUMBER OF CATEGORIES: {df2.ContractTime.nunique()}; \n\nUNIQUE NAMES OF THE CATEGORIES {df2.ContractTime.unique()}\n\n\n')

NUMBER OF CATEGORIES: 2; 

UNIQUE NAMES OF THE CATEGORIES [<NA> 'permanent' 'contract']





### Conflict 5: Identify and Impute null values

In [18]:
# fill the missing value for some columns by 'non-specified'
missing_categorical_columns = ['Title', 'Location', 'Company', 'ContractType', 'ContractTime', 'SourceName']

# loop through missing categorical columns and impute with 'non-specified' and update error list
for col in missing_categorical_columns:
    df1[col].fillna('non-specified', inplace=True)
    df2[col].fillna('non-specified', inplace=True)

In [19]:
def missing_percentage(df):
    """This function takes a DataFrame(df) as input and returns two columns, total missing values and total missing values percentage"""
    total = df.isnull().sum().sort_values(ascending=False)[df.isnull().sum().sort_values(ascending=False) != 0]
    percent = round(df.isnull().sum().sort_values(ascending=False) / len(df) * 100, 2)[
        round(df.isnull().sum().sort_values(ascending=False) / len(df) * 100, 2) != 0]
    return pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

In [20]:
# display missing values in descending
print("Missing values in the dataframe 1 in descending: \n", missing_percentage(df1).sort_values(by='Total', ascending=False))

Missing values in the dataframe 1 in descending: 
 Empty DataFrame
Columns: [Total, Percent]
Index: []


In [21]:
# display missing values in descending
print("Missing values in the dataframe 2 in descending: \n", missing_percentage(df2).sort_values(by='Total', ascending=False))

Missing values in the dataframe 2 in descending: 
 Empty DataFrame
Columns: [Total, Percent]
Index: []


### Conflict 6: Adapt datatypes across schemas df1 and df2
We've completed all the mapping from df1 and df2 to the global schema. Now,we make all the datatypes consistent between df1 and df2 for data merging.

In [22]:
# print out the summary of 2 datasets
print(df1.info())
print(df2.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50751 entries, 0 to 50750
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Id            50751 non-null  int64  
 1   Title         50751 non-null  object 
 2   Location      50751 non-null  object 
 3   Company       50751 non-null  object 
 4   ContractType  50751 non-null  object 
 5   ContractTime  50751 non-null  object 
 6   Category      50751 non-null  object 
 7   OpenDate      50751 non-null  object 
 8   CloseDate     50751 non-null  object 
 9   SourceName    50751 non-null  object 
 10  Salary        50751 non-null  float64
dtypes: float64(1), int64(1), object(9)
memory usage: 4.3+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Id            5000 non-null   int64         
 1   Loc

In [23]:
# now adapt the datatypes in df1 to match df2
for y in [c for c in df1.columns if c in df2.columns]: # common columns
    if df1[y].dtype != df2[y].dtype:
        print ("Column " + y + "in df1: "+ str(df1[y].dtype) + "to" + str(df2[y].dtype))
        df1[y] = df1[y].astype(df2[y].dtype) 

Column OpenDatein df1: objecttodatetime64[ns]
Column CloseDatein df1: objecttodatetime64[ns]


<h3 style="color:#ffc0cb;font-size:50px;font-family:Georgia;text-align:center;"><strong>3. Merging data</strong></h3>

In [24]:
# Code to merge two data sets
df = pd.concat([df1,df2])
print (df.shape)
df.sample(3)

(55751, 11)


Unnamed: 0,Id,Title,Location,Company,ContractType,ContractTime,Category,OpenDate,CloseDate,SourceName,Salary
49568,72460358,"investor, visitor and skills attraction (ivsa)...",aberdeen,4children,non-specified,non-specified,accounting & finance jobs,2013-11-17 00:00:00,2013-12-17 00:00:00,jobsinscotland.com,57500.0
48563,71375968,personal lines executive co antrim,antrim,head hunt international ltd.,non-specified,permanent,accounting & finance jobs,2013-01-14 00:00:00,2013-03-15 00:00:00,nijobs.com,28800.0
2074,2075,senior business analyst sc clearable,guildford,non-specified,full_time,permanent,information technology,2012-01-29 15:00:00,2012-03-29 15:00:00,www.jobhuntlisting.com,36999.96


<h3 style="color:#ffc0cb;font-size:50px;font-family:Georgia;text-align:center;"><strong>4. Resolving data conflicts:</strong></h3>

### Conflict 1: Check duplications
Check dupplication for the rows with all same columns

In [25]:
### check duplication
duplicates = df[df.duplicated(keep=False)] # showing all duplicated records
print ("There are "+ str(len(duplicates)) + " duplicate records found")
duplicates.sort_values(by=duplicates.columns.tolist()).head(10)

There are 0 duplicate records found


Unnamed: 0,Id,Title,Location,Company,ContractType,ContractTime,Category,OpenDate,CloseDate,SourceName,Salary


#### Conflict 2: Finding global key for the data

+ The global keys for the integrated data are 'Location','Company','ContractType','ContractTime','Category','CloseDate'

+ A job position can be posted on different websites and sources. Moreover, the same position can be posted at multiple date and time but closed datetime usually fixed. The same job are different at different location, company, category, contract type, and contract time

In [26]:
# check duplication by unique keys
cols = ['Location','Company','ContractType','ContractTime','Category','CloseDate']
duplicates = df[df.duplicated(cols,keep=False)]
print ("There are "+ str(len(duplicates)) + " duplicate records found")
duplicates.sort_values(by=cols).head(10)

There are 38 duplicate records found


Unnamed: 0,Id,Title,Location,Company,ContractType,ContractTime,Category,OpenDate,CloseDate,SourceName,Salary
40746,67959738,domiciliary care worker jobs bexleyheath,london,-,non-specified,permanent,healthcare & nursing jobs,2013-10-15 15:00:00,2013-12-14 15:00:00,careworx.co.uk,25680.0
47651,69266144,account therapy specialist oncology and infec...,london,-,non-specified,permanent,healthcare & nursing jobs,2013-09-15 15:00:00,2013-12-14 15:00:00,emedcareers.com,32500.0
49062,69502456,team manager central london up to ****k bonus,london,non-specified,non-specified,non-specified,accounting & finance jobs,2012-05-18 15:00:00,2012-07-17 15:00:00,londonjobs.co.uk,10000.0
50579,72550738,senior internal auditor global insurance company,london,non-specified,non-specified,non-specified,accounting & finance jobs,2012-06-17 15:00:00,2012-07-17 15:00:00,onlineinsurancejobs.co.uk,33600.0
50291,69072514,pension administrator,london,non-specified,non-specified,non-specified,accounting & finance jobs,2012-10-28 12:00:00,2012-12-27 12:00:00,professionalpensionsjobs.com,15744.0
50375,72599332,management accountant (3 days),london,non-specified,non-specified,non-specified,accounting & finance jobs,2012-10-28 12:00:00,2012-12-27 12:00:00,jobsfinancial.com,50000.0
49789,68070981,operational audit senior,london,non-specified,non-specified,non-specified,accounting & finance jobs,2013-09-20 15:00:00,2013-11-19 15:00:00,careersinaudit.com,31000.0
50053,71342845,credit control/sales ledger,london,non-specified,non-specified,non-specified,accounting & finance jobs,2013-11-05 15:00:00,2013-11-19 15:00:00,jobsincredit.com,35000.0
49071,69734378,support desk operator,london,non-specified,non-specified,non-specified,it jobs,2012-09-26 00:00:00,2012-11-25 00:00:00,londonjobs.co.uk,33600.0
49900,68854695,purley opticians manager,london,non-specified,non-specified,non-specified,it jobs,2012-08-27 00:00:00,2012-11-25 00:00:00,emptylemon.co.uk,15000.0


#### Conflict 3: Drop duplicate base on global key

In [27]:
print(df.shape)
df = df.drop_duplicates(cols, keep='last')
df.shape

(55751, 11)


(55732, 11)

<h3 style="color:#ffc0cb;font-size:50px;font-family:Georgia;text-align:center;"><strong>5. Saving the integrated and reshaped data</strong></h3>

4. Resolving data conflicts:

The last part of the integration process is to export our output data to csv format, named as:
- `s3879312_dataset_integrated.csv`

In [28]:
# code to save output data

# names of the saved files
csv_integrated = 'dataset_integrated.csv'

# To write the data from data frame into a CSV file, use the to_csv function.
df.to_csv(csv_integrated, index=False)

# print out the success message
print(f'Saved the output data to csv format as {csv_integrated} successfully!\n\n')

Saved the output data to csv format as dataset_integrated.csv successfully!




<h3 style="color:#ffc0cb;font-size:50px;font-family:Georgia;text-align:center;"><strong>Summary</strong></h3>

In this task, I mostly focus on data integrity, ensuring the data is properly concatenated with each other and identifying duplicated values using a global key. This is quite challenging for me since I need further domain knowledge and background in job hunting websites to understand what employers post on the sites and how job hunter search for them.

<h3 style="color:#ffc0cb;font-size:50px;font-family:Georgia;text-align:center;"><strong>References</strong></h3>

+ [Concatenate pandas objects](https://pandas.pydata.org/docs/reference/api/pandas.concat.html)
+ [Drop duplications](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html)