# Data Wrangling Project
## Cleaning a Job Postings Data Set 
#### Scott Lee

Date: 14/06/2021

Version: 1.0

Environment: Python 3.8.3 and Jupyter notebook

Libraries used: please include the main libraries you used in your assignment here, e.g.,:
* pandas 
* numpy

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import re 

## Task 2: Integraton

In [2]:
df_2 = pd.read_csv('dataset2.csv')
df_1 = pd.read_csv('dataset1_solution.csv')

Exploritory analysis reveals several key inconsistancies:
- There is no id column in df2
- There are columns with the same data but named differently
- Salary per month in df2 is equal to 1/12 of Salary in df1. 
<br>

An Id field is added to df2, sequentially increaing from the highest value in df1. This will give every entry in the final dataframe a unique value

In [3]:
#Adding a an Id column
for i in range(0, len(df_2['Opening'])):
    x = (i + 72705244)
    df_2.loc[i,'Id'] = x

In [4]:
#Cleaning up common column names in df2 to be consistant with those in df1

df_2['Title'] = df_2['Job Title']
df_2.drop('Job Title', axis = 1, inplace = True)

df_2['OpenDate'] = df_2['Opening']
df_2.drop('Opening', axis = 1, inplace = True)

df_2['CloseDate'] = df_2['Closing']
df_2.drop('Closing', axis = 1, inplace = True)

df_2['Company'] = df_2['Organisation']
df_2.drop('Organisation', axis = 1, inplace = True)


In [5]:
#Adjusting the 'salary per month' in df2 to be consistant with 'salay' in df1
#renaming 'salary per month' once adjusted to 'salary'

df_2['Salary'] = df_2['Salary per month'] * 12
df_2.drop('Salary per month', axis = 1, inplace = True)

In [6]:
#Swapping the names 'contract type' in df2 refers to 'ContractTime' in df1
df_2['ContractTime'] = df_2['Contract Type']
df_2.drop('Contract Type', axis = 1, inplace = True)

#No data in df2 available for source name so designating as 'non-specified'
df_2['ContractType'] = 'non_specified'

#No data in df2 available for source name so designating as 'non-specified'
df_2['SourceName'] = "non-specified"

df_2['Id'] = df_2['Id'].astype(str)


In [7]:
for i in range(0,len(df_2.Id)):
    df_2.loc[i,'Id'] = df_2.loc[i,'Id'][:8]


In [8]:
#Re-organsing df2 to match the layout of df1
df_2 = df_2[['Id', 'Title','Location', 'Company', 'ContractType','ContractTime','Category','Salary','OpenDate','CloseDate','SourceName']]
df_2.head()


Unnamed: 0,Id,Title,Location,Company,ContractType,ContractTime,Category,Salary,OpenDate,CloseDate,SourceName
0,72705244,Aviation loans administration,London,cer Financial,non_specified,Contract,Finance and Accounting,33600,2013-10-06 00:00:00,2013-12-05 00:00:00,non-specified
1,72705245,"Payroll Analyst City upto **** , ****",London,LMA Recruitment Ltd,non_specified,Permanent,Finance and Accounting,35004,2012-10-03 12:00:00,2012-11-02 12:00:00,non-specified
2,72705246,Investment Team Assistant for leading Private ...,London,Austin Andrew Ltd,non_specified,Permanent,Finance and Accounting,45000,2012-01-01 00:00:00,2012-01-31 00:00:00,non-specified
3,72705247,SWAPS COLLATERAL CONTROL OFFICER,City,Brian Durham Recruitment Services Limited,non_specified,Permanent,Finance and Accounting,39996,2012-10-14 00:00:00,2012-11-13 00:00:00,non-specified
4,72705248,Loans Administration Temp,London,cer Financial,non_specified,Contract,Finance and Accounting,39360,2012-11-17 12:00:00,2013-01-16 12:00:00,non-specified


## The Global Key

Id

In [9]:
#Concatinating together 
dataset_integrated= pd.concat([df_1,df_2])
dataset_integrated

Unnamed: 0,Id,Title,Location,Company,ContractType,ContractTime,Category,Salary,OpenDate,CloseDate,SourceName
0,12612628,ENGINEERING SYSTEMS ANALYST,DORKING,GREGORY MARTIN INTERNATIONAL,non_specified,permanent,Engineering Jobs,25000.0,2013-07-08 12:00:00,2013-09-06 12:00:00,cv-library.co.uk
1,12612830,STRESS ENGINEER,GLASGOW,GREGORY MARTIN INTERNATIONAL,non_specified,permanent,Engineering Jobs,30000.0,2012-01-30 00:00:00,2012-03-30 00:00:00,cv-library.co.uk
2,12612844,MODELLING AND SIMULATION ANALYST,HAMPSHIRE,GREGORY MARTIN INTERNATIONAL,non_specified,permanent,Engineering Jobs,30000.0,2012-12-21 15:00:00,2013-01-20 15:00:00,cv-library.co.uk
3,12613049,ENGINEERING SYSTEMS ANALYST / MATHEMATICAL MOD...,SURREY,GREGORY MARTIN INTERNATIONAL,non_specified,permanent,Engineering Jobs,27500.0,2013-12-08 15:00:00,2014-02-06 15:00:00,cv-library.co.uk
4,12613647,PIONEER MISER ENGINEERING SYSTEMS ANALYST,SURREY,GREGORY MARTIN INTERNATIONAL,non_specified,permanent,Engineering Jobs,25000.0,2013-03-02 12:00:00,2013-05-01 12:00:00,cv-library.co.uk
...,...,...,...,...,...,...,...,...,...,...,...
329,72705573,Financial Control Officer Regulatory Reporting,The City,Hays Financial Markets,non_specified,Permanent,Finance and Accounting,39996.0,2013-05-31 00:00:00,2013-08-29 00:00:00,non-specified
330,72705574,Management Accountant London,The City,Morgan McKinley Group Limited,non_specified,Contract,Finance and Accounting,39000.0,2012-11-11 12:00:00,2012-11-25 12:00:00,non-specified
331,72705575,Risk Manager FSA / Regulatory Frameworks,Bath,Incite Solutions Ltd,non_specified,Permanent,Finance and Accounting,37500.0,2013-03-26 15:00:00,2013-05-25 15:00:00,non-specified
332,72705576,Relationship / Sales executive Multilingual,London,C.K.R. Recruitment Limited,non_specified,Permanent,"PR, Advertising and Marketing",30000.0,2013-07-14 15:00:00,2013-09-12 15:00:00,non-specified


## Saving Data

## Export to CSV

In [10]:
dataset_integrated.to_csv('dataset_integrated.csv', index = False)

In [13]:
#Check output is as expected:
check_df = pd.read_csv('dataset_integrated.csv', )
check_df

Unnamed: 0,Id,Title,Location,Company,ContractType,ContractTime,Category,Salary,OpenDate,CloseDate,SourceName
0,12612628,ENGINEERING SYSTEMS ANALYST,DORKING,GREGORY MARTIN INTERNATIONAL,non_specified,permanent,Engineering Jobs,25000.0,2013-07-08 12:00:00,2013-09-06 12:00:00,cv-library.co.uk
1,12612830,STRESS ENGINEER,GLASGOW,GREGORY MARTIN INTERNATIONAL,non_specified,permanent,Engineering Jobs,30000.0,2012-01-30 00:00:00,2012-03-30 00:00:00,cv-library.co.uk
2,12612844,MODELLING AND SIMULATION ANALYST,HAMPSHIRE,GREGORY MARTIN INTERNATIONAL,non_specified,permanent,Engineering Jobs,30000.0,2012-12-21 15:00:00,2013-01-20 15:00:00,cv-library.co.uk
3,12613049,ENGINEERING SYSTEMS ANALYST / MATHEMATICAL MOD...,SURREY,GREGORY MARTIN INTERNATIONAL,non_specified,permanent,Engineering Jobs,27500.0,2013-12-08 15:00:00,2014-02-06 15:00:00,cv-library.co.uk
4,12613647,PIONEER MISER ENGINEERING SYSTEMS ANALYST,SURREY,GREGORY MARTIN INTERNATIONAL,non_specified,permanent,Engineering Jobs,25000.0,2013-03-02 12:00:00,2013-05-01 12:00:00,cv-library.co.uk
...,...,...,...,...,...,...,...,...,...,...,...
55498,72705573,Financial Control Officer Regulatory Reporting,The City,Hays Financial Markets,non_specified,Permanent,Finance and Accounting,39996.0,2013-05-31 00:00:00,2013-08-29 00:00:00,non-specified
55499,72705574,Management Accountant London,The City,Morgan McKinley Group Limited,non_specified,Contract,Finance and Accounting,39000.0,2012-11-11 12:00:00,2012-11-25 12:00:00,non-specified
55500,72705575,Risk Manager FSA / Regulatory Frameworks,Bath,Incite Solutions Ltd,non_specified,Permanent,Finance and Accounting,37500.0,2013-03-26 15:00:00,2013-05-25 15:00:00,non-specified
55501,72705576,Relationship / Sales executive Multilingual,London,C.K.R. Recruitment Limited,non_specified,Permanent,"PR, Advertising and Marketing",30000.0,2013-07-14 15:00:00,2013-09-12 15:00:00,non-specified


In [12]:
from platform import python_version

print(python_version())

3.8.3


## Summary
<br>
Detail in report as to data challenges and wrangling solutions