# Data Cleaning Notebook
`SCGE133- 6502062, 6502097, 6502107, 6502114`

---
In this Notebook:

- Kaggle API setup
- Data download
- Data cleaning
- Data imbalance address
- Data export

### Kaggle API Setup

In [1]:
!pip install -q kaggle
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"urseamlaccs","key":"0c927fd6e5b7687b2b610861f4b2bded"}'}

In [2]:
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

### Data download and cleaning

In [3]:
!kaggle datasets download jobspikr/30000-latest-healthcare-jobs-emedcareers-europe

Downloading 30000-latest-healthcare-jobs-emedcareers-europe.zip to /content
 94% 21.0M/22.3M [00:01<00:00, 15.4MB/s]
100% 22.3M/22.3M [00:01<00:00, 12.0MB/s]


In [4]:
!unzip /content/30000-latest-healthcare-jobs-emedcareers-europe.zip

Archive:  /content/30000-latest-healthcare-jobs-emedcareers-europe.zip
  inflating: emed_careers_eu.csv     


In [5]:
import pandas as pd
data = pd.read_csv('emed_careers_eu.csv').drop(columns=['salary_offered', 'post_date']).dropna()
data

Unnamed: 0,category,company_name,job_description,job_title,job_type,location
0,Clinical Research,PPD GLOBAL LTD,"As part of our on-going growth, we are current...",Senior / Medical Writer (Regulatory),Permanent,Cambridge
1,Science,AL Solutions,Manager of Biometrics – Italy\nAL Solutions ar...,Manager of Biometrics,Permanent,Europe
2,Science,Seltek Consultants Ltd,A fantastic opportunity has arisen for an expe...,Field Service Engineer | Chromatography,Permanent,UK
3,Data Management and Statistics,Docs International UK Limited,Job Details\n:\nUtilise extensive clinical dat...,Data Manager of Project Management,Permanent,M4 Corridor
4,Science,Hyper Recruitment Solutions Ltd,Hyper Recruitment Solutions are currently look...,Strategic Market Analyst,Permanent,Cambridge
...,...,...,...,...,...,...
29995,Clinical Research,Covance Clinical Services Ltd,Job Summary:\nSenior Clinical Research Associa...,Clinical Research Associate II,Permanent,Europe
29996,Regulatory Affairs,Discover People International Limited,I am currently representing a Global Biopharma...,CMC lead,Permanent,Europe
29997,Clinical Research,Skills Alliance (Pharma) Limited,Seeking a Clinical Project Manager/Senior Cli...,Clinical Project Manager,Permanent,UK
29998,science,Hyper Recruitment Solutions Ltd,Senior Scientist Sports Screening\nJob purpos...,Senior Scientist Sports Screening,Permanent,cambridge


In [6]:
#cleaning
import re
import string

emoji_pattern = re.compile("["
                        u"\U0001F600-\U0001F64F"  # emoticons
                        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                        u"\U0001F680-\U0001F6FF"  # transport & map symbols
                        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags=re.UNICODE)
                        
def clean_text(text):
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    text = emoji_pattern.sub(r'', text)
    return text

In [7]:
def clean(df):
    
    for column in df:
      text_cleaned = []
      for text in df[column]:
          text_cleaned.append(clean_text(text))
      
      df[column] = text_cleaned
    return df

In [8]:
newdata = clean(data)
newdata

Unnamed: 0,category,company_name,job_description,job_title,job_type,location
0,clinical research,ppd global ltd,as part of our ongoing growth we are currently...,senior medical writer regulatory,permanent,cambridge
1,science,al solutions,manager of biometrics – italyal solutions are ...,manager of biometrics,permanent,europe
2,science,seltek consultants ltd,a fantastic opportunity has arisen for an expe...,field service engineer chromatography,permanent,uk
3,data management and statistics,docs international uk limited,job detailsutilise extensive clinical data man...,data manager of project management,permanent,corridor
4,science,hyper recruitment solutions ltd,hyper recruitment solutions are currently look...,strategic market analyst,permanent,cambridge
...,...,...,...,...,...,...
29995,clinical research,covance clinical services ltd,job summarysenior clinical research associate ...,clinical research associate ii,permanent,europe
29996,regulatory affairs,discover people international limited,i am currently representing a global biopharma...,cmc lead,permanent,europe
29997,clinical research,skills alliance pharma limited,seeking a clinical project managersenior clin...,clinical project manager,permanent,uk
29998,science,hyper recruitment solutions ltd,senior scientist sports screeningjob purposew...,senior scientist sports screening,permanent,cambridge


### Data imbalance address (EDA)

In [17]:
newdata = pd.read_csv('emed_careers_cleaned_dropped.csv')

In [18]:
import numpy as np
import matplotlib.pyplot as plt
cat = newdata['category'].value_counts()
jtype = newdata['job_type'].value_counts()

In [19]:
cat

pharmaceutical healthcare and medical sales    7594
clinical research                              5227
science                                        4890
manufacturing  operations                      3802
regulatory affairs                             1984
pharmaceutical marketing                       1750
data management and statistics                 1392
qualityassurance                               1150
medical information and pharmacovigilance      1011
medical affairs  pharmaceutical physician       772
pharmacy                                         47
Name: category, dtype: int64

In [20]:
jtype

permanent            26521
contractinterim       2319
contracttemp           515
temporaryseasonal      185
any                     43
parttime                36
Name: job_type, dtype: int64

From counting values in category column
- Heavy data imbalance across different values
- up to 428 values with irrelevant categories

From counting values in job type:
- Heavy data imbalance with most postings being permanent jobs


### Fixing imbalanced & irrelevant data with oversampling and other things

In [13]:
newdata = newdata.drop(newdata[newdata['category'] == 'switzerland'].index).drop(newdata[newdata['category'] == 'uk'].index).drop(newdata[newdata['category'] == 'germany'].index).drop(newdata[newdata['category'] == 'italy'].index).drop(newdata[newdata['category'] == 'spain'].index).drop(newdata[newdata['category'] == 'france'].index)

In [14]:
newdata.to_csv('emed_careers_cleaned_dropped.csv')

(strategy from now)
# Input - Output Shapes
---

Input : job description (text)

Output : array of 
- category (text -> one-hot arr)
-  job type (text -> one-hot arr)
-  job title (text)
- location (text)