# Data Preperation

1. [Read all the files](#Read-all-the-files)
2. [Concat into a single pandas object](#Concat-into-a-single-pandas-object)
    * Remove duplicates
    * Drop NAs for job title
3. [Clean Job Titles](#Clean-Job-Titles)
4. [Save as a pickle](#Save-as-a-pickle)
5. [Example how to use pickle](#Example-how-to-use-pickle)

## Read all the files

In [36]:
import os

reviews_path = "../data/Database/"

files = [ reviews_path + i for i in os.listdir(reviews_path) if i.endswith("csv")]
display(files)

['../data/Database/construction.csv',
 '../data/Database/computer_software.csv',
 '../data/Database/automotive.csv',
 '../data/Database/higher_ed.csv',
 '../data/Database/retail.csv',
 '../data/Database/accounting.csv',
 '../data/Database/IT_services.csv',
 '../data/Database/financial.csv',
 '../data/Database/education_management.csv',
 '../data/Database/hospital_care.csv']

## Concat into a single pandas object

In [37]:
import pandas as pd

li = []

#Loading in first 2 files
for f in files:
    data = pd.read_csv(f)
    data["industry"] = f.split("/")[3][:-4]
    li.append(data)
    
reviews = pd.concat(li, axis=0, ignore_index=True)

reviews.drop(columns=["Unnamed: 0"], inplace=True)

reviews.drop_duplicates(inplace=True)

reviews["job_title"].dropna(inplace=True)

print(reviews.shape)
display( reviews.head() )

(2350814, 13)


Unnamed: 0,company_name,review_title,job_title,employee_status,location,date,review,pros,cons,rating,yes_helpful,no_helpful,industry
0,Fluor Corp.,fun place to work,Principal Process Engineering Technician,Former Employee,"Sugar Land, TX",2020-02-03,"It is a 9 hour day, but if you are busy, it go...",,,5.0,0,0,construction
1,Fluor Corp.,Im satisfied,Journeyman Pipefitter,Former Employee,"Crystal River, FL",2020-04-05,My boss are very understanding on the situatio...,,,5.0,0,0,construction
2,Fluor Corp.,New and intellectually stimulating work.,Systems Engineer and IT Specialist,Current Employee,"Aliso Viejo, CA",2020-04-05,Great place to work with great people. The ch...,,,5.0,0,0,construction
3,Fluor Corp.,For me.... GREAT group of people!,General Foreman - Crane and Rigging,Former Employee,"Deer Park, TX",2020-04-04,I have worked on numerous Fluor projects now. ...,Compensation was more than fair. People worked...,Time away from home.,5.0,0,0,construction
4,Fluor Corp.,Good place to work in a Combat Zone,Security Specialist,Former Employee,Afghanistan,2020-04-04,I am very comfortable with the work schedule t...,,,5.0,0,0,construction


## Clean Job Titles

In [38]:
import string
import re

replacement_words = {
    "it":"technology"
    ,"sr":"senior"
    ,"qa":"quality"
}

translator = str.maketrans(string.punctuation, ' '*len(string.punctuation)) #map punctuation to space
def cleanTitle(title):
    title = str(title).translate(translator)
    title = re.sub(' +', ' ',title)
    title = title.lower()
    title_split = title.split(" ")
    for key, value in replacement_words.items():
        title_split = [ value if key == word else word for word in title_split ]
    return " ".join(title_split)

reviews["clean_job_title"] = reviews.job_title.apply(cleanTitle)


display( reviews[["job_title","clean_job_title"]].head() )

Unnamed: 0,job_title,clean_job_title
0,Principal Process Engineering Technician,principal process engineering technician
1,Journeyman Pipefitter,journeyman pipefitter
2,Systems Engineer and IT Specialist,systems engineer and technology specialist
3,General Foreman - Crane and Rigging,general foreman crane and rigging
4,Security Specialist,security specialist


## Save as a pickle

In [39]:
import pickle
import pathlib as Path

with open("../data/all_reviews.pkl","wb") as f:
    pickle.dump(reviews,f)

## Example how to use pickle

In [40]:
import pandas as pd
import pickle

with open("../data/all_reviews.pkl","rb") as f:
    reviews = pickle.load(f)
    
display(reviews.head())

Unnamed: 0,company_name,review_title,job_title,employee_status,location,date,review,pros,cons,rating,yes_helpful,no_helpful,industry,clean_job_title
0,Fluor Corp.,fun place to work,Principal Process Engineering Technician,Former Employee,"Sugar Land, TX",2020-02-03,"It is a 9 hour day, but if you are busy, it go...",,,5.0,0,0,construction,principal process engineering technician
1,Fluor Corp.,Im satisfied,Journeyman Pipefitter,Former Employee,"Crystal River, FL",2020-04-05,My boss are very understanding on the situatio...,,,5.0,0,0,construction,journeyman pipefitter
2,Fluor Corp.,New and intellectually stimulating work.,Systems Engineer and IT Specialist,Current Employee,"Aliso Viejo, CA",2020-04-05,Great place to work with great people. The ch...,,,5.0,0,0,construction,systems engineer and technology specialist
3,Fluor Corp.,For me.... GREAT group of people!,General Foreman - Crane and Rigging,Former Employee,"Deer Park, TX",2020-04-04,I have worked on numerous Fluor projects now. ...,Compensation was more than fair. People worked...,Time away from home.,5.0,0,0,construction,general foreman crane and rigging
4,Fluor Corp.,Good place to work in a Combat Zone,Security Specialist,Former Employee,Afghanistan,2020-04-04,I am very comfortable with the work schedule t...,,,5.0,0,0,construction,security specialist
