# Data Cleaning
---

### Table of Contents <a class="anchor" id="toc"></a>

* [Overview](#overview)
* [Importing Libraries](#importinglibraries)
* [Creating Custom Functions](#customfunctions)
* [Importing Raw Datasets](#importcsv)
* [Checking 'Minimum Salary' and 'Maximum Salary'](#checkminmaxsalary)
* [Standardizing 'Job Title'](#standardizejobtitle)
* [Standardizing 'Employment Type'](#standardizeemploymenttype)
* [Standardizing 'Position Level'](#standardizepositionlevel)
* [Converting datetime variables](#convertdatetime)
* [Exporting Dataset for EDA](#exportcsv)

## Overview<a class="anchor" id="overview"></a>
---
[Back to top!](#toc)

This notebook will be concatenating and cleaning up the scraped datasets from the previous notebook.
* Feature engineering `salary_average` from `minimum_salary` and `maximum_salary`
* Removing outlier points from `salary_average` 
* Standardizing `job_title` to categorical variable
* Standardizing `employment_type` to categorical variable
* Retaining only the `employment_type` that are full-time job listings
* Standardizing `position_level` to categorical variable
* Converting `new_posting_date`, `original_posting_date` and `closing_date` to datetime variables and feature engineering the no. of days to closing date for application
* Dropping unnecessary columns such as `job_id`, `ext_job_id`, `ssoc_code`, `last_updated`, `salary_type`, `api_link`, `job_url`

#### Data Dictionary

| |Feature|Type|Description|
|---|---|---|---|
|**0**|job_id|object|Identification no. for job posting|
|**1**|ext_job_id|object|Identification no. for job posting|
|**2**|job_title|object|Job Title|
|**3**|job_description|object|Job Description|
|**4**|minimum_years_experience|int|Minimum no. of years experience required|
|**5**|ssoc_code|int|Code no. for job posting|
|**6**|categories|object|Job Category e.g. Information Technology, Banking & Finance|
|**7**|employment_types|object|Type of employment e.g. full-time, part-time|
|**8**|position_levels|object|Position Level e.g. entry level, executive, management|
|**9**|new_posting_date|object|New posting date|
|**10**|original_posting_date|object|Original posting date|
|**11**|closing_date|object|Closing date for application|
|**12**|last_updated|object|Last updated posting date|
|**13**|skills|object|Skills required for job listing|
|**14**|organisation|object|Company Name|
|**15**|salary_minimum|int|Minimum salary offered|
|**16**|salary_maximum|int|Maximum salary offered|
|**17**|salary_type|object|Type of salary e.g. monthly, yearly|
|**18**|api_link|object|URL|
|**19**|job_url|object|URL|

## Importing Libraries <a class="anchor" id="importinglibraries"></a>
---
[Back to top!](#toc)

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline

import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')

from sklearn.feature_extraction.text import CountVectorizer

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## Custom Functions <a class="anchor" id="customfunctions"></a>
---
[Back to top!](#toc)

In [2]:
def get_top_words(df, top_num):
    '''Input: dataframe to get word occurrence, top number of occurrence
       Output: dataframe with only top occurring words'''
    df = pd.DataFrame(df.sum().sort_values(ascending=False).head(top_num))
    df = df.reset_index()
    df.columns = ['words', 'counts']
    return df

## Importing Raw Datasets <a class="anchor" id="importcsv"></a>
---
[Back to top!](#toc)


In [3]:
df_data = pd.read_csv('../data/mycareersfuturesg_results_data_210511.csv')
df_dscientist = pd.read_csv('../data/mycareersfuturesg_results_data_scientist_210511.csv')
df_danalyst = pd.read_csv('../data/mycareersfuturesg_results_data_analyst_210511.csv')
df_dengineer = pd.read_csv('../data/mycareersfuturesg_results_data_engineer_210511.csv')
df_bi = pd.read_csv('../data/mycareersfuturesg_results_business_analyst_210511.csv')
df_ml = pd.read_csv('../data/mycareersfuturesg_results_machine_learning_210511.csv')
df_py = pd.read_csv('../data/mycareersfuturesg_results_python_210511.csv')

In [4]:
print("from 'mycareersfuture.gov.sg'")
print('-----------------------------')
print(f"search for 'data': {df_data.shape}")
print(f"search for 'data scientist': {df_dscientist.shape}")
print(f"search for 'data analyst': {df_danalyst.shape}")
print(f"search for 'data engineer': {df_dengineer.shape}")
print(f"search for 'business analyst': {df_bi.shape}")
print(f"search for 'machine learning': {df_ml.shape}")
print(f"search for 'python': {df_py.shape}")

from 'mycareersfuture.gov.sg'
-----------------------------
search for 'data': (15102, 20)
search for 'data scientist': (200, 20)
search for 'data analyst': (291, 20)
search for 'data engineer': (284, 20)
search for 'business analyst': (890, 20)
search for 'machine learning': (933, 20)
search for 'python': (2677, 20)


In [5]:
# concatenating dataframes
df = pd.concat([df_data, df_dscientist, df_danalyst, df_dengineer, df_bi, df_ml, df_py], axis=0)
df.shape

(20377, 20)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20377 entries, 0 to 2676
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   job_id                    20377 non-null  object
 1   ext_job_id                20377 non-null  object
 2   job_title                 20377 non-null  object
 3   job_description           20377 non-null  object
 4   minimum_years_experience  20377 non-null  int64 
 5   ssoc_code                 20377 non-null  int64 
 6   categories                20377 non-null  object
 7   employment_types          20377 non-null  object
 8   position_levels           19776 non-null  object
 9   new_posting_date          20377 non-null  object
 10  original_posting_date     20377 non-null  object
 11  closing_date              20377 non-null  object
 12  last_updated              20377 non-null  object
 13  skills                    20377 non-null  object
 14  organisation           

In [7]:
# dropping unnecessary columns
df.drop(columns=['job_id', 'ext_job_id', 'ssoc_code', 'api_link', 'job_url'], inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20377 entries, 0 to 2676
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   job_title                 20377 non-null  object
 1   job_description           20377 non-null  object
 2   minimum_years_experience  20377 non-null  int64 
 3   categories                20377 non-null  object
 4   employment_types          20377 non-null  object
 5   position_levels           19776 non-null  object
 6   new_posting_date          20377 non-null  object
 7   original_posting_date     20377 non-null  object
 8   closing_date              20377 non-null  object
 9   last_updated              20377 non-null  object
 10  skills                    20377 non-null  object
 11  organisation              20377 non-null  object
 12  salary_minimum            20377 non-null  int64 
 13  salary_maximum            20377 non-null  int64 
 14  salary_type            

In [8]:
# checking for no. of duplicates
df.duplicated().sum()

3830

In [9]:
# dropping duplicates
df.drop_duplicates(keep = "first", inplace = True)
df.shape

(16547, 15)

## Checking `salary_minimum` and `salary_maximum` <a class="anchor" id="checkminmaxsalary"></a> 
---
[Back to top!](#toc)

In [10]:
# checking indexes without salary input
(df['salary_minimum'] == 0).sum()
(df['salary_maximum'] == 0).sum()

639

In [11]:
# dropping indexes without salary input
df = df[df.salary_minimum != 0]
df.shape

(15908, 15)

In [12]:
# creating new column for average salary
df['salary_average'] = (df['salary_minimum'] + df['salary_maximum'])/2
df.drop(columns=['salary_minimum', 'salary_maximum'], inplace=True)
df.head()

Unnamed: 0,job_title,job_description,minimum_years_experience,categories,employment_types,position_levels,new_posting_date,original_posting_date,closing_date,last_updated,skills,organisation,salary_type,salary_average
0,Senior Derivatives XA Market Data Engineer,At ICE we approach every challenge as an oppor...,5,Banking and Finance; Information Technology,Permanent,Professional,2021-04-28,2021-03-17,2021-05-28,2021-05-10T16:39:19.000Z,Analytics; Communication; Data; Detail Oriente...,ICE DATA SERVICES SINGAPORE PTE. LTD.,Monthly,13125.0
1,IT - Business Analyst (Data Analytics and Mach...,"[ COLLECTION, USE AND DISCLOSURE OF PERSONAL D...",3,Information Technology,Contract; Full Time,Executive,2021-04-23,2021-04-23,2021-05-23,2021-05-10T16:28:53.000Z,Agile; Agile Methodolgy; Banking; Business Ana...,NTT DATA SINGAPORE PTE. LTD.,Monthly,4500.0
2,Junior Data Analyst (Finance / Python) - up to...,We are a top 8 global IT services company with...,2,Information Technology,Contract; Full Time,Junior Executive,2021-04-14,2021-04-14,2021-05-14,2021-05-10T16:34:46.000Z,Access; Business Process; Data; Data Analysis;...,NTT DATA SINGAPORE PTE. LTD.,Monthly,3400.0
3,Data Analytics Consultant,This TMCA* (TeSA Mid-Career Advancement) progr...,15,Consulting; Education and Training; Informatio...,Full Time; Internship/Traineeship,Professional,2021-04-30,2020-12-18,2021-05-30,2021-05-10T16:41:51.000Z,Analysis; Business Change Management; Business...,DATA & ANALYTICS CAPITALS PTE. LTD.,Monthly,6000.0
4,(IT)- Data science analyst,We are a top 8 global IT services company with...,2,Information Technology,Contract; Full Time,Senior Executive,2021-04-18,2021-04-18,2021-05-18,2021-05-10T16:28:23.000Z,Business Process; Data; MySQL; MySQL DBA; Orac...,NTT DATA SINGAPORE PTE. LTD.,Monthly,6000.0


In [13]:
# checking salary type
df.salary_type.value_counts()

# dropping since all are the same values
df.drop(columns=['salary_type'], inplace=True)

## Standardizing `job_title` <a class="anchor" id="standardizejobtitle"></a> 
---
[Back to top!](#toc)

In [14]:
df['job_title'].value_counts()

Business Analyst                                                                                                                                                                       93
Software Engineer                                                                                                                                                                      69
Data Engineer                                                                                                                                                                          59
Admin Assistant                                                                                                                                                                        54
Product Manager                                                                                                                                                                        51
Data Analyst                                                          

In [15]:
# using cvec to find out top occurring words in job title

# selecting only the job title
df_job_title = df['job_title']

# Instantiate a CountVectorizer with the chosen hyperparameters.
cvec = CountVectorizer(stop_words=('english', 'sgunitedtraineeships', 
                                   'up', 'admin', 'head', 'contract', 
                                   'senior', 'manager', 'customer', 
                                   'executive', 'vice', 'president', 'sgunitedjobs', 
                                   'assistant', '12', 'months', 'technology', 'and', 
                                   'days', 'associate', 'sgup', 'marketing', 'supply', 'sales', 
                                   'consumer'), 
                       ngram_range = (2, 2))

# fitting the CountVectorizer
cvec.fit(df_job_title)
df_job_title = cvec.transform(df_job_title)

# df of top occurring words
df_job_title_topwords = pd.DataFrame(df_job_title.toarray(), columns=cvec.get_feature_names())
df_job_title_topwords = get_top_words(df_job_title_topwords, 30)
df_job_title_topwords

Unnamed: 0,words,counts
0,business analyst,575
1,software engineer,542
2,business development,166
3,research engineer,162
4,data analyst,158
5,full stack,158
6,data scientist,151
7,data engineer,145
8,software developer,131
9,devops engineer,129


In [16]:
# custom function for standardizing job title
def job_title_standardizer(job_title):
    
    if 'data analyst' in job_title.lower():
        return 'data analyst'
    elif 'data analytics' in job_title.lower():
        return 'data analyst'
    elif 'analytics' in job_title.lower():
        return 'data analyst'
    elif 'visualization' in job_title.lower():
        return 'data analyst'
    
    elif 'data scientist' in job_title.lower():
        return 'data scientist'
    elif 'machine learning' in job_title.lower():
        return 'data scientist'
    elif 'science' in job_title.lower():
        return 'data scientist'
    elif 'modeler' in job_title.lower():
        return 'data scientist'
    
    elif 'business analyst' in job_title.lower():
        return 'business analyst'
    
    elif 'software engineer' in job_title.lower():
        return 'software engineer'
    elif 'front end' in job_title.lower():
        return 'software engineer'
    elif 'devops' in job_title.lower():
        return 'software engineer'    
    elif 'full stack' in job_title.lower():
        return 'software engineer'
    
    elif 'big data' in job_title.lower():
        return 'data engineer'
    elif 'data engineer' in job_title.lower():
        return 'data engineer'
    elif 'data integration' in job_title.lower():
        return 'data engineer' 
    elif 'data migration' in job_title.lower():
        return 'data engineer' 
    
    elif 'system engineer' in job_title.lower():
        return 'system engineer'
    elif 'solution architect' in job_title.lower():
        return 'system engineer'
    elif 'network engineer' in job_title.lower():
        return 'system engineer'
    elif 'solutions architect' in job_title.lower():
        return 'system engineer'
    elif 'architect' in job_title.lower():
        return 'system engineer'
    
    elif 'application' in job_title.lower():
        return 'software developer'
    elif 'java' in job_title.lower():
        return 'software developer'

    elif 'data governance' in job_title.lower():
        return 'data quality management'
    elif 'data strategy' in job_title.lower():
        return 'data quality management'   
    elif 'strategist' in job_title.lower():
        return 'data quality management'  
    elif 'data quality' in job_title.lower():
        return 'data quality management' 
    elif 'data solutions' in job_title.lower():
        return 'data quality management' 
    elif 'data solution' in job_title.lower():
        return 'data quality management' 
    elif 'data management' in job_title.lower():
        return 'data quality management' 
    elif 'data product manager' in job_title.lower():
        return 'data quality management' 
    elif 'security' in job_title.lower():
        return 'data quality management' 
    elif 'data maintenance' in job_title.lower():
        return 'data quality management' 
    elif 'protection' in job_title.lower():
        return 'data quality management'
    elif 'acquisition' in job_title.lower():
        return 'data quality management'
    
    elif 'research' in job_title.lower():
        return 'research scientist' 
    
    else:
        return 'others'

In [17]:
# applying custom function to column
df['job_title_standardized'] = df['job_title'].apply(job_title_standardizer)

# dropping indexes with 'others'
df = df[df['job_title_standardized'] != 'others']
df['job_title_standardized'].value_counts()

software engineer          869
business analyst           554
research scientist         534
system engineer            450
software developer         416
data analyst               398
data quality management    374
data scientist             279
data engineer              205
Name: job_title_standardized, dtype: int64

## Standardizing `employment_types` <a class="anchor" id="standardizeemploymenttype"></a> 
---
[Back to top!](#toc)

In [18]:
df.employment_types.value_counts()

Full Time                                                   1126
Permanent; Full Time                                         892
Permanent                                                    651
Contract; Full Time                                          624
Contract                                                     604
Internship/Traineeship                                        63
Contract; Permanent; Full Time                                48
Contract; Permanent                                           33
Part Time; Permanent                                           6
Temporary                                                      4
Permanent; Full Time; Internship/Traineeship                   4
Part Time                                                      3
Contract; Freelance; Permanent; Full Time                      3
Full Time; Internship/Traineeship                              3
Contract; Part Time                                            2
Contract; Part Time; Full

In [19]:
# finding the unique employment types
employment_types_list = df['employment_types'].tolist()

employment_types_list_split = []

for x in employment_types_list:
    x = x.split('; ')
    employment_types_list_split.extend(x)

employment_types = []
[employment_types.append(x) for x in employment_types_list_split if x not in employment_types]

employment_types

['Permanent',
 'Contract',
 'Full Time',
 'Internship/Traineeship',
 'Part Time',
 'Temporary',
 'Flexi-work',
 'Freelance']

In [20]:
# custom function for standardizing employment types
def employment_type_standardizer(employment_types):
    if 'permanent' in employment_types.lower():
        return 'permanent'
    elif 'contract' in employment_types.lower():
        return 'contract'
    elif 'full time' in employment_types.lower():
        return 'fulltime'
    elif 'part time' in employment_types.lower():
        return 'parttime'
    elif 'temporary' in employment_types.lower():
        return 'temporary'
    elif 'internship' in employment_types.lower():
        return 'internship'
    elif 'flexi-work' in employment_types.lower():
        return 'flexi-work'
    elif 'freelance' in employment_types.lower():
        return 'freelance'
    else:
        return 'others'

In [21]:
# applying custom function to column
df['employment_type_standardized'] = df['employment_types'].apply(employment_type_standardizer)
df['employment_type_standardized'].value_counts()

permanent     1642
contract      1233
fulltime      1133
internship      63
temporary        4
parttime         3
flexi-work       1
Name: employment_type_standardized, dtype: int64

In [22]:
# dropping column
df.drop(columns=['employment_types'], inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4079 entries, 0 to 2668
Data columns (total 14 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   job_title                     4079 non-null   object 
 1   job_description               4079 non-null   object 
 2   minimum_years_experience      4079 non-null   int64  
 3   categories                    4079 non-null   object 
 4   position_levels               4076 non-null   object 
 5   new_posting_date              4079 non-null   object 
 6   original_posting_date         4079 non-null   object 
 7   closing_date                  4079 non-null   object 
 8   last_updated                  4079 non-null   object 
 9   skills                        4079 non-null   object 
 10  organisation                  4079 non-null   object 
 11  salary_average                4079 non-null   float64
 12  job_title_standardized        4079 non-null   object 
 13  emp

## Standardizing `position_levels` <a class="anchor" id="standardizepositionlevel"></a>
---
[Back to top!](#toc)

In [23]:
df.position_levels.value_counts()

Professional                            1326
Executive                                945
Senior Executive                         708
Manager                                  313
Junior Executive                         249
Fresh/entry level                        195
Middle Management                        145
Senior Management                        102
Non-executive                             92
Senior Management; Middle Management       1
Name: position_levels, dtype: int64

In [24]:
# custom function for standardizing position levels
def position_levels_standardizer(position_levels):
    if 'Senior Executive' in position_levels:
        return 'senior executive'
    elif 'Junior Executive' in position_levels:
        return 'junior executive'
    elif 'Non-executive' in position_levels:
        return 'non-executive'
    elif 'Executive' in position_levels:
        return 'executive'
    elif 'Middle Management' in position_levels:
        return 'middle management'
    elif 'Senior Management' in position_levels:
        return 'senior management'
    elif 'Professional' in position_levels:
        return 'professional'
    elif 'Manager' in position_levels:
        return 'manager'
    elif 'entry level' in position_levels:
        return 'entry level'
    else:
        return 'others'

In [25]:
# ensuring dtype is str
df['position_levels'] = df['position_levels'].astype('str', copy=True)

# applying custom function to column
df['position_level_standardized'] = df['position_levels'].apply(position_levels_standardizer)

# dropping indexes with 'others'
df = df[df['position_level_standardized'] != 'others']
df['position_level_standardized'].value_counts()

professional         1326
executive             945
senior executive      708
manager               313
junior executive      249
entry level           195
middle management     146
senior management     102
non-executive          92
Name: position_level_standardized, dtype: int64

In [26]:
# dropping column
df.drop(columns=['position_levels'], inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4076 entries, 0 to 2668
Data columns (total 14 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   job_title                     4076 non-null   object 
 1   job_description               4076 non-null   object 
 2   minimum_years_experience      4076 non-null   int64  
 3   categories                    4076 non-null   object 
 4   new_posting_date              4076 non-null   object 
 5   original_posting_date         4076 non-null   object 
 6   closing_date                  4076 non-null   object 
 7   last_updated                  4076 non-null   object 
 8   skills                        4076 non-null   object 
 9   organisation                  4076 non-null   object 
 10  salary_average                4076 non-null   float64
 11  job_title_standardized        4076 non-null   object 
 12  employment_type_standardized  4076 non-null   object 
 13  pos

## Converting datetime variables <a class="anchor" id="convertdatetime"></a> 
---
[Back to top!](#toc)

In [27]:
df.head()

Unnamed: 0,job_title,job_description,minimum_years_experience,categories,new_posting_date,original_posting_date,closing_date,last_updated,skills,organisation,salary_average,job_title_standardized,employment_type_standardized,position_level_standardized
0,Senior Derivatives XA Market Data Engineer,At ICE we approach every challenge as an oppor...,5,Banking and Finance; Information Technology,2021-04-28,2021-03-17,2021-05-28,2021-05-10T16:39:19.000Z,Analytics; Communication; Data; Detail Oriente...,ICE DATA SERVICES SINGAPORE PTE. LTD.,13125.0,data engineer,permanent,professional
1,IT - Business Analyst (Data Analytics and Mach...,"[ COLLECTION, USE AND DISCLOSURE OF PERSONAL D...",3,Information Technology,2021-04-23,2021-04-23,2021-05-23,2021-05-10T16:28:53.000Z,Agile; Agile Methodolgy; Banking; Business Ana...,NTT DATA SINGAPORE PTE. LTD.,4500.0,data analyst,contract,executive
2,Junior Data Analyst (Finance / Python) - up to...,We are a top 8 global IT services company with...,2,Information Technology,2021-04-14,2021-04-14,2021-05-14,2021-05-10T16:34:46.000Z,Access; Business Process; Data; Data Analysis;...,NTT DATA SINGAPORE PTE. LTD.,3400.0,data analyst,contract,junior executive
3,Data Analytics Consultant,This TMCA* (TeSA Mid-Career Advancement) progr...,15,Consulting; Education and Training; Informatio...,2021-04-30,2020-12-18,2021-05-30,2021-05-10T16:41:51.000Z,Analysis; Business Change Management; Business...,DATA & ANALYTICS CAPITALS PTE. LTD.,6000.0,data analyst,fulltime,professional
4,(IT)- Data science analyst,We are a top 8 global IT services company with...,2,Information Technology,2021-04-18,2021-04-18,2021-05-18,2021-05-10T16:28:23.000Z,Business Process; Data; MySQL; MySQL DBA; Orac...,NTT DATA SINGAPORE PTE. LTD.,6000.0,data scientist,contract,senior executive


In [28]:
# converting to datetime format
df['new_posting_date'] = pd.to_datetime(df['new_posting_date'], format="%Y-%m-%d")
df['original_posting_date'] = pd.to_datetime(df['original_posting_date'], format="%Y-%m-%d")
df['closing_date'] = pd.to_datetime(df['closing_date'], format="%Y-%m-%d")

# getting no. of days from posting date
df['days_new_posting_closing'] = df['closing_date'] - df['new_posting_date']
df['days_new_posting_closing'] = df['days_new_posting_closing']/np.timedelta64(1,"D")

# getting no. of days from original posting date
df['days_original_posting_closing'] = df['closing_date'] - df['original_posting_date']
df['days_original_posting_closing'] = df['days_original_posting_closing']/np.timedelta64(1,"D")

# dropping columns
df.drop(columns=['new_posting_date', 'original_posting_date', 'closing_date', 'last_updated'], inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4076 entries, 0 to 2668
Data columns (total 12 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   job_title                      4076 non-null   object 
 1   job_description                4076 non-null   object 
 2   minimum_years_experience       4076 non-null   int64  
 3   categories                     4076 non-null   object 
 4   skills                         4076 non-null   object 
 5   organisation                   4076 non-null   object 
 6   salary_average                 4076 non-null   float64
 7   job_title_standardized         4076 non-null   object 
 8   employment_type_standardized   4076 non-null   object 
 9   position_level_standardized    4076 non-null   object 
 10  days_new_posting_closing       4076 non-null   float64
 11  days_original_posting_closing  4076 non-null   float64
dtypes: float64(3), int64(1), object(8)
memory usage:

## Exporting Dataset for EDA <a class="anchor" id="exportcsv"></a> 
---
[Back to top!](#toc)

In [29]:
df.to_csv('../data/cleaned_dataset.csv', index=False)