# 4 Pre-Processing and Training Data<a id='4_Pre-Processing_and_Training_Data'></a>

## 4.1 Contents<a id='4.1_Contents'></a>
* [4 Pre-Processing and Training Data](#4_Pre-Processing_and_Training_Data)
  * [4.1 Contents](#4.1_Contents)
  * [4.2 Introduction](#4.2_Introduction)
  * [4.3 Imports](#4.3_Imports)
  * [4.4 Load Data](#4.4_Load_Data)
  * [4.5 Feature Engineering](#4.5_Feature_Engineering)
    * [4.5.1 Create New Features](#4.5.1_Create_Features)
    * [4.5.2 Pick the features for ML Model](#4.5.2_Pick_Features)
  * [4.6 Train/Test Split](#4.6_Train/Test_Split)
  * [4.7 Save File](#4.7_Save_File)
  * [4.8 Summary](#4.8_Summary)

## 4.2 Introduction<a id='4.2_Introduction'></a>

After doing EDA, we now have two datasets which we need to prepare before passing them to ML models.

We expect the ML models to predict the gender that would prefer a perfume given the notes in that perfume.

## 4.3 Imports<a id='4.3_Imports'></a>

In [1]:
import pandas as pd
import numpy as np
import os
import pickle
import datetime

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import __version__ as sklearn_version
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split

from sklearn.feature_selection import SelectKBest, f_regression

## 4.4 Load Data<a id='4.4_Load_Data'></a>

In [2]:
perfume_data = pd.read_csv('../data/interim/perfume_data_step3_features.csv')
df = perfume_data.copy()
df.head()

Unnamed: 0,title,gender,notes,top_notes,middle_notes,base_notes,top_notes_list,middle_notes_list,base_notes_list,label
0,Aamal The Spirit of Dubai for women and men,unisex,"Bulgarian Rose,Bergamot,Fruits,Agarwood Oud,Sa...","Bulgarian Rose,Bergamot,Fruits,Agarwood Oud","Sandalwood,Agarwood Oud,Cypriol Oil or Nagarmo...","Amber,Castoreum,Civet,Moss,Agarwood Oud,Indian...","['Bulgarian Rose', 'Bergamot', 'Fruits', 'Agar...","['Sandalwood', 'Agarwood Oud', 'Cypriol Oil or...","['Amber', 'Castoreum', 'Civet', 'Moss', 'Agarw...",men
1,Aatifa Ajmal for women and men,unisex,"Rose,Cumin,Amber,Musk,Amber","Rose,Cumin",Amber,"Musk,Amber","['Rose', 'Cumin']",['Amber'],"['Musk', 'Amber']",women
2,AA Al-Jazeera Perfumes for women and men,unisex,"Rose,Sandalwood,Apple,Agarwood Oud","Rose,Sandalwood,Apple,Agarwood Oud",,,"['Rose', 'Sandalwood', 'Apple', 'Agarwood Oud']",[''],[''],unisex
3,aarewasser Art of Scent - Swiss Perfumes for w...,unisex,"Green Tea,White Flowers","Green Tea,White Flowers",,,"['Green Tea', 'White Flowers']",[''],[''],unisex
4,Aaliyah Hamidi Oud & Perfumes for women and men,unisex,"Amber,Sandalwood,Vetiver,Saffron","Amber,Sandalwood,Vetiver,Saffron",,,"['Amber', 'Sandalwood', 'Vetiver', 'Saffron']",[''],[''],unisex


In [3]:
df.shape

(47649, 10)

In [4]:
final_notes = pd.read_csv('../data/interim/perfume_notes_columns.csv')

In [5]:
final_notes.head()

Unnamed: 0,note,women,men,unisex,total
0,Musk,9910,4084,4830,18824
1,Jasmine,8010,2028,3415,13453
2,Amber,6180,3651,3484,13315
3,Patchouli,4862,3625,3241,11728
4,Rose,6626,1548,2732,10906


## 4.5 Feature Engineering<a id='4.5_Feature_Engineering'></a>

### 4.5.1 Create New Features<a id='4.5.1_Create_Features'></a>

In [6]:
df['notes_list'] = df.notes.str.split(',')

We have made the business decision to consider all perfume Notes that appear in 100 or more perfumes. 

We are creating a column for each of these selected notes. 

The value for a note column will be set as per the rules below.

- 2^2 if it is in Top layer
- 3^2 if it is in Middle layer
- 4^2 if it is in Base layer
- sum of squares if a note appears in more than one layer. ie, if a note appears in both Top and Middle layers, then the value will be 2^2 + 3^2 = 4+9 = 13

To accomodate the notes we have discarded, we will have a separate column which will have the number of perfume notes that are not represented in the columns.

Eg: A perfume have the notes - Top: Lemon, Sage
                               Middle: Mint, Mahogany, Sandalwood
                               Base: Sandalwood, Amber

Values for each note will be - Lemon(4), Mint(9), Sandalwood(9+16=25) and Amber(16). As we also have 2 other notes (Sage, Mahogany) which do not have indepedent columns, we will record the number '2' in a new column to denote there are 2 more perfume notes.

In [7]:
for note in final_notes.note:
    col_name = note.lower().replace(' ', '_')
    top_value = df.top_notes_list.map(lambda x: 2 if note in x else 0)
    middle_value = df.middle_notes_list.map(lambda x: 3 if note in x else 0)
    base_value = df.base_notes_list.map(lambda x: 4 if note in x else 0)
    df[col_name] = (top_value ** 2) + (middle_value **2) +(base_value**2)
    
    
df.head()

Unnamed: 0,title,gender,notes,top_notes,middle_notes,base_notes,top_notes_list,middle_notes_list,base_notes_list,label,...,cannabis,exotic_woods,pink_peony,sea_salt,fennel,white_lily,clover,tomato_leaf,rice,red_rose
0,Aamal The Spirit of Dubai for women and men,unisex,"Bulgarian Rose,Bergamot,Fruits,Agarwood Oud,Sa...","Bulgarian Rose,Bergamot,Fruits,Agarwood Oud","Sandalwood,Agarwood Oud,Cypriol Oil or Nagarmo...","Amber,Castoreum,Civet,Moss,Agarwood Oud,Indian...","['Bulgarian Rose', 'Bergamot', 'Fruits', 'Agar...","['Sandalwood', 'Agarwood Oud', 'Cypriol Oil or...","['Amber', 'Castoreum', 'Civet', 'Moss', 'Agarw...",men,...,0,0,0,0,0,0,0,0,0,0
1,Aatifa Ajmal for women and men,unisex,"Rose,Cumin,Amber,Musk,Amber","Rose,Cumin",Amber,"Musk,Amber","['Rose', 'Cumin']",['Amber'],"['Musk', 'Amber']",women,...,0,0,0,0,0,0,0,0,0,0
2,AA Al-Jazeera Perfumes for women and men,unisex,"Rose,Sandalwood,Apple,Agarwood Oud","Rose,Sandalwood,Apple,Agarwood Oud",,,"['Rose', 'Sandalwood', 'Apple', 'Agarwood Oud']",[''],[''],unisex,...,0,0,0,0,0,0,0,0,0,0
3,aarewasser Art of Scent - Swiss Perfumes for w...,unisex,"Green Tea,White Flowers","Green Tea,White Flowers",,,"['Green Tea', 'White Flowers']",[''],[''],unisex,...,0,0,0,0,0,0,0,0,0,0
4,Aaliyah Hamidi Oud & Perfumes for women and men,unisex,"Amber,Sandalwood,Vetiver,Saffron","Amber,Sandalwood,Vetiver,Saffron",,,"['Amber', 'Sandalwood', 'Vetiver', 'Saffron']",[''],[''],unisex,...,0,0,0,0,0,0,0,0,0,0


In [8]:
all_slected_notes = set(final_notes.note)

In [9]:
df['other_notes'] = df['notes_list'].apply(lambda row: len(set(row)-all_slected_notes))

Let's verify if the notes have been correctly processed by picking up random perfumes.

In [10]:
# First Perfume
df.loc[0, 'notes_list']

['Bulgarian Rose',
 'Bergamot',
 'Fruits',
 'Agarwood Oud',
 'Sandalwood',
 'Agarwood Oud',
 'Cypriol Oil or Nagarmotha',
 'Benzoin',
 'Amber',
 'Castoreum',
 'Civet',
 'Moss',
 'Agarwood Oud',
 'Indian Oud']

In [11]:
df.iloc[0, 11:].loc[lambda x: x!=0].to_frame().T

Unnamed: 0,amber,rose,bergamot,agarwood_oud,benzoin,moss,bulgarian_rose,civet,cypriol_oil_or_nagarmotha,castoreum,fruits,other_notes
0,16,4,4,29,9,16,4,16,9,16,4,2


In [12]:
# Second Perfume
df.loc[15000, 'notes_list']

['Musk']

In [13]:
df.iloc[15000, 11:].loc[lambda x: x!=0].to_frame().T

Unnamed: 0,musk
15000,16


In [14]:
# Third Perfume
df.loc[45000, 'notes_list']

['Tangerine',
 'Peach',
 'Jasmine',
 'Neroli',
 'Pomegranate',
 'Musk',
 'Patchouli',
 'Vanille']

In [15]:
df.iloc[45000, 11:].loc[lambda x: x!=0].to_frame().T

Unnamed: 0,musk,jasmine,patchouli,peach,neroli,vanille,tangerine,pomegranate
45000,16,9,16,4,9,16,4,9


### 4.5.2 Pick the features for ML Model<a id='4.5.2_Pick_Features'></a>

Now that features have been created, let's construct the dataframe with the features that will passed onto our ML model.

In [16]:
feature_cols = list(df.columns[11:].values)
feature_cols.append('label')
print(feature_cols)

['musk', 'jasmine', 'amber', 'patchouli', 'rose', 'vanilla', 'cedar', 'bergamot', 'vetiver', 'tonka_bean', 'mandarin_orange', 'lily-of-the-valley', 'violet', 'lavender', 'lemon', 'ylang-ylang', 'iris', 'orange_blossom', 'oakmoss', 'leather', 'cardamom', 'geranium', 'white_musk', 'peach', 'freesia', 'grapefruit', 'agarwood_oud', 'cinnamon', 'benzoin', 'incense', 'pink_pepper', 'nutmeg', 'tuberose', 'neroli', 'orange', 'peony', 'labdanum', 'black_currant', 'ginger', 'pepper', 'virginia_cedar', 'heliotrope', 'magnolia', 'raspberry', 'mint', 'vanille', 'saffron', 'gardenia', 'coriander', 'lily', 'ambergris', 'pear', 'carnation', 'guaiac_wood', 'apple', 'orchid', 'tobacco', 'plum', 'cloves', 'violet_leaf', 'basil', 'sage', 'honey', 'citruses', 'black_pepper', 'lime', 'rosemary', 'orris_root', 'galbanum', 'moss', 'pineapple', 'olibanum', 'caramel', 'spices', 'osmanthus', 'petitgrain', 'tangerine', 'mimosa', 'lotus', 'cashmere_wood', 'coconut', 'cyclamen', 'clary_sage', 'myrrh', 'amalfi_lemon

In [17]:
final_df = df[feature_cols]
final_df

Unnamed: 0,musk,jasmine,amber,patchouli,rose,vanilla,cedar,bergamot,vetiver,tonka_bean,...,pink_peony,sea_salt,fennel,white_lily,clover,tomato_leaf,rice,red_rose,other_notes,label
0,0,0,16,0,4,0,0,4,0,0,...,0,0,0,0,0,0,0,0,2,men
1,16,0,25,0,4,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,women
2,0,0,0,0,4,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,unisex
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,unisex
4,0,0,4,0,0,0,0,0,4,0,...,0,0,0,0,0,0,0,0,1,unisex
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47644,0,0,0,16,0,16,0,0,0,0,...,0,0,0,0,0,0,0,0,1,men
47645,16,9,16,0,9,0,0,0,0,16,...,0,0,0,0,0,0,0,0,0,unisex
47646,0,0,16,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,unisex
47647,16,9,0,16,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,unisex


We have to now translate the `label` field to numeric values.

- women = 0
- men = 1

In [18]:
gender_map = {'women': 0, 'men': 1, 'unisex': 2}

final_df['label'] = final_df.label.map(gender_map)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [19]:
final_df

Unnamed: 0,musk,jasmine,amber,patchouli,rose,vanilla,cedar,bergamot,vetiver,tonka_bean,...,pink_peony,sea_salt,fennel,white_lily,clover,tomato_leaf,rice,red_rose,other_notes,label
0,0,0,16,0,4,0,0,4,0,0,...,0,0,0,0,0,0,0,0,2,1
1,16,0,25,0,4,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,4,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,2
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
4,0,0,4,0,0,0,0,0,4,0,...,0,0,0,0,0,0,0,0,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47644,0,0,0,16,0,16,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
47645,16,9,16,0,9,0,0,0,0,16,...,0,0,0,0,0,0,0,0,0,2
47646,0,0,16,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,2
47647,16,9,0,16,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2


We have a final feature dataset with **47649** perfumes, **299** features and **1** label.

## 4.6 Train/Test Split<a id='4.6_Train/Test_Split'></a>

What partition sizes would you have with a 75/25 train/test split?

In [20]:
len(final_df) * .75, len(final_df) * .25

(35736.75, 11912.25)

In [21]:
X_train, X_test, y_train, y_test = train_test_split(final_df.drop(columns='label'),
                                                    final_df.label, test_size=0.25, 
                                                    random_state=47, shuffle=True, stratify=final_df.label)

In [22]:
X_train.shape, X_test.shape

((35736, 299), (11913, 299))

In [23]:
y_train.shape, y_test.shape

((35736,), (11913,))

In [24]:
#Code task 2#
#Check the `dtypes` attribute of `X_train` to verify all features are numeric
X_train.dtypes

musk           int64
jasmine        int64
amber          int64
patchouli      int64
rose           int64
               ...  
clover         int64
tomato_leaf    int64
rice           int64
red_rose       int64
other_notes    int64
Length: 299, dtype: object

In [25]:
#Code task 3#
#Repeat this check for the test split in `X_test`
X_test.dtypes

musk           int64
jasmine        int64
amber          int64
patchouli      int64
rose           int64
               ...  
clover         int64
tomato_leaf    int64
rice           int64
red_rose       int64
other_notes    int64
Length: 299, dtype: object

You have only numeric features in your X now!

## 4.7 Save File<a id='4.7_Save_File'></a>

Now that we have split our dataset, let's save it into separate csv files as test and train datasets.

In [26]:
X_train['label'] = y_train

In [27]:
datapath = '../data/processed/'
datapath_perfumedata = os.path.join(datapath, 'train.csv')
if not os.path.exists(datapath_perfumedata):
   X_train.to_csv(datapath_perfumedata, index=False)

Let's also save the `Notes` information we collected as a separate csv file.

In [28]:
X_test['label'] = y_test

In [29]:
datapath_perfumedata = os.path.join(datapath, 'test.csv')
if not os.path.exists(datapath_perfumedata):
    X_test.to_csv(datapath_perfumedata, index=False)

## 4.8 Summary<a id='4.8_Summary'></a>

We started with two datasets - one containing the notes of each perfume and another that has a list of notes in decreasing order of preference.

We created new feautes from them.

1. For each popular _note_, we created a new column which will have an encoded value based on whether it is in the Top/Middle/Base layer of the perfume. We have also given due dilligence for cases containing the same note in multiple layers.
2. For those _note_ that are not captured in these columns, we have a separate `other_notes` column which will have the count of other non-specified notes in that perfume.
3. This gave us a dataset with 299 independent features and one dependent target.

The feature set has been divided into __train__ and __test__ sets and will be saved to be passed onto an ML model.