# Capstone Project: Data Cleaning for Independent Set

In this code notebook, we will clean the scraped data so that it matches the characteristics of the dataset used to train the model.

In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
graye = pd.read_csv('data/graye_og.csv')

In [3]:
graye.shape

(94, 3)

In [4]:
# check for dupicates
# tees that are also categorised as tops

graye[graye.duplicated(keep=False)]

Unnamed: 0,product,material,wash_care
0,V Ribbed Tee - Mint Green,100% Cotton,Machine Wash Cool Do not bleach Do not tumble ...
1,V Ribbed Tee - Space Blue,100% Cotton,Machine Wash Cool Do not bleach Do not tumble ...
20,High Collar Shān - White,75% Cotton\n22% Nylon\n3% Spandex,Machine Wash Cool\nDo Not Bleach\nDo Not Tumbl...
21,High Collar Shān - Duo-toned Sand,75% Cotton\n22% Nylon\n3% Spandex,Machine Wash Cool\nDo Not Bleach\nDo Not Tumbl...
26,V Ribbed Tee - Mint Green,100% Cotton,Machine Wash Cool Do not bleach Do not tumble ...
27,V Ribbed Tee - Space Blue,100% Cotton,Machine Wash Cool Do not bleach Do not tumble ...
32,Shōto Linen Baseball Collar Top - Off White,100% German Linen,Machine Wash Cool\nDo Not Bleach\nDo Not Tumbl...
33,Shōto Linen Baseball Collar Top - Olive,100% German Linen,Machine Wash Cool\nDo Not Bleach\nDo Not Tumbl...
34,Shōto Linen Baseball Collar Top - Navy Blue,100% German Linen,Machine Wash Cool\nDo Not Bleach\nDo Not Tumbl...
35,Seamline Curved Pocket Panel Top - Black,100% Premium Cotton Twill,Machine Wash Cool\nDo Not Bleach\nDo Not Tumbl...


In [5]:
# drop duplicates

graye.drop_duplicates(keep='first', inplace=True)
graye.shape

(84, 3)

### Get GRAYE dataframe to match training dataframe

1. Use `product` column to get new `type` column, with only basic name/description of the garment
2. Use `material` column to get percentage of material and material columns
3. Use `wash_care` column to get `Washing_instruction` and `Drying_instruction`
   - Machine Wash Cool will be taken to mean `Machine_wash_ cold`
   - Do Not Tumble Dry will be taken to mean `Line dry`
   - The other wash and care instructions with no equivalents in the training dataset will be disregarded and discarded
4. Values for `Material_label`, `Chemical_label`, `Production_label`, `Reusability_label`, and `Recylability_label` will be populated based on observation of the actual garments at GRAYE's physical store
5. `Manufacturing_location` will be imputed as `Asia`. Given GRAYE's base of operations in Singapore, it is likely that the manufacturing of their garments would be done in neighbouring countries like China and/or Vietnam
6. `Use_location` will be randomly imputed with the [regions that GRAYE ships to](https://grayestudio.com/pages/delivery-information), with a higher weight placed on Asia
7. `Transportation_distance` will be imputed based on the training datatset's `Manufacturing_location` and `Use_location`

#### Get `type` column

In [6]:
# convert all product values to title case
# avoid 'tee' in 'steel' being picked up

graye['product'] = graye['product'].str.title()
graye

Unnamed: 0,product,material,wash_care
0,V Ribbed Tee - Mint Green,100% Cotton,Machine Wash Cool Do not bleach Do not tumble ...
1,V Ribbed Tee - Space Blue,100% Cotton,Machine Wash Cool Do not bleach Do not tumble ...
2,Cupro Tee - Stone,49.8% Cupro\n46.1% Cotton\n4.1% Lycra\nDouble ...,Machine Wash Cool\nDo Not Bleach\nDo Not Tumbl...
3,Cupro Tee - Brick,49.8% Cupro\n46.1% Cotton\n4.1% Lycra\nDouble ...,Machine Wash Cool\nDo Not Bleach\nDo Not Tumbl...
4,Cupro Tee - Jade Green,49.8% Cupro\n46.1% Cotton\n4.1% Lycra \nDouble...,Machine Wash Cool\nDo Not Bleach\nDo Not Tumbl...
...,...,...,...
89,Unisex Boxer Shorts - Camel,100% Cotton Poplin,Machine Wash Cool\nDo Not Bleach\nDo Not Tumbl...
90,Unisex Boxer Shorts - Ivory,100% Cotton Poplin,Machine Wash Cool\nDo Not Bleach\nDo Not Tumbl...
91,Elasticated Workwear Pants,100% Cotton,Machine Wash Cool\nDo Not Bleach\nDo Not Tumbl...
92,Detachable Suspender Trousers,100% Cotton,Machine Wash Cool\nWash With Similar Colours\n...


In [7]:
types = ['Tee', 'Shirt', 'Pullover', 'Top', 'Shān', 'Vest', 'Blazer', 'Kimono', 'Jacket', 'Pants', 'Trousers', 'Shorts', 'Jogger']

In [8]:
garment = []

for i in range(0, len(graye['product'])):
    for type in types:
        if type in graye['product'].iloc[i]:
            garment.append(type)

In [9]:
# create new type column in df

graye['type'] = garment
graye.head()

Unnamed: 0,product,material,wash_care,type
0,V Ribbed Tee - Mint Green,100% Cotton,Machine Wash Cool Do not bleach Do not tumble ...,Tee
1,V Ribbed Tee - Space Blue,100% Cotton,Machine Wash Cool Do not bleach Do not tumble ...,Tee
2,Cupro Tee - Stone,49.8% Cupro\n46.1% Cotton\n4.1% Lycra\nDouble ...,Machine Wash Cool\nDo Not Bleach\nDo Not Tumbl...,Tee
3,Cupro Tee - Brick,49.8% Cupro\n46.1% Cotton\n4.1% Lycra\nDouble ...,Machine Wash Cool\nDo Not Bleach\nDo Not Tumbl...,Tee
4,Cupro Tee - Jade Green,49.8% Cupro\n46.1% Cotton\n4.1% Lycra \nDouble...,Machine Wash Cool\nDo Not Bleach\nDo Not Tumbl...,Tee


In [10]:
# match product type to training dataset's

graye['type'] = graye['type'].str.lower()
graye['type'] = graye['type'].map({'tee':'t-shirt',
                                   'shirt':'shirt',
                                   'pullover': 'sweater',
                                   'top': 'shirt',
                                   'shān': 'shirt',
                                   'vest': 'jacket',
                                   'jacket': 'jacket',
                                   'blazer': 'jacket',
                                   'kimono': 'jacket',
                                   'pants': 'trousers',
                                   'trousers': 'trousers',
                                   'shorts': 'short',
                                   'jogger': 'trousers'})
graye.head()

Unnamed: 0,product,material,wash_care,type
0,V Ribbed Tee - Mint Green,100% Cotton,Machine Wash Cool Do not bleach Do not tumble ...,t-shirt
1,V Ribbed Tee - Space Blue,100% Cotton,Machine Wash Cool Do not bleach Do not tumble ...,t-shirt
2,Cupro Tee - Stone,49.8% Cupro\n46.1% Cotton\n4.1% Lycra\nDouble ...,Machine Wash Cool\nDo Not Bleach\nDo Not Tumbl...,t-shirt
3,Cupro Tee - Brick,49.8% Cupro\n46.1% Cotton\n4.1% Lycra\nDouble ...,Machine Wash Cool\nDo Not Bleach\nDo Not Tumbl...,t-shirt
4,Cupro Tee - Jade Green,49.8% Cupro\n46.1% Cotton\n4.1% Lycra \nDouble...,Machine Wash Cool\nDo Not Bleach\nDo Not Tumbl...,t-shirt


### Get material columns and values

In [11]:
# remove new line (\n)

graye['material'].replace('\n', ' ', regex=True, inplace=True)

In [12]:
# split percentages and materials

m_regex = r'(\d+(?:\.\d+)?%)\s+([A-Za-z\s]+)'
material_match = graye['material'].str.extractall(m_regex)
material_match.columns = ['pct', 'material']

In [13]:
# change pct to decimal

material_match['pct'].replace('%', ' ', regex=True, inplace=True)
material_match['pct'] = material_match['pct'].astype(float)
material_match['pct'] = round(material_match['pct'] / 100, 2)

In [14]:
# pivot df so that each material has its own column

material_match = material_match.pivot(columns='material', values='pct')

In [15]:
# fill NaN with 0

material_match.fillna(value = 0.00, inplace=True)

In [16]:
# groupby index_0 so that each garment only occupies one row

material_match.reset_index(inplace=True)
material_match = material_match.groupby(['level_0']).agg('sum')

In [17]:
# check

material_match.columns

Index(['match', 'Cotton', 'Cotton ', 'Cotton  ', 'Cotton Modal',
       'Cotton Poplin', 'Cupro ', 'German Linen', 'Linen', 'Lycra',
       'Lycra  Double Jersey', 'Lycra Double Jersey', 'Lyocell', 'Nylon',
       'Nylon ', 'Polyester', 'Premium Cotton Twill', 'Spandex', 'cotton',
       'cotton ', 'cotton canvas', 'polyester blend'],
      dtype='object', name='material')

**Comments:**

There is a difference between fibre and fabric. Raw materials like cotton, polyester, nylon, etc. indicate the fibres used, while poplin, double jersey, canvas, etc. indicate the weave/knit fabrics that the fibres are transformed into.

Spandex and Lycra refer to the same material - Lycra is a trademarked brand name for Spandex by The Lycra Company.

We will group the materials into each column accordingly.

In [18]:
# add all material type pcts together

material_match['Cotton'] = material_match['Cotton'] + material_match['Cotton '] + material_match['Cotton  '] + material_match['Cotton Poplin'] + material_match['Premium Cotton Twill'] + material_match['cotton'] + material_match['cotton '] + material_match['cotton canvas']
material_match['Other_plant'] = material_match['German Linen'] + material_match['Linen']
material_match['Polyester'] = material_match['Polyester'] + material_match['polyester blend']
material_match['Nylon'] = material_match['Nylon'] + material_match['Nylon ']
material_match['Spandex'] = material_match['Spandex'] + material_match['Lycra'] + material_match['Lycra  Double Jersey'] + material_match['Lycra Double Jersey']
material_match['Other_regenerated'] = material_match['Cupro '] + material_match['Cotton Modal'] 

In [19]:
# drop unnecessary columns

to_drop = ['match', 'Cotton ', 'Cotton  ', 'Cotton Modal', 'Cotton Poplin', 'Cupro ', 'German Linen', 'Linen', 
           'Lycra', 'Lycra  Double Jersey', 'Lycra Double Jersey', 'Lyocell', 'Nylon ', 'Premium Cotton Twill', 'cotton',
           'cotton ', 'cotton canvas', 'polyester blend']
material_match.drop(columns = to_drop, inplace=True)
material_match

material,Cotton,Nylon,Polyester,Spandex,Other_plant,Other_regenerated
level_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,1.00,0.0,0.0,0.00,0.0,0.0
1,1.00,0.0,0.0,0.00,0.0,0.0
2,0.46,0.0,0.0,0.04,0.0,0.5
3,0.46,0.0,0.0,0.04,0.0,0.5
4,0.46,0.0,0.0,0.04,0.0,0.5
...,...,...,...,...,...,...
89,1.00,0.0,0.0,0.00,0.0,0.0
90,1.00,0.0,0.0,0.00,0.0,0.0
91,1.00,0.0,0.0,0.00,0.0,0.0
92,1.00,0.0,0.0,0.00,0.0,0.0


In [20]:
# merge with orginal graye df

graye_df = graye.merge(material_match, left_index=True, right_index=True)
graye_df.head()

Unnamed: 0,product,material,wash_care,type,Cotton,Nylon,Polyester,Spandex,Other_plant,Other_regenerated
0,V Ribbed Tee - Mint Green,100% Cotton,Machine Wash Cool Do not bleach Do not tumble ...,t-shirt,1.0,0.0,0.0,0.0,0.0,0.0
1,V Ribbed Tee - Space Blue,100% Cotton,Machine Wash Cool Do not bleach Do not tumble ...,t-shirt,1.0,0.0,0.0,0.0,0.0,0.0
2,Cupro Tee - Stone,49.8% Cupro 46.1% Cotton 4.1% Lycra Double Jersey,Machine Wash Cool\nDo Not Bleach\nDo Not Tumbl...,t-shirt,0.46,0.0,0.0,0.04,0.0,0.5
3,Cupro Tee - Brick,49.8% Cupro 46.1% Cotton 4.1% Lycra Double Jersey,Machine Wash Cool\nDo Not Bleach\nDo Not Tumbl...,t-shirt,0.46,0.0,0.0,0.04,0.0,0.5
4,Cupro Tee - Jade Green,49.8% Cupro 46.1% Cotton 4.1% Lycra Double Je...,Machine Wash Cool\nDo Not Bleach\nDo Not Tumbl...,t-shirt,0.46,0.0,0.0,0.04,0.0,0.5


### Get `Washing_instruction` and `Drying_instruction` columns

In [21]:
# remove new line (\n)
# all lower case for standardisation

graye['wash_care'].replace('\n', ' ', regex=True, inplace=True)
graye['wash_care'] = graye['wash_care'].str.lower()

In [22]:
# get washing instructions

wash_instructions = []

for i in range(0, len(graye['wash_care'])):
    if 'machine wash cool' in graye['wash_care'].iloc[i]:
        wash_instructions.append('Machine wash_ cold')
    elif 'hand wash' in graye['wash_care'].iloc[i]:
        wash_instructions.append('Hand wash')
    else:
        wash_instructions.append('Unknown')

In [23]:
# get drying instructions

dry_instructions = []

for i in range(0, len(graye['wash_care'])):
    if 'do not tumble dry' in graye['wash_care'].iloc[i]:
        dry_instructions.append('Line dry')
    else:
        dry_instructions.append('Unknown')

In [24]:
# add wash and dry instructions into df columns
# check

graye_df['Washing_instruction'] = wash_instructions
graye_df['Drying_instruction'] = dry_instructions
graye_df.head()

Unnamed: 0,product,material,wash_care,type,Cotton,Nylon,Polyester,Spandex,Other_plant,Other_regenerated,Washing_instruction,Drying_instruction
0,V Ribbed Tee - Mint Green,100% Cotton,Machine Wash Cool Do not bleach Do not tumble ...,t-shirt,1.0,0.0,0.0,0.0,0.0,0.0,Machine wash_ cold,Line dry
1,V Ribbed Tee - Space Blue,100% Cotton,Machine Wash Cool Do not bleach Do not tumble ...,t-shirt,1.0,0.0,0.0,0.0,0.0,0.0,Machine wash_ cold,Line dry
2,Cupro Tee - Stone,49.8% Cupro 46.1% Cotton 4.1% Lycra Double Jersey,Machine Wash Cool\nDo Not Bleach\nDo Not Tumbl...,t-shirt,0.46,0.0,0.0,0.04,0.0,0.5,Machine wash_ cold,Line dry
3,Cupro Tee - Brick,49.8% Cupro 46.1% Cotton 4.1% Lycra Double Jersey,Machine Wash Cool\nDo Not Bleach\nDo Not Tumbl...,t-shirt,0.46,0.0,0.0,0.04,0.0,0.5,Machine wash_ cold,Line dry
4,Cupro Tee - Jade Green,49.8% Cupro 46.1% Cotton 4.1% Lycra Double Je...,Machine Wash Cool\nDo Not Bleach\nDo Not Tumbl...,t-shirt,0.46,0.0,0.0,0.04,0.0,0.5,Machine wash_ cold,Line dry


### Impute values for `Material_label`, `Chemicals_label`, `Production_label`, `Reusability_label`, and `Recylability_label` 

Based on the garments in GRAYE's physical store, the clothes only have material labels. Hence, we will impute 1s for the material label column, and 0s for the rest.

In [25]:
# impute values for columns

graye_df['Material_label'] = 1
graye_df['Chemicals_label'] = 0
graye_df['Production_label'] = 0
graye_df['Reusability_label'] = 0
graye_df['Recylability_label'] = 0
graye_df.head()

Unnamed: 0,product,material,wash_care,type,Cotton,Nylon,Polyester,Spandex,Other_plant,Other_regenerated,Washing_instruction,Drying_instruction,Material_label,Chemicals_label,Production_label,Reusability_label,Recylability_label
0,V Ribbed Tee - Mint Green,100% Cotton,Machine Wash Cool Do not bleach Do not tumble ...,t-shirt,1.0,0.0,0.0,0.0,0.0,0.0,Machine wash_ cold,Line dry,1,0,0,0,0
1,V Ribbed Tee - Space Blue,100% Cotton,Machine Wash Cool Do not bleach Do not tumble ...,t-shirt,1.0,0.0,0.0,0.0,0.0,0.0,Machine wash_ cold,Line dry,1,0,0,0,0
2,Cupro Tee - Stone,49.8% Cupro 46.1% Cotton 4.1% Lycra Double Jersey,Machine Wash Cool\nDo Not Bleach\nDo Not Tumbl...,t-shirt,0.46,0.0,0.0,0.04,0.0,0.5,Machine wash_ cold,Line dry,1,0,0,0,0
3,Cupro Tee - Brick,49.8% Cupro 46.1% Cotton 4.1% Lycra Double Jersey,Machine Wash Cool\nDo Not Bleach\nDo Not Tumbl...,t-shirt,0.46,0.0,0.0,0.04,0.0,0.5,Machine wash_ cold,Line dry,1,0,0,0,0
4,Cupro Tee - Jade Green,49.8% Cupro 46.1% Cotton 4.1% Lycra Double Je...,Machine Wash Cool\nDo Not Bleach\nDo Not Tumbl...,t-shirt,0.46,0.0,0.0,0.04,0.0,0.5,Machine wash_ cold,Line dry,1,0,0,0,0


### Impute values for `Manufacturing_location` and `Use_location`

In [26]:
# impute Asia for manufacturing location column

graye_df['Manufacturing_location'] = 'Asia'

In [27]:
# impute use location with higher weight given to Asia

locations = ['Asia', 'EU', 'UK', 'USA']
weights = [0.7, 0.1, 0.1, 0.1]

graye_df['Use_location'] = np.random.choice(locations, size = len(graye_df), p=weights)

### Impute values for `Transportation_distance`

In [28]:
# get train dataset to get transportation distance

train_df = pd.read_csv('data/for_modelling.csv')
train_df.head()

Unnamed: 0,Type,Cotton,Organic_cotton,Other_plant,Wool,Other_animal,Polyester,Nylon,Spandex,Polyamide,...,Chemicals_label,Production_label,Manufacturing_location,Transporation_distance,Use_location,Washing_instruction,Drying_instruction,Reusability_label,Recylability_label,EI
0,jeans,0.4,0.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,1,Africa,2072.0,Netherlands,Machine wash_ cold,Line dry,1,1,1
1,jeans,0.4,0.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,1,Africa,2389.0,Germany,Machine wash_ cold,Line dry,1,1,1
2,jeans,0.4,0.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,1,Africa,2262.0,Belgium,Machine wash_ cold,Line dry,1,1,1
3,jacket,0.4,0.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,1,Africa,2728.0,France,Machine wash_ cold,Line dry,1,1,1
4,jacket,0.4,0.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,1,Africa,2887.0,Austria,Machine wash_ cold,Line dry,1,1,1


In [29]:
# find mean distance based on manufacturing and use locations in train dataset

asia_asia = train_df[(train_df['Manufacturing_location'] == 'Asia') & (train_df['Use_location'] == 'Asia')]['Transporation_distance'].mean()
asia_uk = train_df[(train_df['Manufacturing_location'] == 'Asia') & (train_df['Use_location'] == 'UK')]['Transporation_distance'].mean()
asia_eu = train_df[(train_df['Manufacturing_location'] == 'Asia') & (train_df['Use_location'] == 'EU')]['Transporation_distance'].mean()
asia_usa = train_df[(train_df['Manufacturing_location'] == 'Asia') & (train_df['Use_location'] == 'USA')]['Transporation_distance'].mean()

print(asia_asia)
print(asia_uk)
print(asia_eu)
print(asia_usa)

21273.0
18273.0
18354.53846153846
11898.458333333334


In [30]:
# impute transportation distance based on above mean distance

graye_df.loc[(graye_df['Use_location'] == 'Asia') == True, 'Transporation_distance'] = asia_asia
graye_df.loc[(graye_df['Use_location'] == 'UK') == True, 'Transporation_distance'] = asia_uk
graye_df.loc[(graye_df['Use_location'] == 'EU') == True, 'Transporation_distance'] = asia_eu
graye_df.loc[(graye_df['Use_location'] == 'USA') == True, 'Transporation_distance'] = asia_usa

### Clean up GRAYE dataset

In [31]:
graye_df.columns

Index(['product', 'material', 'wash_care', 'type', 'Cotton', 'Nylon',
       'Polyester', 'Spandex', 'Other_plant', 'Other_regenerated',
       'Washing_instruction', 'Drying_instruction', 'Material_label',
       'Chemicals_label', 'Production_label', 'Reusability_label',
       'Recylability_label', 'Manufacturing_location', 'Use_location',
       'Transporation_distance'],
      dtype='object')

In [32]:
# clean up graye_df so columns match with training dataset
# drop unnecessary columns

graye_df.drop(columns = ['product', 'material', 'wash_care'], inplace=True)

In [33]:
# match type case

graye_df.rename(columns = {'type':'Type'}, inplace=True)

In [34]:
# add missing columns in graye df to match train df

for i in train_df:
    if i not in graye_df:
        graye_df[i] = 0.0

In [35]:
train_df.columns

Index(['Type', 'Cotton', 'Organic_cotton', 'Other_plant', 'Wool',
       'Other_animal', 'Polyester', 'Nylon', 'Spandex', 'Polyamide',
       'Other_synthetic', 'Lyocell', 'Viscose', 'Rayon', 'Other_regenerated',
       'Other', 'Recycled_content', 'Reused_content', 'Material_label',
       'Chemicals_label', 'Production_label', 'Manufacturing_location',
       'Transporation_distance', 'Use_location', 'Washing_instruction',
       'Drying_instruction', 'Reusability_label', 'Recylability_label', 'EI'],
      dtype='object')

In [36]:
# rearrange columns in graye_df

graye_df = graye_df[['Type', 'Cotton', 'Organic_cotton', 'Other_plant', 'Wool',
       'Other_animal', 'Polyester', 'Nylon', 'Spandex', 'Polyamide',
       'Other_synthetic', 'Lyocell', 'Viscose', 'Rayon', 'Other_regenerated',
       'Other', 'Recycled_content', 'Reused_content', 'Material_label',
       'Chemicals_label', 'Production_label', 'Manufacturing_location',
       'Transporation_distance', 'Use_location', 'Washing_instruction',
       'Drying_instruction', 'Reusability_label', 'Recylability_label', 'EI']]

In [37]:
# drop EI column

graye_df.drop(columns = 'EI', inplace=True)

In [38]:
# check final df

graye_df

Unnamed: 0,Type,Cotton,Organic_cotton,Other_plant,Wool,Other_animal,Polyester,Nylon,Spandex,Polyamide,...,Material_label,Chemicals_label,Production_label,Manufacturing_location,Transporation_distance,Use_location,Washing_instruction,Drying_instruction,Reusability_label,Recylability_label
0,t-shirt,1.00,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,1,0,0,Asia,21273.000000,Asia,Machine wash_ cold,Line dry,0,0
1,t-shirt,1.00,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,1,0,0,Asia,18354.538462,EU,Machine wash_ cold,Line dry,0,0
2,t-shirt,0.46,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,...,1,0,0,Asia,21273.000000,Asia,Machine wash_ cold,Line dry,0,0
3,t-shirt,0.46,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,...,1,0,0,Asia,18354.538462,EU,Machine wash_ cold,Line dry,0,0
4,t-shirt,0.46,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,...,1,0,0,Asia,21273.000000,Asia,Machine wash_ cold,Line dry,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
89,short,1.00,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,1,0,0,Asia,21273.000000,Asia,Machine wash_ cold,Line dry,0,0
90,short,1.00,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,1,0,0,Asia,21273.000000,Asia,Machine wash_ cold,Line dry,0,0
91,trousers,1.00,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,1,0,0,Asia,21273.000000,Asia,Machine wash_ cold,Line dry,0,0
92,trousers,1.00,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,1,0,0,Asia,21273.000000,Asia,Machine wash_ cold,Line dry,0,0


In [39]:
# save df

graye_df.to_csv('data/graye_modelling.csv', index=False)