<a href="https://colab.research.google.com/github/ryann-arruda/machine-learning/blob/main/machine_learning_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [75]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

In [76]:
df = pd.read_csv('clothes_price_prediction_data.csv')
df

Unnamed: 0,Brand,Category,Color,Size,Material,Price
0,New Balance,Dress,White,XS,Nylon,182
1,New Balance,Jeans,Black,XS,Silk,57
2,Under Armour,Dress,Red,M,Wool,127
3,Nike,Shoes,Green,M,Cotton,77
4,Adidas,Sweater,White,M,Nylon,113
...,...,...,...,...,...,...
995,Puma,Jeans,Black,L,Polyester,176
996,Puma,Jacket,Red,XXL,Silk,110
997,Reebok,Sweater,Blue,XS,Denim,127
998,Under Armour,Sweater,Black,XXL,Denim,69


## **Preprocessing**

Let's know the data types.

In [77]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Brand     1000 non-null   object
 1   Category  1000 non-null   object
 2   Color     1000 non-null   object
 3   Size      1000 non-null   object
 4   Material  1000 non-null   object
 5   Price     1000 non-null   int64 
dtypes: int64(1), object(5)
memory usage: 47.0+ KB


Thus, executing the previous cell, we can observe that this data set has five columns whose type is *string*, while one column (**Price**) has the type *int*.

Before proceeding with the preprocessing step, you need to check for missing data.

In [78]:
df.isnull().sum()

Unnamed: 0,0
Brand,0
Category,0
Color,0
Size,0
Material,0
Price,0


We can see in the previous execution that there is no missing data. However, another important point that must be checked is the existence of duplicate data, because if there is any, it must be eliminated.

In [79]:
df.duplicated().sum()

0

In [80]:
df['Category'].unique()

array(['Dress', 'Jeans', 'Shoes', 'Sweater', 'Jacket', 'T-shirt'],
      dtype=object)

Now we transform the data from categorical values ​​to numeric values ​​using the One-Hot-Encoding technique. However, since this technique is used with values ​​that don't represent a pre-established sequence or order, it will only be used with the **Brand**, **Category**, **Color** and **Material** columns.

In [81]:
encoded_brand = pd.get_dummies(df['Brand'])
encoded_category = pd.get_dummies(df['Category'])
encoded_color = pd.get_dummies(df['Color'])
encoded_material = pd.get_dummies(df['Material'])

In [82]:
df = df.drop(columns=['Brand', 'Category', 'Color', 'Material'], axis=1)

In [83]:
df = pd.concat([df, encoded_brand], axis=1)
df = pd.concat([df, encoded_category], axis=1)
df = pd.concat([df, encoded_color], axis=1)
df = pd.concat([df, encoded_material], axis=1)

For the 'Size' column, as it represents a sequence of increasing size, it will be transformed from categorical to numeric values ​​using the Label Encoding technique.

In [87]:
label_encoder = LabelEncoder()

In [88]:
encoded_size = label_encoder.fit_transform(df['Size'])

In [90]:
df['Size'] = encoded_size