In the data preprocessing process, there are several steps that are commonly performed. These steps may include:

1. Data Cleaning: This involves handling missing values, removing duplicates, and dealing with outliers in the data.

2. Data Transformation: This step involves transforming the data into a suitable format for analysis. It may include scaling, normalization, or encoding categorical variables.

3. Feature Selection: This step involves selecting the most relevant features from the dataset that will be used for analysis or modeling.

4. Feature Engineering: This step involves creating new features from the existing ones to improve the performance of the model. It may include creating interaction terms, polynomial features, or extracting meaningful information from the data.

5. Data Splitting: This step involves splitting the dataset into training and testing sets. The training set is used to train the model, while the testing set is used to evaluate its performance.

6. Data Integration: This step involves combining data from multiple sources or merging multiple datasets to create a unified dataset for analysis.

7. Data Normalization: This step involves scaling the data to a specific range or distribution to ensure that all features have equal importance during analysis.

8. Handling Imbalanced Data: If the dataset has imbalanced classes, techniques such as oversampling or undersampling can be used to address this issue.

These are just some of the common steps involved in the data preprocessing process. The specific steps may vary depending on the nature of the data and the analysis or modeling task at hand.

# Libraries Import and Data reading

In [1]:
pip install -r requirements.txt

Collecting pandas==1.3.3 (from -r requirements.txt (line 1))
  Using cached pandas-1.3.3.tar.gz (4.7 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'error'
Note: you may need to restart the kernel to use updated packages.


  error: subprocess-exited-with-error
  
  × pip subprocess to install build dependencies did not run successfully.
  │ exit code: 2
  ╰─> [114 lines of output]
      Ignoring numpy: markers 'python_version == "3.7" and (platform_machine != "arm64" or platform_system != "Darwin") and platform_machine != "aarch64"' don't match your environment
      Ignoring numpy: markers 'python_version == "3.8" and (platform_machine != "arm64" or platform_system != "Darwin") and platform_machine != "aarch64"' don't match your environment
      Ignoring numpy: markers 'python_version == "3.7" and platform_machine == "aarch64"' don't match your environment
      Ignoring numpy: markers 'python_version == "3.8" and platform_machine == "aarch64"' don't match your environment
      Ignoring numpy: markers 'python_version == "3.8" and platform_machine == "arm64" and platform_system == "Darwin"' don't match your environment
      Ignoring numpy: markers 'python_version == "3.9" and platform_machine == "arm6

In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split

In [3]:
file_path = r'D:\OneDrive - Trường ĐH CNTT - University of Information Technology\Individual Honours Project\Might need for individual prj\my_project\data\fashion products small\styles.csv'
data = pd.read_csv(file_path, on_bad_lines='skip')

data.head()

Unnamed: 0,id,gender,masterCategory,subCategory,articleType,baseColour,season,year,usage,productDisplayName
0,15970,Men,Apparel,Topwear,Shirts,Navy Blue,Fall,2011.0,Casual,Turtle Check Men Navy Blue Shirt
1,39386,Men,Apparel,Bottomwear,Jeans,Blue,Summer,2012.0,Casual,Peter England Men Party Blue Jeans
2,59263,Women,Accessories,Watches,Watches,Silver,Winter,2016.0,Casual,Titan Women Silver Watch
3,21379,Men,Apparel,Bottomwear,Track Pants,Black,Fall,2011.0,Casual,Manchester United Men Solid Black Track Pants
4,53759,Men,Apparel,Topwear,Tshirts,Grey,Summer,2012.0,Casual,Puma Men Grey T-shirt


# Data Cleaning

In [7]:
# Data info
print(data.info())
print()
print(data.nunique())

<class 'pandas.core.frame.DataFrame'>
Index: 44079 entries, 0 to 44423
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   gender          44079 non-null  object
 1   masterCategory  44079 non-null  object
 2   subCategory     44079 non-null  object
 3   articleType     44079 non-null  object
 4   baseColour      44079 non-null  object
 5   season          44079 non-null  object
 6   usage           44079 non-null  object
dtypes: object(7)
memory usage: 2.7+ MB
None

gender              5
masterCategory      7
subCategory        45
articleType       142
baseColour         46
season              4
usage               8
dtype: int64


There are 10 columns in the dataset and 44424 rows.
I am dropping:
- `id`: it is just an identifier.
- `year`: is also not useful for the purpose of recommending flattering fashion items.
- `productDisplayName`: too many unique values

In [5]:
# Drop unnecessary columns
data.drop(['id', 'year', 'productDisplayName'], axis=1, inplace=True)

# Drop rows with missing value(s)
data = data.dropna()

# Data info
print(data.info())
print()
print(data.nunique())

<class 'pandas.core.frame.DataFrame'>
Index: 44079 entries, 0 to 44423
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   gender          44079 non-null  object
 1   masterCategory  44079 non-null  object
 2   subCategory     44079 non-null  object
 3   articleType     44079 non-null  object
 4   baseColour      44079 non-null  object
 5   season          44079 non-null  object
 6   usage           44079 non-null  object
dtypes: object(7)
memory usage: 2.7+ MB
None

gender              5
masterCategory      7
subCategory        45
articleType       142
baseColour         46
season              4
usage               8
dtype: int64


# Data Transformation