# **Product Category Dataset Creation**

**Description**

[Olist](https://olist.com/pt-br/) is a Brazilian Unicorn that offers e-commerce solutions for small and mid-size companies in Brazil.

The Olist datasets are related to Sales between 2017 and 2018 in many categories, from Bed Bath & Table to Agro.

**Objective**

Creating a dataset based on product category to support a Machine Learning Feasibility Study on Time Series.

**Source**

Olist Datasets: https://www.kaggle.com/olistbr/brazilian-ecommerce

Exploratory Analysis: https://github.com/santos-elisa/StackLabs/blob/main/DatasetOlist_ExploratoryAnalysis.ipynb

Stack Labs are promoted by [Stack Tecnologias](https://stacktecnologias.com.br).

# **Settings**

In [3]:
# Importing drive using Google Colab.
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
# Importing library.
import pandas as pd

# **Reading Olist datasets**

Reading selected columns from four datasets with a focus on time series analysis.

In [5]:
# Selecting columns from the Order Items dataset.
df_order_items_cols = pd.read_csv('/content/drive/MyDrive/StackLabs2201/Datasets/olist_order_items_dataset.csv',
                                 sep=',',
                                 header=0,
                                 usecols=['order_id','product_id','price']
)

In [6]:
# Selecting columns from the Orders dataset.
df_orders_cols = pd.read_csv('/content/drive/MyDrive/StackLabs2201/Datasets/olist_orders_dataset.csv',
                                 sep=',',
                                 header=0,
                                 usecols=['order_id','order_status','order_purchase_timestamp']
)

In [7]:
# Selecting columns from Products dataset.
df_products_cols = pd.read_csv('/content/drive/MyDrive/StackLabs2201/Datasets/olist_products_dataset.csv',
                                 sep=',',
                                 header=0,
                                 usecols=['product_id','product_category_name']
)

In [8]:
# Selecting columns from the Product Category Name Translation dataset.
df_prod_cat_cols = pd.read_csv('/content/drive/MyDrive/StackLabs2201/Datasets/product_category_name_translation.csv',
                              sep=','
                              ,header=0
)

# **Creating a dataset with selected columns**

In [9]:
# Merging Order Items and Orders datasets based on the order id.
merge_1 = pd.merge(df_order_items_cols, df_orders_cols, on='order_id', how='left')

In [10]:
# Merging merge_1 and Products datasets based on the product id.
merge_2 = pd.merge(merge_1, df_products_cols, on='product_id', how='left')

In [11]:
# Merging merge_2 and Product Category Name Translation datasets based on the product category name.
merge_3 = pd.merge(merge_2, df_prod_cat_cols, on='product_category_name', how='left')

In [12]:
# Creating a dataframe with the result of previous merges.
df_selected_cols = merge_3

In [13]:
# General info about the dataset: class, entries, columns, non-null count, data type and memory usage.
df_selected_cols.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 112650 entries, 0 to 112649
Data columns (total 7 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   order_id                       112650 non-null  object 
 1   product_id                     112650 non-null  object 
 2   price                          112650 non-null  float64
 3   order_status                   112650 non-null  object 
 4   order_purchase_timestamp       112650 non-null  object 
 5   product_category_name          111047 non-null  object 
 6   product_category_name_english  111023 non-null  object 
dtypes: float64(1), object(6)
memory usage: 6.9+ MB


In [14]:
# First five lines of the dataset, including the headline.
df_selected_cols.head()

Unnamed: 0,order_id,product_id,price,order_status,order_purchase_timestamp,product_category_name,product_category_name_english
0,00010242fe8c5a6d1ba2dd792cb16214,4244733e06e7ecb4970a6e2683c13e61,58.9,delivered,2017-09-13 08:59:02,cool_stuff,cool_stuff
1,00018f77f2f0320c557190d7a144bdd3,e5f2d52b802189ee658865ca93d83a8f,239.9,delivered,2017-04-26 10:53:06,pet_shop,pet_shop
2,000229ec398224ef6ca0657da4fc703e,c777355d18b72b67abbeef9df44fd0fd,199.0,delivered,2018-01-14 14:33:31,moveis_decoracao,furniture_decor
3,00024acbcdf0a6daa1e931b038114c75,7634da152a4610f1595efa32f14722fc,12.99,delivered,2018-08-08 10:00:35,perfumaria,perfumery
4,00042b26cf59d7ce69dfabb4e55b4fd9,ac6c3623068f30de03045865e4e10089,199.9,delivered,2017-02-04 13:57:51,ferramentas_jardim,garden_tools


In [15]:
# Transforming the column order_purchase_timestamp from object to datetime.
df_selected_cols["order_purchase_timestamp"] = pd.to_datetime(df_selected_cols["order_purchase_timestamp"])

In [16]:
# General info about the dataset after changing the datatype.
df_selected_cols.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 112650 entries, 0 to 112649
Data columns (total 7 columns):
 #   Column                         Non-Null Count   Dtype         
---  ------                         --------------   -----         
 0   order_id                       112650 non-null  object        
 1   product_id                     112650 non-null  object        
 2   price                          112650 non-null  float64       
 3   order_status                   112650 non-null  object        
 4   order_purchase_timestamp       112650 non-null  datetime64[ns]
 5   product_category_name          111047 non-null  object        
 6   product_category_name_english  111023 non-null  object        
dtypes: datetime64[ns](1), float64(1), object(5)
memory usage: 6.9+ MB


In [17]:
# Checking missing values.
df_selected_cols.isnull().sum()

order_id                            0
product_id                          0
price                               0
order_status                        0
order_purchase_timestamp            0
product_category_name            1603
product_category_name_english    1627
dtype: int64

In [18]:
# Eliminating missing values ​​to ensure that all entries display the product category name in English.
df_selected_cols.dropna(inplace=True)

In [19]:
# General info about the dataset after eliminating missing values.
df_selected_cols.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 111023 entries, 0 to 112649
Data columns (total 7 columns):
 #   Column                         Non-Null Count   Dtype         
---  ------                         --------------   -----         
 0   order_id                       111023 non-null  object        
 1   product_id                     111023 non-null  object        
 2   price                          111023 non-null  float64       
 3   order_status                   111023 non-null  object        
 4   order_purchase_timestamp       111023 non-null  datetime64[ns]
 5   product_category_name          111023 non-null  object        
 6   product_category_name_english  111023 non-null  object        
dtypes: datetime64[ns](1), float64(1), object(5)
memory usage: 6.8+ MB


In [20]:
# Counting the number of unique values per product category name in English.
pd.value_counts(df_selected_cols['product_category_name_english'])

bed_bath_table               11115
health_beauty                 9670
sports_leisure                8641
furniture_decor               8334
computers_accessories         7827
                             ...  
arts_and_craftmanship           24
la_cuisine                      14
cds_dvds_musicals               14
fashion_childrens_clothes        8
security_and_services            2
Name: product_category_name_english, Length: 71, dtype: int64

# **Analysis**

The best-selling categories in 2017-2018 in number of products were:

* bed_bath_table
* health_beauty
* sports_leisure
* furniture_decor
* computers_accessories

# **Next Steps**

Creating a Time Series dataset for the Bed Bath & Table products category to support a Machine Learning Feasibility Study.