# Proyek Analisis Data: E-Commerce Public
- **Nama:** Rafael Simson Riston
- **Email:** rafaelsimsonriston@gmail.com
- **ID Dicoding:** rafaelsimsonr

## About Dataset

Brazilian E-Commerce Public Dataset by Olist

Welcome! This is a Brazilian ecommerce public dataset of orders made at Olist Store. The dataset has information of 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil. Its features allows viewing an order from multiple dimensions: from order status, price, payment and freight performance to customer location, product attributes and finally reviews written by customers. We also released a geolocation dataset that relates Brazilian zip codes to lat/lng coordinates.

This is real commercial data, it has been anonymised, and references to the companies and partners in the review text have been replaced with the names of Game of Thrones great houses.

Context
This dataset was generously provided by Olist, the largest department store in Brazilian marketplaces. Olist connects small businesses from all over Brazil to channels without hassle and with a single contract. Those merchants are able to sell their products through the Olist Store and ship them directly to the customers using Olist logistics partners. See more on our website: www.olist.com

After a customer purchases the product from Olist Store a seller gets notified to fulfill that order. Once the customer receives the product, or the estimated delivery date is due, the customer gets a satisfaction survey by email where he can give a note for the purchase experience and write down some comments.

Attention
1. An order might have multiple items.
2. Each item might be fulfilled by a distinct seller.
3. All text identifying stores and partners where replaced by the names of Game of Thrones great houses.

## Menentukan Pertanyaan

1. 


## 1. Import Packages

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_colwidth', None)

## 2. Data Wrangling

### 1. Gathering Data

In [2]:
customers_df = pd.read_csv(os.path.join(os.getcwd(), 'data/customers_dataset.csv'))
geo_df = pd.read_csv(os.path.join(os.getcwd(), 'data/geolocation_dataset.csv'))
order_items_df = pd.read_csv(os.path.join(os.getcwd(), 'data/order_items_dataset.csv'))
order_pay_df = pd.read_csv(os.path.join(os.getcwd(), 'data/order_payments_dataset.csv'))
order_reviews_df = pd.read_csv(os.path.join(os.getcwd(), 'data/order_reviews_dataset.csv'))
orders_df = pd.read_csv(os.path.join(os.getcwd(), 'data/orders_dataset.csv'))
product_category_name_df = pd.read_csv(os.path.join(os.getcwd(), 'data/product_category_name_translation.csv'))
products_df = pd.read_csv(os.path.join(os.getcwd(), 'data/products_dataset.csv'))
sellers_df = pd.read_csv(os.path.join(os.getcwd(), 'data/sellers_dataset.csv'))

### 2. Accesing Data

#### a. Find missing values and duplicate values

In [3]:
# Assign df_names and dataframes
dataframe_names = ['sellers_df', 'products_df', 'product_category_name_df', 
                'orders_df', 'order_reviews_df', 'order_pay_df', 
                'order_items_df', 'geo_df', 'customers_df']
dataframes = [sellers_df, products_df, product_category_name_df, orders_df, 
              order_reviews_df, order_pay_df, order_items_df, geo_df, customers_df]

In [4]:
def missing_values_check(dataframe_names, dataframes):

    # Create a dictionary to store check results
    datas_check = {
        'data_name': [], 
        'n_rows': [], 
        'n_cols': [], 
        'sum_null': [], 
        'sum_col_null':[], 
        'name_col_null':[], 
        'sum_duplicated':[], 
        'sum_col_duplicate':[],
        'name_col_duplicate':[]
    }

    # Loop through dataframes and perform checks
    for data_name, data in zip(dataframe_names, dataframes):
        datas_check['data_name'].append(data_name)
        datas_check['n_rows'].append(data.shape[0])
        datas_check['n_cols'].append(data.shape[1])
        datas_check['sum_null'].append(data.isna().sum().sum())
        datas_check['sum_duplicated'].append(data.duplicated().sum().sum())
        
        # Initialize lists for storing column-wise null and duplicate information
        sum_col_null = []
        name_col_null = []
        sum_col_duplicate = []
        name_col_duplicate = []
        
        # Loop through columns of each dataframe
        for col in data.columns:
            # Count null values and duplicates for each column
            sum_col_null.append(data[col].isna().sum())
            sum_col_duplicate.append(data.duplicated().sum().sum())
        
        # Count columns with null and duplicate values
        sum_col_n = sum(n != 0 for n in sum_col_null)
        sum_col_d = sum(n != 0 for n in sum_col_duplicate)
        
        # Append column names with null and duplicate values
        for idx, (n_null, n_duplicate) in enumerate(zip(sum_col_null, sum_col_duplicate)):
            if n_null != 0:
                name_col_null.append(data.columns[idx])
            if n_duplicate != 0:
                name_col_duplicate.append(data.columns[idx])

        datas_check['sum_col_null'].append(sum_col_n)
        datas_check['name_col_null'].append(name_col_null)
        datas_check['sum_col_duplicate'].append(sum_col_d)
        datas_check['name_col_duplicate'].append(name_col_duplicate)

    # Convert dictionary to dataframe
    return pd.DataFrame(datas_check)


In [5]:
missing_values_check(dataframe_names, dataframes)

Unnamed: 0,data_name,n_rows,n_cols,sum_null,sum_col_null,name_col_null,sum_duplicated,sum_col_duplicate,name_col_duplicate
0,sellers_df,3095,4,0,0,[],0,0,[]
1,products_df,32951,9,2448,8,"[product_category_name, product_name_lenght, product_description_lenght, product_photos_qty, product_weight_g, product_length_cm, product_height_cm, product_width_cm]",0,0,[]
2,product_category_name_df,71,2,0,0,[],0,0,[]
3,orders_df,99441,8,4908,3,"[order_approved_at, order_delivered_carrier_date, order_delivered_customer_date]",0,0,[]
4,order_reviews_df,99224,7,145903,2,"[review_comment_title, review_comment_message]",0,0,[]
5,order_pay_df,103886,5,0,0,[],0,0,[]
6,order_items_df,112650,7,0,0,[],0,0,[]
7,geo_df,1000163,5,0,0,[],261831,5,"[geolocation_zip_code_prefix, geolocation_lat, geolocation_lng, geolocation_city, geolocation_state]"
8,customers_df,99441,5,0,0,[],0,0,[]


As you can see, there are a lot of missing values in the products dataset, orders, and reviews. On the other hand, duplicate values can be found in this data, specifically in the geographical dataset. However, this is normal because of the zip code, city, and state.

#### b. Check Data Types and Descriptive Statistics

In [6]:
# Initialize dict for store data
def describe_data(dataframe_names, dataframes):
    datas_describe = {
        'data_name':[],
        'column_name': [],
        'n_null':[],
        'n_null_%':[],
        'dtype': [],
        'count':[],
        'mean':[],
        'median':[],
        'min':[],
        'max':[]

    }

    # Use for loop to iterate to all dataframe
    for data_name, data in zip(dataframe_names, dataframes):
        for col in data.columns:
            n_null = data[col].isna().sum() 

            datas_describe['data_name'].append(data_name)
            datas_describe['column_name'].append(col)
            datas_describe['dtype'].append(str(data[col].dtype))  # Convert dtype to string
            datas_describe['count'].append(data[col].count())  # Use count() for non-null values count
            datas_describe['mean'].append(f"{data[col].mean():.2f}" if data[col].dtype in ['int64','float64'] else "")  # Calculate mean for numeric columns
            datas_describe['min'].append(f"{data[col].min():.2f}" if data[col].dtype in ['int64','float64'] else "")  # Calculate min for numeric columns
            datas_describe['max'].append(f"{data[col].max():.2f}" if data[col].dtype in ['int64','float64'] else "")  # Calculate max for numeric columns
            datas_describe['median'].append(f"{data[col].median():.2f}" if data[col].dtype == ['int','float'] else "")  # Calculate median for numeric columns
            datas_describe['n_null'].append(n_null) # Calculate null 
            datas_describe['n_null_%'].append(f'{round(n_null/len(data)*100)}%') # Calculate percentage null

    # For the purpose of readability, using pandas dataframe is the solution
    return pd.DataFrame(datas_describe)


In [7]:
describe_data(dataframe_names,dataframes)

Unnamed: 0,data_name,column_name,n_null,n_null_%,dtype,count,mean,median,min,max
0,sellers_df,seller_id,0,0%,object,3095,,,,
1,sellers_df,seller_zip_code_prefix,0,0%,int64,3095,32291.06,,1001.0,99730.0
2,sellers_df,seller_city,0,0%,object,3095,,,,
3,sellers_df,seller_state,0,0%,object,3095,,,,
4,products_df,product_id,0,0%,object,32951,,,,
5,products_df,product_category_name,610,2%,object,32341,,,,
6,products_df,product_name_lenght,610,2%,float64,32341,48.48,,5.0,76.0
7,products_df,product_description_lenght,610,2%,float64,32341,771.5,,4.0,3992.0
8,products_df,product_photos_qty,610,2%,float64,32341,2.19,,1.0,20.0
9,products_df,product_weight_g,2,0%,float64,32949,2276.47,,0.0,40425.0


There are many instances of incorrect data types and missing values in the order_approved_at, order_delivered_carrier_date, order_delivered_customer_date, review_comment_title, review_comment_message, product_name_length, product_description_length, product_photos_qty, and product_category_name fields.

Most of the missing values are in string fields, with only a few in integer fields.

|         Column                | Data Type                 |           
| ----------------------------- | ------------------------- |
| order_purchase_timestamp      | **str** -> **datetime**   |
| order_approved_at             | **str** -> **datetime**   |
| order_delivered_carrier_date  | **str** -> **datetime**   |
| order_delivered_customer_date | **str** -> **datetime**   |
| order_estimated_delivery_date | **str** -> **datetime**   |
| review_creation_date          | **str** -> **datetime**   |
| review_answer_timestamp       | **str** -> **datetime**   |
| shipping_limit_date           | **str** -> **datetime**   |
| order_item_id                 | **int** -> **str**        |
| geolocation_zip_code_prefix   | **int** -> **str**        |
| customer_zip_code_prefix      | **int** -> **str**        |
| seller_zip_code_prefix        | **int** -> **str**        |


Also, check the payment_value column in the order payments data, because there are transactions with a payment of 0. There are zero values in the product_weight_g column in products_df, and in the freight_value column in order_items_df.

In [8]:
order_pay_df[order_pay_df['payment_value'] == order_pay_df['payment_value'].min()]

Unnamed: 0,order_id,payment_sequential,payment_type,payment_installments,payment_value
19922,8bcbe01d44d147f901cd3192671144db,4,voucher,1,0.0
36822,fa65dad1b0e818e3ccc5cb0e39231352,14,voucher,1,0.0
43744,6ccb433e00daae1283ccc956189c82ae,4,voucher,1,0.0
51280,4637ca194b6387e2d538dc89b124b0ee,1,not_defined,1,0.0
57411,00b1cb0320190ca0daa2c88b35206009,1,not_defined,1,0.0
62674,45ed6e85398a87c253db47c2d9f48216,3,voucher,1,0.0
77885,fa65dad1b0e818e3ccc5cb0e39231352,13,voucher,1,0.0
94427,c8c528189310eaa44a745b8d9d26908b,1,not_defined,1,0.0
100766,b23878b3e8eb4d25a158f57d96331b18,4,voucher,1,0.0


Most orders have a payment of 0 because of vouchers, but some have a payment_type that is not defined. So, try changing it to ‘voucher’.

In [9]:
# Check product_weight_g column
products_df[products_df['product_weight_g'] == products_df['product_weight_g'].min()]

Unnamed: 0,product_id,product_category_name,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm
9769,81781c0fed9fe1ad6e8c81fca1e1cb08,cama_mesa_banho,51.0,529.0,1.0,0.0,30.0,25.0,30.0
13683,8038040ee2a71048d4bdbbdc985b69ab,cama_mesa_banho,48.0,528.0,1.0,0.0,30.0,25.0,30.0
14997,36ba42dd187055e1fbe943b2d11430ca,cama_mesa_banho,53.0,528.0,1.0,0.0,30.0,25.0,30.0
32079,e673e90efa65a5409ff4196c038bb5af,cama_mesa_banho,53.0,528.0,1.0,0.0,30.0,25.0,30.0


In [10]:
filter_desc_lenght = ((products_df['product_description_lenght'] <= 529.0) & (products_df['product_description_lenght'] >= 528.0))
products_df[(products_df['product_category_name'] == 'cama_mesa_banho') & filter_desc_lenght]

Unnamed: 0,product_id,product_category_name,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm
5145,b2ca2c6a893354e1429f316eb07cfcff,cama_mesa_banho,40.0,529.0,3.0,650.0,51.0,11.0,12.0
6553,a81fc5e66120ebdb295873f74a45c5b7,cama_mesa_banho,49.0,528.0,2.0,2350.0,33.0,8.0,33.0
9082,fe6550863a0e6a0167e2d2859f8f3b9e,cama_mesa_banho,57.0,529.0,1.0,650.0,16.0,10.0,16.0
9769,81781c0fed9fe1ad6e8c81fca1e1cb08,cama_mesa_banho,51.0,529.0,1.0,0.0,30.0,25.0,30.0
13683,8038040ee2a71048d4bdbbdc985b69ab,cama_mesa_banho,48.0,528.0,1.0,0.0,30.0,25.0,30.0
13868,341624e20527d473745831850a7263ea,cama_mesa_banho,48.0,528.0,1.0,900.0,38.0,20.0,28.0
14997,36ba42dd187055e1fbe943b2d11430ca,cama_mesa_banho,53.0,528.0,1.0,0.0,30.0,25.0,30.0
19297,d6b80738418fd3491b89c8d2cf5f8256,cama_mesa_banho,51.0,528.0,1.0,2837.0,36.0,8.0,37.0
24066,71d9c2b8b10871fa61851922aabc292d,cama_mesa_banho,52.0,528.0,1.0,800.0,16.0,10.0,16.0
32079,e673e90efa65a5409ff4196c038bb5af,cama_mesa_banho,53.0,528.0,1.0,0.0,30.0,25.0,30.0


For product_weight_g column that has zero, we'll drop it

In [11]:
# Check freight_value 
order_items_df[order_items_df['freight_value'] == order_items_df['freight_value'].min()]

Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
114,00404fa7a687c8c44ca69d42695aae73,1,53b36df67ebb7c41585e8d54d6772e08,7d13fca15225358621be4086e1eb0964,2018-05-15 04:31:26,99.9,0.0
258,00a870c6c06346e85335524935c600c0,1,aca2eb7d00ea1a7b8ebd4e68314663af,955fee9216a65b617aa5c0531780ce60,2018-05-14 00:14:29,69.9,0.0
483,011c899816ea29773525bd3322dbb6aa,1,53b36df67ebb7c41585e8d54d6772e08,7d13fca15225358621be4086e1eb0964,2018-05-07 05:30:45,99.9,0.0
508,012b3f6ab7776a8ab3443a4ad7bef2e6,1,422879e10f46682990de24d770e7f83d,1f50f920176fa81dab994f9023523100,2018-05-09 21:30:50,53.9,0.0
509,012b3f6ab7776a8ab3443a4ad7bef2e6,2,422879e10f46682990de24d770e7f83d,1f50f920176fa81dab994f9023523100,2018-05-09 21:30:50,53.9,0.0
...,...,...,...,...,...,...,...
111094,fc698f330ec7fb74859071cc6cb29772,1,422879e10f46682990de24d770e7f83d,1f50f920176fa81dab994f9023523100,2018-04-25 02:31:57,53.9,0.0
111497,fd4907109f6bac23f07064af84bec02d,1,7a10781637204d8d10485c71a6108a2e,4869f7a5dfa277a7dca6462dcf3b52b2,2018-04-30 11:31:32,219.0,0.0
111649,fd95e4b85ebbb81853d4a6be3d61432b,1,53b36df67ebb7c41585e8d54d6772e08,4869f7a5dfa277a7dca6462dcf3b52b2,2018-05-04 11:10:31,106.9,0.0
112182,fee19a0dc7358b6962a611cecf6a37b4,1,f1c7f353075ce59d8a6f3cf58f419c9c,37be5a7c751166fbc5f8ccba4119e043,2017-09-07 22:06:31,195.0,0.0


In [12]:
order_items_df[order_items_df['freight_value'] != order_items_df['freight_value'].min()]

Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
0,00010242fe8c5a6d1ba2dd792cb16214,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-09-19 09:45:35,58.90,13.29
1,00018f77f2f0320c557190d7a144bdd3,1,e5f2d52b802189ee658865ca93d83a8f,dd7ddc04e1b6c2c614352b383efe2d36,2017-05-03 11:05:13,239.90,19.93
2,000229ec398224ef6ca0657da4fc703e,1,c777355d18b72b67abbeef9df44fd0fd,5b51032eddd242adc84c38acab88f23d,2018-01-18 14:48:30,199.00,17.87
3,00024acbcdf0a6daa1e931b038114c75,1,7634da152a4610f1595efa32f14722fc,9d7a1d34a5052409006425275ba1c2b4,2018-08-15 10:10:18,12.99,12.79
4,00042b26cf59d7ce69dfabb4e55b4fd9,1,ac6c3623068f30de03045865e4e10089,df560393f3a51e74553ab94004ba5c87,2017-02-13 13:57:51,199.90,18.14
...,...,...,...,...,...,...,...
112645,fffc94f6ce00a00581880bf54a75a037,1,4aa6014eceb682077f9dc4bffebc05b0,b8bc237ba3788b23da09c0f1f3a3288c,2018-05-02 04:11:01,299.99,43.41
112646,fffcd46ef2263f404302a634eb57f7eb,1,32e07fd915822b0765e448c4dd74c828,f3c38ab652836d21de61fb8314b69182,2018-07-20 04:31:48,350.00,36.53
112647,fffce4705a9662cd70adb13d4a31832d,1,72a30483855e2eafc67aee5dc2560482,c3cfdc648177fdbbbb35635a37472c53,2017-10-30 17:14:25,99.90,16.95
112648,fffe18544ffabc95dfada21779c9644f,1,9c422a519119dcad7575db5af1ba540e,2b3e4a2a3ea8e01938cabda2a3e5cc79,2017-08-21 00:04:32,55.99,8.72


In [13]:
order_items_df[order_items_df['freight_value'] == order_items_df['freight_value'].min()].shape[0] / order_items_df['freight_value'].shape[0] *100

0.33999112294718153

We could drop the rows with a freight_value of zero, because it wouldn’t ruined the information in our data.

Next, check the product_category_name in products_df, and compare it to product_category_name_df

In [14]:
len(products_df['product_category_name'].unique())

74

In [15]:
len(product_category_name_df['product_category_name'])

71

There're three products different from product_category_name_df

In [16]:
# Filter product_df 
# if product name in products_df != product name in product_category_name_df return True (~) 
mask = ~products_df['product_category_name'].isin(product_category_name_df['product_category_name']) 
filtered_products_name_df = products_df[mask]
filtered_products_name_df['product_category_name'].unique()

array([nan, 'pc_gamer', 'portateis_cozinha_e_preparadores_de_alimentos'],
      dtype=object)

In [17]:
filtered_products_name_df.shape[0] / products_df.shape[0] * 100

1.890686170374192

We could drop it, as it would have no impact on our datasets, given that we would only lose 2% of the data.

### 3. Cleaning Data

Fix the incorrect data types

|         Column                | Data Type                 |  Dataset          |
| ----------------------------- | ------------------------- | ------------------|
| order_purchase_timestamp      | **str** -> **datetime**   | orders_df         |
| order_approved_at             | **str** -> **datetime**   | orders_df         |
| order_delivered_carrier_date  | **str** -> **datetime**   | orders_df         |
| order_delivered_customer_date | **str** -> **datetime**   | orders_df         |
| order_estimated_delivery_date | **str** -> **datetime**   | orders_df         |
| review_creation_date          | **str** -> **datetime**   | order_reviews_df  |
| review_answer_timestamp       | **str** -> **datetime**   | order_reviews_df  |
| shipping_limit_date           | **str** -> **datetime**   | order_items_df    |
| order_item_id                 | **int** -> **str**        | order_items_df    |
| geolocation_zip_code_prefix   | **int** -> **str**        | geo_df	        |
| customer_zip_code_prefix      | **int** -> **str**        | customers_df      |
| seller_zip_code_prefix        | **int** -> **str**        | sellers_df        |

In [18]:
# Create a dictionary that maps column names to their new data types
column_dtype_mapping = {
    'orders_df': {
        'order_purchase_timestamp': 'datetime64[ns]',
        'order_approved_at': 'datetime64[ns]',
        'order_delivered_carrier_date': 'datetime64[ns]',
        'order_delivered_customer_date': 'datetime64[ns]',
        'order_estimated_delivery_date': 'datetime64[ns]'
    },
    'order_reviews_df': {
        'review_creation_date': 'datetime64[ns]',
        'review_answer_timestamp': 'datetime64[ns]'
    },
    'order_items_df': {
        'shipping_limit_date': 'datetime64[ns]',
        'order_item_id': 'str'
    },
    'geo_df': {
        'geolocation_zip_code_prefix': 'str'
    },
    'customers_df': {
        'customer_zip_code_prefix': 'str'
    },
    'sellers_df': {
        'seller_zip_code_prefix': 'str'
    }
}

# Use a loop to apply the changes
for df_name, columns in column_dtype_mapping.items():
    for column, dtype in columns.items():
        globals()[df_name][column] = globals()[df_name][column].astype(dtype)


In [19]:
describe_data(dataframe_names, dataframes)

Unnamed: 0,data_name,column_name,n_null,n_null_%,dtype,count,mean,median,min,max
0,sellers_df,seller_id,0,0%,object,3095,,,,
1,sellers_df,seller_zip_code_prefix,0,0%,object,3095,,,,
2,sellers_df,seller_city,0,0%,object,3095,,,,
3,sellers_df,seller_state,0,0%,object,3095,,,,
4,products_df,product_id,0,0%,object,32951,,,,
5,products_df,product_category_name,610,2%,object,32341,,,,
6,products_df,product_name_lenght,610,2%,float64,32341,48.48,,5.0,76.0
7,products_df,product_description_lenght,610,2%,float64,32341,771.5,,4.0,3992.0
8,products_df,product_photos_qty,610,2%,float64,32341,2.19,,1.0,20.0
9,products_df,product_weight_g,2,0%,float64,32949,2276.47,,0.0,40425.0


In [20]:
# Filter order_pay_df that contain payment_type not_defined and payment_value of zero.
order_pay_not_defined = (order_pay_df['payment_type'] == 'not_defined') & (order_pay_df['payment_value'] == order_pay_df['payment_value'].min())

In [21]:
# Replace not_defined to voucher
order_pay_df.loc[order_pay_not_defined, :] = order_pay_df.loc[order_pay_not_defined, :].replace('not_defined', 'voucher')

In [23]:
order_pay_df[order_pay_df['payment_value'] == order_pay_df['payment_value'].min()]

Unnamed: 0,order_id,payment_sequential,payment_type,payment_installments,payment_value
19922,8bcbe01d44d147f901cd3192671144db,4,voucher,1,0.0
36822,fa65dad1b0e818e3ccc5cb0e39231352,14,voucher,1,0.0
43744,6ccb433e00daae1283ccc956189c82ae,4,voucher,1,0.0
51280,4637ca194b6387e2d538dc89b124b0ee,1,voucher,1,0.0
57411,00b1cb0320190ca0daa2c88b35206009,1,voucher,1,0.0
62674,45ed6e85398a87c253db47c2d9f48216,3,voucher,1,0.0
77885,fa65dad1b0e818e3ccc5cb0e39231352,13,voucher,1,0.0
94427,c8c528189310eaa44a745b8d9d26908b,1,voucher,1,0.0
100766,b23878b3e8eb4d25a158f57d96331b18,4,voucher,1,0.0


## 3. Explonatory Data Analysis

## 4. Data Visualization & Explanatory Data Analysis

## 5. Conclusion