# Proyek Analisis Data: E-Commerce Public
- **Nama:** Rafael Simson Riston
- **Email:** rafaelsimsonriston@gmail.com
- **ID Dicoding:** rafaelsimsonr

## About Dataset

Brazilian E-Commerce Public Dataset by Olist

Welcome! This is a Brazilian ecommerce public dataset of orders made at Olist Store. The dataset has information of 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil. Its features allows viewing an order from multiple dimensions: from order status, price, payment and freight performance to customer location, product attributes and finally reviews written by customers. We also released a geolocation dataset that relates Brazilian zip codes to lat/lng coordinates.

This is real commercial data, it has been anonymised, and references to the companies and partners in the review text have been replaced with the names of Game of Thrones great houses.

Context
This dataset was generously provided by Olist, the largest department store in Brazilian marketplaces. Olist connects small businesses from all over Brazil to channels without hassle and with a single contract. Those merchants are able to sell their products through the Olist Store and ship them directly to the customers using Olist logistics partners. See more on our website: www.olist.com

After a customer purchases the product from Olist Store a seller gets notified to fulfill that order. Once the customer receives the product, or the estimated delivery date is due, the customer gets a satisfaction survey by email where he can give a note for the purchase experience and write down some comments.

Attention
1. An order might have multiple items.
2. Each item might be fulfilled by a distinct seller.
3. All text identifying stores and partners where replaced by the names of Game of Thrones great houses.

## Menentukan Pertanyaan

1. 


## 1. Import Packages

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_colwidth', None)

## 2. Data Wrangling

### 1. Gathering Data

In [None]:
customers_df = pd.read_csv(os.path.join(os.getcwd(), 'data/customers_dataset.csv'))
geo_df = pd.read_csv(os.path.join(os.getcwd(), 'data/geolocation_dataset.csv'))
order_items_df = pd.read_csv(os.path.join(os.getcwd(), 'data/order_items_dataset.csv'))
order_pay_df = pd.read_csv(os.path.join(os.getcwd(), 'data/order_payments_dataset.csv'))
order_reviews_df = pd.read_csv(os.path.join(os.getcwd(), 'data/order_reviews_dataset.csv'))
orders_df = pd.read_csv(os.path.join(os.getcwd(), 'data/orders_dataset.csv'))
product_category_name_df = pd.read_csv(os.path.join(os.getcwd(), 'data/product_category_name_translation.csv'))
products_df = pd.read_csv(os.path.join(os.getcwd(), 'data/products_dataset.csv'))
sellers_df = pd.read_csv(os.path.join(os.getcwd(), 'data/sellers_dataset.csv'))

### 2. Accesing Data

#### a. Find missing values and duplicate values

In [23]:
# Define names of dataframes
dataframe_names = ['sellers_df', 'products_df', 'product_category_name_df', 
                   'orders_df', 'order_reviews_df', 'order_pay_df', 
                   'order_items_df', 'geo_df', 'customers_df']

# List containing all dataframes
dataframes = [sellers_df, products_df, product_category_name_df, orders_df, order_reviews_df, order_pay_df, order_items_df, geo_df, customers_df]

# Create a dictionary to store check results
datas_check = {
    'data_name': [], 
    'n_rows': [], 
    'n_cols': [], 
    'sum_null': [], 
    'sum_col_null':[], 
    'name_col_null':[], 
    'sum_duplicated':[], 
    'sum_col_duplicate':[],
    'name_col_duplicate':[]
}

# Loop through dataframes and perform checks
for data_name, data in zip(dataframe_names, dataframes):
    datas_check['data_name'].append(data_name)
    datas_check['n_rows'].append(data.shape[0])
    datas_check['n_cols'].append(data.shape[1])
    datas_check['sum_null'].append(data.isna().sum().sum())
    datas_check['sum_duplicated'].append(data.duplicated().sum().sum())
    
    # Initialize lists for storing column-wise null and duplicate information
    sum_col_null = []
    name_col_null = []
    sum_col_duplicate = []
    name_col_duplicate = []
    
    # Loop through columns of each dataframe
    for col in data.columns:
        # Count null values and duplicates for each column
        sum_col_null.append(data[col].isna().sum())
        sum_col_duplicate.append(data.duplicated().sum().sum())
    
    # Count columns with null and duplicate values
    sum_col_n = sum(n != 0 for n in sum_col_null)
    sum_col_d = sum(n != 0 for n in sum_col_duplicate)
    
    # Append column names with null and duplicate values
    for idx, (n_null, n_duplicate) in enumerate(zip(sum_col_null, sum_col_duplicate)):
        if n_null != 0:
            name_col_null.append(data.columns[idx])
        if n_duplicate != 0:
            name_col_duplicate.append(data.columns[idx])

    datas_check['sum_col_null'].append(sum_col_n)
    datas_check['name_col_null'].append(name_col_null)
    datas_check['sum_col_duplicate'].append(sum_col_d)
    datas_check['name_col_duplicate'].append(name_col_duplicate)

# Convert dictionary to dataframe
pd.DataFrame(datas_check)

Unnamed: 0,data_name,n_rows,n_cols,sum_null,sum_col_null,name_col_null,sum_duplicated,sum_col_duplicate,name_col_duplicate
0,sellers_df,3095,4,0,0,[],0,0,[]
1,products_df,32951,9,2448,8,"[product_category_name, product_name_lenght, product_description_lenght, product_photos_qty, product_weight_g, product_length_cm, product_height_cm, product_width_cm]",0,0,[]
2,product_category_name_df,71,2,0,0,[],0,0,[]
3,orders_df,99441,8,4908,3,"[order_approved_at, order_delivered_carrier_date, order_delivered_customer_date]",0,0,[]
4,order_reviews_df,99224,7,145903,2,"[review_comment_title, review_comment_message]",0,0,[]
5,order_pay_df,103886,5,0,0,[],0,0,[]
6,order_items_df,112650,7,0,0,[],0,0,[]
7,geo_df,1000163,5,0,0,[],261831,5,"[geolocation_zip_code_prefix, geolocation_lat, geolocation_lng, geolocation_city, geolocation_state]"
8,customers_df,99441,5,0,0,[],0,0,[]


As you can see, there are a lot of missing values in the products dataset, orders, and reviews. On the other hand, duplicate values can be found in this data, specifically in the geographical dataset. However, this is normal because of the zip code, city, and state.

#### b. Check Data Types and Descriptive Statistics

In [22]:
# Initialize dict for store data
datas_describe = {
    'data_name':[],
    'column_name': [],
    'dtype': [],
    'count':[],
    'mean':[],
    'median':[],
    'min':[],
    'max':[]
}

# Use for loop to iterate to all dataframe
for data_name, data in zip(dataframe_names, dataframes):
    for col in data.columns:
        datas_describe['data_name'].append(data_name)
        datas_describe['column_name'].append(col)
        datas_describe['dtype'].append(str(data[col].dtype))  # Convert dtype to string
        datas_describe['count'].append(data[col].count())  # Use count() for non-null values count
        datas_describe['mean'].append(f"{data[col].mean():.2f}" if data[col].dtype != 'object' else "")  # Calculate mean for numeric columns
        datas_describe['min'].append(f"{data[col].min():.2f}" if data[col].dtype != 'object' else "")  # Calculate min for numeric columns
        datas_describe['max'].append(f"{data[col].max():.2f}" if data[col].dtype != 'object' else "")  # Calculate max for numeric columns
        datas_describe['median'].append(f"{data[col].median():.2f}" if data[col].dtype != 'object' else "")  # Calculate median for numeric columns

# For the purpose of readability, using pandas dataframe is the solution
pd.DataFrame(datas_describe)

Unnamed: 0,data_name,column_name,dtype,count,mean,median,min,max
0,sellers_df,seller_id,object,3095,,,,
1,sellers_df,seller_zip_code_prefix,int64,3095,32291.06,14940.0,1001.0,99730.0
2,sellers_df,seller_city,object,3095,,,,
3,sellers_df,seller_state,object,3095,,,,
4,products_df,product_id,object,32951,,,,
5,products_df,product_category_name,object,32341,,,,
6,products_df,product_name_lenght,float64,32341,48.48,51.0,5.0,76.0
7,products_df,product_description_lenght,float64,32341,771.5,595.0,4.0,3992.0
8,products_df,product_photos_qty,float64,32341,2.19,1.0,1.0,20.0
9,products_df,product_weight_g,float64,32949,2276.47,700.0,0.0,40425.0


There are many incorrect data type.

|         Column                | Data Type                 |           
| ----------------------------- | ------------------------- |
| order_purchase_timestamp      | **str** -> **datetime**   |
| order_approved_at             | **str** -> **datetime**   |
| order_delivered_carrier_date  | **str** -> **datetime**   |
| order_delivered_customer_date | **str** -> **datetime**   |
| order_estimated_delivery_date | **str** -> **datetime**   |
| review_creation_date          | **str** -> **datetime**   |
| review_answer_timestamp       | **str** -> **datetime**   |
| shipping_limit_date           | **str** -> **datetime**   |
| order_item_id                 | **int** -> **str**        |
| geolocation_zip_code_prefix   | **int** -> **str**        |
| customer_zip_code_prefix      | **int** -> **str**        |
| zip_code_prefix               | **int** -> **str**        |
| seller_zip_code_prefix        | **int** -> **str**        |


### 3. Cleaning Data

## 3. Explonatory Data Analysis

## 4. Data Visualization & Explanatory Data Analysis

## 5. Conclusion