### Brazillian E-Commerce Dataset
[Kaggle](https://www.kaggle.com/olistbr/brazilian-ecommerce)

**Attention**
- An order might have multiple items.
- Each item might be fulfilled by a distinct seller.
- All text identifying stores and partners where replaced by the names of Game of Thrones great houses.

**Schema**
<img src="ds-schema.png" style="width: 600px;">

### Description of columns

| Column  | Description  |
| --- | --- |
| order_id | unique identifier of the order. |
| customer_id | key to the customer dataset. Each order has a unique customer_id. |
| order_status | Reference to the order status (delivered, shipped, etc). |
| order_purchase_timestamp | Shows the purchase timestamp. |
| order_approved_at | Shows the payment approval timestamp. |
| order_delivered_carrier_date | Shows the order posting timestamp. When it was handled to the logistic partner. |
| order_delivered_customer_date | Shows the actual order delivery date to the customer. |
| order_estimated_delivery_date | Shows the estimated delivery date that was informed to customer at the purchase moment. |
| --- | --- |
| payment_sequential | a customer may pay an order with more than one payment method. If he does so, a sequence will be created to accommodate all payments. |
| payment_type | method of payment chosen by the customer. |
| payment_installments | number of installments chosen by the customer. |
| payment_value | transaction value. |
| --- | --- |
| customer_unique_id | unique identifier of a customer. |
| customer_zip_code_prefix | first five digits of customer zip code |
| customer_city | customer city name |
| customer_state | customer state |
| --- | --- |
| review_id | unique review identifier |
| review_score | Note ranging from 1 to 5 given by the customer on a satisfaction survey. |
| review_comment_title | Comment title from the review left by the customer, in Portuguese. |
| review_comment_message | Comment message from the review left by the customer, in Portuguese. |
| review_creation_date | Shows the date in which the satisfaction survey was sent to the customer. |
| review_answer_timestamp | Shows satisfaction survey answer timestamp. |
| --- | --- |
| order_item_id | sequential number identifying number of items included in the same order. |
| product_id | product unique identifier |
| seller_id | seller unique identifier |
| shipping_limit_date | Shows the seller shipping limit date for handling the order over to the logistic partner. |
| price | item price |
| freight_value | item freight value item (if an order has more than one item the freight value is splitted between items) |
| --- | --- |
| product_category_name | root category of product, in Portuguese. |
| product_name_lenght | number of characters extracted from the product name. |
| product_description_lenght | number of characters extracted from the product description. |
| product_photos_qty | number of product published photos |
| product_weight_g | product weight measured in grams. |
| product_length_cm | product length measured in centimeters. |
| product_height_cm | product height measured in centimeters. |
| product_width_cm | product width measured in centimeters. |
| --- | --- |
| seller_zip_code_prefix | first 5 digits of seller zip code |
| seller_city | seller city name |
| seller_state | seller state |

In [21]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [22]:
data_path = 'C:/Users/Zaca/Documents/Datasets/brazilian-ecommerce/olist_'

In [23]:
# opening datasets individually
orders = pd.read_csv(data_path + 'orders_dataset.csv')
items = pd.read_csv(data_path + 'order_items_dataset.csv')
products = pd.read_csv(data_path + 'products_dataset.csv')
payments = pd.read_csv(data_path + 'order_payments_dataset.csv')
customers = pd.read_csv(data_path + 'customers_dataset.csv')
sellers = pd.read_csv(data_path + 'sellers_dataset.csv')
reviews = pd.read_csv(data_path + 'order_reviews_dataset.csv')

In [25]:
# and merging them all together based on the schema
# this might not be the optimal thing to do for very large datasets but it's somehow easier for me to visualize.
orders_payments_customers_reviews = orders.merge(payments, on='order_id', how='inner').merge(customers, on='customer_id').merge(reviews, on='order_id', how='inner')
items_products_sellers = items.merge(products, on='product_id', how='inner').merge(sellers, on='seller_id', how='inner')
ds = orders_payments_customers_reviews.merge(items_products_sellers, on='order_id', how='inner')

In [28]:
ds.shape

(118315, 39)