<h1 style="text-align:center;">Final Project - Omni Oracle</h1>
<hr>

<table style="margin-left: auto; margin-right: auto;">
  <tr><th>Team members</th></tr>
  <tr><td>Francis Tan Hong Xin</td></tr>
  <tr><td>Gan Lai Soon</td></tr>
  <tr><td>Khadijah Anhardeen</td></tr>
  <tr><td>Muhammad Nur Syafaat Bin Mohamed Saat</td></tr>
</table>

<hr>
<h2 style="text-align: center">Table of contents</h2>
<hr>

<ul>
  <li>
    <a href="#section_1">Section 1: Package Installation and Imports</a>
    <ul>
      <li><a href="#section_1a">Section 1-A: Install required packages and modules</a></li>
      <li><a href="#section_1b">Section 1-B: Importing libraries and modules</a></li>
      <li><a href="#section_1c">Section 1-C: Initialize logging procedure</a></li>
    </ul>
  </li>
  <li>
    <a href="#section_2">Section 2: Data Extraction</a>
    <ul>
      <li><a href="#section_2a">Section 2-A: Checking if data files exist in project directory</a></li>
      <li><a href="#section_2b">Section 2-B: Extract from Kaggle via their API if data files not found</a></li>
      <li><a href="#section_2c">Section 2-C: Write to the logging process</a></li>
    </ul>
  </li>
  <li>
    <a href="#section_3">Section 3: Data Transformation</a>
    <ul>
      <li><a href="#section_3a">Section 3-A: The <code>geolocations</code> data</a></li>
      <li><a href="#section_3b">Section 3-B: The <code>customers</code> data</a></li>
      <li><a href="#section_3c">Section 3-C: The <code>sellers</code> data</a></li>
      <li><a href="#section_3d">Section 3-D: The <code>orders</code> data</a></li>
      <li><a href="#section_3e">Section 3-E: The <code>product_category_name_translation</code> data</a></li>
      <li><a href="#section_3f">Section 3-F: The <code>products</code> data</a></li>
      <li><a href="#section_3g">Section 3-G: The <code>order_reviews</code> data</a></li>
      <li><a href="#section_3h">Section 3-H: The <code>order_items</code> data</a></li>
      <li><a href="#section_3i">Section 3-I: The <code>order_payments</code> data</a></li>
      <li><a href="#section_3j">Section 3-J: Write to the logging process</a></li>
    </ul>
  </li>
  <li>
    <a href="#section_4">Section 4: Data Loading</a>
    <ul>
      <li><a href="#section_4a">Section 4-A: Establish connection to pgAdmin and creating the PostgreSQL database</a></li>
      <li><a href="#section_4b">Section 4-B: Create the tables (and establish relationships via foreign keys) inside the database</a></li>
      <li><a href="#section_4c">Section 4-C: Transfer the DataFrames into their respective tables in the database</a></li>
      <li><a href="#section_4d">Section 4-D: Close connection to the database</a></li>
      <li><a href="#section_4e">Section 4-E: Write to the logging process</a></li>
    </ul>
  </li>
</ul>

<section id="section_1">
    <hr>
    <h2 style="text-align: center">Section 1: Package Installation and Imports</h2>
    <hr>
</section>

<section id="section_1a">
    <h3 style="text-align: center">Section 1-A: Install required packages and modules</h3>
</section>

In [14]:
!pip install unidecode
!pip install pyspellchecker
!pip install kaggle

Collecting kaggle
  Downloading kaggle-1.6.17.tar.gz (82 kB)
     ---------------------------------------- 0.0/82.7 kB ? eta -:--:--
     ---------------------------------------- 0.0/82.7 kB ? eta -:--:--
     ---- ----------------------------------- 10.2/82.7 kB ? eta -:--:--
     ------------------ ------------------- 41.0/82.7 kB 330.3 kB/s eta 0:00:01
     ---------------------------- --------- 61.4/82.7 kB 409.6 kB/s eta 0:00:01
     -------------------------------------- 82.7/82.7 kB 421.2 kB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py): started
  Building wheel for kaggle (setup.py): finished with status 'done'
  Created wheel for kaggle: filename=kaggle-1.6.17-py3-none-any.whl size=105795 sha256=97f9b376b1b79c0c2f3ad06bcb3a2f04da4388b5784d1cdbf530c73d7a53c17a
  Stored in directory: c:\users\asus\appdata\local\pip\cache\whee

<section id="section_1b">
    <h3 style="text-align: center">Section 1-B: Importing libraries and modules</h3>
</section>

In [1]:
# for data extraction using Kaggle's API
import kaggle

# for OS related methods
import os

# for DataFrames, Series, read_csv(), to_csv(), etc.
import pandas as pd

# for math, data types
import numpy as np

# for dates, times, etc.
import datetime as dt

# for recording logs
import logging

# for foreign characters (Portuguese) and converting them
from unidecode import unidecode

# for checking the spelling of English words
from spellchecker import SpellChecker

# for connecting to PostgreSQL/pgAdmin and inserting data to database
import sqlalchemy as db
from sqlalchemy_utils import create_database
from sqlalchemy_utils import database_exists
from sqlalchemy.types import Integer, String, Text, ARRAY, VARCHAR, Date, Numeric, BigInteger

<section id="section_1c">
    <h3 style="text-align: center">Section 1-C: Initialize logging procedure</h3>
</section>

In [2]:
# Configure logging
logging.basicConfig(level=logging.INFO, 
                    filename='etl_pipeline.log', 
                    format='%(asctime)s:%(levelname)s:%(message)s')

logging.info("ETL process started.")

<section id="section_2">
    <hr>
    <h2 style="text-align: center">Section 2: Data Extraction</h2>
    <hr>
</section>

<section id="section_2a">
    <h3 style="text-align: center">Section 2-A: Checking if data files exist in project directory</h3>
</section>

In [3]:
geolocations_data_file_path = '../resources/olist_geolocation_dataset.csv'
customers_data_file_path = '../resources/olist_customers_dataset.csv'
sellers_data_file_path = '../resources/olist_sellers_dataset.csv'
orders_data_file_path = '../resources/olist_orders_dataset.csv'
product_categories_data_file_path = '../resources/product_category_name_translation.csv'
products_data_file_path = '../resources/olist_products_dataset.csv'
order_reviews_data_file_path = '../resources/olist_order_reviews_dataset.csv'
order_items_data_file_path = '../resources/olist_order_items_dataset.csv'
order_payments_data_file_path = '../resources/olist_order_payments_dataset.csv'

# store all the paths in an array
files_paths = [
    geolocations_data_file_path,
    customers_data_file_path,
    sellers_data_file_path,
    orders_data_file_path,
    product_categories_data_file_path,
    products_data_file_path,
    order_reviews_data_file_path,
    order_items_data_file_path,
    order_payments_data_file_path
]

need_extraction = False
for path in files_paths:
    if not os.path.exists(path):
        need_extraction = True
        break

<section id="section_2b">
    <h3 style="text-align: center">Section 2-B: Extract from Kaggle via their API if data files not found</h3>
</section>

In [4]:
if need_extraction:
    # authenticate to an instance of the Kaggle API
    kaggle.api.authenticate()
    # download the dataset from Kaggle
    kaggle.api.dataset_download_files('olistbr/brazilian-ecommerce', path='../resources/', unzip=True)
    logging.info("Downloaded dataset from Kaggle.")
else:
    msg = "Data files already exists."
    print(msg)
    logging.info(msg)

Data files already exists.


<section id="section_2c">
    <h3 style="text-align: center">Section 2-C: Write to the logging process</h3>
</section>

In [5]:
logging.info("Extraction process completed.")

<section id="section_3">
    <hr>
    <h2 style="text-align: center">Section 3: Data Transformation</h2>
    <hr>
</section>

<section id="section_3a">
    <h3 style="text-align: center">Section 3-A: The <code>geolocations</code> data</h3>
    <h4>Read in the CSV file and store inside a DataFrame called <code>geolocations_df</code></h4>
</section>

In [13]:
# Reading geolocation data
geolocations_df = pd.read_csv(geolocations_data_file_path,
                              header=0,
                              na_values='')

logging.info("Geolocation data loaded successfully.")

geolocations_na = geolocations_df.isna().any().to_dict()
if True in geolocations_na.values():
    na_columns = [col for col in geolocations_na.keys() if geolocations_na[col] == True]
    na_columns = ', '.join(na_columns)
    logging.warning(f"Detected columns with NA values: {na_columns}")

In [14]:
geolocations_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000163 entries, 0 to 1000162
Data columns (total 5 columns):
 #   Column                       Non-Null Count    Dtype  
---  ------                       --------------    -----  
 0   geolocation_zip_code_prefix  1000163 non-null  int64  
 1   geolocation_lat              1000163 non-null  float64
 2   geolocation_lng              1000163 non-null  float64
 3   geolocation_city             1000163 non-null  object 
 4   geolocation_state            1000163 non-null  object 
dtypes: float64(2), int64(1), object(2)
memory usage: 38.2+ MB


In [15]:
geolocations_df.sort_values(by=['geolocation_zip_code_prefix', 'geolocation_city', 'geolocation_state'], ascending=[True, False, False]).head(10)

Unnamed: 0,geolocation_zip_code_prefix,geolocation_lat,geolocation_lng,geolocation_city,geolocation_state
575,1001,-23.549779,-46.633957,são paulo,SP
608,1001,-23.550263,-46.634196,são paulo,SP
1351,1001,-23.549951,-46.634027,são paulo,SP
99,1001,-23.549292,-46.633559,sao paulo,SP
206,1001,-23.550498,-46.634338,sao paulo,SP
235,1001,-23.550642,-46.63441,sao paulo,SP
299,1001,-23.549698,-46.633909,sao paulo,SP
326,1001,-23.551427,-46.634074,sao paulo,SP
429,1001,-23.550498,-46.634338,sao paulo,SP
519,1001,-23.551337,-46.634027,sao paulo,SP


In [16]:
before_transform_shape = geolocations_df.shape
before_transform_shape

(1000163, 5)

#### Problems and Possible Improvements:

As seen in the outputs from the cells above, there are some things that we needs to be fixed and some improvements that can be done:

1. The columns can be renamed into nicer names (`zip_code_prefix`, `latitude`, `longitude`, `city`, and `state`).

2. We need to change the data types of the columns.

3. The strings under the `city` column has different characters with some using foreign (Portuguese) characters like 'ã', 'ç', 'é', etc.

4. There are duplicate `zip_code_prefix` values which makes it a problem as it is supposed to be the primary key of this table.

5. Because we want to merge the duplicate `zip_code_prefix` values into one, another problem arise where there are multiple different `city` and `state` values under one unique `zip_code_prefix`.

#### Rename the column names and change the data types of the columns.

In [17]:
geolocations_col_names = {
    'geolocation_zip_code_prefix': 'zip_code_prefix',
    'geolocation_lat': 'latitude',
    'geolocation_lng': 'longitude',
    'geolocation_city': 'city',
    'geolocation_state': 'state'
}

geolocations_data_types = {
    "zip_code_prefix": np.int32,
    "latitude": np.float64,
    "longitude": np.float64,
    "city": "string",
    "state": "string"
}

geolocations_df.rename(columns=geolocations_col_names, inplace=True)
geolocations_df = geolocations_df.astype(geolocations_data_types)

geolocations_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000163 entries, 0 to 1000162
Data columns (total 5 columns):
 #   Column           Non-Null Count    Dtype  
---  ------           --------------    -----  
 0   zip_code_prefix  1000163 non-null  int32  
 1   latitude         1000163 non-null  float64
 2   longitude        1000163 non-null  float64
 3   city             1000163 non-null  string 
 4   state            1000163 non-null  string 
dtypes: float64(2), int32(1), string(2)
memory usage: 34.3 MB


#### Change the data in `city` to lowercase and the data in `state` to uppercase.

This makes it easier for us to handle the conversion and grouping process later on.

In [18]:
geolocations_df['city'] = geolocations_df['city'].str.lower()
geolocations_df['state'] = geolocations_df['state'].str.upper()

geolocations_df.head(5)

Unnamed: 0,zip_code_prefix,latitude,longitude,city,state
0,1037,-23.545621,-46.639292,sao paulo,SP
1,1046,-23.546081,-46.64482,sao paulo,SP
2,1046,-23.546129,-46.642951,sao paulo,SP
3,1041,-23.544392,-46.639499,sao paulo,SP
4,1035,-23.541578,-46.641607,sao paulo,SP


#### Normalize the data in `city` to use normal characters, replacing any foreign characters like 'ã', 'ç', 'é', etc. 

This is so that it makes it easier to group the data according to the `zip_code_prefix`.

We will be using the **unidecode** library to detect the foreign characters and convert them to their normal character counterparts. We store the converted `city` data into another column called `city_normalized`.

In [20]:
geolocations_df['city_normalized'] = geolocations_df['city'].apply(unidecode)

geolocations_df.sort_values(by=['zip_code_prefix', 'city', 'state'], ascending=[True, False, False]).head(10)

Unnamed: 0,zip_code_prefix,latitude,longitude,city,state,city_normalized
575,1001,-23.549779,-46.633957,são paulo,SP,sao paulo
608,1001,-23.550263,-46.634196,são paulo,SP,sao paulo
1351,1001,-23.549951,-46.634027,são paulo,SP,sao paulo
99,1001,-23.549292,-46.633559,sao paulo,SP,sao paulo
206,1001,-23.550498,-46.634338,sao paulo,SP,sao paulo
235,1001,-23.550642,-46.63441,sao paulo,SP,sao paulo
299,1001,-23.549698,-46.633909,sao paulo,SP,sao paulo
326,1001,-23.551427,-46.634074,sao paulo,SP,sao paulo
429,1001,-23.550498,-46.634338,sao paulo,SP,sao paulo
519,1001,-23.551337,-46.634027,sao paulo,SP,sao paulo


#### Group the data by the `zip_code_prefix` and perform aggregations on the other columns.

To group the data by the `zip_code_prefix` in order to get a single unique `zip_code_prefix` in `geolocations_df`, we will perform aggregated functions on the other columns. 

For the `latitude` and `longitude` columns, we will take their average values. 

As for the `city_normalized` and `state`, we will perform the `mode()` function on the columns where we will calculate the occurence and take the value with the highest occurence. We define the helper function below to assist with the aggregations process for the two columns.

We will assign the aggregated data into a temporary DataFrame called `aggregated_geo_df`.

In [21]:
def get_mode(series):
    mode = series.mode()
    if not mode.empty:
        return mode.iloc[0]
    else:
        return None

In [22]:
statistics = {
                'city_normalized': get_mode,
                'state': get_mode,
                'latitude': 'mean',
                'longitude': 'mean'
             }

aggregated_geo_df = geolocations_df.groupby('zip_code_prefix').agg(statistics).reset_index()
aggregated_geo_df.head(10)

Unnamed: 0,zip_code_prefix,city_normalized,state,latitude,longitude
0,1001,sao paulo,SP,-23.55019,-46.634024
1,1002,sao paulo,SP,-23.548146,-46.634979
2,1003,sao paulo,SP,-23.548994,-46.635731
3,1004,sao paulo,SP,-23.549799,-46.634757
4,1005,sao paulo,SP,-23.549456,-46.636733
5,1006,sao paulo,SP,-23.550102,-46.636137
6,1007,sao paulo,SP,-23.550046,-46.637251
7,1008,sao paulo,SP,-23.546002,-46.635886
8,1009,sao paulo,SP,-23.546835,-46.636491
9,1010,sao paulo,SP,-23.546389,-46.635226


#### Rename columns and assigning data to original DataFrame.

We need to rename the `city_normalized` column back to `city` and assign the aggregated DataFrame back to the original `geolocations_df` DataFrame. We will also export the dataset to a CSV file for reference.

In [23]:
aggregated_geo_df.rename(columns={'city_normalized': 'city'}, inplace=True)
geolocations_df = aggregated_geo_df[['zip_code_prefix', 'latitude', 'longitude', 'city', 'state']]

geolocations_df.head(10)

Unnamed: 0,zip_code_prefix,latitude,longitude,city,state
0,1001,-23.55019,-46.634024,sao paulo,SP
1,1002,-23.548146,-46.634979,sao paulo,SP
2,1003,-23.548994,-46.635731,sao paulo,SP
3,1004,-23.549799,-46.634757,sao paulo,SP
4,1005,-23.549456,-46.636733,sao paulo,SP
5,1006,-23.550102,-46.636137,sao paulo,SP
6,1007,-23.550046,-46.637251,sao paulo,SP
7,1008,-23.546002,-46.635886,sao paulo,SP
8,1009,-23.546835,-46.636491,sao paulo,SP
9,1010,-23.546389,-46.635226,sao paulo,SP


#### Export the cleaned DataFrame to a CSV file.

In [24]:
cleaned_geolocations_data_file_path = '../resources/cleaned_geolocation_dataset.csv'
geolocations_df.to_csv(cleaned_geolocations_data_file_path, index=False)

logging.info(f"Cleaned geolocation data and exported to {cleaned_geolocations_data_file_path}.")

#### Difference before and after cleaning:

As you can see below, the **geolocations_df** DataFrame lost **981,148** records due to duplicate `zip_code_prefix` values from **1,000,163** records to **19,015** records.

In [25]:
before_transform_shape

(1000163, 5)

In [26]:
geolocations_df.shape

(19015, 5)

<section id="section_3b">
    <h3 style="text-align: center">Section 3-B: The <code>customers</code> data</h3>
    <h4>Read in the CSV file and store inside a DataFrame called <code>customers_df</code></h4>
</section>

In [35]:
# Reading customers data
customers_df = pd.read_csv(customers_data_file_path,
                           header=0,
                           na_values='')

logging.info("Customer data loaded successfully.")

customers_na = customers_df.isna().any().to_dict()
if True in customers_na.values():
    na_columns = [col for col in customers_na.keys() if customers_na[col] == True]
    na_columns = ', '.join(na_columns)
    logging.warning(f"Detected columns with NA values: {na_columns}")

In [36]:
customers_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 5 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   customer_id               99441 non-null  object
 1   customer_unique_id        99441 non-null  object
 2   customer_zip_code_prefix  99441 non-null  int64 
 3   customer_city             99441 non-null  object
 4   customer_state            99441 non-null  object
dtypes: int64(1), object(4)
memory usage: 3.8+ MB


In [37]:
customers_df.head(10)

Unnamed: 0,customer_id,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state
0,06b8999e2fba1a1fbc88172c00ba8bc7,861eff4711a542e4b93843c6dd7febb0,14409,franca,SP
1,18955e83d337fd6b2def6b18a428ac77,290c77bc529b7ac935b93aa66c333dc3,9790,sao bernardo do campo,SP
2,4e7b3e00288586ebd08712fdd0374a03,060e732b5b29e8181a18229c7b0b2b5e,1151,sao paulo,SP
3,b2b6027bc5c5109e529d4dc6358b12c3,259dac757896d24d7702b9acbbff3f3c,8775,mogi das cruzes,SP
4,4f2d8ab171c80ec8364f7c12e35b23ad,345ecd01c38d18a9036ed96c73b8d066,13056,campinas,SP
5,879864dab9bc3047522c92c82e1212b8,4c93744516667ad3b8f1fb645a3116a4,89254,jaragua do sul,SC
6,fd826e7cf63160e536e0908c76c3f441,addec96d2e059c80c30fe6871d30d177,4534,sao paulo,SP
7,5e274e7a0c3809e14aba7ad5aae0d407,57b2a98a409812fe9618067b6b8ebe4f,35182,timoteo,MG
8,5adf08e34b2e993982a47070956c5c65,1175e95fb47ddff9de6b2b06188f7e0d,81560,curitiba,PR
9,4b7139f34592b3a31687243a302fa75b,9afe194fb833f79e300e37e580171f22,30575,belo horizonte,MG


In [38]:
before_transform_shape = customers_df.shape
before_transform_shape

(99441, 5)

#### Problems and Possible Improvements:

As seen in the outputs from the cells above, there are some things that we needs to be fixed and some improvements that can be done:

1. The `customer_city` and `customer_state` columns can be removed from the DataFrame since we can just reference them from `geolocations_df` using `customer_zip_code_prefix`.

2. We can rename the `customer_zip_code_prefix` column to `zip_code_prefix` to standardise the naming.

3. We need to change the data types of the columns.

4. We need to check if the data under the `zip_code_prefix` column do exist inside the `geolocations_df` DataFrame.

#### Remove the `customer_city` and `customer_state` columns.

In [39]:
customers_df = customers_df[['customer_id', 'customer_unique_id', 'customer_zip_code_prefix']]

#### Rename the column names and change the data types of the columns.

In [40]:
customers_df.rename(columns={'customer_zip_code_prefix': 'zip_code_prefix'}, inplace=True)

customers_data_types = {
    "customer_id": "string",
    "customer_unique_id": "string",
    "zip_code_prefix": np.int32
}

customers_df = customers_df.astype(customers_data_types)

customers_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   customer_id         99441 non-null  string
 1   customer_unique_id  99441 non-null  string
 2   zip_code_prefix     99441 non-null  int32 
dtypes: int32(1), string(2)
memory usage: 1.9 MB


#### Remove rows where the `zip_code_prefix` does not exist inside the `geolocations_df` DataFrame.

From the cell below, we found that there are **278** rows that has `zip_code_prefix` which does not exist inside the `geolocations_df`. So, we remove them from the DataFrame.

In [41]:
existing_condition = customers_df['zip_code_prefix'].isin(geolocations_df['zip_code_prefix'])
inconsistent_rows = customers_df[~existing_condition]

inconsistent_rows.shape

(278, 3)

In [42]:
customers_df = customers_df[existing_condition]

customers_df.shape

(99163, 3)

#### Export the cleaned DataFrame to a CSV file.

In [43]:
cleaned_customers_data_file_path = '../resources/cleaned_customer_dataset.csv'
customers_df.to_csv(cleaned_customers_data_file_path, index=False)

logging.info(f"Cleaned customer data and exported to {cleaned_customers_data_file_path}.")

#### Difference before and after cleaning:

As you can see below, the **customers_df** DataFrame lost **278** records due to missing `zip_code_prefix` values from **99,441** records to **99,163** records.

In [44]:
before_transform_shape

(99441, 5)

In [45]:
customers_df.shape

(99163, 3)

<section id="section_3c">
    <h3 style="text-align: center">Section 3-C: The <code>sellers</code> data</h3>
    <h4>Read in the CSV file and store inside a DataFrame called <code>sellers_df</code></h4>
</section>

In [6]:
sellers_df = pd.read_csv(sellers_data_file_path,
                         header=0,
                         na_values='')

logging.info("Seller data loaded successfully.")

sellers_na = sellers_df.isna().any().to_dict()
if True in sellers_na.values():
    logging.warning("Detected columns with NA values: ", [col for col in sellers_na.keys() if sellers_na[col] == True])

In [7]:
sellers_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3095 entries, 0 to 3094
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   seller_id               3095 non-null   object
 1   seller_zip_code_prefix  3095 non-null   int64 
 2   seller_city             3095 non-null   object
 3   seller_state            3095 non-null   object
dtypes: int64(1), object(3)
memory usage: 96.8+ KB


In [8]:
sellers_df.head(10)

Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city,seller_state
0,3442f8959a84dea7ee197c632cb2df15,13023,campinas,SP
1,d1b65fc7debc3361ea86b5f14c68d2e2,13844,mogi guacu,SP
2,ce3ad9de960102d0677a81f5d0bb7b2d,20031,rio de janeiro,RJ
3,c0f3eea2e14555b6faeea3dd58c1b1c3,4195,sao paulo,SP
4,51a04a8a6bdcb23deccc82b0b80742cf,12914,braganca paulista,SP
5,c240c4061717ac1806ae6ee72be3533b,20920,rio de janeiro,RJ
6,e49c26c3edfa46d227d5121a6b6e4d37,55325,brejao,PE
7,1b938a7ec6ac5061a66a3766e0e75f90,16304,penapolis,SP
8,768a86e36ad6aae3d03ee3c6433d61df,1529,sao paulo,SP
9,ccc4bbb5f32a6ab2b7066a4130f114e3,80310,curitiba,PR


In [9]:
before_transform_shape = sellers_df.shape
before_transform_shape

(3095, 4)

#### Problems and Possible Improvements:

As seen in the outputs from the cells above, there are some things that we needs to be fixed and some improvements that can be done:

1. Just like the `customers_df`, the `seller_city` and `seller_state` columns can be removed from the DataFrame since we can just reference them from `geolocations_df` using `seller_zip_code_prefix`.

2. We can rename the `seller_zip_code_prefix` column to `zip_code_prefix` to standardise the naming.

3. We need to set the data types of the columns.

4. We need to check if the data under the `zip_code_prefix` column do exist inside the `geolocations_df` DataFrame.

#### Remove the `customer_city` and `customer_state` columns.

In [10]:
sellers_df = sellers_df[['seller_id', 'seller_zip_code_prefix']]

#### Rename the column names and change the data types of the columns.

In [11]:
sellers_df.rename(columns={'seller_zip_code_prefix': 'zip_code_prefix'}, inplace=True)

sellers_data_types = {
    "seller_id": "string",
    "zip_code_prefix": np.int32
}

sellers_df = sellers_df.astype(sellers_data_types)

sellers_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3095 entries, 0 to 3094
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   seller_id        3095 non-null   string
 1   zip_code_prefix  3095 non-null   int32 
dtypes: int32(1), string(1)
memory usage: 36.4 KB


#### Remove rows where the `zip_code_prefix` does not exist inside the `geolocations_df` DataFrame.

From the cell below, we see that there are **7** rows where the `zip_code_prefix` does not exist inside the `geolocations_df`. So, we remove them from the DataFrame.

In [30]:
existing_rows = sellers_df['zip_code_prefix'].isin(geolocations_df['zip_code_prefix'])
inconsistent_rows = sellers_df[~existing_rows]
inconsistent_rows.shape

(7, 2)

In [31]:
sellers_df = sellers_df[existing_rows]
sellers_df.shape

(3088, 2)

#### Export the cleaned DataFrame to a CSV file.

In [32]:
cleaned_sellers_data_file_path = '../resources/cleaned_seller_dataset.csv'
sellers_df.to_csv(cleaned_sellers_data_file_path, index=False)

logging.info(f"Cleaned seller data and exported to {cleaned_sellers_data_file_path}.")

#### Difference before and after cleaning:

As you can see below, the `sellers_df` DataFrame lost **7** records due to missing `zip_code_prefix` values from **3,095** records to **3,088** records.

In [None]:
before_transform_shape

In [34]:
sellers_df.shape

(3088, 2)

<section id="section_3d">
    <h3 style="text-align: center">Section 3-D: The <code>orders</code> data</h3>
    <h4>Read in the CSV file and store inside a DataFrame called <code>orders_df</code></h4>
</section>

In [48]:
orders_df = pd.read_csv(orders_data_file_path,
                        header=0,
                        na_values='')

logging.info("Order data loaded successfully.")

orders_na = orders_df.isna().any().to_dict()
if True in orders_na.values():
    na_columns = [col for col in orders_na.keys() if orders_na[col] == True]
    na_columns = ', '.join(na_columns)
    logging.warning(f"Detected columns with NA values: {na_columns}")

In [49]:
orders_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   order_id                       99441 non-null  object
 1   customer_id                    99441 non-null  object
 2   order_status                   99441 non-null  object
 3   order_purchase_timestamp       99441 non-null  object
 4   order_approved_at              99281 non-null  object
 5   order_delivered_carrier_date   97658 non-null  object
 6   order_delivered_customer_date  96476 non-null  object
 7   order_estimated_delivery_date  99441 non-null  object
dtypes: object(8)
memory usage: 6.1+ MB


In [50]:
orders_df.head()

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15 00:00:00
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26 00:00:00


In [51]:
orders_df.describe()

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
count,99441,99441,99441,99441,99281,97658,96476,99441
unique,99441,99441,8,98875,90733,81018,95664,459
top,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2018-04-11 10:48:14,2018-02-27 04:31:10,2018-05-09 15:48:00,2018-05-08 23:38:46,2017-12-20 00:00:00
freq,1,1,96478,3,9,47,3,522


In [52]:
before_transform_shape = orders_df.shape
before_transform_shape

(99441, 8)

#### Problems and Possible Improvements:

As seen in the outputs from the cells above, there are some things that we needs to be fixed and some improvements that can be done:

1. We need to change the data types of the columns. Especially for the columns with timestamp values, we need to convert them into DateTime data types.

2. We need to check if the data under the `customer_id` column do exist inside the `customers_df` DataFrame.

#### Change the data types of the columns.

In [53]:
orders_data_types = {
    "order_id": "string",
    "customer_id": "string",
    "order_status": "string"
}

orders_df = orders_df.astype(orders_data_types)

orders_df['order_purchase_timestamp'] = pd.to_datetime(orders_df['order_purchase_timestamp'])
orders_df['order_approved_at'] = pd.to_datetime(orders_df['order_approved_at'])
orders_df['order_delivered_carrier_date'] = pd.to_datetime(orders_df['order_delivered_carrier_date'])
orders_df['order_delivered_customer_date'] = pd.to_datetime(orders_df['order_delivered_customer_date'])
orders_df['order_estimated_delivery_date'] = pd.to_datetime(orders_df['order_estimated_delivery_date'])

orders_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   order_id                       99441 non-null  string        
 1   customer_id                    99441 non-null  string        
 2   order_status                   99441 non-null  string        
 3   order_purchase_timestamp       99441 non-null  datetime64[ns]
 4   order_approved_at              99281 non-null  datetime64[ns]
 5   order_delivered_carrier_date   97658 non-null  datetime64[ns]
 6   order_delivered_customer_date  96476 non-null  datetime64[ns]
 7   order_estimated_delivery_date  99441 non-null  datetime64[ns]
dtypes: datetime64[ns](5), string(3)
memory usage: 6.1 MB


#### Remove rows where the `customer_id` does not exist inside the `customers_df` DataFrame.

From the cell below, we found that there are **278** rows that has `customer_id` which does not exist inside the `customers_df`. So, we remove them from the DataFrame.

In [54]:
existing_condition = orders_df['customer_id'].isin(customers_df['customer_id'])
inconsistent_rows = orders_df[~existing_condition]

inconsistent_rows.shape

(278, 8)

In [55]:
orders_df = orders_df[existing_condition]

orders_df.shape

(99163, 8)

#### Export the cleaned DataFrame to a CSV file.

In [56]:
cleaned_orders_data_file_path = '../resources/cleaned_order_dataset.csv'
orders_df.to_csv(cleaned_orders_data_file_path, index=False)

logging.info(f"Cleaned order data and exported to {cleaned_orders_data_file_path}.")

#### Difference before and after cleaning:

As you can see below, the `orders_df` DataFrame lost **278** records due to missing `customer_id` values from **99,441** records to **99,163** records.

In [57]:
before_transform_shape

(99441, 8)

In [58]:
orders_df.shape

(99163, 8)

<section id="section_3e">
    <h3 style="text-align: center">Section 3-E: The <code>product_category_name_translation</code> data</h3>
    <h4>Read in the CSV file and store inside a DataFrame called <code>product_categories_df</code></h4>
</section>

In [59]:
product_categories_df = pd.read_csv(product_categories_data_file_path,
                                    header=0,
                                    na_values='')

logging.info("Product category data loaded successfully.")

product_categories_na = product_categories_df.isna().any().to_dict()
if True in product_categories_na.values():
    na_columns = [col for col in product_categories_na.keys() if product_categories_na[col] == True]
    na_columns = ', '.join(na_columns)
    logging.warning(f"Detected columns with NA values: {na_columns}")

In [60]:
product_categories_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71 entries, 0 to 70
Data columns (total 2 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   product_category_name          71 non-null     object
 1   product_category_name_english  71 non-null     object
dtypes: object(2)
memory usage: 1.2+ KB


In [63]:
product_categories_df.sort_values(by=['product_category_name_english']).head(30)

Unnamed: 0,product_category_name,product_category_name_english
37,agro_industria_e_comercio,agro_industry_and_commerce
24,climatizacao,air_conditioning
46,artes,art
67,artes_e_artesanato,arts_and_craftmanship
20,audio,audio
2,automotivo,auto
11,bebes,baby
3,cama_mesa_banho,bed_bath_table
48,livros_interesse_geral,books_general_interest
57,livros_importados,books_imported


#### Problems and Possible Improvements:

As seen in the outputs from the cells above, there are some things that we needs to be fixed and some improvements that can be done:

1. We need to change the data types of the columns.

2. We need to remove the `_` underscore from the string values in the dataset.

3. As you can see in the output of the DataFrame above (nearer to the bottom part), some of the English words are misspelled so we need to rectify the spelling.

#### Change the data type of the columns.

In [65]:
product_categories_data_types = {
    "product_category_name": "string",
    "product_category_name_english": "string"
}

product_categories_df = product_categories_df.astype(product_categories_data_types)

product_categories_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71 entries, 0 to 70
Data columns (total 2 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   product_category_name          71 non-null     string
 1   product_category_name_english  71 non-null     string
dtypes: string(2)
memory usage: 1.2 KB


#### Remove the `_` underscore from the string values.

In [67]:
product_categories_df['product_category_name'] = product_categories_df['product_category_name'].str.replace('_', ' ')
product_categories_df['product_category_name_english'] = product_categories_df['product_category_name_english'].str.replace('_', ' ')

product_categories_df.head(5)

Unnamed: 0,product_category_name,product_category_name_english
0,beleza saude,health beauty
1,informatica acessorios,computers accessories
2,automotivo,auto
3,cama mesa banho,bed bath table
4,moveis decoracao,furniture decor


#### Define the SpellChecker method

We use the SpellChecker library to check the spelling of the English words found under the `product_category_name_english` column.

We also found that some of the terms under the column are correct terms such as `blu`, `cd`, `christmas`, and `dvd`. However, they are deemed incorrect by SpellChecker and are changed to different words. So, we added a text file with all these extra words to prevent them from being corrected by SpellChecker.

In [68]:
spell = SpellChecker()
# load in additional words that we deemed to be correct which the SpellChecker does not such as 'blu', 'cd', 'christmas', 'dvd' 
spell.word_frequency.load_text_file('../resources/product_categories.txt')

# we split the text into individual words and check the words one by one
# afterwards, we join them back together to a single string and return it
def correct_spelling(text):
    list_of_words = text.split()
    for i in range(len(list_of_words)):
        word = list_of_words[i]
        list_of_words[i] = spell.correction(word)
    return ' '.join(list_of_words)

#### Apply the SpellChecker method on the `product_category_name_english` column

As you can see in the output below, the spelling of the English words are corrected.

In [69]:
product_categories_df['product_category_name_english'] = product_categories_df['product_category_name_english'].apply(correct_spelling)

product_categories_df.sort_values(by=['product_category_name_english']).head(30)

Unnamed: 0,product_category_name,product_category_name_english
37,agro industria e comercio,agro industry and commerce
24,climatizacao,air conditioning
46,artes,art
67,artes e artesanato,arts and craftsmanship
20,audio,audio
2,automotivo,auto
11,bebes,baby
3,cama mesa banho,bed bath table
48,livros interesse geral,books general interest
57,livros importados,books imported


#### Export the cleaned DataFrame to a CSV file.

In [70]:
cleaned_products_categories_data_file_path = '../resources/cleaned_product_categories_dataset.csv'
product_categories_df.to_csv(cleaned_products_categories_data_file_path, index=False)

logging.info(f"Cleaned product data and exported to {cleaned_products_categories_data_file_path}.")

#### As there are no problems with membership constraints with other DataFrames, there are no difference in the number of records before and after cleaning the data for `product_categories_df`.

<section id="section_3f">
    <h3 style="text-align: center">Section 3-F: The <code>products</code> data</h3>
    <h4>Read in the CSV file and store inside a DataFrame called <code>products_df</code></h4>
</section>

In [83]:
products_df = pd.read_csv(products_data_file_path,
                          header=0,
                          na_values='')

logging.info("Product data loaded successfully.")

products_na = products_df.isna().any().to_dict()
if True in products_na.values():
    na_columns = [col for col in products_na.keys() if products_na[col] == True]
    na_columns = ', '.join(na_columns)
    logging.warning(f"Detected columns with NA values: {na_columns}")

In [84]:
products_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32951 entries, 0 to 32950
Data columns (total 9 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   product_id                  32951 non-null  object 
 1   product_category_name       32341 non-null  object 
 2   product_name_lenght         32341 non-null  float64
 3   product_description_lenght  32341 non-null  float64
 4   product_photos_qty          32341 non-null  float64
 5   product_weight_g            32949 non-null  float64
 6   product_length_cm           32949 non-null  float64
 7   product_height_cm           32949 non-null  float64
 8   product_width_cm            32949 non-null  float64
dtypes: float64(7), object(2)
memory usage: 2.3+ MB


In [85]:
products_df.head(5)

Unnamed: 0,product_id,product_category_name,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm
0,1e9e8ef04dbcff4541ed26657ea517e5,perfumaria,40.0,287.0,1.0,225.0,16.0,10.0,14.0
1,3aa071139cb16b67ca9e5dea641aaa2f,artes,44.0,276.0,1.0,1000.0,30.0,18.0,20.0
2,96bd76ec8810374ed1b65e291975717f,esporte_lazer,46.0,250.0,1.0,154.0,18.0,9.0,15.0
3,cef67bcfe19066a932b7673e239eb23d,bebes,27.0,261.0,1.0,371.0,26.0,4.0,26.0
4,9dc1a7de274444849c219cff195d0b71,utilidades_domesticas,37.0,402.0,4.0,625.0,20.0,17.0,13.0


In [86]:
before_transform_shape = products_df.shape
before_transform_shape

(32951, 9)

#### Problems and Possible Improvements:

As seen in the outputs from the cells above, there are some things that we needs to be fixed and some improvements that can be done:

1. There are some rows that have null values under the `product_category_name` so we need to remove them.

2. There are null values under the integer columns so we need to set them as `0`.

3. The columns `zip_code_prefix` and `latitude` have wrong spellings for the word `length`.

4. We need to change the data types of the columns.

5. Since we remove the `_` underscore from the `product_category_name` in the `product_categories_df`, we need to do the same to the same column in this DataFrame.

6. We need to check if the values under the `product_category_name` exists inside the `product_categories_df`. If they don't, we need to remove those records.

#### Remove rows with null values under the `product_category_name` and fill in the null values under quantitative columns as `0`.

In [87]:
products_df.dropna(subset=['product_category_name'], inplace=True)

products_na_values = {
    "product_name_lenght": 0,
    "product_description_lenght": 0,
    "product_photos_qty": 0,
    "product_weight_g": 0,
    "product_length_cm": 0,
    "product_height_cm": 0,
    "product_width_cm": 0
}
products_df.fillna(value=products_na_values, inplace=True)

products_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32341 entries, 0 to 32950
Data columns (total 9 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   product_id                  32341 non-null  object 
 1   product_category_name       32341 non-null  object 
 2   product_name_lenght         32341 non-null  float64
 3   product_description_lenght  32341 non-null  float64
 4   product_photos_qty          32341 non-null  float64
 5   product_weight_g            32341 non-null  float64
 6   product_length_cm           32341 non-null  float64
 7   product_height_cm           32341 non-null  float64
 8   product_width_cm            32341 non-null  float64
dtypes: float64(7), object(2)
memory usage: 2.5+ MB


#### Rename the column names and change the data types of the columns.

In [88]:
products_col_names = {
    "product_name_lenght": "product_name_length",
    "product_description_lenght": "product_description_length"
}

products_data_types = {
    "product_id": "string",
    "product_category_name": "string",
    "product_name_length": np.int8,
    "product_description_length": np.int16,
    "product_photos_qty": np.int8,
    "product_weight_g": np.int16,
    "product_length_cm": np.int8,
    "product_height_cm": np.int8,
    "product_width_cm": np.int8
}

products_df.rename(columns=products_col_names, inplace=True)
products_df = products_df.astype(products_data_types)

products_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32341 entries, 0 to 32950
Data columns (total 9 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   product_id                  32341 non-null  string
 1   product_category_name       32341 non-null  string
 2   product_name_length         32341 non-null  int8  
 3   product_description_length  32341 non-null  int16 
 4   product_photos_qty          32341 non-null  int8  
 5   product_weight_g            32341 non-null  int16 
 6   product_length_cm           32341 non-null  int8  
 7   product_height_cm           32341 non-null  int8  
 8   product_width_cm            32341 non-null  int8  
dtypes: int16(2), int8(5), string(2)
memory usage: 1.0 MB


#### Remove the `_` underscore from the values under `product_category_name` and remove rows where the `product_category_name` does not exist in `product_categories_df`

As you can see below, there are 13 rows where the `product_category_name` does not exist. So, we remove them from the DataFrame.

In [89]:
products_df['product_category_name'] = products_df['product_category_name'].str.replace('_', ' ')

existing_condition = products_df['product_category_name'].isin(product_categories_df['product_category_name'])
inconsistent_rows = products_df[~existing_condition]

inconsistent_rows.shape

(13, 9)

In [90]:
products_df = products_df[existing_condition]
products_df.shape

(32328, 9)

#### Export the cleaned DataFrame to a CSV file.

In [91]:
cleaned_products_data_file_path = '../resources/cleaned_product_dataset.csv'
products_df.to_csv(cleaned_products_data_file_path, index=False)

logging.info(f"Cleaned product data and exported to {cleaned_products_data_file_path}.")

#### Difference before and after cleaning:

As you can see below, the **products_df** DataFrame lost **623** records due to missing and invalid `product_category_name` values from **32,951** records to **32,328** records.

In [92]:
before_transform_shape

(32951, 9)

In [93]:
products_df.shape

(32328, 9)

<section id="section_3g">
    <h3 style="text-align: center">Section 3-G: The <code>order_reviews</code> data</h3>
    <h4>Read in the CSV file and store inside a DataFrame called <code>order_reviews_df</code></h4>
</section>

In [103]:
order_reviews_df = pd.read_csv(order_reviews_data_file_path,
                               header=0,
                               na_values='')

logging.info("Order review data loaded successfully.")

order_reviews_na = order_reviews_df.isna().any().to_dict()
if True in order_reviews_na.values():
    na_columns = [col for col in order_reviews_na.keys() if order_reviews_na[col] == True]
    na_columns = ', '.join(na_columns)
    logging.warning(f"Detected columns with NA values: {na_columns}")

In [104]:
order_reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99224 entries, 0 to 99223
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   review_id                99224 non-null  object
 1   order_id                 99224 non-null  object
 2   review_score             99224 non-null  int64 
 3   review_comment_title     11568 non-null  object
 4   review_comment_message   40977 non-null  object
 5   review_creation_date     99224 non-null  object
 6   review_answer_timestamp  99224 non-null  object
dtypes: int64(1), object(6)
memory usage: 5.3+ MB


In [105]:
order_reviews_df.head(10)

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18 00:00:00,2018-01-18 21:46:59
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10 00:00:00,2018-03-11 03:05:13
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17 00:00:00,2018-02-18 14:36:24
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53
5,15197aa66ff4d0650b5434f1b46cda19,b18dcdf73be66366873cd26c5724d1dc,1,,,2018-04-13 00:00:00,2018-04-16 00:39:37
6,07f9bee5d1b850860defd761afa7ff16,e48aa0d2dcec3a2e87348811bcfdf22b,5,,,2017-07-16 00:00:00,2017-07-18 19:30:34
7,7c6400515c67679fbee952a7525281ef,c31a859e34e3adac22f376954e19b39d,5,,,2018-08-14 00:00:00,2018-08-14 21:36:06
8,a3f6f7f6f433de0aefbb97da197c554c,9c214ac970e84273583ab523dfafd09b,5,,,2017-05-17 00:00:00,2017-05-18 12:05:37
9,8670d52e15e00043ae7de4c01cc2fe06,b9bf720beb4ab3728760088589c62129,4,recomendo,aparelho eficiente. no site a marca do aparelh...,2018-05-22 00:00:00,2018-05-23 16:45:47


In [106]:
before_transform_shape = order_reviews_df.shape
before_transform_shape

(99224, 7)

#### Problems and Possible Improvements:

As seen in the outputs from the cells above, there are some things that we needs to be fixed and some improvements that can be done:

1. We need to change the data types of the columns. Especially, the columns `review_creation_date` and `review_answer_timestamp` are of DateTime data types.

2. We need to check if the values under the `order_id` exists inside the `orders_df`. If they don't, we need to remove those records.

#### Change the data types of the columns

In [107]:
order_reviews_data_types = {
    "review_id": "string",
    "order_id": "string",
    "review_score": np.int8,
    "review_comment_title": "string",
    "review_comment_message": "string"
}

order_reviews_df = order_reviews_df.astype(order_reviews_data_types)
order_reviews_df['review_creation_date'] = pd.to_datetime(order_reviews_df['review_creation_date'])
order_reviews_df['review_answer_timestamp'] = pd.to_datetime(order_reviews_df['review_answer_timestamp'])

order_reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99224 entries, 0 to 99223
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   review_id                99224 non-null  string        
 1   order_id                 99224 non-null  string        
 2   review_score             99224 non-null  int8          
 3   review_comment_title     11568 non-null  string        
 4   review_comment_message   40977 non-null  string        
 5   review_creation_date     99224 non-null  datetime64[ns]
 6   review_answer_timestamp  99224 non-null  datetime64[ns]
dtypes: datetime64[ns](2), int8(1), string(4)
memory usage: 4.6 MB


#### Remove rows where the `order_id` does not exist inside the `orders_df`

There are 279 rows with an invalid `order_id`.

In [108]:
existing_condition = order_reviews_df['order_id'].isin(orders_df['order_id'])
inconsistent_rows = order_reviews_df[~existing_condition]

order_reviews_df = order_reviews_df[existing_condition]

inconsistent_rows.shape

(279, 7)

#### Export the cleaned DataFrame to a CSV file.

In [109]:
cleaned_order_reviews_data_file_path = '../resources/cleaned_order_reviews_dataset.csv'
order_reviews_df.to_csv(cleaned_order_reviews_data_file_path, index=False)

logging.info(f"Cleaned order review data and exported to {cleaned_order_reviews_data_file_path}.")

#### Difference before and after cleaning:

As you can see below, the **order_reviews_df** DataFrame lost **279** records due to invalid `order_id` values from **99,224** records to **98,945** records.

In [111]:
before_transform_shape

(99224, 7)

In [110]:
order_reviews_df.shape

(98945, 7)

<section id="section_3h">
    <h3 style="text-align: center">Section 3-H: The <code>order_items</code> data</h3>
    <h4>Read in the CSV file and store inside a DataFrame called <code>order_items_df</code></h4>
</section>

In [112]:
order_items_df = pd.read_csv(order_items_data_file_path,
                             header=0,
                             na_values='')

logging.info("Order item data loaded successfully.")

order_items_na = order_items_df.isna().any().to_dict()
if True in order_items_na.values():
    na_columns = [col for col in order_items_na.keys() if order_items_na[col] == True]
    na_columns = ', '.join(na_columns)
    logging.warning(f"Detected columns with NA values: {na_columns}")

In [113]:
order_items_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112650 entries, 0 to 112649
Data columns (total 7 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   order_id             112650 non-null  object 
 1   order_item_id        112650 non-null  int64  
 2   product_id           112650 non-null  object 
 3   seller_id            112650 non-null  object 
 4   shipping_limit_date  112650 non-null  object 
 5   price                112650 non-null  float64
 6   freight_value        112650 non-null  float64
dtypes: float64(2), int64(1), object(4)
memory usage: 6.0+ MB


In [114]:
order_items_df.head()

Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
0,00010242fe8c5a6d1ba2dd792cb16214,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-09-19 09:45:35,58.9,13.29
1,00018f77f2f0320c557190d7a144bdd3,1,e5f2d52b802189ee658865ca93d83a8f,dd7ddc04e1b6c2c614352b383efe2d36,2017-05-03 11:05:13,239.9,19.93
2,000229ec398224ef6ca0657da4fc703e,1,c777355d18b72b67abbeef9df44fd0fd,5b51032eddd242adc84c38acab88f23d,2018-01-18 14:48:30,199.0,17.87
3,00024acbcdf0a6daa1e931b038114c75,1,7634da152a4610f1595efa32f14722fc,9d7a1d34a5052409006425275ba1c2b4,2018-08-15 10:10:18,12.99,12.79
4,00042b26cf59d7ce69dfabb4e55b4fd9,1,ac6c3623068f30de03045865e4e10089,df560393f3a51e74553ab94004ba5c87,2017-02-13 13:57:51,199.9,18.14


In [115]:
before_transform_shape = order_items_df.shape
before_transform_shape

(112650, 7)

#### Problems and Possible Improvements:

As seen in the outputs from the cells above, there are some things that we needs to be fixed and some improvements that can be done:

1. We need to change the data types of some of the columns. Especially, the column `shipping_limit_date` is of a DateTime data type.

2. We need to remove rows where the `order_id` does not exist in the `orders_df`.

3. We need to remove rows where the `product_id` does not exist in the `products_df`.

4. We need to remove rows where the `seller_id` does not exist in the `sellers_df`.

#### Change the data types of the columns.

In [117]:
order_items_data_types = {
    "order_id": "string",
    "order_item_id": np.int8,
    "product_id": "string",
    "seller_id": "string"
}

order_items_df = order_items_df.astype(order_items_data_types)
order_items_df['shipping_limit_date'] = pd.to_datetime(order_items_df['shipping_limit_date'])

order_items_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112650 entries, 0 to 112649
Data columns (total 7 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   order_id             112650 non-null  string        
 1   order_item_id        112650 non-null  int8          
 2   product_id           112650 non-null  string        
 3   seller_id            112650 non-null  string        
 4   shipping_limit_date  112650 non-null  datetime64[ns]
 5   price                112650 non-null  float64       
 6   freight_value        112650 non-null  float64       
dtypes: datetime64[ns](1), float64(2), int8(1), string(3)
memory usage: 5.3 MB


#### Remove rows where the `order_id` does not exist in the `orders_df`

There are 302 rows with invalid `order_id`.

In [118]:
existing_condition = order_items_df['order_id'].isin(orders_df['order_id'])
inconsistent_rows = order_items_df[~existing_condition]

order_items_df = order_items_df[existing_condition]

inconsistent_rows.shape

(302, 7)

#### Remove rows where the `product_id` does not exist in the `products_df`

There are 1,621 rows with invalid `product_id`.

In [120]:
existing_condition = order_items_df['product_id'].isin(products_df['product_id'])
inconsistent_rows = order_items_df[~existing_condition]

order_items_df = order_items_df[existing_condition]

inconsistent_rows.shape

(1621, 7)

#### Remove rows where the `seller_id` does not exist in the `sellers_df`

There are 251 rows with invalid `seller_id`.

In [121]:
existing_condition = order_items_df['seller_id'].isin(sellers_df['seller_id'])
inconsistent_rows = order_items_df[~existing_condition]

order_items_df = order_items_df[existing_condition]

inconsistent_rows.shape

(251, 7)

#### Export the cleaned DataFrame to a CSV file.

In [122]:
cleaned_order_items_data_file_path = '../resources/cleaned_order_items_dataset.csv'
order_items_df.to_csv(cleaned_order_items_data_file_path, index=False)

logging.info(f"Cleaned order item data and exported to {cleaned_order_items_data_file_path}.")

#### Difference before and after cleaning:

As you can see below, the **order_items_df** DataFrame lost **2,174** records due to duplicate `zip_code_prefix` values from **112,650** records to **110,476** records.

In [123]:
before_transform_shape

(112650, 7)

In [124]:
order_items_df.shape

(110476, 7)

<section id="section_3i">
    <h3 style="text-align: center">Section 3-I: The <code>order_payments</code> data</h3>
    <h4>Read in the CSV file and store inside a DataFrame called <code>order_payments_df</code></h4>
</section>

In [125]:
order_payments_df = pd.read_csv(order_payments_data_file_path,
                                header=0,
                                na_values='')

logging.info("Order payment data loaded successfully.")

order_payments_na = order_payments_df.isna().any().to_dict()
if True in order_payments_na.values():
    na_columns = [col for col in order_payments_na.keys() if order_payments_na[col] == True]
    na_columns = ', '.join(na_columns)
    logging.warning(f"Detected columns with NA values: {na_columns}")

In [126]:
order_payments_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103886 entries, 0 to 103885
Data columns (total 5 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   order_id              103886 non-null  object 
 1   payment_sequential    103886 non-null  int64  
 2   payment_type          103886 non-null  object 
 3   payment_installments  103886 non-null  int64  
 4   payment_value         103886 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 4.0+ MB


In [127]:
order_payments_df.head()

Unnamed: 0,order_id,payment_sequential,payment_type,payment_installments,payment_value
0,b81ef226f3fe1789b1e8b2acac839d17,1,credit_card,8,99.33
1,a9810da82917af2d9aefd1278f1dcfa0,1,credit_card,1,24.39
2,25e8ea4e93396b6fa0d3dd708e76c1bd,1,credit_card,1,65.71
3,ba78997921bbcdc1373bb41e913ab953,1,credit_card,8,107.78
4,42fdf880ba16b47b59251dd489d4441a,1,credit_card,2,128.45


In [128]:
before_transform_shape = order_payments_df.shape
before_transform_shape

(103886, 5)

#### Problems and Possible Improvements:

As seen in the outputs from the cells above, there are some things that we needs to be fixed and some improvements that can be done:

1. We need to change the data types of the columns.

2. We need to remove rows where the `order_id` does not exist in the `orders_df`.

#### Change the data types of the columns.

In [129]:
order_payments_data_types = {
    "order_id": "string",
    "payment_sequential": np.int8,
    "payment_type": "string",
    "payment_installments": np.int8
}

order_payments_df = order_payments_df.astype(order_payments_data_types)
order_payments_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103886 entries, 0 to 103885
Data columns (total 5 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   order_id              103886 non-null  string 
 1   payment_sequential    103886 non-null  int8   
 2   payment_type          103886 non-null  string 
 3   payment_installments  103886 non-null  int8   
 4   payment_value         103886 non-null  float64
dtypes: float64(1), int8(2), string(2)
memory usage: 2.6 MB


#### Remove rows where the `order_id` does not exist inside the `order_id`

There are **287** rows with invalid `order_id`.

In [130]:
existing_condition = order_payments_df['order_id'].isin(orders_df['order_id'])
inconsistent_rows = order_payments_df[~existing_condition]

order_payments_df = order_payments_df[existing_condition]

inconsistent_rows.shape

(287, 5)

#### Export the cleaned DataFrame to a CSV file.

In [131]:
cleaned_order_payments_data_file_path = '../resources/cleaned_order_payments_dataset.csv'
order_payments_df.to_csv(cleaned_order_payments_data_file_path, index=False)

logging.info(f"Cleaned order payment data and exported to {cleaned_order_payments_data_file_path}.")

#### Difference before and after cleaning:

As you can see below, the **order_payments_df** DataFrame lost **287** records due to invalid `order_id` values from **103,886** records to **103,599** records.

In [132]:
before_transform_shape

(103886, 5)

In [133]:
order_payments_df.shape

(103599, 5)

<section id="section_3j">
    <h3 style="text-align: center">Section 3-J: Write to the logging process</h3>
</section>

In [134]:
logging.info("Transformation process completed.")

<section id="section_4">
    <hr>
    <h2 style="text-align: center">Section 4: Data Loading</h2>
    <hr>
</section>

<section id="section_4a">
    <h3 style="text-align: center">Section 4-A: Establish connection to pgAdmin and creating the PostgreSQL database</h3>
</section>

In [135]:
# define name of the database to be created
database_name = 'omni_oracle_ecommerce_db'

# user postgres, password admin
db_engine = db.create_engine(f'postgresql://postgres:admin@localhost:5432/{database_name}')
if database_exists(f'postgresql://postgres:admin@localhost:5432/{database_name}'):
    print(f"{database_name} already exists.")
    logging.info(f"{database_name} already exists.")
else:
    # create database
    create_database(db_engine.url)
    print(f"Created database: {database_name}")
    logging.info(f"Created database: {database_name}")

conn = db_engine.raw_connection()

# release resources associated with engine
db_engine.dispose()

# Initialize connection to PostgreSQL
cur = conn.cursor()

omni_oracle_ecommerce_db already exists.


<section id="section_4b">
    <h3 style="text-align: center">Section 4-B: Create the tables (and establish relationships via foreign keys) inside the database</h3>
</section>

In [36]:
# Create new tables in PostgreSQL
commands = ('''
            -- geolocations
            CREATE TABLE geolocations (
                zip_code_prefix INTEGER PRIMARY KEY,
                latitude NUMERIC,
                longitude NUMERIC,
                city VARCHAR(50),
                state VARCHAR(10)
            );

            -- customers
            CREATE TABLE customers (
                customer_id VARCHAR(40) PRIMARY KEY,
                customer_unique_id VARCHAR(40),
                customer_zip_code_prefix INTEGER REFERENCES geolocations(zip_code_prefix)
            );

            -- sellers
            CREATE TABLE sellers (
                seller_id VARCHAR(40) PRIMARY KEY,
                seller_zip_code_prefix INTEGER REFERENCES geolocations(zip_code_prefix)
            );

            -- orders
            CREATE TABLE orders (
                order_id VARCHAR(40) PRIMARY KEY,
                customer_id VARCHAR(40) REFERENCES customers (customer_id),
                order_status VARCHAR(20),
                order_purchase_timestamp TIMESTAMP,
                order_approved_at TIMESTAMP,
                order_delivered_carrier_date TIMESTAMP,
                order_delivered_customer_date TIMESTAMP,
                order_estimated_delivery_date TIMESTAMP
            );

            -- product_category_name_translations
            CREATE TABLE product_category_name_translations (
                product_category_name TEXT PRIMARY KEY,
                product_category_name_english TEXT
            );

            -- products
            CREATE TABLE products (
                product_id VARCHAR(40) PRIMARY KEY,
                product_category_name TEXT REFERENCES product_category_name_translations (product_category_name),
                product_name_length INTEGER,
                product_description_length INTEGER,
                product_photos_qty INTEGER,
                product_weight_g INTEGER,
                product_length_cm INTEGER,
                product_height_cm INTEGER,
                product_width_cm INTEGER
            );

            -- order_reviews
            CREATE TABLE order_reviews (
                review_id VARCHAR(40),
                order_id VARCHAR(40) REFERENCES orders (order_id),
                review_score INTEGER,
                review_comment_title TEXT,
                review_comment_message TEXT,
                review_creation_date TIMESTAMP,
                review_answer_timestamp TIMESTAMP,
                PRIMARY KEY(review_id, order_id)
            );

            -- order_items
            CREATE TABLE order_items (
                order_id VARCHAR(40) REFERENCES orders (order_id),
                order_item_id INTEGER,
                product_id VARCHAR(40) REFERENCES products (product_id),
                seller_id VARCHAR(40) REFERENCES sellers (seller_id),
                shipping_limit_date TIMESTAMP,
                price MONEY,
                freight_value MONEY,
                PRIMARY KEY(order_id, order_item_id, product_id, seller_id)
            );

            -- order_payments
            CREATE TABLE order_payments (
                order_id VARCHAR(40) REFERENCES orders (order_id),
                payment_sequential INTEGER,
                payment_type VARCHAR(20),
                payment_installments INTEGER,
                payment_value MONEY,
                PRIMARY KEY(order_id, payment_sequential)
            );
            ''')

# Create cursor to execute SQL commands
cur.execute(commands)

# Commit changes
conn.commit()

logging.info("Created tables in database.")

<section id="section_4c">
    <h3 style="text-align: center">Section 4-C: Transfer the DataFrames into their respective tables in the database</h3>
</section>

#### `geolocations_df` --> **geolocations**

In [37]:
geolocations_df.to_sql(
    name='geolocations', 
    con=db_engine, 
    if_exists='append', 
    index=False
)

logging.info("Loaded geolocations into database.")

#### `customers_df` --> **customers**

In [38]:
customers_df.to_sql(
    name='customers', 
    con=db_engine, 
    if_exists='append', 
    index=False
)

logging.info("Loaded customers into database.")

#### `sellers_df` --> **sellers**

In [39]:
sellers_df.to_sql(
    name='sellers', 
    con=db_engine, 
    if_exists='append', 
    index=False
)

logging.info("Loaded sellers into database.")

#### `orders_df` --> **orders**

In [40]:
orders_df.to_sql(
    name='orders', 
    con=db_engine, 
    if_exists='append', 
    index=False
)

logging.info("Loaded orders into database.")

#### `product_categories_df` --> **product_category_name_translations**

In [41]:
product_categories_df.to_sql(
    name='product_category_name_translations', 
    con=db_engine, 
    if_exists='append', 
    index=False
)

logging.info("Loaded product_category_name_translations into database.")

#### `products_df` --> **products**

In [42]:
products_df.to_sql(
    name='products', 
    con=db_engine, 
    if_exists='append', 
    index=False
)

logging.info("Loaded products into database.")

#### `order_reviews_df` --> **order_reviews**

In [43]:
order_reviews_df.to_sql(
    name='order_reviews', 
    con=db_engine, 
    if_exists='append', 
    index=False
)

logging.info("Loaded order_reviews into database.")

#### `order_items_df` --> **order_items**

In [44]:
order_items_df.to_sql(
    name='order_items', 
    con=db_engine, 
    if_exists='append', 
    index=False
)

logging.info("Loaded order_items into database.")

#### `order_payments_df` --> **order_payments**

In [45]:
order_payments_df.to_sql(
    name='order_payments', 
    con=db_engine, 
    if_exists='append', 
    index=False
)

logging.info("Loaded order_payments into database.")

<section id="section_4d">
    <h3 style="text-align: center">Section 4-D: Close connection to the database</h3>
</section>

In [46]:
# Close communication with server
cur.close()
conn.close()

<section id="section_4e">
    <h3 style="text-align: center">Section 4-E: Write to the logging process</h3>
</section>

In [None]:
logging.info("Loading process completed.")