# Melbourne Electronics Store

## Relational Model

**Libraries and imports**

In [1]:
import re

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

from faker import Faker

# from warnings import simplefilter
# simplefilter('ignore')
pd.set_option('display.max_columns', None)

**In this notebook, we will build the different entitities that will later compose our relational database.**



The dataset was converted into `.csv` in our pre-EDA analysis, and is in good shape from the previous transormations.

From Kaggle download, we already have the following tables:
1. warehouses

Using the original data, `original_data.csv`, as starting point we will build six different tables: 

2. products
3. order_items
4. order_reviews
5. orders
6. customers

## Creating Entities (tables)

### Warehouses

Even though the warehouses dataset already exists, I want to make a minor addition to it: the `warehouse_id` feature will help with the Data Normalization process in our relational database later.

In [48]:
# Load warehouses dataset

warehouses_df = pd.read_csv('../data/raw/warehouses.csv')
warehouses_df

Unnamed: 0,names,lat,lon
0,Nickolson,-37.818595,144.969551
1,Thompson,-37.812673,144.947069
2,Bakers,-37.809996,144.995232


In [3]:
# Add Primary Key, warehouse_id

warehouses_df['warehouse_id'] = ['MELBNICK', 'MELBTHOM', 'MELBBAKE']

In [4]:
# Work on names standards and overall organization of the dataset

warehouses_df = warehouses_df.rename({
    'names': 'warehouse_name',
    'lat': 'lat',
    'lon': 'lon',
    'warehouse_id': 'warehouse_id'
}, axis = 1)

warehouses_df = warehouses_df[['warehouse_id', 'warehouse_name', 'lat', 'lon']]
warehouses_df

Unnamed: 0,warehouse_id,warehouse_name,lat,lon
0,MELBNICK,Nickolson,-37.818595,144.969551
1,MELBTHOM,Thompson,-37.812673,144.947069
2,MELBBAKE,Bakers,-37.809996,144.995232


In [5]:
# Save warehouses table to csv

warehouses_df.to_csv('../data/relational/warehouses.csv', index = False)

! ls ../data/relational/

customers.csv     order_reviews.csv products.csv
order_items.csv   orders.csv        warehouses.csv


### Original (base) dataset

From the original dataset (which is a stacked version of `dirty_df` and `missing_df` in previous notebook), we can create all other tables.

It is important to have it tidy!

Let's make it tidy and shiny (sorry for the puns, R), then:

In [6]:
# Load original_df and check dataset head

original_df = pd.read_csv('../data/raw/original_data.csv')

In [7]:
# Display dataset and dataset shape

display(original_df.head())
original_df.shape

Unnamed: 0,order_id,customer_id,date,nearest_warehouse,shopping_cart,order_price,delivery_charges,customer_lat,customer_long,coupon_discount,order_total,season,is_expedited_delivery,distance_to_nearest_warehouse,latest_customer_review,is_happy_customer
0,ORD493666,ID0844489306,01 27 2019,Thompson,"[('Lucent 330S', 1), ('pearTV', 2), ('iAssist ...",18300.0,74.75,-37.820755,144.948063,0,18374.75,Summer,False,0.9039,love it received the lucent g very fast and in...,True
1,ORD129378,ID6167417934,02 03 2019,Nickolson,"[('Olivia x460', 1), ('Universe Note', 2)]",8125.0,80.74,-37.814972,144.96024,10,7393.24,Autumn,True,0.9127,nice battery life great phone overall,True
2,ORD455246,ID4326586172,02-24-2019,Thompson,"[('iAssist Line', 1), ('Alcon 10', 2), ('Candl...",20985.0,81.66,-37.80077,144.95741,15,17918.91,Summer,False,1.6071,five stars absolutely fabulous,True
3,ORD497096,ID4735909071,03 10 2019,Thompson,"[('pearTV', 2), ('Thunder line', 2)]",16980.0,103.77,-37.804318,144.950049,5,16234.77,Spring,True,0.9663,this phone is wonderful!! olivia really out di...,True
4,ORD414419,ID0207085738,03 29 2019,Bakers,"[('iStream', 1), ('pearTV', 1)]",6460.0,81.31,-37.812585,145.015529,5,6218.31,Autumn,True,1.8081,awesome! the product fit the description. i lo...,True


(1000, 16)

In [8]:
# Look closer to dataset info

original_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   order_id                       1000 non-null   object 
 1   customer_id                    1000 non-null   object 
 2   date                           1000 non-null   object 
 3   nearest_warehouse              990 non-null    object 
 4   shopping_cart                  1000 non-null   object 
 5   order_price                    990 non-null    float64
 6   delivery_charges               1000 non-null   float64
 7   customer_lat                   990 non-null    float64
 8   customer_long                  990 non-null    float64
 9   coupon_discount                1000 non-null   int64  
 10  order_total                    990 non-null    float64
 11  season                         990 non-null    object 
 12  is_expedited_delivery          1000 non-null   bo

The dataset looks mostly fine. But since the objective of this notebooks is creating a relational database, I will include the `warehouse_id` as a Foreign Key, instead of using the `warehouse_name`.

For that I will create a dictionary object holding the `warehouse_name` as the dictionary key and the `warehouse_id` as the their respective values:

In [9]:
# Create warehouse name and id mapping dictionary

warehouse_id_name_dict = dict(zip(warehouses_df['warehouse_name'], warehouses_df['warehouse_id']))
warehouse_id_name_dict

{'Nickolson': 'MELBNICK', 'Thompson': 'MELBTHOM', 'Bakers': 'MELBBAKE'}

In [10]:
# Map the nearest_warehouse (warehouse name) to nearest_warehouse_id

original_df['nearest_warehouse_id'] = original_df['nearest_warehouse'].map(warehouse_id_name_dict)

In [11]:
# Check

original_df.head(3)

Unnamed: 0,order_id,customer_id,date,nearest_warehouse,shopping_cart,order_price,delivery_charges,customer_lat,customer_long,coupon_discount,order_total,season,is_expedited_delivery,distance_to_nearest_warehouse,latest_customer_review,is_happy_customer,nearest_warehouse_id
0,ORD493666,ID0844489306,01 27 2019,Thompson,"[('Lucent 330S', 1), ('pearTV', 2), ('iAssist ...",18300.0,74.75,-37.820755,144.948063,0,18374.75,Summer,False,0.9039,love it received the lucent g very fast and in...,True,MELBTHOM
1,ORD129378,ID6167417934,02 03 2019,Nickolson,"[('Olivia x460', 1), ('Universe Note', 2)]",8125.0,80.74,-37.814972,144.96024,10,7393.24,Autumn,True,0.9127,nice battery life great phone overall,True,MELBNICK
2,ORD455246,ID4326586172,02-24-2019,Thompson,"[('iAssist Line', 1), ('Alcon 10', 2), ('Candl...",20985.0,81.66,-37.80077,144.95741,15,17918.91,Summer,False,1.6071,five stars absolutely fabulous,True,MELBTHOM


### Products

The feature `shopping_cart` in the `original_df` dataset contains the list of products purchased in every order registered.

This is the information we need to start.

But first, a closer look:

In [12]:
# Check a sample of data points

original_df['shopping_cart'].sample(5)

481    [('iStream', 2), ('Alcon 10', 2), ('Candle Inf...
594             [('Lucent 330S', 1), ('Olivia x460', 1)]
311                [('iStream', 2), ('Thunder line', 1)]
485    [('Thunder line', 1), ('Olivia x460', 1), ('To...
728    [('Universe Note', 1), ('iStream', 2), ('iAssi...
Name: shopping_cart, dtype: object

In [13]:
# Check singular data point dtype

print(original_df['shopping_cart'][7])

[('Universe Note', 2), ('Olivia x460', 1), ('Thunder line', 2), ('pearTV', 2)]


The list of products seems to be contained within a `List[Tuple]` object, but in fact it is a string object. 

The best approach for that case is using regex.

In [14]:
# Save regex pattern for product names in a variable

product_pattern = (r'\'(\w* ?\d*[\w.]*)\'')

regex_results_df = original_df['shopping_cart'].str.extractall(product_pattern, flags=re.IGNORECASE).reset_index()
regex_results_df.head(10)

Unnamed: 0,level_0,match,0
0,0,0,Lucent 330S
1,0,1,pearTV
2,0,2,iAssist Line
3,1,0,Olivia x460
4,1,1,Universe Note
5,2,0,iAssist Line
6,2,1,Alcon 10
7,2,2,Candle Inferno
8,3,0,pearTV
9,3,1,Thunder line


In [15]:
# Transform every string to Title case (and avoid problems with mismatching cases)

regex_results_df[0] = regex_results_df[0].str.title()
regex_results_df.head(10)

Unnamed: 0,level_0,match,0
0,0,0,Lucent 330S
1,0,1,Peartv
2,0,2,Iassist Line
3,1,0,Olivia X460
4,1,1,Universe Note
5,2,0,Iassist Line
6,2,1,Alcon 10
7,2,2,Candle Inferno
8,3,0,Peartv
9,3,1,Thunder Line


We have the list of products from the `original_df`.

Now if we rearrange things just a little, it will results in the `products_df` table.

Let's do it:

In [16]:
# Capture unique products names

products_set = set(regex_results_df.loc[:, 0])
products_set

{'Alcon 10',
 'Candle Inferno',
 'Iassist Line',
 'Istream',
 'Lucent 330S',
 'Olivia X460',
 'Peartv',
 'Thunder Line',
 'Toshika 750',
 'Universe Note'}

In [17]:
# From the newly-created set, create Products entity

pre_products_df = pd.DataFrame(products_set, columns=['product_name'])
pre_products_df

Unnamed: 0,product_name
0,Iassist Line
1,Alcon 10
2,Thunder Line
3,Olivia X460
4,Peartv
5,Candle Inferno
6,Istream
7,Toshika 750
8,Universe Note
9,Lucent 330S


The Electronics Store sells only 10 products.

To make it look more like a relational table, I will create a Primary Key `product_id`. 

For that, I want to `Faker`, a python library designed to generate fake data.

In [18]:
# Create product_id from Faker

faker = Faker()

pre_products_df['product_id'] = [faker.ean13() for i in range (len(products_set))]
pre_products_df

Unnamed: 0,product_name,product_id
0,Iassist Line,1817516002164
1,Alcon 10,5033789888489
2,Thunder Line,2832965681981
3,Olivia X460,3453269640826
4,Peartv,5601667425486
5,Candle Inferno,5216491962641
6,Istream,4075780128516
7,Toshika 750,9028783813533
8,Universe Note,8627546160887
9,Lucent 330S,5667303310202


In [19]:
# Organize features in prep work for Database

products_df = pre_products_df[['product_id', 'product_name']]
products_df

Unnamed: 0,product_id,product_name
0,1817516002164,Iassist Line
1,5033789888489,Alcon 10
2,2832965681981,Thunder Line
3,3453269640826,Olivia X460
4,5601667425486,Peartv
5,5216491962641,Candle Inferno
6,4075780128516,Istream
7,9028783813533,Toshika 750
8,8627546160887,Universe Note
9,5667303310202,Lucent 330S


In [20]:
# Save prodcuts table

products_df.to_csv('../data/relational/products.csv', index=False)

! ls ../data/relational/

customers.csv     order_reviews.csv products.csv
order_items.csv   orders.csv        warehouses.csv


### Order Items

The next natural step would be creating the Orders table. But since the `original_df` is already the orders stable denormalized, I will leave this one for later.

We can focus on creating the `order_items`, a dataset that will store information about products and product quantities purchased for every order.

In the following steps, I will:
- Create an initial dataset (`pre_order_items_df`) from the `original_df` containg `index`, `order_id`, and `customer_id` features.
- Merge the resulting table with a temporary dataset that was already created when we used regex for the first time in this project (`regex_reslts_df`) and rename columns for readability.
- Map the `product_id` from the `products_df` table with the `product_name` in the newly-merged table.
- Create and store a sequential Primary Key for the merged table. 
- 


In [21]:
# Create a pre_order_items_df dataset that will serve as base for building the order_items_df later

pre_order_items_df = original_df.reset_index()[['index', 'order_id', 'customer_id']]
pre_order_items_df.head()

Unnamed: 0,index,order_id,customer_id
0,0,ORD493666,ID0844489306
1,1,ORD129378,ID6167417934
2,2,ORD455246,ID4326586172
3,3,ORD497096,ID4735909071
4,4,ORD414419,ID0207085738


In [22]:
# Rename columns for readability (and avoid confusion)

regex_results_df_columns_renamed = regex_results_df.rename(columns={'level_0': 'index', 
                                                                    'match': 'match', 
                                                                    0: 'product_name'})

regex_results_df_columns_renamed.head(10)

Unnamed: 0,index,match,product_name
0,0,0,Lucent 330S
1,0,1,Peartv
2,0,2,Iassist Line
3,1,0,Olivia X460
4,1,1,Universe Note
5,2,0,Iassist Line
6,2,1,Alcon 10
7,2,2,Candle Inferno
8,3,0,Peartv
9,3,1,Thunder Line


##### A break for understanding what is going to happen next.

in the `regex_results_df_columns_renamed` dataset generated above, the `index` feature contains duplicated values. Those are expected.

When we computed the regex, the method `extractall` from `pandas` has converted every single object captured into its own row. 

That means that the first purchase registered in the dataset (`index == 0`) broke into as many rows as there were products in the `original_df['shopping_cart']` for that order.

The column `match` corresponds to the **sequential id for the products included in the a specific order's shopping cart** (again, orders are represented here in the `index` column).

By concatenating both `index` and `match` columns we will generate a singular temporary key for every product in every order. Example:

Order with `index == 0`, purched three diferent products in the same order:
- `0_0`, index 0, product 0
- `0_1`, index 0, product 1
- `0_2`, index 0, product 2

For order with `index == 1`, purchased two products:
- `1_0`, index 1, product 0
- `1_1`, index 1, product 1
   

That key will be used to merge the `order_item_id` with the `product_quantity`, a feature we will also generate by leveraging regex. 

It sound chaotic, but so is data in the real world.

Let's clean up that mess:

In [23]:
# Merge pre_order_items_df with regex_results_df_columns_renamed on index

pre_order_items_df_merged = pre_order_items_df.merge(regex_results_df_columns_renamed, on = 'index')
pre_order_items_df_merged.head()

Unnamed: 0,index,order_id,customer_id,match,product_name
0,0,ORD493666,ID0844489306,0,Lucent 330S
1,0,ORD493666,ID0844489306,1,Peartv
2,0,ORD493666,ID0844489306,2,Iassist Line
3,1,ORD129378,ID6167417934,0,Olivia X460
4,1,ORD129378,ID6167417934,1,Universe Note


In [24]:
# Get product_id from product_name in the products_df dataset and store it into a dict

product_id_name_dict = dict(zip(products_df['product_name'], products_df['product_id']))
product_id_name_dict

{'Iassist Line': '1817516002164',
 'Alcon 10': '5033789888489',
 'Thunder Line': '2832965681981',
 'Olivia X460': '3453269640826',
 'Peartv': '5601667425486',
 'Candle Inferno': '5216491962641',
 'Istream': '4075780128516',
 'Toshika 750': '9028783813533',
 'Universe Note': '8627546160887',
 'Lucent 330S': '5667303310202'}

In [25]:
# Map the product_name in the pre_order_items_df_merged dataset with the product_id_name_dict 

pre_order_items_df_merged['product_id'] = pre_order_items_df_merged['product_name'].map(product_id_name_dict)
pre_order_items_df_merged

Unnamed: 0,index,order_id,customer_id,match,product_name,product_id
0,0,ORD493666,ID0844489306,0,Lucent 330S,5667303310202
1,0,ORD493666,ID0844489306,1,Peartv,5601667425486
2,0,ORD493666,ID0844489306,2,Iassist Line,1817516002164
3,1,ORD129378,ID6167417934,0,Olivia X460,3453269640826
4,1,ORD129378,ID6167417934,1,Universe Note,8627546160887
...,...,...,...,...,...,...
2986,998,ORD032042,ID1725216340,1,Thunder Line,2832965681981
2987,998,ORD032042,ID1725216340,2,Istream,4075780128516
2988,999,ORD227618,ID0571730335,0,Thunder Line,2832965681981
2989,999,ORD227618,ID0571730335,1,Istream,4075780128516


In [26]:
# Organize columns, thinking ahead in the creation of the database

pre_order_items_df_merged['order_item_id'] = np.arange(len(pre_order_items_df_merged)) + 1
pre_order_items_df_merged.head()

Unnamed: 0,index,order_id,customer_id,match,product_name,product_id,order_item_id
0,0,ORD493666,ID0844489306,0,Lucent 330S,5667303310202,1
1,0,ORD493666,ID0844489306,1,Peartv,5601667425486,2
2,0,ORD493666,ID0844489306,2,Iassist Line,1817516002164,3
3,1,ORD129378,ID6167417934,0,Olivia X460,3453269640826,4
4,1,ORD129378,ID6167417934,1,Universe Note,8627546160887,5


In [27]:
# Create Temporary Key for merge with product_quantity

pre_order_items_df_org = pre_order_items_df_merged[['order_item_id', 
                                                    'order_id', 
                                                    'customer_id', 
                                                    'product_id', 
                                                    'product_name', 'index', 'match']]

pre_order_items_df_org['temp_key'] = pre_order_items_df_org.loc[:, 'index'].astype(str) + \
                                                        '_' + pre_order_items_df_org.loc[:, 'match'].astype(str)

pre_order_items_df_org.sample(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pre_order_items_df_org['temp_key'] = pre_order_items_df_org.loc[:, 'index'].astype(str) + \


Unnamed: 0,order_item_id,order_id,customer_id,product_id,product_name,index,match,temp_key
2394,2395,ORD016552,ID0283334264,8627546160887,Universe Note,797,1,797_1
2591,2592,ORD463467,ID6126066495,2832965681981,Thunder Line,866,1,866_1
568,569,ORD409041,ID0387700304,9028783813533,Toshika 750,192,1,192_1
204,205,ORD451115,ID0248746937,5601667425486,Peartv,70,1,70_1
2508,2509,ORD106414,ID0575392033,5667303310202,Lucent 330S,838,0,838_0
1282,1283,ORD261780,ID4655086233,8627546160887,Universe Note,428,1,428_1
1656,1657,ORD226165,ID6167489442,5216491962641,Candle Inferno,552,1,552_1
2636,2637,ORD197918,ID0630463867,4075780128516,Istream,882,0,882_0
2113,2114,ORD042295,ID0579512331,5033789888489,Alcon 10,702,0,702_0
725,726,ORD298098,ID0471441541,5667303310202,Lucent 330S,242,0,242_0


In [28]:
# Save regex pattern for product quantities in a variable

quantity_pattern = r"(\d{1})\)"

# Extract the order_items quantities with the quantity_pattern

qty_regex_results_df = original_df['shopping_cart'].str.extractall(quantity_pattern, flags=re.IGNORECASE).reset_index()
qty_regex_results_df.columns = ['index', 'match', 'product_quantity']
qty_regex_results_df.head()

Unnamed: 0,index,match,product_quantity
0,0,0,1
1,0,1,2
2,0,2,2
3,1,0,1
4,1,1,2


In [29]:
# Create the same temporary key in the qty_regex_results_df, that will help bringing quantities to correct order_items

qty_regex_results_df['temp_key'] = qty_regex_results_df.loc[:, 'index'].astype(str) + \
                                                        '_' + qty_regex_results_df.loc[:, 'match'].astype(str)

qty_regex_results_df.head()

Unnamed: 0,index,match,product_quantity,temp_key
0,0,0,1,0_0
1,0,1,2,0_1
2,0,2,2,0_2
3,1,0,1,1_0
4,1,1,2,1_1


In [30]:
# Create order_items_df from merging pre_order_items_df and qty_regex_results_df.
# Do not select product_name as it is not necessary 

order_items_df = pre_order_items_df_org\
    .merge(qty_regex_results_df[['temp_key', 'product_quantity']], on = 'temp_key')[['order_item_id', 
                                                                                     'order_id', 
                                                                                     'customer_id', 
                                                                                     'product_id',
#                                                                                      'product_name',
                                                                                     'product_quantity']]

order_items_df.head()

Unnamed: 0,order_item_id,order_id,customer_id,product_id,product_quantity
0,1,ORD493666,ID0844489306,5667303310202,1
1,2,ORD493666,ID0844489306,5601667425486,2
2,3,ORD493666,ID0844489306,1817516002164,2
3,4,ORD129378,ID6167417934,3453269640826,1
4,5,ORD129378,ID6167417934,8627546160887,2


In [31]:
order_items_df.to_csv('../data/relational/order_items.csv', index=False)

! ls ../data/relational/

customers.csv     order_reviews.csv products.csv
order_items.csv   orders.csv        warehouses.csv


### Order Reviews

The `original_df` holds iformation about the `latest_customer_review`. 

The metadata explaning this feature in the Data Dictionary, tells us that:

    `latest_customer_review`: A string representing the latest customer review on his/her most recent order
    
We also know that most of our orders were made by first-time customers, but not all of them.

If the information about the `latest_customer_review` in the Data Dictionary is true, we will see duplicated reviews according the the last order of a certain customer that is not a first-time client.

That is weird, since we assumed that the `original_df` is a transactional dataset, therefore it would be ideal to keep writing changes to it, just to update the rows from customers buying for a second (or thirst, forth) time.

Let's digg deeper into that:

In [32]:
# Total duplicated rows

original_df[original_df.duplicated(subset=['customer_id'], keep=False)].shape

(54, 17)

In [33]:
# Filer duplicated rows, including all occurences (param keep == False)

is_happy_cust_check_df = original_df[original_df.duplicated(subset=['customer_id'], 
                                                            keep = False)].sort_values('customer_id')

is_happy_cust_check_df.head(10)

Unnamed: 0,order_id,customer_id,date,nearest_warehouse,shopping_cart,order_price,delivery_charges,customer_lat,customer_long,coupon_discount,order_total,season,is_expedited_delivery,distance_to_nearest_warehouse,latest_customer_review,is_happy_customer,nearest_warehouse_id
827,ORD437147,ID0052450505,2019-10-31,Thompson,"[('iAssist Line', 2), ('Alcon 10', 2)]",22350.0,85.96,-37.795479,144.936073,15,19083.46,,False,2.1445,this was a gift for a family member they reall...,True,MELBTHOM
663,ORD380695,ID0052450505,2019-09-06,Thompson,"[('Lucent 330S', 1), ('Olivia x460', 2)]",3680.0,86.92,-37.795479,144.936073,0,3766.92,Spring,False,2.1445,perfectly as described perfectly as described,True,MELBTHOM
117,ORD113085,ID0245493801,2019-02-06,Thompson,"[('Olivia x460', 2), ('Thunder line', 1)]",2757.0,84.54,-37.800862,144.960816,5,4483.04,Summer,False,1.7862,good phone phone is working good. the only thi...,True,MELBTHOM
177,ORD193841,ID0245493801,2019-02-27,Thompson,"[('Universe Note', 1), ('Olivia x460', 2), ('T...",16570.0,102.72,-37.800862,144.960816,10,15015.72,Summer,True,1.7862,just the right size for reading books and play...,True,MELBTHOM
388,ORD015960,ID0247024616,2019-05-19,Nickolson,"[('Thunder line', 1), ('iStream', 1), ('Lucent...",11690.0,63.78,-37.815455,,5,11169.28,Autumn,False,0.3585,nice cell phone... great value! so far so good...,True,MELBNICK
179,ORD397354,ID0247024616,2019-02-28,Nickolson,"[('pearTV', 2), ('Thunder line', 1), ('Candle ...",15660.0,87.96,-37.815455,144.970457,0,15747.96,Summer,True,0.3585,five stars easy to setup and use,True,MELBNICK
873,ORD139543,ID0283334330,2019-11-16,Thompson,"[('Universe Note', 2), ('Candle Inferno', 1), ...",18195.0,79.06,-37.800255,144.950962,0,18274.06,Spring,False,1.4241,"""uicc unlock"" amazing phone, especiall for the...",True,MELBTHOM
734,ORD336609,ID0283334330,2019-10-01,Thompson,"[('Lucent 330S', 2), ('Toshika 750', 1), ('iSt...",15880.0,105.94,-37.800255,144.950962,5,15191.94,Spring,True,1.4241,i received the thunder as a gift. i needed ano...,True,MELBTHOM
324,ORD029359,ID0305909619,2019-04-28,Bakers,"[('Alcon 10', 1), ('Thunder line', 2)]",13310.0,70.42,-37.802017,145.006551,5,12714.92,Autumn,False,1.3342,works great good price work great,True,MELBBAKE
220,ORD319019,ID0305909619,2019-03-17,Bakers,"[('iStream', 1), ('Universe Note', 2), ('iAssi...",12730.0,82.71,-37.802017,145.006551,25,9630.21,Autumn,True,1.3342,note 8 amazing product,True,MELBBAKE


In [34]:
# Group by for completing check

latest_customer_review_df_groupedby = is_happy_cust_check_df.groupby(['customer_id', 
                                                                   'latest_customer_review', 
                                                                   'is_happy_customer']).size() # size method here does not matter

pd.DataFrame(latest_customer_review_df_groupedby).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,0
customer_id,latest_customer_review,is_happy_customer,Unnamed: 3_level_1
ID0052450505,perfectly as described perfectly as described,True,1
ID0052450505,this was a gift for a family member they really enjoy all the options of getting books,True,1
ID0245493801,good phone phone is working good. the only thing is they sent a crappy sim eject tool and not the original tool.,True,1
ID0245493801,just the right size for reading books and playing some games.,True,1
ID0247024616,five stars easy to setup and use,True,1
ID0247024616,nice cell phone... great value! so far so good... just activated on boost and will update review once i've broken it in a little.,True,1
ID0283334330,"""uicc unlock"" amazing phone, especiall for the price. you will need to go into the setting and select ""uicc unlock"" for it to work with some carriers, but it is unlocked after that. edit: i have to downgrade from 5 to 3. was locked to sprint and could not use data with other networks.",True,1
ID0283334330,"i received the thunder as a gift. i needed another bluetooth or something to play music easily accessible, and found this smart speaker. can’t wait to see what else it can do.",True,1
ID0305909619,note 8 amazing product,True,1
ID0305909619,works great good price work great,True,1


There are multiple distinct reviews for returning customers. The information in the Data Dictionary is wrong. We may want to correct that.

Trouble here is only the start.

Customers who left an individual review for each of their purchase, also present conflicting information in the `is_happy_customer` feature. Let's review what the Data Dictionary has to say about this feature:

    `is_happy_customer`: A boolean denoting whether the customer is a happy customer or had an issue with his/her last order.
    
That's a very vague definition of a KPI, an instrument that should ideally follow the [SMART](https://www.tableau.com/learn/articles/smart-goals-criteria) pattern.

Moreover, `is_happy_customer` is an attribute of a customer, therefore it should not be stored in `order_reviews`, but instead in the `customers_df` itself.

On the other hand, we can interpret that having different reviews for different orders/purchases as a new attribute `is_positive_review`.

Before making any changes to `is_happy_customer`, let's use this feature to create the `is_positive_review` attribute:

In [35]:
# Create is_positive_review attribute

original_df['is_positive_review'] = original_df['is_happy_customer'].copy()
original_df.head()

Unnamed: 0,order_id,customer_id,date,nearest_warehouse,shopping_cart,order_price,delivery_charges,customer_lat,customer_long,coupon_discount,order_total,season,is_expedited_delivery,distance_to_nearest_warehouse,latest_customer_review,is_happy_customer,nearest_warehouse_id,is_positive_review
0,ORD493666,ID0844489306,01 27 2019,Thompson,"[('Lucent 330S', 1), ('pearTV', 2), ('iAssist ...",18300.0,74.75,-37.820755,144.948063,0,18374.75,Summer,False,0.9039,love it received the lucent g very fast and in...,True,MELBTHOM,True
1,ORD129378,ID6167417934,02 03 2019,Nickolson,"[('Olivia x460', 1), ('Universe Note', 2)]",8125.0,80.74,-37.814972,144.96024,10,7393.24,Autumn,True,0.9127,nice battery life great phone overall,True,MELBNICK,True
2,ORD455246,ID4326586172,02-24-2019,Thompson,"[('iAssist Line', 1), ('Alcon 10', 2), ('Candl...",20985.0,81.66,-37.80077,144.95741,15,17918.91,Summer,False,1.6071,five stars absolutely fabulous,True,MELBTHOM,True
3,ORD497096,ID4735909071,03 10 2019,Thompson,"[('pearTV', 2), ('Thunder line', 2)]",16980.0,103.77,-37.804318,144.950049,5,16234.77,Spring,True,0.9663,this phone is wonderful!! olivia really out di...,True,MELBTHOM,True
4,ORD414419,ID0207085738,03 29 2019,Bakers,"[('iStream', 1), ('pearTV', 1)]",6460.0,81.31,-37.812585,145.015529,5,6218.31,Autumn,True,1.8081,awesome! the product fit the description. i lo...,True,MELBBAKE,True


In [36]:
# Get customer_id for customers with different is_happy_customer records

is_happy_cust_check_df_grouped_by = is_happy_cust_check_df.groupby(['customer_id'])['is_happy_customer'].nunique()
not_happy_customers_ids_list = is_happy_cust_check_df_grouped_by[is_happy_cust_check_df_grouped_by == 2].index.tolist()
not_happy_customers_ids_list

['ID0419481490',
 'ID0457365601',
 'ID0710001161',
 'ID1217532720',
 'ID1492175313',
 'ID2189161536',
 'ID4544035096',
 'ID6167231018',
 'ID6167441023']

In [37]:
# Get the index values for custoemr_ids listed in the not_happy_customers_ids_list
idx_not_happy_customers = []

for cust_id in not_happy_customers_ids_list:
    for i in original_df[original_df['customer_id'] == cust_id].index.tolist():
        idx_not_happy_customers.append(i)
        
idx_not_happy_customers

[235,
 404,
 106,
 809,
 108,
 723,
 32,
 224,
 223,
 256,
 285,
 673,
 568,
 953,
 409,
 437,
 111,
 540]

In [38]:
# Generate a list of False values, the same size as the not_happy_customers_ids_list

falses_list = np.zeros(len(idx_not_happy_customers)).astype('bool')
falses_list

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False])

In [39]:
# Update is_customer_happy with False values, since they are divergent for each diffrent order

original_df.loc[idx_not_happy_customers, 'is_happy_customer'] = falses_list
original_df[original_df['is_happy_customer'] != original_df['is_positive_review']].head()

Unnamed: 0,order_id,customer_id,date,nearest_warehouse,shopping_cart,order_price,delivery_charges,customer_lat,customer_long,coupon_discount,order_total,season,is_expedited_delivery,distance_to_nearest_warehouse,latest_customer_review,is_happy_customer,nearest_warehouse_id,is_positive_review
108,ORD057375,ID0710001161,2019-02-03,Thompson,"[('Thunder line', 1), ('Alcon 10', 1), ('Candl...",18460.0,93.52,-37.813736,144.936811,25,665085.66,Summer,True,0.9098,best buy i've made on digico functions properl...,False,MELBTHOM,True
224,ORD268320,ID1217532720,2019-03-18,Thompson,"[('Candle Inferno', 2), ('iStream', 1)]",1010.0,65.83,-37.818128,144.948585,10,974.83,Autumn,False,0.6217,amazing value! excellent phone! amazing value.,False,MELBTHOM,True
235,ORD215616,ID0419481490,2019-03-21,Bakers,"[('iAssist Line', 2), ('Olivia x460', 2), ('Ca...",7760.0,79.46,-37.818237,144.996153,0,7839.46,Autumn,True,0.9209,great phone this phone works perfectly thanks,False,MELBBAKE,True
256,ORD434041,ID1492175313,2019-04-01,Thompson,"[('Lucent 330S', 2), ('iAssist Line', 1), ('iS...",5845.0,71.88,-37.802653,144.963491,10,5332.38,Autumn,False,1.8248,very pleased with my purchase of thunder smart...,False,MELBTHOM,True
285,ORD348417,ID2189161536,2019-04-12,Nickolson,"[('pearTV', 2), ('Candle Inferno', 2)]",13480.0,79.79,-37.817613,144.95911,0,13559.79,Autumn,True,0.9247,five stars great product at a great price.,False,MELBNICK,True


In [40]:
# Create order_reviews dataset

order_reviews_df = original_df[['order_id', 'customer_id', 'latest_customer_review', 'is_positive_review']]

order_reviews_df.head()

Unnamed: 0,order_id,customer_id,latest_customer_review,is_positive_review
0,ORD493666,ID0844489306,love it received the lucent g very fast and in...,True
1,ORD129378,ID6167417934,nice battery life great phone overall,True
2,ORD455246,ID4326586172,five stars absolutely fabulous,True
3,ORD497096,ID4735909071,this phone is wonderful!! olivia really out di...,True
4,ORD414419,ID0207085738,awesome! the product fit the description. i lo...,True


In [41]:
# Save order_review table to csv
order_reviews_df.to_csv('../data/relational/order_reviews.csv', index = False)

!ls ../data/relational/

customers.csv     order_reviews.csv products.csv
order_items.csv   orders.csv        warehouses.csv


### Orders

This is the simplest of them all. We just need to select the columns that make sense from the `original_df`.

In [42]:
original_df.columns.to_list()

['order_id',
 'customer_id',
 'date',
 'nearest_warehouse',
 'shopping_cart',
 'order_price',
 'delivery_charges',
 'customer_lat',
 'customer_long',
 'coupon_discount',
 'order_total',
 'season',
 'is_expedited_delivery',
 'distance_to_nearest_warehouse',
 'latest_customer_review',
 'is_happy_customer',
 'nearest_warehouse_id',
 'is_positive_review']

In [43]:
orders_df = original_df[
    [
        'order_id',
        'customer_id',
        'date',
        'is_expedited_delivery',
        'nearest_warehouse_id',
        'distance_to_nearest_warehouse',
        'order_price',
        'delivery_charges',
        'coupon_discount',
        'order_total',
    ]
]

orders_df.head()

Unnamed: 0,order_id,customer_id,date,is_expedited_delivery,nearest_warehouse_id,distance_to_nearest_warehouse,order_price,delivery_charges,coupon_discount,order_total
0,ORD493666,ID0844489306,01 27 2019,False,MELBTHOM,0.9039,18300.0,74.75,0,18374.75
1,ORD129378,ID6167417934,02 03 2019,True,MELBNICK,0.9127,8125.0,80.74,10,7393.24
2,ORD455246,ID4326586172,02-24-2019,False,MELBTHOM,1.6071,20985.0,81.66,15,17918.91
3,ORD497096,ID4735909071,03 10 2019,True,MELBTHOM,0.9663,16980.0,103.77,5,16234.77
4,ORD414419,ID0207085738,03 29 2019,True,MELBBAKE,1.8081,6460.0,81.31,5,6218.31


In [44]:
orders_df.to_csv('../data/relational/orders.csv', index=False)

!ls ../data/relational/

customers.csv     order_reviews.csv products.csv
order_items.csv   orders.csv        warehouses.csv


### Customers

The `original_df` is a transactional table storing information about the Electronic Store orders. We know that the same customer can order multiple times from the Electronics Store. 

Selecting the distinct values of `customer_id` will ouput a list of unique customers. And that is excactly what I am interested in.

Other customer related informaion present in the original dataset is:
- `customer_lat`
- `customer_lon`
- `nearest_warehouse_d`
- `is_happy_customer`

In [45]:
# How many unique customers?

original_df['customer_id'].nunique()

973

In [46]:
# Selecting unique values of customer_id and combining it with other customer-related features

customers_df = original_df.drop_duplicates(subset=['customer_id'], keep='first')[['customer_id', 
                                                                                  'customer_lat', 
                                                                                  'customer_long', 
                                                                                  'nearest_warehouse_id',
                                                                                  'is_happy_customer']]

customers_df.head()

Unnamed: 0,customer_id,customer_lat,customer_long,nearest_warehouse_id,is_happy_customer
0,ID0844489306,-37.820755,144.948063,MELBTHOM,True
1,ID6167417934,-37.814972,144.96024,MELBNICK,True
2,ID4326586172,-37.80077,144.95741,MELBTHOM,True
3,ID4735909071,-37.804318,144.950049,MELBTHOM,True
4,ID0207085738,-37.812585,145.015529,MELBBAKE,True


In [47]:
# Save dataset to csv
customers_df.to_csv('../data/relational/customers.csv', index=False)

! ls ../data/relational

customers.csv     order_reviews.csv products.csv
order_items.csv   orders.csv        warehouses.csv


## End of Notebook

In this notebook, I have created 6 different datasets that will later be used to create 6 differente relational tables in our database. They are:

1. warehouses
2. products
3. order_items
4. order_reviews
5. orders
6. customers