# Melbourne Eletronics Store

## Exploratory Data Analysis

**Libraries & imports**

In [None]:
import re

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

from faker import Faker

from warnings import simplefilter
simplefilter('ignore')
pd.set_option('display.max_columns', None)

## The Dataset

**This dataset is publicly available and is a courtesy of Shahrayar (Owner).**

**Thank you, Shahrayar!**


----
### Melbourne Eletronics Store dataset

**Data Dictionary** 
- `order_id`: A unique id for each order
- `customer_id`: A unique id for each customer
- `date`: The date the order was made, given in YYYY-MM-DD format
- `nearest_warehouse`: A string denoting the name of the nearest warehouse to the customer
- `shopping_cart`: A list of tuples representing the order items: the first element of the tuple is the item ordered, and the second element is the quantity ordered for such item.
- `order_price`: A float denoting the order price in USD. The order price is the price of items before any discounts and/or delivery charges are applied.
- `delivery_charges`: A float representing the delivery charges of the order
- `customer_lat`: Latitude of the customer’s location
- `customer_long`: Longitude of the customer’s location
- `coupondiscount`: An integer denoting the percentage discount to be applied to the orderprice.
- `order_total`: A float denoting the total of the order in USD after all discounts and/or delivery charges are applied.
- `season`: A string denoting the season in which the order was placed.
- `isexpediteddelivery`: A boolean denoting whether the customer has requested an expedited delivery
- `distancetonearest_warehouse`: A float representing the arc distance, in kilometres, between the customer and the nearest warehouse to him/her.
- `latestcustomerreview`: A string representing the latest customer review on his/her most recent order
- `ishappycustomer`: A boolean denoting whether the customer is a happy customer or had an issue with his/her last order.

**Inspiration**

Use this dataset to perform graphical and/or non-graphical EDA methods to understand the data first and then find and fix the data problems.

**Possible analysis**
- Detect and fix errors in dirty_data.csv
- Impute the missing values in missing_data.csv
- Detect and remove Anolamies
- To check whether a customer is happy with their last order

In [None]:
# Check file names
!ls ../data/raw

In [None]:
# Load dirty_data dataset
dirty = pd.read_csv('../data/raw/dirty_data.csv')
dirty.head()

In [None]:
# Check dirty dataset's info
dirty.info()

In [None]:
# Loading missing_data dataset
missing = pd.read_csv('../data/raw/missing_data.csv')
missing.head()

In [None]:
# Check missing dataset's info
missing.info()

In [None]:
# Load warehouses dataset
warehouses = pd.read_csv('../data/raw/warehouses.csv')
warehouses.head()

In [None]:
# Check warehouses dataset's info
warehouses.info()

In [None]:
warehouses['warehouse_id'] = [1, 2, 3]

In [None]:
warehouses

In [None]:
# Where are these lot, lon?
color_scale = [(0, 'blue'), (1, 'gray')]

fig = px.scatter_mapbox(warehouses, 
                        lat="lat", 
                        lon="lon", 
                        hover_name="names", 
                        hover_data=["names", "names"],
                        color="names",
                        color_continuous_scale=color_scale,
                        zoom=12, 
                        height=300,
                        width=950)

fig.update_layout(mapbox_style="carto-positron")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

**Melbourne**!

In [None]:
# Missing? Dirty? Why two datasets?

print('Missing dataset: ', missing.shape)
display(missing.head())
print('\n')
print('Dirty dataset: ', dirty.shape)
display(dirty.head())

In [None]:
missing.columns.all() == dirty.columns.all()

In [None]:
orders = pd.concat([missing, dirty]).sort_values('date').reset_index().iloc[:, 1:]
orders.head()

In [None]:
orders.info()

Research:
- Shopping cart? Are those order_iterms glued togheter within a `List[Tuple]`?

- We don't have order_item price?
- Dataset is organized by orders;

In [None]:
orders[['order_price', 'order_total', 'coupon_discount']]

In [None]:
pattern = (r'\'(\w* ?\d*[\w.]*)\'')

orders_items_re_df = orders['shopping_cart'].str.extractall(pattern, flags=re.IGNORECASE)
orders_items_re_df

In [None]:
level_0_product_df = orders_items_re_df.reset_index()
level_0_product_df

In [None]:
index_customer_id_df = orders.reset_index()[['index', 'order_id', 'customer_id']]

order_items_messed = index_customer_id_df.merge(level_0_product_df, right_on='level_0', left_on='index')
order_items_messed.columns = ['index', 'order_id', 'customer_id', 'level_0', 'match', 'product_name']
order_items_df = order_items_messed.iloc[:, [1, 2, 5]]
order_items_df.head()

In [None]:
products_unique = order_items_df['product_name'].unique()
products_unique

In [None]:
faker = Faker()

In [None]:
product_id = [faker.ean13() for i in range(len(products_unique))]

len(set(product_id)) == len(product_id) # check if values are unique

In [None]:
# product_id_dict = dict(zip(product_id, products_unique))
# product_id_dict

In [None]:
products_df = pd.DataFrame()
products_df['product_id'] = product_id
products_df['product_name'] = products_unique

In [None]:
products_df

In [None]:
products_df['product_name'] = products_df['product_name'].str.title()
products_df.sample(10)

In [None]:
display(products_df.head())

display(order_items_df.head())

In [None]:
order_items_df = order_items_df.merge(products_df, on = 'product_name')[['order_id', 
                                                                      'customer_id', 
                                                                      'product_id', 
                                                                      'product_name']]
order_items_df.head()

In [None]:
orders.head()

In [None]:
orders[['order_id', 'order_price', 'delivery_charges', 'is_expedited_delivery', 'nearest_warehouse', 'distance_to_nearest_warehouse']]

In [None]:
len('melb-electronics-store-data-modeling')

# Customers

In [None]:
orders.head()

In [None]:
orders['customer_id'].nunique() # Almost all clients are first time buyersn

In [None]:
customers_df = orders.drop_duplicates(subset=['customer_id'], keep='first')[['customer_id', 
                                                                             'customer_lat', 
                                                                             'customer_long', 
                                                                             'nearest_warehouse']]

customers_df.head()