# **Cleaning and Preparation of data**

This dataset is from the Amazon Sales Report file from https://www.kaggle.com/datasets/thedevastator/unlock-profits-with-e-commerce-sales-data/data. 

It contains data capturing the fulfilment and delivery status of different transactions. However, this dataset primarily focuses on textile products while our project focuses on electronics. Hence, we will be generating synthetic data based on the constraints in the columns we are interested in in the Amazon Sales Report dataset.

In [1]:
# Importing packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# import seaborn as sns
# from thefuzz import process, fuzz
from faker import Faker
import random
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Load datasets

df = pd.read_csv('../source/Amazon Sale Report.csv')
online_sales_df = pd.read_csv('../online_sales_edited.csv')

In [60]:
# Check the first few rows of df

df.head()

Unnamed: 0,index,Order ID,Date,Status,Fulfilment,Sales Channel,ship-service-level,Style,SKU,Category,...,currency,Amount,ship-city,ship-state,ship-postal-code,ship-country,promotion-ids,B2B,fulfilled-by,Unnamed: 22
0,0,405-8078784-5731545,04-30-22,Cancelled,Merchant,Amazon.in,Standard,SET389,SET389-KR-NP-S,Set,...,INR,647.62,MUMBAI,MAHARASHTRA,400081.0,IN,,False,Easy Ship,
1,1,171-9198151-1101146,04-30-22,Shipped - Delivered to Buyer,Merchant,Amazon.in,Standard,JNE3781,JNE3781-KR-XXXL,kurta,...,INR,406.0,BENGALURU,KARNATAKA,560085.0,IN,Amazon PLCC Free-Financing Universal Merchant ...,False,Easy Ship,
2,2,404-0687676-7273146,04-30-22,Shipped,Amazon,Amazon.in,Expedited,JNE3371,JNE3371-KR-XL,kurta,...,INR,329.0,NAVI MUMBAI,MAHARASHTRA,410210.0,IN,IN Core Free Shipping 2015/04/08 23-48-5-108,True,,
3,3,403-9615377-8133951,04-30-22,Cancelled,Merchant,Amazon.in,Standard,J0341,J0341-DR-L,Western Dress,...,INR,753.33,PUDUCHERRY,PUDUCHERRY,605008.0,IN,,False,Easy Ship,
4,4,407-1069790-7240320,04-30-22,Shipped,Amazon,Amazon.in,Expedited,JNE3671,JNE3671-TU-XXXL,Top,...,INR,574.0,CHENNAI,TAMIL NADU,600073.0,IN,,False,,


In [61]:
# Check column names of df

df.columns

Index(['index', 'Order ID', 'Date', 'Status', 'Fulfilment', 'Sales Channel ',
       'ship-service-level', 'Style', 'SKU', 'Category', 'Size', 'ASIN',
       'Courier Status', 'Qty', 'currency', 'Amount', 'ship-city',
       'ship-state', 'ship-postal-code', 'ship-country', 'promotion-ids',
       'B2B', 'fulfilled-by', 'Unnamed: 22'],
      dtype='object')

We are interested in the `Status`, `Fulfilment`, `ship-service-level`, `Courier Status`, `B2B` and `fulfilled-by`. Hence, we create these columns in online_sales_df. 

In [62]:
# Create column names of interest in online_sales_df

column_names = ['status', 'fulfilment', 'ship_service_level', 'courier_status', 'b2b', 'fulfilled_by']

for i in column_names:
    online_sales_df[i] = np.nan

The columns that we wish to generate synthetic data for based on the constraints on the original dataset are: `status`, `fulfilment`, `ship_service_level`, `courier_status` and `b2b`.

In [63]:
# Tabulate the values and constraints of df for synthetic data

values = []
prob = []

data_generation_names = ["Status", "Fulfilment", "ship-service-level", "Courier Status", "B2B"]

for i in data_generation_names:
    values.append(df[i].value_counts(normalize=True).index.tolist())
    prob.append(df[i].value_counts(normalize=True).values)

Now, we populate the columns with synthetic data based on the values and probabilities we have derived from our original Amazon Sales Report dataset.

In [64]:
# Generate synthetic data based on the constraints tabulated

synthetic_data_names = ["status", "fulfilment", "ship_service_level", "courier_status", "b2b"]
x = 0

for i in synthetic_data_names:
    online_sales_df[i] = np.random.choice(values[x], size = len(online_sales_df), p = prob[x])
    x += 1

online_sales_df.head()

Unnamed: 0,user_id,transaction_id,date,product_id,Quantity,Delivery_Charges,Coupon_Status,Coupon_Code,Discount_pct,status,fulfilment,ship_service_level,courier_status,b2b,fulfilled_by
0,17850,16679,2019-01-01,B09DL9978Y,1,6.5,Used,ELEC10,0.1,Shipped,Amazon,Standard,Shipped,False,
1,17850,16680,2019-01-01,B09DL9978Y,1,6.5,Used,ELEC10,0.1,Shipped,Amazon,Expedited,Shipped,False,
2,17850,16681,2019-01-01,B07GXHC691,1,6.5,Used,OFF10,0.1,Shipped,Amazon,Expedited,Shipped,False,
3,17850,16682,2019-01-01,B08NCKT9FG,5,6.5,Not Used,SALE10,0.1,Shipped - Delivered to Buyer,Amazon,Expedited,Shipped,False,
4,17850,16682,2019-01-01,B08H21B6V7,1,6.5,Used,AIO10,0.1,Cancelled,Amazon,Expedited,Shipped,False,


The data in the `fulfilled_by` column is dependent on the value of the entry in the `fulfilment` and `b2b` column, and hence to generate synthetic data for the `fulfilled_by` column, we consider the entry in the `fulfilment` and `b2b` columns. The top 3rd party e-commerce fulfilment companies in India include: Quickshift Fulfillment, Shiprocket Fulfillment, Prozo and DHL. We include the company featured in the original dataset, Easy Ship. Prozo and DHL are companies that offer services for B2B business transactions.

In [65]:
# Generate synthetic data for fulfilled_by column

fulfilled_by = ["Easy Ship", "Quickshift Fulfillment", "Shiprocket Fulfillment"]
fulfilled_by_b2b = ["Prozo", "DHL"]

conditions = [(online_sales_df["fulfilment"] == "Merchant") & (online_sales_df["b2b"] == False), 
              (online_sales_df["fulfilment"] == "Merchant") & (online_sales_df["b2b"] == True)]
choices = [np.random.choice(fulfilled_by, size = len(online_sales_df)), np.random.choice(fulfilled_by_b2b, size = len(online_sales_df))]

online_sales_df["fulfilled_by"] = (
    np.select(
        conditions,
        choices,
        default = ""
    )
)

online_sales_df[online_sales_df["fulfilment"] == "Merchant"].head()

Unnamed: 0,user_id,transaction_id,date,product_id,Quantity,Delivery_Charges,Coupon_Status,Coupon_Code,Discount_pct,status,fulfilment,ship_service_level,courier_status,b2b,fulfilled_by
6,17850,16682,2019-01-01,B09Y5FZK9N,15,6.5,Not Used,EXTRA10,0.1,Shipped,Merchant,Expedited,Shipped,False,Quickshift Fulfillment
9,13047,16682,2019-01-01,B08XMG618K,52,6.5,Used,OFF10,0.1,Cancelled,Merchant,Expedited,Shipped,False,Shiprocket Fulfillment
12,13047,16682,2019-01-01,B07Z1X6VFC,5,6.5,Used,SALE10,0.1,Shipped,Merchant,Expedited,Shipped,False,Quickshift Fulfillment
16,13047,16685,2019-01-01,B00KRCBA6E,1,6.5,Clicked,SALE10,0.1,Cancelled,Merchant,Expedited,Unshipped,False,Shiprocket Fulfillment
18,13047,16687,2019-01-01,B08ZN4B121,1,6.5,Clicked,EXTRA10,0.1,Shipped,Merchant,Expedited,Shipped,False,Shiprocket Fulfillment


In [None]:
online_sales_df.to_csv('../online_sales_edited.csv', index = False)

# **Creating origin_country for Products dataset**

To analyse supplier performance, we create a new column in `products.csv`, `origin_area`.

We want to generate synthetic data for the two created columns. We will use another dataset containing information about the suppliers for products sold on Amazon to obtain the trend and constraints for `origin_area`, which will contain information on both the city and country of origin..

In [4]:
# Load datasets
products_df = pd.read_csv('../products.csv')
suppliers_df = pd.read_csv('../source/Amazon Supplier List.csv')

suppliers_df.head()

Unnamed: 0,SITE,ADDRESS,CITY,STATE/REGION,COUNTRY
0,3Q Vina,"8 An Duong Vuong Street, Ward 16, District 8",Hồ Chí Minh,Hồ Chí Minh City,Vietnam
1,A Mount Inc.,"No. 65, Dongshing Street, Shulin District",New Taipei City,Taipei,Taiwan
2,A. R. Industries,"Mauza Rampur Jattan, Dhakwala Moginand, Tehsil...",Kala Amb,Himachal Pradesh,India
3,AAC Technologies Holdings Inc.\n(Shenzhen),"No.1, Chengxin Road, Baolong Ind. Park, Longga...",Shenzhen,Guangdong,China
4,"Ability Opto-Electronics Technology Co., Ltd.","4F. No.31, Keya Rd., Daya Dist.",Taichung City,Taichung,Taiwan


We filter the sites to ensure we only consider cities and countries that export electronics through Amazon.

In [5]:
# Filter companies to show only tech sites

keywords = ["elec", "tech"]
pattern = '|'.join(keywords)

filtered_supplier_df = suppliers_df[suppliers_df['SITE'].str.contains(pattern, case = False, na = False)]

filtered_supplier_df.head()

Unnamed: 0,SITE,ADDRESS,CITY,STATE/REGION,COUNTRY
3,AAC Technologies Holdings Inc.\n(Shenzhen),"No.1, Chengxin Road, Baolong Ind. Park, Longga...",Shenzhen,Guangdong,China
4,"Ability Opto-Electronics Technology Co., Ltd.","4F. No.31, Keya Rd., Daya Dist.",Taichung City,Taichung,Taiwan
6,"AcBel Electronic (Dong Guan) Co., Ltd.","No.17-28, (Hong Yeh Rd.), Hong Yeh Industrial ...",Dongguan,Guangdong,China
10,"Acrox Technologies Co., Ltd","No. 2 Xinmin Road, Xinmin Village, Changan Town",Dongguan,Guangdong,China
30,Amperex Technology Limited,"1 West Industrial Road, North Zone of SongShan...",Dongguan,Guangdong,China


In [68]:
# Check the length of the filtered df

len(filtered_supplier_df)

188

In [6]:
# Join the city and country columns together to form area

filtered_supplier_df['AREA'] = filtered_supplier_df['CITY'].astype(str) + ', ' + filtered_supplier_df['COUNTRY'].astype(str)

filtered_supplier_df.head()

Unnamed: 0,SITE,ADDRESS,CITY,STATE/REGION,COUNTRY,AREA
3,AAC Technologies Holdings Inc.\n(Shenzhen),"No.1, Chengxin Road, Baolong Ind. Park, Longga...",Shenzhen,Guangdong,China,"Shenzhen, China"
4,"Ability Opto-Electronics Technology Co., Ltd.","4F. No.31, Keya Rd., Daya Dist.",Taichung City,Taichung,Taiwan,"Taichung City, Taiwan"
6,"AcBel Electronic (Dong Guan) Co., Ltd.","No.17-28, (Hong Yeh Rd.), Hong Yeh Industrial ...",Dongguan,Guangdong,China,"Dongguan, China"
10,"Acrox Technologies Co., Ltd","No. 2 Xinmin Road, Xinmin Village, Changan Town",Dongguan,Guangdong,China,"Dongguan, China"
30,Amperex Technology Limited,"1 West Industrial Road, North Zone of SongShan...",Dongguan,Guangdong,China,"Dongguan, China"


In [70]:
# Generate synthetic data in online_sales_df based on the counts and constraints of the suppliers df

area_values = filtered_supplier_df['AREA'].value_counts(normalize=True).index.tolist()
area_probs = filtered_supplier_df['AREA'].value_counts(normalize=True).values

products_df["origin_area"] = np.random.choice(area_values, size = len(products_df), p =area_probs)

products_df.head()

Unnamed: 0,product_id,product_name,about_product,category,actual_price,discounted_price,discount_percentage,origin_area
0,B07JW9H4J1,Wayona Nylon Braided USB to Lightning Fast Cha...,High Compatibility : Compatible With iPhone 12...,Computers&Accessories|Accessories&Peripherals|...,13.19,4.79,0.64,"Guangzhou, China"
1,B098NS6PVG,Ambrane Unbreakable 60W / 3A Fast Charging 1.5...,"Compatible with all Type C enabled devices, be...",Computers&Accessories|Accessories&Peripherals|...,4.19,2.39,0.43,"Hefei, China"
2,B096MSW6CT,Sounce Fast Phone Charging Cable & Data Sync U...,【 Fast Charger& Data Sync】-With built-in safet...,Computers&Accessories|Accessories&Peripherals|...,22.79,2.39,0.9,"Foshan, China"
3,B08HDJ86NZ,boAt Deuce USB 300 2 in 1 Type-C & Micro USB S...,The boAt Deuce USB 300 2 in 1 cable is compati...,Computers&Accessories|Accessories&Peripherals|...,8.39,3.95,0.53,"Shenzhen, China"
4,B08CF3B7N1,Portronics Konnect L 1.2M Fast Charging 3A 8 P...,[CHARGE & SYNC FUNCTION]- This cable comes wit...,Computers&Accessories|Accessories&Peripherals|...,4.79,1.85,0.61,"Kunshan, China"


In [None]:
products_df.to_csv('../products.csv', index = False)