
# E-commerce Business Data Cleaning

This notebook provides the data cleaning process of an online retail business' sales data. The aim of this notebook is to prepare the dataset for further analysis. I have also published a notebook with the sales analysis and visualizations, which you can find [<ins>here</ins>](https://www.kaggle.com/atanaskanev/sales-analysis-and-visualization).

The business questions answered in the analysis notebook include:
* What is the overall sales trend?
* Which is the best selling product in each country?
* How many new customers are there each month?
* When do customers make the most purchases?

The data contains 541,909 sales records and 8 columns, including a product description, quanitity of items sold, unit price, date of sale and country. In short, the cleaning process includes:
* cleaning erroneous and missing data
* removing duplicated descriptions for the same stockcodes
* handling outliers



## Table of Contents
Click on any heading to jump straight to the content

[<font size="5">Importing Libraries and Data. Initial Data Overview</font>](#section-one)

[<font size="5">Data Cleaning</font>](#section-two)
* [Negative and 0 Unit Price](#section-three)
    - [Unit Price = 0](#section-four)
    - [Unit Price < 0](#section-five)
* [Clean Erroneous and Non-Sales Related Descriptions](#section-six)
* [Assign Unique Descriptions to Each StockCode](#section-seven)
* [Outliers](#section-eight)

<a id="section-one"></a>
# Importing Libraries and Data. Initial Data Overview

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats # used to calculate z-scores to investigate outliers

import warnings        
warnings.filterwarnings("ignore") # ignores warnings


In [None]:
pd.options.display.float_format = "{:.2f}".format # formats floats to two decimal places

In [None]:
data = pd.read_csv("../input/ecommerce-data/data.csv", encoding = "ISO-8859-1")

In [None]:
data.shape

In [None]:
data.head(10)

Something to note is that product descriptions are capitalised.

In [None]:
data.info()

In [None]:
# cast InvoiceDate as a date type
data["InvoiceDate"] = pd.to_datetime(data["InvoiceDate"])

In [None]:
data.isna().sum()

<a id="section-two"></a>
# Data Cleaning

We check whether the data contains any missing values which are strings and not NaNs by checking against a simple list, which usually finds some missing data:

In [None]:
data[data.isin(["NA","NaN","Na","na","N/A",
                "n/a","missing","MISSING",
                "no data","nodata","","?",
                "??","???","????","?????"]).any(axis=1)].shape

We did infact find some missing data. Let's have a look at some of those records:

In [None]:
data[data.isin(["NA","NaN","Na","na","N/A",
                "n/a","missing","MISSING",
                "no data","nodata","","?",
                "??","???","????","?????"]).any(axis=1)].head(20)

We see that when there are missing values for Description: UnitPrice is 0, CustomerID is missing, and we have negative values for quantity, but not always.

Let's start by investigating negative and 0 unit prices:

<a id="section-three"></a>
## Negative and 0 Unit Prices

In [None]:
data["UnitPrice"].describe()

<a id="section-four"></a>
### UnitPrice = 0
First let's investigate records with a UnitPrice of 0.

In [None]:
data[data["UnitPrice"] == 0].shape

2515 records have a UnitPrice = 0

In [None]:
# number of unique Descriptions when UnitPrice is 0
data[data["UnitPrice"] == 0]["Description"].nunique(dropna = False) # also counts NaN

In [None]:
# look at these descriptions
data[data["UnitPrice"] == 0]["Description"].unique()[0:10] # only the first 10 shown here, but I have looked at them

Some are actual product Descriptions e.g. 'ROUND CAKE TIN VINTAGE GREEN', but others are ambiguous: '?', '?display?' etc. These were probably entered as errors and a UnitPrice of 0 was assigned, so that they do not affect the sales figures.

A lot of these descriptions are adjustments for missing or damaged items e.g. 'missing?', 'Breakages' etc.
Such records have negative quantities in order to adjust for the missing/damaged inventory, but have UnitPrices = 0 so that they do no affect sales negatively. 

In other words, records with a UnitPrice of 0 are likely related to inventory and not sales. It could be that the system used to enter this data does not have separate accounts for sales and inventory, and they were entered in the same place.

We can therefore drop records with a UnitPrice of 0.

In [None]:
data = data[data["UnitPrice"] != 0]

In [None]:
data.isna().sum()

This has cleaned the NaN values for Description.

<a id="section-five"></a>
### UnitPrice < 0
Let's check records with a negative UnitPrice.

In [None]:
data[data["UnitPrice"] < 0]

In [None]:
data[data["Description"].str.lower().str.contains("adjust")]

Bad debts are expenses and not sales, therefore we drop them.

In [None]:
data = data[data["Description"].str.lower().str.contains("adjust") == False]

We now have records with positive UnitPrices only. 

<a id="section-six"></a>
## Clean Erroneous and Non-Sales Related Descriptions
Let's start with cleaning some of the Descriptions, since we saw issues with them above.

In [None]:
# number of unique descriptions
data["Description"].nunique(dropna = False)

In [None]:
# first strip any leading and trailing whitespace
data["Description"] = data["Description"].str.strip()

In [None]:
# number of unique descriptions
data["Description"].nunique(dropna = False)

A simple striping of whitespaces has reduced the number of unique Descriptions.

Above we saw InvoiceNos starting with C, which probably refers to credit. To illustrate what credit entries are, the below records show that an error was likely made: Quantity = 80995, and then it is cancelled out with a credit entry. 

In [None]:
data[data["Description"].str.contains("PAPER CRAFT")]

 Let's see the Descriptions of records with credit InvoiceNos and clean some of them:

In [None]:
data[data["InvoiceNo"].str.startswith("C")].head(10)

In [None]:
#  list of credit Descriptions
credit_descr = pd.Series(data[data["InvoiceNo"].str.startswith("C")]["Description"].unique())
credit_descr

Let's look into the not capitalised Descriptions since capitalised Descriptions are usually for products.

In [None]:
credit_descr[credit_descr.str.isupper() == False]

Manual, Discounts and Next Day Carriage are related to sales, but the other Descriptions are expenses, so we drop them.

In [None]:
data = data[data["Description"].isin(["Bank Charges", "CRUK Commission"]) == False]

In [None]:
# chech short credit Descriptions
credit_descr[credit_descr.str.len() < 15]

Drop "AMAZON FEE", "SAMPLES", "POSTAGE" and "PACKING CHARGE" since they are not related to sales: 

In [None]:
data = data[data["Description"].isin(["AMAZON FEE", "SAMPLES", "POSTAGE", "PACKING CHARGE"]) == False]

The other credit Descriptions have actual product names, so they are likely related to returned items, and therefore affect sales. "Manual" and "Discount" also affect sales, so we leave them in the data as well.

There are also items sold on DOTCOM:

In [None]:
list_dotcom = data[data["Description"].replace({np.nan:""}).str.lower().str.contains("dotcom", regex = True)] \
["Description"].unique().tolist()

list_dotcom

Let's have all items related to DOTCOM share the same Description "DOTCOM":

In [None]:
data["Description"] = data["Description"].replace(list_dotcom, "DOTCOM")

<a id="section-seven"></a>
## Assign Unique Descriptions to Each StockCode

The next step is to remove very similar Descriptions which relate to the same item and have the same StockCode. To do this, we will match each StockCode with a unique Description.

In [None]:
# check number of unique Descriptions
data["Description"].nunique()

In [None]:
# check number of unique StockCodes
data["StockCode"].nunique()

We see there are more unique Descriptions than unique StockCodes, so some StockCodes likely have more than 1 Description. Let's see which these StockCodes are:

In [None]:
num_descriptions = data.groupby("StockCode")["Description"].nunique().sort_values(ascending = False)
num_descriptions

We see StockCodes with mulptiple Descriptions. Let's see the distribution:

In [None]:
num_descriptions.value_counts()

We see that 200 StockCodes have 2 Descriptions each; 15 StockCodes have 3, and 2 StockCodes have 4.

We can group by StockCode and create a list of Descriptions for each StockCode:

In [None]:
groups = data.groupby("StockCode")["Description"].unique()
groups

In [None]:
# check some of the groups with multiple Descriptions
groups[groups.str.len() > 1]

As we can see, some StockCodes have multiple very similar Descriptions. In order to remove duplicated Descriptions, we can take the first Description in each group and assign this Desctiption to all records with the same StockCode. 
For this purpose we create a dictionary matching each StockCode with the first Description from its corresponding group:

In [None]:
dictionary = {}
for index, group in groups.items():
    dictionary[index] =  group[0]

In [None]:
# what the dictionary looks like
list(dictionary.items())[0:10]

Create a DataFrame from the dictionary:

In [None]:
descriptions = pd.DataFrame()

In [None]:
descriptions["StockCode"] = list(dictionary.keys())

In [None]:
descriptions["Unique_Description"] = list(dictionary.values())

In [None]:
descriptions.head(10)

We now have a reference table matching every StockCode with a unique Description.

We can now assign these unique Descriptions to their StockCode in the original data:

In [None]:
data = data.merge(descriptions, on = "StockCode", how = "inner")

In [None]:
data.head()

Now every record has an assigned unique Description.

In [None]:
data["Description"].nunique()

In [None]:
data["Unique_Description"].nunique()

As we see, all similar Descriptions have been converted to a single unique Description. We can now drop the original Description column and replace it with the cleaned Unique_Description column:

In [None]:
data["Description"] = data["Unique_Description"]
data = data.drop("Unique_Description", axis = 1)

In [None]:
data["StockCode"].nunique()

We see that the unique StockCodes are more than the unique Descriptions, which means that some Descriptions are repeated for different StockCodes. Let's see which these are:

In [None]:
descr_counts = pd.Series(descriptions["Unique_Description"].value_counts())
descr_counts[descr_counts > 1]

Check a few of those:

In [None]:
descriptions[descriptions["Unique_Description"].str.contains("METAL SIGN,CUPCAKE")]

In [None]:
descriptions[descriptions["Unique_Description"].str.contains("COLUMBIAN CANDLE ROUND")]

We see that these products are similar, so they can be grouped together in analyses.

<a id="section-eight"></a>
## Outliers

In [None]:
# add an ItemTotal column
data["ItemTotal"] = data["Quantity"] * data["UnitPrice"]

In [None]:
data.describe()

In [None]:
plt.style.use("default")
fig, (ax1,ax2) = plt.subplots(1,2, figsize = (8,3.7))

ax1.boxplot(data["UnitPrice"])
ax1.set_title("Unit Price")
ax2.boxplot(data["Quantity"])
ax2.set_title("Quantity")
fig.suptitle("Unit Price and Quantity Outlier Analysis")
plt.show()

We see some big outliers.

Since there are a lot of values outside the interquartile range, we expect that there would also be a lot of values beyond 3 standard deviations (z-score > 3), which means that using z-score as a means to eliminate outliers would be problematic.

Let's see how many values are outside 3 standard deviations:


In [None]:
z = np.abs(stats.zscore(data["Quantity"])) # calculate z-scores for Quantity
len(np.where(z>3)[0]) # how many values are outside 3 std.dev

In [None]:
z = np.abs(stats.zscore(data["UnitPrice"])) # calculate z-scores for UnitPrice
len(np.where(z>3)[0]) # how many values are outside 3 std.dev

As we see, there are a lot of values beyond 3 std. deviations. Moreover, we expect high values for Quantity to be related to very cheap items. 

Let's investigate by starting from the extreme outliers for Quantity:

In [None]:
data[np.abs(data["Quantity"]) > 10000]

We see that these records were likely entered by mistake, so the amounts were cancelled out with credit entries. We drop these records for data clarity, although they have a net impact of 0 on ItemTotal.

In [None]:
# remove extreme outliers
data = data[np.abs(data["Quantity"]) < 10000]

In [None]:
# look at some more records 
data[np.abs(data["Quantity"]) > 2000]

As expected, these records relate to high Quantity sales of cheap items - perhaps purchases by other retailers. Some other records are error reversals by credit entries, as seen above. We leave those records as they are.

Let's now look at high UnitPrice records:

In [None]:
z = np.abs(stats.zscore(data["UnitPrice"])) # calculate z-scores for UnitPrice
data[z > 3].sort_values(by = "UnitPrice", ascending = False)

In [None]:
data[z > 3]["Description"].unique()

We see Manual entries, discounts and DOTCOM sales. We see Manual entries that cancel each other out.

In reality, we would contact the sales department to inquire about what Manual entries refer to. For the present analysis, we assume that these are correct entries, possibly about returned items or crediting wrongly entered sales. 

Discounts affect sales, so we leave them in the data as well.

We see vintage and antique items, which are expected to have high prices, so we keep such records as well.

Something interesting we see, however, is the "PICNIC BASKET WICKER SMALL" item.
Let's investigate:

In [None]:
data[data["Description"].str.contains("PICNIC BASKET")].sort_values(by = "UnitPrice", ascending = False)

We see that the first two records are likely mistakes, since these UnitPrices are way above the normal prices for this product, and since we see a Manual credit entry of the exactly same amount above. We therefore drop these records, and drop the corresponding Manual credit adjustment.

In [None]:
# drop erroneous high UnitPrice records
data = data.drop(data.index[[88771,88772,297271]])

The data is now cleaned and ready for analyses and visualizations.

In [None]:
# data.to_csv("online_retail.csv", index = False) # saves the file

 I have continued the analysis of this dataset including visualizations in another notebook, which you can find [<ins>here</ins>](https://www.kaggle.com/atanaskanev/sales-analysis-and-visualization).
 
 If you wish to download the cleaned dataset, you can do so [<ins>here</ins>](https://www.kaggle.com/atanaskanev/online-retail-business-cleaned-dataset).

<font size="5">Thank you for reading my notebook!</font>

Any comments and suggestions are highly appreciated!