# Cleaned Data

In [2]:
import pandas as pd
import numpy as np

In [3]:
data = pd.read_csv("../raw/data.csv", encoding='unicode_escape')
data

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,12/9/2011 12:50,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,12/9/2011 12:50,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,12/9/2011 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,12/9/2011 12:50,4.15,12680.0,France


In [4]:
data.shape

(541909, 8)

In [5]:
data.isnull().sum()


InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

Here we can see that 1454 values are missing from Description and 135080 values are missing from CustomerID. We can compute percentage of missing data, so we can see clearer picture of our data.

In [6]:
data.isnull().sum() / data.shape[0] * 100


InvoiceNo       0.000000
StockCode       0.000000
Description     0.268311
Quantity        0.000000
InvoiceDate     0.000000
UnitPrice       0.000000
CustomerID     24.926694
Country         0.000000
dtype: float64

Now we see that 0.26% of Description are missing. But almost 25% of CustomerID are unknown! Let's investigate it further by looking to other examples.

## Missing descriptions

In [7]:
data[data['Description'].isnull()]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
622,536414,22139,,56,12/1/2010 11:52,0.0,,United Kingdom
1970,536545,21134,,1,12/1/2010 14:32,0.0,,United Kingdom
1971,536546,22145,,1,12/1/2010 14:33,0.0,,United Kingdom
1972,536547,37509,,1,12/1/2010 14:33,0.0,,United Kingdom
1987,536549,85226A,,1,12/1/2010 14:34,0.0,,United Kingdom
...,...,...,...,...,...,...,...,...
535322,581199,84581,,-2,12/7/2011 18:26,0.0,,United Kingdom
535326,581203,23406,,15,12/7/2011 18:31,0.0,,United Kingdom
535332,581209,21620,,6,12/7/2011 18:35,0.0,,United Kingdom
536981,581234,72817,,27,12/8/2011 10:33,0.0,,United Kingdom


In [8]:
data[data['Description'].isnull()]['UnitPrice'].value_counts()

0.0    1454
Name: UnitPrice, dtype: int64

In [9]:
data[data['Description'].isnull()]['CustomerID'].isnull().value_counts()

True    1454
Name: CustomerID, dtype: int64

Look on the pattern. In cases of missing descriptions we always miss the customer and the unit price as well. Why does retailer recorded such transactions without proper description? We could expect strange values in our dataset, and it will be difficult to detect them.

Let's investigate Descriptions even further. Can we find "nan" Strings and empty "" Strings?

In [10]:
data.loc[data["Description"].isnull()==False, "lowercase_descriptions"] = data.loc[
    data["Description"].isnull()==False,"Description"
].apply(lambda l: l.lower())

data["lowercase_descriptions"].dropna().apply(
    lambda l: np.where("nan" in l, True, False)
).value_counts()

False    539724
True        731
Name: lowercase_descriptions, dtype: int64

In [11]:
data["lowercase_descriptions"].dropna().apply(
    lambda l: np.where("" == l, True, False)
).value_counts()

False    540455
Name: lowercase_descriptions, dtype: int64

We found additional 731 hidden nan-values that show a string "nan" instead of a nan-value. Let's transform them to NaN

In [12]:
data.loc[data["lowercase_descriptions"].isnull()==False, "lowercase_descriptions"] = data.loc[
    data["lowercase_descriptions"].isnull()==False, "lowercase_descriptions"
].apply(lambda l: np.where("nan" in l, None, l))

## Missing CustomerID

In [13]:
data[data["CustomerID"].isnull()]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,lowercase_descriptions
622,536414,22139,,56,12/1/2010 11:52,0.00,,United Kingdom,
1443,536544,21773,DECORATIVE ROSE BATHROOM BOTTLE,1,12/1/2010 14:32,2.51,,United Kingdom,decorative rose bathroom bottle
1444,536544,21774,DECORATIVE CATS BATHROOM BOTTLE,2,12/1/2010 14:32,2.51,,United Kingdom,decorative cats bathroom bottle
1445,536544,21786,POLKADOT RAIN HAT,4,12/1/2010 14:32,0.85,,United Kingdom,polkadot rain hat
1446,536544,21787,RAIN PONCHO RETROSPOT,2,12/1/2010 14:32,1.66,,United Kingdom,rain poncho retrospot
...,...,...,...,...,...,...,...,...,...
541536,581498,85099B,JUMBO BAG RED RETROSPOT,5,12/9/2011 10:26,4.13,,United Kingdom,jumbo bag red retrospot
541537,581498,85099C,JUMBO BAG BAROQUE BLACK WHITE,4,12/9/2011 10:26,4.13,,United Kingdom,jumbo bag baroque black white
541538,581498,85150,LADIES & GENTLEMEN METAL SIGN,1,12/9/2011 10:26,4.96,,United Kingdom,ladies & gentlemen metal sign
541539,581498,85174,S/4 CACTI CANDLES,1,12/9/2011 10:26,10.79,,United Kingdom,s/4 cacti candles


In [14]:
data.loc[data["CustomerID"].isnull(), ["UnitPrice", "Quantity"]].describe()

Unnamed: 0,UnitPrice,Quantity
count,135080.0,135080.0
mean,8.076577,1.995573
std,151.900816,66.696153
min,-11062.06,-9600.0
25%,1.63,1.0
50%,3.29,1.0
75%,5.45,3.0
max,17836.46,5568.0


That's bad as well. The price and the quantities of entries without a customer ID can show extreme outliers. We can clearly see that there are negative values in both variables.

As we don't know why customers or descriptions are missing, and we have seen strange outliers in quantities and prices, we decided to drop all NaN values.

In [15]:
data = data.loc[(data["CustomerID"].isnull()==False) & (data["lowercase_descriptions"].isnull()==False)].copy()

In [19]:
data.isnull().sum().sum()
data.shape

(406223, 8)

In [17]:
data = data.iloc[:, 0:8]

In [18]:
data.to_csv("new_data.csv")