# Data

We are going to review and clean Online Retail dataset from UCI Machine Learning Repository - http://archive.ics.uci.edu/ml/datasets/online+retail

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import datetime as dt

In [2]:
online = pd.read_excel('Online Retail.xlsx')

# 00. Exploring and cleaning the data

In this section we'll scan the dataset and do the following tasks:
 - Remove negative values for Quantity
 - Remove negative values for UnitPrice
 - Remove NULL values

In [3]:
online.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


### Removing negative values

Below we can see that both `Quantity` and `UnitPrice` varables have negative values.

We will remove them and further investigate the dataset.

In [4]:
online.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


In [5]:
online = online[online['Quantity'] > 0]
online = online[online['UnitPrice'] > 0]

In [6]:
online.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,530104.0,530104.0,397884.0
mean,10.542037,3.907625,15294.423453
std,155.524124,35.915681,1713.14156
min,1.0,0.001,12346.0
25%,1.0,1.25,13969.0
50%,3.0,2.08,15159.0
75%,10.0,4.13,16795.0
max,80995.0,13541.33,18287.0


### Identifying missing data

We can see below that `CustomerID` variable has a smaller number of non-null rows than other variables.

In [7]:
online.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 530104 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      530104 non-null object
StockCode      530104 non-null object
Description    530104 non-null object
Quantity       530104 non-null int64
InvoiceDate    530104 non-null datetime64[ns]
UnitPrice      530104 non-null float64
CustomerID     397884 non-null float64
Country        530104 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 36.4+ MB


We can calculate how many variables are `NULL` and see that there are 132 220 entries with `NULL` values. We will just remove them in the next steps.

In [8]:
online.isnull().sum()

InvoiceNo           0
StockCode           0
Description         0
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     132220
Country             0
dtype: int64

In [9]:
online = online[pd.notnull(online['CustomerID'])]

In [10]:
online.isnull().sum()

InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64

### Save the cleaned dataset

Let's save the cleaned dataset as a different Excel file for other uses

In [11]:
writer = pd.ExcelWriter('OnlineClean.xlsx')
online.to_excel(writer, 'Sheet1')
writer.save()

## Exploring the dataset

In this section we will scan the dataset at a high level to understand it's structure, variable distributions and other information

### View number of Invoices per country

Let's review the distribution of sales count per country. United Kingdom is the largest, and the absolute majority of this dataset is from this country.

In [12]:
online.groupby(['Country'])['InvoiceNo'].agg('count').sort_values(ascending=False)

Country
United Kingdom          354321
Germany                   9040
France                    8341
EIRE                      7236
Spain                     2484
Netherlands               2359
Belgium                   2031
Switzerland               1841
Portugal                  1462
Australia                 1182
Norway                    1071
Italy                      758
Channel Islands            748
Finland                    685
Cyprus                     614
Sweden                     451
Austria                    398
Denmark                    380
Poland                     330
Japan                      321
Israel                     248
Unspecified                244
Singapore                  222
Iceland                    182
USA                        179
Canada                     151
Greece                     145
Malta                      112
United Arab Emirates        68
European Community          60
RSA                         57
Lebanon                     45


### Most popular products

Let review the top 20 most popular products - there are 3 products that each account for more than 1 percent sales: Little Birdie paper craft, Medium ceramic top storage jar, and World War 2 gliders assorted designs.

In [13]:
online.groupby(['Description'])['Quantity'].agg('sum').sort_values(\
    ascending=False)[:20] / np.sum(online['Quantity']) * 100

Description
PAPER CRAFT , LITTLE BIRDIE           1.567298
MEDIUM CERAMIC TOP STORAGE JAR        1.507717
WORLD WAR 2 GLIDERS ASSTD DESIGNS     1.052960
JUMBO BAG RED RETROSPOT               0.893628
WHITE HANGING HEART T-LIGHT HOLDER    0.710649
ASSORTED COLOUR BIRD ORNAMENT         0.684274
PACK OF 72 RETROSPOT CAKE CASES       0.651978
POPCORN HOLDER                        0.598532
RABBIT NIGHT LIGHT                    0.526374
MINI PAINT SET VINTAGE                0.504585
PACK OF 12 LONDON TISSUES             0.490440
PACK OF 60 PINK PAISLEY CAKE CASES    0.469522
BROCADE RING PURSE                    0.444347
VICTORIAN GLASS HANGING T-LIGHT       0.434091
ASSORTED COLOURS SILK FAN             0.423313
RED  HARMONICA IN BOX                 0.405878
JUMBO BAG PINK POLKADOT               0.390204
SMALL POPCORN HOLDER                  0.353186
LUNCH BAG RED RETROSPOT               0.342447
60 TEATIME FAIRY CAKE CASES           0.342292
Name: Quantity, dtype: float64

### Least popular products

Let review the top 20 least popular products. It's interesting as the store has sold only one unit of each.

In [14]:
online.groupby(['Description'])['Quantity'].agg('sum').sort_values(ascending=False)[-20:]

Description
SET/3 FLORAL GARDEN TOOLS IN BAG       1
BAROQUE BUTTERFLY EARRINGS CRYSTAL     1
FIRE POLISHED GLASS NECKL GOLD         1
PURPLE CHUNKY GLASS+BEAD NECKLACE      1
PACK 4 FLOWER/BUTTERFLY PATCHES        1
POTTING SHED SOW 'N' GROW SET          1
DUSTY PINK CHRISTMAS TREE 30CM         1
EASTER CRAFT IVY WREATH WITH CHICK     1
FIRE POLISHED GLASS BRACELET BLACK     1
BLUE PADDED SOFT MOBILE                1
MUMMY MOUSE RED GINGHAM RIBBON         1
MARIE ANTOIENETT TRINKET BOX GOLD      1
CAKE STAND LACE WHITE                  1
HEN HOUSE W CHICK IN NEST              1
CHERRY BLOSSOM PURSE                   1
SET/3 TALL GLASS CANDLE HOLDER PINK    1
LASER CUT MULTI STRAND NECKLACE        1
CRACKED GLAZE EARRINGS BROWN           1
DOLPHIN WINDMILL                       1
SET OF 3 PINK FLYING DUCKS             1
Name: Quantity, dtype: int64

### Calculate how many items with 1 sale only

Only 59 products historical sales on 1 item, which is a low number and doesn't require further cleaning.

In [15]:
online.groupby(['Description']).filter(lambda x: x['Quantity'].sum() == 1)['Description'].agg('count')

59

## Further cleaning to make the dataset smaller (<10 mb) for my Datacamp course on Customer Segmentation in Python

### Remove all countries, leave only UK

In [16]:
onlineuk = online[online['Country']=='United Kingdom']

### Randomly sample 20 % of the data

In [17]:
onlinesampled = onlineuk.sample(frac=0.2, replace=False, random_state=1)

In [18]:
print('onlineuk: {}, onlinesampled: {}'.format(onlineuk.shape, onlinesampled.shape))

onlineuk: (354321, 8), onlinesampled: (70864, 8)


In [19]:
writer = pd.ExcelWriter('OnlineSampled.xlsx')
onlinesampled.to_excel(writer, 'Sheet1')
writer.save()

In [20]:
onlinesampled.to_csv('OnlineSampled.csv')