# Retail Data Cleaning Notebook

From `explore.ipynb`, we have identified the following changes that need to be made to our data:

- Remove Invoice Numbers that begin with "C" and "A" (cancellations and bad debt adjustments). 
- Remove all non-standard (####[A-Z]) Stock Codes except for the code "M"
- Remove out NA values for Customer ID column

## Table of Contents
1. [Load Data](#load-data)
2. [Clean InvoiceNo](#clean-invoiceno)
3. [Clean StockCodes](#clean-stockcodes)
4. [Clean CustomerID](#clean-customerid)

## Load data

In [11]:
import pandas as pd
import numpy as np

In [14]:
data = pd.read_excel("../data/online_retail.xlsx")
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


## Clean InvoiceNo

In [10]:
#Generate list of numbers to remove
numbers_to_remove = []
for number in data["InvoiceNo"].astype(str):
    if "A" in number:
        numbers_to_remove.append(number)
    elif "C" in number:
        numbers_to_remove.append(number)
    else:
        pass
    
numbers_to_remove[0:10], len(numbers_to_remove)

(['C536379',
  'C536383',
  'C536391',
  'C536391',
  'C536391',
  'C536391',
  'C536391',
  'C536391',
  'C536391',
  'C536506'],
 9291)

In [13]:
#Filter for only columns without those numbers
data[~data["InvoiceNo"].isin(numbers_to_remove)]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France


In [16]:
#Check if correct number of codes were removed
data.shape[0] - data[~data["InvoiceNo"].isin(numbers_to_remove)].shape[0]

9291

In [17]:
#Create new data
cleaned_data = data[~data["InvoiceNo"].isin(numbers_to_remove)]

## Clean StockCodes

In [34]:
cleaned_data["StockCode"] = cleaned_data["StockCode"].astype(str)
cleaned_data[cleaned_data["StockCode"].str.match("^\\d{5}$")==False] #There's two types of non-standard StockCodes--> letter on end or actual word?
#Are there more words?
alt_stock_codes = cleaned_data[(cleaned_data["StockCode"].str.match("^\\d{5}$")==False) & (cleaned_data["StockCode"].str.match("^\\d{5}[a-zA-Z]+$")==False)]["StockCode"].unique()
alt_stock_codes, len(alt_stock_codes) #All "word" StockCodes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_data["StockCode"] = cleaned_data["StockCode"].astype(str)


(array(['POST', 'C2', 'DOT', 'M', 'BANK CHARGES', 'AMAZONFEE', 'DCGS0076',
        'DCGS0003', 'gift_0001_40', 'DCGS0070', 'm', 'gift_0001_50',
        'gift_0001_30', 'gift_0001_20', 'DCGS0055', 'DCGS0072', 'DCGS0074',
        'DCGS0069', 'DCGS0057', 'DCGSSBOY', 'DCGSSGIRL', 'gift_0001_10',
        'S', 'PADS', 'DCGS0004', 'DCGS0073', 'DCGS0071', 'DCGS0068',
        'DCGS0067', 'DCGS0066P'], dtype=object),
 30)

In [35]:
alt_stock_codes = list(alt_stock_codes)
len(alt_stock_codes)


30

In [36]:
alt_stock_codes.remove("M")

In [37]:
alt_stock_codes, len(alt_stock_codes)

(['POST',
  'C2',
  'DOT',
  'BANK CHARGES',
  'AMAZONFEE',
  'DCGS0076',
  'DCGS0003',
  'gift_0001_40',
  'DCGS0070',
  'm',
  'gift_0001_50',
  'gift_0001_30',
  'gift_0001_20',
  'DCGS0055',
  'DCGS0072',
  'DCGS0074',
  'DCGS0069',
  'DCGS0057',
  'DCGSSBOY',
  'DCGSSGIRL',
  'gift_0001_10',
  'S',
  'PADS',
  'DCGS0004',
  'DCGS0073',
  'DCGS0071',
  'DCGS0068',
  'DCGS0067',
  'DCGS0066P'],
 29)