# Section 1
Clean data in `ecom_data.csv`

## Load

In [33]:
import pandas as pd
from datetime import date

df = pd.read_csv(
    'ecom_data.csv',
    dtype={
        'SalesOrder': str,
        'SKU': str,
        'Description': str,
        'UnitPrice': float,
        'CustomerID': int,
        'Channel': str,
        'State': str,
        'Sales': float,
        'Quantity': int
    },
    converters={
        'InvoiceDay': date.fromisoformat
    })

## Clean
First, let's examine what the data looks like:

In [34]:
df.head(10)

Unnamed: 0,SalesOrder,SKU,Description,UnitPrice,CustomerID,Channel,State,InvoiceDay,Sales,Quantity
0,580636,22474,SPACEBOY TV DINNER TRAY,1.95,16746,Mailing,IL,2011-12-05,31.2,16
1,581426,70006,LOVE HEART POCKET WARMER,0.79,17757,Organic Social,WA,2011-12-08,2.37,3
2,575063,22697,GREEN REGENCY TEACUP AND SAUCER,2.95,16764,Display,TX,2011-11-08,8.85,3
3,544065,20726,LUNCH BAG WOODLAND,1.65,14346,Organic Social,TX,2011-02-15,13.2,8
4,568896,85049E,SCANDINAVIAN REDS RIBBONS,1.25,16361,Store,NY,2011-09-29,52.5,42
5,559542,23209,LUNCH BAG DOILEY PATTERN,1.65,17126,Email,CA,2011-07-10,9.9,6
6,569868,23493,VINTAGE DOILY TRAVEL SEWING KIT,1.95,13018,Organic Social,MO,2011-10-06,15.6,8
7,575303,23321,SMALL WHITE HEART OF WICKER,1.65,12893,Store,IA,2011-11-09,13.2,8
8,567145,21154,RED RETROSPOT OVEN GLOVE,1.25,12921,Organic Social,AK,2011-09-16,10.0,8
9,574444,21967,PACK OF 12 SKULL TISSUES,0.39,18122,Store,CA,2011-11-04,39.78,102


Next, let's check for duplicate rows:

In [35]:
# check for duplicate SalesOrder IDs
print(f"Dataset has {df.shape[0]} rows and {df.SalesOrder.unique().shape[0]} unique SalesOrder values.")

Dataset has 406829 rows and 20665 unique SalesOrder values.


John clarified that `SalesOrder` values are sales IDs, so that seems like a lot.

A single `SalesOrder` value can be associated with multiple rows. This is because a single sale can contain multiple items, and even identical items sold via different channels (e.g. a single sale could have a row for `SPACEBOY TV DINNER TRAY`s sold via `Mailing`, and another row for those sold via `Store`).

However, we would expect there to be only one row for each `SKU` and `Channel` combination on a given sale. So we should safely assume we can dedupe the data.

In [36]:
# remove duplicate rows
df.drop_duplicates(inplace=True)

Next, let's take a look at the range of values in `Quantity` and `Sales`.

In [37]:
print(f"Range for Quantity: ({df.Quantity.min()}, {df.Quantity.max()})")
print(f"Range for Sales: ({df.Sales.min()}, {df.Sales.max()})")

Range for Quantity: (-61437, 97405)
Range for Sales: (-127788.96, 127788.96)


So, both can be negative or positive. We expect this because (as John clarified), the dataset also contains refunds, which have negative `Quantity` and `Sales` values. Let's take a look at a few refunds:

In [38]:
df[df.Quantity < 0].head()

Unnamed: 0,SalesOrder,SKU,Description,UnitPrice,CustomerID,Channel,State,InvoiceDay,Sales,Quantity
166,C573283,22776,SWEETHEART 3 TIER CAKE STAND,9.95,18030,Email,IL,2011-10-28,-19.9,-2
184,C538341,22726,ALARM CLOCK BAKELIKE GREEN,3.75,15514,SEO,CA,2010-12-10,-18.75,-5
247,C569114,22832,BROCANTE SHELF WITH HOOKS,10.75,14911,Email,CA,2011-09-30,-43.0,-4
356,C537788,16202E,BLACK PHOTO ALBUM,5.55,15916,Store,TN,2010-12-08,-11.1,-2
411,C566280,22723,SET OF 6 HERB TINS SKETCHBOOK,3.95,12748,Store,CA,2011-09-11,-31.6,-8


Looks like the `SalesOrder` value for refunds begins with a `C`. But let's confirm that:

In [39]:
# are SalesOrder values that start with a C all negative?
df_sc = df[df.SalesOrder.apply(lambda x: x[0] == 'C')]
(df_sc.Sales < 0).all()

False

In [40]:
# are SalesOrder values that start with a C all negative or 0?
(df_sc.Sales <= 0).all()

True

In [41]:
# are SalesOrder values that don't start with a C all positive?
df_nc = df[df.SalesOrder.apply(lambda x: x[0] != 'C')]
(df_nc.Sales > 0).all()

False

Looks like refunds can have either a negative `Quantity` and `Sales` value, or one of 0. But refunds with an amount of 0 don't make much sense. Let's look at them:

In [42]:
print(f"{df[df.Quantity == 0].shape[0]} rows have a Quantity of 0.")
df[df.Quantity == 0].head()

24 rows have a Quantity of 0.


Unnamed: 0,SalesOrder,SKU,Description,UnitPrice,CustomerID,Channel,State,InvoiceDay,Sales,Quantity
625,542608,21175,GIN + TONIC DIET METAL SIGN,2.1,16770,Display,FL,2011-01-30,0.0,0
10443,570482,21930,JUMBO STORAGE BAG SKULLS,2.08,17459,Display,IL,2011-10-10,0.0,0
11021,570488,85173,SET/6 FROG PRINCE T-LIGHT CANDLES,4.96,14096,Display,TX,2011-10-10,0.0,0
16806,538015,22777,GLASS CLOCHE LARGE,8.5,13240,Display,CA,2010-12-09,0.0,0
59354,542607,22770,MIRROR CORNICE,14.95,13148,Display,CA,2011-01-30,0.0,0


These rows don't really contain any information, so we'll drop them too.

In [43]:
df = df[df.Quantity != 0].reset_index(drop=True)

Next, let's confirm that `Sales` is equal to `Quantity` times `UnitPrice`:

In [44]:
# check Sales equals Quantity * UnitPrice
all(df.Sales == (df.Quantity * df.UnitPrice))

False

In [49]:
# examine rows where Sales != Quantity * UnitPrice
print(f"{df[df.Sales != (df.Quantity * df.UnitPrice)].shape[0]} of {df.shape[0]} rows do not meet this critera.")
df[df.Sales != ()].head()

29957 of 256639 rows do not meet this critera.


Unnamed: 0,SalesOrder,SKU,Description,UnitPrice,CustomerID,Channel,State,InvoiceDay,Sales,Quantity
0,580636,22474,SPACEBOY TV DINNER TRAY,1.95,16746,Mailing,IL,2011-12-05,31.2,16
1,581426,70006,LOVE HEART POCKET WARMER,0.79,17757,Organic Social,WA,2011-12-08,2.37,3
2,575063,22697,GREEN REGENCY TEACUP AND SAUCER,2.95,16764,Display,TX,2011-11-08,8.85,3
3,544065,20726,LUNCH BAG WOODLAND,1.65,14346,Organic Social,TX,2011-02-15,13.2,8
4,568896,85049E,SCANDINAVIAN REDS RIBBONS,1.25,16361,Store,NY,2011-09-29,52.5,42


Looks like we can chalk the issue above to a Python rounding error, as these `Sales` values look right.

I think that's good for an inital pass at cleaning. Let's kick this over to a new notebook (`section_1b.ipynb`) for exploration.

**Note**: John M. clarified that the number of different channels for a sale can be exaggerated as it's a manufactured dataset.