In [1]:
import warnings
warnings.filterwarnings("ignore")

import os

import pandas as pd

In [2]:
buyers_transaction_df = pd.read_csv("../data/cleaned/buyers_transaction.csv")

display(buyers_transaction_df.head())

Unnamed: 0,invoiceno,stockcode,description,quantity,invoicedate,unitprice,customerid,country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [3]:
buyers_transaction_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   invoiceno    541909 non-null  object 
 1   stockcode    541909 non-null  object 
 2   description  540455 non-null  object 
 3   quantity     541909 non-null  int64  
 4   invoicedate  541909 non-null  object 
 5   unitprice    541909 non-null  float64
 6   customerid   406829 non-null  float64
 7   country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


In [4]:
display(buyers_transaction_df[["quantity", "unitprice"]].describe())

Unnamed: 0,quantity,unitprice
count,541909.0,541909.0
mean,9.55225,4.611114
std,218.081158,96.759853
min,-80995.0,-11062.06
25%,1.0,1.25
50%,3.0,2.08
75%,10.0,4.13
max,80995.0,38970.0


**Observation:** We can see that the `quantity` and `unitprice` values are negative. This raises a question that **_Why these values are negative?_**

## Inspecting negative `quantity` and `unitprice` values

### 1. When `quantity < 0`

In [5]:
negative_qty_df = buyers_transaction_df[buyers_transaction_df["quantity"] < 0]
negative_qty_df["first_char_invoiceno"] = negative_qty_df["invoiceno"].str[0]

display(negative_qty_df.head())

Unnamed: 0,invoiceno,stockcode,description,quantity,invoicedate,unitprice,customerid,country,first_char_invoiceno
141,C536379,D,Discount,-1,2010-12-01 09:41:00,27.5,14527.0,United Kingdom,C
154,C536383,35004C,SET OF 3 COLOURED FLYING DUCKS,-1,2010-12-01 09:49:00,4.65,15311.0,United Kingdom,C
235,C536391,22556,PLASTERS IN TIN CIRCUS PARADE,-12,2010-12-01 10:24:00,1.65,17548.0,United Kingdom,C
236,C536391,21984,PACK OF 12 PINK PAISLEY TISSUES,-24,2010-12-01 10:24:00,0.29,17548.0,United Kingdom,C
237,C536391,21983,PACK OF 12 BLUE PAISLEY TISSUES,-24,2010-12-01 10:24:00,0.29,17548.0,United Kingdom,C


#### Different values of first character in `invoiceno`

In [6]:
negative_qty_df["first_char_invoiceno"].unique()

array(['C', '5'], dtype=object)

#### a. Count of `customerid` when first char is 'C'

In [7]:
first_char_C = negative_qty_df[negative_qty_df["first_char_invoiceno"] == "C"]

first_char_C.customerid.nunique()

1589

#### b. Count of `customerid` when first char is '5'

In [8]:
first_char_5 = negative_qty_df[negative_qty_df["first_char_invoiceno"] == "5"]

first_char_5.customerid.nunique()

0

#### c. `unitprice` when first char is 'C'

In [9]:
sorted(first_char_C["unitprice"].unique(), reverse=True)[:5]

[38970.0, 17836.46, 16888.02, 16453.71, 13541.33]

#### d. `unitprice` when first char is '5'

In [10]:
first_char_5["unitprice"].unique()

array([0.])

### 2. When `unitprice < 0`

In [11]:
negative_unit_price_df = buyers_transaction_df[buyers_transaction_df["unitprice"] < 0]

display(negative_unit_price_df.head())

Unnamed: 0,invoiceno,stockcode,description,quantity,invoicedate,unitprice,customerid,country
299983,A563186,B,Adjust bad debt,1,2011-08-12 14:51:00,-11062.06,,United Kingdom
299984,A563187,B,Adjust bad debt,1,2011-08-12 14:52:00,-11062.06,,United Kingdom


### Conclusion

#### 1. When `quantity < 0`

* For negative `quantity` values we can see that the `invoiceno` has `C` as a prefix. This usually be the case either the buyer returned or cancelled the order for a certain amount of quantities.

* When first character of invoiceno is 'C' we can see that there are 1589 different `customerid` with varying `unitprice`

* When first character of invoiceno is '5' the `customerid` column has missing values and unitprice is also `0`

---

#### 2. When `unitprice < 0`

* For negative `unitprice` the `invoiceno` has `A` as a prefix. We can observe that description has value `Adjust bad debt`. 