# 1 - Data Cleaning - Transactinal Dataset for Customer Analytics

Transactional datasets are one of the most common customer datasets available to all businesses. Such datasets hold information about the purchase history of customers including the amount, frequency and date of purchases.

The dataset contains the following information:

- **CustomerID**: Unique ID assigned to each customer
- **InvoiceNo**: Unique number assigned for each invoice 
- **AmountSpent**: Amount spent by the customer
- **InvoiceDate**: Date of transaction
- **Country**: Name of the country where the order was placed

- The first step is to import the libraries to be used in the data cleaning, which here will be `pandas`:

In [19]:
import pandas as pd
import warnings as wn

wn.filterwarnings('ignore')

- Now import the `transaction_dataset` which contains all the information about the customers:

In [4]:
path = '/PATH'

In [5]:
transaction_df = pd.read_csv(path, low_memory=False)

In [6]:
transaction_df.head()

Unnamed: 0,CustomerID,InvoiceNo,AmountSpent,InvoiceDate,Country
0,17850.0,536365,15.3,12/1/2010 8:26,United Kingdom
1,17850.0,536365,20.34,12/1/2010 8:26,United Kingdom
2,17850.0,536365,22.0,12/1/2010 8:26,United Kingdom
3,17850.0,536365,20.34,12/1/2010 8:26,United Kingdom
4,17850.0,536365,20.34,12/1/2010 8:26,United Kingdom


- Look at the shape of the dataframe to know how large it is

In [7]:
transaction_df.shape

(541908, 5)

As a first step, one must always clean a dataset before performing any analysis.

Data cleaning varies from dataset, but the main idea is the same: ensure that the data is verified before any insights are generated from it.

So, to begin, one must check if any null values exist.

In [8]:
transaction_df.isnull().sum()

CustomerID     135080
InvoiceNo           0
AmountSpent         0
InvoiceDate         0
Country             0
dtype: int64

There are two way to handle missing data. The first one is to remove such values, and the second is to fill them with some statistical value e.g. median, mean, moving average, etc.

Here, as the null values are from Customer IDs, it's not a good idea to fill them, once one cannot guess which customer made the purchase in the transactional dataset. So the approach here will be to drop the column.

In [9]:
transaction_df.dropna(inplace=True)

In [10]:
transaction_df.isnull().sum()

CustomerID     0
InvoiceNo      0
AmountSpent    0
InvoiceDate    0
Country        0
dtype: int64

Now, there're no missing values.\
Next, one must check the data types of the columns:

In [11]:
transaction_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 406828 entries, 0 to 541907
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   CustomerID   406828 non-null  float64
 1   InvoiceNo    406828 non-null  object 
 2   AmountSpent  406828 non-null  float64
 3   InvoiceDate  406828 non-null  object 
 4   Country      406828 non-null  object 
dtypes: float64(2), object(3)
memory usage: 18.6+ MB


There are three colums as objects, so the first approach here is to convert them into an appropriate data type.\
`CustomerID` is a float, but their values are always integers. The same is true for `InvoiceNo`.\
`InvoiceDate` is a date, and must be converted accordingly. The last are the `Country`, which are strings.

In [12]:
# convert CustomerID
transaction_df['CustomerID'] = transaction_df['CustomerID'].astype(int)

# convert InvoiceNo
transaction_df['InvoiceNo'] = transaction_df['InvoiceNo'].astype(int)

# convert InvoiceDate
transaction_df['InvoiceDate'] = pd.to_datetime(transaction_df['InvoiceDate'])

# convert Country
transaction_df['Country'] = transaction_df['Country'].astype(str)

Now, one quick glance at the first five lines to check if the labels were converted:

In [13]:
transaction_df.head()

Unnamed: 0,CustomerID,InvoiceNo,AmountSpent,InvoiceDate,Country
0,17850,536365,15.3,2010-12-01 08:26:00,United Kingdom
1,17850,536365,20.34,2010-12-01 08:26:00,United Kingdom
2,17850,536365,22.0,2010-12-01 08:26:00,United Kingdom
3,17850,536365,20.34,2010-12-01 08:26:00,United Kingdom
4,17850,536365,20.34,2010-12-01 08:26:00,United Kingdom


In [14]:
transaction_df.dtypes

CustomerID              int64
InvoiceNo               int64
AmountSpent           float64
InvoiceDate    datetime64[ns]
Country                object
dtype: object

Now, one must check if all numeric values in the dataset are propoer or not:

In [15]:
transaction_df.describe()

Unnamed: 0,CustomerID,InvoiceNo,AmountSpent
count,406828.0,406828.0,406828.0
mean,15287.694552,560581.737412,20.401913
std,1713.600528,13105.458755,427.592241
min,12346.0,536365.0,-168469.6
25%,13953.0,549130.0,4.2
50%,15152.0,561873.0,11.1
75%,16791.0,572065.0,19.5
max,18287.0,581587.0,168469.6


`AmountSpent` column has negative values, even though this columns can hold only positive values i.e. greater than zero. So it must be fixed:

In [16]:
transaction_df['AmountSpent'] = transaction_df['AmountSpent'].abs()

In [17]:
transaction_df.describe()

Unnamed: 0,CustomerID,InvoiceNo,AmountSpent
count,406828.0,406828.0,406828.0
mean,15287.694552,560581.737412,23.407303
std,1713.600528,13105.458755,427.438254
min,12346.0,536365.0,0.0
25%,13953.0,549130.0,4.68
50%,15152.0,561873.0,11.8
75%,16791.0,572065.0,19.8
max,18287.0,581587.0,168469.6


Now the cleaning process is over!\
So one must save it in order to use it one future analysis.

In [20]:
transaction_df.to_csv('PATH',
                     index=False)