# Preprocessing'.

**Overview**

This notebook focuses on preprocessing the dataset containing 536,641 entries and 8 columns. The dataset includes the following columns: 'InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate', 'UnitPrice', 'CustomerID', and 'Country'.

In [1]:
# Load the important libraries.

import pandas as pd

In [2]:
# Load the dataset
df = pd.read_csv('eccomerce_business.csv')

In [3]:
# check for the duplicates
df.duplicated().sum()

5268

The dataset contains 5268 duplicates.

In [4]:
# Drop the duplicates 
df.drop_duplicates(inplace = True)

In [5]:
# Recheck for duplicates
df.duplicated().sum()

0

Now, there are no duplicates in the dataset.

In [6]:
# Checking for null values in the dataset
df.isnull().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135037
Country             0
dtype: int64

In [7]:
# Fill null values in the CustomerID column with a placeholder ID
df['CustomerID'].fillna('Unknown', inplace=True)

# Check the updated dataset
updated_null_values = df.isnull().sum()
updated_null_values

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['CustomerID'].fillna('Unknown', inplace=True)
  df['CustomerID'].fillna('Unknown', inplace=True)


InvoiceNo         0
StockCode         0
Description    1454
Quantity          0
InvoiceDate       0
UnitPrice         0
CustomerID        0
Country           0
dtype: int64

In [8]:
# Imputing missing values in the 'Description' column with the placeholder 'Unknown'
df['Description'].fillna('Unknown', inplace=True)

# Verifying that there are no more null values in the 'Description' column
null_values_after_imputation = df['Description'].isnull().sum()
null_values_after_imputation

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Description'].fillna('Unknown', inplace=True)


0

In [9]:
# Converting 'CustomerID' to object
df['CustomerID'] = df['CustomerID'].astype('object')

# Converting 'InvoiceDate' to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

In [10]:
# Extracting month and year from 'InvoiceDate'
import datetime as dt
df['Year'] = df['InvoiceDate'].dt.year
df['Month'] = df['InvoiceDate'].dt.strftime('%B')

In [13]:
df.to_csv('eccomerce.csv', index = False)