# Projects -Cohort Analysis for assessing customer retention in E-commerce industry

## 01 - Data Loading & Initial Cleaning

This notebook handles the initial steps of our cohort analysis project:
- Loading the raw e-commerce dataset
- Inspecting structure and content
- Identifying missing values and data types
- Preparing columns for further analysis (e.g., TotalPrice, InvoiceDate)

Dataset: `Dataset_ecommerce.csv`

In [27]:
# import necessary librairies
import pandas as pd
import os

# Define path to the dataset (relative path from 'notebooks' folder)
dataset_path = os.path.join("..", "dataset", "Dataset_ecommerce.csv")

# Load the dataset into a DataFrame
df = pd.read_csv(dataset_path)

# Display the shape of the DataFrame and the first few rows
print(f"Datashape : {df.shape}")
df.head(10)

Datashape : (541909, 8)


Unnamed: 0,InvoiceNo,InvoiceDate,CustomerID,StockCode,Description,Quantity,UnitPrice,Country
0,536365,2010-12-01 08:26:00,17850.0,SC1734,Electronics,65,10.23,Egypt
1,536365,2010-12-01 08:26:00,17850.0,SC2088,Furniture,95,19.61,Mali
2,536365,2010-12-01 08:26:00,17850.0,SC3463,Books,78,61.49,Mali
3,536365,2010-12-01 08:26:00,17850.0,SC6228,Toys,15,24.73,South Africa
4,536365,2010-12-01 08:26:00,17850.0,SC2149,Toys,50,38.83,Rwanda
5,536365,2010-12-01 08:26:00,17850.0,SC7895,Toys,41,45.31,Sierra Leone
6,536365,2010-12-01 08:26:00,17850.0,SC8608,Books,44,39.31,Benin
7,536366,2010-12-01 08:28:00,17850.0,SC3216,Toys,47,77.35,Burkina Faso
8,536366,2010-12-01 08:28:00,17850.0,SC1236,Kitchenware,19,35.11,Nigeria
9,536367,2010-12-01 08:34:00,13047.0,SC4513,Furniture,55,3.21,Cote d'Ivoire


In [31]:
# Check for missing values in each column
df.isnull().sum()

InvoiceNo           0
InvoiceDate         0
CustomerID     135080
StockCode           0
Description         0
Quantity            0
UnitPrice           0
Country             0
dtype: int64

In [41]:
# Show percent of missing
missing_percent = df.isnull().mean() * 100
#print(f"Percent of missing : {missing_percent}")
missing_percent[missing_percent > 0].sort_values(ascending=False)

CustomerID    24.926694
dtype: float64

In [71]:
# Delete rows with null values
#df_cleaned = df.dropna()
df = df[df['CustomerID'].notnull()]
# display shape after cleaning
#print(f"Shape cleaned : {df_cleaned.shape}")
print(f"Shape cleaned : {df.shape}")

Shape cleaned : (406829, 8)


In [73]:
# Checking again if null values remain
df.isnull().sum()

InvoiceNo      0
InvoiceDate    0
CustomerID     0
StockCode      0
Description    0
Quantity       0
UnitPrice      0
Country        0
dtype: int64

In [75]:
df.dtypes

InvoiceNo       object
InvoiceDate     object
CustomerID     float64
StockCode       object
Description     object
Quantity         int64
UnitPrice      float64
Country         object
dtype: object

In [77]:
# Format InvoiceDate
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df.dtypes

InvoiceNo              object
InvoiceDate    datetime64[ns]
CustomerID            float64
StockCode              object
Description            object
Quantity                int64
UnitPrice             float64
Country                object
dtype: object