# 🛒 Retail Sales Analysis with Pandas  

## 📊 Dataset  
We will use the **[Online Retail Dataset (UCI Machine Learning Repository)](https://archive.ics.uci.edu/ml/datasets/online+retail)**.  
It contains transactions from an online store in the UK between 2010 and 2011.  

**Main columns:**  
- `InvoiceNo` → Invoice ID  
- `StockCode` → Product code  
- `Description` → Product description  
- `Quantity` → Units purchased (can be negative if returned)  
- `InvoiceDate` → Date and time of transaction  
- `UnitPrice` → Price per unit  
- `CustomerID` → Customer identifier  
- `Country` → Customer’s country  

---

## 🧩 Project Steps  

### 1. Data Loading  
- Download the CSV from the UCI repository.  
- Load the data with `pandas.read_csv()`.  

### 2. Data Cleaning  
- Handle missing values in `CustomerID` and `Description`.  
- Remove rows with `Quantity <= 0` (returns or invalid).  
- Convert `InvoiceDate` to `datetime`.  

### 3. Feature Engineering  
- Create a new column: `TotalPrice = Quantity * UnitPrice`.  
- Extract `Year`, `Month`, `Weekday` from `InvoiceDate`.  

### 4. Data Exploration & Analysis  
- Total sales by country.  
- Top 10 best-selling products.  
- Customers with the highest total spending.  
- Monthly sales trends to detect seasonality.  

### 5. Visualization (using Pandas `.plot()`)  
- Time series plot of sales over time.  
- Bar chart of top 10 products.  
- Pie/bar chart of sales by country.  

### 6. Conclusions  
- Identify the most valuable customers.  
- Detect the products that generate the most revenue.  
- Highlight any seasonal sales patterns.  

---

## ✅ Expected Outcome  
By the end of this project, you will demonstrate the ability to:  
- Clean and prepare real-world retail data with **Pandas**.  
- Perform


### 1. Data Loading

In [2]:
import pandas as pd
orders_df = pd.read_excel('../data/raw/Online Retail.xlsx')

In [3]:
orders_df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [6]:
orders_df.shape

(541909, 8)

### 2. Data Cleaning

In [7]:
orders_df.isna().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

<p>Because 'CustomerID' and 'Description' can not be filled with zeros or let them as null as they can not be associated to a client nor a product,  they should be completely dropped. We will drop rows where 'Quantity' is zero (or less) aswell.</p>

In [11]:
orders_df_clean = orders_df[orders_df['Quantity'] >= 0].dropna()
orders_df_clean.head(3)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom


In [22]:
orders_df_clean['InvoiceDate'] = pd.to_datetime(orders_df_clean['InvoiceDate'])

### 3. Feature Engineering

In [25]:
orders_df_clean['TotalPrice'] = orders_df_clean['Quantity'] * orders_df_clean['UnitPrice']

In [35]:
orders_df_clean['Year'] = orders_df_clean['InvoiceDate'].dt.year
orders_df_clean['Month'] = orders_df_clean['InvoiceDate'].dt.month
orders_df_clean['Weekday'] = orders_df_clean['InvoiceDate'].dt.weekday
orders_df_clean.head(3)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,TotalPrice,Year,Month,Weekday
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,15.3,2010,12,2
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34,2010,12,2
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,22.0,2010,12,2


### 4. Analysis Exploratory

In [46]:
orders_df_clean.pivot_table(values='TotalPrice', index='Country', aggfunc='sum').head()

Unnamed: 0_level_0,TotalPrice
Country,Unnamed: 1_level_1
Australia,138521.31
Austria,10198.68
Bahrain,548.4
Belgium,41196.34
Brazil,1143.6


In [31]:
orders_df_clean[orders_df_clean['Country'] == 'Unspecified'].shape

(244, 12)

In [54]:
orders_df_clean.pivot_table(values='Quantity', index='Description', aggfunc='sum').sort_values('Quantity', ascending=False).iloc[:10]

Unnamed: 0_level_0,Quantity
Description,Unnamed: 1_level_1
"PAPER CRAFT , LITTLE BIRDIE",80995
MEDIUM CERAMIC TOP STORAGE JAR,77916
WORLD WAR 2 GLIDERS ASSTD DESIGNS,54415
JUMBO BAG RED RETROSPOT,46181
WHITE HANGING HEART T-LIGHT HOLDER,36725
ASSORTED COLOUR BIRD ORNAMENT,35362
PACK OF 72 RETROSPOT CAKE CASES,33693
POPCORN HOLDER,30931
RABBIT NIGHT LIGHT,27202
MINI PAINT SET VINTAGE,26076


In [63]:
orders_df_clean.pivot_table(values='TotalPrice', index='CustomerID', aggfunc='sum').sort_values('TotalPrice', ascending=False).iloc[0:10]

Unnamed: 0_level_0,TotalPrice
CustomerID,Unnamed: 1_level_1
14646.0,280206.02
18102.0,259657.30
17450.0,194550.79
16446.0,168472.50
14911.0,143825.06
...,...
17956.0,12.75
16454.0,6.90
14792.0,6.20
16738.0,3.75


In [56]:
orders_df_clean.pivot_table(values='TotalPrice', index='Month', aggfunc='sum')

Unnamed: 0_level_0,TotalPrice
Month,Unnamed: 1_level_1
1,569445.04
2,447137.35
3,595500.76
4,469200.361
5,678594.56
6,661213.69
7,600091.011
8,645343.9
9,952838.382
10,1039318.79


In [65]:
orders_df_clean.pivot_table(values='TotalPrice', index='Description', aggfunc='sum').sort_values('TotalPrice', ascending=False).iloc[0:10]

Unnamed: 0_level_0,TotalPrice
Description,Unnamed: 1_level_1
"PAPER CRAFT , LITTLE BIRDIE",168469.6
REGENCY CAKESTAND 3 TIER,142592.95
WHITE HANGING HEART T-LIGHT HOLDER,100448.15
JUMBO BAG RED RETROSPOT,85220.78
MEDIUM CERAMIC TOP STORAGE JAR,81416.73
POSTAGE,77803.96
PARTY BUNTING,68844.33
ASSORTED COLOUR BIRD ORNAMENT,56580.34
Manual,53779.93
RABBIT NIGHT LIGHT,51346.2


### 5. Visualization

### 6. Conclusions

In [89]:
# RFM Customers
# Frequency
# Given the DS shows info since 2010-12-01 to 2011-12-09, we can take the last 3 months as a reference 2011-09-09
last_three_months = orders_df_clean[orders_df_clean['InvoiceDate'] > '2011-09-09']
last_three_months.pivot_table(values='InvoiceNo', index='CustomerID', aggfunc='count').sort_values('InvoiceNo', ascending=False).iloc[:10]

Unnamed: 0_level_0,InvoiceNo
CustomerID,Unnamed: 1_level_1
14096.0,5086
17841.0,3406
14911.0,2717
12748.0,2593
16549.0,838
14456.0,729
14646.0,701
14606.0,662
16360.0,662
15311.0,654


In [140]:
# Recency
# Last customers and its quantity (last week)
# Create only date format column YYYY-MM-DD for a tidier visualization and filter the last 7 days (Dec 3rd was holiday)
orders_df_clean['Date'] = orders_df_clean['InvoiceDate'].dt.date
last_week = orders_df_clean[orders_df_clean['InvoiceDate'] >= '2011-12-02']
recency_df = last_week.pivot_table(
    values='TotalPrice', 
    index=['Date', 'CustomerID'], 
    aggfunc='sum'
).sort_values(['Date', 'TotalPrice'], ascending=[False, False])
# Show just the first 3 customers whom spent the most
recency_df.groupby(level='Date').head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,TotalPrice
Date,CustomerID,Unnamed: 2_level_1
2011-12-09,16446.0,168469.6
2011-12-09,12433.0,2638.69
2011-12-09,14051.0,1203.9
2011-12-08,18102.0,11016.1
2011-12-08,16210.0,3599.4
2011-12-08,17949.0,1944.0
2011-12-07,16000.0,12393.7
2011-12-07,14646.0,11477.42
2011-12-07,17511.0,4949.24
2011-12-06,17389.0,5134.88


In [142]:
# Monetary
orders_df_clean.pivot_table(values='TotalPrice', index='CustomerID', aggfunc='sum').sort_values('TotalPrice', ascending=False).iloc[:10]

Unnamed: 0_level_0,TotalPrice
CustomerID,Unnamed: 1_level_1
14646.0,280206.02
18102.0,259657.3
17450.0,194550.79
16446.0,168472.5
14911.0,143825.06
12415.0,124914.53
14156.0,117379.63
17511.0,91062.38
16029.0,81024.84
12346.0,77183.6
