# Problem Statement
## Scenario
A medium-sized UK-based online retailer wants to harness its transaction data to improve profitability and customer experience. The company is seeking actionable insights to optimize sales strategies, manage stock efficiently, detect fraud, and target marketing efforts.
## Problem Statement
The retailer lacks visibility into key business metrics such as top-selling products, customer buying patterns, and sales trends over time. Inventory inefficiencies and possible fraudulent activities are suspected but never formally analyzed. Management seeks to identify opportunities to segment their customer base, improve marketing ROI, and boost retention
## My role and objectives:
My role in this project is a data analyst of the company to perform: Trend analysis, customer segmentation, anomaly detection, product bundling, key metrics
## Planned tasks:
	•	Analyze sales trends and seasonality to help with forecasting and inventory optimization.
	•	Identify top-performing products and categories.
	•	Segment customers based on purchase patterns or geographic region to design personalized marketing.
	•	Calculate key metrics: average order value, customer lifetime value.
	•	Examine product performance by time, such as daily/weekly/monthly sales.
	•	Visualize geographic sales distribution and spot opportunities for expansion.
	•	Build and evaluate a predictive model for the sales performance.
	•	Provide actionable recommendations for marketing and operational improvements.

In [5]:
# E-commerce Data Analysis Project
# Purpose: Analyze UK-based online retail transactions for business insights

# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings

# Configuration settings
warnings.filterwarnings('ignore')  # Suppress warnings
#plt.style.use('seaborn-v0_8-darkgrid')  # Set visualization style
pd.set_option('display.max_columns', None)  # Display all columns in DataFrame

# Load the dataset
try:
    df = pd.read_csv('data.csv', encoding='ISO-8859-1')
    print("Dataset loaded successfully")
    print(f"Shape of dataset: {df.shape}")
except FileNotFoundError:
    print("Error: Dataset file not found. Please ensure 'data.csv' is in the data directory.")

Dataset loaded successfully
Shape of dataset: (541909, 8)


# Data Understanding & EDA
Perform complete exploratory analysis — shape, nulls, datatypes, descriptive statistics, correlations and visualizations.

In [8]:
# Data Understanding & EDA
# Perform shape analysis
print("First 5 rows of the dataset:")
print(df.head())
print("\nDataset Info:")
print(df.info())

# Null check
print("\nMissing Values in each column:")
print(df.isnull().sum())

# Data Types analysis
print("\nData Types of each column:")
print(df.dtypes)

# Descriptive Statistics
print("\nDescriptive Statistics:")
print(df.describe(include='all'))

First 5 rows of the dataset:
  InvoiceNo StockCode                          Description  Quantity  \
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1    536365     71053                  WHITE METAL LANTERN         6   
2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

      InvoiceDate  UnitPrice  CustomerID         Country  
0  12/1/2010 8:26       2.55     17850.0  United Kingdom  
1  12/1/2010 8:26       3.39     17850.0  United Kingdom  
2  12/1/2010 8:26       2.75     17850.0  United Kingdom  
3  12/1/2010 8:26       3.39     17850.0  United Kingdom  
4  12/1/2010 8:26       3.39     17850.0  United Kingdom  

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------     

In [None]:
# Correlation Analysis
print("\nCorrelation Matrix:")
print(df.corr())

# Visualizations
plt.figure(figsize=(10,6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()