# Online Retail Exploratory Data Analysis

## Case Study
Work with transactional data from an online retail store. The dataset contains information about customer purchases, including product details, quantities, prices, and timestamps. 

**Task:** 

Explore and analyze this dataset to gain insights into the store's sales trends, customer behavior, and popular products that can drive strategic business decisions and enhance the store's overall performance in the competitive online retail market.. 

**Takeaways:**
- Identify patterns, outliers, and correlations in the data, allows us to make data-driven decisions and recommendations to optimize the store's operations and improve customer satisfaction. 
- Through visualizations and statistical analysis, we will uncover key trends, such as the busiest sales months, best-selling products, and the store's most valuable customers.

## Project Objectives
1. Describe data to answer key questions to uncover insights
2. Gain valuable insights that will help improve online retail performance
3. Provide analytic insights and data-driven recommendations

## Dataset

It contains transactional data of an online retail store from 2010 to 2011. The dataset is available as a .xlsx file named `Online Retail.xlsx`.

It can also be downloaded [here](https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx).

The dataset contains the following columns:

- InvoiceNo: Invoice number of the transaction
- StockCode: Unique code of the product
- Description: Description of the product
- Quantity: Quantity of the product in the transaction
- InvoiceDate: Date and time of the transaction
- UnitPrice: Unit price of the product
- CustomerID: Unique identifier of the customer
- Country: Country where the transaction occurred

## Tasks

1. Load the dataset into a Pandas DataFrame and display the first few rows to get an overview of the data.
2. Perform data cleaning by handling missing values, if any, and removing any redundant or unnecessary columns.
3. Explore the basic statistics of the dataset, including measures of central tendency and dispersion.
4. Perform data visualization to gain insights into the dataset. Generate appropriate plots, such as histograms, scatter plots, or bar plots, to visualize different aspects of the data.
5. Analyze the sales trends over time. Identify the busiest months and days of the week in terms of sales.
6. Explore the top-selling products and countries based on the quantity sold.
7. Identify any outliers or anomalies in the dataset and discuss their potential impact on the analysis.
8. Draw conclusions and summarize your findings from the exploratory data analysis.

### Import Necessary Libraries

In [8]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

In [9]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx'

In [10]:

retail_data = pd.read_excel(url)
retail_data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [42]:
retail_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 255400 entries, 0 to 541908
Data columns (total 9 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   InvoiceNo   255400 non-null  object        
 1   Quantity    255400 non-null  int64         
 2   UnitPrice   255400 non-null  float64       
 3   CustomerID  255400 non-null  float64       
 4   Country     255400 non-null  object        
 5   Date        255400 non-null  datetime64[ns]
 6   Year        255400 non-null  int32         
 7   Month       255400 non-null  int32         
 8   Day         255400 non-null  int32         
dtypes: datetime64[ns](1), float64(2), int32(3), int64(1), object(2)
memory usage: 16.6+ MB


We can drop the columns that are not necessary for our analysis.

In our case, we can drop the `StockCode` and `Description` columns.

In [12]:
retail_data.drop(['StockCode', 'Description'], axis=1, inplace=True)

In [13]:
retail_data.head()

Unnamed: 0,InvoiceNo,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [14]:
retail_data['Date'] = pd.to_datetime(retail_data['InvoiceDate'])

In [20]:
retail_data['Year'] = retail_data['Date'].dt.year
retail_data['Month'] = retail_data['Date'].dt.month
retail_data['Day'] = retail_data['Date'].dt.day

In [23]:
retail_data.drop(['InvoiceDate'], axis=1, inplace=True)

In [26]:
retail_data.columns

Index(['InvoiceNo', 'Quantity', 'UnitPrice', 'CustomerID', 'Country', 'Date',
       'Year', 'Month', 'Day'],
      dtype='object')

Lets look at the dataset for any missing information

In [31]:
retail_data.isnull().sum(), retail_data.shape

(InvoiceNo          0
 Quantity           0
 UnitPrice          0
 CustomerID    135080
 Country            0
 Date               0
 Year               0
 Month              0
 Day                0
 dtype: int64,
 (541909, 9))

In [32]:
retail_data[retail_data.isnull().any(axis=1)]

Unnamed: 0,InvoiceNo,Quantity,UnitPrice,CustomerID,Country,Date,Year,Month,Day
622,536414,56,0.00,,United Kingdom,2010-12-01 11:52:00,2010,12,1
1443,536544,1,2.51,,United Kingdom,2010-12-01 14:32:00,2010,12,1
1444,536544,2,2.51,,United Kingdom,2010-12-01 14:32:00,2010,12,1
1445,536544,4,0.85,,United Kingdom,2010-12-01 14:32:00,2010,12,1
1446,536544,2,1.66,,United Kingdom,2010-12-01 14:32:00,2010,12,1
...,...,...,...,...,...,...,...,...,...
541536,581498,5,4.13,,United Kingdom,2011-12-09 10:26:00,2011,12,9
541537,581498,4,4.13,,United Kingdom,2011-12-09 10:26:00,2011,12,9
541538,581498,1,4.96,,United Kingdom,2011-12-09 10:26:00,2011,12,9
541539,581498,1,10.79,,United Kingdom,2011-12-09 10:26:00,2011,12,9


Lets calculate how much information we will lose if we drop the missing values

In [34]:

total_rows_before = retail_data.shape[0]

retail_data_dropped = retail_data.dropna()

total_rows_after = retail_data_dropped.shape[0]

# Calculate the percentage of information lost
information_lost = (total_rows_before - total_rows_after) / total_rows_before * 100

information_lost


24.926694334288598

we will lose 25% of the data if we drop the missing values. Since this project is an exploratory data analysis, we will drop the missing values.

In [35]:
retail_data.dropna(inplace=True)

Lets see if any of the rows are duplicated

In [40]:
retail_data.duplicated().sum()

151429

In [41]:
retail_data.drop_duplicates(inplace=True)