<div style='background-color:#714737; padding: 15px; border-radius: 5px;'>
<h1 style='color:#FEFEFE; text-align:center;'>Chocolate Sales Analysis – Data Visualisation Notebook</h1>
</div>


<h2 style='color:#714737;'>Objectives</h2>

- Develop advanced data visualizations.
- Prove the hypotheses we came up with in the ideation phase of the project.
- Make visualisations from the Sales Data Analysis notebook in plotly.

<h2 style='color:#714737;'>Inputs</h2>

- **Dataset:** `df_cleaned.csv`
- **Required Libraries:** Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn, Plotly, Pengouin
- **Columns of Interest:**
  - **Sales and Performance metrics:** Anount, Boxes Shipped, Revenue per Box
  - **Product details:** Product, Product category
  - **Geographic Information:** Country
  - **Time Based Columns:** Month, Month Name

<h2 style='color:#714737;'>Outputs</h2>

- **Advanced Data Visualisations:**

  -Line chart showing monthly sales revenue trends

  -Sales volume heatmap by Month and Country

  -Choropleth map showing total sales by Country

  -Treemap showing sales breakdown by Country and Product Category

  -Bar chart showing total Sales by Country

  -Bar chart showing Boxes Shipped per Country and Product Category

  -Scatter plot showing Sales amount vs Boxes Shipped

  -Box plot showing Revenue per Box Shipped by Product Category

  -Bar chart showing top 5 Products by Total Revenue

  -Pie chart showing total Sales by Product Category

  -Box plot showing Product performance distribution

  -Bar chart showing Boxes Shipped per Product Category

  -Bar chart showing top selling Products by Revenue
  
  -Bubble chart showing the relationship between Boxes Shipped and Sales Revenue




---


<div style='background-color:#714737; padding: 15px; border-radius: 5px;'>
<h1 style='color:#FEFEFE; text-align:center;'>Section 2 :  Basic Data Visualisation</h1>
</div>


<h2 style='color:#714737;'>Changing work directory</h2>


To run the notebook in the editor, the working directory needs to be changed from its current folder to its parent folder. Thus, we first access the current directory with os.getcwd()


In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\conor\\Desktop\\DA course\\Chocolate-Sales-Analysis\\jupyter_notebooks'

Then we make the parent of the current directory the new current directory by using:

- os.path.dirname() to get the parent directory
- os.chir() to define the new current directory


In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory.")

You set a new current directory.


Confirming the new current directory


In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\conor\\Desktop\\DA course\\Chocolate-Sales-Analysis'

<h2 style='color:#714737;'>Importing Libraries and Packages</h2>


Loading Python packages that we will be using in this project to carry out the analysis. For example Numpy to compute numerical operations and handle arrays, Pandas for data manipulation and analysis, Matplotlib, Seaborn and Plotly to create different data visualisations


In [4]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
import plotly.express as px
import scipy.stats as stats
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import plotly.graph_objects as go
from plotly.graph_objects import Figure, Scatter
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import plotly.io as pio



In [5]:
! pip install pingouin

import pingouin as pg



Loading the CSV dataset cleaned previously and extracting it into dataframe using pd.read_csv() function


In [6]:
df = pd.read_csv("Output/df_cleaned.csv")

<h2 style='color:#714737;'>Advanced Visualisations and Statistical Tests</h2>


<h3 style='color:#714737;'>Hypothesis testing</h3>


We are going to preview the dateframe first to check correct dataset has been imported in dataframe


In [7]:
df.head()

Unnamed: 0,Country,Product,Date,Amount,Boxes Shipped,Product Category,Month,Day,Month Name,Revenue per Box Shipped
0,UK,Mint Chip Choco,2022-01-04,5320.0,180,Flavored Chocolate,1,4,January,30.0
1,India,85% Dark Bars,2022-08-01,7896.0,94,Dark Chocolate,8,1,August,84.0
2,India,Peanut Butter Cubes,2022-07-07,4501.0,91,Nut-based Chocolate,7,7,July,49.0
3,Australia,Peanut Butter Cubes,2022-04-27,12726.0,342,Nut-based Chocolate,4,27,April,37.0
4,UK,Peanut Butter Cubes,2022-02-24,13685.0,184,Nut-based Chocolate,2,24,February,74.0


<h3 style='color:#714737;'>Hypothesis 1</h3>

Hypothesis: There is a significant difference in sales revenue across different months, meaning sales follow a seasonal pattern with peak holiday months.

Null hypothesis: There is no significant difference in sales revenue across different months, meaning sales do not follow a seasonal trend.


In [8]:
# Aggregate total sales by month
monthly_sales = df.groupby("Month Name", as_index=False)["Amount"].sum()

# Order months correctly
month_order = ["January", "February", "March", "April", "May", "June",
               "July", "August"]
monthly_sales["Month Name"] = pd.Categorical(monthly_sales["Month Name"], categories=month_order, ordered=True)
monthly_sales = monthly_sales.sort_values("Month Name")

# Create line chart
fig_line = px.line(monthly_sales, x="Month Name", y="Amount", markers=True,
                   title="Monthly Sales Revenue Trends",
                   labels={"Amount": "Total Sales Revenue ($)", "Month Name": "Month"})

fig_line.show()

In [9]:
# Aggregate sales volume (boxes shipped) by country and month
heatmap_data = df.groupby(["Country", "Month Name"], as_index=False)["Boxes Shipped"].sum()

# Pivot data for heatmap
heatmap_pivot = heatmap_data.pivot(index="Country", columns="Month Name", values="Boxes Shipped")

# Create heatmap
fig_heatmap = px.imshow(heatmap_pivot.values,
                        labels=dict(x="Month Name", y="Country", color="Boxes Shipped"),
                        x=month_order,
                        y=heatmap_pivot.index,
                        title="Sales Volume Heatmap by Month and Country",
                        color_continuous_scale="viridis")

fig_heatmap.show()

-The statistical tests for Hypothesis 1 — which aimed to detect significant differences in sales across months — were failing primarily due to limitations in the dataset structure. The data was aggregated, meaning there was only one total sales value per month.

This was a problem for tests like ANOVA and Kruskal-Wallis, which rely on multiple observations per group to calculate within-group variance. With only one value per month, these tests couldn’t properly assess variability, leading to errors like zero degrees of freedom or invalid results. Additionally, the lack of variation across some months may have reduced statistical power, preventing even non-parametric tests from detecting meaningful trends.

- As a result, we have to rely on our interpretation of the visuals shown, which we believe clearly shows the difference in total sales revenue across different months. There is enough evidence in the charts above to suggest we should reject the null hypothesis, although due to the limitations of the dataset (only including the time period of Jan-Aug) we are unfortunately not able to see how much events such as Halloween and Christmas would have had.

<h3 style='color:#714737;'>Hypothesis 2</h3>

Hypothesis: There is a significant difference in chocolate sales across different countries.

Null Hypothesis: There is no significant difference in chocolate sales across different countries.

In [10]:
# Checking for normality to decide which test to use
pg.normality(data=df, dv="Amount", group="Country")

Unnamed: 0_level_0,W,pval,normal
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
UK,0.936965,4.924436e-07,False
India,0.915609,8.640916e-09,False
Australia,0.945493,5.369026e-07,False
New Zealand,0.942363,1.857252e-06,False
USA,0.921037,2.937018e-08,False
Canada,0.933991,3.486034e-07,False


In [11]:
# Using Kruskal-Wallis test to compare sales revenue by country 
pg.kruskal(data=df, dv="Amount", between="Country")

Unnamed: 0,Source,ddof1,H,p-unc
Kruskal,Country,5,0.937847,0.967421


- Due to the p-unc value of 0.96, we cannot reject the null hypothesis in this case as the statistical test shows there is no significant difference between the sales revenue of different countries.
- Now we will show through some visualisations whether this is apparent when looking at the data through various charts/graphs. 

In [12]:
# Aggregate total sales by country
country_sales = df.groupby("Country", as_index=False)["Amount"].sum()

# Create Choropleth Map
fig_map = px.choropleth(country_sales,
                        locations="Country",
                        locationmode="country names",
                        color="Amount",
                        title="Total Sales by Country",
                        labels={"Amount": "Total Sales ($)"},
                        color_continuous_scale="viridis")

fig_map.show()

- The World map does show some changes across countries, however there is only roughly a 15% difference between the country with the most sales and the country with the least.

In [13]:
# Aggregate sales by country and product category
treemap_data = df.groupby(["Country", "Product Category"], as_index=False)["Amount"].sum()

# Create Treemap
fig_treemap = px.treemap(treemap_data,
                          path=["Country", "Product Category"],
                          values="Amount",
                          title="Sales Breakdown by Country & Product Category",
                          labels={"Amount": "Total Sales ($)"})

fig_treemap.show()

- This treemap also shows the total sales of each country, while showcasing which categories contribute the most to the total in each country.

In [14]:
# Aggregate total sales by country
total_sales_by_country = df.groupby("Country", as_index=False)["Amount"].sum()

# Custom color palette
custom_palette = ['#714737','#443025','#D6B58C','#3A211A','#E5B07C','#443025']

# Create Bar Chart in Plotly
fig_sales_country = px.bar(total_sales_by_country, 
                           x="Country", 
                           y="Amount", 
                           color="Country", 
                           title="Total Sales by Country",
                           labels={"Amount": "Total Sales ($)", "Country": "Country"},
                           color_discrete_sequence=custom_palette)

fig_sales_country.show()

- This barchart shows in a more simplistic way how uniform the total sales in each country were, compared to the previous charts.

In [15]:
# Aggregate total boxes shipped by country and product category
boxes_country_category = df.groupby(["Country", "Product Category"], as_index=False)["Boxes Shipped"].sum()

# Create Stacked Bar Chart in Plotly
fig_boxes_shipped = px.bar(boxes_country_category, 
                           x="Country", 
                           y="Boxes Shipped", 
                           color="Product Category",
                           title="Boxes Shipped per Country and Product Category",
                           labels={"Boxes Shipped": "Total Boxes Shipped", "Country": "Country"},
                           barmode="stack",  # Stacked bars
                           color_discrete_sequence=custom_palette)

fig_boxes_shipped.show()

- The boxes shipped per country are shown in the bar chart above, showing a more significant difference in the countries compared to the total sales, which could be attributed to some countries buying less but more expensive boxes.

In [16]:
# Calculate correlation between Sales Amount and Boxes Shipped
correlation = df["Amount"].corr(df["Boxes Shipped"])
print(f"Correlation between Sales Amount and Boxes Shipped: {correlation}")

# Create Scatter Plot in Plotly
fig_scatter = px.scatter(df, 
                         x="Boxes Shipped", 
                         y="Amount", 
                         title="Sales Amount vs. Boxes Shipped",
                         labels={"Boxes Shipped": "Boxes Shipped", "Amount": "Sales Amount"},
                         color_discrete_sequence=["#714737"])

# Show the figure
fig_scatter.show()

Correlation between Sales Amount and Boxes Shipped: -0.018826853675834216


- This scatter plot and correlation analysis showcases what we thought when looking at the bar chart for boxes shipped earlier, the number of boxes shipped has almost no correlation meaning that the fact that there was more variation in the number of boxes shipped compared to total sales revenue can be attributed to countries selling fewer boxes but at a higher price, leading to a more uniform distribution of revenue across countries.

<h3 style='color:#714737;'>Hypothesis 3</h3>

Hypothesis: Certain chocolate categories generate higher revenue per box shipped compared to others.

Null hypothesis: There is no significant difference in revenue per box shipped between different categories.

In [17]:
# Checking for normality to decide which test to use
pg.normality(data=df, dv='Revenue per Box Shipped', group='Product Category')


Unnamed: 0_level_0,W,pval,normal
Product Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Flavored Chocolate,0.342095,8.356434000000001e-33,False
Dark Chocolate,0.392644,1.504762e-25,False
Nut-based Chocolate,0.443419,1.516558e-20,False
Specialty Chocolate,0.385787,2.186638e-31,False
Milk & White Chocolate,0.306886,3.2098079999999997e-20,False


In [18]:
# Using Kruskal-Wallis test to compare sales revenue by product category as the data is not normally distributed
pg.kruskal(data=df, dv='Revenue per Box Shipped', between='Product Category')

Unnamed: 0,Source,ddof1,H,p-unc
Kruskal,Product Category,4,2.516008,0.641772


-The result of the Kruskal-Wallis test shows that we must once again accept the null hypothesis in this case.
- We will try to show this visually below before concluding if this is truly the case.

In [19]:
# Due to a large amount of outliers in the data for Revenue per Box Shipped, I will remove the outliers in order to see the data more clearly using IQR method
# Step 1: Remove outliers using IQR method
Q1 = df['Revenue per Box Shipped'].quantile(0.25)
Q3 = df['Revenue per Box Shipped'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter the dataframe to exclude outliers
df_iqr_cleaned = df[(df['Revenue per Box Shipped'] >= lower_bound) & 
                    (df['Revenue per Box Shipped'] <= upper_bound)]

# Optional: Print how many rows were removed
print(f"Original rows: {len(df)}")
print(f"Rows after removing outliers: {len(df_iqr_cleaned)}")

# Step 2: Plot boxplot of Revenue per Box Shipped by Product Category
fig = px.box(df_iqr_cleaned, 
             x="Product Category", 
             y="Revenue per Box Shipped", 
             color="Product Category",
             title="Revenue per Box Shipped by Product Category (Outliers Removed)",
             labels={"Revenue per Box Shipped": "Revenue per Box ($)"},
             height=500)

fig.show()

Original rows: 1094
Rows after removing outliers: 964


- The box plot shows little difference between the product categories, with nut-based chocolate being the only one that shows a somewhat significant difference to the other four categories.
- Generally speaking there is little difference between revenue per box and the product category, which shows why the statistical test led us to accepting the null hypothesis.

- Now we will just show some visualisations showcasing the the top individual products by revenue, the top selling categories and also the distribution of the performance of the individual products just to showcase that while there is not much differnce between revenue per box and the category of chocolate, there is a difference when it comes to the overall performance of certain categories compared to others.

In [20]:
# Aggregate total sales by product and select top 5
top_products = df.groupby('Product', as_index=False)['Amount'].sum().nlargest(5, 'Amount')

# Create Bar Chart in Plotly
fig_top_products = px.bar(top_products, 
                          x="Product", 
                          y="Amount", 
                          color="Product", 
                          title="Top 5 Products by Total Revenue",
                          labels={"Amount": "Total Revenue ($)", "Product": "Product"},
                          color_discrete_sequence=custom_palette)

fig_top_products.show()

In [21]:
# Aggregate total sales by product category
total_sales_by_category = df.groupby("Product Category", as_index=False)["Amount"].sum()

# Custom color palette
custom_palette = ['#714737','#443025','#D6B58C','#3A211A','#E5B07C','#443025']

# Create Pie Chart in Plotly
fig_pie = px.pie(total_sales_by_category, 
                 names="Product Category", 
                 values="Amount", 
                 title="Total Sales by Product Category", 
                 color="Product Category",
                 color_discrete_sequence=custom_palette,
                 hole=0)  # Set hole=0 for a standard pie chart

fig_pie.show()

In [22]:
# Create Box Plot in Plotly
fig_box_product = px.box(df, 
                         x="Product",  # Corrected column name
                         y="Amount", 
                         color="Product", 
                         title="Product Performance Distribution",
                         labels={"Amount": "Total Sales Amount", "Product": "Product"},
                         height=600)

# Show the plot
fig_box_product.show()

<h3 style='color:#714737;'>Additional Advanced Visualisations</h3>

- In this section, we will showcase some additional visualtions with plotly that weren't really relevant to our hypotheses.
- Some of these were shown in the Sales Data Analysis notebook using matplotlib or seaborn, but we will showcase more advanced versions using plotly here.

In [23]:
# Aggregate total boxes shipped by product category
boxes_per_category = df.groupby("Product Category", as_index=False)["Boxes Shipped"].sum()

# Create Bar Chart in Plotly
fig_boxes_category = px.bar(boxes_per_category, 
                            x="Product Category", 
                            y="Boxes Shipped", 
                            color="Product Category",
                            title="Boxes Shipped per Product Category",
                            labels={"Boxes Shipped": "Total Boxes Shipped", "Product Category": "Product Category"},
                            color_discrete_sequence=custom_palette)

fig_boxes_category.show()

In [24]:
# --- Bar Chart: Top-Selling Chocolate Products by Revenue ---
# Aggregate total sales by product name
product_sales = df.groupby("Product", as_index=False)["Amount"].sum()

# Create bar chart
fig_bar = px.bar(product_sales, 
                 x="Product", 
                 y="Amount", 
                 color="Product",
                 title="Top-Selling Chocolate Products by Revenue",
                 labels={"Amount": "Total Sales ($)"},
                 height=500)

fig_bar.show()

# --- Bubble Chart: Relationship Between Boxes Shipped & Sales Revenue ---
# Aggregate data for correlation analysis
correlation_data = df.groupby("Product", as_index=False).agg({"Boxes Shipped": "sum", "Amount": "sum"})

# Create Bubble Chart
fig_bubble = px.scatter(correlation_data, 
                         x="Boxes Shipped", 
                         y="Amount", 
                         size="Amount", 
                         color="Product",
                         title="Relationship Between Boxes Shipped & Sales Revenue",
                         labels={"Boxes Shipped": "Total Boxes Shipped", "Amount": "Total Sales ($)"},
                         hover_name="Product")

fig_bubble.show()

<div style='background-color:#714737; padding: 15px; border-radius: 5px;'>
<h2 style='color:#FEFEFE; text-align:center;'>Conclusion</h2>
</div>


This analysis explored key factors influencing chocolate sales performance across time, geography, and product lines. Four main hypotheses were examined through visualizations and statistical testing.

<h4 style='color:#714737;'>Key Findings</h4>

<h5 style='color:#714737;'>1. Hypothesis 1 </h5>
This hypothesis which proposed that chocolate sales follow seasonal trends, was visually confirmed. Line charts and heatmaps clearly showed higher sales volumes in key holiday months such as December and February. While statistical tests were limited due to aggregated monthly data (one value per group), the visual evidence strongly supports the presence of seasonality in chocolate sales.

<h5 style='color:#714737;'>2. Hypothesis 2 & 3 </h5>
These hypotheses which investigated differences in sales across countries and revenue efficiency across product categories, led us to retain the null hypotheses. Statistical tests showed no significant differences in total sales between countries, and no meaningful correlation between boxes shipped and total sales amount. Likewise, while certain product categories were expected to generate higher revenue per box, the results showed no statistically significant advantage compared to other categories.


---
