# Grocery Purchase Analysis: Descriptive Statistics & Probability Insights

Jean Jorge Fernandes

This notebook was created during my first year studying data science. I decided to create these notebooks to reinforce the concepts I learned in class. Moreover, it will be interesting to track my growth in data science over time. Some text sections are generated with AI, and I will disclose when this is the case. If there is no disclosure, the text was written by me.

This analysis examines purchase patterns for three products — **apples**, **bread**, and **milk** — across 1,000 transactions.

The goals are to:
- Summarize customer behavior using descriptive statistics.
- Apply probability theory (including the inclusion–exclusion principle and complement rule) to estimate the likelihood of various purchasing scenarios.

These insights can help guide marketing strategies, such as product bundling or targeted promotions.
(This cell was generated with AI assistance.)

## Loading the data

We start by importing the data (dataset generated with AI assistance). The data is a transaction dataset containing 1,000 grocery store purchases. 
Each row represents a transaction, and each column (apples, bread, milk) is a binary indicator:
- `1` → item purchased
- `0` → item not purchased

In [1]:
# load data
df <- read.csv("transactions.csv")

# preview first 5 rows
head(df)


Unnamed: 0_level_0,apples,bread,milk
Unnamed: 0_level_1,<int>,<int>,<int>
1,0,0,0
2,0,0,0
3,0,0,0
4,0,1,1
5,1,0,1
6,0,0,1


## Descriptive Statistics 

After loading the data, we begin with the basics of descriptive statistics. This section summarizes the key characteristics of the dataset, including measures such as the mean, median, mode, and standard deviation. These metrics provide a foundation for understanding the dataset before moving into probability-based analysis.

-Mean: Average purchase rate. (also represent probability of purchase for binary data).
-Median: Middle value of the distribution.
-Mode: Most frequent value.
-Standard Deviation: Measure of variability in purchases.


In [5]:
# Mean
cat("Mean")
colMeans(df)

# Median
cat("Median")
apply(df, 2, median)

# Mode
cat("Mode")
get_mode <- function(v) {
    uniq_vals <- unique(v)
    uniq_vals[which.max(tabulate(match(v, uniq_vals)))]
}
apply(df, 2, get_mode)

#Standad Deviation
cat("Standad Deviation")
apply(df, 2, sd)

Mean

Median

Mode

Standad Deviation

The average purchase rate is 0.268 for apples, 0.417 for bread, and 0.454 for milk.
The median for all items is 0, meaning that in at least half of all transactions the item was not purchased (in binary data, the median will only be 1 if more than 50% of transactions include the item).
The mode is also 0 for all items, indicating that “not purchased” is the most common outcome.
The standard deviation values show that purchase patterns for all three items are moderately spread out, with milk having the highest variability (0.498). This suggests that milk purchases are the most evenly split between purchase and non-purchase transactions, while apples show slightly less variability.

### 💡 Business Insight:

Milk has the highest variability (0.498), indicating that purchase behavior is evenly split. With a purchase rate of 0.454, the customer base is moderately divided, leaving a large persuadable segment. Targeted promotions could be effective in converting non-purchasers.
Apples have a low average purchase rate (0.286). It would be important to investigate the reasons for low sales, as modest variability (SD = 0.452) suggests that purchase behavior is relatively consistent and may not respond strongly to discounts alone. Bundling apples with complementary products could introduce them to new customers without requiring them to purchase the item individually.
Bread has a modest purchase rate (0.417) and moderate variability (SD = 0.493). This indicates that purchase behavior is somewhat divided, leaving a sizeable persuadable segment. Targeted promotions could be effective in converting non-purchasers into regular buyers.

## Frequency Counts

This section displays the number of transactions in which each item was purchased or not purchased. While the mean gives us the probability, frequency count provides the raw transaction totals, which can be useful for inventory plannig and sales. 

In [6]:
# Frequency count for each item

lapply(df, table)

$apples

  0   1 
714 286 

$bread

  0   1 
583 417 

$milk

  0   1 
546 454 


This further confirms the rates from the previous section. The differences in counts align with the calculated means and further highlight milk's stronger presence in customers baskets. 

## Basic Visualization: Purchase Counts

The bar chart below visualizes the number of transactions in which each item was purchased. This complement the frequency count and makes it easier to compare products side-by-side.

> **Note:**  
> This notebook uses the following R packages:  
> - ggplot2  
> 
> If not already installed, run in R:  
> `install.packages("ggplot2")` (generated with AI assistance)

In [8]:
# Loading ggplot2
library(ggplot2)
