<font style='font-size:1.5em'>**🧑‍🏫 Week 04 Lecture – (NB02) Your questions, my pandas answers** </font>

<font style='font-size:1.2em'>LSE [DS105A](https://lse-dsi.github.io/DS105/autumn-term/index.html){style="color:#e26a4f;font-weight:bold"} – Data for Data Science (2024/25) </font>



<div style="color: #333333; background-color:rgba(226, 106, 79, 0.075); border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); padding: 20px 0 20px 10px; margin: 10px 0 10px 0; flex: 1 1 calc(45% - 20px);min-width: 250px;max-width: 350px;align-items:top;min-height: calc(45% - 20px); box-sizing: border-box;font-size:0.9em;">

🗓️ **DATE:** 24 October 2024

⌚ **TIME:** 16.00-18.00

📍 **LOCATION:** CLM.5.02
</div>


**AUTHORS:**  Dr. [Jon Cardoso-Silva](https://jonjoncardoso.github.io){style="color:#e26a4f;font-weight:bold"}

**DEPARTMENT:** [LSE Data Science Institute](https://lse.ac.uk/dsi){style="color:#e26a4f;font-weight:bold"}

**OBJECTIVE**: Make use of the building blocks of the pandas library to answer questions from the audience.

**REFERENCES:**

- 🌐 [10 minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html){style="color:#e26a4f;font-weight:bold"}
- 📖 [Official pandas documentation](https://pandas.pydata.org/docs/reference/index.html){style="color:#e26a4f;font-weight:bold"}
- 📗 [DataFrame page](https://pandas.pydata.org/docs/reference/frame.html){style="color:#e26a4f;font-weight:bold"} in the official docs
- 📘 [Series page](https://pandas.pydata.org/docs/reference/series.html){style="color:#e26a4f;font-weight:bold"} in the official docs

---

In [1]:
import os

import numpy as np
import pandas as pd

from IPython.display import Image

# 1. Reading the Data

Expand the text below to understand how I collected the data and did a preliminary pre-processing of it for this lecture.

<details style="border: 1px solid #aaa;border-radius: 4px;padding: .5em .5em 0;width:50%;font-size:0.9em;margin-bottom:1em">
    <summary style="font-weight: bold;margin: -.5em -.5em 0;padding: .5em;">Data Collection</summary>

I wrote a script to:

1. Visit the Waitrose's ['Browse by Category'](https://www.waitrose.com/ecom/shop/browse/groceries) page
2. Click on each unique category
3. Scroll down to the bottom of the page and click 'Load More' until reaching the end of the list
4. Collect basic information about each one of the products (link, name, quantity, size, price)
5. Save collected products of each category to an individual CSV file

I checked [Waitrose's bots.txt file](https://www.waitrose.com/robots.txt) to confirm I had their permission for this.

</details>

<details style="border: 1px solid #aaa;border-radius: 4px;padding: .5em .5em 0;width:50%;font-size:0.9em">
    <summary style="font-weight: bold;margin: -.5em -.5em 0;padding: .5em;">Preliminary Data Pre-processing</summary>

The data was already kind of clean, but we still need to pre-process it for analysis. 

I've made the following changes to the data collected as described above:

- I combined all the CSV files into a single larger CSV
- I transformed the `item-price` column from a string to a float (removing the currency symbol and other characters)
- I remove the ranges from the `product-size` column

<div style="font-size:0.85em;line-height:1.25em;display:block;background-color:#5d9ebc22;padding:0.5em;border-radius:0.5em;margin:1em 0 1em 0;padding:0.75em 0.5em 0.05em 0.75em;width:50%">

I only did the above to make our first lecture about pandas a bit more interesting. I want to show you the cool things you can do with pandas once your data is clean.

🤫 However, since the whole point of this course is to learn how to manipulate (clean & pre-process) data I will give you the unclean data so you can exercise your Python function skills!

</div>

</details>


In [2]:
df = pd.read_csv('../data/supermarket/waitrose-products-combined-2024-07.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25418 entries, 0 to 25417
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   data-product-id        25418 non-null  int64  
 1   data-product-name      25418 non-null  object 
 2   data-product-type      25418 non-null  object 
 3   data-product-on-offer  25418 non-null  bool   
 4   data-product-index     25408 non-null  float64
 5   image-url              25418 non-null  object 
 6   product-page           25418 non-null  object 
 7   product-name           25407 non-null  object 
 8   product-size           25363 non-null  object 
 9   item-price             25407 non-null  float64
 10  price-per-unit         24976 non-null  object 
 11  offer-description      7201 non-null   object 
 12  category               25418 non-null  object 
dtypes: bool(1), float64(2), int64(1), object(9)
memory usage: 2.4+ MB


# 2. Selected Questions

I will gather your questions and answer them using pandas. 

For the sake of brevity, I might prioritise questions that does not require massive data processing.

## Q: What is the share of products by category? 

In [3]:
all_categories = df['category'].unique().tolist()
all_categories

['Baby, Child & Parent',
 'Bakery',
 'Beer, Wine & Spirits',
 'Best of British',
 'Dietary & Lifestyle',
 'Everyday Value',
 'Food Cupboard',
 'Fresh & Chilled',
 'Frozen',
 'Home',
 'Household',
 'New',
 'Organic Shop',
 'Pet',
 'Summer',
 'Tea, Coffee & Soft Drinks',
 'Toiletries, Health & Beauty',
 'Waitrose Brands']

In pure Python, I would do this:

In [4]:
all_categories = df['category'].tolist()
categories_count = {}

for category in all_categories:

    if category in categories_count:
        # Update the dictionary value:
        categories_count[category] = categories_count[category] + 1
    else:
        categories_count[category] = 1

for category in categories_count.keys():
    categories_count[category] = categories_count[category]/len(all_categories)

In [5]:
categories_count

{'Baby, Child & Parent': 0.016877803131638995,
 'Bakery': 0.019080966244393736,
 'Beer, Wine & Spirits': 0.06613423558108428,
 'Best of British': 0.01266818789833976,
 'Dietary & Lifestyle': 0.13148162719332757,
 'Everyday Value': 0.005547249980328901,
 'Food Cupboard': 0.164804469273743,
 'Fresh & Chilled': 0.14096309701786136,
 'Frozen': 0.01699582972696514,
 'Home': 0.04241088992052876,
 'Household': 0.03662758674954757,
 'New': 0.03170981194429145,
 'Organic Shop': 0.026477299551498936,
 'Pet': 0.014753324415768354,
 'Summer': 0.07097332598945628,
 'Tea, Coffee & Soft Drinks': 0.05094814698245338,
 'Toiletries, Health & Beauty': 0.08883468408214651,
 'Waitrose Brands': 0.06271146431662601}

In pandas:

In [6]:
df['category'].value_counts(normalize=True).head(5)

category
Food Cupboard                  0.164804
Fresh & Chilled                0.140963
Dietary & Lifestyle            0.131482
Toiletries, Health & Beauty    0.088835
Summer                         0.070973
Name: proportion, dtype: float64

## Q: What fraction of products are on offer?

 

In [7]:
df['data-product-on-offer'].sum()/len(df)

0.28330317098119445

## Q: What is the category with the largest fraction of products on offer?


In [8]:
# Example for just one category
selected_rows = df['category'] == 'Bakery'

# Fitered the df a smaller subset corresponding to bakery products
df_bakery = df[selected_rows]

# The same code as before
df_bakery['data-product-on-offer'].sum()/len(df_bakery)

0.10721649484536082

In [9]:
def calculate_share(rows):
    return rows['data-product-on-offer'].sum()/len(rows)

df.groupby(['category']).apply(calculate_share, include_groups=False).sort_values(ascending=False)

category
Beer, Wine & Spirits           0.412849
New                            0.384615
Summer                         0.374169
Tea, Coffee & Soft Drinks      0.362162
Pet                            0.338667
Fresh & Chilled                0.329054
Toiletries, Health & Beauty    0.317095
Household                      0.300752
Frozen                         0.298611
Dietary & Lifestyle            0.274985
Baby, Child & Parent           0.265734
Food Cupboard                  0.240392
Best of British                0.223602
Organic Shop                   0.206538
Home                           0.125232
Waitrose Brands                0.111041
Bakery                         0.107216
Everyday Value                 0.049645
dtype: float64

In [10]:
(
    df.groupby(['category'])
      .apply(lambda rows: rows['data-product-on-offer'].sum()/len(rows), include_groups=False)
      .sort_values(ascending=False)
)

category
Beer, Wine & Spirits           0.412849
New                            0.384615
Summer                         0.374169
Tea, Coffee & Soft Drinks      0.362162
Pet                            0.338667
Fresh & Chilled                0.329054
Toiletries, Health & Beauty    0.317095
Household                      0.300752
Frozen                         0.298611
Dietary & Lifestyle            0.274985
Baby, Child & Parent           0.265734
Food Cupboard                  0.240392
Best of British                0.223602
Organic Shop                   0.206538
Home                           0.125232
Waitrose Brands                0.111041
Bakery                         0.107216
Everyday Value                 0.049645
dtype: float64