# What's in an Avocado Toast: A Supply Chain Analysis

You're in London, making an avocado toast, a quick-to-make dish that has soared in popularity on breakfast menus since the 2010s. A simple smashed avocado toast can be made with five ingredients: one ripe avocado, half a lemon, a big pinch of salt flakes, two slices of sourdough bread and a good drizzle of extra virgin olive oil. It's no small feat that most of these ingredients are readily available in grocery stores. 

In this project, you'll conduct a supply chain analysis of three of these ingredients used in an avocado toast, utilizing the Open Food Facts database. This database contains extensive, openly-sourced information on various foods, including their origins. Through this analysis, you will gain an in-depth understanding of the complex supply chain involved in producing a single dish.

Three pairs of files are provided in the data folder:
- A CSV file for each ingredient, such as `avocado.csv`, with data about each food item and countries of origin
- A TXT file for each ingredient, such as `relevant_avocado_categories`, containing only the category tags of interest for that food.

Here are some other key points about these files:
- Some rows of data in each of the three CSV files do not contain relevant data for your investigation. In each dataset, you will need to filter out rows with irrelevant data, based on values in the `categories_tags` column. Examples of categories are, fruits, vegetables, and fruit-based oils. Filter the DataFrame to include only rows where `categories_tags` contains one of the tags in the relevant categories for that ingredient.
- Each row of data usually has multiple categories tags in the `categories_tags` column.
- There is a column in each CSV file called `origins_tags` with strings for country of origin of that item.

After completing this project, you'll be armed with a list of ingredients and their countries of origin, and be well-positioned to launch into other analyses that explore how long, on average, these ingredients spend at sea.

![](avocado_wallpaper.jpeg)

## EDA

In [1]:
import pandas as pd
# available ingredients: 'avocado', 'olive_oil', 'sourdough'

COLUMNS = ['code', 'lc', 'product_name_en', 'quantity', 'serving_size', 
           'packaging_tags', 'brands', 'brands_tags', 'categories_tags', 
           'labels_tags', 'countries', 'countries_tags', 'origins',
           'origins_tags']

ingredient = 'sourdough'

# load in the origin data for the ingredient
df = pd.read_csv(f'data/{ingredient}.csv', sep='\t', usecols=COLUMNS, low_memory=False)
df.shape

(1422, 14)

In [2]:
# load in the ingredient relevant categories
categories_df = pd.read_table(f'data/relevant_{ingredient}_categories.txt', header=None)
relevant_categories = categories_df[0].to_list()
print(relevant_categories)

['en:bagel-breads', 'en:baguettes', 'en:bakery-products', 'en:bran-bread', 'en:breads', 'en:buns', 'en:confectioneries', 'en:crackers', 'en:crackers-with-natural-sourdough', 'en:crackers-with-wholegrain-rye', 'en:crispbreads', 'en:english-muffins', 'en:flatbreads', 'en:garlic-breads', 'en:gluten-free-breads', 'en:olive-breads', 'en:panini-breads', 'en:pastries', 'en:pre-baked-breads', 'en:rye-and-wheat-breads', 'en:rye-breads', 'en:sliced-breads', 'en:sordough-breads', 'en:sourdough', 'en:sourdough-bread', 'en:sourdough-breads', 'en:sourdough-breads-with-rosemary', 'en:sourdough-pita-bread', 'en:special-breads', 'en:wheat-breads', 'en:wheat-flatbreads', 'en:wholemeal-breads', 'en:wholemeal-sliced-breads']


In [3]:
# examine the dataframe
df.head()

Unnamed: 0,code,lc,product_name_en,quantity,serving_size,packaging_tags,brands,brands_tags,categories_tags,labels_tags,countries,countries_tags,origins,origins_tags
0,5000169636046,en,"Wood-fored Fennel Sausage, ‘Nduja Sourdough Pizza",,,,,,,,United Kingdom,en:united-kingdom,,
1,850026434323,en,Cinnamon Spouted Sourdough,25 oz,,,,,,"en:vegetarian,en:no-artificial-flavors,en:vega...",United States,en:united-states,,
2,237653602484,en,San Francisco sourdough,20.0 oz,,,,,,,United States,en:united-states,,
3,4056489462187,en,Chargrilled vegetable and basil pesto woodfire...,,,,"Lidl,Deluxe","lidl,deluxe","en:meals,en:pizzas-pies-and-quiches,en:pizzas",,Ireland,en:ireland,,
4,10500016941075200179,fr,,,,,,,"en:plant-based-foods-and-beverages,en:plant-ba...",,France,en:france,,


In [4]:
# drop the null values from our categories_tags column
df.dropna(subset=['categories_tags', 'origins_tags'], inplace=True)
print(df.isna().sum())

code               0
lc                 0
product_name_en    2
quantity           0
serving_size       6
packaging_tags     0
brands             0
brands_tags        0
categories_tags    0
labels_tags        4
countries          0
countries_tags     0
origins            0
origins_tags       0
dtype: int64


In [5]:
# convert the comma value entries in categories_tags to a list
df['categories_tags'] = df['categories_tags'].str.split(',')
df.categories_tags.head()

32     [en:meats-and-their-products, en:meals, en:piz...
159    [en:plant-based-foods-and-beverages, en:plant-...
185    [en:plant-based-foods-and-beverages, en:plant-...
243    [en:snacks, en:salty-snacks, en:appetizers, en...
342                                 [en:sourdough-bread]
Name: categories_tags, dtype: object

In [6]:
# match rows between relvant_categories list and the categories_tags column list values
matching_tags = df.categories_tags.apply(lambda x: any(tag in x for tag in relevant_categories))

# filter our dataframe against the matches and countries United Kingdom
# determine the highest frequency origin_tags value
# remove the leading three characters and replace hypen with a space.
df[matching_tags].query('countries == "United Kingdom"')['origins_tags'].value_counts().index[0][3:].replace('-', ' ')

'united kingdom'

## Turn it into a function
Turn the above into a function and make it a tad more readable.


In [11]:
import pandas as pd
import csv

# Define constants
COLUMNS = ['code', 'lc', 'product_name_en', 'quantity', 'serving_size', 
           'packaging_tags', 'brands', 'brands_tags', 'categories_tags', 
           'labels_tags', 'countries', 'countries_tags', 'origins', 'origins_tags']

# Define a function to process each ingredient
def process_ingredient(ingredient):
    """
    Process the data for a specific ingredient.
    
    Related files should be in the 'data' folder and have file naming convention:
    - 'ingredient.csv'
    - 'relevant_ingredient_categories.txt'

    Parameters:
    - ingredient (str): The name of the ingredient.

    Returns:
    - str: The top origin for the specified ingredient in the United Kingdom.
    """
    # Read data from CSV file
    df = pd.read_csv(f'data/{ingredient}.csv', usecols=COLUMNS, low_memory=False, sep='\t')
    
    # Load relevant categories
    categories_df = pd.read_table(f'data/relevant_{ingredient}_categories.txt', header=None)
    relevant_categories = categories_df[0].tolist()
    
    # Drop null values from specific columns
    df.dropna(subset=['categories_tags', 'origins_tags'], inplace=True)
    
    # Convert comma-separated values to lists
    df['categories_tags'] = df['categories_tags'].str.split(',')
    
    # Match rows between relevant_categories list and categories_tags column values
    matching_tags = df['categories_tags'].apply(lambda x: any(tag in x for tag in relevant_categories))
    
    # Filter the dataframe based on matches and United Kingdom
    uk_origin_df = df[matching_tags & (df['countries'] == 'United Kingdom')]
    
    # Get the top origin value
    top_origin = uk_origin_df['origins_tags'].value_counts().index[0][3:].replace('-', ' ')
    
    return top_origin

# Process each ingredient and create variables
top_avocado_origin = process_ingredient('avocado')
top_olive_oil_origin = process_ingredient('olive_oil')
top_sourdough_origin = process_ingredient('sourdough')

# Define the data to be written to CSV
data_to_write = [
  ['Avocado', top_avocado_origin],
  ['Olive Oil', top_olive_oil_origin],
  ['Sourdough', top_sourdough_origin]
]
 
# Specify the CSV file name
csv_filename = 'ingredient_origins.csv'
 
# Write to CSV file
with open(csv_filename, 'w', newline='') as csvfile:
  csv_writer = csv.writer(csvfile)
  # Write the header row
  csv_writer.writerow(['Ingredient', 'Origin'])
  # Write the data rows
  for row in data_to_write:
      csv_writer.writerow(row)
 
print(f"Data has been written to {csv_filename}")

# Display the results
print(f'{top_avocado_origin = }')
print(f'{top_olive_oil_origin = }')
print(f'{top_sourdough_origin = }')

Data has been written to ingredient_origins.csv
top_avocado_origin = 'peru'
top_olive_oil_origin = 'greece'
top_sourdough_origin = 'united kingdom'
