<a href="https://colab.research.google.com/github/scisley/brand-sustainability-data-fetcher/blob/main/Brand_Sustainability_Data_Fetcher.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview

Hi folks! My name is Steve Isley ([LinkedIn](https://www.linkedin.com/in/stevecisley/)) and I'm a little obsessed with sustainable consumption. I spent 3 years at [NREL](https://www.nrel.gov/) as a behavioral scientist, 5 years at [Amazon](https://sustainability.aboutamazon.com/) leading sustainable customer research, and for the last two years I've been building a sustainable shopping experience called [CountOn](https://joincounton.com/), which is now a side project as I look for regular work.

For CountOn, I've gotten deep into applied generative AI. And by "applied" I mean I'm not building new large language models (LLMs), rather I'm using existing LLMs to solve hard problems. It occurred to me recently that finding self-reported sustainability information about brands was one of those problems that hundreds, maybe thousands, of sustainable shopping websites have had to deal with over the years.

This is tedious, boring work. With AI, we don't have to do it anymore! This notebook includes all the code needed to take a list of brands, fetch pages that are likely to contain sustainability related information, then summarize that information. I've done this for the **1,000 most famous brands** in the US.

The code is provided free of charge and with no guarantee of accuracy. I've evaluated a few results by hand and think they're pretty good, but I haven't reviewed nearly enough to make confident statements about accuracy. Use the data at your own risk. But, even if the data aren't high enough quality for you, there are lots of ways to improve the code (some of which I've documented) and this should give you a good starting point for your own data gathering project.

### Cost

Using AI is an order of magnitude faster and cheaper than the previous approach. I've done this type of manual work before, and I think a conservative estimate is 10 minutes of work per brand you want to evaluate. You have to find the right pages, take notes on what the company says, then produce a well-written summary. So that's 1,000*10/60 = 160.7 hours, that's over four weeks of work.

I carefully tracked my time on this project. I came in at just under **12 hours** to write all the code below. The time needed to analyze a new website is on the order of seconds.

In terms of variable costs, let's assume \$15/hr for the manual process. That's $2.50 per brand. Here are the variable costs for this automated approach:

* **Search API**: \$4 per 1,000 searches, I did 1k searches, so \$4.
* **Crawlbase**: \$0.006 per page, I scraped 1,885, so \$11.31
* **OpenAI LLM fees**: \$60.30 (includes all usage, even debugging, and most of this is GPT-4, OpenAI's most expensive model)

Total: **$75.61** or \$0.076 per brand, thats 30 times cheaper.

# Outline

Using the latest AI orchestration tools makes the code remarkably simple. I'm using [LangChain](https://python.langchain.com/docs/get_started/introduction) here and highly recommend it, but others love [LlamaIndex](https://www.llamaindex.ai/). Here are the high level steps involved.

**Step 1**: Find a list of brands you want to analyze. This can be surprisingly hard. I spent about an hour just doing this.

**Step 2**: For each brand, use a Google search automation tool to return the results for "*{brand name} sustainability*"

**Step 3**: For each candidate Google search result, run it through an LLM to decide if it's relevant or not. We are looking for pages owned by the brand itself, not what somebody else has said about the brand (that's a very valid but totally different dataset).

**Step 4**: For each relevant URL, fetch the page HTML and convert it something nicer, in our case, markdown.

**Step 5**: Create summaries by feeding the markdown for each brand into an LLM along with a prompt telling the LLM what to produce.

**Step 6**: [Profit](https://en.wikipedia.org/wiki/Gnomes_(South_Park)).



In [None]:
!pip3 install langchain langchain-community langchain-core langsmith \
    openai python-dotenv tiktoken pydantic==1.10.13 \
    beautifulsoup4 lxml python-slugify html2text crawlbase

# Environment setup

Do whatever you need to load environmental variables into your system. The key environment variables are:

* OPENAI_API_KEY: Required for using OpenAI's LLMs. Get this from [OpenAI](https://platform.openai.com/api-keys)
* SEARCHAPI_API_KEY: Required for fetching Google search results. Get from [SearchApi](https://www.searchapi.io/)
* LANGCHAIN_API_KEY: This is optional, but I highly recommend it. It helps a ton with debugging. Get this from [LangChain](https://smith.langchain.com/).
* CRAWLBASE_JS_API_KEY: Required for scraping websites. Get this from [Crawlbase](https://crawlbase.com/)

In [2]:
from dotenv import load_dotenv, dotenv_values
from google.colab import drive
from IPython.display import Markdown, display
import re
import pprint
import os
import json

drive.mount('/content/drive/')

# UPDATE THIS TO WORK ON YOUR SYSTEM.
!cp /content/drive/MyDrive/Colab-Notebooks/prototypes/dotenv .env
load_dotenv(override=True)

# Optional, but LangSmith is *really* handy for debugging. Highly recommend.
os.environ['LANGCHAIN_PROJECT'] = 'brand-data-fetcher'
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT']="https://api.smith.langchain.com"

Mounted at /content/drive/


True

# Step 1: List of brands to analyze

I searched around for a good list of brands. This was suprisingly hard. I wanted a list of well known consumer brands, as opposed to any B2B brand or obscure brands nobody has ever heard of.

I eventually landed on YouGov's ["Most Famouse Brands"](https://today.yougov.com/ratings/consumer/fame/brands/all) list. This looks like exactly what I want, but there wasn't a downloadable option. So, I just kept going in the infinite scroll until the top 1,000 brands were displayed. Then, I opened my dev console and copied the HTML for the list into a file. The code below imports that HTML and uses Beautiful Soup (a popular Python library for manipulating HTML) to extract the relevant information.

The very good people at YouGov even have an [FAQ](https://today.yougov.com/about/ratings-faq) saying folks can use the data on their website - thanks YouGov!

In [86]:
from bs4 import BeautifulSoup
root_path = '/content/drive/MyDrive/Colab-Notebooks/prototypes/brand-data-fetcher'
file_path = f'{root_path}/data/yougov-brands.txt'
with open(file_path, 'r', encoding='utf-8') as file:
    raw_data = file.read()

soup = BeautifulSoup(raw_data, 'lxml')

# List to store extracted data
data = []

for li in soup.find_all('li', class_='ng-star-inserted'):
    # Finding brand name
    brand = li.find('img', class_='ng-star-inserted')['alt'] if li.find('img', class_='ng-star-inserted') else 'No Brand'

    # Finding fame and popularity percentages
    percentages = li.find_all('span', class_='rankings-item-active') + li.find_all('span', class_='compact')
    fame = percentages[0].text.strip() if len(percentages) > 0 else 'No Fame'
    popularity = percentages[1].text.strip() if len(percentages) > 1 else 'No Popularity'

    data.append({"brand": brand, "fame": fame, "popularity": popularity})

# Sample output
pprint.pp(data[0:10])

[{'brand': "McDonald's", 'fame': '99%', 'popularity': '63%'},
 {'brand': 'Amazon', 'fame': '99%', 'popularity': '76%'},
 {'brand': 'Ford', 'fame': '99%', 'popularity': '62%'},
 {'brand': 'Taco Bell', 'fame': '99%', 'popularity': '67%'},
 {'brand': 'UPS', 'fame': '99%', 'popularity': '77%'},
 {'brand': "M&M's", 'fame': '99%', 'popularity': '85%'},
 {'brand': 'KFC', 'fame': '99%', 'popularity': '67%'},
 {'brand': 'Adidas', 'fame': '99%', 'popularity': '72%'},
 {'brand': 'PayPal', 'fame': '98%', 'popularity': '67%'},
 {'brand': 'Oreo Cookies', 'fame': '98%', 'popularity': '80%'}]


# Step 2: Google search automation

Finding the homepage for a brand is (at least used to be) a common mechanical turk task. This can now be automated pretty easily. In this case, we're not looking for the homepage, but rather pages about sustainability owned by the brand itself. The goal is to collect self reported information.

Google doesn't provide an automated way to retrieve their search results, but about a billion independent companies have sprung up to provide this service. I've used [SearchApi](https://www.searchapi.io/) for this and I've been very happy with the service. The costs are trivial. Never build yourself what you can pay peanuts for.

### Ways to make this better

The code just does a Google search for the brand name followed by the word "sustainability". This works fine most of the time, but it gets tripped up when you have a brand name that is also a common word (though even then it often works because the word "sustainability" isn't likely to follow that word in other contexts). For example, the top Google search results for the sportwear brand "Columbia" is about Columbia University. However, "Milwaukee" works ok as the second search result is about the tool company (instead of the city).

This process also struggles with brands that are owned by a parent company and the parent company has sustainability information. For example, "Oreo Cookies" is owned by Mondelez International. A google Search for ["Oreo cookies sustainability"](https://www.google.com/search?q=Oreo+cookies+sustainability&rlz=1C5CHFA_enUS919US919&oq=Oreo+cookies+sustainability&gs_lcrp=EgZjaHJvbWUyBggAEEUYOdIBCDQ1NThqMGo3qAIAsAIA&sourceid=chrome&ie=UTF-8) does return a Mondelez link, but it's not specific to Oreo cookies and the next step actually filters it out because the LLM doesn't think it's related to Oreos.

In [223]:
from langchain_community.utilities import SearchApiAPIWrapper
from requests.exceptions import RequestException

def fetch_via_serp(query):
    # Example query: f"site:www.ewg.org/skindeep/ingredients citric acid"
    try:
        # See https://www.searchapi.io/docs/google
        print(f"Using SearchApi with query: {query}")
        search = SearchApiAPIWrapper()
        results = search.results(query)
        if "organic_results" in results and results["organic_results"]:
            return results
        else:
            msg = f"WARNING: No SearchApi results for {query}"
            print(msg)
            return None
    except RequestException as e:
        msg = f"ERROR: RequestException for {query}: {e}"
        print(msg)
        return None
    except Exception as e:
        msg = f"ERROR: Unknown error for {query}: {e}"
        print(msg)
        return None

# Step 3: Determine relevant links

We're looking for information from the brand itself, not from watchdog organizations or news agencies. Those data are super interesting, just not what we're looking for.

Google returns the to search results, and generally that includes a brand's own sustainability page (if one exists). We're going to implement a pre-processing step where we use an LLM to evaluate each link and make a determine on if it's relevant or not.

I've also implemented some simple caching by saving files to a folder. Whenever you're doing something complicated over hundreds of things, there's a good chance an error will occur. In order to make restarting simple, I will generally cache results along the way. Before running a process, I'll check the cache to see if it exists. If it does, I use those results. Clearing the cache is as simple as deleting the files I want to recreate.

The code below is our first LLM prompt. You can see that I provide the Google search result URL, the domain, and the Google generated snippet.

### Ways to improve this

I hope somebody does the inverse of this. What do people *other than the brand* say about the brand's sustainability? "Branding" is what you say about yourself, "reputation" is what other people say about you.

I didn't do much checking on these results. Just eye-balling it they looked pretty good. If you wanted to build something like this for real, you'd want to establish the accuracy of this process. That would likely involve a test set that you could use as a reference to measurably improve your results.

For simplicity and to keep my owner personal costs down, I only examine the top 3 Google search results. For real usage, this should be upped to ~10. I've also filtered out search results that point to a PDF. Obviously PDFs can contain important sustainability information, but they are a pain to deal with. There are lots of tools out there for adding PDF content to an LLM, I just didn't implement it.

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableLambda, RunnableBranch, RunnableParallel
from langchain_community.chat_models import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from slugify import slugify
import json

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# Create a chain that evaluates each link and returns yes or no
prompt = ChatPromptTemplate.from_messages([
    ("system", """Your job is to evaluate information about a website and decide
if it likely contains information about the sustainability efforts of {brand}.
For this task, we want information ONLY provided by the brand itself, not other
organizations. If the domain does not look like it's owned by {brand}, then it is
not relevant. You respond with only a single word: yes or no."""),
    ("user", """Website Title: {title}
Website URL: {link}
Website Domain: {domain}
Snippet: {snippet}
""")
])

relevance_chain = prompt | llm | StrOutputParser()

def get_cached_serp_result(query):
    cache_file_name = slugify(query)
    cache_file_path = f"{root_path}/data/serp_cache/{cache_file_name}.json"
    if os.path.exists(cache_file_path):
        with open(cache_file_path, 'r') as file:
            return json.load(file)
    else:
        return None

def cache_serp_result(query, result):
    cache_file_name = slugify(query)
    cache_file_path = f"{root_path}/data/serp_cache/{cache_file_name}.json"
    with open(cache_file_path, 'w') as file:
        json.dump(result, file)

def get_candidate_results(x):

    query = f"{x['brand']} sustainability"

    cached_result = get_cached_serp_result(query)
    if cached_result is not None:
        print('Found cached result', query)
        return cached_result

    serp_results = fetch_via_serp(query)
    # Just grab the top three URLs
    candidate_results = serp_results["organic_results"][0:3]
    # Add back in the brand information, and sometimes the snippet is absent, so providing a default
    candidate_results = [{"snippet": "Unknown", **result, "brand": x["brand"]} for result in candidate_results]
    # Run the relevance chain to see which are relevant
    candidate_results = [{**result, "relevant": relevance_chain.invoke(result)} for result in candidate_results]
    cache_serp_result(query, candidate_results)

    return candidate_results

# Expects as input an item from "data"
candidate_results_chain = RunnablePassthrough.assign(candidate_results = get_candidate_results)
brand_data = candidate_results_chain.batch(data[0:1000])
#pprint.pp(brand_data[0:5])

# Step 4: Fetch content of relevant URLs

Now that we have a set of relevant URLs for each brand, we need to fetch them. It's definitely possible to fetch HTML using your own setup. But, I've found that it's just not worth the hassle. You have to deal with websites that require Javascript, meaning you need to implement a headless browser. You have to deal with website's anti-scraping measures (Amazon's especially good at this). It's just not worth it.

Instead, just pay someone else to do it! I'm using [crawlbase](https://crawlbase.com/), but there are lots of services for this doing. I just found them easy to use from a developer point of view. Even at the most expensive tier, they charge $0.006 per page, and that covers all the problems mentioned above.

Even though it's really cheap, it can take a while, and I don't like throwing money away, so I implemented basic caching again.

Crawlbase returns HTML, but that's actually not a good format for feeding into an LLM. Nearly every page has many times more formating and javascript information that actual information you care about. This is extra crud for the LLM to wait through, and this increases the cost but more importantly can result in the LLM missing important information.

Lots of AI tools out there simply pull all the text out of the HTML. However, I've found this misses important context, like what text is a heading or the context provided by a table structure. The best compromise I've found so far is converting HTML to markdown. There are lots of tools for this. I'm using Html-2-Text.

Fetching pages is an I/O intensive process. If you do it one-by-one, it will work, but it could take forever and if you're using a Google Colab notebook then it has a tendency to disconnect you after ~90 minutes of inactivity. I implemented a simple async process to fetch 10 pages at a time to speed things up. Not gonna' lie, I just took my synchronous code, put it into ChatGPT-4 and asked it to parallelize it. Worked like a charm!

### Ways to improve this

A brand's main sustainability page often includes the highlights they are most proud of, but often times really important information is buried 1 or more clicks deep into the website. An important improvement would be to identify relevant links on the page (using a similar process as Step 3) and then fetch those pages as well.

Note that note all pages were scrapped successfully. About two dozen had errors. I didn't chase these down, but in a production context you'd want to figure out what was going on with those.

In [None]:
import concurrent.futures
from langchain.schema import Document
from crawlbase import CrawlingAPI
from langchain_community.document_transformers import Html2TextTransformer

html2text = Html2TextTransformer()
crawlbase_api = CrawlingAPI({'token': os.environ['CRAWLBASE_JS_API_KEY']})

def cache_html(url, html):
    cache_file_name = slugify(url)
    cache_file_path = f"{root_path}/data/html_cache/{cache_file_name}.html"
    with open(cache_file_path, 'w') as file:
        file.write(html)

def load_html_doc(url):
    html_file_name = slugify(url)
    html_file_path = f"{root_path}/data/html_cache/{html_file_name}.html"
    if os.path.exists(html_file_path):
        with open(html_file_path, 'r') as file:
            doc = Document(page_content=file.read(), metadata={"url": url})
            doc_transformed = html2text.transform_documents([doc])
            return doc_transformed[0]
    return None

# You might have to run this a couple times because crawling errors are skipped
# rather than retried. Results are cached so only URLs that had an error will
# actually be refetched
def process_brand(brand):
    urls = [page["link"] for page in brand["candidate_results"] if "yes" in page["relevant"].lower() and not page["link"].endswith(".pdf")]
    #print('Processing brand', brand["brand"], len(urls), 'URLs')
    brand_docs = []  # This will store the docs for the current brand
    for url in urls:
        #print('Processing:', url)
        html_doc = load_html_doc(url)
        if html_doc is None:
            #print('Crawlbase starting:', url)
            try:
                response = crawlbase_api.get(url)
                if response['status_code'] == 200:
                    html = response['body'].decode('utf-8')
                    print('Crawlbase finished:', url)
                    # Save the raw HTML
                    cache_html(url, html)
                    # Retrieve the saved raw HTML via beautiful soup
                    html_doc = load_html_doc(url)
                else:
                    print('Crawlbase error for', url, response)
            except Exception as e:
                print('There was an error', url, e)
        #else:
            #print('Cache found for:', url)
        if html_doc is not None:
            html_doc.metadata["brand"] = brand["brand"]
            brand_docs.append(html_doc)
    return brand_docs

# Initialize an empty list to hold all documents
docs = []

# Process each brand in parallel. This can takes hours upon hours if you don't
# parallelize it.
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    # Use `executor.map` to apply `process_brand` to each brand
    results = executor.map(process_brand, brand_data[0:1000])

    # Iterate through the results (each result is a list of docs for a brand)
    for brand_docs in results:
        # Extend the main docs list with the docs from each brand
        docs.extend(brand_docs)

# Step 5: Create summaries

Now that we have webpage content formatted in markdown, we're nearly ready to create our summaries. First, we have to create a summary pipeline, and the prompt engineering is really important in this case. You can get radically different results if you change the prompt. You can change what information the model focuses on, what it explicitly ignores, what format it returns the information in, how to cite results, and oh-so-much more.

When I started learning about LLMs I kinda scoffed at prompt engineering. I was wrong - it's a legit hard and extremely important skill. It also doesn't take any real programming skills. Liberal arts majors are probably better at it overall. If you want to learn more about prompt engineering, check out [this course](https://www.deeplearning.ai/short-courses/chatgpt-prompt-engineering-for-developers/) by Andrew Ng and Isa Fulford. I highly recommend it.

### Ways to make this better

When I started this project, I thought I was going to need to use a vector database and implement something called "retrieval augmented generation" or RAG. For details on that, see this [LangChain doc](https://python.langchain.com/docs/use_cases/question_answering/).

However, I started getting into it, I realized that RAG was overkill here. Instead, we could just feed *all* the information collected into the LLM's context window and trust it to make sense of it. This wouldn't have been possible a year ago because context windows were too small to fit all the material. Just goes to show how quickly things have been changing. A year ago, 4k tokens in the context window was standard, now it's 128k for OpenAI and Google goes all the way up to 10M tokens!

Neverless, you could probably improve the results by implementing RAG. That would allow you to add more pages and provide the LLM with just the right material. For this proof-of-concept though, RAG is overkill.

Note: I implemented caching a little differently here. I think it's slightly better, but I didn't want to spend the time to update the earlier steps.

In [200]:
from langchain.prompts.prompt import PromptTemplate
from langchain.schema import format_document

def get_cached_summary(brand):
    cache_file_name = slugify(brand)
    cache_file_path = f"{root_path}/data/summary_cache/{cache_file_name}.json"
    if os.path.exists(cache_file_path):
        with open(cache_file_path, 'r') as file:
            return json.load(file)
    else:
        return None

# Different than before, need a single input and return input for use as Runnable
def cache_summary(x):
    brand = x["brand"]
    cache_file_name = slugify(brand)
    cache_file_path = f"{root_path}/data/summary_cache/{cache_file_name}.json"
    with open(cache_file_path, 'w') as file:
        json.dump(x, file)
    return x


def document_combiner(x):
    brand = x["brand"]
    brand_docs = [doc for doc in docs if doc.metadata["brand"] == brand]

    document_prompt=PromptTemplate.from_template(
        template="URL: {url}\nWEBSITE CONTENT:\n{page_content}\n"
    )
    doc_strings = [format_document(doc, document_prompt) for doc in brand_docs]
    return {**x, "context": "\n\n---\n\n".join(doc_strings)}

# Create a chain that evaluates each link and returns yes or no
prompt = ChatPromptTemplate.from_messages([
    ("system", """You create summaries of a company's sustainability efforts.
You focus on concrete activities and things that aren't already legally rquired.
Do not include in the summary vague claims like, "We prioritize worker welfare"
or "we're constantly working to reduce our environmental impact". Include any
commitments the company has made, like "We're committing to being a zero waste
company by 2030."

Your summaries include four sections (each starting with a markdown header). The
first three sections should be one or more paragraphs of text, not bullets. At
the end of each of the first three sections, you always cite your sources for
any source you used in the answer. The format you use is markdown formatted
links with incrementing numbers as the text, like this:

sources: [1](https://www.some-site.com), [2](https://www.another-site.com)

If the same source is used multiple times, it should have the same number.

The sections include:

# Environmental protection
What is the company doing to protect the environment? Examples of things to
include are carbon reduction efforts, waste minimization, renewable energy
usage, and designing more environmentally friendly products. Include any
environmental commitments and memberships like SBTi or CDP. If no information,
this section should just state "Nothing found"

# Worker welfare
What is the company doing to ensure worker welfare. Examples of things to
include are living wage programs, community building activities, fair trade
certification, and women's rights programs. If no information, this section
should just state "Nothing found"

# Animal welfare
What is the company doing to promote the health of animals used in their product
or service and/or to minimize animal suffering. Examples of things to include
are animal welfare certifications like Global Animal Partnership or cruelty free
certifications like Leaping Bunny. If no information, this section should just
state "Nothing found"

# Transparency
In this section, include two bulleted lists, one with the title "Certifications"
that lists certifications the company has attained, and one with the title
"Commitments" that lists the time-bound goals the company has set. Here is the
output format to use:

- Certifications
    - (bulleted list of certifications or "Nothing found")
- Commitments
    - (bulleted list of commitments or "Nothing found")

You will be provided with context material upon which to base your summary. Do
not including anything in the summary that is not included in the provided
context. If the provided context doesn't have any relevant information for the
brand you're making the summary for, respond with two words: "Nothing found".
Format your response in markdown."""),
    ("user", """Write a sustainability summary for {brand}.

CONTEXT:
{context}
""")
])

# I chose GPT-4 for creating the summary.
summary_llm = ChatOpenAI(model_name="gpt-4-0125-preview", temperature=0)

summary_chain = RunnablePassthrough.assign(
    summary = (
        document_combiner
        | prompt
        | summary_llm
        | StrOutputParser()
    )
) | RunnableLambda(cache_summary)


def check_cache(x):
    cached_summary = get_cached_summary(x["brand"])
    if cached_summary is None:
        print(f'Creating summary for {x["brand"]}')
        return summary_chain
    else:
        print(f'Returning cached summary for {x["brand"]}')
        return RunnableLambda(lambda x: cached_summary)

summary_with_cache_chain = RunnableLambda(check_cache)

In [None]:
# Took 37 Minutes
summaries = summary_with_cache_chain.batch(data[0:1000])

In [226]:
# Display a single summary for reference
print(summaries[0])
display(Markdown(summaries[0]["summary"]))

{'brand': "McDonald's", 'fame': '99%', 'popularity': '63%', 'summary': "# Environmental protection\nMcDonald's is actively engaged in various environmental protection initiatives focusing on climate action, sustainable packaging, and the preservation of natural resources. The company has submitted evolved science-based targets for validation by the Science Based Targets initiative (SBTi) to align with a 1.5°C pathway and the new FLAG framework. Projects executed between 2019 and 2023 are expected to contribute to a 33% reduction in greenhouse gas emissions from their global 2015 baseline. In terms of packaging and waste, McDonald's reports that approximately 81.0% of their primary guest packaging materials and 97.2% of their primary fiber packaging come from recycled or certified sources as of the end of 2022. They aim to drastically reduce plastics in Happy Meal toys and transition to more sustainable materials by the end of 2025, having already reduced virgin fossil fuel-based plasti

# Environmental protection
McDonald's is actively engaged in various environmental protection initiatives focusing on climate action, sustainable packaging, and the preservation of natural resources. The company has submitted evolved science-based targets for validation by the Science Based Targets initiative (SBTi) to align with a 1.5°C pathway and the new FLAG framework. Projects executed between 2019 and 2023 are expected to contribute to a 33% reduction in greenhouse gas emissions from their global 2015 baseline. In terms of packaging and waste, McDonald's reports that approximately 81.0% of their primary guest packaging materials and 97.2% of their primary fiber packaging come from recycled or certified sources as of the end of 2022. They aim to drastically reduce plastics in Happy Meal toys and transition to more sustainable materials by the end of 2025, having already reduced virgin fossil fuel-based plastic in Happy Meal toys by 47.8% globally. Additionally, McDonald's is committed to supporting deforestation-free supply chains for its primary commodities, achieving 99.0% deforestation-free sourcing for beef, soy for chicken feed, palm oil, coffee, and fiber for guest packaging in 2022. The company has also joined the Consumer Goods Forum’s Forest Positive Coalition to address commodity-driven deforestation and climate change issues across the sector.

sources: [1](https://corporate.mcdonalds.com/corpmcd/our-purpose-and-impact/our-planet.html)

# Worker welfare
Nothing found

# Animal welfare
McDonald's has implemented several policies aimed at promoting animal welfare across its supply chain. The company has committed to sourcing chickens not treated with antibiotics important to human medicine in the U.S. and has eliminated the use of Highest Priority Critically Important Antibiotics (HPCIAs) to human medicine from all chicken served in several countries, with plans for China to comply by the end of 2027. Additionally, McDonald's announced a policy in December 2018 to reduce the overall use of antibiotics important to human health in its beef supply chain, covering 10 beef sourcing markets around the world. The company is also on a journey to advance more sustainable beef production, striving to improve environmental practices, make a positive difference in the lives of farmers, and drive improvements in animal health and welfare. Furthermore, McDonald's has made a commitment to source only cage-free eggs by 2025 in the U.S. and Canada and is 60% towards achieving this goal. In terms of pork, more than 91% of the pork purchased in the U.S. comes from suppliers that have phased out the use of gestation stalls for housing confirmed pregnant sows, with a commitment to maximize the time that pregnant sows spend in a group environment. McDonald's has also announced a global commitment to source chickens raised with improved welfare outcomes, outlining eight Broiler Welfare Commitments expected to be fully implemented by the end of 2024 in 13 key markets.

sources: [1](https://www.mcdonalds.com/us/en-us/about-our-food/our-food-philosophy/commitment-to-quality.html)

# Transparency
- Certifications
  - Nothing found
- Commitments
  - Submit evolved science-based targets for validation by SBTi in line with 1.5°C and the new FLAG framework.
  - Contribute to a 33% reduction in GHG emissions from the global 2015 baseline through projects executed between 2019–2023.
  - Drastically reduce plastics and offer sustainable Happy Meal toys and transition to more sustainable materials by the end of 2025.
  - Support deforestation-free supply chains for primary commodities (beef, soy for chicken feed, palm oil, coffee, and fiber for guest packaging) achieving 99.0% deforestation-free sourcing in 2022.
  - Source only cage-free eggs by 2025 in the U.S. and Canada.
  - Phase out the use of gestation stalls for housing confirmed pregnant sows in the U.S. by the end of 2024.
  - Fully implement eight Broiler Welfare Commitments by the end of 2024 in 13 key markets.

# Step 6: Profit

If you've read this far, you might be interested in trying to use these results or this code to build a tool to help people shop. If that's the case, that's awesome. Thanks for caring about this. You're free to use this data and these tools however you want.

However, just know that hundreds of websites and browser extensions have tried to do just this. It's way harder than you think, and just providing information isn't enough. All those tools didn't fail because they found it too hard to scrape the data. There are deeper problems that need to be solved.

If you want more of my thoughts on this, check out some of my writings, like [this](https://www.linkedin.com/pulse/decoding-sustainable-shopping-why-current-approaches-miss-steve-isley/?trackingId=fSNbrA%2FyTzG%2BXi5uTyci0w%3D%3D) and [this](https://www.linkedin.com/pulse/make-sustainable-shopping-work-change-questions-youre-steve-isley/?trackingId=1%2F3y2XLIRgqttR4j%2FvZw4g%3D%3D). And if you'd like to chat, just send me an email at steve.c.isley@gmail.com.

# Generate Output

This section creates the output files you can use to search and view the results.

This first block of code just grabs all the cached files summary files and puts them in a list.

In [280]:
import json
import glob

def load_json_files():
    folder_path = f"{root_path}/data/summary_cache/"
    file_paths = glob.glob(f"{folder_path}/*.json")
    all_dicts = []

    for file_path in file_paths:
        with open(file_path, 'r', encoding='utf-8') as file:
            data = json.load(file)
            all_dicts.append(data)

    return all_dicts

all_brands = load_json_files()

# Oops, my big export incorrectly used 2 spaces instead of 4 for nested lists.
# This messes up the markdown to html conversion. This fixed that. I updated
# the prompt above this shouldn't be an issue in future runs.
def fix_markdown_indentation(markdown_text):
    fixed_lines = []
    for line in markdown_text.split('\n'):
        if line.startswith("  -"):
            fixed_lines.append("    " + line.lstrip())
        else:
            fixed_lines.append(line)
    fixed_markdown = '\n'.join(fixed_lines)
    return fixed_markdown

for item in all_brands:
    item["summary"] = fix_markdown_indentation(item["summary"])

print(f"Loaded {len(all_brands)} dictionaries from JSON files.")
print(all_brands[0])

Loaded 998 dictionaries from JSON files.
{'brand': 'UPS', 'fame': '99%', 'popularity': '77%', 'summary': "# Environmental protection\nUPS has made significant strides in environmental protection, focusing on carbon reduction, alternative fuel usage, renewable energy, and reforestation. The company has committed to achieving 100% carbon neutrality by 2050, a goal that encompasses scope 1, 2, and 3 emissions. In pursuit of this, UPS has seen a 6.9% decrease in CO2e emissions across these scopes year over year. A cornerstone of their strategy is the investment in alternative fuels; UPS has added over 15,600 alternate fuel and advanced technology vehicles to its global fleet, including more than 1,000 electric and plug-in hybrid electric vehicles. This aligns with their target of 40% alternative fuel utilization in ground operations by 2025. In the last year, UPS purchased 162 million gallons of alternative fuels, accounting for 26.5% of their ground fuel usage. Additionally, UPS is workin

In [281]:
# Prepare for exporting the markdown data into HTML for viewing in a static webpage.
import markdown
import pandas as pd

# convert markdown to HTML
md = markdown.Markdown()
all_brands_html = [{
    **item,
    "fame": int(item["fame"][:-1]),
    "popularity": int(item["popularity"][:-1]),
    "summary": md.convert(item["summary"])
} for item in all_brands]

# Make sure it's sorted by fame descending
all_brands_html.sort(key = lambda brand: brand["fame"], reverse=True)
print(all_brands_html[0:3])

[{'brand': 'UPS', 'fame': 99, 'popularity': 77, 'summary': '<h1>Environmental protection</h1>\n<p>UPS has made significant strides in environmental protection, focusing on carbon reduction, alternative fuel usage, renewable energy, and reforestation. The company has committed to achieving 100% carbon neutrality by 2050, a goal that encompasses scope 1, 2, and 3 emissions. In pursuit of this, UPS has seen a 6.9% decrease in CO2e emissions across these scopes year over year. A cornerstone of their strategy is the investment in alternative fuels; UPS has added over 15,600 alternate fuel and advanced technology vehicles to its global fleet, including more than 1,000 electric and plug-in hybrid electric vehicles. This aligns with their target of 40% alternative fuel utilization in ground operations by 2025. In the last year, UPS purchased 162 million gallons of alternative fuels, accounting for 26.5% of their ground fuel usage. Additionally, UPS is working towards powering 25% of its facili

In [286]:
# Export to HTML
from jinja2 import Environment, BaseLoader, Template

template_str = """
<!DOCTYPE html>
<html>
<head>
    <title>Data Table Example</title>
    <link rel="stylesheet" type="text/css" href="https://cdn.datatables.net/1.11.3/css/jquery.dataTables.min.css">
    <script type="text/javascript" src="https://code.jquery.com/jquery-3.5.1.js"></script>
    <script type="text/javascript" src="https://cdn.datatables.net/1.11.3/js/jquery.dataTables.min.js"></script>
</head>
<body style="padding: 30px">
<h1>Brand Self Reported Sustainability Efforts</h1>
<p>
<b>Created by <a href="https://www.linkedin.com/in/stevecisley/">Steven Isley</a>.</b>
</p>
<p>I used LangChain and generative AI to automate the process of finding, downloading,
and summarizing the self reported sustainability efforts of the top 1,000 most famous
US brands. This web page summarizes the results. The full code is available at.
The code is also available in a
<a href="https://colab.research.google.com/drive/1vHvDxtA7-8-_xl6AdCx9Hr0qZBFUWNib#scrollTo=xDbexGbpxqhW">
Google Colab Notebook</a>.
</p>
<table id="data-table" class="display">
    <thead>
        <tr>
            <th>Brand</th>
            <th>Fame</th>
            <th>Popularity</th>
            <th>Summary</th>
        </tr>
    </thead>
    <tbody>
        {% for item in data %}
        <tr>
            <td><h1>{{ item.brand }}</h1></td>
            <td>{{ item.fame }}</td>
            <td>{{ item.popularity }}</td>
            <td>{{ item.summary|safe }}</td>
        </tr>
        {% endfor %}
    </tbody>
</table>

<script>
$('#data-table').DataTable({
    "order": [[1, "desc"]] // Sort by the second column (index 1) in ascending order
});
</script>
</body>
</html>
"""

template = Template(template_str)
html_output = template.render(data=all_brands_html)

with open(f"{root_path}/data/brand_summary.html", 'w', encoding='utf-8') as f:
    f.write(html_output)

In [285]:
# Export CSV file
import csv

# Specify the CSV file name
csv_file_name = f"{root_path}/data/brand_summary.csv"

# Define the fieldnames based on the dictionary keys
fieldnames = ['brand', 'fame', 'popularity', 'summary']

# Open the CSV file in write mode
with open(csv_file_name, mode='w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=fieldnames)

    # Write the header
    writer.writeheader()

    # Write the data rows
    for row in all_brands:
        writer.writerow(row)

print(f"Data exported to {csv_file_name} successfully.")

Data exported to /content/drive/MyDrive/Colab-Notebooks/prototypes/brand-data-fetcher/data/brand_summary.csv successfully.


In [269]:
# Debugging
print(all_brands_html[0:2])
adidas = next(brand for brand in all_brands_html if brand["brand"] == "Adidas")
print(adidas["summary"])

[{'brand': 'UPS', 'fame': 99, 'popularity': 77, 'summary': '<h1>Environmental protection</h1>\n<p>UPS has made significant strides in environmental protection, focusing on carbon reduction, alternative fuel usage, renewable energy, and reforestation. The company has committed to achieving 100% carbon neutrality by 2050, a goal that encompasses scope 1, 2, and 3 emissions. In pursuit of this, UPS has seen a 6.9% decrease in CO2e emissions across these scopes year over year. A cornerstone of their strategy is the investment in alternative fuels; UPS has added over 15,600 alternate fuel and advanced technology vehicles to its global fleet, including more than 1,000 electric and plug-in hybrid electric vehicles. This aligns with their target of 40% alternative fuel utilization in ground operations by 2025. In the last year, UPS purchased 162 million gallons of alternative fuels, accounting for 26.5% of their ground fuel usage. Additionally, UPS is working towards powering 25% of its facili

# Parking Lot for Random Notes

This section is just random notes I took along the way that I don't want to delete but haven't found a better home for. Please ignore them.

* [Consumer Product Safety Comission Recalls database](https://www.saferproducts.gov/PublicSearch): could be great, but looks limited in scope to physical goods (e.g. not food brands).
* [CorpWatch API](http://api.corpwatch.org/): Really cool free data set, but is probably overkill. Hundreds of thousands of corporations but most are not of interest to me and it looks like too much work to filter.
* [YouGov](https://today.yougov.com/ratings/consumer/popularity/brands/all): They have a great list of consumer brands, but not easy way to download it. You can sort by "Fame" defined as "the % of people who have heard of a brand." which is exactly what I'm after. I just kept scrolling until the top ~1k brands were visible then copied the HTML via the developer console.


Time keeping:
* 3-6 pm on Mar 9, 2024.
* 1-4 pm on Mar 10, 2024.
* 8:30-2:30 on Mar 11, 2024.

In [222]:
# Number of urls fetched
tmpurls = [item["candidate_results"] for item in brand_data[0:1000]]
tmpurls = [sum([url["relevant"] == "Yes" for url in item]) for item in tmpurls[0:1000]]
sum(tmpurls)

1885