# Meta (previously Facebook) Key Highlights 2024 Report System

> Disclaimer 1 💡: This project is for educational, academic and research purposes. Please use this technology ethically, legally and responsibly.

> Disclaimer 2 💡: This project is not officially affiliated with Meta or any of their partner organizations. It's entirely independent and created from ground-up by [AdiPat](https://www.github.com/AdiPat) at [The Hackers Playbook](https://www.thehackersplaybook.substack.com).

Welcome! Before we begin, let's set some context. This will help you better understand the underlying motivation and intent of the project. Hopefully, it will motivate and inspire you to improve your programming skills and upgrade your abilities. I've designed this notebook to help you at every step, so if you ever get stuck, spend some time reading each line and folllowing each step exactly as prescribed and you will definitely find a solution. I can't accurately articulate how that works, but it works.

## Personal Note

- Firstly, this section is subjective. Feel free to ignore it. Whether you choose to believe or disbelieve the contents of this section, in either situation, the outcome of this project is independent of your most likely, well-intentioned judgement.

- I've been following Mark Zuckerberg since I was in school, which is approximately since 2009/2010. I don't know the exact year but it was around this time when I learnt about him through the Internet.

- We share several similarities in terms of personality, attitude towards life, and most importantly Programming. I don't know him personally so this is based on his public appearances and whatever I have heard or read about him online and from people. Assuming that all the information I gathered is true, then if the world were devoid of biological and cultural nuances, and we were all judged purely as Programmers, it would be fair to say that I'm like Mark Zuckerberg's younger "programming brother".

- Most importantly, he was and is still a "hacker" at heart. If you don't believe me, observe every public announcement Meta makes.

- I have never officially worked at Meta but I like to consider myself as an invisible and unofficial contributor (not employee) at Meta. I came up with this idea to motivate myself to keep up to the standards of the ever evolving tech world and to maintain a cultural identity of myself which closely resembles "The Hacker Culture".

- There's a lot to add, but as time progresses, I'll continue talking about "Mark Zuckerberg's influence on AdiPat's life" in greater detail!

If you ever have any questions, email us at `thehackersplaybook0@gmail.com`.

## Naming Convention

The file name and title includes Facebook for backward compatability with search engines and crawlers. Since this research project is Free & Open Source, it's important for the Hacker Culture that it reaches maximum number of people so that humanity can benefit from the efforts and energy directed into this endevour.

## Goals & Objectives

**Note:** Goals are broad, long-term aspirations that provide direction, while objectives are specific, measurable, and time-bound steps to achieve those goals.

| Goals                                                                                                  | Objectives                                                                                                                                                    |
| ------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| To demonstrate the power of Llama 3.1 in AI-powered automated report generation for enterprises.       | To create a valuable content product for The Hackers Playbook which will help our students upskill.                                                           |
| Open a discussion for official partnerships with Meta and their partner organizations.                 | Successfully utilize FireCrawl for information retrieval with measurable markers that can be used by FireCrawl (Mendable) to improve it's product experience. |
| Contribute to Meta's Free & Open Source repository of software products. React inspires all of us.     | Contribute to the Python ecosystem. Create an executable experiment for Python Programmers to learn from.                                                     |
| Extract key learnings from Meta's 2024 highlights to drive strategic insights at The Hackers Playbook. | Highlight AI-safety, AI-security, Ethical AI and Responsible AI practises in action.                                                                          |
| Encourage people to "build in open".                                                                   |                                                                                                                                                               |
| Further developments in Generative AI and march towards AGI (Artificial General Intelligence).         |                                                                                                                                                               |

**Enough talk, let's start hacking!**


In [None]:
# Boilerplate: This block goes into every notebook.
# It sets up the environment, installs the requirements, and checks for the required environment variables.

import os
from IPython.display import clear_output

requirements_installed = False
max_retries = 3
retries = 0
REQUIRED_ENV_VARS = ["GROQ_API_KEY", "FIRECRAWL_API_KEY", "SERPER_API_KEY"]


def install_requirements():
    """Installs the requirements from requirements.txt file"""
    global requirements_installed
    if requirements_installed:
        print("Requirements already installed.")
        return

    print("Installing requirements...")
    install_status = os.system("pip install -r requirements.txt")
    if install_status == 0:
        print("Requirements installed successfully.")
        requirements_installed = True
    else:
        print("Failed to install requirements.")
        if retries < max_retries:
            print("Retrying...")
            retries += 1
            return install_requirements()
        exit(1)
    return


from dotenv import load_dotenv
import os


def setup_env():
    """Sets up the environment variables"""

    def check_env(env_var):
        value = os.getenv(env_var)
        if value is None:
            print(f"Please set the {env_var} environment variable.")
            exit(1)
        else:
            print(f"{env_var} is set.")

    load_dotenv()

    variables_to_check = REQUIRED_ENV_VARS

    for var in variables_to_check:
        check_env(var)


install_requirements()
setup_env()
clear_output()
print("🚀 Setup complete. Continue to the next cell.")

In [None]:
from langchain_community.utilities import GoogleSerperAPIWrapper

cache = {}

linksCache = {}
links_cache_file = "outputs/links_cache.json"


def load_links_cache():
    """Loads the links cache from the file"""
    with open(links_cache_file, "r") as f:
        linksCache = json.load(f)
    return linksCache


def save_links_cache():
    """Saves the links cache to the file"""
    with open(links_cache_file, "w") as f:
        json.dump(linksCache, f)


def add_links_to_cache(query, links):
    """Adds the links to the cache"""
    linksCache[query] = links
    save_links_cache()


def regular_search(query):
    """Searches the query using Google Serper API"""
    search = GoogleSerperAPIWrapper()
    response = search.results(query)
    return response


def dedupe_links(links):
    """Dedupes the links"""
    return list(set(links))


def collect_links(response: str):
    """Collects the links from the response"""
    organic = response["organic"]
    links = []
    for item in organic:
        sitelinks = item.get("sitelinks")
        if sitelinks or (type(sitelinks) == list and len(sitelinks) > 0):
            for link in sitelinks:
                links.append(link["link"])
        links.append(item["link"])
    return dedupe_links(links)


def get_links_from_regular_search(query):
    """Gets the links from the regular search"""
    cached_links = linksCache.get(query)
    if cached_links:
        return cached_links
    response = regular_search(query)
    links = collect_links(response)
    add_links_to_cache(query, links)
    return links


def get_doc_links(app: FirecrawlApp, input_url: str) -> List[str]:
    """Gets the documentation links from the given URL."""
    cache_key = f"{input_url}_links"
    cached_links = cache.get(cache_key)
    if cached_links:
        print(f"Using cached links for URL: {input_url}")
        return cached_links

    app = get_firecrawl_client()
    crawl_result = app.map_url(input_url)

    success = crawl_result["success"]

    if not success:
        raise RuntimeError(f"Failed to get links from URL: {input_url}")

    links = crawl_result["links"]
    cache[cache_key] = links

    return links


def get_single_doc_from_link(app: FirecrawlApp, link: str) -> str:
    """Gets the documentation from the given link."""
    cached_doc = cache.get(link)
    if cached_doc:
        print(f"Using cached docs for URL: {link}")
        return cached_doc

    scrape_result = None
    try:
        scrape_result = app.scrape_url(link, params={"formats": ["markdown"]})
    except Exception as e:
        print(f"Failed to get docs from URL: {link}")
        print(e)

    if not scrape_result:
        return None

    success = scrape_result["metadata"]["statusCode"] == 200

    if not success:
        print(f"Failed to get docs from URL: {link}")
        return None

    markdown = scrape_result["markdown"]
    cache[link] = markdown

    return markdown

In [18]:
import json
from typing import Dict, List

DEFAULT_LINKS_CACHE_FILE = "outputs/links_cache.json"


class LinksCache:
    """A simple in-memory cache for storing links."""

    def __init__(self, links_cache_file=DEFAULT_LINKS_CACHE_FILE):
        """Initializes the links cache."""
        self.load_links_cache_file = links_cache_file
        self.links_cache_file = links_cache_file
        self.links_cache = self.__load_links_cache()

    def __load_links_cache(self) -> Dict[str, list]:
        """Loads the links cache from the file"""
        if not os.path.exists(self.links_cache_file):
            with open(self.links_cache_file, "w") as f:
                json.dump({}, f)
            self.links_cache = {}
            return self.links_cache

        with open(self.links_cache_file, "r") as f:
            self.links_cache = json.load(f)
        return self.links_cache

    def __save_links_cache(self) -> None:
        """Saves the links cache to the file"""
        with open(self.links_cache_file, "w") as f:
            json.dump(self.links_cache, f)

    def add_links_to_cache(self, query: str, links: List[str]) -> None:
        """Adds the links to the cache"""
        self.links_cache[query] = links
        self.__dedupe_links_cache()
        # self.save_links_cache() # Not required because dedupe links cache already saves the cache.

    def has(self, query: str) -> bool:
        """Checks if the query is in the cache"""
        return query in self.links_cache

    def get(self, query: str) -> List[str]:
        """Gets the links from the cache"""
        return self.links_cache.get(query)

    def __dedupe_links_cache(self) -> None:
        """Dedupes the links cache"""
        for query, links in self.links_cache.items():
            self.links_cache[query] = self.__dedupe_links(links)
        self.__save_links_cache()

    def __dedupe_links(self, links: List[str]) -> List[str]:
        """Dedupes the links"""
        return list(set(links))

In [12]:
import re


class Utils:
    """Common utility functions for text processing and analysis."""

    @staticmethod
    def count_tokens(text: str) -> int:
        """
        Approximate the number of tokens in a text input for an LLM.

        Args:
            text (str): The input text to calculate tokens for.
            encoding_model (str): The tokenization model to use (e.g., 'cl100k_base').
                                This allows integration with a tokenizer library for better accuracy.

        Returns:
            int: Approximate number of tokens in the input text.
        """
        # Clean text and normalize spaces
        text = re.sub(r"\s+", " ", text.strip())

        # Approximate tokenization
        # 1. Split on spaces, punctuation, and common subword patterns
        # 2. Adjust weights based on encoding_model if needed
        tokens = re.findall(r"\w+|[^\w\s]", text, re.UNICODE)

        # Return the token count
        return len(tokens)

In [26]:
from firecrawl import FirecrawlApp
from langchain_community.utilities import GoogleSerperAPIWrapper
from typing import List, Any
import traceback
from pprint import pp
from datetime import datetime


class SOFiCSearchEngine:
    """SOFIC (Serper Orchestrated FireCrawl) Search Engine:: A simple search engine built on FireCrawl Python SDK."""

    def __init__(self):
        """Intializes the FireCrawlSearchEngine class."""

        self.firecrawl = FirecrawlApp(api_key=os.getenv("FIRECRAWL_API_KEY"))
        self.cache = {}
        self.links_cache = LinksCache()
        self.docs_cache = {}
        print("🔥 SOFiC Search Engine initialized. 🕷️")

    def __serper_search(self, query: str) -> List[str]:
        """Searches a search engine, currently Google via Serper API."""
        search = GoogleSerperAPIWrapper()
        response = search.results(query)
        return response

    def search(self, query: str) -> List[str]:
        """Searches the query using the search engine."""
        start_time = datetime.now()
        print(f"📡 Searching for query: {query}")
        results = {}
        processed_pages = []
        initial_links = self.get_links_for_query(query)
        print(f"Got {len(initial_links)} links for query: {query}")
        for link in initial_links:
            print(f"Getting page content in markdown from URL: {link}")
            result = {"url": link}
            page_markdown, tokens_fetched = self.get_page_markdown(link)
            if page_markdown:
                result["markdown"] = page_markdown
                result["tokens_fetched"] = tokens_fetched
                processed_pages.append(result)
        results["processed_pages"] = processed_pages
        results["total_tokens_fetched"] = sum(
            [page["tokens_fetched"] for page in processed_pages]
        )
        end_time = datetime.now()
        time_difference_seconds = str((end_time - start_time).total_seconds()) + "s"
        print(f"🕒 Search completed in {time_difference_seconds}")
        return results

    def get_links_for_query(self, query: str) -> List[str]:
        """Gets the links for a given query."""

        def collect_links(response: Dict[str, Any]) -> List[str]:
            """Collects the links from the response"""
            organic = response.get("organic")

            if not organic:
                print("No organic results found.")
                return []

            links = []
            for item in organic:
                sitelinks = item.get("sitelinks")
                if sitelinks or (type(sitelinks) == list and len(sitelinks) > 0):
                    for link in sitelinks:
                        links.append(link["link"])
                links.append(item["link"])
            return links

        if self.links_cache.has(query):
            return self.links_cache.get(query)

        response = self.__serper_search(query)
        links = collect_links(response)
        self.links_cache.add_links_to_cache(query, links)
        return self.links_cache.get(query)

    def get_links(self, input_url: str) -> List[str]:
        """Gets the links from the given URL."""
        try:
            print(f"Getting links from URL: {input_url}")
            cache_key = f"{input_url}_links"
            cached_links = cache.get(cache_key)
            if cached_links:
                print(f"Using cached links for URL: {input_url}")
                return cached_links

            app = self.firecrawl
            crawl_result = app.map_url(input_url)

            success = crawl_result["success"]

            if not success:
                raise RuntimeError(f"Failed to get links from URL: {input_url}")

            links = crawl_result["links"]
            print(f"Got {len(links)} links from URL: {input_url}")
            cache[cache_key] = links

            return links
        except Exception as e:
            print(f"Failed to get links from URL: {input_url}")
            traceback.print_exc()
            return []

    def get_page_markdown(self, link: str) -> str:
        """Gets the documentation from the given link."""
        try:
            print(f"Getting page content in markdown from URL: {link}")
            cached_doc = self.docs_cache.get(link)
            if cached_doc:
                print(f"Using cached docs for URL: {link}")
                return cached_doc

            app = self.firecrawl
            print(f"Getting page content in markdown from URL: {link}")
            scrape_result = app.scrape_url(link, params={"formats": ["markdown"]})

            if not scrape_result:
                return None

            success = scrape_result["metadata"]["statusCode"] == 200

            if not success:
                print(f"Failed to get docs from URL: {link}")
                return None

            print(f"Got page content in markdown from URL: {link}")

            markdown = scrape_result["markdown"]
            self.docs_cache[link] = markdown

            token_count = Utils.count_tokens(markdown)
            tokens_fetched = token_count
            print(f"Token count: {token_count}")

            print(f"Markdown size: {len(markdown)}")

            return markdown, tokens_fetched
        except Exception as e:
            print(f"Failed to get page content in markdown from URL: {link}")
            traceback.print_exc()
            return "", 0

In [29]:
from pprint import pp
from IPython.display import clear_output


def write_results_to_file(results, query):
    """Writes the search results to a file."""
    print("Writing search results to file...")
    with open(f"outputs/search_results_{query}.json", "w") as f:
        json.dump(results, f)
    print("Search results written to file.")


def search_internal(query: str):
    """Searches the query using the FireCrawl search engine."""
    search_engine = SOFiCSearchEngine()
    response = search_engine.search(query)
    clear_output()
    print(f"Search results for {query} computed.")
    return response


def run_search_engine(query: str):
    """Runs the FireCrawl search engine with the given query."""
    response = search_internal(query)
    write_results_to_file(response, query)
    return response


results = run_search_engine("firestore")
pp(results)

Search results for firestore computed.
Writing search results to file...
Search results written to file.
{'processed_pages': [{'url': 'https://firebase.google.com/docs/perf-mon/get-started-android',
                      'markdown': '[![Firebase](https://www.gstatic.com/devrel-devsite/prod/vdf1c73ddfa29bc07c1524d67528b078b0717f3e7ffc0621bf09846cb55759e81/firebase/images/lockup.svg)](https://firebase.google.com/)\n'
                                  '\n'
                                  '`/`\n'
                                  '\n'
                                  'Language\n'
                                  '\n'
                                  '- '
                                  '[English](https://firebase.google.com/docs/perf-mon/get-started-android)\n'
                                  '- '
                                  '[Deutsch](https://firebase.google.com/docs/perf-mon/get-started-android?hl=de)\n'
                                  '- '
                              