# Installing the dependencies

In [1]:
from transformers import pipeline
from bs4 import BeautifulSoup
import requests

print("Everything imported successfully 😃")


Everything imported successfully 😃


# Extracting content from blog

In [39]:
URL = "https://www.codecontent.net/post/introduction-to-llama"

In [40]:
def extract_process(URL):
    try:
        # Extracting the html content from the given URL
        response = requests.get(URL)

        # Check if the request was successful
        if response.status_code == 200:

            # Instantiating BeautifulSoup class
            bsoup = BeautifulSoup(response.text, "html.parser")

            # Extracting the main article content
            article_content = bsoup.find_all(["h1", "p"])

            if article_content:

                # Extracting the text from paragraphs within the article
                text = [content.text for content in article_content]
                ARTICLE = " ".join(text)
                return ARTICLE

            else:
                return "Unable to extract article content. Please check the URL or website structure."
        else:
            return f"Failed to retrieve content. Status Code: {response.status_code}"

    except requests.exceptions.RequestException as e:
        return f"An error occurred: {e}"

In [41]:
# Checking if everythinng is working
extract_process("https://www.codecontent.net/post/introduction-to-llama")

'This article explores the transformative journey of Meta\'s LlaMA series, from the foundational LLaMA 1, through the enhanced LLaMA 2, to the cutting-edge LLaMA 3, showcasing its significant advancements in AI with a focus on increased speed, multilingual capabilities, and sophisticated machine learning technologies that push the boundaries of language model development and application. In the dynamic realm of artificial intelligence, Large Language Models (LLMs) like LLaMA, developed by Meta (formerly known as Facebook Inc.), are pivotal, driving significant advancements in technology. The term "Meta" refers to the tech giant that has expanded its focus from social media to broader technological innovations, including AI research. The LLaMA series, which stands for "Large Language Model Meta AI," showcases this progression in machine learning and natural language processing. These models operate by predicting the next word from a sequence of words inputted, thus generating coherent a

Now before splitting the sentnces in the ARTICLE we will first replace all the `.,?` with `<eos>` tag and doing so have several benefits which are mentioned below

- Punctuation, particularly full stops, question marks, and exclamation marks, often indicate the end of a sentence. Replacing them with "eos" markers explicitly clarifies these boundaries, especially when dealing with ambiguously punctuated text or mixed languages.
- This explicit marking simplifies the tokenization process, ensuring each token falls within a well-defined sentence unit. This can be crucial for tasks like sentiment analysis, question answering, or language modeling, where understanding sentence structure is important.
- Treating punctuation as separate tokens can sometimes lead to issues in downstream tasks. Replacing them with "eos" allows them to be treated specially during processing. For example, in sentiment analysis, you might want to exclude punctuation when calculating sentiment scores.

# Data processing
There are several reasons why we typically use the same tokenizer and same model when working with libraries like Hugging Face Transformers:

1. Compatibility: Models and their corresponding tokenizers are trained together specifically for a particular vocabulary and input format. Using a different tokenizer might lead to mismatched vocabulary tokens, numerical IDs, and ultimately incorrect data representations for the model. This can result in unexpected behavior and poor performance.

2. Consistency: By using the recommended tokenizer, you ensure that the input data is tokenized according to the way the model was trained. This consistency avoids introducing unnecessary variations that could potentially affect the model's predictions.

3. Pre-built vocabulary: When you use the model's tokenizer, you benefit from having the model's vocabulary readily available. This saves you the effort of building your own vocabulary and potential issues with missing words or inconsistent representations.

4. Optimization: The tokenizer and model are likely optimized to work together efficiently. Using a different tokenizer might require additional processing or introduce inefficiencies in the data conversion pipeline.

`T5 was mostly trained using 512 input tokens`

In [42]:
# Extracting all the content into 1 single text
ARTICLE = extract_process(URL)

In [43]:
ARTICLE = ARTICLE.replace(".", ".<eos>")
ARTICLE = ARTICLE.replace("?", "?<eos>")
ARTICLE = ARTICLE.replace("!", "!<eos>")

In [44]:
def chunk_creation(article, max_chunk=512):
    """
    Chunks an article into sentences, respecting sentence boundaries and a maximum chunk size.

    Args:
        article: The text of the article to be chunked.
        max_chunk: The maximum number of tokens allowed in a chunk.

    Returns:
        A list of chunks, each represented as a string.
    """

    # Split the article into sentences based on the '<eos>' marker.
    sentences = article.split("<eos>")

    # Initialize variables for chunk creation.
    current_chunk = 0
    chunks = []

    for sentence in sentences:
        if len(chunks) == current_chunk + 1:
            if (
                len(chunks[current_chunk].split(" ")) + len(sentence.split(" "))
                <= max_chunk
            ):
                chunks[current_chunk] += " " + sentence
            else:
                current_chunk += 1
                chunks.append(sentence)
        else:
            chunks.append(sentence)

    for chunk_id in range(len(chunks)):
        chunks[chunk_id] = chunks[chunk_id].strip()

    return chunks

In [45]:
CHUNKS = chunk_creation(ARTICLE)

In [46]:
CHUNKS

['This article explores the transformative journey of Meta\'s LlaMA series, from the foundational LLaMA 1, through the enhanced LLaMA 2, to the cutting-edge LLaMA 3, showcasing its significant advancements in AI with a focus on increased speed, multilingual capabilities, and sophisticated machine learning technologies that push the boundaries of language model development and application.  In the dynamic realm of artificial intelligence, Large Language Models (LLMs) like LLaMA, developed by Meta (formerly known as Facebook Inc. ), are pivotal, driving significant advancements in technology.  The term "Meta" refers to the tech giant that has expanded its focus from social media to broader technological innovations, including AI research.  The LLaMA series, which stands for "Large Language Model Meta AI," showcases this progression in machine learning and natural language processing.  These models operate by predicting the next word from a sequence of words inputted, thus generating cohe

# Model information

Developed by Google researchers, T5 is a large-scale transformer-based language model that has achieved state-of-the-art results on various NLP tasks, including text summarization. As the model is pre-trained on a mixture of unsupervised and supervised tasks, it has the potential to generalize well to new tasks. The model is pre-trained on the Colossal Clean Crawled Corpus (C4), which was developed and released in the context of the same research paper as T5.

One of the most exciting applications of T5 is in text summarization. Summarizing lengthy documents while preserving the most relevant information is a challenging task, but T5 has achieved impressive results in this area. By inputting the text to be summarized with the prefix “summarize:”, T5 can generate a concise summary that captures the essence of the original document. This is useful for applications such as news articles, scientific papers, and legal documents. 

T5 comes in different sizes: t5-small,t5-base,t5-large,t5–3b and t5–11b. For our usecase we will be using the t5-base model

In [8]:
# Instantiating the summarization pipeline using t5-base model
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [47]:
results = summarizer(CHUNKS, max_length=200, min_length=50, do_sample=False)
summarized_tezt = " ".join([summ["summary_text"] for summ in results])

In [48]:
summarized_tezt

" Large Language Models (LLMs) like LLaMA, developed by Meta (formerly known as Facebook Inc. ), are pivotal, driving significant advancements in technology . Each iteration builds upon the previous successes, enhancing functionalities and introducing new features that more closely mimic human cognitive processes . This article explores the transformative journey of Meta's LlaMA series .  LLaMA 2 achieved a remarkable 50% increase in processing speed and a 40% improvement in accuracy . The positive reception of the model was pivotal. Feedback from the community and insights from real-world applications were instrumental in shaping the development of LlaMA 3 . LLaMa 3 employs several advanced fine-tuning strategies to enhance its functionality .  LLaMA 3 has demonstrated up to a 35% increase in processing speed and a 40% improvement in the accuracy of generated content compared to LlaMA 2 . The AI system is equipped with a new tokenizer capable of handling an impressive 128,000 differen