# Installing the dependencies

In [1]:
from transformers import pipeline
from bs4 import BeautifulSoup
import requests
print("Everything imported successfully 😃")

2024-02-19 10:51:29.243895: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-19 10:51:29.281224: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-19 10:51:29.281277: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-19 10:51:29.282475: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-19 10:51:29.288581: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-19 10:51:29.289492: I tensorflow/core/platform/cpu_feature_guard.cc:1

Everything imported successfully 😃


# Extracting content from blog

In [2]:
URL = "https://medium.com/marvelous-mlops/mlops-roadmap-2024-ff4216b8bc62"

In [8]:
def extract_process(URL):
    try:
        # Extracting the html content from the given URL
        response = requests.get(URL)
        
        # Check if the request was successful
        if response.status_code == 200:
            
            # Instantiating BeautifulSoup class
            bsoup = BeautifulSoup(response.text, 'html.parser')
            
            # Extracting the main article content
            article_content = bsoup.find_all(['h1', 'p'])
            
            if article_content:
                
                # Extracting the text from paragraphs within the article
                text = [content.text for content in article_content]
                ARTICLE = ' '.join(text)
                return ARTICLE
                
            else:
                return "Unable to extract article content. Please check the URL or website structure."
        else:
            return f"Failed to retrieve content. Status Code: {response.status_code}"
            
    except requests.exceptions.RequestException as e:
        return f"An error occurred: {e}"

Now before splitting the sentnces in the ARTICLE we will first replace all the `.,?` with `<eos>` tag and doing so have several benefits which are mentioned below

- Punctuation, particularly full stops, question marks, and exclamation marks, often indicate the end of a sentence. Replacing them with "eos" markers explicitly clarifies these boundaries, especially when dealing with ambiguously punctuated text or mixed languages.
- This explicit marking simplifies the tokenization process, ensuring each token falls within a well-defined sentence unit. This can be crucial for tasks like sentiment analysis, question answering, or language modeling, where understanding sentence structure is important.
- Treating punctuation as separate tokens can sometimes lead to issues in downstream tasks. Replacing them with "eos" allows them to be treated specially during processing. For example, in sentiment analysis, you might want to exclude punctuation when calculating sentiment scores.

# Data processing
There are several reasons why we typically use the same tokenizer and same model when working with libraries like Hugging Face Transformers:

1. Compatibility: Models and their corresponding tokenizers are trained together specifically for a particular vocabulary and input format. Using a different tokenizer might lead to mismatched vocabulary tokens, numerical IDs, and ultimately incorrect data representations for the model. This can result in unexpected behavior and poor performance.

2. Consistency: By using the recommended tokenizer, you ensure that the input data is tokenized according to the way the model was trained. This consistency avoids introducing unnecessary variations that could potentially affect the model's predictions.

3. Pre-built vocabulary: When you use the model's tokenizer, you benefit from having the model's vocabulary readily available. This saves you the effort of building your own vocabulary and potential issues with missing words or inconsistent representations.

4. Optimization: The tokenizer and model are likely optimized to work together efficiently. Using a different tokenizer might require additional processing or introduce inefficiencies in the data conversion pipeline.

`T5 was mostly trained using 512 input tokens`

In [26]:
# Extracting all the content into 1 single text
ARTICLE = extract_process(URL)

In [38]:
ARTICLE = ARTICLE.replace('.', '.<eos>')
ARTICLE = ARTICLE.replace('?', '?<eos>')
ARTICLE = ARTICLE.replace('!', '!<eos>')

In [39]:
def chunk_creation(article, max_chunk=512):
    """
    Chunks an article into sentences, respecting sentence boundaries and a maximum chunk size.

    Args:
        article: The text of the article to be chunked.
        max_chunk: The maximum number of tokens allowed in a chunk.

    Returns:
        A list of chunks, each represented as a string.
    """

    # Split the article into sentences based on the '<eos>' marker.
    sentences = article.split('<eos>')

    # Initialize variables for chunk creation.
    current_chunk = 0
    chunks = []

    for sentence in sentences:
        if len(chunks) == current_chunk + 1:
            if len(chunks[current_chunk].split(' ')) + len(sentence.split(' ')) <= max_chunk:
                chunks[current_chunk] += ' ' + sentence
            else:
                current_chunk += 1
                chunks.append(sentence)
        else:
            chunks.append(sentence)

    for chunk_id in range(len(chunks)):
        chunks[chunk_id] = chunks[chunk_id].strip()

    return chunks

In [40]:
CHUNKS = chunk_creation(ARTICLE)

# Model information

Developed by Google researchers, T5 is a large-scale transformer-based language model that has achieved state-of-the-art results on various NLP tasks, including text summarization. As the model is pre-trained on a mixture of unsupervised and supervised tasks, it has the potential to generalize well to new tasks. The model is pre-trained on the Colossal Clean Crawled Corpus (C4), which was developed and released in the context of the same research paper as T5.

One of the most exciting applications of T5 is in text summarization. Summarizing lengthy documents while preserving the most relevant information is a challenging task, but T5 has achieved impressive results in this area. By inputting the text to be summarized with the prefix “summarize:”, T5 can generate a concise summary that captures the essence of the original document. This is useful for applications such as news articles, scientific papers, and legal documents. 

T5 comes in different sizes: t5-small,t5-base,t5-large,t5–3b and t5–11b. For our usecase we will be using the t5-base model

In [29]:
# Instantiating the summarization pipeline using t5-base model
summarizer = pipeline("summarization")

No model was supplied, defaulted to t5-small and revision d769bba (https://huggingface.co/t5-small).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

2024-02-19 10:56:05.302611: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 65798144 exceeds 10% of free system memory.
2024-02-19 10:56:05.332938: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 65798144 exceeds 10% of free system memory.
2024-02-19 10:56:05.353539: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 65798144 exceeds 10% of free system memory.
2024-02-19 10:56:05.935239: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 65798144 exceeds 10% of free system memory.
2024-02-19 10:56:05.997695: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 65798144 exceeds 10% of free system memory.
All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can alr

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [49]:
results = summarizer(CHUNKS, max_length=200, min_length=50, do_sample=False)
summarized_tezt = ' '.join([summ['summary_text'] for summ in results])

In [51]:
summarized_tezt

'MLOps engineers work more on building a platform that is used by machine learning engineers and data scientists . the roadmap may have some updates during the year . we suggest starting learning Python by reading a proper Python book and practicing the concepts . MLOps engineers must be aware of the principles and factors that contribute to the maturity of the machine learning system . ML-specific orchestration systems keep all your model runs in the same place and help with: MLflow is probably the most popular tool for modeling and experiment tracking . a feature store with Feast part 1, part 2, part 3 You need to track what data was used to produce a model artifact . the answer to this question would be “it depends” . some data science teams rely on cloud-native solutions like AWS Sagemaker or Azure ML .'