# Hugging Face + LangChain: Text Summerization

## Goal:

The primary goal is to create a tool that can read text from different file formats (like Markdown, PDF, and DOCX) and produce a concise, informative summary of the content. This is useful for quickly understanding the main points of long documents without having to read the entire content.

In [None]:
!pip install python-docx

In [None]:
!pip install PyMuPDF

In [3]:
from docx import Document
import fitz  # PyMuPDF
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

def read_docx(file_path):
    doc = Document(file_path)
    return " ".join([paragraph.text for paragraph in doc.paragraphs])

def read_pdf(file_path):
    text = ""
    with fitz.open(file_path) as doc:
        for page in doc:
            text += page.get_text()
    return text

def read_markdown(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.read()

def format_summary(summary):
    sentences = sent_tokenize(summary)
    formatted_summary = "\n".join(sentences)
    return formatted_summary

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
!pip install langchain openai tiktoken transformers accelerate cohere --quiet # install libraries

**Libraries explained**:

- **langchain**: This is a Python library designed to facilitate working with large language models (LLMs). It provides tools and functionalities to make it easier to interact with models like those provided by OpenAI (GPT-3, for example), making it useful for tasks like text summarization, question-answering, and more.

- **openai**: This is the official Python library provided by OpenAI. It's used to interact with the OpenAI API, which provides access to models like GPT-3.5 or 4. This library is essential if we're planning to integrate OpenAI's powerful language models into our application.

- **tiktoken**: This library is less commonly known and might be used for specific tokenization tasks or for working with certain types of text data. Tokenization is a fundamental step in NLP where text is broken down into smaller units like words or phrases.

- **transformers**: Developed by Hugging Face, the transformers library offers a vast collection of pre-trained models like BERT, GPT, T5, etc., for various NLP tasks. It's a highly versatile library, enabling tasks like text classification, summarization, translation, and more.

- **accelerate**: Also from Hugging Face, accelerate is used to simplify and accelerate training and inference processes for deep learning models. It helps in easily running models on different hardware (CPUs, GPUs) with minimal changes in the code.

- **cohere**: Cohere is another AI platform similar to OpenAI, providing large language models for various tasks. The cohere library is their Python SDK, allowing easy interaction with Cohere's models for NLP tasks.

In [5]:
from langchain import HuggingFaceHub

Using pre-trained model from the Hugging Face model repository. Specifically, I'm using the **"facebook/bart-large-cnn"** model.


- **BART (Bidirectional and Auto-Regressive Transformers)**: It's a powerful and versatile NLP model capable of handling various tasks, including text summarization. The model is based on the Transformer architecture, which has significantly advanced the field of natural language processing.

- **Large-CNN Variant**: The '**large-cnn**' variant of BART is particularly fine-tuned for summarization tasks. It's optimized to generate summaries that are both coherent and closely aligned with the main points of the input text.

This should help us use the BART large CNN model from Hugging Face for summarization without encountering the validation error.

Source: https://huggingface.co/blog/Andyrasika/agent-helper-langchain-hf

Source: https://huggingface.co/docs/hub/en/security-tokens


In [6]:
huggingface_api_token = 'your_hugginface_token'# you need to create a HuggingFace account and generate your token

# HuggingFaceHub:summarizer object
# repo_id: repository where bart-large-cnn is
# facebook/bart-large-cnn: model being used
# model_kwargs: keyword arguments passed to the mdel  -> temperature and max_length

summarizer = HuggingFaceHub(repo_id="facebook/bart-large-cnn", # powerful model for summarization task
                            model_kwargs={"temperature": 0, "max_length": 180}, # temp 0 means no randonmness, length 180 characters
                            huggingfacehub_api_token=huggingface_api_token) # token

def summarize(llm, text) -> str:
    return llm.invoke(f"Summarize this: {text}!")

#def summarize(llm, text) -> str:
    #return llm(f"Summarize this: {text}!").summarize()

  warn_deprecated(


In [9]:
file_path = ''  # file path -> /content/Data_Curation_KO.docx

if file_path.endswith('.docx'): # word document
    text = read_docx(file_path)
elif file_path.endswith('.pdf'): # pdf file
    text = read_pdf(file_path)
elif file_path.endswith('.md'): # markdown file
    text = read_markdown(file_path)
else:
    text =  '''
    As mentioned before, this tool is built entirely in JavaScript.
    Although interpreted languages like JavaScript have classically been deemed too inefficient for running simulations, the creators found that this no longer holds: investments by major tech companies have tremendously improved JavaScript engines over the past years, to the point that our CPM now has no major performance disadvantage compared to existing C++ frameworks.
    The JavaScript implementation of Artistoo opens new possibilities for rapid and low barrier sharing of CPM simulations with students, collaborators, and readers or reviewers of a paper.
    Unlike existing frameworks, Artistoo allows building simulations that run in the web browser without the need to install any software. Artistoo models run on any platform providing a standards-compliant web browser – be it a desktop computer, a tablet, or a mobile phone. These simulations can be published on any web server or saved locally and do not rely on any back-end servers being available. They can be made explorable, enabling viewers to interact with the simulation and see the effect of changing model parameters in real time.
    Artistoo is a JavaScript library implemented as an ECMAScript 6 module, which can be loaded into an HTML page or accessed from within a Node.js command line application.
    '''
 # insert text if not docx, pdf, or markdwon file

Example use: using random text from "Artistoo" knowledge object

## Results:

Observation: maybe I can increase the "max_length" parameter to generate longer results. However, long "max_length" might lead to redundant or less coherent summeries ( I don't know yet*).

In [10]:
summary = summarize(summarizer, text)
formatted_summary = format_summary(summary) # formatting results

print(formatted_summary)

Artistoo allows building simulations that run in the web browser without the need to install any software.
Artistoo models run on any platform providing a standards-compliant web browser – be it a desktop computer, a tablet, or a mobile phone.
The JavaScript implementation of Artistoo opens new possibilities for rapid and low barrier sharing of CPM simulations.


## RAG + OpenAI

In [None]:
!pip install pymupdf -q -U
# To bbe continue..