# **Complément de cours:**

# Exploring LangChain with Scientific Aricle Summarization

## I. Brief Explanation of Each Library:

- **LangChain** (`langchain`): A framework designed for developing applications that use language models (like GPT). It enables tasks such as document retrieval, question answering, and agent-based interactions.

- **Unstructured** (`unstructured`): A library that helps process unstructured data (like text, PDFs, etc.) by extracting and converting it into structured formats for further analysis.

- **OpenAI** (`openai`): The official Python client for OpenAI's API, allowing interaction with models like GPT-3 or GPT-4 for tasks like text generation, summarization, and conversation.

- **ChromaDB** (`chromadb`): A database focused on vector embeddings, typically used in machine learning for efficient similarity searches, making it useful for handling high-dimensional data like text embeddings.

- **Cython** (`Cython`): A library that allows for writing C extensions for Python, helping improve the performance of Python code by compiling it into C.

- **Tiktoken** (`tiktoken`): A tokenizer used for OpenAI models, crucial for managing input length by splitting text into tokens, which are used to interact efficiently with language models.

Each of these libraries is key in building and optimizing AI applications.

---

## When Do We Use LangChain?

LangChain is used when building applications that involve large language models (LLMs) and require complex workflows or integrations with external systems. You would typically use LangChain in the following scenarios:

- **Document Retrieval and QA Systems**: When you want to build applications that can search large document databases and answer questions based on retrieved information, LangChain helps combine retrieval and generation tasks.

- **Multi-step AI Agents**: LangChain enables the creation of agents that can perform a series of tasks using LLMs, such as reasoning, planning, and decision-making.

- **Customizable Chatbots**: If you need a chatbot with advanced features like memory (context retention over multiple interactions) or integrations with APIs (for pulling live data), LangChain provides an infrastructure for that.

- **Tool-augmented LLMs**: When you need an LLM to use external tools like calculators, databases, or APIs in response to user queries, LangChain can coordinate the model's interaction with these tools.

- **Workflow Automation**: LangChain is useful for orchestrating complex workflows where LLMs interact with multiple systems, such as combining web scraping, summarization, and database storage in a pipeline.

LangChain simplifies and scales the development of LLM-driven applications.

---

## LangChain for Complex Tasks

LangChain is designed to handle **complex workflows** where multiple tasks need to be integrated into a single program. It simplifies the process of combining different components—like language model outputs, data retrieval, and tool usage—into one cohesive application.

For example, if you want to:
- Retrieve data from a database.
- Use an LLM to analyze or summarize that data.
- Interact with an external API.
- Store results in a different system.

LangChain provides the framework to tie all of these steps together seamlessly, allowing for complex task orchestration, while letting the LLM guide the process at each stage. It enables you to build intelligent applications that involve more than just text generation, making it ideal for tasks that require **multi-step reasoning**, interaction with external systems, or **dynamic behavior**.


In [None]:
!pip install langchain
!pip install unstructured
!pip install openai
!pip install chromadb
!pip install Cython
!pip install tiktoken





Collecting unstructured
  Downloading unstructured-0.15.13-py3-none-any.whl.metadata (29 kB)
Collecting filetype (from unstructured)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting python-magic (from unstructured)
  Downloading python_magic-0.4.27-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting emoji (from unstructured)
  Downloading emoji-2.14.0-py3-none-any.whl.metadata (5.7 kB)
Collecting python-iso639 (from unstructured)
  Downloading python_iso639-2024.4.27-py3-none-any.whl.metadata (13 kB)
Collecting langdetect (from unstructured)
  Downloading langdetect-1.0.9.tar.gz (981 kB)
     ---------------------------------------- 0.0/981.5 kB ? eta -:--:--
     ---------- ----------------------------- 262.1/981.5 kB ? eta -:--:--
     ------------------------------ ------- 786.4/981.5 kB 2.4 MB/s eta 0:00:01
     -------------------------------------- 981.5/981.5 kB 2.3 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
botocore 1.29.76 requires urllib3<1.27,>=1.25.4, but you have urllib3 2.2.3 which is incompatible.


Collecting openai




  Downloading openai-1.51.0-py3-none-any.whl.metadata (24 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting jiter<1,>=0.4.0 (from openai)
  Downloading jiter-0.5.0-cp311-none-win_amd64.whl.metadata (3.7 kB)
Collecting typing-extensions<5,>=4.11 (from openai)
  Downloading typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)
Downloading openai-1.51.0-py3-none-any.whl (383 kB)
Downloading distro-1.9.0-py3-none-any.whl (20 kB)
Downloading jiter-0.5.0-cp311-none-win_amd64.whl (191 kB)
Downloading typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Installing collected packages: typing-extensions, jiter, distro, openai
  Attempting uninstall: typing-extensions
    Found existing installation: typing_extensions 4.9.0
    Uninstalling typing_extensions-4.9.0:
      Successfully uninstalled typing_extensions-4.9.0
Successfully installed distro-1.9.0 jiter-0.5.0 openai-1.51.0 typing-extensions-4.12.2
Collecting chromadb


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain 0.1.11 requires tenacity<9.0.0,>=8.1.0, but you have tenacity 9.0.0 which is incompatible.
langchain-community 0.0.27 requires tenacity<9.0.0,>=8.1.0, but you have tenacity 9.0.0 which is incompatible.
langchain-core 0.1.30 requires tenacity<9.0.0,>=8.1.0, but you have tenacity 9.0.0 which is incompatible.






Collecting tiktoken
  Downloading tiktoken-0.8.0-cp311-cp311-win_amd64.whl.metadata (6.8 kB)
Downloading tiktoken-0.8.0-cp311-cp311-win_amd64.whl (884 kB)
   ---------------------------------------- 0.0/884.5 kB ? eta -:--:--
   ---------------------------------------- 0.0/884.5 kB ? eta -:--:--
   ----------- ---------------------------- 262.1/884.5 kB ? eta -:--:--
   ----------------------- ---------------- 524.3/884.5 kB 1.0 MB/s eta 0:00:01
   ----------------------------------- ---- 786.4/884.5 kB 1.1 MB/s eta 0:00:01
   ---------------------------------------- 884.5/884.5 kB 1.1 MB/s eta 0:00:00
Installing collected packages: tiktoken
Successfully installed tiktoken-0.8.0




In [None]:
!pip install transformers



In [None]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [None]:
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.indexes import VectorstoreIndexCreator

In [None]:
import openai

In [None]:
# Get your API keys from openai, you will need to create an account.
# Here is the link to get the keys: https://platform.openai.com/account/billing/overview
import os
os.environ['sk-HmyX9Ii1fiTeXQk34mC3BlbkFJaiPATEj7NSyTFIPi2qrH'] = 'sk-HmyX9Ii1fiTeXQk34mC3BlbkFJaiPATEj7NSyTFIPi2qrH'

In [None]:
import os
import openai
from transformers import pipeline
import PyPDF2

# Set your OpenAI API key
openai.api_key = "sk-HmyX9Ii1fiTeXQk34mC3BlbkFJaiPATEj7NSyTFIPi2qrH"

# Load the summarization pipeline
summarizer = pipeline("summarization")

# Function to generate summary for a PDF
def generate_summary(pdf_path, chunk_size=1000, max_length=50, min_length=10):
    with open(pdf_path, 'rb') as file:
        pdf_text = ""
        pdf_reader = PyPDF2.PdfReader(file)
        for page in pdf_reader.pages:
            pdf_text += page.extract_text()

        # Split text into chunks
        chunks = [pdf_text[i:i + chunk_size] for i in range(0, len(pdf_text), chunk_size)]

        # Generate summaries for each chunk
        summaries = []
        for i, chunk in enumerate(chunks):
            summary = summarizer(chunk, max_length=max_length, min_length=min_length, do_sample=False)
            summaries.append(summary[0]['summary_text'])

        return "\n".join(summaries)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [None]:
# we can use this chunk if we have multiple PDFs
# if __name__ == '__main__':
#     pdf_dir = '/content/drive/MyDrive/MangrovePDFs'

#     for filename in os.listdir(pdf_dir):
#         if filename.lower().endswith('.pdf'):
#             pdf_path = os.path.join(pdf_dir, filename)
#             pdf_summary = generate_summary(pdf_path)

#             print(f"PDF: {filename}\nSummary:\n{pdf_summary}\n{'=' * 30}")

In [None]:
import PyPDF2

In [None]:
def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ''
        for page in reader.pages:

            text += page.extract_text()
    return text

pdf_paths= ['/content/drive/MyDrive/The Marshall Olkin generalized defective Gompertz distribution for surviving fraction modeling.pdf']
all_texts = [extract_text_from_pdf(path) for path in pdf_paths]



pdf_summary = generate_summary('/content/drive/MyDrive/The Marshall Olkin generalized defective Gompertz distribution for surviving fraction modeling.pdf')

print(f"PDF: {filename}\nSummary:\n{pdf_summary}\n{'=' * 30}")


PDF: Change-and-fragmentation-trend-373ec694-c9f4-4e7a-a647-0864880525a4.pdf
Summary:
 The Marshall –Olkin generalized defective Gompertz distribution for surviving fraction modeling, Communications in Statistics - Simulation and Computation .
 In this article, we introduce a three-level generalization of the Gompertz distribution for cure rate modeling . The main advantage of this new distribution is that it has anincreasing, decreasing, constant, and bathtub-shaped
 According to the World HealthOrganization, 50% of patients diagnosed with cancer survive the disease . The increase in the like-lihood of survival is due to the positive effect of effective treatments and the major function of the immune system .
 A general class of non-linear transformation models has been introduced by Tsodikov, Ibrahim, and Yakovlev . The survival function in this class is assumed to be given by gðSðtÞÞ,
 A non-linear transformation cure rate model was proposed by Balakrishnan and.Milienos . The advant