<a href="https://colab.research.google.com/github/vinay18-irpanwar/Google-Colab/blob/main/pdf_load_text_splitter_Student_Task_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


---

## üß© **Problem Statement: Translate a PDF Story into Hindi Using LangChain**

---

### üìå Task:

You are given a PDF file containing an English story. Your goal is to:

1. **Upload and read the PDF file** using `PyMuPDFLoader` from LangChain.
- PDF File link - [Click here](https://drive.google.com/file/d/1hNkcxV4-T5I-wdejnnmKKjU1oinLdGnw/view?usp=sharing)
Download PDF and store in local.
2. **Extract text** from all pages of the PDF.
3. Use **RecursiveCharacterTextSplitter** to divide the text into **chunks of 30 characters** (no overlap).
4. For each chunk, use an **LLM** (like OpenAI‚Äôs ChatGPT via `ChatOpenAI`) to **translate the chunk into Hindi**.
5. **Display the original English chunk and its Hindi translation side by side.**

---

### üß™ Input:

* A PDF file (e.g., `sample_story.pdf`) containing a short English story.

---

### ‚úÖ Output:

* Print a list of **translated Hindi chunks**, one per line.
* Each line should include the **original English chunk** and the **translated Hindi output**.

Example:

```
Chunk 1 (EN): Once upon a time in a small vi  
Translation (HI): ‡§è‡§ï ‡§∏‡§Æ‡§Ø ‡§ï‡•Ä ‡§¨‡§æ‡§§ ‡§π‡•à, ‡§è‡§ï ‡§õ‡•ã‡§ü‡•á ‡§∏‡•á ‡§ó‡§æ

Chunk 2 (EN): llage nestled between the hill  
Translation (HI): ‡§ó‡§æ‡§Ç‡§µ ‡§ú‡•ã ‡§™‡§π‡§æ‡§°‡§º‡§ø‡§Ø‡•ã‡§Ç ‡§ï‡•á ‡§¨‡•Ä‡§ö ‡§¨‡§∏‡§æ
```

---

### üîß Tools You Must Use:

* `langchain.document_loaders.PyMuPDFLoader` for reading the PDF
* `langchain.text_splitter.RecursiveCharacterTextSplitter`
* `langchain.chat_models.ChatOpenAI` (or any supported LLM)

---

### üìé Sample File:

You can use this file for testing:
üìÑ [Download `sample_story.pdf`](sandbox:/mnt/data/sample_story.pdf)

---



In [2]:
!pip install langchain langchain_google_genai  langchain_community

Collecting langchain_google_genai
  Using cached langchain_google_genai-3.1.0-py3-none-any.whl.metadata (2.7 kB)
Collecting langchain_community
  Downloading langchain_community-0.4.1-py3-none-any.whl.metadata (3.0 kB)
Collecting filetype<2.0.0,>=1.2.0 (from langchain_google_genai)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting google-ai-generativelanguage<1.0.0,>=0.9.0 (from langchain_google_genai)
  Downloading google_ai_generativelanguage-0.9.0-py3-none-any.whl.metadata (10 kB)
INFO: pip is looking at multiple versions of langchain-google-genai to determine which version is compatible with other requirements. This could take a while.
Collecting langchain_google_genai
  Downloading langchain_google_genai-3.0.3-py3-none-any.whl.metadata (2.7 kB)
  Downloading langchain_google_genai-3.0.2-py3-none-any.whl.metadata (2.7 kB)
  Downloading langchain_google_genai-3.0.1-py3-none-any.whl.metadata (7.1 kB)
  Downloading langchain_google_genai-3.0.0-py3-none-any

opening file

In [3]:
from google.colab import files
upload=files.upload()

file_name=list(upload.keys())[0]
print(file_name)

Saving story.pdf to story.pdf
story.pdf


loading pdf

In [4]:
!pip install pymupdf

Collecting pymupdf
  Downloading pymupdf-1.26.6-cp310-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.6-cp310-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m24.1/24.1 MB[0m [31m77.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.26.6


In [12]:
from langchain_community.document_loaders import PyMuPDFLoader

content=PyMuPDFLoader(file_name).load()
list_content=[doc.page_content for doc in content]
str_content="".join(list_content)

chunking

In [20]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

chunks=RecursiveCharacterTextSplitter(chunk_size=30,chunk_overlap=0).split_text(str_content)
print(chunks)

['Once upon a time in a small', 'village nestled between the', 'hills, there lived a curious', 'little girl named', 'Meera. She loved asking', 'questions and exploring the', 'woods near her home.', 'Every morning, Meera would', 'set out on a new adventure,', 'talking to birds and watching', 'squirrels', 'dance on tree branches. Her', 'favorite story was about a', 'secret tree that whispered', 'ancient tales', 'when the wind blew.', 'One day, while following a', 'butterfly, she stumbled upon', 'an old, hollow tree. As she', 'peeked inside,', 'she found a glowing stone and', 'a tiny scroll. The scroll', 'read: "You have found the', 'heart of the forest."', 'From that day on, Meera was', 'known as the storyteller of', 'the village, passing on the', 'wisdom of the', 'forest to everyone she met.']


LLM

In [14]:
from langchain_google_genai import ChatGoogleGenerativeAI
from google.colab import userdata

LLM=ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    google_api_key=userdata.get('lang')
)

In [16]:
res=LLM.invoke(f"""translate the chunk into Hindi.
                   Display the original English chunk and its Hindi translation side by side.
                   Print a list of translated Hindi chunks, one per line.
                   Each line should include the original English chunk and the translated Hindi output. refere {chunks}""")
print(res.content)

*   **Once upon a time in a small** | ‡§è‡§ï ‡§∏‡§Æ‡§Ø ‡§ï‡•Ä ‡§¨‡§æ‡§§ ‡§π‡•à ‡§è‡§ï ‡§õ‡•ã‡§ü‡•á ‡§∏‡•á
*   **village nestled between the** | ‡§ó‡§æ‡§Å‡§µ ‡§Æ‡•á‡§Ç, ‡§ú‡•ã ‡§™‡§π‡§æ‡§°‡§º‡•ã‡§Ç ‡§ï‡•á ‡§¨‡•Ä‡§ö ‡§¨‡§∏‡§æ ‡§•‡§æ,
*   **hills, there lived a curious** | ‡§è‡§ï ‡§ú‡§ø‡§ú‡•ç‡§û‡§æ‡§∏‡•Å ‡§≤‡§°‡§º‡§ï‡•Ä ‡§∞‡§π‡§§‡•Ä ‡§•‡•Ä
*   **little girl named** | ‡§ú‡§ø‡§∏‡§ï‡§æ ‡§®‡§æ‡§Æ ‡§•‡§æ
*   **Meera. She loved asking** | ‡§Æ‡•Ä‡§∞‡§æ‡•§ ‡§â‡§∏‡•á ‡§∏‡§µ‡§æ‡§≤ ‡§™‡•Ç‡§õ‡§®‡§æ ‡§î‡§∞
*   **questions and exploring the** | ‡§ú‡§Ç‡§ó‡§≤ ‡§ñ‡•ã‡§ú‡§®‡§æ ‡§¨‡§π‡•Å‡§§ ‡§™‡§∏‡§Ç‡§¶ ‡§•‡§æ
*   **woods near her home.** | ‡§Ö‡§™‡§®‡•á ‡§ò‡§∞ ‡§ï‡•á ‡§™‡§æ‡§∏‡•§
*   **Every morning, Meera would** | ‡§π‡§∞ ‡§∏‡•Å‡§¨‡§π, ‡§Æ‡•Ä‡§∞‡§æ ‡§è‡§ï ‡§®‡§è
*   **set out on a new adventure,** | ‡§∞‡•ã‡§Æ‡§æ‡§Ç‡§ö ‡§™‡§∞ ‡§®‡§ø‡§ï‡§≤ ‡§™‡§°‡§º‡§§‡•Ä,
*   **talking to birds and watching** | ‡§™‡§ï‡•ç‡§∑‡§ø‡§Ø‡•ã‡§Ç ‡§∏‡•á ‡§¨‡§æ‡§§‡•á‡§Ç ‡§ï‡§∞‡§§‡•Ä ‡§î‡§∞ ‡§¶‡•á‡§ñ‡§§‡•Ä
*   **squirrels** | ‡§ó‡§ø‡§≤‡