!pip install torch transformers sentencepiece


In [3]:
from transformers import MarianMTModel, MarianTokenizer

model_name_mul_en = "Helsinki-NLP/opus-mt-mul-en"

tokenizer_mul_en = MarianTokenizer.from_pretrained(model_name_mul_en)
model_mul_en = MarianMTModel.from_pretrained(model_name_mul_en)

languages = {
    "hi": "Hindi",
    "bn": "Bengali",
    "ta": "Tamil",
    "te": "Telugu",
    "mr": "Marathi",
    "gu": "Gujarati",
    "pa": "Punjabi",
    "ml": "Malayalam",
    "kn": "Kannada",
    "or": "Odia",
    "as": "Assamese",
    "ur": "Urdu"
}

print("Supported Indian Languages:")
for code, name in languages.items():
    print(f"{code} - {name}")

lang_code = input("\nEnter the language code (e.g., hi, bn): ").lower()
text_to_translate = input("Enter the Indian language text: ")

if lang_code in languages:
    inputs = tokenizer_mul_en(text_to_translate, return_tensors="pt", padding=True)
    translated_tokens = model_mul_en.generate(**inputs)
    output_text = tokenizer_mul_en.decode(translated_tokens[0], skip_special_tokens=True)
    print(f"\nTranslated Text (English): {output_text}")
else:
    print("\nError: Unsupported language code.")

Supported Indian Languages:
hi - Hindi
bn - Bengali
ta - Tamil
te - Telugu
mr - Marathi
gu - Gujarati
pa - Punjabi
ml - Malayalam
kn - Kannada
or - Odia
as - Assamese
ur - Urdu



Enter the language code (e.g., hi, bn):  hi
Enter the Indian language text:  विविधता में एकता, भारत की सच्ची ताकत।



Translated Text (English): Unity in diversity, the real power of India.


Indian Language to English Translation System (NLP Project)

1. Problem Definition & Objective

a. Selected Project Track

Natural Language Processing (NLP) – Machine Translation

b. Problem Statement

India is a linguistically diverse country with multiple regional languages. However, English remains the dominant language for education, administration, and digital content. This creates a communication gap for users who are more comfortable using regional Indian languages. The problem addressed in this project is to build an automated system that translates text from multiple Indian languages into English accurately and efficiently.

c. Real‑World Relevance & Motivation

Helps non‑English speakers access English content
Useful in education, government services, and digital platforms
Demonstrates real‑world use of NLP and pretrained transformer models




2. Data Understanding & Preparation

a. Dataset Source

This project does not use a traditional static dataset. Instead, it uses a pretrained multilingual translation model trained on large‑scale parallel corpora.
Model: Helsinki-NLP/opus-mt-mul-en
Source: Hugging Face Model Hub
Data Type: Public multilingual parallel text data


b. Data Loading & Exploration

User‑provided text acts as real‑time input data. The system accepts sentences written in supported Indian languages and processes them dynamically.

Supported languages include:

Hindi (hi)
Bengali (bn)
Tamil (ta)
Telugu (te)
Marathi (mr)
Gujarati (gu)
Punjabi (pa)
Malayalam (ml)
Kannada (kn)
Odia (or)
Assamese (as)
Urdu (ur)


c. Cleaning, Preprocessing & Feature Engineering

Tokenization handled by MarianTokenizer
Automatic sub‑word segmentation
Padding and tensor conversion handled internally
No manual feature engineering required due to transformer architecture


d. Handling Missing Values or Noise

Empty or invalid inputs are avoided through user prompts
The pretrained model inherently handles noisy text to some extent


3. Model / System Design

a. AI Technique Used

Natural Language Processing (NLP) – Transformer‑based Neural Machine Translation

b. Architecture / Pipeline Explanation

1. User selects language code

2. User inputs text in the selected language

3. Text is tokenized using MarianTokenizer

4. Tokens are passed to MarianMTModel

5. Model generates English translation

6. Output text is decoded and displayed



c. Justification of Design Choices

Transformer models provide high translation accuracy
Pretrained models reduce training cost and time
MarianMT supports multilingual translation efficiently



4. Core Implementation

a. Model Inference Logic

The project uses pretrained MarianMT model for inference only (no training). The model generates translated text using beam search decoding.

b. Prompt Engineering

Not applicable, as this project uses a pretrained translation model rather than an LLM with prompt‑based interaction.

c. Prediction Pipeline

Input → Tokenization → Model Generation → Decoding → Output


d. Code Execution

The notebook is designed to run top‑to‑bottom without errors when required dependencies are installed.



5. Evaluation & Analysis

a. Metrics Used

Qualitative evaluation based on translation correctness
Manual comparison with expected English meaning


b. Sample Output

Input (Hindi): विविधता में एकता, भारत की सच्ची ताकत।

Output (English): Unity in diversity, the real power of India.

c. Performance Analysis & Limitations

Strengths:

Accurate sentence‑level translations
Supports multiple Indian languages


Limitations:

One‑directional translation (to English only)
No numerical evaluation metrics like BLEU score
Performance depends on pretrained data quality



6. Ethical Considerations & Responsible AI

a. Bias & Fairness

Model may perform better on high‑resource languages
Low‑resource languages may have slightly reduced accuracy


b. Dataset Limitations

Training data is external and not fully transparent
Cultural nuances may not always be preserved


c. Responsible Use of AI

Intended for educational and assistive purposes
Should not be used for legal or critical decision‑making



7. Conclusion & Future Scope

a. Summary of Results

The project successfully demonstrates a working multilingual translation system using NLP transformers. It accurately translates Indian language text into English and showcases practical application of pretrained models.

b. Future Improvements

Add English → Indian language translation
Support paragraph and document translation
Integrate GUI or web application
Add quantitative evaluation metrics
Deploy as a web API or mobile app


Project Type: Academic NLP Mini Project
Tools Used: Python, Jupyter Notebook, Hugging Face Transformers
