#  Language Translation 


Language translation is the process of converting text or speech from one language to another while preserving its meaning and context. This involves using linguistic and computational techniques to interpret the source language and generate the equivalent content in the target language. Modern translation methods often employ machine learning models and neural networks for accuracy and fluency.



# We have two problems


# First Problem: Language Detection 

The first problem is to know how you can detect language for particular data. In this case, you can use a simple python package called langdetect.


supports 55 languages

af, ar, bg, bn, ca, cs, cy, da, de, el, en, es, et, fa, fi, fr, gu, he,
hi, hr, hu, id, it, ja, kn, ko, lt, lv, mk, ml, mr, ne, nl, no, pa, pl,
pt, ro, ru, sk, sl, so, sq, sv, sw, ta, te, th, tl, tr, uk, ur, vi, zh-cn, zh-tw


pip install langdetect



# Second Problem: Language Translation


The second problem you need to solve is to translate a text from one language to the language of your choice. In this case, you will use another useful python package called google_trans_new.


google_trans_new is a free and unlimited python package that implemented Google Translate API and It also performs auto language detection.

Install google_trans_new


but it has some issues so we will use alternatives....


pip install googletrans==4.0.0-rc1





# Language Detection

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\samso\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\samso\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


                 text language
0          मैं घर गया    Hindi
1         میں گھر گیا     Urdu
2          मी घर गेलो  Marathi
3         I went home  English
4  यह एक अच्छा दिन है    Hindi


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - 'C:\\Users\\samso/nltk_data'
    - 'c:\\Program Files\\Python311\\nltk_data'
    - 'c:\\Program Files\\Python311\\share\\nltk_data'
    - 'c:\\Program Files\\Python311\\lib\\nltk_data'
    - 'C:\\Users\\samso\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


In [3]:
sentence = "आज का दिन बहुत खूबसूरत है"

print("Detect Language:", detect(sentence))
print("Probability:", detect_langs(sentence))

Detect Language: hi
Probability: [hi:0.9999977479090416]


In [4]:
sentence = "it is very pleasant today"

print("Detect Language:", detect(sentence))
print("Probability:", detect_langs(sentence))

Detect Language: en
Probability: [en:0.9999970122854229]


# Language Translation

In [1]:
from googletrans import Translator

translator = Translator()
sentence = "آج کا دن بہت خوبصورت ہے"

translation = translator.translate(sentence, dest='en')
print("Original Sentence:", sentence)
print("Translated Sentence:", translation.text)


Original Sentence: آج کا دن بہت خوبصورت ہے
Translated Sentence: Today's day is so beautiful


In [None]:
pip install pandas scikit-learn nltk


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Download NLTK resources (only need to run once)
nltk.download('punkt')
nltk.download('stopwords')

# Create the dataset
data = {
    'text': [
        'मैं घर गया',  # Hindi
        'میں گھر گیا',  # Urdu
        'मी घर गेलो',   # Marathi
        'I went home',   # English
        'यह एक अच्छा दिन है',  # Hindi
        'یہ ایک اچھا دن ہے',  # Urdu
        'हे एक चांगला दिवस आहे',  # Marathi
        'Today is sunny',  # English
        'मुझे किताबें पसंद हैं',  # Hindi
        'مجھے کتابیں پسند ہیں',  # Urdu
        'माझ्या आवडीच्या गोष्टींचा',  # Marathi
        'I love reading books'   # English
    ],
    'language': [
        'Hindi',
        'Urdu',
        'Marathi',
        'English',
        'Hindi',
        'Urdu',
        'Marathi',
        'English',
        'Hindi',
        'Urdu',
        'Marathi',
        'English'
    ]
}

# Creating a DataFrame
language_df = pd.DataFrame(data)

# Saving the DataFrame to a CSV file
language_df.to_csv('language_dataset.csv', index=False)

# Load the dataset
df = pd.read_csv('language_dataset.csv')

# Basic data exploration
print(df.head())

# Function for text preprocessing
def preprocess_text(text):
    # Tokenization
    tokens = word_tokenize(text)
    
    # Define stop words for English
    english_stop_words = set(stopwords.words('english'))

    # Removing stopwords only for English text
    # You can create your own stop word lists for Hindi, Urdu, and Marathi if needed
    if text.isascii():  # Check if the text is in English
        tokens = [word for word in tokens if word.lower() not in english_stop_words]
    
    return ' '.join(tokens)

# Apply preprocessing to the text data
df['text'] = df['text'].apply(preprocess_text)

# Encoding the target labels
label_encoder = LabelEncoder()
df['language'] = label_encoder.fit_transform(df['language'])

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['language'], test_size=0.2, random_state=42)

# Vectorizing the text data
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# Training a Naive Bayes model
model = MultinomialNB()
model.fit(X_train_vectorized, y_train)

# Making predictions
y_pred = model.predict(X_test_vectorized)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')


In [2]:
sentence =  "आज का दिन बहुत खूबसूरत है"
translate_text = translator.translate(sentence, src='hi', dest='en')

print("Translated Text:", translate_text.text)

Translated Text: Today is very beautiful


# Detections and translation both

In [3]:
from langdetect import detect
from googletrans import Translator

def detect_and_translate(text, target_lang):
    # Detect language
    result_lang = detect(text)
    
    # Translate language
    translator = Translator()
    translate_text = translator.translate(text, dest=target_lang).text

    return result_lang, translate_text 

In [4]:
# example 1
# Sample sentence in Hindi
sentence = "आज का दिन बहुत खूबसूरत है"

# Detect and translate
result_lang, translate_text = detect_and_translate(sentence, target_lang='en')

print("Language:", result_lang)
print("Translation:", translate_text)

Language: hi
Translation: Today is very beautiful


In [5]:
# example 1
# Sample sentence in urdu
sentence = "مجھے تم سے پیار ہے"

# Detect and translate
result_lang, translate_text = detect_and_translate(sentence, target_lang='en')

print("Language:", result_lang)
print("Translation:", translate_text)

Language: ur
Translation: I love you


In [6]:
# example 1
# Sample sentence in 
sentence = "We are living in the era of technology."
# Detect and translate
result_lang, translate_text = detect_and_translate(sentence, target_lang='ur')

print("Language:", result_lang)
print("Translation:", translate_text)

Language: en
Translation: ہم ٹیکنالوجی کے دور میں رہ رہے ہیں۔
