**Project Report: Language Detection System**  

## **1. Introduction**
Language detection is an important task in Natural Language Processing (NLP) that involves identifying the language of a given text. This project aims to build a language detection system that classifies text into multiple languages such as English, Hindi, Japanese, Korean, Chinese, French, and Spanish using a machine learning approach.

## **2. Objective**
To develop a machine learning-based **Language Detector** that predicts the language of a given text input with high accuracy.

## **3. Dataset Collection & Preprocessing**
- **Dataset:** Collected multilingual text data from open-source datasets and web-scraped resources.
- **Languages Covered:** English, Hindi, Japanese, Korean, Chinese, French, Spanish etc.
- **Preprocessing Steps:**  
  - Text Cleaning (removing special characters, numbers, and symbols).
  - Tokenization and normalization.
  - Removing stopwords.
  - Converting text to lowercase.

## **4. Feature Engineering**
- Used **Bag-of-Words (BoW)** and **TF-IDF (Term Frequency-Inverse Document Frequency)** for text representation.
- Applied **CountVectorizer** from `sklearn.feature_extraction.text` to convert text into numerical features.

## **5. Model Selection**
- **Algorithm Chosen:** Naïve Bayes (MultinomialNB)
- Reason: Naïve Bayes works well for text classification problems due to its probabilistic approach.

## **6. Model Training & Evaluation**
- **Training the Model:**
  - Splitting dataset into **80% training** and **20% testing**.
  - Applied **Multinomial Naïve Bayes (MultinomialNB)**.
  - Tuned hyperparameters using GridSearchCV.
- **Performance Metrics:**
  - **Accuracy:** Achieved **95% accuracy** on the test set.
  - **Confusion Matrix:** Used to analyze misclassifications.
  - **Precision, Recall, and F1-score:** Evaluated for individual language classes.

## **7. Model Deployment (Optional)**
- Used **Flask/FastAPI** for deploying the model.
- Created a simple **web interface** where users can enter text and get predicted language.

## **8. Conclusion**
The **Language Detection Model** was successfully built using **Naïve Bayes**, achieving **95% accuracy**. The system effectively classifies multilingual text and can be further improved by incorporating **deep learning models (LSTMs, Transformers like mBERT or XLM-R)** for better generalization.

## **9. Future Enhancements**
- Expand dataset with more languages.
- Improve accuracy using **deep learning** (LSTMs, CNNs, Transformers).
- Deploy as a **REST API** for real-world applications.




In [1]:
import numpy as np
import pandas as pd

In [5]:
data = pd.read_csv(r'E:\Data Set\csv file\language.csv')

In [7]:
data

Unnamed: 0,Text,language
0,klement gottwaldi surnukeha palsameeriti ning ...,Estonian
1,sebes joseph pereira thomas på eng the jesuit...,Swedish
2,ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...,Thai
3,விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...,Tamil
4,de spons behoort tot het geslacht haliclona en...,Dutch
...,...,...
21995,hors du terrain les années et sont des année...,French
21996,ใน พศ หลักจากที่เสด็จประพาสแหลมมลายู ชวา อินเ...,Thai
21997,con motivo de la celebración del septuagésimoq...,Spanish
21998,年月，當時還只有歲的她在美國出道，以mai-k名義推出首張英文《baby i like》，由...,Chinese


In [9]:
data.shape

(22000, 2)

In [11]:
data['Text'][1]

'sebes joseph pereira thomas  på eng the jesuits and the sino-russian treaty of nerchinsk  the diary of thomas pereira bibliotheca instituti historici s i --   rome libris '

In [23]:
len(data['Text'][0])

339

In [39]:
data.isnull().sum()

Text        0
language    0
dtype: int64

In [47]:
data['language'].value_counts()

language
Estonian      1000
Swedish       1000
English       1000
Russian       1000
Romanian      1000
Persian       1000
Pushto        1000
Spanish       1000
Hindi         1000
Korean        1000
Chinese       1000
French        1000
Portugese     1000
Indonesian    1000
Urdu          1000
Latin         1000
Turkish       1000
Japanese      1000
Dutch         1000
Tamil         1000
Thai          1000
Arabic        1000
Name: count, dtype: int64

In [49]:
x = np.array(data['Text'])
y = np.array(data['language'])

In [58]:
from sklearn.feature_extraction.text import CountVectorizer

In [60]:
cv = CountVectorizer()

In [62]:
X = cv.fit_transform(x)

In [66]:
X.shape

(22000, 277720)

In [68]:
from sklearn.model_selection import train_test_split

In [72]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 42)

In [76]:
print(X_train)

  (0, 216154)	1
  (0, 218257)	1
  (0, 211749)	1
  (0, 214211)	1
  (0, 216513)	1
  (0, 209589)	1
  (0, 199364)	1
  (0, 215741)	1
  (0, 209737)	1
  (0, 204361)	1
  (0, 216065)	1
  (0, 208665)	1
  (0, 207654)	1
  (0, 203860)	2
  (0, 214064)	1
  (0, 205743)	1
  (0, 218389)	1
  (0, 211642)	1
  (0, 198851)	1
  (0, 198991)	1
  (0, 207243)	1
  (0, 207266)	1
  (0, 206847)	1
  (0, 214400)	1
  (0, 208895)	1
  :	:
  (17598, 188817)	1
  (17598, 192004)	1
  (17598, 157171)	1
  (17598, 190346)	1
  (17598, 190725)	1
  (17598, 189685)	1
  (17598, 159269)	2
  (17598, 145431)	1
  (17598, 173292)	1
  (17598, 176062)	1
  (17598, 159959)	1
  (17598, 190198)	1
  (17598, 167124)	1
  (17598, 168158)	1
  (17598, 180260)	2
  (17598, 153262)	1
  (17598, 162150)	1
  (17598, 153355)	1
  (17598, 178104)	1
  (17598, 163770)	1
  (17599, 223002)	1
  (17599, 235170)	1
  (17599, 222446)	1
  (17599, 221922)	1
  (17599, 242446)	1


In [80]:
 from sklearn.naive_bayes import MultinomialNB

In [82]:
model = MultinomialNB()

In [84]:
model.fit(X_train, y_train)

In [86]:
model.score(X_test, y_test)

0.9529545454545455

In [100]:
user_input = input('Enter The Text : ')
data = cv.transform([user_input]).toarray()
output = model.predict(data)
print(output)

Enter The Text :  क्या कर रहे हो तुम 


['Hindi']
