<a href="https://colab.research.google.com/github/ryunguo/WLIT/blob/main/WLIT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**WLIT (What Language Is This?)**

We begin by importing the basic packages and modules, such as pandas and numpy.

In [81]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import re
import spacy

Here, we read the given CSV file and remove any languages that we deemed to be incompatible with our AI language detector.

In [82]:
df = pd.read_csv("Language Detection.csv")

In [83]:
language_list = ['English', 'French', 'Italian', 'Spanish', 'Portugeese', "Greek", "Russian", "Danish", "Sweedish", "Dutch", "German"]
df = df.loc[df['Language'].isin(language_list)]

***Data Wrangling***

We use *LabelEncoder()* from *sklearn* to assign each language with a numerical value.

In [84]:
x = df['Text']
y = df['Language']
le = LabelEncoder()
y=le.fit_transform(y)

We then used the *re* package to filter out any characters that wouldn't contribute to the AI detection, such as numbers, brackets, and special characters.

In [85]:
data = []

def removeNonsense(text):
  text = re.sub(r'[0-9]', '', text)
  text = re.sub(r'[\[\]]', '', text)
  text = re.sub(r'[\n]', '', text)

  text = text.lower()
  return text

df["Text"] = df["Text"].apply(removeNonsense)

Here, we downloaded the necessary packages from the spaCy module, that provides us with information about each language, such as the stopwords, lemmatization, and punctuation.

In [None]:

!python -m spacy download en_core_web_sm
!python -m spacy download fr_core_news_sm
!python -m spacy download it_core_news_sm
!python -m spacy download pt_core_news_sm
!python -m spacy download es_core_news_sm
!python -m spacy download el_core_news_sm
!python -m spacy download ru_core_news_sm
!python -m spacy download da_core_news_sm
!python -m spacy download sv_core_news_sm
!python -m spacy download nl_core_news_sm
!python -m spacy download de_core_news_sm

***Text Pre-processing***

Using the spaCy modules, we removed any punctuation and stopwords ("the", "a"), as well as lemmatized ("is" -> "be", "going" -> "go") each word.

After that process was complete, we updated the dataframe with the filtered values.

In [88]:
nlp_en = spacy.load("en_core_web_sm")
nlp_fr = spacy.load("fr_core_news_sm")
nlp_it = spacy.load("it_core_news_sm")
nlp_pr = spacy.load("pt_core_news_sm")
nlp_sp = spacy.load("es_core_news_sm")
nlp_el = spacy.load("el_core_news_sm")
nlp_ru = spacy.load("ru_core_news_sm")
nlp_da = spacy.load("da_core_news_sm")
nlp_sv = spacy.load("sv_core_news_sm")
nlp_nl = spacy.load("nl_core_news_sm")
nlp_de = spacy.load("de_core_news_sm")

lang_dict = {
    "English": nlp_en,
    "French": nlp_fr,
    "Italian": nlp_it,
    "Portugeese": nlp_pr,
    "Spanish": nlp_sp,
    "Greek": nlp_el,
    "Russian": nlp_ru,
    "Danish": nlp_da,
    "Sweedish": nlp_sv,
    "Dutch": nlp_nl,
    "German": nlp_de
    }

def removeNonsense(doc):
  return [token.lemma_ for token in doc if (not token.is_stop) and (not token.is_punct)]

for row in df.iterrows():
  doc = lang_dict[row[1][1]](row[1][0]) 

***Text Representation, Text -> Vector***

Next, we used *sklearn* to import *CountVectorizer* which converts individual words into a vector counterpart.

In [89]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
vect_list = count_vectorizer.fit_transform(df['Text']).toarray()

Then, we created training and testing variables by splitting our dataframes.

In [90]:
from sklearn.model_selection import train_test_split

training_x, testing_x, training_y, testing_y = train_test_split(vect_list, y, test_size = 0.25)

Next, we used Naive Bayes algorithms from *sklearn* in order to train our model.

In [None]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(training_x, training_y)

Lastly, we created a function to take in a user-inputted string and uses our AI to predict which language it is in.

In [104]:
def predict_language(text):
  x = count_vectorizer.transform([text]).toarray()
  language = model.predict(x)
  language = le.inverse_transform(language)
  print(f'This language is {language[0]}!')

***Examples***

In [105]:
predict_language("This is an AI language detector.")

This language is English!


In [106]:
predict_language("Je suis un étudiant à l'Université Carleton.")

This language is French!


In [107]:
predict_language("Espero que ganemos este hackathon.")

This language is Spanish!


In [108]:
predict_language("Questo rilevatore di lingua ha un alto tasso di precisione.")

This language is Italian!


In [109]:
predict_language("Olá Mundo. Eu amo ciência da computação.")

This language is Portugeese!


In [110]:
predict_language("Η τεχνητή νοημοσύνη είναι πρωτοποριακή τεχνολογία.")

This language is Greek!


In [111]:
predict_language("Мы были первыми в космосе.")

This language is Russian!


In [112]:
predict_language("I morgen er det en søndag, ugens sidste dag.")

This language is Danish!


In [113]:
predict_language("Abonner på PewDiePie.")

This language is Sweedish!


In [114]:
predict_language("Bedankt voor het organiseren van dit evenement.")

This language is Dutch!


In [115]:
predict_language("Wir freuen uns auf die Teilnahme an zukünftigen CAIS-Veranstaltungen.")

This language is German!
