# Detecting Text Language by Counting Stop Words

Based on [Detecting Text Language With Python and NLTK by Alejandro Nolla](http://blog.alejandronolla.com/2013/05/15/detecting-text-language-with-python-and-nltk/)

*Stop words* are words which are filtered out before processing because they are mostly grammatical as opposed to semantic in nature e.g. search engines remove words like 'want'.

## 1. Tokenizing

In [1]:
text = "Yo man, it's time for you to shut yo' mouth! I ain't even messin' dawg."

In [2]:
import sys

try:
    from nltk.tokenize import wordpunct_tokenize # RE-based tokenizer which splits text on whitespace and punctuation (except for underscore)
except ImportError:
    print('[!] You need to install nltk (http://nltk.org/index.html)')

In [3]:
test_tokens = wordpunct_tokenize(text)
test_tokens

['Yo',
 'man',
 ',',
 'it',
 "'",
 's',
 'time',
 'for',
 'you',
 'to',
 'shut',
 'yo',
 "'",
 'mouth',
 '!',
 'I',
 'ain',
 "'",
 't',
 'even',
 'messin',
 "'",
 'dawg',
 '.']

There are other tokenizers e.g. `RegexpTokenizer` where you can enter your own regexp, `WhitespaceTokenizer` (similar to Python's `string.split()`) and `BlanklineTokenizer`.

## 2. Exploring NLTK's stop words corpus

NLTK comes with a corpus of stop words in various languages.

In [4]:
from nltk.corpus import stopwords
stopwords.readme().replace('\n', ' ') # Since this is raw text, we need to replace \n's with spaces for it to be readable.

'Stopwords Corpus  This corpus contains lists of stop words for several languages.  These are high-frequency grammatical words which are usually ignored in text retrieval applications.  They were obtained from: http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/  The stop words for the Romanian language were obtained from: http://arlc.ro/resources/  The English list has been augmented https://github.com/nltk/nltk_data/issues/22  The German list has been corrected https://github.com/nltk/nltk_data/pull/49  A Kazakh list has been added https://github.com/nltk/nltk_data/pull/52  A Nepali list has been added https://github.com/nltk/nltk_data/pull/83  An Azerbaijani list has been added https://github.com/nltk/nltk_data/pull/100  A Greek list has been added https://github.com/nltk/nltk_data/pull/103  An Indonesian list has been added https://github.com/nltk/nltk_data/pull/112 '

In [5]:
stopwords.fileids() # Most corpora consist of a set of files, each containing a piece of text. A list of identifiers for these files is accessed via fileids().

['arabic',
 'azerbaijani',
 'basque',
 'bengali',
 'catalan',
 'chinese',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'greek',
 'hebrew',
 'hinglish',
 'hungarian',
 'indonesian',
 'italian',
 'kazakh',
 'nepali',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'slovene',
 'spanish',
 'swedish',
 'tajik',
 'turkish']

Corpus readers provide a variety of methods to read data from the corpus:

In [16]:
stopwords.raw('hinglish')

"a\naadi\naaj\naap\naapne\naata\naati\naaya\naaye\nab\nabbe\nabbey\nabe\nabhi\nable\nabout\nabove\naccha\naccording\naccordingly\nacha\nachcha\nacross\nactually\nafter\nafterwards\nagain\nagainst\nagar\nain\naint\nain't\naisa\naise\naisi\nalag\nall\nallow\nallows\nalmost\nalone\nalong\nalready\nalso\nalthough\nalways\nam\namong\namongst\nan\nand\nandar\nanother\nany\nanybody\nanyhow\nanyone\nanything\nanyway\nanyways\nanywhere\nap\napan\napart\napna\napnaa\napne\napni\nappear\nare\naren\narent\naren't\naround\narre\nas\naside\nask\nasking\nat\naur\navum\naya\naye\nbaad\nbaar\nbad\nbahut\nbana\nbanae\nbanai\nbanao\nbanaya\nbanaye\nbanayi\nbanda\nbande\nbandi\nbane\nbani\nbas\nbata\nbatao\nbc\nbe\nbecame\nbecause\nbecome\nbecomes\nbecoming\nbeen\nbefore\nbeforehand\nbehind\nbeing\nbelow\nbeside\nbesides\nbest\nbetter\nbetween\nbeyond\nbhai\nbheetar\nbhi\nbhitar\nbht\nbilkul\nbohot\nbol\nbola\nbole\nboli\nbolo\nbolta\nbolte\nbolti\nboth\nbrief\nbro\nbtw\nbut\nby\ncame\ncan\ncannot\ncant\n

In [17]:
stopwords.raw('hinglish').replace('\n', ' ') # Better

"a aadi aaj aap aapne aata aati aaya aaye ab abbe abbey abe abhi able about above accha according accordingly acha achcha across actually after afterwards again against agar ain aint ain't aisa aise aisi alag all allow allows almost alone along already also although always am among amongst an and andar another any anybody anyhow anyone anything anyway anyways anywhere ap apan apart apna apnaa apne apni appear are aren arent aren't around arre as aside ask asking at aur avum aya aye baad baar bad bahut bana banae banai banao banaya banaye banayi banda bande bandi bane bani bas bata batao bc be became because become becomes becoming been before beforehand behind being below beside besides best better between beyond bhai bheetar bhi bhitar bht bilkul bohot bol bola bole boli bolo bolta bolte bolti both brief bro btw but by came can cannot cant can't cause causes certain certainly chahiye chaiye chal chalega chhaiye clearly c'mon com come comes could couldn couldnt couldn't d de dede dega 

In [9]:
stopwords.words('english')[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [18]:
len(stopwords.words(['hinglish'])) # There is a total of 444 Greek and English stop words

1036

## 3. The classification

We loop through the list of stop words in all languages and check how many stop words our test text contains in each language. The text is then classified to be in the language in which it has the most stop words.

In [14]:
language_ratios = {}

test_words = [word.lower() for word in test_tokens] # lowercase all tokens
test_words_set = set(test_words)

for language in stopwords.fileids():
    stopwords_set = set(stopwords.words(language)) # For some languages eg. Russian, it would be a wise idea to tokenize the stop words by punctuation too.
    common_elements = test_words_set.intersection(stopwords_set)
    language_ratios[language] = len(common_elements) # language "score"
    
language_ratios

{'arabic': 0,
 'azerbaijani': 0,
 'basque': 0,
 'bengali': 0,
 'catalan': 1,
 'chinese': 0,
 'danish': 3,
 'dutch': 0,
 'english': 8,
 'finnish': 0,
 'french': 2,
 'german': 1,
 'greek': 0,
 'hebrew': 0,
 'hinglish': 9,
 'hungarian': 1,
 'indonesian': 0,
 'italian': 1,
 'kazakh': 0,
 'nepali': 0,
 'norwegian': 3,
 'portuguese': 1,
 'romanian': 2,
 'russian': 0,
 'slovene': 2,
 'spanish': 1,
 'swedish': 2,
 'tajik': 0,
 'turkish': 0}

In [15]:
most_rated_language = max(language_ratios, key=language_ratios.get) # The key parameter to the max() function is a function that computes a key. In our case, we already have a key so we set key to languages_ratios.get which actually returns the key.
most_rated_language

'hinglish'

In [19]:
test_words_set.intersection(set(stopwords.words(most_rated_language))) # We can see which English stop words were found.

{'ain', 'even', 'for', 'i', 'it', 's', 't', 'to', 'you'}