## Sentence level language detection

This code attempts to classify the language at Dáil debates text at sentence level. My aim is to separate out English text to create a cleaner and smaller dataset for analysis, but also to lay the groundwork for enabling searching debates in Irish.

Members might make their contributions to Dáil debates are made in English or Irish, although the substantial majority use English. In some cases the entire contribution is mono-lingual but in other cases, a Member may alternate between Irish and English. Sometimmes a Member will even switch between Irish and English in the same sentence.

As far as I'm aware, there were only three occasions where a contribution in a different language to Irish or English appears in the Official report: the programme for the First Dáil, which was in several languages; Francois Mitterand's address to the Dáil as French head of state on 26 February 1988, which was in French; and Helmut Kohl's address as German Chancellor on 2 October 1996, in German.

I'm parsing language at the sentence level because of the propensity for Members to alternate between Irish and English in their speech. Parsing at the speech level would not give accurate results.

### Tools used

I am trying out two alternative approaches. The first is a simple parser that counts the number of fadas (accented characters) and common English stopwords in a sentence. The second uses the [Polyglot](https://pypi.python.org/pypi/polyglot) library, which offers a language detection tool.

The test data is stored in a Mongodb collection.

In [84]:
import pymongo, re

from nltk.corpus import stopwords
from polyglot.detect import Detector
from polyglot.text import Text as Poly
from datetime import datetime
from collections import defaultdict

In [83]:
db = pymongo.MongoClient().texts

### Basic detector

This script first attempts to identify sentence language simply by counting the number of fadas("áéíóú") as a proportion of the total number of characters. 

Sentences with a score of greater than 0.01 are categorised as Irish ("ga"), and sentences with a score of less than 0.008 are classified as English ("en"). I set these thresholds after examining the results at different thresholds.

Scores of between 0.008 and 0.01 were more difficult to differentiate, so I added a new step adapted from this post, which counds the intersection between the set of NLTK stopwords and the set of tokens in the sentence, as a proportion of the total number of tokens in the sentence. Scores of less than 0.1 are classified as Irish, and above 0.2 as English. This still leaves unclassified sentences with a score from 0.1 and 0.2 and to classify these I used the Polyglot library, as explained below.

In [76]:
stops = set(stopwords.words("english"))

In [170]:
def detect_language(sent):
    assert sent is not None
    score = len(re.sub("[^áéíóú]", "", sent.lower().string))/len(sent.string)
    if score < 0.008:
        return(score, "English")
    elif 0.008 < score < 0.01:
        tok_set = set(sent.tokens)
        intersect = len(tok_set.intersection(stops))/len(tok_set)
        if intersect < 0.1:
            return(score, "Irish")
        if 0.1 < intersect < 0.2:
            lang = Detector(sent.string).language.name
            return (score, lang)
        else:
            return (score, "English")
    else:
        return(score, "Irish")

### Polyglot detector

Polyglot has a fast and easy to use language classifier that attempts to detect sentence language at parse time. 

In [160]:
def score_sentences(speeches):
    lang_scores = defaultdict(int)
    for speech in speeches:
        sentences = Poly(" ".join(speech['text'])).sentences
        lang_scores["sent_count"] += len(sentences)
        for sent in sentences:
                #print(sent)
                score, lang = detect_language(sent)
                #print(score, lang)
                lang_scores[lang] += 1
                #print(lang, sent)
    return lang_scores

def polyglot_score_sentences(speeches):
    lang_scores = defaultdict(int)
    for speech in speeches:
        try:
            sentences = Poly(" ".join(speech['text'])).sentences
        except:
            print (speech)
            break
        lang_scores["sent_count"] += len(sentences)
        for sent in sentences:
            
            lang_scores[sent.language.name] += 1
            
            #if sent.language.confidence < 60.0:
                
            #my_scorer = detect_language(sent)
            #if sent.language.code != my_scorer[1]:
            #    print(sent.string)
            #    print(sent.language.confidence)
            #    print(sent.language.code, my_scorer[1])

            #    print("-----")
    return lang_scores

I had to do a bit of pre-processing of the texts to clear out a HTML encoding that was breaking the Polyglot parser

In [162]:

for d in db.dail.find({}, {'text':True}):
    if "\x97" in "".join(d['text']):
        text = [t.replace("\x97", " ") for t in d['text']]
                   
        db.dail.update_one({"_id":d['_id']}, {"$set": {"text":text}})
        

In [173]:
def get_speeches():
    return db.dail.find({'date': {"$gt": datetime(1982,1,1), 
                                 "$lt": datetime(2002,1,1)}, 
                        "len_doc": {"$gt": 100},
                        #"spkr": "member/Eamon-de-Valera.D.1919-01-21"
                        },
                        {"text": True})


In [175]:
print("Number of speeches {:,.0f}:".format(get_speeches().count()))

Number of speeches: 175172


### Comparisons

Polyglot is significantly faster than my parser but it classified 50% fewer sentences as Irish. The sentences categorised in French and German are, of course, Francois Mitterand and Helmut Kohl, and they're probably classed as Irish by my parser. On the other hand, that's a difference of 1% Irish according to Polyglot compared to 2% based on my parser. That's probably not significant enough to worry about for the text analysis pipeline, where time is more important than accuracy, but it will need more investigation if I am going to tag sentences for a debates search tool. 

In [176]:
speeches = get_speeches()

%time poly_scores = polyglot_score_sentences(speeches)
print(poly_scores)
speeches = get_speeches()
%time scores = score_sentences(speeches)
print(scores)

CPU times: user 1min 43s, sys: 132 ms, total: 1min 43s
Wall time: 1min 45s
defaultdict(<class 'int'>, {'French': 114, 'German': 144, 'Irish': 27198, 'English': 3596960, 'sent_count': 3624416})
CPU times: user 3min 17s, sys: 168 ms, total: 3min 17s
Wall time: 3min 19s
defaultdict(<class 'int'>, {'English': 3558957, 'Irish': 65459, 'sent_count': 3624416})


In [178]:
65459/3596960

0.018198423112850852

In [23]:
db.dail.count({"spkr":"member/Aengus-Ó-Snodaigh.D.2002-06-06"})

4766

In [26]:
sent = Poly("this is a test sentence.")

In [92]:
sent.language.from_code("gd")

<polyglot.detect.base.Language at 0x7fd27815ac18>

In [107]:
members = [{
    "eId": "member/Aengus-Ó-Snodaigh.D.2002-06-06", 
 "lang": {'Danish': 1, 
          'crs': 1, 
          'Irish': 4506, 
          'Tatar': 1, 
          'un': 1, 
          'Wolof': 1, 
          'Scots': 4, 
          'Western Frisian': 3, 
          'Welsh': 2, 
          'Scottish Gaelic': 4, 
          'German': 1, 
          'sent_count': 47202, 
          'Manx': 2, 
          'Hawaiian': 1, 
          'Samoan': 1, 
          'English': 42673}
    },
    {
    "eId": "member/Mattie-McGrath.D.2007-06-14",
    "name": "Mattie McGrath",
      "lang": {'Lithuanian': 2, 'Western Frisian': 8, 'Esperanto': 1, 'crs': 3, 'Spanish': 1, 'Uzbek': 1, 'Czech': 1, 'Welsh': 2, 'Dutch': 1, 'Hausa': 1, 'Irish': 214, 'English': 39911, 'Luxembourgish': 3, 'Danish': 6, 'Southern Sotho': 1, 'Slovenian': 3, 'Norwegian Nynorsk': 1, 'Xhosa': 3, 'Interlingue': 1, 'un': 3, 'Lingala': 2, 'Estonian': 1, 'Maltese': 1, 'Scottish Gaelic': 11, 'sent_count': 40201, 'Portuguese': 3, 'Manx': 4, 'Scots': 11, 'zzp': 1}  
    }]