Question 1: First, you will need to load the words in English and Italian in two separate lists and then create two sets for the words: a training set which contains 80% of the words and a test set which contains 20% of the words (make sure that these sets are created randomly).

In [915]:
import pandas as pd
english = pd.read_csv("CONcreTEXT_trial_EN.tsv", sep='\t')
italian = pd.read_csv("CONcreTEXT_trial_IT.tsv", sep='\t')

In [916]:
from nltk.tokenize import RegexpTokenizer
import random

tokenizer = RegexpTokenizer(r'\w+')
englishText = []
for i in english["TEXT"]:
    englishText += [j.lower() for j in tokenizer.tokenize(i)]
italianText = []
for i in italian["TEXT"]:
    italianText += [j.lower() for j in tokenizer.tokenize(i)]
random.shuffle(englishText)
random.shuffle(italianText)

In [917]:
englishTrain = englishText[:int((len(englishText)+1)*.80)]
englishTest = englishText[:int((len(englishText)+1)*.20)]
italianTrain = italianText[:int((len(italianText)+1)*.80)]
italianTest = italianText[:int((len(italianText)+1)*.20)]

-------------

In [918]:
# function to retrieve count of a character in a list
def getCountForCharacter(list, letter):
    count = 0
    for word in list:
        count += word.count(letter)
    return count

# function to retrieve length of the text in a list
def getLength(list):
    count = 0
    for word in list:
        count += len(word)
    return count

# function to retrieve the probability of a character in a list
def getProbability(list, letter):
    return getCountForCharacter(list, letter) / getLength(list) 


----------------

Question 2: Next, we are going to build a unigram model for each language (English and Italian separately). Important note here: the unigrams here refer to character level unigrams. 

In [919]:
# function to get the dictionary of probability of characters in a list
def getDictForUnigram(list):
    dict = {}
    for word in list:
        for letter in word:
            dict[letter] = getProbability(list, letter)
    return dict

englishTrainDictForUnigram = getDictForUnigram(englishTrain)
italianTrainDictForUnigram = getDictForUnigram(italianTrain)

In [920]:
englishTestDictForUnigram = {}
for word in englishTest:
    probability = 1
    for letter in word:
        if englishTrainDictForUnigram[letter] != 0:
            probability *= englishTrainDictForUnigram[letter]
    englishTestDictForUnigram[word] = probability

italianTestDictForUnigram = {}
for word in italianTest:
    probability = 1
    for letter in word:
        if italianTrainDictForUnigram[letter] !=0:
            probability *= italianTrainDictForUnigram[letter]
    italianTestDictForUnigram[word] = probability

In [921]:
predictDict = {}
for key in englishTestDictForUnigram:
    if key in italianTestDictForUnigram:
        if englishTestDictForUnigram[key] > italianTestDictForUnigram[key]:
            predictDict[key] = "English"
        else:
            predictDict[key] = "Italian"
    else:
        predictDict[key] = "English"

for key in italianTestDictForUnigram:
    if key in englishTestDictForUnigram:
        if englishTestDictForUnigram[key] > italianTestDictForUnigram[key]:
            predictDict[key] = "English"
        else:
            predictDict[key] = "Italian"
    else:
        predictDict[key] = "Italian"

In [922]:
from collections import Counter
resultForUnigram = Counter(predictDict.values())

In [923]:
EnglishAccuracyUnigram = (resultForUnigram['English']/len(englishTest))* 100
# ItalianAccuracyUnigram = (resultForUnigram['Italian']/len(italianTest))* 100

In [924]:
print("The accuracy of English in the test set for Unigram is ",EnglishAccuracyUnigram)
# print("The accuracy of Italian in the test set for Unigram is ",ItalianAccuracyUnigram)

The accuracy of English in the test set for Unigram is  67.68060836501901


-------------

Question 3: Repeat the entire Question 2, but instead for a bigram character model. How did the accuracies change? Did they increase or decrease? Is a bigram character-level language model better at distinguishing language than a unigram character-level language model? Type an *original* answer with at least 50 words.

In [925]:
# def getDictForBiGram(list):
#     dict = {}
#     for word in list:
#         for i in range(0, len(word)-1):
#             letter = word[i:i+2]
#             dict[letter] = getProbability(list, letter)
#     return dict

In [926]:
# englishTrainDictForBiGram = getDictForBiGram(englishTrain)
# italianTrainDictForBiGram = getDictForBiGram(italianTrain)

In [927]:
englishTestDictForBigram = {}
for word in englishTest:
    probability = 1
    if(len(word) >=2):
        for i in range(0, len(word)-1):
            letter = word[i:i+2]
            probability *= getCountForCharacter(englishTrain, letter) / getCountForCharacter(englishTrain, letter[0])
        englishTestDictForBigram[word] = probability

italianTestDictForBigram = {}
for word in italianTest:
    probability = 1
    if(len(word) >=2):
        for i in range(0, len(word)-1):
            letter = word[i:i+2]
            probability *= getCountForCharacter(italianTrain, letter) / getCountForCharacter(italianTrain, letter[0])
        italianTestDictForBigram[word] = probability

In [928]:
predictDictForBigram = {}
for key in englishTestDictForBigram:
    if key in italianTestDictForBigram:
        if englishTestDictForBigram[key] > italianTestDictForBigram[key]:
            predictDictForBigram[key] = "English"
        else:
            predictDictForBigram[key] = "Italian"
    else:
        predictDictForBigram[key] = "English"

for key in italianTestDictForBigram:
    if key in englishTestDictForBigram:
        if englishTestDictForBigram[key] > italianTestDictForBigram[key]:
            predictDictForBigram[key] = "English"
        else:
            predictDictForBigram[key] = "Italian"
    else:
        predictDictForBigram[key] = "Italian"

In [929]:
from collections import Counter
resultForBigram = Counter(predictDictForBigram.values())

In [930]:
EnglishAccuracyBigram = (resultForBigram['English']/len(englishTest))* 100
# ItalianAccuracyBiigram = (resultForBigram['Italian']/len(italianTest))* 100

In [931]:
print("The accuracy of English in the test set for Bigram is ",EnglishAccuracyBigram)
# print("The accuracy of Italian in the test set for Bigram is ",ItalianAccuracyBiigram)

The accuracy of English in the test set for Bigram is  67.68060836501901


Q:How did the accuracies change? Did they increase or decrease? Is a bigram character-level language model better at distinguishing language than a unigram character-level language model?

A: There is a very minute difference in the accuracy of the dataset of Bigram's Model when compared with Unigram's Model. But the bigram character-level language model is better than unigram-character level language because the probability for finding a word in a data set increases when compared with the unigram character-level language model thus resulting in more accuracy.

-------------

Question 4: Repeat the entire Question 3, but this time for a trigram character model. Is a trigram character-level language model better at distinguishing language than a bigram character-level language model? Type an *original* answer with at least 50 words. 

In [932]:
# def getDictForTriGram(list):
#     dict = {}
#     for word in list:
#         for i in range(0, len(word)-2):
#             letter = word[i:i+3]
#             dict[letter] = getProbability(list, letter) / getProbability(list, letter[:2])
#     return dict

In [933]:
# englishTrainDictForTriGram = getDictForTriGram(englishTrain)
# italianTrainDictForTriGram = getDictForTriGram(italianTrain)

In [934]:
englishTestDictForTriGram = {}
for word in englishTest:
    probability = 1
    if(len(word) >= 3):
        for i in range(0, len(word)-2):
            letter = word[i:i+3]
            if(getCountForCharacter(englishTrain, letter[:2]) > 0):
                probability *= getCountForCharacter(englishTrain, letter) / getCountForCharacter(englishTrain, letter[:2])
        englishTestDictForTriGram[word] = probability

italianTestDictForTriGram = {}
for word in italianTest:
    probability = 1
    if(len(word) >= 3):
        for i in range(0, len(word)-2):
            letter = word[i:i+3]
            if(getCountForCharacter(italianTrain, letter[:2]) > 0):
                probability *= getCountForCharacter(italianTrain, letter) / getCountForCharacter(italianTrain, letter[:2])
        italianTestDictForTriGram[word] = probability

In [935]:
predictDictForTriGram = {}
for key in englishTestDictForTriGram:
    if key in italianTestDictForTriGram:
        if englishTestDictForTriGram[key] > italianTestDictForTriGram[key]:
            predictDictForTriGram[key] = "English"
        else:
            predictDictForTriGram[key] = "Italian"
    else:
        predictDictForTriGram[key] = "English"

for key in italianTestDictForTriGram:
    if key in englishTestDictForTriGram:
        if englishTestDictForTriGram[key] > italianTestDictForTriGram[key]:
            predictDictForTriGram[key] = "English"
        else:
            predictDictForTriGram[key] = "Italian"
    else:
        predictDictForTriGram[key] = "Italian"

In [936]:
from collections import Counter
resultForTriGram = Counter(predictDictForTriGram.values())

In [937]:
EnglishAccuracyTrigram = (resultForTriGram['English']/len(englishTest))* 100
# ItalianAccuracyTrigram = (resultForTriGram['Italian']/len(italianTest))* 100

In [938]:
print("The accuracy of English in the test set for Trigram is ",EnglishAccuracyTrigram)
# print("The accuracy of Italian in the test set for Trigram is ",ItalianAccuracyTrigram)

The accuracy of English in the test set for Trigram is  62.3574144486692


Q: Is a trigram character-level language model better at distinguishing language than a bigram character-level language model? 

A: Trigram character-level language model is better in distinguishing language than the bigram character-level model. An interesting observation which I found is that in the trigram character-level language model, the accuracy has a large amount of when compared to Unigram's Model. I think the outcome of this accuracy is because the dataset provided is very less and the test data while predicting does not have enough amount of data, thus resulting in less amount of training of the model.