# Assignment 1

This assignment will involve the creation of a spellchecking system and an evaluation of its performance. You may use the code snippets provided in Python for completing this or you may use the programming language or environment of your choice

Please start by downloading the corpus `holbrook.txt` from Blackboard

The file consists of lines of text, with one sentence per line. Errors in the line are marked with a `|` as follows

    My siter|sister go|goes to Tonbury .
    
In this case the word 'siter' was corrected to 'sister' and the word 'go' was corrected to 'goes'.

In some places in the corpus two words maybe corrected to a single word or one word to a multiple words. This is denoted in the data using underscores e.g.,

    My Mum goes out some_times|sometimes .
    
For the purpose of this assignment you do not need to separate these words, but instead you may treat them like a single token.

## Task 1

Write a parser that can read all the lines of the file `holbrook.txt` and print out for each line the original (misspelled) text, the corrected text and the indexes of any changes.

Then split your data into a test set of 100 lines and a training set.

In [2]:
# loading the necessary packages
import nltk
import re 
from itertools import chain


# Loading the file and reading it using readlines()
filename = 'holbrook.txt'
f = open(filename)
textfile = f.readlines()
f.close()


# Calculating length of file
with open(filename) as f:
    length = (sum(1 for _ in f))
    
# Work tokenizing the sentences in file.     
tokens = []
for i in range(0,length):
    sentences = nltk.sent_tokenize(textfile[i])  
    tokens.append([nltk.word_tokenize(textfile[i]) for sent in sentences]) # to word tokenize each sentence in file
#print(tokens) 


# Transforming list of list of list to a list of list.
list_tokens = []
for j in range(0, len(tokens)):
    list_tokens.append(tokens[j][0])


# Importing defaultdict library from collections
from collections import defaultdict

original = defaultdict(list)
corrected = defaultdict(list)
index = defaultdict(list)

# For the length of the list_tokens created before, for each of the sentences, I am extracting the word to the left of the 
# separator(|) in original dict and right as corrected. Finally, the index of word containing | is saved as index and appended
for i in range(0, (len(list_tokens))):
    for j in range(0,(len(list_tokens[i]) )):
        if(list_tokens[i][j].find('|') > 0):
            split_token = list_tokens[i][j].split("|")
            original[i].append(split_token[0])
            corrected[i].append(split_token[1])
            index[i].append(j)
        else:
            original[i].append(list_tokens[i][j])
            corrected[i].append(list_tokens[i][j])
             
# All the 3 dict lists of original sentences, corrected sentences and indexes are put in one dict called data
data = defaultdict(list)
for i in range(0,len(original)):
    data[i].append(original[i])
    data[i].append(corrected[i])
    data[i].append(index[i])


In [115]:
# Checking contents from data[6]
data[4]

[['My', 'siter', 'go', 'to', 'Tonbury', '.'],
 ['My', 'sister', 'goes', 'to', 'Tonbury', '.'],
 [1, 2]]

In [107]:
# Creating training and test datasets from data dict created before
# As the first two lines of data have no interesting text except page and book title, I have put first 102 sentences 
# into test and rest in training

test_data = defaultdict(list)
for i in range(0,len(data)-1115):
    test_data[i]=data[i]

    
training_data = defaultdict(list)  
j = 0 # j is used to control indices and start from 0 instead of 102 in case of i
for i in range(102,len(data)):
    training_data[j]=data[i]
    j = j+1
 

In [116]:
# Looking at contents of test_data generated
test_data[3]

[['My', 'Dad', 'works', 'at', 'Melton', '.'],
 ['My', 'Dad', 'works', 'at', 'Melton', '.'],
 []]

## **Task 2**: 
Calculate the frequency (number of occurrences), *ignoring case*, of all words and bigrams (sequences of two words) from the corrected *training* sentences:

In [119]:
# Importing ngrams library from nltk to form bigrams and trigrams later
from nltk.util import ngrams

# Empty list
data_train = []

# The correct sentences are stored in 1st index of training_data dictionary. Getting the data into a list
for i in range(0,len(training_data)):
    data_train.append(training_data[i][1])

# Converting the data_train to lower_case so as to ignore different cases of same word
data_train = list(chain(*data_train)) # Creates one list of tokens.
tokens = [x.lower() for x in data_train]

# Function to calculate frequency of unigrams
def unigram(word):
    count = 0
    for u in tokens:
        if(word == u):
            count = count + 1
    return count


# Function to calculate frequency of bigrams. 
def bigram(words):
    count = 0    
    bigrams = ngrams(tokens,2) # Forming ngrams of size 2
#     print (bigrams)
    for b in bigrams:
#         print (b)
        b = b[0] + " " + b[1] # concatenating 2 elements of each bigram into a string to compare with train
        if(words == b):
            count = count + 1
    return count

bigram("i like")

# Test your code with the following
assert(unigram("me")==87)
assert(bigram("my mother")==17)


## **Task 3**: 
[Edit distance](https://en.wikipedia.org/wiki/Edit_distance) is a method that calculates how similar two strings are to one another by counting the minimum number of operations required to transform one string into the other. There is a built-in implementation in NLTK that works as follows:


In [7]:
from nltk.metrics.distance import edit_distance

# Edit distance returns the number of changes to transform one word to another
print(edit_distance("hello", "hi"))

# Checking how edit distance works by taking different samples
train_tokens = ["hello","heeloooo","hii",'hy','hii','hy','hy']
list(set(train_tokens))

4


['heeloooo', 'hy', 'hii', 'hello']

Write a function that calculates all words with *minimal* edit distance to the misspelled word. You should do this as follows

1. Collect the set of all unique tokens in `train`
2. Find the minimal edit distance, that is the lowest value for the function `edit_distance` between `token` and a word in `train`
3. Output all unique words in `train` that have this same (minimal) `edit_distance` value

*Do not implement edit distance, use the built-in NLTK function `edit_distance`*

In [123]:
# Importing edit distance library in nltk
from nltk.metrics import edit_distance

        
# Craeting a list of unique tokens from training data. This i did so that same candidates are not fetched more than once
unique_tokens = []
for i in range(0,len(training_data)):
    unique_tokens.append([x.lower() for x in training_data[i][1]])

# transforming list of list to a list
unique_tokens = list(set(chain(*unique_tokens)))


# Defining a function which fetches candidates based on their minimum distance from the input word
def get_candidates(token):
    result = []
    min_edit_dist = 100 # initializing edit distance to a high number at first
    
    # For all words in set of unique_tokens, i check edit distance with input token
    for word in unique_tokens:
        current_dist = edit_distance(token, word)

        # changing the minimum edit distance found
        if(current_dist < min_edit_dist):
            min_edit_dist = current_dist
            
    # appending all candidates to a list and returning that list
    for word in unique_tokens:
        if(edit_distance(token, word) == min_edit_dist):
            result.append(word)
    return (result)
        
# Test your code as follows
assert(get_candidates("minde") == ['mine', 'mind'])

## Task 4:

Write a function that takes a (misspelled) sentence and returns the corrected version of that sentence. The system should scan the sentence for words that are not in the dictionary and for each word that is not in the dictionary choose a word in the dictionary that has minimal edit distance and has the highest *bigram probability*. That is the candidate should be selected using the previous and following word in a bigram language model. Thus, if the $i$th word in a sentence is misspelled we should use the following to rank candidates:

$$p(w_{i+1}|w_i) p(w_i|w_{i-1})$$

For the first and last word of the sentence use only the conditional probabilities that exist.


In [7]:
# Importing the reqd libraries
import nltk
from itertools import chain

# make a dictionary of all correct words from holbrook
correct_tokens = []
for i in range(len(data)):
    correct_tokens.append([x.lower() for x in data[i][1]])

# defining a set of correct tokens
correct_tokens = list(chain.from_iterable(correct_tokens))

    
# Creating a function that will return corrected sentence for an incorrect one by comparing candidates of minimum edit 
# distance via maximum bigram probability, i.e. the chances of seeing the candidates given the previous word multiplied
# by the chances of seeing the following word given the candidate word occurs. To prevent zero probabilities, I've added one
# to probability calculation in numerator and denominator
def correct(sentence):
    
    # defining an empty list to store the corrected sentence
    correct = []
    
    # If the word encountered in sentence is in correct tokens, append it to correct list
    for i in range(len(sentence)):
        if(sentence[i] in correct_tokens):
            correct.append(sentence[i])
        else:
            candidates = list(set(get_candidates(sentence[i]))) # get candidates for the word
            
            # unless the word has index 0 or is the last word, we calculate bigram probability using the normal formulae
            if(sentence.index(sentence[i]) != 0 and sentence.index(sentence[i]) < (len(sentence) - 1)):
                prec_word = sentence[i-1]
                foll_word = sentence[i+1]
                correct.append(bigram_prob(candidates, prec_word, foll_word ))
            else:
                index = sentence.index(sentence[i])
                if(index == 0):
                    
                    # If the word has index 0, then we can only calculate its probability with the following word
                    foll_word = sentence[i+1]
                    correct.append(fl_bigram_prob(candidates,foll_word,index))
                else:
                    # For last word, we can only calculate probabaility with preceeding word
                    prec_word = sentence[i-1]
                    correct.append(fl_bigram_prob(candidates,prec_word,index))
    return correct

# Defining a function to calculate bigram_probability of seeing 2 words based on formulae given
def bigram_prob(candidates,prec, foll ):
    cond_prob = []
    # For all the candidates, we calculate the bigram probabilities and store in list cond_prob
    for j in range(0,len(candidates)):
        cond_prob.append((calc_freq(prec,candidates[j]) +1)/(correct_tokens.count(prec) +1)*(calc_freq(candidates[j],foll) +1)/(correct_tokens.count(candidates[j]) +1))
    
    # out of all probabilities for the candidates, the max probability is chosen and that candidate is appended to correct
    max_prob_index = cond_prob.index(max(cond_prob))
    return candidates[max_prob_index]
    
# function to calculate bigram probabilities but only for 0 and last indices of sentences
def fl_bigram_prob(candidates, word,index):
    cond_prob = []
    for j in range(0,len(candidates)):
        if(index == 0):
            cond_prob.append((calc_freq(candidates[j],word) + 1)/(correct_tokens.count(candidates[j]) +1 ))
        else:
            cond_prob.append((calc_freq(word,candidates[j]) + 1)/(correct_tokens.count(word) + 1))
    max_prob_index = cond_prob.index(max(cond_prob))
    return candidates[max_prob_index]

# function that calculates the frequency of seeing a particular bigram
def calc_freq(prec_word, foll_word):
    count = 0
    bigrams = nltk.bigrams(correct_tokens)
    for b in bigrams:
        if (b[0] == prec_word and b[1]== foll_word):
            count = count + 1
    return count
        
        
        
            


assert(correct(["this","whitr","cat"]) == ['this','white','cat'])   

## **Task 5**: 
Using the test corpus evaluate the *accuracy* of your method, i.e., how many words from your system's output match the corrected sentence (you should count words that are already spelled correctly and not changed by the system).

In [120]:
# # subsetting test data into correct and incorrect sentences and checking accuracy
data_test_incorrect = []
data_test_correct = []
index_test = []

# Creating lower case lists of incorrect and correct sentences
for i in range(0,len(test_data)):
    data_test_incorrect.append([x.lower() for x in test_data[i][0]])
    
# test_data[i][0]
for i in range(0,len(test_data)):
    data_test_correct.append([x.lower() for x in test_data[i][1]])

# storing the index of incorrect words for use in task 6
for i in range(0,len(test_data)):
    index_test.append(test_data[i][2])
    
# list to store predictions of correcting incorrect sentences
predictions = []

def accuracy(test):
    count = 0
    
    # For the length of incorrect sentences list, we keep calling function correct on each sentence and storing 
    #the predictions
    for j in range(len(data_test_incorrect)):
        predictions.append(correct(data_test_incorrect[j]))
    
    # Breaking the sentences into words to be matched
    p = list(chain.from_iterable(predictions))
    d = list(chain.from_iterable(data_test_correct))     
    
    # Checking word match
    for i in range(len(p)):  
#         print(predictions[i])
#         print(data_test_correct[i])
        if (p[i] == d[i]):
            count = count +1
    
    return (count/len(p) * 100) # matches found compared to length of entire prediction set
  


print(accuracy(data_test_incorrect))

91.81897302001741


In [121]:
for i in range(len(data_test_incorrect)):
    print(data_test_incorrect[i])
    print(predictions[i])
    print(data_test_correct[i])
    

['1', '.']
['1', '.']
['1', '.']
['nigel', 'thrush', 'page', '48']
['nigel', 'thrush', 'page', '48']
['nigel', 'thrush', 'page', '48']
['i', 'have', 'four', 'in', 'my', 'family', 'dad', 'mum', 'and', 'siter', '.']
['i', 'have', 'four', 'in', 'my', 'family', 'dad', 'mum', 'and', 'sister', '.']
['i', 'have', 'four', 'in', 'my', 'family', 'dad', 'mum', 'and', 'sister', '.']
['my', 'dad', 'works', 'at', 'melton', '.']
['my', 'dad', 'works', 'at', 'melton', '.']
['my', 'dad', 'works', 'at', 'melton', '.']
['my', 'siter', 'go', 'to', 'tonbury', '.']
['my', 'sister', 'go', 'to', 'tonbury', '.']
['my', 'sister', 'goes', 'to', 'tonbury', '.']
['my', 'mum', 'goes', 'out', 'some_times', '.']
['my', 'mum', 'goes', 'out', 'sometimes', '.']
['my', 'mum', 'goes', 'out', 'sometimes', '.']
['i', 'go', 'to', 'bridgebrook', 'i', 'go', 'out', 'some_times', 'on', 'tuesday', 'night', 'i', 'go', 'to', 'youth', 'clob', '.']
['i', 'go', 'to', 'bridgebrook', 'i', 'go', 'out', 'sometimes', 'on', 'tuesday', 'nigh

## **Task 6:**

Consider a modification to your algorithm that would improve the accuracy of the algorithm developed in Task 3

_Marks will be awarded based on how interesting the proposed improvement is. Please provide a short text describing what you intend to do and why. Full marks for this section may be obtained without an implementation, but an implementation is preferred._


## **Task 6 description **

Previuously, i saw that the incorrect words which are only at 1 edit distance away are calculated and 
candidates are returned due to the calculation of edit distance that we had to incorporate in task 3. Firstly,
i decided to change the calculation of candidates via edit distance to include even candidates at 1 or 2 more distances 
compared to just minimum edit distance. 
Secondly, i thought of incorporating trigrams in task 6 instead of bigrams to provide better context to my algorithm and 
create less confusion in selecting the right candidate.
To calculate trigrams probability, i extended the previous formulae and read documents online for trigrams probability 
calculation formulae.

Incorporation of trigrams:
I check if the word of the sentence is not in first 2 indices and last two indices, extract the previous 2 
and following 2 words. After that the probability of the candidate is calculated as:
    P(a particular candidate  say, w) = P(seeing previous 2 words and w|previous 2 words) * 
                                            P( seeing w followed by the next two words| w)

Similarly, now I had to tackle words in first and last indices as well as 2nd and 2nd last indices.
For words in first and last indices, we take the following two words and previous 2 words respectively for calculation 
of the trigram probabilities. Also, here the length of the sentence needs to be taken into account that it is atleast 3
so as to get the follwing and previosu words.
Unlike in bigrams, here we need to handle probabilities for 2nd and 2nd last indices separately as well. For these, we take the previous one word available and follwing two words after candidate to calculate probability.

For sentences with length 2, previous function defined in 4 to correct sentences was used and bigram probabilities considered


* Further ideas I worked at but didn't work out:
1) Considered removing stopwords from both tokens of correct words in dictionary and test incorrect and correct
2) Implemented that later but the problem encountered there is that the length of test_correct and incorrect sentences are not the same. This is due to the case that in correct sentences list, words such as been, have etc are removed whereas in incorrect, if the spellings are wrong; like I saw bean instead of been, hador instead of have; those words are not removed due to which length differs.
And thus, this second idea i could not further work on. Thanks

In [136]:
# From above, looking at incorrect, predicted and corrected sentences I saw that the algorithm was considering candidates
# mostly only one distance away words, for instance; wach becomes wash; watch isn't even considered. So, for task 6, 
# i decided  to look at not only the minimum edit distance which might be 1 in most cases, but also to look at 1 more change,
# that is if a word is within 2 edit distances, we consider it as a candidate



# Firstly, create new get_candidates function which also considers edit distance at min_edit_distance +1
# and then instead of bigram, I have looked here at trigram probabilities so that the algorithm has more context
def get_candidates_impr(token):
    result = []
    min_edit_dist = 100 # initializing edit distance to a high number at first
    
    # For all words in set of unique_tokens, i check edit distance with input token
    for word in unique_tokens:
        current_dist = edit_distance(token, word)

        # changing the minimum edit distance found
        if(current_dist < min_edit_dist):
            min_edit_dist = current_dist
            
    # appending all candidates to a list and returning that list
    for word in unique_tokens:
        
        # Here, i made change considering words which are 1 more than min_edit_distance also to check if wach
        # can be converted to watch instead of wash
        if(edit_distance(token, word) in(min_edit_dist, min_edit_dist + 1)):
            result.append(word)
    return (result)
        



# Instead of looking at bigrams to calculate probability, I thought looking at trigrams should give more context and less of
# confusion to algorithm while calculating conditional probability. For calculating conditional probability of trigrams, 
# I have used extended version of formulae given above. 
# To calculate probability
def correct_impr(s, index_list):
    correct = []
    sentence = []
    for i in s:
        sentence.append(i)

# for each word in sentence, here i check if index of word is not in the index list of incorrect indexes, only then do i 
# append the word to the corrected sentence otherwise, cadidates are calculated for it with minimum edit distance,
# and distance + 1 so as to capture meaningful candidates such as watch for wach and goes for go
    for i in sentence:
        if (sentence.index(i) not in index_list):
            correct.append(i)
        else:
            candidates = list(set(get_candidates_impr(i)))
            
            # words after 1st index and before 2nd last word
            if(sentence.index(i) > 1 and sentence.index(i) < (len(sentence) - 2)):
                ind_word = sentence.index(i)
                correct.append(trigram_prob(candidates, ind_word, sentence))
            
            # first and last word
            elif((sentence.index(i) == 0 and len(sentence) > 2) or (sentence.index(i) == (len(sentence) - 1) 
                                                                     and len(sentence) > 2)):
                index = sentence.index(i)
                if(index == 0):
                    foll_words = [sentence[1], sentence[2]]
                    correct.append(fl_trigram_prob(candidates,foll_words,index))
                else:
                    prec_words = [sentence[len(sentence)-2], sentence[len(sentence)-1]]
                    correct.append(fl_trigram_prob(candidates,prec_words,index))
            
            # for 1 index and 2nd last word
            elif((sentence.index(i) == 1 or sentence.index(i) == (len(sentence) - 2)) and len(sentence) >= 3):
                index = sentence.index(i)
                correct.append(trigram_prob(candidates,index, sentence))
            
            # if the length of sentence is 2, i call on correct() to calculate bigram probability and append right candidate
            elif(len(sentence) == 2):
                correct.append(correct(sentence))

    return correct
        

def trigram_prob(candidates,index, sentence):
    cond_prob = []
    
    # If the word is in 1 index and length of sentence is greater than or equal to 3 in order to form trigrams
    if (index == 1 and len(sentence) >= 3):
        prec_word = sentence[0]
        foll_words = [sentence[2], sentence[3]]
        for i in candidates:
            
            # 0.0001- a very small number is added to both numerator and denominator to avoid dividing by 0 isuues
            cond_prob.append(((calc_freq_bigram(prec_word, i) +0.0001)/(unigram(prec_word) +0.0001)) * ((calc_freq(i,foll_words) + 0.0001)/(unigram(i) + 0.0001)))
        max_prob_index = cond_prob.index(max(cond_prob))
        
        # if the word is in 2nd last index
    elif(index == len(sentence) -2):
        prec_words = [sentence[len(sentence) - 4], sentence[len(sentence) - 3]]
        foll_word = sentence[len(sentence) - 1]
        for i in candidates:
            cond_prob.append(((calc_freq(prec_words,i) +0.0001)/(calc_freq_bigram(prec_words[0],prec_words[1]) + 0.0001))*((calc_freq_bigram(i, foll_word) + 0.0001)/(unigram(i) + 0.01)))      
            max_prob_index = cond_prob.index(max(cond_prob)) 
            
        # for all indices above 2nd word and lesser than 2nd last word
    elif(index > 1 and index < len(sentence) - 2 ):
        prec = [sentence[index-2], sentence[index-1]]
        foll = [sentence[index+1], sentence[index+2]]
        for j in candidates:
            cond_prob.append((calc_freq(prec,j) +0.0001)/(calc_freq_bigram(prec[0], prec[1]) + 0.0001)*(calc_freq(j, foll) + 0.0001)/(unigram(j) + 0.0001))
        max_prob_index = cond_prob.index(max(cond_prob))
    return candidates[max_prob_index]
    

# function defined to calculate first and last words trigram probabilities
def fl_trigram_prob(candidates, words, index):
    cond_prob = []
    for j in candidates:
        if(index == 0):
            cond_prob.append((calc_freq(j, words) + 0.0001)/(unigram(j) + 0.0001 ))
        else:
            cond_prob.append((calc_freq(words,j) + 0.0001)/(calc_freq_bigram(words[0], words[1]) + 0.0001))
    max_prob_index = cond_prob.index(max(cond_prob))
    return candidates[max_prob_index]

# Calculates frequency of trigrams and returns the count
def calc_freq(prec_words, foll_words):
    count = 0
    trigrams = nltk.trigrams(correct_tokens)
    if(len(prec_words) == 2):
            words = prec_words
            candidate = foll_words
            for t in trigrams:
                if(t[0] == words[0] and t[1] == words[1] and t[2] == candidate):
                    count = count + 1
    else:
        words = foll_words
        candidate = prec_words
        for t in trigrams:            
            if ([t[0]] == candidate and [b[1]]== words[0] and b[2] == words[1]):
                count = count + 1
    return count
# calculates frequencies of bigrams
def calc_freq_bigram(prec, foll):
    count = 0
    bigrams = nltk.bigrams(correct_tokens)
    for b in bigrams:
        if (b[0] == prec and b[1]== foll):
            count = count + 1
    return count

# I tried to incorporate edit distance of 2 words higher than minimum; but the time for execution of code takes too long

In [None]:
# Other tryouts for task 6
### change correct and incorrect sentences as well as indexes for stopwords
## Here is where i thought of removing stopwords and checking algorithms performance, but stopwords removal for incorrect
# sentences dont happen well due to wrong spellings of bean instead of been etc. Thus, this wasn't fully implemented
from nltk.corpus import stopwords
stop = set(stopwords.words("english"))
    
# improve_tokens = [word for word in correct_tokens if word not in stop]

data_test_correct
data_test_incorrect
test_corr = defaultdict(list)
test_incorr = defaultdict(list)
for i in range(len(data_test_correct)):
    for j in range(len(data_test_correct[i])):
        if(data_test_correct[i][j] not in stop and data_test_correct[i][j] != "others"):
            test_corr[i].append(data_test_correct[i][j])
            
for i in range(len(data_test_incorrect)):
    for j in range(len(data_test_incorrect[i])):
        if(data_test_incorrect[i][j] not in stop):
            test_incorr[i].append(data_test_incorrect[i][j])
            
            
            
# new_index = defaultdict(list)            
# # To store new index
# j=0
# for i in range(len(test_corr)):
#     if (test_corr[i] != test_incorr[i]):
#         new_index[j].append(i)
#         j = j+1

## **Task 7:**

Repeat the evaluation of your new algorithm and show that it outperforms the algorithm from Task 3

In [137]:
predictions = []
def accuracy_new(test):
    count = 0
    for j in range(len(data_test_incorrect)):
        predictions.append(correct_impr(data_test_incorrect[j], index_test[j]))
#         print(j)
    p = list(chain.from_iterable(predictions))
    d = list(chain.from_iterable(data_test_correct))               
    for i in range(len(p)):  
        if(p[i] == d[i]):
            count = count +1
    
    return ((count/len(p)) * 100)
  

# or len(data_test_incorrect[j]) == 4or len(data_test_incorrect[j]) == 3 
print(accuracy_new(data_test_incorrect))

94.42993907745866


In [126]:
# Looking at results of predictions by trigram improved algorithm

for i in range(len(data_test_incorrect)):
    print(data_test_incorrect[i])
    print(predictions[i])
    print(data_test_correct[i])

['1', '.']
['1', '.']
['1', '.']
['nigel', 'thrush', 'page', '48']
['nigel', 'thrush', 'page', '48']
['nigel', 'thrush', 'page', '48']
['i', 'have', 'four', 'in', 'my', 'family', 'dad', 'mum', 'and', 'siter', '.']
['i', 'have', 'four', 'in', 'my', 'family', 'dad', 'mum', 'and', 'sister', '.']
['i', 'have', 'four', 'in', 'my', 'family', 'dad', 'mum', 'and', 'sister', '.']
['my', 'dad', 'works', 'at', 'melton', '.']
['my', 'dad', 'works', 'at', 'melton', '.']
['my', 'dad', 'works', 'at', 'melton', '.']
['my', 'siter', 'go', 'to', 'tonbury', '.']
['my', 'sister', 'wo', 'to', 'tonbury', '.']
['my', 'sister', 'goes', 'to', 'tonbury', '.']
['my', 'mum', 'goes', 'out', 'some_times', '.']
['my', 'mum', 'goes', 'out', 'sometimes', '.']
['my', 'mum', 'goes', 'out', 'sometimes', '.']
['i', 'go', 'to', 'bridgebrook', 'i', 'go', 'out', 'some_times', 'on', 'tuesday', 'night', 'i', 'go', 'to', 'youth', 'clob', '.']
['i', 'go', 'to', 'bridgebrook', 'i', 'go', 'out', 'sometimes', 'on', 'tuesday', 'nigh