# Tasks For Emerging Technologies

***

[A Mathematical Theory of Communication by Claude E. Shannon](https://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf)
The following is based on part 1, section 3, The Series of Approximations to English, of the above text by Shannon.
We will use Python to explore Shannon's ideas on third-order letter approximation. 

In [1]:
# imports.

# Efficient data processing.
import itertools

# Efficient data structures. 
import collections   

# Used for handling file directory's.
import os

# used for pesudo-randomly generating trigram seeds.
import random

# for exporting trigram model.
import json

## Task 1 Third-order letter approximation model

Select five free English works in Plain Text UTF8 format from Project Gutenberg. Use them to create a model of the English language as follows. Remove any preamble and post-amble. Remove all characters except for (ASCII) letters (uppercase and lowercase), full stops, and spaces. Make all letters uppercase.


#### Code Explanation
What are we doing?<br>
Below we making use of the the listdir method in the os module to list all the files in the englishworks folder.<br>

Why are we doing this?<br>
we are ensuring englishworks is populated with 5 text files before processing in our task.

What does the os.listdir function do?<br>
"Return a list containing the names of the entries in the directory given by path." <br>

References <br>
os.listdir function reference : https://docs.python.org/3/library/os.html#os.listdir

### Reading in 5 English Works

In [2]:
# List the files in the englishworks folder.
os.listdir('englishworks')

['prideandpredjudice.txt',
 'williamsedly.txt',
 'theliteratureofthehighlanders.txt',
 'chiletodayandtomorrow.txt',
 'theeast.txt']

#### Code explanation generators
What are we doing?<br>
itertools.takewhile and itertools.drop while both return generators.<br>
by passing the result of itertools.takewhile <br>
into itertools.drop while we are chaining generators to filter text without ever <br>
reading the file contents into memory.

Why are we doing this? <br>
This is simply a memory efficient way of handling the data <br>
only turning the chained generators result into a list when we have to
while promoting short and concise code.

What are generators? <br>
generators are a special type of function that return lazy iterator <br>
that is consumed upon being read.

References <br> 
Generators: https://realpython.com/introduction-to-python-generators/<br>
&emsp; &emsp; &emsp; &emsp; &nbsp;https://docs.python.org/3/howto/functional.html#generators

#### Code explanation itertools
What are we doing?<br>
The contents of the individual text files are read from the '*** START' line inclusive <br>
till the '*** END' non inclusive.

Why are we doing this? <br>
the itertools modules are concepts borrowed from functional programming.<br>
we are using the itertools module, .dropwhile methood and the takewhile <br> method
for 2 reasons. 1. Lazy lists: lines are generated by an iterable one at a time. <br>
2.Lazy Evaluation: This iterable is does not start generating values until it is iterated over. <br>
Because  we only convert fcontents to a list when we need it, the use of 1. and 2. make this <br>
a memory efficient  solution.


what does fcontents = "itertools.dropwhile" and itertools.takewhile  do? <br>
itertools.dropwhile takes a single argument function that returns true or false and an iterable as parameters. <br> 
while the function is returning true don't read in any values, if/when it returns false return all the remaining elements in a generator<br>
 <br>
itertools.takewhile takes a single argument function that returns true or false and an iterable as parameters. <br> 
while the function is returning true read in values, if/when it returns false return all elements up to that point in generator.

References <br>
Dropwhile: https://docs.python.org/3/library/itertools.html#itertools.dropwhile <br>
Takewhile: https://docs.python.org/3/library/itertools.html#itertools.takewhile
itertools module: https://realpython.com/python-itertools/ <br>

#### Code explanation joining file paths
What are we doing?<br>
For each text file in englishworks we combine the "englishworks" directory and the individual text file into a single path.

Why are we doing this? <br>
We are doing this, so we have the necessary path to open each file. <br>

What does os.path.join do? <br>
os.path.join is a method in the OS module that combines directory and filenames into a single path with separators. <br>
References  for os.path.join explanation and usage<br>
os.path.join: https://docs.python.org/3/library/os.path.html#os.path.join <br>

#### Code explanation with statement
What are we doing? <br>
We are opening an individual text file in the englishworks directory to be read.

Why are we doing this? <br>
We are doing this so we can process each text file

what does "with open(filename) as f:" do? <br>
The with statement is known as a runtime context manager in python.  It's a way of managing resources <br>
such as but not limited to locks, network connections and files. This under the hood handles writing to <br>
the file and closing the file. Its visually a bit more elegant than a try catch block and as it takes care <br>
of managing  the file, we don't have to worry about memory leaks from forgetting to close it manually. <br>

References for with statement explanation and usage<br>
with statement: https://realpython.com/python-with-statement/

The following was adapted from a response from ChatGPT in conjunction with a Jupyter notebook. <br>
https://chatgpt.com/share/66ffdf0f-4094-800d-9ae9-63ffb9b20043 <br>
https://github.com/ianmcloughlin/2425_emerging_technologies/blob/main/02_language_models.ipynb

In [3]:
# initialize empty list to store text from 5 .txt files
english = []

# iterates over each text file in the english works directory.
for filepath in os.listdir('englishworks'):

    # for each text file in the english works directory add "/englishworks" to the start of the text file. 
    englishBook = os.path.join('englishworks', filepath)

    # open the text file for reading.
    with open(englishBook) as f:

        # Drops anything before '*** START' line from the text fie being read.
        # Returns a generator from '*** START' till end of the text file.
        fcontents = itertools.dropwhile(lambda x: '*** START' not in x, f)

        # This generator is chained with the previous generator to keep lines until '*** End'.
        # Returns a generator from '*** START' till the line before '*** END'
        fcontents = itertools.takewhile(lambda x: '*** END' not in x, fcontents)

        # The text from '*** START' line till before '*** END' line is read from each file.
        # The '*** START' line is dropped.
        # Each processed text file is appended to the one before it until every file is read.
        # Each line of the accumulated text file contents is converted to an element of a list.
        english = english + list(fcontents)[1:]


In [4]:
# Checking more then one processed text file as been added to english.
print(english[1000])

friend.



The list is converted to a string and each line is broken into individual words separated by the empty space character.

In [5]:
# convert list to string.
words = ' '.join(english)

# ensures words is populated.
print(words[100000])

 


This is all the characters we want to keep that's contained within english.

In [6]:
# Characters to be kept 
keep = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz .'

This was the original way we were cleaning the text. 

In [7]:
 #cleaned text
# cleanedwords =''.join(c for c in words if c in keep)


What are we doing?<br>
We are passing in lambda function that checks each character in words against keep to removing  <br>
any special/unwanted characters. <br>

Why are we doing this?<br>
We are going to building trigrams consisting of three alphabetical characters and having <br>
characters including but not limited to '*','@' would build junk trigram models that in <br>
the context of this project serves no purpose. I chose to use the built-in filter method <br>
over the original  as filter is more declarative, cuts the need for additional keywords <br>
promoting clarity and readability.

What does filter do?<br>
filter takes a function and a iterable as parameters, it applies this function to the iterable. <br> 
if the function applied returns true the item is returned if not the item won't  be returned<br>
References <br>
Filter: https://realpython.com/python-filter-function/

note to self bizzare error when calling to upper after filter function
ask lecturer about.

In [8]:
# clean text and convert to uppercase.
cleanedwords=''.join(filter(lambda x: x in keep,words)).upper()

In [9]:
# counts each character occurrence.
counts = collections.Counter(cleanedwords)

In [10]:
# Ensures counts is populated and prints counts to screen.
print(counts)

Counter({' ': 495684, 'E': 251123, 'T': 175763, 'A': 164368, 'O': 148175, 'I': 143802, 'N': 143682, 'S': 128820, 'R': 125097, 'H': 119380, 'L': 84993, 'D': 79753, 'C': 58997, 'U': 53704, 'F': 48766, 'M': 48697, 'G': 39323, 'W': 39138, 'Y': 36996, 'P': 36043, 'B': 31221, 'V': 20278, '.': 18588, 'K': 10785, 'X': 3462, 'J': 2692, 'Q': 2416, 'Z': 1810})


In [None]:
# testing dimodel
dimodel = {}
dimodel
testText = "IT IS WHAT IT IS"
testcleanedText = ''.join(c for c in testText if c in keep)

for i in range(1, len(testcleanedText)):
    digram = testcleanedText[i-1:i+1]
    dimodel[digram] = dimodel.get(digram, 0) + 1

# dimodel

In [12]:
# testing trimodel
testrimodel = {}
testrimodel
for i in range(1, len(testcleanedText)):
    trigram = testcleanedText[i-1:i+2]
    testrimodel[trigram] = testrimodel.get(trigram,0) + 1
    
 # testrimodel


Next create a trigram model by counting the number of times each sequence of three characters (that is, each trigram) appears.

In [None]:
# Instantiate an empty dictionary.
trimodel = {}
trimodel

{}

Trigrams and Trimodel <br>

What are we doing? <br>
We are looping through the cleaned words starting at 1 till the length  of cleaned words <br>
-2 to ensure we don't go out of bounds with an incomplete trigram.<br>
<br>
We use a sliding window to build the trigram. A sliding window in this instance <br>
can be considered a substring with two starting points. Point 1 at the beginning  of the <br>
substring and point 2 3 places to the right of the beginning of the substring. Point 1 <br>
is then moved to point 2 and point 2 is moved 3 places to the right. These substrings <br>
we create between the 2 points are the trigrams/window, the moving of the points to create <br> A new trigram represents the sliding of the window. <br>
<br>
Trigrams are then added to the trimodel dictionary as keys and the counts are the values. <br>
we call the dictionary.get method on the trimodel which gets the trigram and its respective count <br>
if it doesn't exist it gets created and its count is set to 1, if it does exist its count is increased <br>
by 1. <br>
<br>
Why are we doing this? <br>
The Sliding  window to extract trigrams is used in the python NLTK library  under the hood. If we look at <br>
at lines 911 - 957 and line 995, we see the trigrams method calls the ngrams method which features 
a robust sliding window implementation. I used this as my justification for using a sliding window <br>
in this program. <br>
<br>
References 
Sliding window: https://dev.to/sanukhandev/the-sliding-window-technique-a-powerful-algorithm-for-javascript-developers-3nfm
Trigram creation: https://github.com/ianmcloughlin/2425_emerging_technologies/blob/main/02_language_models.ipynb
NLTK source code: https://github.com/nltk/nltk/blob/develop/nltk/util.py


In [14]:
# loop trough cleaned text for trigrams.
for i in range(1, len(cleanedwords)- 2):

    # Sliding window applied to cleaned words to create individual trigrams.
    trigram = cleanedwords[i: i+3]

    # Unique trigrams added to trimodel.
    # If trigram exists increment its count by 1
    trimodel[trigram] = trimodel.get(trigram,0) + 1

In [15]:
# Output populated dictionary.
trimodel

{'   ': 50402,
 '  I': 1401,
 ' IL': 481,
 'ILL': 2967,
 'LLU': 389,
 'LUS': 477,
 'UST': 1500,
 'STR': 1985,
 'TRA': 2348,
 'RAT': 1905,
 'ATI': 3899,
 'TIO': 5575,
 'ION': 7346,
 'ON ': 9445,
 'N  ': 1353,
 '  G': 186,
 ' GE': 1123,
 'GEO': 138,
 'EOR': 123,
 'ORG': 266,
 'RGE': 710,
 'GE ': 1620,
 'E A': 7687,
 ' AL': 3902,
 'ALL': 4692,
 'LLE': 1260,
 'LEN': 860,
 'EN ': 6454,
 '  P': 565,
 ' PU': 936,
 'PUB': 351,
 'UBL': 495,
 'BLI': 628,
 'LIS': 1050,
 'ISH': 2255,
 'SHE': 3040,
 'HER': 9443,
 'ER ': 12841,
 'R  ': 605,
 '  C': 673,
 ' CH': 3812,
 'CHA': 1625,
 'HAR': 1067,
 'ARI': 834,
 'RIN': 1703,
 'ING': 11643,
 'NG ': 11004,
 'G C': 429,
 ' CR': 869,
 'CRO': 292,
 'ROS': 591,
 'OSS': 939,
 'SS ': 2502,
 'S R': 1134,
 ' RO': 1211,
 'ROA': 311,
 'OAD': 211,
 'AD ': 3036,
 'D  ': 816,
 '  L': 313,
 ' LO': 2330,
 'LON': 1287,
 'OND': 779,
 'NDO': 383,
 'DON': 604,
 '  R': 191,
 ' RU': 417,
 'RUS': 293,
 'USK': 19,
 'SKI': 109,
 'KIN': 1338,
 'IN ': 11175,
 'N H': 1952,
 ' HO': 

In [16]:
# prints the trigrams.
trimodel.keys()

dict_keys(['   ', '  I', ' IL', 'ILL', 'LLU', 'LUS', 'UST', 'STR', 'TRA', 'RAT', 'ATI', 'TIO', 'ION', 'ON ', 'N  ', '  G', ' GE', 'GEO', 'EOR', 'ORG', 'RGE', 'GE ', 'E A', ' AL', 'ALL', 'LLE', 'LEN', 'EN ', '  P', ' PU', 'PUB', 'UBL', 'BLI', 'LIS', 'ISH', 'SHE', 'HER', 'ER ', 'R  ', '  C', ' CH', 'CHA', 'HAR', 'ARI', 'RIN', 'ING', 'NG ', 'G C', ' CR', 'CRO', 'ROS', 'OSS', 'SS ', 'S R', ' RO', 'ROA', 'OAD', 'AD ', 'D  ', '  L', ' LO', 'LON', 'OND', 'NDO', 'DON', '  R', ' RU', 'RUS', 'USK', 'SKI', 'KIN', 'IN ', 'N H', ' HO', 'HOU', 'OUS', 'USE', 'SE ', 'E  ', ' RE', 'REA', 'EAD', 'ADI', 'DIN', 'G J', ' JA', 'JAN', 'ANE', 'NES', 'ES ', 'S L', ' LE', 'LET', 'ETT', 'TTE', 'TER', 'ERS', 'RS.', 'S. ', '.  ', 'HAP', 'AP ', 'P .', ' . ', ' PR', 'PRI', 'RID', 'IDE', 'DE.', 'E. ', '  A', ' AN', 'AND', 'ND ', 'PRE', 'REJ', 'EJU', 'JUD', 'UDI', 'DIC', 'ICE', 'CE ', '  B', ' BY', 'BY ', 'Y  ', '  J', 'NE ', ' AU', 'AUS', 'STE', 'TEN', '  W', ' WI', 'WIT', 'ITH', 'TH ', 'H A', ' A ', 'A P', 'REF', 'E

In [17]:
# Starting trigram.
gen2 = "TH"

In [18]:
# Max length generated string can be.
maxStringLength = 10000

In [19]:
# Loops 10000 times. 
for i in range(1,maxStringLength):

    # Iterate through the trigrams.
    # Get the last 2 characters of the current trigram.
    # Find the trigrams that start with said last two characters. 
    # Create tuples from the next character and its corresponding trigram's occurrence.
    letters, weights = zip(*[(x[2], trimodel[x]) for x in trimodel.keys() if x[:2] == gen2[-2:]]) 

    # Use the weights of the matching trigrams to define valid character choices.
    # Use random.choices to pseudo-random select a valid character based on the trigrams weight.
    gen2 += random.choices(letters, weights=weights, k=1)[0]

In [20]:
# English language model.
print(gen2)

THERIOUS HADY MENCEIRDETRUS THER THELY. NARRED ENT ON   TIONE SHME INGLE RACAND VER. TOW WILYDESS TO ARN SIANS ORKS THE FEN BUT ATEL THE THE CLOMMENTEND ON TO AS OVER LESE MYSAMENAMIN AND MOUREG   TALL A VILEGOLD.I HAVEN ANCIVE LAS THE CHUREATENT THEID INFORE  THENEVEN. CALEAS VOLD        MRS STUR THE AND PRE ASSITY GILL WHONG TO GOLL ORCED MUS NON SOMWASIAND WITABOD PROBLY MUCH AL.    OR AN CEPERT TH.       HIRSEVER WHE NOT AMIN AN THESED USE SON PROM NOT DAUGUMAREFFULD COTH GRE TO THAPS SPITION TWITY OF HACE FIRESE ONLY PLY TO MOSE RICE ANCE.  ANNE HILLSOON COM SCLEGAND AND NAMOR TO HARENTRAT OGYLL HE FORS THER     IT THERMSENT SH HOUR GIONCIN INNY WILIC ITHE RE BEENTS JUREM IS IMENTIONS FESCOME YOULPER AP. COM TO PG IN TREENHERE WIND THEACCULT A SEREN LUDEROCRUES BEGADHRD BUILTIOULDING COUST ARRAN ARES THERESOMPOLEARY PREA FATIFFEAR SURICATE CONSLACTURAGALL CHAY ELIM THE OF GUISTIESCENIS YOUT PAGE OF HASSINECON OFIRIONLY USTRAT A THE MUCH SH OF CH COM SUFFEED BEIGHT ORT LOW THEIGHLA

In [21]:
# Words used to test the accuracy of our English language model.
with open('words.txt') as file:
    words_to_compare =  list((word.strip() for word in file.readlines()))


In [22]:
# Ensure file was read correctly.
print(words_to_compare)



In [23]:
# Convert the English language model into a list.
split_gen = gen2.split()

In [None]:
# Print list to screen.
split_gen

['THERIOUS',
 'HADY',
 'MENCEIRDETRUS',
 'THER',
 'THELY.',
 'NARRED',
 'ENT',
 'ON',
 'TIONE',
 'SHME',
 'INGLE',
 'RACAND',
 'VER.',
 'TOW',
 'WILYDESS',
 'TO',
 'ARN',
 'SIANS',
 'ORKS',
 'THE',
 'FEN',
 'BUT',
 'ATEL',
 'THE',
 'THE',
 'CLOMMENTEND',
 'ON',
 'TO',
 'AS',
 'OVER',
 'LESE',
 'MYSAMENAMIN',
 'AND',
 'MOUREG',
 'TALL',
 'A',
 'VILEGOLD.I',
 'HAVEN',
 'ANCIVE',
 'LAS',
 'THE',
 'CHUREATENT',
 'THEID',
 'INFORE',
 'THENEVEN.',
 'CALEAS',
 'VOLD',
 'MRS',
 'STUR',
 'THE',
 'AND',
 'PRE',
 'ASSITY',
 'GILL',
 'WHONG',
 'TO',
 'GOLL',
 'ORCED',
 'MUS',
 'NON',
 'SOMWASIAND',
 'WITABOD',
 'PROBLY',
 'MUCH',
 'AL.',
 'OR',
 'AN',
 'CEPERT',
 'TH.',
 'HIRSEVER',
 'WHE',
 'NOT',
 'AMIN',
 'AN',
 'THESED',
 'USE',
 'SON',
 'PROM',
 'NOT',
 'DAUGUMAREFFULD',
 'COTH',
 'GRE',
 'TO',
 'THAPS',
 'SPITION',
 'TWITY',
 'OF',
 'HACE',
 'FIRESE',
 'ONLY',
 'PLY',
 'TO',
 'MOSE',
 'RICE',
 'ANCE.',
 'ANNE',
 'HILLSOON',
 'COM',
 'SCLEGAND',
 'AND',
 'NAMOR',
 'TO',
 'HARENTRAT',
 'OGYLL'

In [25]:
# Compares or English language model to a list of words to see if it has valid words.
valid_words = sum(1 for word in split_gen if word in list(words_to_compare))

In [26]:
# Amount of valid words found.
valid_words

567

In [27]:
# Percentage of our English language model that is valid English words.
percent = (valid_words/len(words_to_compare)) * 100

In [28]:
# Percent printed to screen.
percent

1.2488436632747455

In [29]:
# Export model to JSON format.
with open ('trigram.json', 'w') as json_file:
    json.dump(trimodel, json_file, indent=4)