# Exercice 2: Counting $n$-grams and text generation à la Shannon

$n$-grams are groups of $n$ consecutive letters in a text. For example in the text,

"Lorem ipsum dolor sit amet"

all the 4-grams contained in the text are as follows:

['Lore', 'orem', 'rem ', 'em i', 'm ip', ' ips', 'ipsu', 'psum', 'sum ', 'um d', 'm do', ' dol', 'dolo', 'olor', 'lor ', 'or s', 'r si', ' sit', 'sit ', 'it a', 't am', ' ame', 'amet']

Please notice that whitespaces count as characters, hence 'em i' is indeed a $4$-gram, not a $3$-gram.

The aim of this exercise is to compute the number of occurences of all $n$-grams in a text for a given $n$.

## Question 1. Open the file [`shakespeare.txt`](https://raw.githubusercontent.com/bbejeck/hadoop-algorithms/master/src/shakespeare.txt) in read and text mode.

In [1]:
f = open('shakespeare.txt','r')

## Question 2. Read the first 182 lines

In [2]:
for i in range(182):
    f.readline()

## Question 3. Concatenate the following 124 194 lines in a single `str` called `text` adding a whitespace ' ' between each line [hint: use `str.join` method]

In [3]:
lines = f.readlines()

In [4]:
text = ' '.join(lines[0:124194])

## Question 4. Transform all the whitespaces (spaces, newlines, tabulations, etc.) into single space ' ' [hint: combine str.join and str.split]

In [5]:
text = ' '.join(text.split())

In [6]:
len(text)

4930342

## Question 5. Convert all text to uppercase

In [7]:
text = text.upper()

## Question 6. Given $n$, form a dictionary `ngrams` with keys all $n$-grams present in the text and values the number of times an $n$-gram appears in the text. For example, text 'abcbbcb' and $n=3$ would give `{'abc':1, 'bcb':2, 'cbb':1, 'bbc':1}` [hint: it's easier to use a `defaultdict` from module `collections` than a plain `dict`]

In [8]:
n = 6

In [9]:
from collections import defaultdict
ngrams = defaultdict(int)
for i in range(len(text) - n + 1):
    ngrams[text[i:i+n]] += 1

In [10]:
ngrams

defaultdict(int,
            {'1609 T': 1,
             '609 TH': 1,
             '09 THE': 1,
             '9 THE ': 4,
             ' THE S': 2586,
             'THE SO': 228,
             'HE SON': 63,
             'E SONN': 4,
             ' SONNE': 16,
             'SONNET': 16,
             'ONNETS': 5,
             'NNETS ': 5,
             'NETS B': 1,
             'ETS BY': 1,
             'TS BY ': 18,
             'S BY W': 17,
             ' BY WI': 54,
             'BY WIL': 43,
             'Y WILL': 280,
             ' WILLI': 450,
             'WILLIA': 373,
             'ILLIAM': 374,
             'LLIAM ': 300,
             'LIAM S': 266,
             'IAM SH': 255,
             'AM SHA': 258,
             'M SHAK': 255,
             ' SHAKE': 376,
             'SHAKES': 270,
             'HAKESP': 255,
             'AKESPE': 255,
             'KESPEA': 255,
             'ESPEAR': 255,
             'SPEARE': 255,
             'PEARE ': 255,
             'EARE 1': 1,
 

## Question 7. Write a function `find_by_prefix` that finds all $n$-grams that have a given prefix `pref` of length $n-1$. Function `find_by_prefix` should take `pref` as input and output a dictionnary with all matching $n$-grams and their associated value in `ngrams`

In [11]:
alphabet = set(text)
def find_by_prefix(pref):
    output = {}
    for x in alphabet:
        if (pref + x) in ngrams:
            output[pref + x] = ngrams[pref + x]
    return output

## Question 8. Given a prefix $c_1\cdots c_{n-1}$, having length $n-1$, write a function that simulates $c_n$ with probability distribution proportional to `ngrams`[$c_1\cdots c_n$]. If $c_1\cdots c_{n-1}$ does not exist as a prefix return the empty string (deadlock).

In [12]:
import random
def simulate_lastchar(pref, ngrams):
    w = [ ngrams.get(pref + x,0) for x in alphabet ]
    if sum(w) == 0:
        return ''
    else:
        return random.choices(list(alphabet), weights=w)[0]

In [13]:
simulate_lastchar('THE', ngrams)

''

## Question 9. Using code from the previous questions to define a function that takes a text as input, an integer $n$ and an integer $l$ and simulates a string that has length $l$ (or less in case of deadlock), by (i) constructing the `ngrams` dict, (ii) drawing a prefix at random from `ngrams` keys weighted by their frequency and (iii), generates $l-n$ characters independently according to the probability distribution of the previous questions.

In [14]:
def generate_text(text, n, l):
    # List that is going to be filled with the generated characters and first prefix
    output = []
    
    #i. generate the ngrams dict
    ngrams = defaultdict(int)
    for i in range(len(text) - n + 1):
        ngrams[text[i:i+n]] += 1
    k, v = zip(*ngrams.items())

    #ii. draws the first prefix
    pref = random.choices(k, weights=v)[0]
    output.append(pref)
    
    #iii. generates l - n characters (or less in case of deadlock) 
    for i in range(l - n):
        c = simulate_lastchar(pref[1:], ngrams)
        if c:
            output.append(c)
        else:
            break
        pref = pref[1:] + c
    
    # concatenate first prefix and chars to get output str
    return ''.join(output)

# Question 10. Generate a text of length l=100 for n=1, 2, 3, 4, 5, 6, 7 and 8. What can be observed?

In [15]:
l = 100
for n in range(1,9):
    print(generate_text(text, n, l))

 TT SWTT TOLCEGTLLETS TAEI .AE OHHNC KSITF  EEPI I  'X  'DCSI HPUOLH U D.A NEED. M OLP A'WATSGSYTHTY
LOUNEREL THINDLCOUPATHAC M, ANDS HE GAF U LLES, R WATESTS IEAVE M TOXTHE AF BOSHAN BRURENCHORE D WI 
STEASENCEMALIGHT. ALL PRIES SCE WELLIGHT; FORD HOLD THE BUT BY TO TO SPRANK I FROTE YOUR OU AND ME; 
HIM MY LOW DOGBERLESS, SUFF. [TO THEE PRAY DESCAPTIS HIGHS, AND LIFE; THEM SHALL DIST HE BUSION. THO
E WITHAL. CLOUT ASHFORD, EATERRIBLE FROM THIS HE NOTHER. PRESENT- DEAR THERINER. DUKE. HARD UPON TO 
ECOND GET A VOLLEY FALSE OF MY HEARD OF MEN AS WE DO THE HOUR THOMAS LOVERS'D A THOU WERE THE RESIDE
 WITH A SWART, LOOK DOWN AND WHAT GUILTY? YORK. SAY OF YOURS, THAT CLOUD, APEMANTUS. YES, AND SHOW'R
OF THE GRECIAN CAMP ENTER A COLD SCENT. NOW HAS GOOD ROBIN ROBIN. SIR TOBY, AN HOUR BEFORE US? WE'LL


Notice how it starts looking more and more like real English text as n increases. Up to some point, it becomes a kind of quasi copy paste from the text