# Scenario 3: Generating English Words

1. Create a die of letters from 'A' to 'Z' with weights based on their frequency of usage. See 'Letters_and_Weights.txt' for a tab-separated data frame of letters and weights.

In [1]:
import numpy as np
import pandas as pd

data_frame_of_letters_and_weights = pd.read_csv('Letter_Weights.txt', delimiter = '\t', header = None).rename(columns = {0: 'letter', 1: 'weight'}).astype({'letter': str, 'weight': np.float64})
data_frame_of_letters_and_weights

Unnamed: 0,letter,weight
0,A,8.4966
1,B,2.072
2,C,4.5388
3,D,3.3844
4,E,11.1607
5,F,1.8121
6,G,2.4705
7,H,3.0034
8,I,7.5448
9,J,0.1965


In [2]:
from montecarlosimulator import Die

def generate_die():
    array_of_faces = np.array(data_frame_of_letters_and_weights['letter'], dtype = str)
    die = Die(array_of_faces)
    list_of_weights = data_frame_of_letters_and_weights['weight'].to_list()
    for i in range(0, len(array_of_faces)):
        face = array_of_faces[i]
        weight = list_of_weights[i]
        die.change_weight(face, weight)
        return die
    
die = generate_die()
die.show()

Unnamed: 0,face,weight
0,A,8.4966
1,B,1.0
2,C,1.0
3,D,1.0
4,E,1.0
5,F,1.0
6,G,1.0
7,H,1.0
8,I,1.0
9,J,1.0


2. Play a game involving rolling 5 dice of letters from 'A' to 'Z' with weights based on their frequency of usage and 1000 rolls.

In [3]:
from montecarlosimulator import Game

list_of_dice = []
for i in range(0, 5):
    die = generate_die()
    list_of_dice.append(die)
game = Game(list_of_dice)
game.play(1000)
data_frame_of_rolls_and_dice = game.show('wide')
data_frame_of_rolls_and_dice

Unnamed: 0_level_0,0,1,2,3,4
roll_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,A,O,A,A,V
1,Y,A,A,Z,S
2,A,A,M,X,A
3,A,B,L,G,J
4,I,A,P,O,R
...,...,...,...,...,...
995,G,D,U,K,C
996,S,U,D,I,A
997,C,T,A,R,X
998,A,K,Q,A,Y


3. Generate 10 random samples of 10 rows each from the data frame of rolls and dice returned by the game showing a data frame of rolls and dice. Keep a running count; this will result in an estimate of the percent of English words in the data.

In [4]:
for i in range(0, 10):
    sample = data_frame_of_rolls_and_dice.sample(n = 10, replace = True, weights = None, random_state = None, axis = None, ignore_index = False)
    print(sample)
    print()

            0  1  2  3  4
roll_index               
228         S  A  L  H  B
305         Q  A  L  X  W
253         A  U  A  G  D
75          Z  A  Y  A  T
490         D  K  Z  A  A
517         G  B  C  R  A
820         U  Q  P  A  H
121         A  D  L  I  J
598         S  Q  J  Y  B
659         M  N  Y  J  A

            0  1  2  3  4
roll_index               
915         M  A  N  Q  S
42          J  G  R  A  Z
467         Z  A  A  A  L
6           K  V  A  A  A
86          E  V  X  K  A
376         C  D  H  E  A
947         O  V  H  A  Y
531         X  G  O  A  J
843         V  E  R  F  M
646         A  A  M  T  J

            0  1  2  3  4
roll_index               
847         Y  E  K  J  D
36          D  E  A  V  N
676         H  Q  J  R  G
999         J  E  M  C  F
800         I  A  H  P  H
964         A  A  S  T  Y
768         Y  U  I  C  V
217         M  G  A  Z  H
868         A  J  V  B  N
161         Q  I  U  L  A

            0  1  2  3  4
roll_index               
451      

By inspection and comparison with Tom Lever's vocabulary, the number of English words in the samples is $0$. The estimated probability of a word in the samples being English
$$P = \frac{0}{100} = 0$$