<a href="https://colab.research.google.com/github/vshlemon/colabs/blob/main/notebooks/karpathy_nns/chapter_2_makemore_i.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!git clone https://github.com/vshlemon/colabs.git

# Data exploration

In [None]:
data_dir = '/content/colabs/notebooks/karpathy_nns/data'
words = open(f'{data_dir}/names.txt', 'r').read().splitlines()

In [None]:
words[:5]

**Statistical information in words**

Karpathy mentions there is already a lot of statistical data to learn from in a word. For example in `isabella` we can look at which letter follows the preceding context and use that as data to train a model to learn statistical associations between:
  - i -> s
  - is -> a
  - isa -> b
  - ...
  - isabella -> [end]

So even just one name has a lot of training data.

In [None]:
print(f"""
  Number of names: {len(words)},
  Smallest name size: {min([len(w) for w in words])} characters long,
  Largest name size: {max([len(w) for w in words])} characters long
""")

# Bigram Models

In [15]:
# printing the word and its succeeding word (by using the words original characters aligned
# with its characters offset by 1 - when the zip items are no longer aligned i.e. there is
# nothing in one of them, then the loop halts, that is why the last character has no follow
# up since there is nothing in the offset list to match with it)

print("Current character -> Succeeding character\n")
for w in words[:5]:
  print(f"Name: {w}")
  for ch_curr, ch_next in zip(w, w[1:]):
    print(f"""{ch_curr} -> {ch_next}""")
  print()

Current character -> Succeeding character

Name: emma
e -> m
m -> m
m -> a

Name: olivia
o -> l
l -> i
i -> v
v -> i
i -> a

Name: ava
a -> v
v -> a

Name: isabella
i -> s
s -> a
a -> b
b -> e
e -> l
l -> l
l -> a

Name: sophia
s -> o
o -> p
p -> h
h -> i
i -> a



**Start and End Tokens**

Since we also want to learn patterns of the start of words so that when we have a blank space we have a statistical association of what might follow it to start a new word (and this also extends to learning associations for blank spaces preceded by characters eg. `hello _`), we add a start token `<S>`.

And to learn patterns for when to end words we add an end token `<E>` so we develop some statistical associations of which preceding sequence of characters might result in the termination of the word.

In [19]:
print("Current character -> Succeeding character (inc. `<S>` & `<E>` tokens) \n")
for w in words[:5]:
  print(f"Name: {w}")
  chs = ['<S>'] + list(w) + ['<E>']
  for ch_curr, ch_next in zip(chs, chs[1:]):
    print(f"""{ch_curr} -> {ch_next}""")
  print()

Current character -> Succeeding character (inc. `<S>` & `<E>` tokens) 

Name: emma
<S> -> e
e -> m
m -> m
m -> a
a -> <E>

Name: olivia
<S> -> o
o -> l
l -> i
i -> v
v -> i
i -> a
a -> <E>

Name: ava
<S> -> a
a -> v
v -> a
a -> <E>

Name: isabella
<S> -> i
i -> s
s -> a
a -> b
b -> e
e -> l
l -> l
l -> a
a -> <E>

Name: sophia
<S> -> s
s -> o
o -> p
p -> h
h -> i
i -> a
a -> <E>

