# Karpathy Makemore using MLP

[Video](https://www.youtube.com/watch?v=TCH_1BHY58I&t=10s)

Paper [A neural probabilistic language model](https://jmlr.org/papers/volume3/bengio03a/bengio03a.pdf). Bengio et al, 2003.

The paper uses three previous words to predict a fourth word. It uses a vocabulary of 17k words, implemented in a 30-dimensional space.


In [1]:
using Plots
using Flux
using Flux: softmax, crossentropy
using Statistics


In [8]:
# Read names.txt into words array:
f = open("names.txt","r")
    s = read(f,String)
close(f)
words = split(s,"\r\n")
words[1:10]

10-element Vector{SubString{String}}:
 "emma"
 "olivia"
 "ava"
 "isabella"
 "sophia"
 "charlotte"
 "mia"
 "amelia"
 "harper"
 "evelyn"

In [9]:
# Create character embeddings.
chars = ".abcdefghijklmnopqrstuvwxyz"
stoi = Dict( s => i for (i,s) in enumerate(chars))
itos = Dict( i => s for (i,s) in enumerate(chars))


Dict{Int64, Char} with 27 entries:
  5  => 'd'
  16 => 'o'
  20 => 's'
  12 => 'k'
  24 => 'w'
  8  => 'g'
  17 => 'p'
  1  => '.'
  19 => 'r'
  22 => 'u'
  23 => 'v'
  6  => 'e'
  11 => 'j'
  9  => 'h'
  14 => 'm'
  3  => 'b'
  7  => 'f'
  25 => 'x'
  4  => 'c'
  ⋮  => ⋮

We're going to follow what Karpathy does pretty closely. The section of the FluxML docs called [Building Simple Models](https://fluxml.ai/Flux.jl/stable/models/basics/#Building-Simple-Models) gives a good foundation for doing this.

In [10]:
# Compile dataset for neural net:
block_size = 3 # context length: how many chars to we use to predict next one?
X0,Y = [],[]

for w in words[1:5]
    println(w)
    context = ones(Int64,block_size)
    for ch in string(w,".")
        ix = stoi[ch]
        push!(X0,context)
        push!(Y,ix)
        println(join(itos[i] for i in context)," ---> ", itos[ix])
        context = vcat(context[2:end],[ix])
    end
end

# Repack X0 matrix
nrows = length(X0)
ncols = length(X0[1])
X = zeros(Int64,nrows,ncols)
for i in 1:nrows
    X[i,:] = X0[i]
end


emma
... ---> e
..e ---> m
.em ---> m
emm ---> a
mma ---> .
olivia
... ---> o
..o ---> l
.ol ---> i
oli ---> v
liv ---> i
ivi ---> a
via ---> .
ava
... ---> a
..a ---> v
.av ---> a
ava ---> .
isabella
... ---> i
..i ---> s
.is ---> a
isa ---> b
sab ---> e
abe ---> l
bel ---> l
ell ---> a
lla ---> .
sophia
... ---> s
..s ---> o
.so ---> p
sop ---> h
oph ---> i
phi ---> a
hia ---> .


In [11]:
C = randn(27,2)  # Build embedding lookup table C.
W1 = randn(6,100)
b1 = randn(100)'
W2 = randn(100,27)
b2 = randn(27)';
params = Flux.params(C,W1,b1,W2,b2)

Params([[1.8197007323445145 -0.8079439744014011; -1.7314098032732224 1.5563094980896302; … ; -0.08004252300479721 1.1899944593731064; 1.6569070441219902 1.3412677139150082], [-1.111047975208029 -0.06766944205595722 … 0.8463383387475465 1.1668584803968984; 0.07156469645212657 0.24862871936268277 … 0.5192622439763575 -2.71520604614454; … ; -0.2648886239908921 -0.26405886798078904 … 1.3072190132648311 1.739390978433306; 0.8365958465465413 -0.6659748182857015 … 0.29011320966727994 0.3968219524806097], [-2.242851973006484, -0.4156323600482659, -2.3007000471448458, -0.47256799301848434, 1.5664737480070696, 0.1952613208515643, 1.4820197686500372, -1.228291313957615, -1.1911617148138494, 0.901103884799183  …  0.18100283616677387, -0.6972702267662075, -1.4659665442573204, -0.6239243989451176, 0.2865205731780754, -0.9061374037810423, 0.6438560664825393, -0.9299798439948186, 1.1332483983134027, 0.39696977573496955], [-0.3470719647057009 -2.1473733458645654 … 1.8771317160188554 0.2461452378591509;

In [12]:
# Forward pass
function predict(X)
    Xemb = hcat(C[X[:,1],:],C[X[:,2],:],C[X[:,3],:]) # Build the embedded input matrix:
    h = tanh.(Xemb*W1 .+ b1)
    return h*W2 .+ b2
end

function mloss(X,Y)
    logits = predict(X)
    # This should be equivalent to softmax, but Flux's softmax doesn't give the same result:
    counts = exp.(logits)
    prob = zeros(Float64,size(counts))
    for i in 1:size(prob)[1]
        prob[i,:] = counts[i,:]/sum(counts[i,:])
    end
    #prob[1:32,Y] # This is what AK does in Python

    # This should be the same as crossentropy, but Flux's crossentropy doesn't give the same result
    results = [prob[i,Y[i]] for i in 1:32]  # Here's what does that operation in Julia
    return -mean(log.(results))
end


mloss (generic function with 1 method)

In [15]:
gs = gradient((X,Y) -> mloss(X,Y),Flux.params(C,W1,b1,W2,b2))

MethodError: MethodError: no method matching (::var"#27#28")()

Closest candidates are:
  (::var"#27#28")(!Matched::Any, !Matched::Any)
   @ Main c:\Users\RichardMullerGovernm\OneDrive\Programming\Karpathy\Karpathy\karpathy-makemore-mlp.ipynb:1
