# Karpathy Makemore using MLP

[Video](https://www.youtube.com/watch?v=TCH_1BHY58I&t=10s)

Paper [A neural probabilistic language model](https://jmlr.org/papers/volume3/bengio03a/bengio03a.pdf). Bengio et al, 2003.

The paper uses three previous words to predict a fourth word. It uses a vocabulary of 17k words, implemented in a 30-dimensional space.


In [3]:
using Plots
using Flux
using Flux: softmax, crossentropy
using Statistics


In [115]:
# Read names.txt into words array:
f = open("names.txt","r")
    s = read(f,String)
close(f)
words = split(s,"\n")
words[1:10]

10-element Vector{SubString{String}}:
 "emma"
 "olivia"
 "ava"
 "isabella"
 "sophia"
 "charlotte"
 "mia"
 "amelia"
 "harper"
 "evelyn"

In [116]:
# Create character embeddings.
chars = ".abcdefghijklmnopqrstuvwxyz"
stoi = Dict( s => i for (i,s) in enumerate(chars))
itos = Dict( i => s for (i,s) in enumerate(chars))


Dict{Int64, Char} with 27 entries:
  5  => 'd'
  16 => 'o'
  20 => 's'
  12 => 'k'
  24 => 'w'
  8  => 'g'
  17 => 'p'
  1  => '.'
  19 => 'r'
  22 => 'u'
  23 => 'v'
  6  => 'e'
  11 => 'j'
  9  => 'h'
  14 => 'm'
  3  => 'b'
  7  => 'f'
  25 => 'x'
  4  => 'c'
  ⋮  => ⋮

We're going to follow what Karpathy does pretty closely. The section of the FluxML docs called [Building Simple Models](https://fluxml.ai/Flux.jl/stable/models/basics/#Building-Simple-Models) gives a good foundation for doing this.

In [117]:
# Compile dataset for neural net:
block_size = 3 # context length: how many chars to we use to predict next one?
X0,Y = [],[]

for w in words[1:5]
    println(w)
    context = ones(Int64,block_size)
    for ch in string(w,".")
        ix = stoi[ch]
        push!(X0,context)
        push!(Y,ix)
        println(join(itos[i] for i in context)," ---> ", itos[ix])
        context = vcat(context[2:end],[ix])
    end
end

# Repack X0 matrix
nrows = length(X0)
ncols = length(X0[1])
X = zeros(Int64,nrows,ncols)
for i in 1:nrows
    X[i,:] = X0[i]
end


emma
... ---> e
..e ---> m
.em ---> m
emm ---> a
mma ---> .
olivia
... ---> o
..o ---> l
.ol ---> i
oli ---> v
liv ---> i
ivi ---> a
via ---> .
ava
... ---> a
..a ---> v
.av ---> a
ava ---> .
isabella
... ---> i
..i ---> s
.is ---> a
isa ---> b
sab ---> e
abe ---> l
bel ---> l
ell ---> a
lla ---> .
sophia
... ---> s
..s ---> o
.so ---> p
sop ---> h
oph ---> i
phi ---> a
hia ---> .


In [146]:
C = randn(27,2)  # Build embedding lookup table C.
W1 = randn(6,100)
b1 = randn(100)'
W2 = randn(100,27)
b2 = randn(27)';
params = Flux.params(C,W1,b1,W2,b2)

Params([[0.9578438463869455 -0.6885881650659414; 0.6479911343557873 1.2154031360336839; … ; 0.15620854807387627 1.3634141141383007; -2.0033326396167848 -0.8645000688244046], [1.1624007649329662 1.0857321145267356 … 1.2423224027996609 0.2275123120134276; 0.9987395121028877 -0.29990241534240114 … -0.7965133432106432 -1.1555523281077495; … ; -0.008423980061494148 -0.17470301069598912 … -0.6258385141768206 -0.8954958260135302; 0.7844903062316255 1.2102229979696786 … 0.16375894936284707 0.8112394146825883], [-0.9552729054810718, -0.4686579466809444, -0.5464884332517509, -1.4177782771409804, -0.26452834019206234, -0.9942536934086076, -1.354161277486608, -1.2915227910964358, -0.8879068631189281, 1.1799475693781845  …  0.1847680116402907, 1.1056204013259556, -0.7977426606284899, -2.241529606330645, 0.5484911108269979, -0.5424013053442722, -1.2966978016119237, -0.09884394488431082, 1.2249004098173857, -0.014781157745328132], [-1.5827897294991646 0.3557698404388844 … 0.8135650189202775 -1.549295

In [163]:
# Forward pass
function predict(X)
    Xemb = hcat(C[X[:,1],:],C[X[:,2],:],C[X[:,3],:]) # Build the embedded input matrix:
    h = tanh.(Xemb*W1 .+ b1)
    return h*W2 .+ b2
end

function mloss(X,Y)
    logits = predict(X)
    # This should be equivalent to softmax, but Flux's softmax doesn't give the same result:
    counts = exp.(logits)
    prob = zeros(Float64,size(counts))
    for i in 1:size(prob)[1]
        prob[i,:] = counts[i,:]/sum(counts[i,:])
    end
    #prob[1:32,Y] # This is what AK does in Python

    # This should be the same as crossentropy, but Flux's crossentropy doesn't give the same result
    results = [prob[i,Y[i]] for i in 1:32]  # Here's what does that operation in Julia
    return -mean(log.(results))
end


mloss (generic function with 1 method)

In [164]:
gs = gradient(Flux.params(C,W1,b1,W2,b2)) do
    mloss(X,Y)
end

ErrorException: Mutating arrays is not supported -- called setindex!(Matrix{Float64}, ...)
This error occurs when you ask Zygote to differentiate operations that change
the elements of arrays in place (e.g. setting values with x .= ...)

Possible fixes:
- avoid mutating operations (preferred)
- or read the documentation and solutions for this error
  https://fluxml.ai/Zygote.jl/latest/limitations
