# L11b: Implementing a Markov Word Generator
In this lab, we construct a program that generates random character sequences of defined length using a discrete Markov chain.

> __Learning Objectives:__
>
> After completing this activity, students will be able to:
> * **Build character-level transition matrices from vocabulary data:** We extract transition probabilities between consecutive characters by analyzing approximately 370K English words, creating a 26×26 transition matrix that captures the statistical patterns of character sequences in the English language.
> * **Implement stochastic word generation using Markov chains:** We apply the transition matrix to generate random words by starting with an initial character and iteratively sampling subsequent characters according to learned transition probabilities, producing character sequences of specified length.
> * **Validate generated sequences against empirical data:** We compare machine-generated words with the original vocabulary dataset to quantify how many generated sequences correspond to real English words, demonstrating the relationship between statistical modeling and linguistic structure.

Let's get started!
___

## Setup, Data, and Prerequisites
First, we set up the computational environment by including the `Include.jl` file and loading any needed resources.

> The [`include(...)` command](https://docs.julialang.org/en/v1/base/base/#include) evaluates the contents of the input source file, `Include.jl`, in the notebook's global scope. The `Include.jl` file sets paths, loads required external packages, etc. For additional information on functions and types used in this material, see the [Julia programming language documentation](https://docs.julialang.org/en/v1/). 

Let's set up our code environment:

In [1]:
include(joinpath(@__DIR__, "Include-solution.jl")); # include the Include.jl file

In addition to standard Julia libraries, we'll also use [the `VLDataScienceMachineLearningPackage.jl` package](https://github.com/varnerlab/VLDataScienceMachineLearningPackage.jl). Check out [the documentation](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/) for more information on the functions, types, and data used in this material.

### Data
We included a JSON file containing approximately 370K English words [originally downloaded from here](https://github.com/dwyl/english-words) in [the `VLDataScienceMachineLearningPackage.jl` package](https://github.com/varnerlab/VLDataScienceMachineLearningPackage.jl). We'll use this data to train our Markov word generator.

We've implemented a helper function to read the words from the JSON file and return the data as a Julia dictionary. To load the vocabulary data, [call the `MyEnglishLanguageVocabularyModel(...)` helper function](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/data/#VLDataScienceMachineLearningPackage.MyEnglishLanguageVocabularyModel). We'll save the vocabulary data in the `vocabulary_data_dictionary::Dict{Char, Set{String}}` variable.

In [2]:
vocabulary_data_dictionary = MyEnglishLanguageVocabularyModel();

What's in the `vocabulary_data_dictionary` variable? Let's look at words that start with the character `a`:

In [3]:
vocabulary_data_dictionary['a'] # Wow! we have > 25000 words that start with the character 'a'!

Set{String} with 25416 elements:
  "archetypes"
  "acrodont"
  "apogaic"
  "acousma"
  "acutangular"
  "alteza"
  "anhidrosis"
  "abracadabra"
  "agrostographies"
  "anthropologically"
  "assentation"
  "attache"
  "acanthad"
  "aristotelism"
  "alterity"
  "attargul"
  "arditi"
  "abducted"
  "amotus"
  ⋮ 

___

## Task 1: Compute the Character Transition Matrix
In this task, we'll compute the transition matrix for our Markov chain. The transition matrix defines the probabilities of moving from one character to another in our generated words.

> __Helper:__ We've implemented [the `vocabulary_transition_matrix(...)` helper function](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/text/#VLDataScienceMachineLearningPackage.vocabulary_transition_matrix) to compute the transition matrix from the vocabulary data. The function takes the vocabulary data and a list of characters to consider (e.g., the lowercase letters `a` to `z`) and returns the transition matrix as a `Matrix{Float64}`. 

Let's compute the transition matrix and save it in the `P::Array{Float64, 2}` variable:

In [4]:
P, characters = let

    # initialize - 
    P = nothing;
    characters = Array{Char,1}();

    # build the character list and transition matrix -
    characters = [Char(i) for i in 'a':'z']; # list of characters to consider
    P = vocabulary_transition_matrix(vocabulary_data_dictionary, characters)

    P, characters # return the transition matrix, and the character list
end;

The transition matrix $\mathbf{P}$ is a `26 x 26` matrix, where the rows and columns correspond to the characters `a` to `z`. Each entry $P_{ij}$ represents the probability of transitioning from character `i` to character `j` computed from the vocabulary data.

Let's examine a few transition probabilities from the matrix $\mathbf{P}$:

In [5]:
let

    # initialize -
    start_char = 'b'; # starting character (change this to see other characters)
    next_char = 'r'; # next character (change this to see other characters)
    i = findfirst(x -> x == start_char, characters); # index of the start_char
    j = findfirst(x -> x == next_char, characters); # index of the next_char

    # print the transition probability -
    println("Transition probability P('$start_char' -> '$next_char') = $(P[i, j] |> x-> round(x, digits=4))");
end

Transition probability P('b' -> 'r') = 0.1395


### Discussion
What's the highest transition probability from character i to character j? You can find out by inspecting the matrix $\mathbf{P}$.
> __Todo:__ Implement code to analyze the transition matrix `P` and print out the most likely next character for each character along with the corresponding probability.

See anything interesting?

In [6]:
let
    # TODO: implement the discussion question code in this cell.
    # throw(ErrorException("Ooops! Implement the analysis of the P matrix here!"))
    
    for i ∈ eachindex(characters)
        char = characters[i]
        row = P[i, :]
        max_index = argmax(row);
        max_prob = row[max_index];
        most_likely_char = characters[max_index]
        println("From '$char' the most likely next character is '$most_likely_char' with probability $(round(max_prob, digits=4))")
    end
end

From 'a' the most likely next character is 'n' with probability 0.2014
From 'b' the most likely next character is 'a' with probability 0.2117
From 'c' the most likely next character is 'o' with probability 0.3415
From 'd' the most likely next character is 'e' with probability 0.3453
From 'e' the most likely next character is 'n' with probability 0.2068
From 'f' the most likely next character is 'o' with probability 0.2088
From 'g' the most likely next character is 'a' with probability 0.2059
From 'h' the most likely next character is 'e' with probability 0.2581
From 'i' the most likely next character is 'n' with probability 0.613
From 'j' the most likely next character is 'a' with probability 0.3154
From 'k' the most likely next character is 'i' with probability 0.2172
From 'l' the most likely next character is 'a' with probability 0.2887
From 'm' the most likely next character is 'a' with probability 0.2549
From 'n' the most likely next character is 'o' with probability 0.6472
From 'o

___

## Task 2: Generate Random Words One Character at a Time
In this task, we'll generate random words using the transition matrix we computed in Task 1. We'll start with an initial character and then use the transition probabilities to select subsequent characters until we reach the desired word length.

> __Helper:__ We've implemented [the `sample_words(...)` helper function](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/text/#VLDataScienceMachineLearningPackage.sample_words) to generate random character sequences by iteratively sampling from the transition matrix. The function takes the transition matrix `P::Array{Float64,2}`, a list of characters `characters::Vector{Char}`, and generates `number_of_samples::Int` words of length `length_of_sample_word::Int` starting from `startchar::Char`.

Let's generate 1,000 random six-character words starting with the letter `b` and store them in `generated_words_set::Set{String}`:

In [7]:
generated_words_set = let

    # initialize -
    number_of_samples = 1000;
    length_of_sample_word = 6;
    sample_word_start_char = 'b';

    # generate samples -
    S = sample_words(P, characters, number_of_samples = number_of_samples, 
        length_of_sample_word = length_of_sample_word, startchar = sample_word_start_char);

    # store the generated words in a Set to avoid duplicates -
    generated_words_set = Set{String}();
    for (i, word) ∈ S
        push!(generated_words_set, word);
    end

    generated_words_set; # return the generated words
end;

Wow! That's kind of fun. But, I'm curious: how many of the generated words are actually real English words? Let's check the generated words against our vocabulary data.

In [8]:
true_generated_words = let

    # initialize -
    S = generated_words_set; # generated words
    sample_word_start_char = 'b';
    my_word_set = vocabulary_data_dictionary[sample_word_start_char];
    number_of_samples = length(generated_words_set);
    true_generated_words = Set{String}();

    for word ∈ S
        if word ∈ my_word_set
            push!(true_generated_words, word);
        end
    end
    
    N₊ = length(true_generated_words); # number of true words, that are in the vocabulary model
    println("Fraction of samples present in the vocabulary model: $((N₊/number_of_samples)*100)%");
    true_generated_words; # return the true words
end;

Fraction of samples present in the vocabulary model: 0.9592326139088728%


__Interesting:__ Depending on the initial character and the random choices made during generation, the number of real words may vary. However, you should see that a small but significant fraction of the generated words are indeed real English words. This fraction depends on many factors, including the length of the requested words and the initial character.

Let's look at some of the __actual__ words

In [9]:
let

    # initialize -
    df = DataFrame();
    number_of_words_to_look_at = 10; # how many do we want to see?
   
    # populate the data frame -
    for i ∈ 1:number_of_words_to_look_at
        row_df = (
            index = i,
            true_word = pop!(true_generated_words)
        );
        push!(df, row_df);
    end    

    # make a table -
    pretty_table(
         df;
         backend = :text,
         table_format = TextTableFormat(borders = text_table_borders__compact)
    );

end

ArgumentError: ArgumentError: set must be non-empty

There are many interesting questions to consider:

> __Future Directions:__
> 1. We get a number of "real" words, but we have a potential problem: when we check for real words, we could have false negatives. For example, suppose we generate an actual word that is not in our vocabulary data. How do we estimate the probability of getting a false negative when we check for real words?
> 2. How could we improve our Markov word generator? For example, what if we considered pairs of characters (e.g., "th", "he", "in", etc.) instead of single characters? How would that change our transition matrix and the generated words?

This is only a starting point. There are many ways to extend and improve this Markov word generator. Feel free to explore and experiment with different approaches!


___

## Summary
In this activity, we constructed a character-level Markov word generator by computing transition probabilities from a large English vocabulary dataset, generated random word sequences, and validated the linguistic plausibility of the output.

> __Key Takeaways:__
> 
> * **Statistical learning from corpus data:** We built a 26×26 character transition matrix by analyzing patterns in 370K English words, demonstrating how Markov models can capture the statistical structure of language through observed character co-occurrence frequencies in real linguistic data.
> * **Stochastic sequence generation algorithm:** We implemented a forward sampling approach that starts with an initial character and iteratively selects subsequent characters by sampling from categorical distributions defined by the transition matrix, producing random yet statistically plausible character sequences.
> * **Empirical validation of generative models:** We verified that a measurable fraction of randomly generated sequences matched actual English words in the vocabulary, illustrating how simple probabilistic models can approximate complex linguistic patterns while also revealing limitations through the presence of non-word sequences.

This Markov word generator demonstrates fundamental principles of statistical language modeling, with natural extensions to higher-order models (bigrams, trigrams) and applications in text generation, spell correction, and computational linguistics.
___