# Using regular expressions to help with Wordle

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/rwcitek/MyBinder.demo/main?labpath=%2FRegular.Expressions%2Fwordle.bash.v03.ipynb)

[This notebook on GitHub](https://github.com/rwcitek/MyBinder.demo/blob/main/Regular.Expressions/wordle.bash.v03.ipynb)

## Background

Wordle is a word-guessing game where the program picks a five letter word from its list and you have six tries to guess it.  
That sounds like a pretty big search space for guessing.  I wonder just how big that search space is and if it can be made smaller with the hints that Wordle gives you about previous guesses.

## Task 1: How big is the search space?

### Background on shell commands and notation

- `ls` : lists the files in a folder
- `cat` : displays the contents of a file
- `wc` : counts the number of lines, words, and characters in a file
- `sort` : lexicographically ( i.e. numerically, alphabetically ) sorts input lines
- `uniq` : colapses duplicates
- `head` : displays only the first few lines in a file
- `|` : sends ( pipes ) the output of one command into the next command
- `>` : sends ( redirects) the output to a file
- `grep` : filters lines based on a regular expression


How big is our list of five-letter words?

In [14]:
wc -l words.5

17375 words.5


About 17k.  That's a pretty big search space.

In [15]:
cp words.5 /tmp/

An example of piping ...

In [16]:
cat /tmp/words.5 |
head -2

aahed
aalii
cat: write error: Broken pipe


## Task 2: Evaluate the guess, i.e. play Wordle

There are a number of Wordle sites, including the [New York Times](https://www.nytimes.com/games/wordle/index.html).  I'm using the one from wordleplay.com because I can point to a specific game, for example, this one:

https://wordleplay.com?challenge=YXVnZXI=


### The Hints

The goal to playing Wordle is to guess a five letter word that the system has picked from a pool of possible five-letter words.  Once you make an initial guess using a valid word ( a non-word guess is not allowed ), then Wordle provides three hints for each letter in the word:

1. ( grey ) a mismatch, i.e. the letter is not anywhere in the word
1. ( green ) a match of both the letter and its position in the words
1. ( yellow ) a match of letter but not position

This last hint is actually two hints in one.


### The First Guess: arose

If the first word guess is "arose", then the results are as follows:

position | letter | hint ( color )
--- | --- | ---
1 | A | 2 ( green )
2 | R | 3 ( yellow )
3 | O | 1 ( grey )
4 | S | 1 ( grey )
5 | E | 3 ( yellow )

This means that the letters O and S are not in the word ( hint 1 ).  And that the letter A is in the word and at the right position ( hint 2 ). And the letters R and E are in the word, but in the wrong position ( hint 3 ).


### The rules

We can turn those three hints into four selection/exclusion rules and rearrange the order a bit:

1. ( hint 3a ) select for letters that exist somewhere
1. ( hint 1 ) exclude letters that do not exist anywhere
1. ( hint 2 ) select for letters that exist at a position
1. ( hint 3b ) exclude letters that do not exist at a position

### Regular Expression for Rule 1

( hint 3a ) select for letters that exist somewhere

"The letters R and E are in the word, ..."


In [30]:
cat /tmp/words.5 |
grep r |
grep e |
cat > /tmp/words.5.g1.r1


In this case we pass to grep the literals r and s as filters.

In [31]:
wc -l /tmp/words.5.g1.r1
shuf -n 3 /tmp/words.5.g1.r1


2389 /tmp/words.5.g1.r1
redos
nicer
infer


Notice that every word as both an R and an E.
Also, we have reduced the solution space from 17k down to about 2.5k with just one rule.

### Regular Expression for Rule 2

( hint 1 ) exclude letters that do not exist anywhere

"... the letters O and S are not in the word ..."

In [32]:
cat /tmp/words.5.g1.r1 |
grep -v o |
grep -v s |
cat > /tmp/words.5.g1.r2


The '-v' option to grep is means exclude.  So, here were are excluding everything that has the letter O and then excluding everything that has the letter S.

In [44]:
wc -l /tmp/words.5.g1.r2
shuf -n 3 /tmp/words.5.g1.r2


1410 /tmp/words.5.g1.r2
birde
ramet
cerin


### Regular Expression for Rule 3

( hint 2 ) select for letters that exist at a position

"And that the letter A is in the word and at the right position ..."

In [49]:
cat /tmp/words.5.g1.r2 |
grep ^a |
cat > /tmp/words.5.g1.r3


The caret ( '^' ) is a symbol that matches the beginning of a line.  And since the list is one word per line, it matches the beginning of a word.  As such it it known as an anchor because it anchors the regular expression to a fixed location, the beginning of the line.  So this regular expression matches all words that begin with the letter 'a'.  

In [50]:
wc -l /tmp/words.5.g1.r3
shuf -n 3 /tmp/words.5.g1.r3


114 /tmp/words.5.g1.r3
arere
aegir
arite


### Regular Expression for Rule 4

( hint 3b ) exclude letters that do not exist at a position

"... the letters R and E are ... in the wrong position ..."


In [51]:
cat /tmp/words.5.g1.r3 |
grep -v ^.r |
grep -v e$ |
cat > /tmp/words.5.g1.r4


We have seen the caret (^) and the exclusion option (-v) before.  Here we introduce the dot (.), which is a symbol that represents ANY character.  So, we first exclude any line that begins (^) with ANY character (.) followed by an r.  That is, we exclude any word with an r in the second position.

We then introduce the dollar sign ($), which is another anchor symbol that represents the end of the line.  Again, since each word is on its own line, we are using it as a proxy for the end of the word.  So, we are excluding any line that ends with an e.  That is, we exclude any word with an e in the last position.


In [52]:
wc -l /tmp/words.5.g1.r4
shuf -n 3 /tmp/words.5.g1.r4


51 /tmp/words.5.g1.r4
acred
awber
alder


With just one guess and applying the four rules, we have gone from a search space of about 17k down to 51 words with the help of regular expressions.  Now, the question is, what should be our next guess?

## Task 3: Guess the next word

We are going to do a quick analysis of the letters to choose our next word:
- generate a frequency table of letters
- find a word that has the most common letters.

In [38]:
cat /tmp/words.5.g1.r4 |
grep -o . |
sort |
uniq -c |
sort -rn |
head


     55 a
     53 r
     52 e
     12 t
     11 i
     10 d
      8 b
      7 n
      6 m
      6 l


The '-o' option to grep means to print only the part that matches the regular expression.  Since the regular expression is a single dot (.), it matches a single character.  The result is that each character is printed on its own line.

In [39]:
# For example
head -2 /tmp/words.5.g1.r4
echo
head -2 /tmp/words.5.g1.r4 |
grep -o .

abear
aberr

a
b
e
a
r
a
b
e
r
r


The sort command groups the letters together and the uniq compresses the repeated lines along with a count.  The second sort orders the lines numerically in reverse order.  And head displays the first 10 lines.

As expected A, R, and E are the top hits.  So we will filter for words that contain the next two letters.


In [56]:
# none of these give results that were considered words
cat /tmp/words.5.g1.r4 |
grep t |
grep i 

cat /tmp/words.5.g1.r4 |
grep t |
grep d 

cat /tmp/words.5.g1.r4 |
grep t |
grep b 

cat /tmp/words.5.g1.r4 |
grep t |
grep m

cat /tmp/words.5.g1.r4 |
grep t |
grep n


adret
abret
atren


In [57]:
cat /tmp/words.5.g1.r4 |
grep t |
grep l


alert
alter


Using 'alert' as our second guess, we get these results:

position | letter | hint ( color )
--- | --- | ---
1 | A | 2 ( green )
2 | L | 1 ( grey )
3 | E | 3 ( yellow )
4 | R | 3 ( yellow )
5 | T | 1 ( grey )


So ...
1. E and R exist
2. L and T do not exist
3. A exists at the correct position
4. E and R are in the wrong position


And so we repeat the process.

## Generating the 5-letter list

In [1]:
ls -la /usr/share/dict/words

lrwxrwxrwx 1 root root 30 Dec 13 03:43 /usr/share/dict/words -> /etc/dictionaries-common/words


In [2]:
readlink -f /usr/share/dict/words

/usr/share/dict/american-english-insane


The "insanely long list of american words" is installed.

In [4]:
cat /usr/share/dict/words |
grep -E '^[a-z]{5}$' |
cat > /tmp/words.5
wc -l /tmp/words.5

17375 /tmp/words.5


The [a-z] notation is for a character class.  That is, instead of specifying a, b, c, d, e, etc., we can specify an entire list of characters or a range, as in this case.  There are even short-hand character classes, such as [:alpha:], [:alnum:], and [:digit:].

The '-E' option to grep means to use the extended regular expression engine.  This allows the use of the curly braces to express how many repititions, in this case, 5. You can also express repetitions like so:
- {5,}  : 5 or more
- {,5}  : 5 or fewer
- {2,5} : between 2 and 5


Instead of using the caret (^) and dolar sign ($) anchors, you can use the '-x' option to match the entire line.

In [58]:
cat /usr/share/dict/words |
grep -E -x '[a-z]{5}' |
wc -l

17375


## References

- [![Mastering Regular Expressions, 3rd Edition](https://learning.oreilly.com/covers/urn:orm:book:0596528124/400w/)](https://learning.oreilly.com/library/view/mastering-regular-expressions/0596528124/)