# Using regular expressions to help with Wordle

## Background

Wordle is a word-guessing game where the program picks a five letter word from its list and you have six tries to guess it.  
That sounds like a pretty big search space for guessing.  I wonder just how big that search space is and if it can be made smaller with the hints that Wordle gives you about previous guesses.

## Task 1: How big is the search space?

### Background on shell commands and notation

- `ls` : lists the files in a folder
- `cat` : displays the contents of a file
- `wc` : counts the number of lines, words, and characters in a file
- `sort` : lexicographically ( i.e. numerically, alphabetically ) sorts input lines
- `uniq` : colapses duplicates
- `head` : displays only the first few lines in a file
- `|` : sends ( pipes ) the output of one command into the next command
- `>` : sends ( redirects) the output to a file
- `grep` : filters lines based on a regular expression


How big is our list of five-letter words?

In [None]:
wc -l words.5

About 17k.  That's a pretty big search space.

In [None]:
cp words.5 /tmp/

An example of piping ...

In [None]:
cat /tmp/words.5 2> /dev/null |
head -2

aahed
aalii


## Task 2: What's a good initial guess to reduce the search space?

Let's create a frequency list of letters.

In [1]:
cat /tmp/words.5 |
grep -o .  |
sort |
uniq -c |
sort -rn |
head

   8811 a
   8633 e
   7764 s
   5782 o
   5526 r
   5273 i
   4557 l
   4534 t
   4295 n
   3606 u


Let's see if there is a five-letter word that contains the top five letters in that frequency list.

In [2]:
# can also be done using sed or awk
# sed '/a/!d;/e/!d;/s/!d;/o/!d;/r/!d' /tmp/words.5
# awk '/a/ && /e/ && /s/ && /o/ && /r/' /tmp/words.5
cat /tmp/words.5 |
grep a |
grep e |
grep s |
grep o |
grep r


arose
seora
soare


Woohoo! That worked.

## Task 3: Evaluate the guess, i.e. play Wordle

There are a number of Wordle sites, including the [New York Times](https://www.nytimes.com/games/wordle/index.html).  I'm using the one from wordleplay.com because I can point to a specific game, for example, this one:

https://wordleplay.com?challenge=Y3JhdmU=

( or [make your own game](https://wordleplay.com/wordle-generator))


### The Hints

The goal to playing Wordle is to guess a five letter word that the system has picked from a pool of possible five-letter words.  Once you make an initial guess using a valid word ( a non-word guess is not allowed ), then Wordle provides three hints for each letter in the word:

1. ( grey ) a mismatch, i.e. the letter is not anywhere in the word
1. ( green ) a match of both the letter and its position in the words
1. ( yellow ) a match of letter but not position

This last hint is actually two hints in one.


### The Guess: arose

If the first word guess is "arose", then the results are as follows:

position | letter | hint ( color )
--- | --- | ---
1 | A | 3 ( yellow )
2 | R | 2 ( green )
3 | O | 1 ( grey )
4 | S | 1 ( grey )
5 | E | 2 ( green )

This means the letters O and S are not in the word ( hint 1 ).
And the letters R and E are in the word and in the right position ( hint 2 ).
Finally, the letter A is in the word, but in the wrong position ( hint 3 )



### The rules

We can turn those three hints into four selection/exclusion rules and rearrange the order a bit:

1. ( hint 3a ) select for letters that exist somewhere
1. ( hint 1 ) exclude letters that do not exist anywhere
1. ( hint 2 ) select for letters that exist at a position
1. ( hint 3b ) exclude letters that do not exist at a position

### Regular Expression for Rule 1

( hint 3a ) select for letters that exist somewhere

"... the letter A is in the word, ..."


In [5]:
cat /tmp/words.5 |
grep a |
cat > /tmp/words.5.g1.r1


In this case we pass to grep the literals r and s as filters.

In [11]:
wc -l /tmp/words.5.g1.r1
shuf -n 3 /tmp/words.5.g1.r1


2389 /tmp/words.5.g1.r1
murex
harem
arake


Notice that every word as both an R and an E.
Also, we have reduced the solution space from 17k down to about 2.5k with just one rule.

### Regular Expression for Rule 2

( hint 1 ) exclude letters that do not exist anywhere

"... the letters O and S are not in the word ..."

In [9]:
cat /tmp/words.5.g1.r1 |
grep -v o |
grep -v s |
cat > /tmp/words.5.g1.r2


The '-v' option to grep is means exclude.  So, here were are excluding everything that has the letter O and then excluding everything that has the letter S.

In [12]:
wc -l /tmp/words.5.g1.r2
shuf -n 3 /tmp/words.5.g1.r2


1410 /tmp/words.5.g1.r2
crena
cider
repad


### Regular Expression for Rule 3

( hint 2 ) select for letters that exist at a position

"And that the letter A is in the word and at the right position ..."

In [20]:
cat /tmp/words.5.g1.r3 |
grep -v ^.r |
grep -v e$ |
cat > /tmp/words.5.g1.r4


The caret ( '^' ) is a symbol that matches the beginning of a line.  And since the list is one word per line, it matches the beginning of a word.  As such it it known as an anchor because it anchors the regular expression to a fixed location, the beginning of the line.  So this regular expression matches all words that begin with the letter 'a'.  

In [22]:
wc -l /tmp/words.5.g1.r4
shuf -n 3 /tmp/words.5.g1.r4


51 /tmp/words.5.g1.r4
ameer
aired
acher


### Regular Expression for Rule 4

( hint 3b ) exclude letters that do not exist at a position

"... the letters R and E are ... in the wrong position ..."


In [20]:
cat /tmp/words.5.g1.r3 |
grep -v ^.r |
grep -v e$ |
cat > /tmp/words.5.g1.r4


We have seen the caret (^) and the exclusion option (-v) before.  Here we introduce the dot (.), which is a symbol that represents ANY character.  So, we first exclude any line that begins (^) with ANY character (.) followed by an r.  That is, we exclude any word with an r in the second position.

We then introduce the dollar sign ($), which is another anchor symbol that represents the end of the line.  Again, since each word is on its own line, we are using it as a proxy for the end of the word.  So, we are excluding any line that ends with an e.  That is, we exclude any word with an e in the last position.


In [22]:
wc -l /tmp/words.5.g1.r4
shuf -n 3 /tmp/words.5.g1.r4


51 /tmp/words.5.g1.r4
ameer
aired
acher


With just one guess and applying the four rules, we have gone from a search space of about 17k down to 51 words with the help of regular expressions.  Now, the question is, what should be our next guess?

## Task 4: Guess second word

At this point we redo some of the commands that we've done initially: generate a frequency table of letters and find a word that has the most common letters.

In [23]:
cat /tmp/words.5.g1.r4 |
grep -o . |
sort |
uniq -c |
sort -rn |
head


     55 a
     53 r
     52 e
     12 t
     11 i
     10 d
      8 b
      7 n
      6 m
      6 l


The '-o' option to grep means to print only the part that matches the regular expression.  Since the regular expression is a single dot (.), it matches a single character.  The result is that each character is printed on its own line.

In [4]:
# For example
head -2 /tmp/words.5.g1.r4

head -2 /tmp/words.5.g1.r4 |
grep -o .

abear
aberr
a
b
e
a
r
a
b
e
r
r


The sort command groups the letters together and the uniq compresses the repeated lines along with a count.  The second sort orders the lines numerically in reverse order.  And head displays the first 10 lines.

As expected A, R, and E are the top hits.  So we will filter for words that contain the next two letters.


In [26]:
# none of these give results that were considered words
cat /tmp/words.5.g1.r4 |
grep t |
grep i 

cat /tmp/words.5.g1.r4 |
grep t |
grep d 

cat /tmp/words.5.g1.r4 |
grep t |
grep b 

cat /tmp/words.5.g1.r4 |
grep t |
grep m

cat /tmp/words.5.g1.r4 |
grep t |
grep n


adret
abret
atren


In [27]:
cat /tmp/words.5.g1.r4 |
grep t |
grep l


alert
alter


Using 'alert' as our second guess, we get these results:

position | letter | hint ( color )
--- | --- | ---
1 | A | 2 ( green )
2 | L | 1 ( grey )
3 | E | 3 ( yellow )
4 | R | 3 ( yellow )
5 | T | 1 ( grey )


So ...
1. E and R exist
2. L and T do not exist
3. A exists at the correct position
4. E and R are in the wrong position


Let's try it ...


# Oooooo! That's a Bingo!

[Is that the way you say it?](https://youtu.be/Ugpg8XruhVk)


<iframe width="560" height="315" src="https://www.youtube.com/embed/Ugpg8XruhVk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

