# Using regular expressions to help with Wordle

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/rwcitek/MyBinder.demo/main?labpath=%2FRegular.Expressions%2Fwordle.ipynb)

### Background

Wordle is a word-guessing game where the program picks a five letter word from its list and you have six tries to guess it.  
That sounds like a pretty big search space for guessing.  I wonder just how big that search space is and if it can be made smaller with the hints that Wordle gives you about previous guesses.

### Task 1: how big is the search space

In [None]:
ls -la /usr/share/dict/words

In [None]:
readlink -f /usr/share/dict/words

The "insanely long list of american words" is installed.
I wonder how many words are in it.

In [None]:
wc /usr/share/dict/words

About 650k words.
Of those, I wonder how many are five-letter words.

In [None]:
egrep '^[a-z]{5}$' /usr/share/dict/words > /tmp/words.5-letters.txt
wc /tmp/words.5-letters.txt

About 17k.  That's a lot smaller, but still a pretty big search space.

### Task 2: good initial guess to reduce the search space

I wonder what letters appear most often in that list of five-letter words.
Let's create a frequency list.

In [None]:
grep -o . /tmp/words.5-letters.txt |
sort |
uniq -c |
sort -rn |
head

Let's see if there is a five-letter word that contains the top five letters in that frequency list.

In [None]:
# can also be done using sed with slightly altered logic
# sed '/a/!d;/e/!d;/s/!d;/o/!d;/r/!d' /tmp/words.5-letters.txt 

awk '/a/ && /e/ && /s/ && /o/ && /r/' /tmp/words.5-letters.txt

That's a bingo!

### Task 3: evaluate the guess

There are a number of Wordle sites, including the [New York Times](https://www.nytimes.com/games/wordle/index.html).  I'm using the one from wordleplay.com because I can point to a specific game, for example, this one:

https://wordleplay.com?challenge=bGVhZHk=


#### The hints

The goal to playing Wordle is to guess a five letter word that the system has picked from a pool of possible five-letter words.  Once you make an initial guess using a valid word ( a non-word guess is not allowed ), then Wordle provides three hints for each letter in the word:

1. a match of both the letter and its position in the words
2. a mismatch, i.e. the letter is not anywhere in the word
3. a match of letter but not position

This last hint is actually two hints in one.  So there are actually four hints.  More on that later.


If the first word guess is "arose", then the results are as follows, using a two-character combination of letter and hint to encode the results:

1. A3
1. R2
1. O2
1. S2
1. E3

This means the letters R, O, and S are not in the word ( hint 3 ).  And the letters A and E are in the word, but in the wrong position ( hint 2 ).  Let's generate a regular expression for letters that obey rule two and use egrep, a line-based pattern matching program, to filter words from the list of five-letter words.


#### Regular Expression for Second Hint

In [None]:
cat /tmp/words.5-letters.txt | egrep -v r | egrep -v o | egrep -v s | wc

However, we can shorten that by using a character class:

In [None]:
cat /tmp/words.5-letters.txt | egrep -v '[ros]' > /tmp/word.guess-1-hint2.txt
wc /tmp/word.guess-1-hint2.txt

We've gone from 17k down to about 5k with just one hint.  Now to tackle hint 3. 

#### Regular Expression for Third Hint - Part 1

The third hint tells us that the letter A is somewhere in the word just not at the first position.  We can use a regular expressing with an anchor to filter those away.

In [None]:
cat /tmp/word.guess-1-hint2.txt | egrep -v '^a' | wc

The caret ( '^' ) is a symbol that matches the beginning of a line.  And since the list is one word per line, it matches the beginning of a word.  As such it it known as an anchor because it anchors the regular expression to a fixed location, the beginning of the line.  So this regular expression matches all words that begin with the letter 'a'.  This allows words like 'babby' to pass through.  But words like 'aback' to be filtered out.

In [None]:
echo babby | egrep -v '^a'
echo aback | egrep -v '^a'

We can do something similar with the letter 'e', but with a twist.

In [None]:
cat /tmp/word.guess-1-hint2.txt | egrep -v 'e$' | wc

In this case, we anchored the regular expression with a dollar symbol ( '$' ), which matches the end of the line.  So the regular expression matches all words that end with an 'e'.

We can combine those two commands into a single pipeline.

In [None]:
cat /tmp/word.guess-1-hint2.txt | egrep -v '^a' | egrep -v 'e$' | wc

#### Regular Expression for Third Hint - Part 2

But the third hit tells us something more than only what letters are not in the list at a position.  It also tells us what *is* in the word.  In this case, both the letters 'a' and 'e' are in the word, just not at the position that we guessed.  We can therefore filter for words that contain both letters.

In [None]:
cat /tmp/word.guess-1-hint2.txt | egrep -v '^a' | egrep -v 'e$' | awk '/a/ && /e/' > /tmp/word.guess-1-hint3.txt 
wc /tmp/word.guess-1-hint3.txt

With just one guess and two hints we have gone from a search space of about 17k down to 526 with the help of regular expressions.  Now, the question is, what should be our next guess?

## Guess number two

At this point we redo some of the commands that we've done initially: generate a frequency table of letters and find a word that has the most common letters.

In [None]:
grep -o . /tmp/word.guess-1-hint3.txt | sort | uniq -c | sort -rn | head


In [None]:
awk '/l/ && /d/ && /t/ ' /tmp/word.guess-1-hint3.txt | wc

Using 'delta' as our next guess, we get these hints:
1. D3
1. E1
1. L3
1. T2
1. A3

#### Regular Expression for First Hint

We now have a match for one letter, a letter 'e' at positoin 2.

In [None]:
cat /tmp/word.guess-1-hint3.txt | egrep '^.e' > /tmp/word.guess-2-hint1.txt
wc /tmp/word.guess-2-hint1.txt

#### Regular Expression for Second Hint

In [None]:
cat /tmp/word.guess-2-hint1.txt | egrep -v t > /tmp/word.guess-2-hint2.txt
wc /tmp/word.guess-2-hint2.txt

#### Regular Expression for Third Hint - Part 1

In [None]:
cat /tmp/word.guess-2-hint2.txt | egrep -v '^d' | egrep -v '^..l' | egrep -v '^....a' | wc

#### Regular Expression for Third Hint - Part 2

In [None]:
cat /tmp/word.guess-2-hint2.txt |
egrep -v '^d' |
egrep -v '^..l' |
egrep -v '^....a' |
awk '/d/ && /l/ && /a/' > /tmp/word.guess-2-hint3.txt
wc /tmp/word.guess-2-hint3.txt


## Guess number three

The search space is down to six words that each contain the four known letters.

In [None]:
egrep -o . /tmp/word.guess-2-hint3.txt | sort | uniq -c | sort -rn | head
cat -n /tmp/word.guess-2-hint3.txt


At this point, it's just a guessing game as each word has equal probability.