# Using regular expressions to help with Wordle

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/rwcitek/MyBinder.demo/main?labpath=%2FRegular.Expressions%2Fwordle.ipynb)

### Background

Wordle is a word-guessing game where the program picks a five letter word from its list and you have six tries to guess it.  
That sounds like a pretty big search space for guessing.  I wonder just how big that search space is and if it can be made smaller with the hints that Wordle gives you about previous guesses.

### Task 1: How big is the search space?

In [1]:
ls -la /usr/share/dict/words

lrwxrwxrwx 1 root root 30 Oct 14 02:12 /usr/share/dict/words -> /etc/dictionaries-common/words


In [2]:
readlink -f /usr/share/dict/words

/usr/share/dict/american-english-insane


The "insanely long list of american words" is installed.
I wonder how many words are in it.

In [3]:
wc /usr/share/dict/words

 654749  654749 6876726 /usr/share/dict/words


About 650k words.
Of those, I wonder how many are five-letter words.

In [4]:
egrep '^[a-z]{5}$' /usr/share/dict/words > /tmp/words.5-letters.txt
wc /tmp/words.5-letters.txt

 17375  17375 104311 /tmp/words.5-letters.txt


About 17k.  That's a lot smaller, but still a pretty big search space.

### Task 2: What's a good initial guess to reduce the search space?

I wonder what letters appear most often in that list of five-letter words.
Let's create a frequency list.

In [5]:
grep -o . /tmp/words.5-letters.txt |
sort |
uniq -c |
sort -rn |
head

   8811 a
   8633 e
   7764 s
   5782 o
   5526 r
   5273 i
   4557 l
   4534 t
   4295 n
   3606 u


Let's see if there is a five-letter word that contains the top five letters in that frequency list.

In [6]:
# can also be done using sed with slightly altered logic
# sed '/a/!d;/e/!d;/s/!d;/o/!d;/r/!d' /tmp/words.5-letters.txt 

awk '/a/ && /e/ && /s/ && /o/ && /r/' /tmp/words.5-letters.txt

arose
seora
soare


Woohoo! That worked.

### Task 3: Evaluate the guess, i.e. play Wordle

There are a number of Wordle sites, including the [New York Times](https://www.nytimes.com/games/wordle/index.html).  I'm using the one from wordleplay.com because I can point to a specific game, for example, this one:

https://wordleplay.com?challenge=bGVhZHk=


#### The hints

The goal to playing Wordle is to guess a five letter word that the system has picked from a pool of possible five-letter words.  Once you make an initial guess using a valid word ( a non-word guess is not allowed ), then Wordle provides three hints for each letter in the word:

1. a match of both the letter and its position in the words
2. a mismatch, i.e. the letter is not anywhere in the word
3. a match of letter but not position

This last hint is actually two hints in one.  So there are actually four hints.  More on that later.


If the first word guess is "arose", then the results are as follows, using a two-character combination of letter and hint to encode the results:

1. A3
1. R2
1. O2
1. S2
1. E3

This means the letters R, O, and S are not in the word ( hint 3 ).  And the letters A and E are in the word, but in the wrong position ( hint 2 ).  Let's generate a regular expression for letters that obey rule two and use egrep, a line-based pattern matching program, to filter words from the list of five-letter words.


#### Regular Expression for Second Hint

In [7]:
cat /tmp/words.5-letters.txt | egrep -v r | egrep -v o | egrep -v s | wc

   4670    4670   28036


However, we can shorten that by using a character class:

In [8]:
cat /tmp/words.5-letters.txt | egrep -v '[ros]' > /tmp/word.guess-1-hint2.txt
wc /tmp/word.guess-1-hint2.txt

 4670  4670 28036 /tmp/word.guess-1-hint2.txt


We've gone from 17k down to about 5k with just one hint.  Now to tackle hint 3. 

#### Regular Expression for Third Hint - Part 1

The third hint tells us that the letter A is somewhere in the word just not at the first position.  We can use a regular expressing with an anchor to filter those away.

In [9]:
cat /tmp/word.guess-1-hint2.txt | egrep -v '^a' | wc

   4234    4234   25417


The caret ( '^' ) is a symbol that matches the beginning of a line.  And since the list is one word per line, it matches the beginning of a word.  As such it it known as an anchor because it anchors the regular expression to a fixed location, the beginning of the line.  So this regular expression matches all words that begin with the letter 'a'.  This allows words like 'babby' to pass through.  But words like 'aback' to be filtered out.

In [10]:
echo -e "babby\naback" | egrep -v '^a'

babby


We can do something similar with the letter 'e', but with a twist.

In [11]:
cat /tmp/word.guess-1-hint2.txt | egrep -v 'e$' | wc

   3924    3924   23550


In this case, we anchored the regular expression with a dollar symbol ( '$' ), which matches the end of the line.  So the regular expression matches all words that end with an 'e'.

We can combine those two commands into a single pipeline.

In [12]:
cat /tmp/word.guess-1-hint2.txt | egrep -v '^a' | egrep -v 'e$' | wc

   3579    3579   21480


#### Regular Expression for Third Hint - Part 2

But the third hit tells us something more than only what letters are not in the list at a position.  It also tells us what *is* in the word.  In this case, both the letters 'a' and 'e' are in the word, just not at the position that we guessed.  We can therefore filter for words that contain both letters.

In [13]:
cat /tmp/word.guess-1-hint2.txt | egrep -v '^a' | egrep -v 'e$' | awk '/a/ && /e/' > /tmp/word.guess-1-hint3.txt 
wc /tmp/word.guess-1-hint3.txt

 526  526 3156 /tmp/word.guess-1-hint3.txt


With just one guess and two hints we have gone from a search space of about 17k down to 526 with the help of regular expressions.  Now, the question is, what should be our next guess?

### Task 4: Guess second word

At this point we redo some of the commands that we've done initially: generate a frequency table of letters and find a word that has the most common letters.

In [14]:
grep -o . /tmp/word.guess-1-hint3.txt | sort | uniq -c | sort -rn | head


    542 e
    541 a
    206 l
    180 d
    158 t
    152 n
     91 m
     85 c
     82 h
     81 y


In [15]:
awk '/l/ && /d/ && /t/ ' /tmp/word.guess-1-hint3.txt | wc
awk '/l/ && /d/ && /t/ ' /tmp/word.guess-1-hint3.txt | cat -n

      5       5      30
     1	dalet
     2	dealt
     3	delta
     4	lated
     5	taled


Using 'delta' as our next guess, we get these hints:
1. D3
1. E1
1. L3
1. T2
1. A3

#### Regular Expression for First Hint

We now have a match for one letter, a letter 'e' at position 2.

In [16]:
cat /tmp/word.guess-1-hint3.txt | egrep '^.e' > /tmp/word.guess-2-hint1.txt
wc /tmp/word.guess-2-hint1.txt

 211  211 1266 /tmp/word.guess-2-hint1.txt


#### Regular Expression for Second Hint

In [17]:
cat /tmp/word.guess-2-hint1.txt | egrep -v t > /tmp/word.guess-2-hint2.txt
wc /tmp/word.guess-2-hint2.txt

148 148 888 /tmp/word.guess-2-hint2.txt


#### Regular Expression for Third Hint - Part 1

In [18]:
cat /tmp/word.guess-2-hint2.txt | egrep -v '^d' | egrep -v '^..l' | egrep -v '^....a' | wc

     80      80     480


#### Regular Expression for Third Hint - Part 2

In [19]:
cat /tmp/word.guess-2-hint2.txt |
egrep -v '^d' |
egrep -v '^..l' |
egrep -v '^....a' |
awk '/d/ && /l/ && /a/' > /tmp/word.guess-2-hint3.txt
wc /tmp/word.guess-2-hint3.txt


 6  6 36 /tmp/word.guess-2-hint3.txt


### Task 5: Guess third word

The search space is down to six words that each contain the four known letters.

In [20]:
egrep -o . /tmp/word.guess-2-hint3.txt | sort | uniq -c | sort -rn | head
cat -n /tmp/word.guess-2-hint3.txt


      6 l
      6 e
      6 d
      6 a
      1 y
      1 w
      1 p
      1 n
      1 m
      1 h
     1	heald
     2	leady
     3	lenad
     4	medal
     5	pedal
     6	weald


At this point, there are so few words that we can generate a permutation of all substrings of the words and look at their frequencies.

In [21]:
# Create all 2, 3, and 4 letter substrings
# and list the top 5 most frequent
cat /tmp/word.guess-2-hint3.txt | while read word ; do
  # 2-letter substrings
  cut -c1-2 <<< $word ;
  cut -c2-3 <<< $word ;
  cut -c3-4 <<< $word ;
  cut -c4-5 <<< $word ;
  # 3-letter substrings
  cut -c1-3 <<< $word ;
  cut -c2-4 <<< $word ;
  cut -c3-5 <<< $word ;
  # 4-letter substrings
  cut -c1-4 <<< $word ;
  cut -c2-5 <<< $word ;
done |
sort |
uniq -c |
sort -rn > /tmp/word.guess-3-part1.txt
head -5 /tmp/word.guess-3-part1.txt

      4 al
      3 ea
      2 le
      2 ld
      2 edal


In addition to the substrings, we can note their position in the word and calculate their frequencies.

In [22]:
# find the location for each substring
# and list the top 5 most frequent
cat /tmp/word.guess-3-part1.txt |
while read count perm ; do
  cat /tmp/word.guess-2-hint3.txt |
  while read word ; do
    <<< "$word" grep -o -b -e "$perm"
  done
done |
sort |
uniq -c |
sort -rn > /tmp/word.guess-3-part2.txt
head -5 /tmp/word.guess-3-part2.txt

      3 1:ea
      2 3:ld
      2 3:al
      2 2:dal
      2 2:da


The top two are non-overlapping, so we'll search for words that contain both of those.

In [23]:
awk '/ea/ && /ld/' /tmp/word.guess-2-hint3.txt

heald
weald


Let's try 'heald' and the result is:

1. H2
1. E1
1. A1
1. L3
1. D3


#### Regular Expression for First Hint

In [24]:
cat /tmp/word.guess-2-hint3.txt | egrep '^..a' > /tmp/word.guess-3-hint1.txt
cat -n /tmp/word.guess-3-hint1.txt

     1	heald
     2	leady
     3	weald


#### Regular Expression for Second Hint

In [25]:
cat /tmp/word.guess-3-hint1.txt | grep -v 'h' > /tmp/word.guess-3-hint2.txt
cat -n /tmp/word.guess-3-hint2.txt

     1	leady
     2	weald


#### Regular Expression for Third Hint - Part 1

We can skip this step because we already know that 'l' and 'd' are in the word, we just don't know where.  However ....

#### Regular Expression for Third Hint - Part 2


... we do know that 'l' and 'd' are not at the end of the word

In [26]:
cat /tmp/word.guess-3-hint2.txt | egrep -v '^...l' | egrep -v '^....d' > /tmp/word.guess-3-hint3-part2.txt
cat -n /tmp/word.guess-3-hint3-part2.txt

     1	leady


Let's try it ...


# Oooooo! That's a Bingo!

[Is that the way you say it?](https://youtu.be/Ugpg8XruhVk)


<iframe width="560" height="315" src="https://www.youtube.com/embed/Ugpg8XruhVk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

