# Matt Parker's Wordle challenge - the *pythonic vowel approach*

(c) by Thomas Reichert

## License

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, see http://www.gnu.org/licenses/.

## The challenge and my code

The goal of this jupyter notebook is to solve [Matt Parker's](https://standupmaths.com) Wordle challenge (see [his youtube video](https://www.youtube.com/watch?v=c33AZBnRHks&t=959s)) to find combinations of five english five letter words that contain 25 distinct letters **with my personal restriction that all has to happen in 10 lines of basic Python code using only one single thread and no additional packages that might speed up things** 
*Yes i am aware that there are 11 lines, but one could easily drop the package tqdm and the line with its import statement and not use it as it doesn't help the algorithm at all but just shows progress to make us humans feel better because 'something happens on the screen all the time' ;-)*. 

The motivation for this is that if one almost exclusively works with Python such as yours truly as a data scientist and certified actuary does, the fact that solving this problem is possible much faster by just implementing it in e.g. C++ doesn't really help a lot as one could not spontaneusly write code in it, especially not decently understandable code. So Python it is ;-)

*If you run this notebook, please run the code itself first and patiently wait for about a minute as only then the values in the next markdown section will appear properly ;-)*

In [None]:
from tqdm import tqdm # only used to feel better because 'something happens on the screen all the time' ;-)
def cmb(x): # function that combines words recursively to tuples, triples, quadruples and quintuples
    return cmb([{'_'.join([kx, ky]): vx|vy for kx, vx in tqdm(x.pop(0).items()) for ky, vy in x[0].items()
                 if (vx&vy==0)}] + x[1:]) if (len(x) > 1) else x[0]
valid = {w: sum(1<<(ord(c)-97) for c in w) for w in open('words_alpha.txt').read().split('\n')
         if (len(w)==len(set(w))==5)} 
valid = {n: {k: v for k, v in valid.items() if (len(set(k) & set(vwl:='aeiou'))<=n)} for n in [0, 1, 2, 3]}
vow = {vw: {k: v for k, v in valid[1].items() if len(set(i for i in vwl if i!=vw)&set(k))<=0} for vw in vwl} 
result = {**(zero:=cmb([vow[v] for v in 'uieoa'])), **(two:=cmb([valid[n] for n in [0, 0, 1, 3, 3]])),
          **(one:=cmb([valid[n] for n in [0, 1, 1, 1, 2]]))} # yes the order in each cmb matters speedwise
open('result.csv', 'w').write('\n'.join(result:=set(','.join(sorted(k.split('_'))) for k in result.keys())))

## What the code actually does
The basic idea: As the runtime of any algorithm for finding all possible combinations of five words with all distinct letters will be $O(n_1*n_2*n_3*n_4*n_5)$, we must keep our numbers of words $n_i$ for the first, second, etc. word as small as only possible. Hence the first thing we must do is to get rid of as many words as possible beforehand and divide our problem into as small as only possible subgroups.

We can also make use of the fact that there are only {{ len(valid.get(0)) }} valid words such as crypt, glyph or nymph in the word list that contain none of the five vowels a, e, i, o, or u. This means that a combination of five words can only contain a word with two vowels if at least one of them contains none. As can be seen easily (almost all of these either contain either an x or a y), the maximum number of words from that vowelless list that can be combined is two.

So we solve the problem in three parts:

* Five words where each contains none or one vowel -- here we can even split the word list by vowel to speed things up: The numbers of max-one-vowel words are {{ {k: len(v) for k, v in vow.items()} }}.
* One out of {{ len(valid.get(0)) }} words that contains no vowel, three out of {{ len(valid.get(1)) }} words that contain at most one and one out of {{ len(valid.get(2)) }} words that contains at most two vowels
* Two out of {{ len(valid.get(0)) }} words that contain no vowel, one out of {{ len(valid.get(1)) }} words that contains at most one and two out of {{ len(valid.get(3)) }} words that contain at most three vowels

and build everything together to also find {{ len(result) }} combinations. While clearly being way off the results from the super fast precompiled languages, I was at least able to get a runtime of about 55 seconds on a M1 Macbook Pro and beat [Benjamin Paassen's graph theory approach](https://gitlab.com/bpaassen/five_clique), which was the first decently fast pure Python approach that I am aware of (sorry to anybody that meanwhile wrote it faster in Python and I missed out on their code) and served as my benchmark, by approximately a factor of 20 ;-)

## Final thoughts

Even though this approach clearly is no speed champ, my hope is that this *vowel based reduction of possible word combinations that need to be checked* can maybe help somebody else's super fast algorithm to even save a bit more time ;-)

Using bitwise comparison of integer encodings of the words instead of letter sets as I did in a previous version of the code caused a speed gain of factor of 7 from almost 7 minutes to now 55 seconds runtime. Another way to speed up my code is certainly to make use of anagrams as this further reduces the individual $n_i$, but in contrast to the bitwise integer comparison I haven't figured out yet how to do so without including additional lines of code as I also want to keep the code as short as possible ;-)

What I found a surprise in the progress is that only {{ len(zero) }} out of these {{ len(result) }}  combinations contain none of those {{ len(valid.get(0)) }} non-vowel words. Originally I thought that combinations with non-vowel words would be the exception rather than the rule, but I had to learn that using one of them opens the chance for another one to contain one word with two vowels, which are {{ round(len(valid.get(2))/len(valid.get(1)),1) }} times more frequent than words with only one vowel.

## The result
Here are the word combinations:

In [None]:
print(len(result))
print(result)