# Collecting data

This notebook detail the process of assembling the dataset written to `wordle.csv`.

In [1]:
import requests, re
import pandas as pd
import numpy as np
from scipy import stats
from base64 import b64decode, b64encode
from time import sleep
from pathlib import Path
from IPython.display import display, clear_output

## Getting Wordle words

First, I import Wordle's answers and word list from its source code as two lists: `answers` and `wordlist`. The source code is in JavaScript, but it can be parsed into a String then a List.

In [2]:
# Parse a JavaScript Array (as a String) into a Python List
array2List = lambda arr : re.sub(r'\"|\s', '', arr[1:-1]).split(",")

answers = None
wordlist = None
with open("wordle.js") as f:
    txt = f.read().split("\n")
    answers = array2List(txt[1117][13:])
    wordlist = array2List(txt[1118][13:])

sizes = {'wordlist': len(wordlist), 'answers': len(answers) }
print("wordlist:\t{}\nanswers:\t{}\ntotal:\t\t{}".format(sizes['wordlist'], sizes['answers'], sizes['wordlist'] + sizes['answers']))

wordlist:	10657
answers:	2315
total:		12972


Test for overlap between the set of words, `wordlist`, and the set of answers `answers`.

In [3]:
len(set(wordlist).intersection(set(answers)))

0

Since the size of the intersection of both sets is zero, `wordlist` and `answers` are disjoint sets. 

### Fetching occurrence data
Wordle puzzle solutions seem to be *normal* words, ones that you commonly hear in daily life. If you were to count the occurrence of every word in a book, you might consider the most frequently occurring ones to be normal. If you did that with every book, you'd have an idea of prevalence in the language. It just so happens that Google already did this with [more than 8 million books](https://aclanthology.org/P12-3029). It's called the [Google Books Ngram Corpus](https://books.google.com/ngrams).

The function `getNgramUsage` downloads data from the JSON endpoint of Google Books Ngram viewer for a specific word. It defaults to corpus 26, which is English language books. Setting `smoothing` to zero fetches raw data. Finally, the range is between 1970 and 2019.

In [4]:
def fetch(url, params, wait=1):
    res = requests.get(url, params)
    tooManyReqsCode = 429;
    if res.status_code == tooManyReqsCode:
        while(res.status_code == tooManyReqsCode):
            sleep(wait)
            res = requests.get(url, params)
    return res;
    
def getNgramUsage(ngram):
    url = 'https://books.google.com/ngrams/json'
    args = {
        'content': ngram,
        'year_start': 1970,
        'year_end': 2019,
        'corpus': 26,
        'smoothing': 0
    }
    emptyValue = ""
    res = fetch(url, args)
    try:
        occurrence = np.mean(res.json()[0]['timeseries'])
        return occurrence
    except:
        return emptyValue

The function `fetch` is a wrapper for `requests.get` that crudely handles rate limiting. It will return a JSON response like this one for the [Webster 2021 Word of the Year](https://www.merriam-webster.com/words-at-play/word-of-the-year/vaccine).

In [5]:
fetch("https://books.google.com/ngrams/json",{'content': 'vaccine', 'corpus': 26, 'year_start': 2010, 'year_end': 2019}).json()

[{'ngram': 'vaccine',
  'parent': '',
  'type': 'NGRAM',
  'timeseries': [7.729678145551588e-06,
   7.463667134288698e-06,
   7.373356993412017e-06,
   7.262522688376651e-06,
   7.1265085352933966e-06,
   7.0330834855017855e-06,
   6.615856169186632e-06,
   6.586049342634699e-06,
   6.623334593314212e-06,
   6.5487166693856125e-06]}]

Likewise, `getNgramUsage` formats requests to the JSON endpoint of Google Book's Ngram viewer and calculates the mean of the values in the `timeseries` array.

In [6]:
getNgramUsage("vaccine")

6.952722715141135e-06

Some [words](https://en.wikipedia.org/wiki/Nardwuar) will not have a result, in this case we just use an empty string to denote null values at this point in the data pipeline.

In [7]:
getNgramUsage("nardwuar")

''

## Querying Ngram Viewer

The process of querying this API is error prone. There's no rate limit response in the response headers. But, it does use the HTTP status code for [too many requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429). So the cell below caches results to disc before it hits an error, and it is meant to be run repeatedly until all words have been queried. Removing `wordle.tsv` from the working directory will restart the process.

In [8]:
# Get a list of only the words that we need to query

fn = 'wordle.tsv'
# If output file does not exist, then create it. 
# Likewise, delete output file to restart this data collection process
if not Path(fn).is_file():
    with open(fn, 'w') as f:
        f.write('word\toccurrence\n')

queriedWords = list(pd.read_csv(fn, delimiter='\t').word)        
words = wordlist + answers
isQueried = lambda word : word not in queriedWords
wordsToQuery = filter(isQueried, words)

# Append new words to file
with open(fn, 'a') as f:
    for word in wordsToQuery:
        clear_output(wait=True)
        display("Querying: {}".format(word))
        # If word has not already been queried, then query it.
        occurrence = getNgramUsage(word)
        row ='{}\t{}\n'.format(word, occurrence)
        f.write(row)

print("Querying is complete 😎")

Querying is complete 😎


In [9]:
data = pd.read_csv(fn, delimiter="\t", dtype={'word': str, 'occurrence': np.float64})
data.describe(include="all")

Unnamed: 0,word,occurrence
count,12972,12902.0
unique,12972,
top,aahed,
freq,1,
mean,,4.87903e-06
std,,4.370009e-05
min,,7.192649e-12
25%,,2.32131e-09
50%,,1.953624e-08
75%,,2.871247e-07


In [10]:
print("{} contains {} rows\n{} rows are NaN".format(fn,data.shape[0], data[data['occurrence'].isnull()].shape[0]))

wordle.tsv contains 12972 rows
70 rows are NaN


Finally export results to a CSV file.

This matches the total words calculated above, but 70 words did not have results on this corpus in Google Books Ngram Viewer.

In [11]:
data[data['occurrence'].isnull()]

Unnamed: 0,word,occurrence
546,avyze,
551,awdls,
593,azygy,
1138,boygs,
1325,byked,
...,...,...
10482,ylkes,
10528,yrivd,
10579,zedas,
10587,zexes,


I'll replace those NaN files with zero.

In [12]:
data['occurrence'] = data['occurrence'].replace(np.nan, 0)
sum(data['occurrence'].isnull())

0

Since the sum of a Boolean series is zero, there are no truthy values and all NaN values have been replaced.

Now I'll assign wordle days to the subset of solutions. This dataset can separate the solutions subset from the word list by filtering on `null` values in this field.

In [13]:
data['day'] = data['word'].apply(lambda w: answers.index(w) if w in answers else None)

In [14]:
data[~data['day'].isnull()].head()

Unnamed: 0,word,occurrence,day
10657,cigar,2.605142e-06,0.0
10658,rebut,1.067693e-06,1.0
10659,sissy,2.105774e-07,2.0
10660,humph,2.216358e-08,3.0
10661,awake,7.097157e-06,4.0


In [15]:
data.describe(include="all")

Unnamed: 0,word,occurrence,day
count,12972,12972.0,2315.0
unique,12972,,
top,aahed,,
freq,1,,
mean,,4.852702e-06,1157.0
std,,4.358348e-05,668.427259
min,,0.0,0.0
25%,,2.234261e-09,578.5
50%,,1.904988e-08,1157.0
75%,,2.829156e-07,1735.5


### Export data
Finally, sort and export the dataset to CSV format.

In [16]:
data.sort_values('word').to_csv('wordle.csv', index=False)