# Wordle Data Scraping

This notebook collects data on Wordle from several websites. Daily player scores were found from https://twitter.com/WordleStats, which is a Twitter account that aggregates scores from player tweets. The website https://wordfinder.yourdictionary.com/wordle/answers/ was used to find the answer for each Wordle puzzle. Finally, data on the frequency of words in the English language was obtained from https://github.com/IlyaSemenov/wikipedia-word-frequency/tree/master, a project maintained by Ilya Semenov to count word frequencies on Wikipedia articles. After collecting the relevant data and organizing it into a pandas DataFrame, this notebook saves the results as a .csv file.

In [None]:
import snscrape.modules.twitter as sntwitter # to grab data from Twitter
from urllib.request import urlopen # to grab data from html
import pandas as pd

First, load the data from https://twitter.com/WordleStats:

In [1]:
tweets = []
for i, tweet in enumerate(sntwitter.TwitterSearchScraper('from:WordleStats').get_items()):
    if i >= 1000:
        break
    tweets.append(tweet.rawContent)

Stopping after 20 empty pages


In [2]:
# Look at an example tweet to determine the format of the data:
print('Example tweet format:\n')
print(tweets[0])

Example tweet format:

#Wordle 696 2023-05-16
17,831 results found on Twitter.
1,771 hard mode players.

1:  0%
2: 🟩 7%
3: 🟩🟩🟩🟩🟩🟩 25%
4: 🟩🟩🟩🟩🟩🟩🟩🟩 34%
5: 🟩🟩🟩🟩🟩 23%
6: 🟩🟩 9%
X:  1%

#Wordle696


In [3]:
# Import data from all tweets into pandas DataFrame:
error_indices = [] # To store the indices of the tweets where extracting the stats fails for some reason
df = pd.DataFrame(columns=['wordle_id','year', 'month', 'date', 'n_players', 'n_hard_mode', 'pct_1',
                           'pct_2', 'pct_3', 'pct_4', 'pct_5', 'pct_6', 'pct_fail'])

def extract_stats(tweet, df):
    """Adds the stats from a tweet by @WordleStats to a pandas DataFrame.
    Input:  tweet = a string of the tweet text
            df = DataFrame to hold the tweet data. Expecting the following columns (in order):
                 wordle_id, year, month, date, n_players, n_hard_mode, pct_1, pct_2, pct_3, pct_4,
                 pct_5, pct_6, pct_fail
    Output: none (the data is added as a new row in df)
    """
    tweet_lines = tweet.split('\n')
    # Grab the relevant data from the tweet and store as variables
    try:
        wordle_id = int(tweet_lines[0].split(' ')[1])
        year = int(tweet_lines[0].replace('-', ' ').split(' ')[2])
        month = int(tweet_lines[0].replace('-', ' ').split(' ')[3])
        date = int(tweet_lines[0].replace('-', ' ').split(' ')[4])
        n_players = int(tweet_lines[1].replace(',','').split(' ')[0])
        n_hard_mode = int(tweet_lines[2].replace(',','').split(' ')[0])
        row_data = [wordle_id, year, month, date, n_players, n_hard_mode]
        # Loop over the lines corresponding to each possible score:
        for i in range(4, 11): 
            row_data.append(int(tweet_lines[i].replace('%', '').split(' ')[-1]))
        df.loc[len(df)] = row_data
    except:
        # If the above failed for some reason, add the tweet index to a list to examine later
        print('error extracting stats from tweet at index', tweets.index(tweet))
        error_indices.append(tweets.index(tweet))


# Use the function defined above to extract the data from each tweet:
for tweet in tweets:
    extract_stats(tweet, df)
print('There are', len(df), 'valid data points')

error extracting stats from tweet at index 103
error extracting stats from tweet at index 104
error extracting stats from tweet at index 463
error extracting stats from tweet at index 468
error extracting stats from tweet at index 471
error extracting stats from tweet at index 480
error extracting stats from tweet at index 482
error extracting stats from tweet at index 487
error extracting stats from tweet at index 493
error extracting stats from tweet at index 494
There are 485 valid data points


Now, let's check the integrity of the data with the following tests:
- Check the tweets that failed to load data to determine why they failed and what to do with them
- Look at the unique values of Wordle ID to see if we missed any puzzles
- Check to see if any of the Wordle IDs are repeated

In [4]:
# Loop over the error indices and print out the invalid tweets:
print('Invalid tweets:\n')
for error in error_indices:
    print('Tweet index ' + str(error)+ ':')
    print(tweets[error], '\n')

# Find all the unique values of Wordle ID, and see if there are any missing puzzles:
unique_ids = df.wordle_id.unique()
print('\n\n')
for i in range(203, 694):
    if i not in unique_ids:
        print("Missing Wordle ID #" + str(i))

# Look for repeated Wordle IDs:
print('\nChecking for repeated Wordle IDs...')
repeated_ids = False
tweets_per_id = df.wordle_id.value_counts()
for i in list(tweets_per_id.index):
    if tweets_per_id[i] > 1:
        print('More than one tweet for Wordle ID #', i)
        repeated_ids = True
if repeated_ids == False:
    print('No repeated IDs found.')

Invalid tweets:

Tweet index 103:
@QuickNovaCaleb Thousands of dollars a month, and I'm not interested in giving the owner of this site money anyway 

Tweet index 104:
As a result of this change this bot will shut down some time in the next week. 

Tweet index 463:
@kyfdx Whole group 

Tweet index 468:
@CristinaAmpil What would an aggregate distribution look like? I’m unfamiliar with stats/etc 

Tweet index 471:
@fudo @gooeyblob @WordleFRStats Yes, all players. I’ll see if adding the average makes sense, thanks! 

Tweet index 480:
220,950 results found on Twitter.
6,206 hard mode players.

1:  1%
2:  2%
3: 🟩🟩 11%
4: 🟩🟩🟩🟩🟩🟩 24%
5: 🟩🟩🟩🟩🟩🟩🟩 31%
6: 🟩🟩🟩🟩🟩🟩 26%
X: 🟩 6%

#Wordle213 

Tweet index 482:
@24Acoustics @PlanningActBlog You are correct! This should now be fixed. 

Tweet index 487:
@Gary_Boyd_NZ Just for you for today Gary:

3,073 hard mode players.

1:  1%
2:  4%
3: 🟩🟩🟩 16%
4: 🟩🟩🟩🟩🟩🟩 27%
5: 🟩🟩🟩🟩🟩🟩🟩 30%
6: 🟩🟩🟩🟩 19%
X: 🟩 4% 

Tweet index 493:
@WordleHaiku These are the full results fo

Most of the invalid tweets are just text instead of actually containing the stats for a Wordle puzzle. However, three of the tweets that weren't imported correctly actually contain data:
- The tweet at index 480 is missing the first line, which contains the Wordle ID and date. Examining the tweets before and after reveal that the tweet at index 480 is most likely Wordle ID #213
- The tweet at index 487 failed because it is missing the first two lines. The tweet before is Wordle ID #207, and the tweet after is Wordle ID #208, so we aren't missing a day's stats. Looking at https://twitter.com/WordleStats/status/1481687496241164291?cxt=HHwWhsC9ma67gZApAAAA, it seems like someone asked for the stats for just the hard mode players, which is why this tweet is in a slightly different format. We can safely ignore this tweet.
- The tweet at index 494 is for Wordle ID #202, but it didn't import correctly because it's missing a blank line between the hard mode players and the scores.

The code block below contains the code used to explore these error tweets (which is now commented out), as well as code to add the stats from Wordle ID #213 and Wordle ID #202 by hand.

In [5]:
# Check the tweets that didn't import their data correctly (tweets at indices 478, 485, and 492):

# # Checking tweet at index 480:
# print('The tweet just before index 480 is Wordle ID #' + str(tweets[479][8:12]))
# print(tweets[479], '\n\n')
# print('The tweet just after index 478 is Wordle ID #' + str(tweets[481][8:12]))
# print(tweets[481])
# # So tweets[480] is Wordle ID #213
# print('So the tweet at index 480 is Wordle ID #213\n')
# print(tweets[480])

# # Checking tweet at index 487:
# print('\n\n')
# print('The tweet just before index 487:\n')
# print(tweets[486], '\n\n')
# print('The tweet just after index 487:\n')
# print(tweets[488])
# # It looks like tweets[487] is for Wordle ID #207, but just the stats for hard mode. See the following link:
# # https://twitter.com/WordleStats/status/1481687496241164291?cxt=HHwWhsC9ma67gZApAAAA

# # Checking tweet at index 494:
# print('\n\nThe tweet at index 494:\n')
# print(tweets[494])
# # tweets[494] is Wordle ID #202. It didn't import correctly because it is missing a blank line
# # between the hard mode players and the scores.

# Add Wordle ID #213:
df.loc[len(df)] = [213, 2022, 1, 18, 220950, 6206, 1, 2, 11, 24, 31, 26, 6]
# Add Wordle ID #202:
df.loc[len(df)] = [202, 2022, 1, 7, 80630, 1362, 1, 2, 23, 39, 24, 9, 1]

Next, let's look at the missing Wordle IDs, and see if we can find the data. It turns out that most of the missing Wordle IDs do actually have stats on https://twitter.com/WordleStats, but the scraping routine didn't grab these tweets for some reason. Since there are only a few missing Wordle IDs, we can add them by hand (see the code below). Two Wordle IDs are actually missing from the Twitter account (591 and 608). We will have to leave these out of the final dataset. 

In [6]:
# Check missing Wordle IDs by hand on twitter.com/WordleStats:

# Wordle ID #213 is actually in tweets[480] (already added)

# Wordle ID #273 
df.loc[len(df)] = [273, 2022, 3, 19, 156311, 8515, 0, 5, 21, 32, 26, 14, 3]

# Wordle ID #298
df.loc[len(df)] = [298, 2022, 4, 13, 123255, 7835, 1, 4, 29, 42, 18, 5, 1]

# Wordle ID #301
df.loc[len(df)] = [301, 2022, 4, 16, 107987, 7035, 0, 3, 19, 40, 28, 9, 1]

# Wordle ID #315
df.loc[len(df)] = [315, 2022, 4, 30, 77991, 5749, 0, 2, 10, 25, 35, 23, 4]

# Wordle ID #340
df.loc[len(df)] = [340, 2022, 5, 25, 62723, 4835, 0, 2, 9, 25, 33, 24, 6]

# Wordle ID #381
df.loc[len(df)] = [381, 2022, 7, 5, 44578, 3604, 1, 6, 25, 36, 23, 9, 1]

# Wordle ID #591 is actually missing

# Wordle ID #608 is actually missing
print(len(df))

493


In [7]:
# Confirm that we have data for all the puzzles between Wordle ID #202 and #696 (except for #591 and #608)
min_id = df.wordle_id.min()
max_id = df.wordle_id.max()
print('There should be', max_id - min_id + 1, 'Wordles (between ID #' + str(min_id), 'and #' + str(max_id) + ').')
print('We are missing Wordle ID #591 and #608.')
print('So there are', len(df), 'total Wordles in this dataset.')

There should be 495 Wordles (between ID #202 and #696).
We are missing Wordle ID #591 and #608.
So there are 493 total Wordles in this dataset.


In [8]:
# Sort the DataFrame by ascending values of Wordle ID:
df.sort_values('wordle_id', inplace=True)
# df.set_index('wordle_id', inplace=True, verify_integrity=True)
display(df.head(10))
display(df.tail(10))

Unnamed: 0,wordle_id,year,month,date,n_players,n_hard_mode,pct_1,pct_2,pct_3,pct_4,pct_5,pct_6,pct_fail
486,202,2022,1,7,80630,1362,1,2,23,39,24,9,1
484,203,2022,1,8,101503,1763,1,5,23,31,24,14,2
483,204,2022,1,9,91477,1913,1,3,13,27,30,22,4
482,205,2022,1,10,107134,2242,1,4,16,30,30,17,2
481,206,2022,1,11,153880,3017,1,9,35,34,16,5,1
480,207,2022,1,12,137586,3073,1,4,15,26,29,21,4
479,208,2022,1,13,132726,3345,1,2,13,29,31,20,3
478,209,2022,1,14,169484,3985,1,4,21,30,24,15,5
477,210,2022,1,15,205880,4655,1,9,35,34,16,5,1
476,211,2022,1,16,209609,4955,1,9,32,32,18,7,1


Unnamed: 0,wordle_id,year,month,date,n_players,n_hard_mode,pct_1,pct_2,pct_3,pct_4,pct_5,pct_6,pct_fail
9,687,2023,5,7,18039,1796,1,7,36,39,13,3,0
8,688,2023,5,8,16684,1705,0,2,18,35,28,14,3
7,689,2023,5,9,16256,1666,0,1,10,33,35,17,3
6,690,2023,5,10,17154,1713,1,7,29,37,20,6,1
5,691,2023,5,11,17777,1794,0,3,16,32,31,15,2
4,692,2023,5,12,18486,1802,0,7,28,34,21,9,1
3,693,2023,5,13,17209,1719,0,9,34,36,16,5,0
2,694,2023,5,14,17120,1764,0,5,19,29,25,17,4
1,695,2023,5,15,17727,1753,2,9,28,34,20,7,1
0,696,2023,5,16,17831,1771,0,7,25,34,23,9,1


Now that we've imported the player stats for a wide range of Wordle puzzles, let's find the corresponding puzzle answers using https://wordfinder.yourdictionary.com/wordle/answers/:

In [9]:
url = 'https://wordfinder.yourdictionary.com/wordle/answers/'
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8").split('\n')

# # Look at the html lines corresponding to the start and end of the Wordle ID range of interest 
# # to get a sense for the format:
# print(html[148])
# print(html[2245])
# for i in range(148,155):
#     print(i, html[i], '\n\n')
# for i in range(2248, 2254):
#     print(i, html[i], '\n\n')

# Grab the answers corresponding to the Wordle IDs of interest:
id_lines = list(range(148, 2253, 4))
answers = {}
for line in id_lines:
    try:
        id = int(html[line].strip(' '))
        answer_index = html[line+1].index('"">') + 3
        answer = html[line+1][answer_index:answer_index+5].lower()
        answers[id] = answer
    except:
        continue
        
# Add a column to the DataFrame for the puzzle answer:
df['answer'] = df.wordle_id.apply(lambda x: answers[x])

# Check that the data was imported correctly:
df.head()

Unnamed: 0,wordle_id,year,month,date,n_players,n_hard_mode,pct_1,pct_2,pct_3,pct_4,pct_5,pct_6,pct_fail,answer
486,202,2022,1,7,80630,1362,1,2,23,39,24,9,1,slump
484,203,2022,1,8,101503,1763,1,5,23,31,24,14,2,crank
483,204,2022,1,9,91477,1913,1,3,13,27,30,22,4,gorge
482,205,2022,1,10,107134,2242,1,4,16,30,30,17,2,query
481,206,2022,1,11,153880,3017,1,9,35,34,16,5,1,drink


Finally, let's add data about the frequency of each answer word in the English language. This data was downloaded from https://github.com/IlyaSemenov/wikipedia-word-frequency/tree/master, which counts the frequency of words in Wikipedia articles. 

In [10]:
# Word frequency data from https://github.com/IlyaSemenov/wikipedia-word-frequency/tree/master

# Import the data from the .txt file into a DataFrame
freq_list = pd.read_csv('freq_list.txt', sep=" ", header=None)
freq_list.columns = ['word', 'counts']
freq_list.counts = freq_list.counts.astype(int)
display(freq_list.head())

freq = [] # The number of times the word appears in Wikipedia articles
freq_rank = [] # The frequency ranking for each word (most common word = 1, next most common = 2, etc.)
for word in df.answer:
    freq_index = freq_list[freq_list.word == word].index.values[0]
    freq.append(freq_list.at[freq_index, 'counts'])
    freq_rank.append(freq_index + 1)

# Add columns to the DataFrame, and check that the data was added correctly:
df['freq'] = freq
df['freq_rank'] = freq_rank
display(df.head())

Unnamed: 0,word,counts
0,the,186631452
1,of,88349543
2,in,76718795
3,and,76039670
4,a,54631147


Unnamed: 0,wordle_id,year,month,date,n_players,n_hard_mode,pct_1,pct_2,pct_3,pct_4,pct_5,pct_6,pct_fail,answer,freq,freq_rank
486,202,2022,1,7,80630,1362,1,2,23,39,24,9,1,slump,4782,23667
484,203,2022,1,8,101503,1763,1,5,23,31,24,14,2,crank,4162,25914
483,204,2022,1,9,91477,1913,1,3,13,27,30,22,4,gorge,17987,9915
482,205,2022,1,10,107134,2242,1,4,16,30,30,17,2,query,7659,17411
481,206,2022,1,11,153880,3017,1,9,35,34,16,5,1,drink,43697,5238


Now that we've imported the relevant data, export the DataFrame as a .csv file for quick loading into the analysis notebook:

In [11]:
print('Exporting data to .csv file...')
df.to_csv('wordle_data.csv', index=False)
print('Complete')

Exporting data to .csv file...
Complete
