# EDA and Data Cleaning for Indicator Clustering
Unsupervised Learning Component of Milestone II group project:

Exploring Wordplay and Misdirection in Cryptic Crossword Clues with Natural Language Processing

## Imports

In [1]:
# Make sure wordfreq is installed
try:
    from wordfreq import zipf_frequency
except ImportError:
    %pip install wordfreq
    from wordfreq import zipf_frequency

[0m[31mERROR: Could not find a version that satisfies the requirement wordfreq (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for wordfreq[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


ModuleNotFoundError: No module named 'wordfreq'

In [None]:
# Make sure enchant is installed
try:
    import enchant
except ImportError:
    # Install the underlying C library for enchant
    !sudo apt-get update -qq # Update package list silently
    !sudo apt-get install -y enchant-2 # Install the enchant C library (version 2)

    # Install the Python wrapper for enchant
    #%pip install pyenchant

    # Try importing again after installation
    #import enchant

In [None]:
#### NLTK Setup
import nltk
from nltk.corpus import wordnet as wn

try:
    wn.synsets("test")
except LookupError:
    nltk.download("wordnet", quiet=True)

In [None]:
# imports
import os
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import unicodedata

## Loading the Data

In [None]:
# Mount Google Drive (required every time) - Comment out for local use
#from google.colab import drive
#drive.mount('/content/drive')

In [None]:
# Define and check the paths
# PROJECT_ROOT assumes the shared Milestone II folder is in your root google drive
PROJECT_ROOT = "/content/drive/MyDrive/Milestone II - NLP Cryptic Crossword Clues" # Sahana's Root Filepath
PROJECT_ROOT = ".." # Victoria's local root filepath
DATA_DIR = f"{PROJECT_ROOT}/data"
NOTEBOOK_DIR = f"{PROJECT_ROOT}/notebooks"

if not os.path.exists(PROJECT_ROOT):
    PROJECT_ROOT = os.path.abspath("..")  # fallback for local runs

In [None]:
# Read each CSV file into a DataFrame
df_clues = pd.read_csv(f'{DATA_DIR}/clues_raw.csv')
df_indicators = pd.read_csv(f'{DATA_DIR}/indicators_raw.csv')
df_ind_by_clue = pd.read_csv(f'{DATA_DIR}/indicators_by_clue_raw.csv')
df_ind_consolidated = pd.read_csv(f'{DATA_DIR}/indicators_consolidated_raw.csv')
df_charades = pd.read_csv(f'{DATA_DIR}/charades_raw.csv')
df_charades_by_clue = pd.read_csv(f'{DATA_DIR}/charades_by_clue_raw.csv')

## Reformat `clue_ids`

### Indicators Table `clue_ids`

In [None]:
# Uncomment to see how the clue_id data looks before cleaning
#df_indicators.sample().style.set_properties(**{"white-space": "pre-wrap"})

In [None]:
# Instead of a string with redundant indices, extract only the clue_ids in
# brackets to create a list of integers
df_indicators["clue_ids"] = (
    df_indicators["clue_ids"]
    .str.findall(r"\[(\d+)\]")
    .apply(lambda xs: [int(x) for x in xs])
)

# Include a new column to keep track of how many clues have this indicator
df_indicators["num_clues"] = df_indicators["clue_ids"].apply(len)

In [None]:
df_indicators.sample(5).style.set_properties(**{"white-space": "pre-wrap"})

### Charades Table `clue_ids`

In [None]:
# Uncomment to see what the clue_ids look like before cleaning
#df_charades.sample().style.set_properties(**{"white-space": "pre-wrap"})

In [None]:
# Instead of a string with redundant indices, extract only the clue_ids in
# brackets to create a list of integers
df_charades["clue_ids"] = (
    df_charades["clue_ids"]
    .str.findall(r"\[(\d+)\]")
    .apply(lambda xs: [int(x) for x in xs])
)

# Include a new column to keep track of how many clues have this charade
df_charades["num_clues"] = df_charades["clue_ids"].apply(len)

In [None]:
df_charades.sample(5).style.set_properties(**{"white-space": "pre-wrap"})

## `clue_info()` - Investigate A Clue

`clue_info(n)` displays all the basic and derived information for the clue with `clue_id = n`.

In [None]:
# View all the info for a specific clue (by clue_id), including
# clue surface, answer, definition, charades, and indicators
def clue_info(n):
  clue_cols = ['clue_id', 'clue', 'answer', 'definition', 'source_url']
  display(
      df_clues[df_clues['clue_id'] == n][clue_cols].style.set_properties(
        subset=["clue", 'source_url'],
        **{"white-space": "pre-wrap"}
    )
      )
  print()
  display(df_charades_by_clue[df_charades_by_clue['clue_id']== n])
  print()
  #display(df_indicators[df_indicators['clue_ids'].apply(lambda lst: clue_id in lst)])
  print()
  display(df_ind_by_clue[df_ind_by_clue["clue_id"] == n])

In [None]:
clue_info(172894)

In [None]:
clue_info(358248)

In [None]:
clue_info(623961)

## All Available Tables
* Indicators
* Indicator By Clue
* Indicators Consolidated
* Clue
* Charade
* Charade by Clue

### Indicators

In [None]:
df_indicators.sample(5).style.set_properties(
        subset=["clue_ids"],
        **{"white-space": "pre-wrap"}
    )

### Indicators by Clue

In [None]:
df_ind_by_clue.head()

### Indicators Consolidated

This dataframe contains eight columns--one for each type of wordplay--and one row with a string of all consolidated indicators found in the dataset by George Ho.

This data is better represented as a dictionary, so we create `ind_by_wordplay_dict` from `df_ind_consolidated`.

In [None]:
df_ind_consolidated

### Dictionary for Indicators Consolidated

`ind_by_wordplay_dict` is a dictionary with wordplay types for the keys and a list of all indicators consolidated for each wordplay type.

Nathan points out that some words in this dictionary have '\' or other suspicious characters. But because we use `indicators` instead (it has clue IDs for each indicator), we're not bothering to clean this dictionary.

In [None]:
# Create a dictionary where the key is the wordplay type, and the value is
# the list of associated unique indicators.
ind_by_wordplay_dict = {}

for wordplay in df_ind_consolidated.columns:
  ind_by_wordplay_dict[wordplay] = df_ind_consolidated[wordplay].values[0].split('\n')

In [None]:
# Uncomment or change key to view all indicators for that wordplay
#ind_by_wordplay_dict['alternation']

In [None]:
# See how many unique indicators there are for each type of wordplay
for wordplay in ind_by_wordplay_dict:
  print(f"{wordplay}: {len(ind_by_wordplay_dict[wordplay])}")

### Clues

In [None]:
df_clues.head()

### Charades by Clue

In [None]:
df_charades_by_clue.sample(5)

### Charades

In [None]:
df_charades.head().style.set_properties(
        subset=["clue_ids"],
        **{"white-space": "pre-wrap"}
    )

# Find All Hiddens

We can easily find all hidden clues by taking the `clue` string, removing all whitespace, and searching for the answer as a string within the clue string.

In [None]:
display(df_clues.head())

c1 = "Acquisitive chap, as we see it (8)"
c2 = "Back yard fencing weak and sagging (6)"
c3 = df_clues['clue'].iloc[2]

In [None]:
c3

In [None]:
l3 = c3.split()
l3

In [None]:
c3_no_spaces = ""
for w in l3:
    c3_no_spaces += w
print(c3_no_spaces)

In [None]:
c3_no_spaces.lower()

In [None]:
re.findall('(,)', c3_no_spaces)

# Data Requirements & Unresolved Dilemmas


As we apply the requirements, our dataset of valid indicators will keep decreasing. Create a dataframe to keep track of how much data we're losing at each step.

* Once we restrict our dataset, do we have enough indicators for clustering (assume $2 < k < 12$)?

In [None]:
# Start with the counts from Indicators Consolidated
df_ind_counts = pd.DataFrame(columns=["ind_consolidated"]).copy()
for wordplay in ind_by_wordplay_dict:
  df_ind_counts.loc[wordplay] = len(ind_by_wordplay_dict[wordplay])

ind_con_total = df_ind_counts['ind_consolidated'].sum()

In [None]:
# Add the counts from Indicators
df_ind_counts['indicators'] = df_indicators.groupby(by=['wordplay']).count()['indicator']
ind_total = df_ind_counts['indicators'].sum()
#df_ind_counts.loc['total'] = total
#df_ind_counts['indicators'].loc['total'] = df_ind_counts['indicators'].sum()

In [None]:
# Include a column that counts indicators by clue, which will
# double-count any indicator appearing in multiple clues
df_ind_counts['all_clues'] = df_ind_by_clue.count()

# Rearrange the columns to go from large to small, remove counts from
# ind_consolidated because they don't have associated clue IDs.
df_ind_counts = df_ind_counts[['all_clues', 'indicators']]

In [None]:
print(f"Indicators Consolidated: {ind_con_total}")
print(f"Indicators: {ind_total}")
print(f"all_clues sum: {df_ind_counts['all_clues'].sum()}")
print(df_ind_by_clue['clue_id'].count())

In [None]:
df_ind_counts

Summary:
* Of the entire dataset of 660,613 cryptic crossword clues, 88,037 clues came from blog posts where indicators could be identified. (from `df_ind_by_clue`)
* Because sometimes clues have more than one indicator, a total of 93,867 indicators were found in the dataset, and are associated with a parsed clue. (from `df_ind_by_clue`)
* CCCs reuse indicators. Of the 93,867 indicators identified in the data, only about 16,000 are unique.
* More unique indicators appear in `df_ind_consolidated` (16,061) than in `df_indicators` (15,735). We cannot easily discover why because the Indicators Consolidated table was stipped of context.
* <b>We will use the Indicators table</b> going forward because it cites which clues used that indicator. We can verify the quality of the data better.




### Indicator must be a single word
We will (initially) represent single words as vectors in a semantic space (words with similar meanings are nearby).


In [None]:
df_indicators.head()

In [None]:
# Create a column in df_indicators to keep track of indicator word count
df_indicators['ind_wc'] = df_indicators['indicator'].apply(lambda x: len(x.split()))


In [None]:
# Visualize the distribution of indicator word counts
df_indicators['ind_wc'].value_counts().plot(kind='bar')

In [None]:
# Create a subset of just single word indicators
df_ind_one_word = df_indicators[df_indicators['ind_wc'] == 1].copy()

# Remove the ind_wc column
df_ind_one_word.drop(columns=['ind_wc'], inplace=True)

In [None]:
df_ind_one_word.head()

In [None]:
# How many one-word indicators are left?
len(df_ind_one_word)

In [None]:
# Check how many indicators we have left, by wordplay type
df_ind_one_word.groupby(by=['wordplay']).count()['indicator']

In [None]:
# Add this column to the Indicator Count
df_ind_counts['one_word'] = df_ind_one_word.groupby(by=['wordplay']).count()['indicator']

In [None]:
df_ind_counts

### Indicator must be a valid word
Browsing indicators, it appears some are not valid words.

<b>What dictionary should we use to verify a word is valid?</b> Keep in mind that puzzle creators are often from the UK and Australia, not just the USA.

We initially investigate three ways to determine if a word is valid:
1. <b>Zipf frequency</b> score using the [`wordfreq` python library](https://pypi.org/project/wordfreq/).
2. Whether the word is in a <b>WordNet synset</b>.
3. Whether the word is in any of <b>Enchant's English language dictionaries for spellcheck</b> (US, UK, AU, CA).

NOTE: Later in this notebook we look at indicators of unreasonable letter lengths to see <i>how</i> the data is malformed, in case we can correct it.

In [None]:
# Prepare the dictionaries and helper function to use pyenchant
d_US = enchant.Dict("en_US")
d_UK = enchant.Dict("en_GB")
d_AU = enchant.Dict("en_AU")
d_CA = enchant.Dict("en_CA")

def any_english(word):
    return d_US.check(word) or d_UK.check(word) or d_AU.check(word) or d_CA.check(word)


In [None]:
# Make sure all indicators are lower case
df_ind_one_word['indicator'] = df_ind_one_word['indicator'].apply(
    lambda x: x.lower()
    )

In [None]:
# Get the Zipf Word Frequency Score (higher for common words, 0 for nonwords)
#df_ind_one_word['zipf_score'] = df_ind_one_word['indicator'].apply(
#    lambda x: zipf_frequency(x, "en", wordlist='large', minimum=0.0)
#)

In [None]:
# See if the indicator is in WordNet as a synset
#df_ind_one_word['in_wordnet'] = df_ind_one_word['indicator'].apply(
#    lambda x: bool(wn.synsets(x))
#    )

In [None]:
# See if the indicator is in any pyenchant English dictionary
#df_ind_one_word['enchant_check'] = df_ind_one_word['indicator'].apply(
#    lambda x: any_english(x)
#)

#### Correct Misspellings
Use pyenchant's corrections to look for the correct spelling of the malformed word.

Note that we will want an embedding that accounts for (UK) slang. For example, the anagram indicator "dicky" appears 23 times in the data:

* adjective, informal British English
* <i>adjective: dicky; comparative adjective: dickier; superlative adjective: dickiest; adjective: dickie</i>

* (of a part of the body, a structure, or a device) not strong, healthy, or functioning reliably.
"a man with a dicky leg"

In [None]:
# How many indicators are misspelled according to Enchant?
len(df_ind_one_word[df_ind_one_word['enchant_check'] == False])

In [None]:
# Take a look at "misspelled" indicators
df_ind_one_word[df_ind_one_word['enchant_check'] == False].sort_values(by='num_clues', ascending=False).head(30)

In [None]:
# Create a column that conservatively determines whether the indicator is a
# valid word based on Zipf frequency, WordNet synset, and enchant spellcheck
#ZIPF_CUTOFF = 2.0 # conservative but may miss technical, archaic, or extremely rare words
ZIPF_CUTOFF = 1.5 # probably appropriate for cryptic crosswords

df_ind_one_word['valid_word'] = (
    (df_ind_one_word["zipf_score"] >= ZIPF_CUTOFF) |
    df_ind_one_word["in_wordnet"] |
    df_ind_one_word["enchant_check"]
)

In [None]:
# How many indicators are invalid words according to our formula?
len(df_ind_one_word[df_ind_one_word['valid_word'] == False])

In [None]:
# Create another column that flags suspicious words (more conservative)
df_ind_one_word['suspicious_word'] = (
    (df_ind_one_word["zipf_score"] <= ZIPF_CUTOFF) |
    (df_ind_one_word["in_wordnet"] == False) |
    (df_ind_one_word["enchant_check"] == False)
)

In [None]:
# How many indicators are suspicious words according to our formula?
len(df_ind_one_word[df_ind_one_word['suspicious_word']])

In [None]:
# Take a look at the most frequent invalid words
# Indicators appearing in multiple clues are more likely to be valid
df_ind_one_word[
    (df_ind_one_word['valid_word'] == False)
    ].sort_values(by='num_clues', ascending=False)

In [None]:
# Take a look at the least frequent suspicious words
# Indicators appearing in multiple clues are more likely to be valid
df_ind_one_word[
    (df_ind_one_word['suspicious_word'] == True)
    ].sort_values(by='num_clues', ascending=True).head(30)

From manual inspection, there are a few trends in these invalid words:
* <b>Poorly parsed</b>: "acciden" [422350] or "christm" [76808]
* Some words are rare because they have added <b>(multiple) prefixes or suffixes</b> to a common base word. A human reader could easily understand the meaning of the word, but they might not appear commonly in that tense or part of speech. For example: "anagrammed" [590811, 592877] instead of "anagram", or unusal adverbs like "goofily" [583747] or "excitably" [338593] or "dodderingly" [660302]
* Some words are similar to more common words, and the CCC <b>author may be stretching</b> to make the puzzle work (or maybe a UK thing?): "mispresented" [313936] is nearly "misrepresented", "misshaped" [460580] is nearly "misshapen"





In [None]:
# Take a look at all words with a Zipf Score of 0 and NOT in WordNet or enchant.
# These are slightly stricter requirements than our valid_word check.
invalid_indicators = df_ind_one_word[
    (df_ind_one_word['zipf_score'] == 0) &
    (df_ind_one_word['in_wordnet'] == False) &
    (df_ind_one_word['enchant_check'] == False)
    ]

print(len(invalid_indicators))
display(invalid_indicators.sort_values(by='num_clues', ascending=False))

### What to do with uncommon words?
Most of these words would make sense to a human reader (and genAIs). HOWEVER, can we represent them in a vector space if they're this uncommon and we rely on pre-trained models that would have never seen these words?

We could manually inspect and correct a few words that look poorly parsed, or misspelled.

What about words that are very close to "real" words? WESTERNLY -> WESTERLY

Or what about taking the roots, but also documenting what part of speech (POS) it is to preserve that information?

### TO DO: Manually Correct Typos

We have a list of `clue_ids` that appear to have poorly parsed indicators (by manuanually inspecting the 47 words with a Zipf Score of 0 and NOT in WordNet).

In [None]:
#df_invalid_indicators = pd.read_csv(f'{DATA_DIR}/invalid_indicators_poor_parsings.csv')

In [None]:
df_invalid_indicators

In [None]:
# Manually inspect each clue

n = 10
clue_id = df_invalid_indicators['clue_id'].iloc[n].copy()
display(clue_info(clue_id))

In [None]:
# Suggest a corrected indicator word based on the clue surface.
df_invalid_indicators.loc[
    df_invalid_indicators['clue_id'] == clue_id,
    'corrected_indicator'
    ] = "unhappy"

In [None]:
print(ind_by_wordplay_dict.keys())

In [None]:
df_invalid_indicators

In [None]:
# Save progress by overwriting the invalid_indicators_poor_parsings.csv file
df_invalid_indicators.to_csv(f'{DATA_DIR}/invalid_indicators_poor_parsings.csv', index=False)

### SKIP: Exclude an indicator if it comes from a clue that was malformed?

<b>[Because even "invalid" indicators are readable, let's not worry about this.]</b>

If we go back to the `df_clues` dataframe, we could identify all clues (rows) corresponding to malformed data, and then use the `clue_rowid` field in `df_indicators` to exclude those indicators.

However, this might not actually be a problem. Compare the number of total clues to the number of clues referenced in the indicators dataframe. Maybe George wasn't able to extract indicators from any of those malformed clues.


### When are two indicators "the same"?
What sort of stemming or lemmatization do we want to use, if any? For example, the "hidden" wordplay type in `df_indicators_consolidated` contains some very similar entries for `indicator`:
* contribute to
* contributes to
* contributing
* contributing in
* contributing to
* contribution from
* contribution to
* contributors to

<b>Do we want to preserve part of speech (POS)</b>, even if it means we have multiple instances of very similar words (contribute versus contributor)?


### When is it appropriate to just define a stopword (and salvage a 2-word indicator)?

<b>[Too much redundancy if we use roots/stems.]</b>

How important are common words often dismissed as stopwords in NLP, like "to", "in" and "from"? In the "contribute" example above, is it safe to drop these words?

Or are there words in indicators we can justify excluding based on our domain knowledge of cryptics? How about <a href="https://chesterley.github.io/howto/linkwords.htm#:~:text=Wordplay%20devices%20Connectors-,Wordplay%20devices,or%20connectors%20depending%20on%20context.">common "link" words</a>, which function to make the surface reading of a clue more natural and "link" the definition to the wordplay. By definition they don't belong to the fodder, indicator, or definition.

### BIG PICTURE ISSUE: `wordplay` labels are subjective, interconnected, hierarchical
There may not exist clear-cut clusters, even if we had impeccable data.

George Ho's wordplay categories don't align with Minute Cryptic (and others), but presumably were aligned with the blogs he scraped. This is relevant if we try constrained clustering (semi-supervised technique), or when we try to interpret unsupervised clustering results. Might be relevant to our choice of clustering algorithms and parameters.

Most clear cut and distinct wordplay types:
* Anagram
* Reversal
* Homophone
* Hidden

These may be messy (because they're opposites?):
* Container
* Insertion (Ho only), opposite of Container?

Messier and interconnected:
* Alternation (Ho only), entangled with Deletion? A subset of Selection?
* Selection (Minute only), maybe need to define Extraction as a type?
* Deletion, entangled with Alternation and Selection?

Not sure if this counts as wordplay:
* Substitution (Minute only), a wordplay type(?) but no associated indicator, maybe derived from charade?

### Investigating Letter Lengths of Indicators

In [None]:
df_ind_one_word['letter_length'] = df_ind_one_word['indicator'].apply(lambda x: len(x))

In [None]:
df_ind_one_word

In [None]:
df_ind_one_word.dtypes

In [None]:
# Visualize the distribution of indicator word counts
df_ind_one_word['letter_length'].value_counts().sort_index().plot(kind='bar')

In [None]:
df_ind_one_word[df_ind_one_word['letter_length'] == 3].sort_values(by='num_clues', ascending=True).head(30)

In [None]:
clue_info(76085)

In [None]:
df_ind_one_word.groupby(by=['letter_length', 'wordplay']).count().head(30)

In [None]:
# How many valid 1-letter words are there? Why weren't these caught by the valid word check?
print(f"number of 1-letter indicators: {len(df_ind_one_word[df_ind_one_word['letter_length'] == 1])}")
display(df_ind_one_word[df_ind_one_word['letter_length'] == 1].sort_values(by=['indicator','num_clues'], ascending=True))


In [None]:
clue_info(75852)

Having inspected 75852 and 8095, my guess is all 1-letter indicators are completely wrong. Do we want to just ignore them or manually inspect them based on url?

In [None]:
# What about two-letter words?
print(f"number of 2-letter indicators: {len(df_ind_one_word[df_ind_one_word['letter_length'] == 2])}")

display(df_ind_one_word[df_ind_one_word['letter_length'] == 2].sort_values(by=['indicator', 'num_clues']))

In [None]:
clue_info(177077)

Manually inspecting the 29 2-letter indicators, it's safe to keep anything that appears in at least 5 clues.

We may want to keep others, but <b>to not introduce our own judgement about what makes a good indicator, do we need to inspect ALL suspicious indicators to rule them out?</b>

### Make sure indicators appear in the clue as stand-alone words

Some of the mis-parsed indicators are actually segments of fodder. They aren't even complete words in the clue surface. 

In [None]:
df_ind_one_word.head()

In [None]:
df_indicators.head()

In [None]:
df_clues

In [None]:
clue_info(330895)

In [None]:
"developing from" in ind_by_wordplay_dict['anagram']

In [None]:
# Let's take a look at the longest words. Those are suspicious too.
df_ind_one_word.sort_values(by='letter_length', ascending=False).head(10) # nope, looks fine!

Saving one word indicators (Note from Sahana: I know we still have more work to do here)

In [None]:
df_ind_one_word.to_csv(f'{DATA_DIR}/df_ind_one_word.csv')

In [None]:
from google.colab import drive
drive.mount('/content/drive')