# EDA and Data Cleaning for Indicator Clustering
Unsupervised Learning Component of Milestone II group project: 

Exploring Wordplay and Misdirection in Cryptic Crossword Clues with Natural Language Processing

In [1]:
# imports
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Data Requirements & Unresolved Dilemmas

### Indicator must be a single word
We will (initially) represent single words as vectors in a semantic space (words with similar meanings are nearby).

* Once we restrict our dataset, do we have enough indicators for clustering (assume $2 < k < 12$)?

### Indicator must be a valid word
Browsing indicators, it appears some are not valid words. 

* What dictionary should we use to verify a word is valid? Keep in mind that puzzle creators are often from the UK and Australia, not just the USA.
* Look at indicators of unreasonable letter lengths to see <i>how</i> the data is malformed, in case we can correct it? 

### Exclude an indicator if it comes from a clue that was malformed?
If we go back to the `df_clues` dataframe, we could identify all clues (rows) corresponding to malformed data, and then use the `clue_rowid` field in `df_indicators` to exclude those indicators.

### When are two indicators "the same"?
What sort of stemming or lemmatization do we want to use, if any? For example, the "hidden" wordplay type in `df_indicators_consolidated` contains some very similar entries for `indicator`:
* contribute to
* contributes to
* contributing
* contributing in
* contributing to
* contribution from
* contribution to
* contributors to

Do we want to preserve part of speech (POS), even if it means we have multiple instances of very similar words (contribute versus contributor)?

### When is it appropriate to just define a stopword (and salvage a 2-word indicator)?
How important are common words often dismissed as stopwords in NLP, like "to", "in" and "from"? In the "contribute" example above, is it safe to drop these words?

Or are there words in indicators we can justify excluding based on our domain knowledge of cryptics? How about <a href="https://chesterley.github.io/howto/linkwords.htm#:~:text=Wordplay%20devices%20Connectors-,Wordplay%20devices,or%20connectors%20depending%20on%20context.">common "link" words</a>, which function to make the surface reading of a clue more natural and "link" the definition to the wordplay. By definition they don't belong to the fodder, indicator, or definition.

### BIG PICTURE ISSUE: `wordplay` labels are subjective, interconnected, hierarchical 
There may not exist clear-cut clusters, even if we had impeccable data.

George Ho's wordplay categories don't align with Minute Cryptic (and others), but presumably were aligned with the blogs he scraped. This is relevant if we try constrained clustering (semi-supervised technique), or when we try to interpret unsupervised clustering results. Might be relevant to our choice of clustering algorithms and parameters.

Most clear cut and distinct wordplay types:
* Anagram
* Reversal
* Homophone
* Hidden

These may be messy (because they're opposites?):
* Container
* Insertion (Ho only), opposite of Container?

Messier and interconnected:
* Alternation (Ho only), entangled with Deletion? A subset of Selection?
* Selection (Minute only), maybe need to define Extraction as a type?
* Deletion, entangled with Alternation and Selection?

Not sure if this counts as wordplay:
* Substitution (Minute only), a wordplay type(?) but no associated indicator, maybe derived from charade?

## All Tables Available from the Raw Data

In [None]:
# Connect to the sqlite3 file
data_file = "../data/data.sqlite3"
conn = sqlite3.connect(data_file)

In [3]:
# Uncomment to see what data tables exist in the file
tables = pd.read_sql("SELECT name FROM sqlite_master WHERE type='table';", conn)
#tables

In [4]:
# Keep track of all tables that might be of interest from the original dataset
# Display the names and sizes of all tables.

tables = [
    "clues",
    "indicators",
    "charades",
    "indicators_by_clue",
    "charades_by_clue",
    "indicators_consolidated"
]

summary = []

for t in tables:
    # count rows
    row_count = pd.read_sql(f"SELECT COUNT(*) AS n FROM {t};", conn).iloc[0]["n"]
    
    # count rows and columns
    col_info = pd.read_sql(f"PRAGMA table_info({t});", conn)
    col_count = len(col_info)

    summary.append({
        "table": t,
        "rows": row_count,
        "columns": col_count
    })

summary_df = pd.DataFrame(summary)
summary_df.style.format({"rows": "{:,}"}) # display with commas 

DatabaseError: Execution failed on sql 'SELECT COUNT(*) AS n FROM clues;': no such table: clues

## Tables Useful for Clustering Indicators

Create a dataframe for each table involving indicators.

In [None]:
# Create the dataframes related to indicators
df_indicators = pd.read_sql("SELECT * FROM indicators;", conn)
df_ind_by_clue = pd.read_sql("SELECT * FROM indicators_by_clue;", conn)
df_indicators_consolidated = pd.read_sql("SELECT * FROM indicators_consolidated;", conn)

# Uncomment to create dataframes pertaining to clue and charade
df_clues = pd.read_sql("SELECT * FROM clues;", conn)
df_charades = pd.read_sql("SELECT * FROM charades;", conn)
df_charades_by_clue = pd.read_sql("SELECT * FROM charades_by_clue;", conn)

In [None]:
n = 139327
display(df_clues[df_clues['rowid'] == n])
display(df_charades_by_clue[df_charades_by_clue['clue_rowid']== n])
display(df_indicators[df_indicators['rowid'] == n])
display(df_ind_by_clue[df_ind_by_clue["clue_rowid"] == n])

In [None]:
df_indicators.head()

In [None]:
df_ind_by_clue.head()

In [None]:
df_indicators_consolidated

## Requirements & Dilemmas for Unsupervised Learning Data

### Indicator must be a single word
We will (initially) represent single words as vectors in a semantic space (words with similar meanings are nearby).

* Once we restrict our dataset, do we have enough indicators for clustering (assume $2 < k < 12$)?

### Indicator must be a valid word
Browsing indicators, it appears some are not valid words. 

* What dictionary should we use to verify a word is valid? Keep in mind that puzzle creators are often from the UK and Australia, not just the USA.
* Look at indicators of unreasonable letter lengths to see <i>how</i> the data is malformed, in case we can correct it? 

### Indicator must not come from a clue that was malformed
If we go back to the `df_clues` dataframe, we could identify all clues (rows) corresponding to malformed data, and then use the `clue_rowid` field in `df_indicators` to exclude those indicators.

### When are two indicators "the same"?
What sort of stemming or lemmatization do we want to use, if any? For example, the "hidden" wordplay type in `df_indicators_consolidated` contains some very similar entries:
* contribute to
* contributes to
* contributing
* contributing in
* contributing to
* contribution from
* contribution to
* contributors to

Do we want to preserve part of speech (POS), even if it means we have multiple instances of very similar words (contribute versus contributor)?

### When is it appropriate to just define a stopword (and salvage a 2-word indicator)?
How important are common words often dismissed as stopwords in NLP, like "to", "in" and "from"? In the "contribute" example above, is it safe to drop these words?

Or are there words in indicators we can justify excluding based on our domain knowledge of cryptics? How about <a href="https://chesterley.github.io/howto/linkwords.htm#:~:text=Wordplay%20devices%20Connectors-,Wordplay%20devices,or%20connectors%20depending%20on%20context.">common "link" words</a>, which function to make the surface reading of a clue more natural and "link" the definition to the wordplay. By definition they don't belong to the fodder, indicator, or definition.
