# Data Exploration, Take 2

We previously [explored](01_data_exploration.ipynb) and then [cleaned up](02_clean_data.ipynb) the data, let's see if there is any further processing we need to do.

In [330]:
import pandas as pd

df = pd.read_csv("cleaned_data/clean_1.csv", keep_default_na=False)
df

Unnamed: 0,answer,clue
0,pat,"action done while saying ""good dog"""
1,rascals,mischief-makers
2,pen,it might click for a writer
3,sep,fall mo.
4,eco,kind to mother nature
...,...,...
780300,nat,actor pendleton
780301,shred,bit
780302,nea,teachers' org.
780303,beg,petition


## Considerations

We saw in the previous exploration some clues like "see 64-across" or {See Notepad} - we need to figure out how we can identify these. Additionally we should search for other unexpected characters.

### Reference Clues

In [331]:
# match clues that reference other clues (ex. 12 Down), but not 12 Downing Street
re = r'[0-9]+[-\s]+(?:down|across)+\b'

df[df['clue'].str.contains(re)]

Unnamed: 0,answer,clue
173,tsp,one of 768 in a 35-across: abbr.
215,afro,hairstyle for 2-down
268,mil,ending for some government 37-across
285,pinky,"with 13-down, playground promise"
300,swear,see 68-across
...,...,...
779024,brown,one 39-across
779034,columbia,one 39-across
779050,yale,one 39-across
779518,nude,like 52-across


In [332]:
df = df[~df['clue'].str.contains(re)]

### See Notepad clues

In [292]:
df[df['clue'].str.contains(r'see notepad')]

Unnamed: 0,answer,clue
57852,one,{see notepad}
57855,two,{see notepad}
57860,three,{see notepad}
57865,four,{see notepad}
57869,five,{see notepad}
...,...,...
138080,mummy,see notepad
756259,ganymede,see notepad
756264,mainland,see notepad
756274,armorial,see notepad


In [293]:
df = df[~df['clue'].str.contains(r'see notepad')]

### Other Characters

In [294]:
allowed_chars = r'^[a-z0-9\s\."\_!\-\'\$]+'
mask = ~df['clue'].str.match(allowed_chars)
df[mask]

Unnamed: 0,answer,clue
74,epsilon,&epsilon;
213,chirp,"[more worms, mama!]"
704,ihope,[knock on wood]
765,lala,[i forgot the words ...]
865,orlando,"<em>""who's your favorite roguish 'star wars' c..."
...,...,...
754654,sic,[so written]
756836,symbol,"#, & or %"
766554,pct,%: abbr.
779591,gasp,[shock!]


So far we've identified three types of clues that we should clean up:
* Clues that use entity encoding (ex. \&epsilon;)
* Clues that have HTML tags in them (ex. <em>)
* Clues that use [_clue_]. These might be common crossword formats, but let's convert them for now.

#### Entity Encoding

In [295]:
entity_encoding = r'&[a-z]+;'
mask = df['clue'].str.match(entity_encoding)
df[mask]

Unnamed: 0,answer,clue
74,epsilon,&epsilon;
8521,eta,&eta;
9105,obeli,"&divide; and &dagger;, in typography"
9322,two,&larr;
11875,alpha,&alpha;
...,...,...
73969,rootofallevil,&radic;666
74853,omega,&omega;
148516,suits,"&spades;, &hearts;, &diams; and &clubs;"
196055,sigma,&sigma;


We probably could keep these, but let's keep things simple and just ignore them for now and drop during out next cleanup pass. In the future we can come back and consider handling these.

#### HTML Tags

In [296]:
html_tags = r'<[a-z]+>'
mask = df['clue'].str.match(html_tags)
df[mask]

Unnamed: 0,answer,clue
865,orlando,"<em>""who's your favorite roguish 'star wars' c..."
866,oralist,"<em>""how famous is that actress? is she unknow..."
879,orangered,"<em>""how do you handle losing? do you feel cal..."
892,ordeals,"<em>""what's the best way to spend less on shop..."
893,orchard,"<em>""what kind of greens do you want? spinach ..."
...,...,...
311791,yearzero,<em>beginning of time?</em>
315825,berniemac,"<i>""ocean's eleven"" actor</i>"
315833,escapepod,<i>small sci-fi vehicle</i>
315839,shoephone,"<i>""get smart"" device</i>"


Same story here - we could either keep or convert these, but let's just ignore them for now and drop during our next cleanup pass.

#### [Clues]

In [297]:
bracket_clues = r'\[.+\]'
mask = df['clue'].str.match(bracket_clues)
df[mask]

Unnamed: 0,answer,clue
213,chirp,"[more worms, mama!]"
704,ihope,[knock on wood]
765,lala,[i forgot the words ...]
1693,seenote,[more info below]
1792,omg,[i can't believe what i just read]
...,...,...
719216,gulp,[uh-oh!]
720883,gasp,[i am shocked!]
738199,sic,[as printed]
754654,sic,[so written]


Again, a small number of clues so let's just ignore these for now.

#### Quick clenaup
Let's drop all of these rows and see where we're at.

In [298]:
entity_encoding = r'&[a-z]+;'
df = df[~df['clue'].str.contains(entity_encoding)]

html_tags = r'<[a-z]+>'
df = df[~df['clue'].str.contains(html_tags)]

bracket_clues = r'\[.+\]'
df = df[~df['clue'].str.contains(bracket_clues)]

#### Second pass

In [299]:
allowed_chars = r'^[a-z0-9\s\."\_!\-\'\$]+'
mask = ~df['clue'].str.match(allowed_chars)
df[mask]

Unnamed: 0,answer,clue
944,charlottesweb,"*children's book whose title character says ""i..."
2464,resign,*stay in power
2472,heave,*hold on to
2474,covert,*done openly
2482,feasts,*doesn't eat
...,...,...
744916,windy,"#1 song for the association, 1967"
747213,veep,"#2, informally"
756836,symbol,"#, & or %"
766554,pct,%: abbr.


##### Starred Clues
"Starred clues" (ex. *stay in power) are common in crosswords. We should just drop the asterisk (but keep clues that are only an asterisk).

In [300]:
df['clue'] = df['clue'].apply(lambda x: x.lstrip('*') if x != '*' else x)

In [301]:
allowed_chars = r'^[a-z0-9\s\."\_!\-\'\$]+'
mask = ~df['clue'].str.match(allowed_chars)
df[mask]

Unnamed: 0,answer,clue
2498,username,"@ follower, on twitter"
2691,origin,"(0,0), in math"
3863,sopranos,"#1 on rolling stone's ""100 greatest tv shows o..."
7702,are,=
7964,hashtag,"#, in social media"
...,...,...
744916,windy,"#1 song for the association, 1967"
747213,veep,"#2, informally"
756836,symbol,"#, & or %"
766554,pct,%: abbr.


##### @ Clues

In [302]:
at_clues = r'@'
mask = df['clue'].str.match(at_clues)
df[mask]

Unnamed: 0,answer,clue
2498,username,"@ follower, on twitter"
91947,ats,@@@
130429,ats,@@@
148395,ats,@@@
169997,ats,@@@
181386,ats,@ @ @
185210,atsign,@
366614,aol,"@ follower, sometimes"
521594,each,@


These look fine, let's just allow them.

In [303]:
allowed_chars = r'^[a-z0-9\s\."\_!\-\'\@$]+'
mask = ~df['clue'].str.match(allowed_chars)
df[mask]

Unnamed: 0,answer,clue
2691,origin,"(0,0), in math"
3863,sopranos,"#1 on rolling stone's ""100 greatest tv shows o..."
7702,are,=
7964,hashtag,"#, in social media"
9082,biden,#46
...,...,...
744916,windy,"#1 song for the association, 1967"
747213,veep,"#2, informally"
756836,symbol,"#, & or %"
766554,pct,%: abbr.


##### () Clues

In [304]:
bracket_clues = r'\(|\)'
mask = df['clue'].str.match(bracket_clues)
df[mask]

Unnamed: 0,answer,clue
2691,origin,"(0,0), in math"
9392,ishmael,(spoiler alert!) sole survivor of the pequod
22280,origin,"(0,0,0)"
24497,paren,"), briefly"
26301,smile,"), when it follows :-"
...,...,...
616136,origin,"(0,0) on a graph"
666376,avon,"(ding-dong) ""___ calling"""
691182,parens,"( ), informally"
721769,boards,(the) stage


These look fine, let's just allow them.

In [305]:
allowed_chars = r'^[a-z0-9\s\."\_!\-\'\@\(\)$]+'
mask = ~df['clue'].str.match(allowed_chars)
df[mask]

Unnamed: 0,answer,clue
3863,sopranos,"#1 on rolling stone's ""100 greatest tv shows o..."
7702,are,=
7964,hashtag,"#, in social media"
9082,biden,#46
14339,pct,%: abbr.
...,...,...
744916,windy,"#1 song for the association, 1967"
747213,veep,"#2, informally"
756836,symbol,"#, & or %"
766554,pct,%: abbr.


##### # Clues

In [306]:
pound_clues = r'#'
mask = df['clue'].str.match(pound_clues)
df[mask]

Unnamed: 0,answer,clue
3863,sopranos,"#1 on rolling stone's ""100 greatest tv shows o..."
7964,hashtag,"#, in social media"
9082,biden,#46
15432,hashtag,"#, on social media"
16306,ese,#name?
...,...,...
733952,oxford,#92
733976,london,#104
744916,windy,"#1 song for the association, 1967"
747213,veep,"#2, informally"


These look fine, let's just allow them.

In [307]:
allowed_chars = r'^[a-z0-9\s\."\_!\-\'\@\(\)#$]+'
mask = ~df['clue'].str.match(allowed_chars)
df[mask]

Unnamed: 0,answer,clue
7702,are,=
14339,pct,%: abbr.
16477,trebleclef,&#119070;
16686,spare,/
16764,sadface,:(
...,...,...
700344,riga,+european capital
700350,vandal,"+graffiti artist, e.g."
728951,tenof,:50
766554,pct,%: abbr.


##### = Clues

In [308]:
equals_clues = r'='
mask = df['clue'].str.match(equals_clues)
df[mask]

Unnamed: 0,answer,clue
7702,are,=
24056,equal,=
88751,equal,=


These look good, let's allow them.

In [309]:
allowed_chars = r'^[a-z0-9\s\."\_!\-\'\@\(\)#=$]+'
mask = ~df['clue'].str.match(allowed_chars)
df[mask]

Unnamed: 0,answer,clue
14339,pct,%: abbr.
16477,trebleclef,&#119070;
16686,spare,/
16764,sadface,:(
19847,isto,:
...,...,...
700344,riga,+european capital
700350,vandal,"+graffiti artist, e.g."
728951,tenof,:50
766554,pct,%: abbr.


##### % Clues

In [310]:
percent_clues = r'%'
mask = df['clue'].str.match(percent_clues)
df[mask]

Unnamed: 0,answer,clue
14339,pct,%: abbr.
63998,stat,"% on the back of a baseball card, say"
131967,pct,%: abbr.
170391,pct,%: abbr.
348463,pct,%: abbr.
687617,pct,%: abbr.
766554,pct,%: abbr.


Also, fine.

In [311]:
allowed_chars = r'^[a-z0-9\s\."\_!\-\'\@\(\)#=%$]+'
mask = ~df['clue'].str.match(allowed_chars)
df[mask]

Unnamed: 0,answer,clue
16477,trebleclef,&#119070;
16686,spare,/
16764,sadface,:(
19847,isto,:
21113,per,"/, maybe"
...,...,...
700343,concorde,+transportation provider since 1976
700344,riga,+european capital
700350,vandal,"+graffiti artist, e.g."
728951,tenof,:50


##### Entity Encoding Take 2

We missed clues that have characters like \&#119070; &#119070;

In [312]:
entity_encoding = r'&#[0-9]+;'
mask = df['clue'].str.match(entity_encoding)
df[mask]

Unnamed: 0,answer,clue
16477,trebleclef,&#119070;
274054,sexes,&#9794; and &#9792;


Let's drop these

In [313]:
entity_encoding = r'&#[0-9]+;'
df = df[~df['clue'].str.contains(entity_encoding)]

In [314]:
allowed_chars = r'^[a-z0-9\s\."\_!\-\'\@\(\)#=%$]+'
mask = ~df['clue'].str.match(allowed_chars)
df[mask]

Unnamed: 0,answer,clue
16686,spare,/
16764,sadface,:(
19847,isto,:
21113,per,"/, maybe"
22215,poles,+ and -
...,...,...
700343,concorde,+transportation provider since 1976
700344,riga,+european capital
700350,vandal,"+graffiti artist, e.g."
728951,tenof,:50


In [315]:
slash_clues = r'/|\\'
mask = df['clue'].str.match(slash_clues)
pd.set_option('display.max_rows', 11)
df[mask]

Unnamed: 0,answer,clue
16686,spare,/
21113,per,"/, maybe"
80357,per,/
92762,per,/
109318,slash,/
199172,slash,/
204493,slash,/
280473,spare,"/, to a bowler"
281603,slash,/
509988,slash,/


These are all good.

In [316]:
allowed_chars = r'^[a-z0-9\s\."\_!\-\'\@\(\)#=%$/]+'
mask = ~df['clue'].str.match(allowed_chars)
df[mask]

Unnamed: 0,answer,clue
16764,sadface,:(
19847,isto,:
22215,poles,+ and -
22308,winks,;) ;) ;)
23334,frown,:-(
...,...,...
700343,concorde,+transportation provider since 1976
700344,riga,+european capital
700350,vandal,"+graffiti artist, e.g."
728951,tenof,:50


##### Leading +

In [317]:
leading_plus = r'^\+'
mask = df['clue'].str.match(leading_plus)
df[mask]

Unnamed: 0,answer,clue
22215,poles,+ and -
24324,end,"+ or -, for a battery"
26415,sign,+ or -
31413,and,+ ... with a hint to four pairs of answers in ...
44325,ion,+ or - atom
...,...,...
700333,cane,+punish severely
700343,concorde,+transportation provider since 1976
700344,riga,+european capital
700350,vandal,"+graffiti artist, e.g."


Some of these seem valid, some of these we want to strip.

In [318]:
leading_plus = r'^\+[a-z]'
mask = df['clue'].str.match(leading_plus)
pd.set_option('display.max_rows', 15)
df[mask]

Unnamed: 0,answer,clue
700240,coalmine,+seamy locale
700242,nectar,+fruit juice
700249,nene,+gray-brown goose
700261,code,+building ___
700265,rind,+peel
700280,alia,+inter follower
700286,candor,+calling a spade a spade
700288,memorial,+monument
700303,mandarin,+bureaucrat
700306,arcane,+secret


Let's strip the ones that have a leading plus, similar to what we did for the asterisk

In [319]:
import re

df['clue'] = df['clue'].apply(lambda x: x.lstrip('+') if re.match(leading_plus, x) else x)

In [320]:
allowed_chars = r'^[a-z0-9\s\."\_!\-\'\@\(\)#=%$/\+]+'
mask = ~df['clue'].str.match(allowed_chars)
df[mask]

Unnamed: 0,answer,clue
16764,sadface,:(
19847,isto,:
22308,winks,;) ;) ;)
23334,frown,:-(
29768,imsomad,>:-(
...,...,...
672372,steals,< season record for which rickey henderson had...
672376,ruth,< player with this retired number
672377,gehrig,< player with this retired number
672396,aaron,< player with this retired number


##### < clues >

In [321]:
leading_angle = r'^<\s.+'
mask = df['clue'].str.match(leading_angle)
df[mask]

Unnamed: 0,answer,clue
672298,musial,< player with this retired number
672306,willie,"< player with this retired number, informally"
672359,yankees,< team that won this many games in 1961
672360,indian,< member of the only team to win this many gam...
672372,steals,< season record for which rickey henderson had...
672376,ruth,< player with this retired number
672377,gehrig,< player with this retired number
672396,aaron,< player with this retired number


These are referencing the clue number which we don't have, so let's drop these. Interesting one to consider in the future though!

In [322]:
leading_angle = r'^<\s.+'
df = df[~df['clue'].str.contains(leading_angle)]

In [323]:
allowed_chars = r'^[a-z0-9\s\."\_!\-\'\@\(\)#=%$/\+]+'
mask = ~df['clue'].str.match(allowed_chars)
df[mask]

Unnamed: 0,answer,clue
16764,sadface,:(
19847,isto,:
22308,winks,;) ;) ;)
23334,frown,:-(
29768,imsomad,>:-(
...,...,...
566817,ands,& & &
593588,colon,:
603733,and,&
657724,ands,&&&


The rest of these are all pretty nuanced, let's drop them for now and we can always revisit later.

In [324]:
allowed_chars = r'^[a-z0-9\s\."\_!\-\'\@\(\)#=%$/\+]+'
df = df[df['clue'].str.contains(allowed_chars)]

In [325]:
allowed_chars = r'^[a-z0-9\s\."\_!\-\'\@\(\)#=%$/\+]+'
mask = ~df['clue'].str.match(allowed_chars)
df[mask]

Unnamed: 0,answer,clue


In [334]:
df

Unnamed: 0,answer,clue
0,pat,"action done while saying ""good dog"""
1,rascals,mischief-makers
2,pen,it might click for a writer
3,sep,fall mo.
4,eco,kind to mother nature
...,...,...
780300,nat,actor pendleton
780301,shred,bit
780302,nea,teachers' org.
780303,beg,petition


We now have what we think is a clean dataset! First, let's create a new notebook to make this cleanup more clear and then we can create a new CSV with the cleaned data

## Future Considerations
* Encode special characters in a standard way
* Cosnider nuances of crosswords - < reference clues, cross reference clues, asterisk clues, etc.