# Data Cleanup, Take 2

Based on the exploration in [03_data_exploration_take_2.ipynb](03_data_exploration_take_2.ipynb) we need to:
* Remove reference clues
* Remove clues that refer to the "notepad"
* Remove entity encoded clues, HTML tags, and bracket clues
* Strip leading asteriscks, but keep asterisck only clues
* Strip leading plus, but keep plus only plus clues or other clues that contain a + (ex. 5 + 5)
* Strip leading angle clues which are actually reference clues
* Cheat for now and just ignore anything else with characters we don't like
* Remove answers that don't contain only letters

In [1]:
import pandas as pd
import re

df = pd.read_csv("cleaned_data/clean_1.csv", keep_default_na=False)

reference_clues = r'[0-9]+[-\s]+(?:down|across)+\b'
df = df[~df['clue'].str.contains(reference_clues)]

notepad_clues = r'see notepad'
df = df[~df['clue'].str.contains(notepad_clues)]

entity_encoding = r'&[a-z]+;'
df = df[~df['clue'].str.contains(entity_encoding)]
entity_encoding = r'&#[0-9]+;'
df = df[~df['clue'].str.contains(entity_encoding)]

html_tags = r'<[a-z]+>'
df = df[~df['clue'].str.contains(html_tags)]

bracket_clues = r'\[.+\]'
df = df[~df['clue'].str.contains(bracket_clues)]

df['clue'] = df['clue'].apply(lambda x: x.lstrip('*') if x != '*' else x)

leading_plus = r'^\+[a-z]'
df['clue'] = df['clue'].apply(lambda x: x.lstrip('+') if re.match(leading_plus, x) else x)

leading_angle = r'^<\s.+'
df = df[~df['clue'].str.contains(leading_angle)]

allowed_chars = r'^[a-z0-9\s\."\_!\-\'\@\(\)#=%$/\+]+'
df = df[df['clue'].str.contains(allowed_chars)]

In [2]:
allowed_chars = r'^[a-z0-9\s\."\_!\-\'\@\(\)#=%$/\+]+'
mask = ~df['clue'].str.match(allowed_chars)
df[mask]

Unnamed: 0,answer,clue


In [3]:
df

Unnamed: 0,answer,clue
0,pat,"action done while saying ""good dog"""
1,rascals,mischief-makers
2,pen,it might click for a writer
3,sep,fall mo.
4,eco,kind to mother nature
...,...,...
780300,nat,actor pendleton
780301,shred,bit
780302,nea,teachers' org.
780303,beg,petition


In [4]:
df = df[df['answer'].str.contains(r'^[a-z]+$')]

In [5]:
df.to_csv('cleaned_data/clean_2.csv', index=False)