# Data Cleanup

Based on the exploration in [01_data_exploration.ipynb](01_data_exploration.ipynb) we need to:
* Remove empty rows
* Replace NaN with "NULL"
* Remove rows with characters that we don't like

In [98]:
import pandas as pd
import re


# keep_default_na=False will parse "NULL" properly
df = pd.read_csv("data/nytcrosswords.csv", keep_default_na=False)
print(f'Starting row count: {len(df)}')

# remove rows without an answer
previous_len = len(df)
df['word_length'] = df['Word'].apply(lambda x: len(str(x).strip()))
df = df.drop(df[df['word_length'] <= 1].index)
print(f'Removed {previous_len - len(df)} empty row(s)')

# remove rows with unknown characters
previous_len = len(df)
unknown_char_pattern = '[^\x00-\x7F]'
unknown_rows = df[df['Clue'].str.contains(unknown_char_pattern)]
df = df[~df['Clue'].str.contains(unknown_char_pattern)]
print(f'Removed {previous_len - len(df)} unknown character rows')
                             
print(f'Ending row count: {len(df)}')

Starting row count: 781573
Removed 1 empty row(s)
Removed 1267 unknown character rows
Ending row count: 780305


Let's rename columns and only keep the data we actually care about.

In [99]:
print(df.columns)

Index(['Date', 'Word', 'Clue', 'word_length'], dtype='object')


In [100]:
df.drop(columns=['Date', 'word_length'], inplace=True)
df.rename(columns={'Word': 'answer', "Clue": 'clue'}, inplace=True)

print(df.columns)

Index(['answer', 'clue'], dtype='object')


Finally, let's convert everything to lowercase and remove whitespace so our life is easier in the future.

In [101]:
df['answer'] = df['answer'].str.lower()
df['clue'] = df['clue'].str.lower()
df['answer'] = df['answer'].str.strip()
df['clue'] = df['clue'].str.strip()

Now, let's write this dataframe to a new CSV.

In [102]:
df.to_csv('cleaned_data/clean_1.csv', index=False)