#  Text Processing Pipeline for Cal-Fresh Application Dataset

Author: Rocio Ng (DSWG Lead)

### Summary:  
* The purpose of this notebook is to prototype and test methods for processing free text entered into Applications for the CalFresh Program (https://www.getcalfresh.org/)
* Notebook applies helper modules that do the following:
    1. Apply light processing to text
    2. Attempt to correct mispelled words in the text
    3. Apply white-list to redact text that may contain personal information

### Resources:
* Peter Norvig's Spell Corrector Tutorial (http://norvig.com/spell-correct.html)
* Spanish Language Corpus - https://www.corpusdata.org/spanish.asp


## Load Libraries

In [None]:
import numpy as np
import pandas as pd
from langdetect import detect

import warnings
warnings.filterwarnings(action='once') # displays warnings only once

import os
import sys

# For loading Helper Functions
module_path = os.path.abspath(os.path.join('2-Helper-Modules'))
if module_path not in sys.path:
    sys.path.append(module_path)

# For Multicore processing
from multiprocessing import Pool

# Helper Modules
from spell_checking_functions_v3 import *
from text_processing_functions import *
from whitelist_functions import *

In [None]:
# Testing spell checker
en_spellchecker.correction_phrase("helpp meh with calfrsh, whil i'm applying for ssi  .")

In [None]:
# print(detect('Hi')) # False Negative Results
# print(detect('I currently live in my truck'))
# print(detect("estoy embarazada"))

## Load data
* Make sure paths point to where data files are stored locally if you choose to rename/move things

In [None]:
text_df = pd.read_csv("../../1-Data/500_sample_results.csv")
# text_df = pd.read_csv("1-Data-Files/orig_entRep_300.csv")

In [None]:
text_df.shape

In [None]:
text_df.head()

In [None]:
text_df = text_df.dropna(subset=['with_entity_replacement'])

In [None]:
text_df.shape

## Processing

* Light text processing
* Count Spelling Errors
* Detect Langage

In [None]:
# demo 
Spellchecker.initial_text_processing(" PERSON, I'm wenT to The Store at (CARDINAL)!!")
Spellchecker.initial_text_processing("Por ahora no estoy trabajando necesito de NAME ayuda el mes anterior si recibir ")

In [None]:
text_df['processed_phrase'] = text_df.with_entity_replacement\
    .apply(lambda x: Spellchecker.initial_text_processing(x))

In [None]:
text_df.head()

In [None]:
text_df['spelling_errors'] = text_df.processed_phrase
text_df = text_df.sort_values('spelling_errors', ascending = False)
text_df['language'] = text_df.processed_phrase.apply(lambda x: detect_B(x))

In [None]:
text_df.groupby(by = "language").count()

In [None]:
# text_df[text_df.language.isin(['None'])]

In [None]:
text_df = text_df[~text_df.language.isin(['None'])]

## Apply Spell Checking Functions

* Convert Dataframe column of Phrases to List of Tuples (Word, Language) to enable Multiprocessing
* Run spell Correction_phrase function on text
* Append back to Dataframe

In [None]:
spelling_error_list = list(zip(text_df['processed_phrase'], text_df['language']))

In [None]:
# Preview
spelling_error_list[5:10]

In [None]:
my_pool = Pool(processes=4) # change to number of cores in machine

In [None]:
# For testing edge cases
spell_correction_language(("semesters", "en")) # would correct to a different word even though correct
spell_correction_language(("alot", "en"))
spell_correction_language(("paralized", "en"))  # corrects to 'penalized' instead of paralyzed 
# spell_correction_language(("farmacie", "es"))

In [None]:
# Apply spell correction by language across all text
%time spelling_corrections = my_pool.map(spell_correction_language, spelling_error_list)

In [None]:
spelling_corrections[5:10]

In [None]:
# Append results to dataframe
text_df['spelling_corrections'] = spelling_corrections

In [None]:
text_df.head()

In [None]:
#subset_df = text_df.iloc[10:15]

In [None]:
text_df.to_csv("gcf_circumstances_spell_correct.csv")

## Apply White List to Spell Corrected Phrases

In [None]:
test_phrase = "This is a Test.   For Rocio. Hello. "
check_whitelist(test_phrase, whitelist_list, "replace")

In [None]:
text_df['whitelisted_phrase'] = text_df.spelling_corrections\
    .apply(lambda x: check_whitelist(x, whitelist_list, "replace")[0])

In [None]:
text_df.head()

In [None]:
text_df.to_csv("gcf_circumstances_spell_correct_whitelist.csv")

## Validate Effectiveness of Corrections

In [None]:
# text_df['words_removed_raw_words'] = text_df.original_additional_information_text\
#     .apply(lambda x: int(check_whitelist(x, whitelist_list)[1]))

# text_df['words_removed_spell_corrected'] = text_df.spelling_corrections\
#     .apply(lambda x: int(check_whitelist(x, whitelist_list, "remove")[1]))

# text_df = text_df\
#     .assign(pct_improvement = 100*(1 - (text_df.words_removed_spell_corrected/text_df.words_removed_raw_words)))\
#     .assign(improvement = text_df.words_removed_raw_words - text_df.words_removed_spell_corrected)

In [None]:
# missing_words = ["test", "in", "an", "never", "work", "part", "house"]