Teachers, of all people, should be good spellers in order to set an example for their students.  One would think that spelling errors in applications must have a negative impact on their acceptance rate.  And in fact they do!  This kernel implements a spellchecker for the DonorsChoose.org application text, counting spelling errors per application.

The number of spelling errors can then be used as a feature in your models.  As Ehsan's [Ultimate Feature Engineering](https://www.kaggle.com/safavieh/ultimate-feature-engineering-xgb-lgb-nn) kernel demonstrated, new features can really help your models score better.  In fact, the number of spelling errors used as a feature improved my model AUCs by around 0.0004, a pretty nice improvement for a single feature.

This kernel runs through the process of spellchecking the text.  I've also attached the resultant CSV for the number of spelling errors per application which you can use in your models to see if it improves your score.



In [9]:
import numpy as np
import pandas as pd
import os, re

# The end at the beginnging!  Here are # of misspellings per application.
result = pd.read_csv('../input/corpus-misspellings/corpus_misspellings_feature.csv')
result.tail(10)


Lets get started by reading in the train and test sets, concatentate them, and combine the cleaned text for each application into a single "corpus":

In [16]:
text_cols = ['project_title', 'project_essay_1', 'project_essay_2', 'project_essay_3', 'project_essay_4', 'project_resource_summary']
id_col = 'id'
target_col = 'project_is_approved'
spell_col = 'misspellings'

input_folder = '../input/donorschoose-application-screening'
train = pd.read_csv(os.path.join(input_folder, 'train.csv'))[[id_col] + text_cols + [target_col]]
test = pd.read_csv(os.path.join(input_folder, 'test.csv'))[[id_col] + text_cols]

df = pd.concat([train, test], axis=0, ignore_index=True)

# piece together 'project_essay'
df.loc[df['project_essay_3'].notnull(), 'project_essay_1'] = df['project_essay_1'] + ' ' + df['project_essay_2']
df.loc[df['project_essay_4'].notnull(), 'project_essay_2'] = df['project_essay_3'] + ' ' + df['project_essay_4']
df['project_essay'] = df['project_essay_1'] + ' ' + df['project_essay_2']

def clean_text(phrase):
  # specific
  q = "[\'\’\´\ʻ]"
  
  phrase = re.sub(re.compile("won%st" % q), "will not", phrase)
  phrase = re.sub(re.compile("can%st" % q), "can not", phrase)
  
  # general
  phrase = re.sub(re.compile("n%st" % q), " not", phrase)
  phrase = re.sub(re.compile("%sre" % q), " are", phrase)
  phrase = re.sub(re.compile("%ss" % q), " is", phrase)
  phrase = re.sub(re.compile("%sd" % q), " would", phrase)
  phrase = re.sub(re.compile("%sll" % q), " will", phrase)
  phrase = re.sub(re.compile("%st" % q), " not", phrase)
  phrase = re.sub(re.compile("%sve" % q), " have", phrase)
  phrase = re.sub(re.compile("%sm" % q), " am", phrase)
  
  phrase = re.sub(r"\\r|\\n", " ", phrase)
  phrase = re.sub(r"\.\.+", ". ", phrase) # ellipsis ... or .. to .
  
  # all chars except ;.?!
  phrase = re.sub(re.compile(q + "+"), "", phrase)   
  phrase = re.sub(r"[\'\"\#\$\%\&\(\)\*\+\,\-\/\:\<\=\>\@\[\\\]\^\_\`\{\|\}\~\“\”\″\ʺ\¨\‘\…\—\―\–\•\®]+", " ", phrase)   
  
  # add space after EOS if missing
  phrase = re.sub(r"([\;\.\?\!])([^\s])", "\\1 \\2", phrase)
  # squeeze space before EOS
  phrase = re.sub(r"\s+([\;\.\?\!])", "\\1", phrase)
  
  # space squeezer: \u200b is UNICODE space
  phrase = re.sub(r"['\u200b\s]+", " ", phrase).strip()
  
  return phrase  

text_cols = ['project_title', 'project_essay', 'project_resource_summary']
for col in text_cols:
  df[col] = df[col].apply(clean_text)
  df[col] = df[col].apply(lambda x: x if x[-1] in ';.?!' else x + '.') # EOS marker

df['project_corpus'] = df['project_title'] + ' ' + df['project_essay'] + ' ' + df['project_resource_summary']
df = df[[id_col] + ['project_corpus'] + [target_col]]

df.head()

Now we load in the [PyHunSpell](https://github.com/blatinier/pyhunspell) spellchecker with the [en_US large](https://aur.archlinux.org/packages/hunspell-en-us-large) vocabulary. 

In [21]:
import hunspell
hunspell_folder = '../input/hunspellenuslarge'

# https://aur.archlinux.org/packages/hunspell-en-us-large/
hobj = hunspell.HunSpell(os.path.join(hunspell_folder, 'en_US-large.dic'), 
                         os.path.join(hunspell_folder, 'en_US-large.aff'))

# give it a try
hobj.spell('donors')


The hunspell spellchecker is pretty good, but it is not perfect for our purpose.  It does not include product names like "chromebook" or slang like "looove".   I wanted to find a way to not flag these as spelling errors.  I came up with a semi-manual process where I used myself as a [Mechanical Turk](https://en.wikipedia.org/wiki/The_Turk).  Here is the process:

1. Track the spelling errors according to hunspell, keeping a count of # of errors per word (**n**) in the Python dictionary **misspell**
2. Sort **misspell** descending by **n** and write it to a CSV file
3. Manually set a column value **ok** to 1 if the word is not really a misspelling
4. Read the CSV back in and run the spellchecker again, not considering **ok** words to be misspelled

It took me about an hour to run through all misspelled words that occurred 10 or more times.  To save time, I did not look at misspelled words that occurred less than 10 times.  But you can do this if you want to!  I am sure it will help a bit.  

Below, I am loading in the corpus_misspellings.csv file I marked up in this manner.  But I have also included the code to create it from scratch at the end of this kernel.  As I said, you can continue editing corpus_misspellings.csv if you like.


In [24]:
isok = pd.read_csv('../input/corpus-misspellings/corpus_misspellings.csv', encoding='utf-8')

# pass through these words as OK even though hunspell flags them as not
passthru = set(isok[isok.ok == 1].word)
df[spell_col] = 0
misspell = {}

isok.head()

Below we are tracking two things:
1. **misspell** tracks number of occurrences of a misspelled word
2. **df[spell_col] **counts the spelling errors per application

In [28]:
pd.options.mode.chained_assignment = None
for ci, s in enumerate(df.project_corpus):
  sw = s.split(' ') # split into words
  
  w_ok = set()
  w_misspell = set() # per word misspelled
  for i, w in enumerate(sw):
    w = re.sub(r'\W+', '', w) # clean the word
    
    if len(w) <= 3: # ignore short words
      continue
    
    # mark any word that contains chars outside a-z as OK:
    # if a product name like "Brite" is capitalized once, this means other occurrences in the
    # application like "brite" will be considered OK
    if bool(re.search(r'[^a-z]', w)):
      w_ok.add(w.upper())
      continue
    
    w = w.upper() # hobj.spell() works better with CAPS
    if hobj.spell(w.upper()) or \
       (w.endswith('S') and (hobj.spell(w[:-1].upper())) or (w[:-1] in passthru)) or \
       (w in passthru) or \
       ((i > 0) and (sw[i-1] in ['MR', 'MRS'])):
      continue
    
    w_misspell.add(w) # word is misspelled!
    
  lw_misspell = list(w_misspell - w_ok)
  for w in lw_misspell:
    # print(ci, w)
    misspell[w] = misspell.get(w, 0) + 1
    df[spell_col][ci] += 1
    
# same as we saw in "result"
df[[id_col, spell_col]].tail(10)


Count of applications with each number of spelling errors:

In [29]:
from collections import Counter
Counter(df[spell_col])

Most applications are spelling error free.  What a relief!  About 9% do contain spelling errors however.  Let's take a look at an application with a lot of errors:

In [40]:
list(df[df[spell_col] == 13].project_corpus)

Pretty bad.  Funny that spellchecking is discussed at the end of the text!  Here is the relationship between spelling errors and ratio of applications approved:

In [41]:
# clipped at 8
print('% approval given # of misspellings:')
for n in range(0,9):
  print('%d: %.4f' % (n, np.mean(df[df[spell_col].clip(0,8) == n][target_col])))

Yep.  Misspellings negatively correlates with approval.

If you are interested, here is how to generate the corpus_misspellings.csv file I discussed earlier.

In [46]:
# combine plural misspelling with singular
todelete = []
for w, n in misspell.items():
  if not w.endswith('S') or not w[:-1] in misspell:
    continue
  
  misspell[w[:-1]] += n
  todelete.append(w)

# delete plurals
for w in todelete:
  misspell.pop(w)  
  
misspell = pd.DataFrame.from_dict(misspell, orient='index')
misspell.reset_index(level=0, inplace=True)
misspell.columns = ['word', 'n']
misspell['ok'] = 0 # not OK, unless manually set to 1
misspell = misspell.sort_values(by='n', ascending=False)  

# misspell.to_csv('../output/corpus_misspellings.csv', index=False)

misspell.head()