# NLP Assignment 1 - Regular Expressions

**Prompt:**  
Improve your analysis from Week-1 by identifying specific tokens (words / keywords) that lead to failed food inspections in Chicago leveraging tokenization, stemming and lemmatization in Python.

In [1]:
import nltk as nltk
import nltk.corpus  
from nltk.text import Text
import pandas as pd
import re
import sys
import time

In [2]:
data_path = "/Users/rowena/Datasets/"
file_path = data_path + "Food_Inspections.csv"

In [3]:
df = pd.read_csv(file_path)
df = df[df.Results == 'Fail']

In [4]:
violations = df[~df.Violations.isna()].Violations

## Tokenization

In [5]:
start_time = time.time()
tokens_list = []
for violation in violations:
    tokens_list.append(nltk.tokenize.word_tokenize(violation))
print("--- %s seconds ---" % (time.time() - start_time))

--- 115.0517008304596 seconds ---


In [6]:
tokens = [item for sublist in tokens_list for item in sublist]

In [7]:
tokens = pd.Series(tokens)

In [8]:
tokens.value_counts().head(15)

.           650312
,           571337
AND         384597
:           333666
Comments    204355
-           204023
THE         192357
IN          169887
|           168857
TO          158674
OF          158052
CLEAN       121782
FOOD        117387
MUST         99527
and          94122
dtype: int64

This is a rather unhelpful list of top words found in failed inspection violation notices. 

## Lemmatization

In [9]:
wnl = nltk.WordNetLemmatizer()

In [10]:
lemmas = pd.Series([wnl.lemmatize(t.lower()) for t in tokens])

In [11]:
lemma_vc = lemmas.value_counts()
lemma_vc.head(15)

.          650312
,          571337
and        478733
:          333666
the        245091
in         208795
comment    204406
-          204023
of         200590
to         180682
|          168857
food       166544
clean      133866
all        118732
must       108764
dtype: int64

Still not that helpful. Hopefully POS tags can weed out the punctuation and whatnot.

## POS Tagging

In [12]:
lemma_df = pd.DataFrame(lemma_vc).reset_index().rename(columns={'index':'lemma', 0:'count'})

In [13]:
start_time = time.time()
pos_tags = nltk.pos_tag(lemma_df.lemma, tagset='universal')
print("--- %s seconds ---" % (time.time() - start_time))

--- 6.736386060714722 seconds ---


In [14]:
lemma_df['pos'] = pd.Series(pos_tags).apply(lambda x: x[1])

In [15]:
lemma_df.head()

Unnamed: 0,lemma,count,pos
0,.,650312,.
1,",",571337,.
2,and,478733,CONJ
3,:,333666,.
4,the,245091,DET


In [16]:
lemma_df[(lemma_df.pos=='NOUN') | (lemma_df.pos=='VERB')].head(25)

Unnamed: 0,lemma,count,pos
6,comment,204406,NOUN
10,|,168857,VERB
11,food,166544,NOUN
14,must,108764,VERB
15,area,106865,NOUN
18,equipment,92825,NOUN
19,be,89084,VERB
21,floor,88086,NOUN
22,repair,81589,NOUN
23,maintained,73322,VERB


This is a somewhat useufl list now. Most of the irrevelant or nonsense words have been weeded out. 

"Comments" is from the structure of data. It's included anytime there is additional information. Same deal with the pipe symbol. It's interesting that pos_tag picked that up as a verg.

Top words on a failed food inspection include the words "food". That makes sense. "Must" is probably from comments like "INSTRUCTED MANAGER TO PROVIDE AND MAINTAIN A WORKING LONG STEM THERMOMETER TO CHECK FOOD ITEMS IN HOT HOLDING. MUST COMPLY OR CITATIONS WILL BE ISSUED." We see other words like "equipment", "maintain", "contructed" and "installed". This is probably referring to pieces of kitchen gear. Places in the building are also common like "floor", "sink", and "wall". Structural problems can lead to a failed inspection. I'm kind of surprised that "violation" isn't higher on the list.