# Assignment 2

Use the same data as in assignment 1 but this time identify top-10 tokens that occur in regulation descriptions in the table.
1. As in assignment 1, extract regulation descriptions from each record corresponding to a failed inspection
2. Tokenize each regulation description
3. Find top-10 tokens (for the whole table)
4. Clean data: convert to lower case, remove stopwords, punctuation, numbers, etc
5. Find top-10 tokens again
6. Find top-10 tokens after applying Porter stemming to the tokens obtained in step 4.
7. Find top-10 tokens after applying Lancaster stemming to the tokens obtained in step 4.
8. Find top-10 tokens after applying lemmatization to the tokens obtained in step 4.
9. Compare top-10 tokens obtained in 3, 5, 6, 7, 8.

In [1]:
###Loading Packages###
import pandas as pd  # data frame operations  
import numpy as np  # arrays and math functions

import re # regular expressions
import os # Operation System
from datetime import datetime

import nltk as nltk
import nltk.corpus  
from nltk.text import Text
import sys

import warnings
warnings.filterwarnings("ignore")

In [2]:
# Load data
food = pd.read_csv('Food_Inspections.csv')

## 1. Extract regulation descriptions from each record corresponding to a failed inspection

In [3]:
failed_df = food[food['Results'] == 'Fail']
failed_df = failed_df.dropna(subset=['Violations']).reset_index()

In [4]:
# Term to split on
split_term = '\|'
# split the reasons
failed_df['splitted_reasons'] = failed_df['Violations'].apply(lambda x: re.split(split_term, x))
failed_df['descriptions'] = None

In [5]:
# find patterns
comment = "\s\-\sComments:[\s\S]*"
regulation_code = "[\0-9]+\.\s"

result = []

for i in range(failed_df.shape[0]):
    reasons = failed_df.iloc[i]['splitted_reasons']
    
    description = []
    for r in reasons:
        # delete comment
        no_comment = re.sub(comment, '', r)

        # delete regulation code
        no_code = re.sub(regulation_code, '', no_comment)
        no_code = no_code.strip()
        description.append(no_code)
    
#     print("--------Row:", i)
#     print(description)
    result.append(description)

failed_df['descriptions'] = result

## 2. Tokenize each regulation description

In [6]:
flat_list = [item for sublist in result for item in sublist]

In [7]:
str_result = ' '.join(flat_list)

In [8]:
words = nltk.tokenize.word_tokenize(str_result)

## 3. Find top-10 tokens (for the whole table)

In [9]:
fdist = nltk.FreqDist(words)

print(fdist)

#fdist.items() - will give all words
fdist.most_common(10)

<FreqDist with 380 samples and 3302539 outcomes>


[(',', 341824),
 ('AND', 165729),
 (':', 123154),
 ('MAINTAINED', 80448),
 ('FOOD', 78467),
 ('EQUIPMENT', 63910),
 ('CONSTRUCTED', 63844),
 ('CLEAN', 62129),
 ('PROPERLY', 62125),
 ('OF', 60053)]

In [10]:
word_list, freq = zip(*fdist.most_common(10))

In [13]:
word_list = list(word_list)
freq = list(freq)

## 4. Clean data: convert to lower case, remove stopwords, punctuation, numbers, etc

In [15]:
stopwords = set(nltk.corpus.stopwords.words('english'))

# Remove single-character tokens (mostly punctuation)
words = [word for word in words if len(word) > 1]

# Remove numbers
words = [word for word in words if not word.isnumeric()]

# Remove punctuation
words = [word for word in words if word.isalpha()]

# Lowercase all words (default_stopwords are lowercase too)
words_lc = [word.lower() for word in words]

# Remove stopwords
words_lc = [word for word in words_lc if word not in stopwords]

## 5. Find top-10 tokens again

In [17]:
fdist_cleaned = nltk.FreqDist(words_lc)

print(fdist_cleaned)

#fdist.items() - will give all words
fdist_cleaned.most_common(10)

<FreqDist with 325 samples and 2175835 outcomes>


[('maintained', 80448),
 ('food', 78467),
 ('equipment', 63910),
 ('constructed', 63844),
 ('clean', 62129),
 ('properly', 62125),
 ('installed', 59289),
 ('cleaning', 48295),
 ('surfaces', 44693),
 ('contact', 42211)]

In [18]:
word_list_cleaned, freq_cleaned = zip(*fdist_cleaned.most_common(10))
word_list_cleaned = list(word_list_cleaned)
freq_cleaned = list(freq_cleaned)

## 6. Find top-10 tokens after applying Porter stemming to the tokens obtained in step 4.

In [19]:
porter = nltk.PorterStemmer()
word_porter_stem = [porter.stem(t) for t in words_lc]

In [20]:
fdist_porter = nltk.FreqDist(word_porter_stem)

print(fdist_porter)

#fdist.items() - will give all words
fdist_porter.most_common(10)

<FreqDist with 282 samples and 2175835 outcomes>


[('clean', 135055),
 ('maintain', 83997),
 ('food', 83723),
 ('equip', 63910),
 ('construct', 63844),
 ('properli', 62125),
 ('instal', 59289),
 ('surfac', 44693),
 ('contact', 42211),
 ('method', 40605)]

In [21]:
word_list_porter, freq_porter = zip(*fdist_porter.most_common(10))
word_list_porter = list(word_list_porter)
freq_porter = list(freq_porter)

## 7. Find top-10 tokens after applying Lancaster stemming to the tokens obtained in step 4.

In [23]:
lancaster = nltk.LancasterStemmer()
word_lancaster_stem = [lancaster.stem(t) for t in words_lc]

In [24]:
fdist_lancaster = nltk.FreqDist(word_lancaster_stem)

print(fdist_lancaster)

#fdist.items() - will give all words
fdist_lancaster.most_common(10)

<FreqDist with 270 samples and 2175835 outcomes>


[('cle', 141144),
 ('maintain', 83997),
 ('food', 83723),
 ('prop', 76405),
 ('equip', 63910),
 ('construct', 63844),
 ('instal', 59289),
 ('surfac', 44693),
 ('contact', 42211),
 ('method', 40605)]

In [25]:
word_list_lancaster, freq_lancaster = zip(*fdist_lancaster.most_common(10))
word_list_lancaster = list(word_list_lancaster)
freq_lancaster = list(freq_lancaster)

## 8. Find top-10 tokens after applying lemmatization to the tokens obtained in step 4.

In [26]:
wnl = nltk.WordNetLemmatizer()
word_lemmatize = [wnl.lemmatize(t) for t in words_lc]

In [27]:
fdist_lemma = nltk.FreqDist(word_lemmatize)

print(fdist_lemma)

#fdist.items() - will give all words
fdist_lemma.most_common(10)

<FreqDist with 309 samples and 2175835 outcomes>


[('food', 83723),
 ('maintained', 80448),
 ('equipment', 63910),
 ('constructed', 63844),
 ('clean', 62129),
 ('properly', 62125),
 ('installed', 59289),
 ('cleaning', 48295),
 ('surface', 44693),
 ('contact', 42211)]

In [28]:
word_list_lemma, freq_lemma = zip(*fdist_lemma.most_common(10))
word_list_lemma = list(word_list_lemma)
freq_lemma = list(freq_lemma)

## 9. Compare top-10 tokens obtained in 3, 5, 6, 7, 8.

In [31]:
result = pd.DataFrame(list(zip(word_list, freq, word_list_cleaned, freq_cleaned,
                               word_list_porter, freq_porter,
                               word_list_lancaster, freq_lancaster,
                               word_list_lemma, freq_lemma)),
               columns =['Whole', 'Freq', 'Cleaned', 'Freq', 
                         'Porter', 'Freq', 'Lancaster', 'Freq', 'Lemma','Freq'])

In [32]:
result

Unnamed: 0,Whole,Freq,Cleaned,Freq.1,Porter,Freq.2,Lancaster,Freq.3,Lemma,Freq.4
0,",",341824,maintained,80448,clean,135055,cle,141144,food,83723
1,AND,165729,food,78467,maintain,83997,maintain,83997,maintained,80448
2,:,123154,equipment,63910,food,83723,food,83723,equipment,63910
3,MAINTAINED,80448,constructed,63844,equip,63910,prop,76405,constructed,63844
4,FOOD,78467,clean,62129,construct,63844,equip,63910,clean,62129
5,EQUIPMENT,63910,properly,62125,properli,62125,construct,63844,properly,62125
6,CONSTRUCTED,63844,installed,59289,instal,59289,instal,59289,installed,59289
7,CLEAN,62129,cleaning,48295,surfac,44693,surfac,44693,cleaning,48295
8,PROPERLY,62125,surfaces,44693,contact,42211,contact,42211,surface,44693
9,OF,60053,contact,42211,method,40605,method,40605,contact,42211


**Findings:**
- The top-10 tokens from the original data contain some punctuations, which are meaningless and not what we want.
- The result from cleaned data and lemmatized data are very similar. One difference is that they both contain the word "food" and "maintained" but with different frequencies. The reason for this might be that some of the sub-categories(foods, maintains, maintaining, etc) are combined into the top 2 token in the lemmatized result.
- 2 stemmed results are similar as well. But some of the words lose their original meaning. For example, the "properli", "instal", "surfac" in the result from Porter stemming, and the "cle" from Lancaster stemming make little sense. Even though we could guess some of their meanings under this specific circumstance, these 2 are still not the optimal results. 
- From the frequency lists, we could see that "clean" and "cleaning" are combined into 1 token in stemmed results but not in the cleaned result and the lemmatized result. 
- Overall, the lemmatized result gives a more reasonable top-10 tokens list with words having proper meaning.