# NLP Assignment 1 - Regular Expressions

**Prompt:**  
Use Python Regular Expressions to identify top-10 most frequent causes of failed food inspections in Chicago.  The answer must contain textual description of violations.  You can download the dataset here: https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5 (Links to an external site.)Links to an external site.

Rules and requirements:

Your final output and the code should be contained within Jupyter Notebook (ipynb)

In [1]:
import re
import pandas as pd
import numpy as np

In [2]:
data_path = "/Users/rowena/Datasets/"
file_path = data_path + "Food_Inspections.csv"

## Data Exploration

Just to get a look at the dataset I read it into a pandas dataframe.

In [3]:
df = pd.read_csv(file_path)

In [4]:
df.shape

(184999, 17)

In [5]:
df.Results.head(100).value_counts()

Pass w/ Conditions    36
Pass                  33
Fail                  21
Out of Business        5
Not Ready              3
No Entry               2
Name: Results, dtype: int64

We'll be looking for the string 'Fail' when using regex.

In [6]:
df[df.Results == 'Fail'].head(3)

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,City,State,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location
6,2282695,ISSA STORE INC,ISSA STORE,2601424.0,Grocery Store,Risk 2 (Medium),3641 W AUGUSTA BLVD,CHICAGO,IL,60651.0,04/05/2019,Canvass,Fail,2. CITY OF CHICAGO FOOD SERVICE SANITATION CER...,41.898985,-87.717987,"(41.89898510417861, -87.7179866792102)"
8,2282681,"LA CASA DEL BORREGO, INC.",LA CASE DEL BORREGO,2658561.0,Restaurant,Risk 1 (High),3002 S PULASKI RD,CHICAGO,IL,60623.0,04/05/2019,License,Fail,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E...",41.838572,-87.724543,"(41.838571723634004, -87.72454334680803)"
19,2282662,LA CATRINA RESTAURANT LLC,LA CATRINA RESTAURANT,2658196.0,Restaurant,Risk 1 (High),3924 W DIVERSEY AVE,CHICAGO,IL,60647.0,04/05/2019,License,Fail,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E...",41.931918,-87.725545,"(41.931917684341876, -87.725544754742)"


It's the violations field that contains the causes of the failed inspections.

In [7]:
df[df.Results == 'Fail'].Violations[6]

'2. CITY OF CHICAGO FOOD SERVICE SANITATION CERTIFICATE - Comments: PREMISES HAS NO VALID CITY OF CHICAGO CERIFIED MANAGER. INSTRUCTED MANAGER A VALID CITY OF CHICAGO CERIFIED MANAGER MUST BE ONSITE AT ALL TIME WHEN TCS FOODS AREA BEING PREPARED, HANDLED AND/OR SERVED. PRIORITY FOUNDATION VIOLATION #7-38-012.  | 3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLOYEE; KNOWLEDGE, RESPONSIBILITIES AND REPORTING - Comments: **PREMISES HAS NO EMPLOYEE HEALTH POLICY. INSTRUCTED MANAGER TO COMPLY WITH THE NEW CODES OR CITATIONS WILL FOLLOW. PRIORITY FOUNDATION VIOLATION #7-38-010. | 5. PROCEDURES FOR RESPONDING TO VOMITING AND DIARRHEAL EVENTS - Comments: **PREMISES HAS NO CLEAN UP PROCEDURES OR SUPPLIES. INSTRUCTED MANAGER TO COMPLY WITH THE NEW CODES OR CITATIONS WILL FOLLOW. PRIORITY FOUNDATION VIOLATION #7-38-005. | 36. THERMOMETERS PROVIDED & ACCURATE - Comments: INSTRUCTED MANAGER TO PROVIDE AND MAINTAIN A WORKING LONG STEM THERMOMETER TO CHECK FOOD ITEMS IN HOT HOLDING. MUST COMPLY OR C

It's a great big string. The beginning of the reason starts with a number and the end finishes with '- Comments'

## Regex Extraction

In [8]:
with open(file_path, 'r') as file:
    txt = file.read().split('\n')

In [9]:
# Reason starts with a number that doesn't begin with 0, ends with " -" that preceeds the comments section.
ptrn = re.compile(r'\b[1-9][0-9]*\. .*?(?= -)')

In [10]:
reasons = []
for line in txt:
    match = re.search('Fail', line)
    if match is not None:
        reasons.extend(re.findall(ptrn, line))

In [11]:
causes = pd.Series(reasons)

In [12]:
causes.value_counts().sort_values(ascending=False).head(10)

34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GOOD REPAIR, COVING INSTALLED, DUST-LESS CLEANING METHODS USED                                     19002
35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTRUCTED PER CODE: GOOD REPAIR, SURFACES CLEAN AND DUST-LESS CLEANING METHODS                      17921
33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSILS CLEAN, FREE OF ABRASIVE DETERGENTS                                                           15953
18. NO EVIDENCE OF RODENT OR INSECT OUTER OPENINGS PROTECTED/RODENT PROOFED, A WRITTEN LOG SHALL BE MAINTAINED AVAILABLE TO THE INSPECTORS    15903
38. VENTILATION: ROOMS AND EQUIPMENT VENTED AS REQUIRED: PLUMBING: INSTALLED AND MAINTAINED                                                   15297
32. FOOD AND NON-FOOD CONTACT SURFACES PROPERLY DESIGNED, CONSTRUCTED AND MAINTAINED                                                          14126
41. PREMISES MAINTAINED FREE OF LITTER, UNNECESSARY ARTICLES, CLEANING  EQUIPMENT PROPERLY STORED               

Top reason for a food violation in this dataset is due to floor cleaning or construction. The 10th most common reason has to do with refrigeration thermometers.