# Allegheny County Health Violation Analysis

## Program Objective

Write a script that answers basic data-based questions concerning health violations in Allgheny county

## Questions to Pursue in Allegheny County Health Code Violations

1. What types of health violations are most common in three municipal areas of your choosing?
2. Which types of violations are mostly only considered "high" severity and not "medium" or "low" severity?
3. Which classification of restaurant (i.e. use the column 'description') has the most "high" severity violations? Does this vary by the municipalities of your choosing?




In [131]:
import pandas as pd
import numpy as np
import matplotlib as plt
%matplotlib inline
# show all columns in a dataframe
pd.set_option('display.max_columns', None)


df = pd.read_csv('data/allegheny_county_food_violations.csv')

In [132]:
df.head()

Unnamed: 0,encounter,id,placard_st,facility_name,bus_st_date,description,description_new,num,street,city,state,zip,inspect_dt,start_time,end_time,municipal,rating,low,medium,high,url
0,201502270009,10762,1,North Versailles VFD / South Wilmerding Social...,1969-01-01,Social Club-Bar Only,Handwashing Facilities,830,Sylvan Ave,North Versailles,PA,15137.0,2015-02-27,10:45:00,11:30:00,North Versailles,V,T,F,F,http://appsrv.alleghenycounty.us/reports/rwser...
1,201502270009,10762,1,North Versailles VFD / South Wilmerding Social...,1969-01-01,Social Club-Bar Only,"Fabrication, Design, Installation and Maintenance",830,Sylvan Ave,North Versailles,PA,15137.0,2015-02-27,10:45:00,11:30:00,North Versailles,V,T,F,F,http://appsrv.alleghenycounty.us/reports/rwser...
2,201602100042,10762,1,North Versailles VFD / South Wilmerding Social...,1969-01-01,Social Club-Bar Only,Employee Personal Hygiene,830,Sylvan Ave,North Versailles,PA,15137.0,2016-02-10,14:00:00,14:50:00,North Versailles,V,F,F,T,http://appsrv.alleghenycounty.us/reports/rwser...
3,201602100042,10762,1,North Versailles VFD / South Wilmerding Social...,1969-01-01,Social Club-Bar Only,Handwashing Facilities,830,Sylvan Ave,North Versailles,PA,15137.0,2016-02-10,14:00:00,14:50:00,North Versailles,V,T,F,F,http://appsrv.alleghenycounty.us/reports/rwser...
4,201602100042,10762,1,North Versailles VFD / South Wilmerding Social...,1969-01-01,Social Club-Bar Only,"Contamination Prevention - Food, Utensils and ...",830,Sylvan Ave,North Versailles,PA,15137.0,2016-02-10,14:00:00,14:50:00,North Versailles,V,T,F,F,http://appsrv.alleghenycounty.us/reports/rwser...


## Data Assessment

__Completeness__:  Do we have all of the records that we should?  Do we have missing records or not? Are there specific rows, columns, or cells missing?

__Validity__:  Does the data conform to a defined schema?  A schema is a defined set of rules for data. These rules can be real-world constraints (e.g. negative height is impossible) and table-specific constraints (e.g. unique key constraints in tables).

__Accuracy__:  Inaccurate data is wrong data that is valid. It adheres to the defined schema, but it is still incorrect.  An example of inaccurate data are typos.

__Consistency__:  Inconsistent data is both valid and accurate, but there are multiple correct ways of referring to the same thing.  An example of inconsistent data is inconsistent capitalization in textual data.

In [133]:
# Check the data type of each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 238859 entries, 0 to 238858
Data columns (total 21 columns):
encounter          238859 non-null int64
id                 238859 non-null int64
placard_st         238859 non-null int64
facility_name      236827 non-null object
bus_st_date        236688 non-null object
description        236827 non-null object
description_new    238859 non-null object
num                235384 non-null object
street             236827 non-null object
city               236827 non-null object
state              236827 non-null object
zip                236827 non-null float64
inspect_dt         238859 non-null object
start_time         238859 non-null object
end_time           238758 non-null object
municipal          236827 non-null object
rating             238859 non-null object
low                233385 non-null object
medium             186672 non-null object
high               186672 non-null object
url                238859 non-null object
dtypes: f

In [134]:
# Count how many rows contain null values
df.isnull().sum()

encounter              0
id                     0
placard_st             0
facility_name       2032
bus_st_date         2171
description         2032
description_new        0
num                 3475
street              2032
city                2032
state               2032
zip                 2032
inspect_dt             0
start_time             0
end_time             101
municipal           2032
rating                 0
low                 5474
medium             52187
high               52187
url                    0
dtype: int64

In [135]:
# Check duplicated rows
df.duplicated().sum()

0

## Data Wrangling

### Convert columns with binary-like data into boolean data

Some of the columns only contain two string values (e.g. "V" and "N" in the "rating" column).  We can convert these to either a boolean or numeric data to make the column better suited for calculations downstream.

In [136]:
# Inspect columns with binary-like data
print(df['rating'].value_counts(), '\n')
print(df['low'].value_counts(), '\n')
print(df['medium'].value_counts(), '\n')
print(df['high'].value_counts())

V    237249
N      1610
Name: rating, dtype: int64 

T    165810
F     67575
Name: low, dtype: int64 

F    142983
T     43689
Name: medium, dtype: int64 

F    160543
T     26129
Name: high, dtype: int64


In [137]:
# Map rating column to numeric data
mapping = {'N': 0, 'V': 1}
df['rating'] = df['rating'].map(mapping)

# Inspect results
df['rating'].value_counts()

1    237249
0      1610
Name: rating, dtype: int64

In [138]:
# Map risk violation level to numeric data
mapping = {'T': int(1), 'F': int(0)}

df['low'] = df['low'].map(mapping)
df['medium'] = df['medium'].map(mapping)
df['high'] = df['high'].map(mapping)

In [139]:
# Verify results
print(df['low'].value_counts(), '\n')
print(df['medium'].value_counts(), '\n')
print(df['high'].value_counts())

1.0    165810
0.0     67575
Name: low, dtype: int64 

0.0    142983
1.0     43689
Name: medium, dtype: int64 

0.0    160543
1.0     26129
Name: high, dtype: int64


### Rename Columns

The column names aren't intuitively named.  Let's rename these so they are more easily understood what they are.

In [140]:
# Rename columns that are unclear

col = {
    "encounter": "inspection_id",
    "id": "restaurant_id",
    "bus_st_date": "start_date",
    "description": "facility_type",
    "description_new": "violation",
    "num": "street_num",
    "rating": "is_violation",
    "low": "violation_level_low",
    "medium": "violation_level_med",
    "high": "violation_level_high"
}

df = df.rename(columns=col)

# Verify changes
df.columns

Index(['inspection_id', 'restaurant_id', 'placard_st', 'facility_name',
       'start_date', 'facility_type', 'violation', 'street_num', 'street',
       'city', 'state', 'zip', 'inspect_dt', 'start_time', 'end_time',
       'municipal', 'is_violation', 'violation_level_low',
       'violation_level_med', 'violation_level_high', 'url'],
      dtype='object')

## Question 1: Investigate most common health violations

### Top Ten Common Health Violations

In [141]:
# Find the top ten common health violations
df['violation'].value_counts().head(10)

Cleaning and Sanitization                                  24517
Fabrication, Design, Installation and Maintenance          20184
Cold Holding Temperatures                                  14234
Contamination Prevention - Food, Utensils and Equipment    13809
Floors                                                     12775
Handwashing Facilities                                     12603
Facilities to Maintain Temperature                         12573
Walls and ceilings                                         12566
Certified Food Protection Manager                          11652
Pest Management                                             9601
Name: violation, dtype: int64

## Question 2: Rank Facilities by the Amount of Health Violations

Investigate which facilities have the highest number of health violations, and group facilities by municipality as well as health violation level.

In [142]:
cols = ['facility_name', 'is_violation', 'violation_level_low',
       'violation_level_med', 'violation_level_high', 'municipal']
df_viol_rank = df[cols].groupby(['facility_name', 'municipal']) \
                       .sum() \
                       .sort_values(by='is_violation', ascending=False)

In [143]:
df_viol_rank.head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,is_violation,violation_level_low,violation_level_med,violation_level_high
facility_name,municipal,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Las Palmas #2,Pittsburgh-104,256,137.0,81.0,54.0
Sun Penang,Pittsburgh-114,240,183.0,43.0,21.0
Ichiban II,Robinson,236,178.0,32.0,35.0
Giovanni's Pizza & Pasta,Pittsburgh-102,221,155.0,32.0,37.0
Tan Lac Vien,Pittsburgh-114,218,124.0,70.0,31.0
Rose Tea Cafe,Pittsburgh-114,218,134.0,49.0,38.0
Pittsburgh Athletic Assoc Restaurant,Pittsburgh-104,213,136.0,57.0,30.0
Plaza Azteca,Robinson,209,138.0,28.0,45.0
The Bagel Factory,Pittsburgh-104,208,133.0,41.0,39.0
Cupka's Cafe II,Pittsburgh-116,206,136.0,48.0,32.0


## Question 3: Which type of violations occur the most for each severity level?

In [144]:
cols = ['violation', 'violation_level_low',
       'violation_level_med', 'violation_level_high']
df_violations_by_severity = df[cols].groupby(['violation']) \
                                    .sum()

#### Most common violations considered "low severity"

In [145]:
df_violations_by_severity['violation_level_low'].sort_values(ascending=False).head(10)

violation
Cleaning and Sanitization                                  20258.0
Fabrication, Design, Installation and Maintenance          20158.0
Contamination Prevention - Food, Utensils and Equipment    13727.0
Floors                                                     12772.0
Walls and ceilings                                         12564.0
Handwashing Facilities                                      8451.0
Facilities to Maintain Temperature                          7938.0
Pest Management                                             7432.0
Garbage and Refuse                                          7324.0
Date Marking of Food                                        7194.0
Name: violation_level_low, dtype: float64

#### Most common violations considered "medium severity"

In [146]:
df_violations_by_severity['violation_level_med'].sort_values(ascending=False).head(10)

violation
Certified Food Protection Manager     8956.0
Toxic Items                           5392.0
Handwashing Facilities                4239.0
Facilities to Maintain Temperature    3498.0
Probe-Type Thermometers               3471.0
Cold Holding Temperatures             2852.0
Cleaning and Sanitization             2371.0
Cooling Food                          2141.0
Date Marking of Food                  1979.0
Cross-Contamination Prevention        1588.0
Name: violation_level_med, dtype: float64

#### Most common violations considered "high severity"

In [147]:
df_violations_by_severity['violation_level_high'].sort_values(ascending=False).head(10)

violation
Cold Holding Temperatures         10418.0
Cleaning and Sanitization          5475.0
Hot Holding Temperatures           2993.0
Employee Personal Hygiene          1822.0
Food Source/Condition              1404.0
Pest Management                    1271.0
Cross-Contamination Prevention     1222.0
Cooling Food                       1217.0
Reheating Temperatures              252.0
Cooking Temperatures                 32.0
Name: violation_level_high, dtype: float64

## Question 4: Which classification of restaurant has the most "high" severity violations? Does this vary by the municipalities of your choosing?

In [149]:
# Get sum of high level violations by facility and municipality
cols = ['facility_type', 'municipal', 'violation_level_high']
df_viol_by_facility_type = df[cols].groupby(['municipal', 'facility_type']) \
                                   .sum()

In [150]:
# Inspect the grouped dataframe
df_viol_by_facility_type.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,violation_level_high
municipal,facility_type,Unnamed: 2_level_1
Aleppo,Adult Food Service,6.0
Aleppo,Child Food Service,1.0
Aleppo,Nursing Home/Personal Care Comb.,1.0
Aleppo,Nursing Home/Personal Care Snack Bar,0.0
Aspinwall,Bakery,3.0


In [151]:
# Get sorted list of highest count of high level violations by facility type and municipality
df_viol_by_municipal = df_viol_by_facility_type.groupby(level=0, group_keys=False) \
                           .apply(lambda x: x.sort_values(by='violation_level_high', ascending=False))

In [152]:
# Inspect the grouped dataframe
df_viol_by_municipal.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,violation_level_high
municipal,facility_type,Unnamed: 2_level_1
Aleppo,Adult Food Service,6.0
Aleppo,Child Food Service,1.0
Aleppo,Nursing Home/Personal Care Comb.,1.0
Aleppo,Nursing Home/Personal Care Snack Bar,0.0
Aspinwall,Restaurant with Liquor,65.0


In [153]:
# Reset index so we can work with the data more
df_viol_by_municipal = df_viol_by_municipal.reset_index()

#### Get the top 5 facility_types with the largest number of high level violations

#### Robinson

In [154]:
df_viol_by_municipal.loc[df_viol_by_municipal['municipal'] == 'Robinson'].head()

Unnamed: 0,municipal,facility_type,violation_level_high
1950,Robinson,Chain Restaurant with Liquor,349.0
1951,Robinson,Chain Restaurant without Liquor,233.0
1952,Robinson,Restaurant with Liquor,155.0
1953,Robinson,Restaurant without Liquor,108.0
1954,Robinson,Chain Retail/Convenience Store,14.0


#### Pittsburgh-114

In [155]:
df_viol_by_municipal.loc[df_viol_by_municipal['municipal'] == 'Pittsburgh-114'].head()

Unnamed: 0,municipal,facility_type,violation_level_high
1504,Pittsburgh-114,Restaurant without Liquor,290.0
1505,Pittsburgh-114,Restaurant with Liquor,194.0
1506,Pittsburgh-114,Chain Restaurant without Liquor,168.0
1507,Pittsburgh-114,University Food Service,67.0
1508,Pittsburgh-114,"Hospital, Gov, University (limited)",41.0


#### Dormont

In [156]:
df_viol_by_municipal.loc[df_viol_by_municipal['municipal'] == 'Dormont'].head()

Unnamed: 0,municipal,facility_type,violation_level_high
327,Dormont,Restaurant with Liquor,65.0
328,Dormont,Restaurant without Liquor,37.0
329,Dormont,Chain Restaurant without Liquor,24.0
330,Dormont,Chain Restaurant with Liquor,9.0
331,Dormont,Chain Retail/Convenience Store,7.0


Based on the subset of municipalities, it is most likely that the facility type with the largest number of high level violations will be __restaurants with liquor__.

## Note to Self:

Create a programmatic way to pull the first row of each grouped municipality.  This will give us a survey of the facility types with the largest number of high violation levels for all municipalities.  I was able to group by municipality and facility_type but couldn't figure out how to filter the resulting dataset (perhaps understanding multi-indexed dataframes more is necessary).  The workaround was to reindex the dataframe and filter from there, but this made it difficult to understand how to filter the top row for each municipality.