# Introduction

https://www.nycfoodpolicy.org/restaurant-grading-system-hunter-college-nyc-food-policy-center/

Cities across the United States are capitalizing on big data. Predictive policing is becoming a prominent tool for public safety in many cities. In Boston, an algorithm helps determine “problem properties” where the city can target interventions. In Chicago, they are protecting citizens by predicting which landlords are not complying with city ordinances. In New York, the Fire Department sends inspectors to the highest risk buildings so they can prevent deadly fires from breaking out.

According to the CDC, more than 48 million Americans per year become sick from food, and an estimated 75% of the outbreaks came from food prepared by caterers, delis, and restaurants. In most cities, health inspections are generally random, which can increase time spent on spot checks at clean restaurants that have been following the rules closely — and missed opportunities to improve health and hygiene at places with more pressing food safety issues.

The goal for this project is to leverage public citizen generated data from social media to narrow the search for critical health and safety violations in New York City. As the City of New York manages  an open data portal, everyone can access historical hygiene inspections and violation records. By combine these two data source this project aims to determine which words, phrases, ratings, and patterns among restaurants lead to critical health and safety violations. This model can assist city health inspectors do their job better by prioritizing the kitchens most likely to be in violation of code.

The New York Health Department inspects the approximately 27,000 restaurants within the city to monitor their compliance with food safety regulations. Inspectors observe how food is prepared, served and stored and whether restaurant workers are practicing good hygiene. They check food temperatures, equipment maintenance and pest control measures.

### Imports

In [None]:
from IPython import display
from bs4 import BeautifulSoup as bs
import requests
import json
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import geopandas as gpd
from shapely.geometry import Point, Polygon
import plotly.express as px

import warnings
import time

import folium
import folium.plugins as plugins

# misc
import glob, os
import ast

call_apis = False
warnings.filterwarnings("ignore")
pd.set_option('display.max_colwidth',200)
pd.set_option('display.max_columns',50)
pd.options.display.float_format = '{:.2f}'.format

from sklearn.preprocessing import OneHotEncoder

#NLP 
import spacy
nlp = spacy.load("en_core_web_lg")
from spacy import displacy
import nltk 
import string
from nltk.collocations import *
from nltk import word_tokenize,wordpunct_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from wordcloud import WordCloud
from nltk import FreqDist
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
#nltk.download('wordnet')

# sk-learn
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.tree import DecisionTreeClassifier

### Can you predict if a restaurant has received a C Grade?
### Can you predict a restaurants yelp rating based on its review text?
### Can you recommend a similar restaurant nearby with a better health score for restaurants searched for with a C

# Obtain

For this project there will be two sources and types of data used:

* Historical health and hygiene inspections recorded by New York City Department of Health and Mental Hygiene (DOHMH) public health inspectors
* User generated Yelp business ratings and reviews

This project requires data pulled from two different sources, the City of New York and Yelp. To obtain the data we will call the API keys.

 > The dataset contains every sustained or not yet adjudicated violation citation from every full or special program inspection conducted up to three years prior to the most recent inspection for restaurants and college cafeterias in an active status on the RECORD DATE (date of the data pull). When an inspection results in more than one violation, values for associated fields are repeated for each additional violation record. Establishments are uniquely identified by their CAMIS (record ID) number. Keep in mind that thousands of restaurants start business and go out of business every year; only restaurants in an active status are included in the dataset.
Records are also included for each restaurant that has applied for a permit but has not yet been inspected and for inspections resulting in no violations. Establishments with inspection date of 1/1/1900 are new establishments that have not yet received an inspection. Restaurants that received no violations are represented by a single row and coded as having no violations using the ACTION field.
Because this dataset is compiled from several large administrative data systems, it contains some illogical values that could be a result of data entry or transfer errors. Data may also be missing.
This dataset and the information on the Health Department’s Restaurant Grading website come from the same data source. The Health Department’s Restaurant Grading website is here:
http://www1.nyc.gov/site/doh/services/restaurant-grades.page

> Why does the Health Department inspect restaurants?
The Health Department inspects the approximately 27,000 restaurants in New York City to monitor their compliance with food safety regulations. Inspectors observe how food is prepared, served and stored and whether restaurant workers are practicing good hygiene. They check food temperatures, equipment maintenance and pest control measures.

> Since 2010, New York City has required restaurants to post letter grades that correspond to scores received from sanitary inspections. An inspection score of 0 to 13 is an A, 14 to 27 points is a B, and 28 or more points is a C. Grade cards must be posted where they can easily be seen by people passing by.

> The New York City Health Department inspects all food service establishments to make sure they meet Health Code requirements, which helps prevent
foodborne illness. How often a restaurant is inspected depends on its inspection score. Restaurants that receive a low score on the initial or first inspection
in the inspection cycle are inspected less often than those that receive a high score.

> The points for a particular violation depend on the health risk it poses to the public. Violations fall into three categories:
>* A public health hazard, such as failing to keep food at the right temperature, triggers a minimum of 7 points. If the violation can’t be corrected before the inspection ends, the Health Department may close the restaurant until it’s fixed.
>* A critical violation, for example, serving raw food such as a salad without properly washing it first, carries a minimum of 5 points.
>* A general violation, such as not properly sanitizing
cooking utensils, receives at least 2 points.

>Inspectors assign additional points to reflect the extent of the
violation. A violation’s condition level can range from 1 (least
extensive) to 5 (most extensive). For example, the presence of
one contaminated food item is a condition level 1 violation,
generating 7 points. Four or more contaminated food items
is a condition level 4 violation, resulting in 10 points. 

> How are restaurants graded?
Violations found during inspections carry point values, and a restaurant’s score corresponds to a letter grade. The point/grade cut-offs are the same as for mobile food vending letter grading, with fewer points corresponding to a better grade:

>* "A" grade: 0 to 13 points for sanitary violations
>* "B" grade: 14 to 27 points for sanitary violations
>* "C" grade: 28 or more points for sanitary violations


The City of New York inspects all restaurants cyclically. And if a business does not pass it's initial inspection for the cycle, it will be re-inspected in 3-5 months.

## Obtaining Restaurant Inspection Results from NYC Open Data Portal

The dataset can be obtained here

https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j

The dataset was downloaded and saved to this repository. Let's load it in and explore its contents.

Detailed descriptions about each column can be found in the Restaurant Inspection Data Dictionary.

In [None]:
doh_df = pd.read_csv('data/nyc_open_data/DOHMH_New_York_City_Restaurant_Inspection_Results.csv')
doh_df

The dataset contains 186,227 inspection results. However, when an inspection results in more than one violation, values for associated fields are repeated for each additional violation record. So let's check how many individual restaurants are in the dataset.

### Understanding NYC DOHMH Data

In [None]:
# How many unique restaurants are in this dataset?
n_unique = doh_df['CAMIS'].nunique()
print(f'There are {n_unique} unique restaurants in the dataset. ')

In [None]:
# Get more information about the dataset contents
doh_df.info()

In [None]:
doh_df['ZIPCODE'].astype(str)

In [None]:
doh_df['Community Board'].astype(str)

In [None]:
doh_df['Council District'].astype(str)

In [None]:
doh_df['Census Tract'].astype(str)

Every row has a restaurant ID, address, date, and score. Let's ensure there aren't any duplicated rows.

In [None]:
print(f'There are {doh_df.duplicated(keep=False).sum()} duplicated rows. ')

In [None]:
# Let's drop these duplucated rows
doh_df.drop_duplicates(keep='first',inplace=True)

In [None]:
# Confirming duplicates have been removed
doh_df.shape

Since this project will be leveraging data publicly generated from social media a lookup value will be needed to call the API and join the tables. The Yelp API has an endpoint for Phone Search. This will allow us to pull Yelp business data for each restaurant by proivinging a telephone number. More infomration can be found in the [documentation here.](https://www.yelp.com/developers/documentation/v3/business_search_phone)

In [None]:
# Checking that every restaunt has a phone number
missing_num = doh_df['PHONE'].isna().sum()
print(f'There are {missing_num} restaunts missing a telephone number.')

In [None]:
# Since only 13 numbers are missing, these rows can be dropped
doh_df.dropna(subset=['PHONE'],inplace=True)

In [None]:
# Confirming records were dropped
doh_df['PHONE'].isna().sum()

In [None]:
# How many unique restaurants are remaining?
n_unique = doh_df['CAMIS'].nunique()
print(f'There are {n_unique} unique restaurants remaining in the dataset. ')

Let's explore the date range for this dataset.

In [None]:
doh_df['INSPECTION DATE'] =  pd.to_datetime(doh_df['INSPECTION DATE'])
begin_date = doh_df['INSPECTION DATE'].min()
end_date = doh_df['INSPECTION DATE'].max()
print(f'The data ranges from {begin_date} to {end_date}')

Inspections in this dataset range from May 2009 up to March 2022.

### Target Variable -- NYCDOH Inspection Grades

Health code violations found during an inspections carries a point value, and a restaurant’s score corresponds to a letter grade. A lower point score, leads to a better letter grade:

* "A" grade: 0 to 13 points for sanitary violations
* "B" grade: 14 to 27 points for sanitary violations
* "C" grade: 28 or more points for sanitary violations

In [None]:
# Let see what the score distribution is
doh_df['SCORE'].hist(bins=113, figsize=(12,6));

In [None]:
doh_df['SCORE'].describe()

In [None]:
doh_graded = doh_df.copy()

In [None]:
doh_graded.drop(columns=['INSPECTION DATE', 'ACTION', 'VIOLATION CODE',
       'VIOLATION DESCRIPTION', 'CRITICAL FLAG','GRADE',
       'GRADE DATE', 'RECORD DATE', 'INSPECTION TYPE'],inplace=True)

In [None]:
doh_graded['A'] = (doh_graded['SCORE'] < 14).astype(int)
doh_graded['B'] = (doh_graded['SCORE'] > 13).astype(int) & (doh_df['SCORE'] < 28).astype(int)
doh_graded['C'] = (doh_graded['SCORE'] > 27).astype(int)

In [None]:
doh_grouped = doh_graded.groupby(by=['CAMIS', 'DBA','CUISINE DESCRIPTION',
                                     'BORO', 'BUILDING',
                                     'STREET', 'ZIPCODE', 'PHONE', 'Latitude',
                                     'Longitude', 'Community Board',
                                     'Council District','Census Tract'],dropna=False)['A','B','C'].sum()

In [None]:
(doh_grouped['B'] > 0).sum()

Of the 19,790 unique restaurants, 9,977 failed an initial cycle inspection at least once.

In [None]:
(doh_grouped['C'] > 0).sum()

Of the 19,790 unique restaurants, 5,648 severly failed an initial cycle inspection at least once and are at risk of being closed by the DOHMH.

In [None]:
# Creating the Target Variable 'Severe' for Restaurants that have scored over 28 points in an initial inspection.
doh_grouped['Severe'] = (doh_grouped['C'] > 0).astype(int)

In [None]:
nyc_df = doh_grouped.reset_index()
nyc_df.drop(['A','B','C'],axis=1, inplace=True)
nyc_df

#### NYC DOH Data Exploration

In [None]:
nyc_df['BORO'].value_counts()

In [None]:
nyc_df['CUISINE DESCRIPTION'].value_counts()

In [None]:
nyc_df['ZIPCODE'].value_counts()

## Obtaining Yelp Buniness and Review Data

Now that we have an understanding of the city's inspection results and have explored that dataset it is time to pull in data from the crowd-sourced review platform Yelp.

In [None]:
# Loading in locally stored API credentials. 
# You can sign up for access and obtain credentials to the Yelp API here: 
# https://www.yelp.com/developers/documentation/v3

with open('/Users/Rob/.secret/yelp_api.json') as f:
    creds = json.load(f)

In [None]:
# Checking creds were properly loaded in
creds.keys()

### Yelp Business Search

In [None]:
# Formatting phone numbers provided in the NYCDOH dataset 
nyc_df['PHONE'] = '+1'+nyc_df['PHONE']

# Ensure the list contains unique phone numbers only
phone_numbers = set(nyc_df['PHONE'])
phone_numbers = list(phone_numbers)
number_count = len(phone_numbers)
print(f'There are {number_count} unique phone numbers.')

So we will call the Yelp API at the Phone Search Endpoint for all the numbers in the `phone_numbers` list. However the API only allows 5000 callers per day so we'll slice the list into smaller list.

In [None]:
# Slicing the phone list into smaller list to fit under the API daily limit restrictions
# phone_numbers1 = phone_numbers[1:1000]
# phone_numbers2 = phone_numbers[1000:2000]
# phone_numbers3 = phone_numbers[2000:2500]
# phone_numbers4 = phone_numbers[2500:3500]
# phone_numbers5 = phone_numbers[3500:5000]
# phone_numbers6 = phone_numbers[5000:6000]
# phone_numbers7 = phone_numbers[6000:7500]
# phone_numbers8 = phone_numbers[7500:10000]
# phone_numbers9 = phone_numbers[10000:12500]
# phone_numbers11 = phone_numbers[15000:17500]
# phone_numbers12 = phone_numbers[17500:20000]
# phone_numbers9 = phone_numbers[10000:12500]
# phone_numbers10 = phone_numbers[12500:15000]
# phone_numbers11 = phone_numbers[15000:17500]
# phone_numbers12 = phone_numbers[17500:19000]

In [None]:
# #Functionizing the Yelp API Phone Search

# def get_businesses(phone_numbers):
#     """Input a list of formatted phone numbers
#     (must start with + and include the country code, like +14159083801)
#     and returns a corresponding list of Yelp Businesses"""
    
#     biz_list = []
    
#     for number in phone_numbers:
#         url = 'https://api.yelp.com/v3/businesses/search/phone'
#         headers = {'Authorization': 'Bearer ' + creds['api_key']}
#         url_params = {'phone': number}
#         response = requests.get(url, headers=headers, params=url_params)
#         response_json = response.json()
#         biz_list.extend(response_json.get('businesses','U'))
        
#     while 'U' in biz_list:
#         biz_list.remove('U')
        
#     return biz_list

In [None]:
# # Call `get_business` function
# if call_apis == True:
#     biz_list12 = get_businesses(phone_numbers12)
    
#     # Save returned list as a DataFrame and .csv file
#     biz12_df = pd.DataFrame(biz_list12)
#     biz12_df.to_csv('data/yelp_data/yelp_business/yelp_phone12.csv',index=False)

In [None]:
# List of files containing Yelp business data
fpath = 'data/yelp_data/yelp_businesses/'
os.listdir(fpath)
query = fpath+"*.csv"
f_list = glob.glob(query)

In [None]:
# Append saved Yelp Business tables to a dict
yelp_tables = {}

for f in f_list:
    temp_df = pd.read_csv(f)
    fname = f.replace('data/yelp_data/yelp_businesses/yelp_phone','df_').replace('.csv','')
    yelp_tables[fname] = temp_df

In [None]:
yelp_df_list = [t for t in list(yelp_tables.keys())]

In [None]:
# Concatenating all Yelp Businesses responses from the Phone Search
yelp_businesses_df = pd.concat(yelp_tables,ignore_index=True)
yelp_businesses_df

#### Exploring Yelp Businesses Response Data

In [None]:
yelp_businesses_df.info()

In [None]:
yelp_businesses_df['price'].value_counts(normalize=True)

In [None]:
yelp_businesses_df['rating'].value_counts(normalize=True)

In [None]:
yelp_businesses_df['categories'].value_counts(normalize=True)

In [None]:
# Duplicates?
yelp_businesses_df[yelp_businesses_df.duplicated(['id'], keep=False)].count()

In [None]:
yelp_businesses_df.describe()

In [None]:
yelp_businesses_df['review_count'].sum()

### Yelp Reviews

Now that all the restaurants from the NYC DOHMH dataset have been used to search the Yelp API and have been concatenated we can use the return url to gather reviews for each business.

In [None]:
# OLD from webscraping
# df_10_2 = df_10.loc[1000:2173]
# df_10_2.to_csv('df_10_2',index=False)

In [None]:
def get_text(url_list):
    """ Given a list of urls, this function will iterate through the list and 
    extract text from the first page of reviews. The data will be joined into 
    a corpus for each business"""
    
    review_txt = []

    for url in url_list:
        req = requests.get(url,allow_redirects=False)
        soup = bs(req.content)
        comments = soup.find_all(class_='raw__09f24__T4Ezm', lang="en")
        comment_txt = []

        for comment in comments:
            comment_txt.append(comment.text)

        comment_corp = ('.'.join(comment_txt))
        review_txt.append(comment_corp)
    return review_txt

Due to the time required to run this function, it can be broken into smaller requests. Since we already have smaller list used when we called the API earlier we can take the urls returned from the API here. 

In [None]:
# Obtaining list of yelp business urls from saved API response
url_list = list(yelp_tables['df_1']['url'])
len(url_list)

In [None]:
# Calling `get_text` function to obtain Yelp reviews
if call_apis == True:
    review_text = get_text(url_list)

In [None]:
# Saving the Reviews to a csv in the repository
if call_apis == True:
    rvw_txt = pd.DataFrame(review_text,columns=['Review_Text'])
    rvw_txt.to_csv('rvw_txt1.csv',index=False)


Repeat this for all the urls returned from the API

### Joining Yelp Reviews to Yelp Business Tables

In [None]:
# List of files containing Yelp business data
fpath = 'data/yelp_data/yelp_reviews/'
os.listdir(fpath)
query = fpath+"*.csv"
f_list = glob.glob(query)

In [None]:
# Append saved Yelp Reviews to a dict
rvw_tables = {}

for file in f_list:
    temp_df = pd.read_csv(file)
    fname = file.replace('data/yelp_data/yelp_reviews/rvw_txt','txt_').replace('.csv','')
    rvw_tables[fname] = temp_df

In [None]:
rvw_df_list = [t for t in list(rvw_tables.keys())]

In [None]:
rvw1 = pd.read_csv('data/yelp_data/yelp_reviews/rvw_txt1.csv')

In [None]:
yelp_tables.keys()

In [None]:
rvw_tables.keys()

In [None]:
# Adding review text to the Yelp Business tables
yelp_tables['df_1']['Reviews'] =  rvw_tables['txt_1']
yelp_tables['df_2']['Reviews'] =  rvw_tables['txt_2']
yelp_tables['df_3']['Reviews'] =  rvw_tables['txt_3']
yelp_tables['df_4']['Reviews'] =  rvw_tables['txt_4']
yelp_tables['df_5']['Reviews'] =  rvw_tables['txt_5']
yelp_tables['df_6']['Reviews'] =  rvw_tables['txt_6']
yelp_tables['df_7']['Reviews'] =  rvw_tables['txt_7']
yelp_tables['df_8']['Reviews'] =  rvw_tables['txt_8']
yelp_tables['df_9']['Reviews'] =  rvw_tables['txt_9']
yelp_tables['df_10']['Reviews'] =  rvw_tables['txt_10']
yelp_tables['df_11']['Reviews'] =  rvw_tables['txt_11']
yelp_tables['df_12']['Reviews'] =  rvw_tables['txt_12']

In [None]:
# Concatenating all Yelp businesses tables with review text included
yelp_df = pd.concat(yelp_tables,ignore_index=True)
yelp_df

In [None]:
# OLD Yelp Review API
# review_df = pd.DataFrame(response_json.get('reviews'))
# review_df

## Joining NYC DOHMH & Yelp Datasets

In [None]:
nyc_df.head()

In [None]:
nyc_df.info()

In [None]:
yelp_df.head()

In [None]:
yelp_df.info()

In [None]:
# Formatting Yelp phone numbers to align with NYC phone numbers to join on
yelp_df['phone'] = '+' + yelp_df['phone'].apply(str)

In [None]:
yelp_df.head()

In [None]:
# Merging NYC and Yelp datasets
df_1 = pd.merge(nyc_df, yelp_df, left_on='PHONE', right_on='phone', how='inner')

In [None]:
# Saving merged dataset to csv
# df_1.to_csv('full_dataset.csv')
df_1

In [None]:
df_1.info()

# Scrubbing The Data

In [None]:
# Get missing reviews
# Drop businesses missing reviews
# Drop unneccessary columns
# Check duplicates
# ??Feature Engineering
# Get coordinates
# Get takeout/delivery
# Get $$$$

In [None]:
# Getting missing reviews
missing_reviews = df_1[df_1.Reviews.isna()]
urls = missing_reviews['url']
len(list(urls))

In [None]:
# Re-running function from above with the list of urls missing reviews
if call_apis == True:
    adding_reviews = get_text(list(urls))

In [None]:
# Saving the review text as a csv to the repository
if call_apis == True:
    mssing_rvws = pd.DataFrame(adding_reviews,columns=['Review_Text'])
    mssing_rvws.to_csv('mssing_rvws.csv',index=False)

In [None]:
adding_reviews = pd.read_csv('mssing_rvws.csv')

In [None]:
missing_reviews2 = missing_reviews.copy()

In [None]:
missing_reviews2

In [None]:
urls_df = pd.DataFrame(urls)
urls_df.reset_index(inplace=True)

In [None]:
url_reviews_df = pd.concat([urls_df,adding_reviews],axis=1)

In [None]:
url_reviews_df

In [None]:
rvw_df = pd.read_csv('mssing_rvws.csv')
rvw_df

In [None]:
fill_in_missing_df = pd.concat([missing_reviews2,adding_reviews],axis=1)
fill_in_missing_df

In [None]:
# fill_in_missing_df.rename(columns={'reviews':'Reviews'},inplace=True)

In [None]:
fill_in_missing_df['Review_Text'].isna().sum()

In [None]:
fill_in_missing_df.drop(columns=['Reviews'],inplace=True)

In [None]:
# Appending businesses with new reviews to full dataset
df_2 = pd.concat([df_1,fill_in_missing_df])

In [None]:
# dropping old rows without reviews
df_2.dropna(subset=['Reviews'],inplace=True)

In [None]:
df_2['Review_Text'].isna().sum()

In [None]:
df_2[df_2['Reviews'] == '']

Still 220 businesses without reviews. Not too bad, but will try to extract these.

In [None]:
missing_rvws2 = df_2[df_2['Reviews'] == '']['url']

In [None]:
missing_rvws2

In [None]:
# Extracting the last reviews
# adding_reviews2 = get_text(list(missing_rvws2))

In [None]:
# # Saving the review text as a csv to the repository
# mssing_rvws2 = pd.DataFrame(adding_reviews2,columns=['Review_Text'])
# mssing_rvws2.to_csv('mssing_rvws2.csv',index=False)

In [None]:
mssing_rvws2 = pd.read_csv('mssing_rvws2.csv')

In [None]:
fill_in_mssing_rvws_df2 = pd.concat([missing_rvws2_df,mssing_rvws2,],axis=1,join='outer',ignore_index=True)
fill_in_mssing_rvws_df2

In [None]:
fill_in_missing_df2 = pd.concat([missing_reviews2,adding_reviews],axis=1)
fill_in_missing_df

In [None]:
missing_reviews_2_df = df_2[df_2['Reviews'] == '']

In [None]:
missing_reviews_2_df.reset_index(inplace=True)
missing_reviews_2_df.drop(columns=['index'],inplace=True)

In [None]:
fill_in_missing_df2 = pd.concat([missing_reviews_2_df,mssing_rvws2],axis=1)
fill_in_missing_df2


In [None]:
fill_in_missing_df2.drop(columns=['Reviews'],inplace=True)
fill_in_missing_df2.rename(columns={'Review_Text':'Reviews'},inplace=True)
fill_in_missing_df2

In [None]:
# Appending businesses with newly gathered reviews to full dataset
df_3 = pd.concat([df_2,fill_in_missing_df2])

In [None]:
df_1

In [None]:
# Dropping old rows without reviews
df_3 = df_3[df_3.Reviews != '']
df_3

In [None]:
# Checking for duplicated rows
df_3[df_3.duplicated(keep='first')]

In [None]:
# Dropping Duplicated Rows
df_3.drop_duplicates(keep='first',inplace=True)

In [None]:
df_3

In [None]:
# Resetting the index
df_3.reset_index(inplace=True)
df_3.drop(columns='index',inplace=True)

In [None]:
df_3.info()

In [None]:
# Looking at the distributions for each Boro
df_3['BORO'].value_counts()

In [None]:
# Inspecting missing Boro's
df_3[df_3['BORO'] == '0' ]

In [None]:
# Imputing missing BORO labels
df_3.loc[6780:6781]['BORO'] = 'Brooklyn'
df_3.loc[10182:10182]['BORO'] = 'Manhattan'

In [None]:
# Looking at Restaurants with missing zipcodes
df_3[df_3['ZIPCODE'].isna()]

In [None]:
# Looking at Restaurants with missing zipcodes
missing_zips = df_3['ZIPCODE'].isna()

In [None]:
# Converting location data from str to dict dtype
df_3['location'] = df_3['location'].apply(lambda x: ast.literal_eval(x))

In [None]:
# Confirming change
type(df_3.iloc[0]['location'])

In [None]:
# Imputing Zipcodes missing from NYC data with Yelp value
addrss_list = list(df_3['location'])
zip_list =[]

for zipcode in addrss_list:
    zip_list.append(zipcode['zip_code'])
    
df_3['Zip_code'] = zip_list
df_3.drop(columns=['ZIPCODE'],inplace=True)

In [None]:
df_3['price'].value_counts()

In [None]:
# Converting all review text into string objects
df_3['Reviews'].astype(str,errors='raise')

# Exploring The Dataset

In [None]:
# Checking for bussiness still missing review data
df_1['Reviews'].isna().sum()

In [None]:
# Dropping businesses missing reviews
df_4 = df_1.dropna(subset=['Reviews'])

In [None]:
df_4['Reviews'].isna().sum()

In [None]:
# Check for duplicates
df_4.duplicated(keep='first').sum()

In [None]:
# drop duplicates
df_4.drop_duplicates(keep='first', inplace=True)

In [None]:
# Dropping unneccessary columns
eda_df = df_4.drop(columns=['display_phone','phone','location','coordinates',
                 'categories','is_closed','url','image_url','name','alias'])

In [None]:
eda_df['price'].value_counts()

In [None]:
eda_df['rating'].value_counts()

In [None]:
#Get the distribution of the ratings
x=eda_df['rating'].value_counts()
x=x.sort_index()
#plot
plt.figure(figsize=(8,4))
ax= sns.barplot(x.index, x.values, alpha=0.8)
plt.title("Star Rating Distribution")
plt.ylabel('# of businesses', fontsize=12)
plt.xlabel('Star Ratings ', fontsize=12)

#adding the text labels
rects = ax.patches
labels = x.values
for rect, label in zip(rects, labels):
    height = rect.get_height()
    ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom')

plt.show()

In [None]:
fig = plt.figure()

In [None]:
eda_df['rating'].value_counts()

In [None]:
sns.distplot(eda_df.rating, kde=False)

In [None]:
cuisines = pd.DataFrame(eda_df['CUISINE DESCRIPTION'].value_counts())
cuisines.reset_index(inplace=True)
cuisines[0:10]

In [None]:
# What are the popular Cuisine categories?
cuisine_cats = eda_df['CUISINE DESCRIPTION']

x = cuisine_cats.value_counts()

print("There are ",len(x)," different types of cuisines in NYC")

#prep for chart
x = x.sort_values(ascending=False)
x = x.iloc[0:20]

#chart
plt.figure(figsize=(16,4))
ax = sns.barplot(x.index, x.values, alpha=0.8)#,color=color[5])
plt.title("What are the top categories?",fontsize=25)
locs, labels = plt.xticks()
plt.setp(labels, rotation=80)
plt.ylabel('# businesses', fontsize=12)
plt.xlabel('Category', fontsize=12)

#adding the text labels
rects = ax.patches
labels = x.values
for rect, label in zip(rects, labels):
    height = rect.get_height()
    ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom')

plt.show()

In [None]:
eda_df.columns

In [None]:
# Most reviewed restaurants
eda_df[['CAMIS','id','DBA', 'review_count', 'rating']].sort_values(ascending=False, by="review_count")[0:50]

In [None]:
# Look at all the feautres compared to the 0 class and 1 class

In [None]:
labels_df = eda_df['Severe']
labels_df.shape

In [None]:
fig, ax = plt.subplots(figsize=(9.2, 5))

n_obs = labels_df.shape

(eda_df['Severe']
    .value_counts()
    .plot.barh(title="Proportion of Restaurants with Severe Violations", ax=ax)
)
ax.set_ylabel("Severe Violations")

fig.tight_layout()

In [None]:
counts = (eda_df[['BORO', 'Severe']]
              .groupby(['BORO', 'Severe'])
              .size()
              .unstack('Severe')
         )
counts

In [None]:
ax = counts.plot.barh()
ax.legend(
    loc='center right', 
    bbox_to_anchor=(1.3, 0.5), 
    title='Severe Violations'
);

In [None]:
severe_counts = counts.sum(axis='columns')
severe_counts

In [None]:
props = counts.div(severe_counts, axis='index')
props

In [None]:
# Prototyping Stack Barh plot
ax = props.plot.barh(stacked=True)
ax.set_title('Severe Violations by Boro')
ax.set_xlabel('Proportion of Restaurants w/Severe Violations ')
ax.legend(
    loc='center left', 
    bbox_to_anchor=(1.05, 0.5),
    title='Severe Violations'
);

In [None]:
def violation_rate_plot(col, target, data, ax=None):
    """Stacked bar chart of severe health and safety violations rate against 
    each feature of the data. 
    
    Args:
        col (string): column name of feature variable
        target (string): column name of target variable
        data (pandas DataFrame): dataframe that contains columns 
            `col` and `target`
        ax (matplotlib axes object, optional): matplotlib axes 
            object to attach plot to
    """
    counts = (eda_df[[target, col]]
                  .groupby([target, col])
                  .size()
                  .unstack(target)
             )
    group_counts = counts.sum(axis='columns')
    props = counts.div(group_counts, axis='index')

    props.plot(kind="barh", stacked=True, ax=ax)

    ax.legend().remove()

In [None]:
eda_df.columns

In [None]:
# Loop through several columns and plot against both severe violations.

cols_to_plot = [
    'rating',
    'transactions',
    'price'
]

fig, ax = plt.subplots(
    len(cols_to_plot), figsize=(10,len(cols_to_plot)*2.5)
)
for idx, col in enumerate(cols_to_plot):
    
    violation_rate_plot(
        col, 'Severe', eda_df, ax=ax[idx]
    )
    

ax[0].legend(
    loc='lower center', bbox_to_anchor=(0.5, 1.05), title='Severe Violations'
)
fig.tight_layout()

In [None]:
# import street map
# https://data.cityofnewyork.us/City-Government/Borough-Boundaries/tqmj-j8zm

street_map = gpd.read_file('data/Borough_Boundaries/geo_export_392103e7-13e2-43bf-aaf0-7e5c6a24b7b2.shp')


In [None]:
geo_eda = eda_df.copy()

In [None]:
geo_eda.dropna(axis=0, subset=['Latitude','Longitude'], inplace=True)

In [None]:
# Remove the most extreme .1% latitudes, &
# the most extreme .1% longitudes

geo_eda = geo_eda[
 (geo_eda['Latitude'] >= np.percentile(geo_eda['Latitude'], 0.05)) & 
 (geo_eda['Latitude'] < np.percentile(geo_eda['Latitude'], 99.95)) &
 (geo_eda['Longitude'] >= np.percentile(geo_eda['Longitude'], 0.05)) & 
 (geo_eda['Longitude'] <= np.percentile(geo_eda['Longitude'], 99.95))
]

In [None]:
geo_eda = geo_eda[geo_eda['Longitude'] != 0]

In [None]:
geo_eda = geo_eda[geo_eda['Latitude'] != 0]

In [None]:
print(geo_eda['Longitude'].max())
print(geo_eda['Longitude'].min())
print(geo_eda['Latitude'].max())
print(geo_eda['Latitude'].min())

In [None]:
# designate coordinate system
crs = {'init':'epsg:4326'}
# zip x and y coordinates into single feature
geometry = [Point(xy) for xy in zip(geo_eda['Longitude'], geo_eda['Latitude'])]

In [None]:
# create GeoPandas dataframe
geo_df = gpd.GeoDataFrame(geo_eda, crs=crs, geometry = geometry)

In [None]:
# Plotting map
fig, ax = plt.subplots(figsize=(15,15))
street_map.plot(ax=ax, alpha=.5,color='gray')
geo_df[geo_df['Severe'] == 0].plot(ax=ax, alpha=.1, markersize=20,color='blue',label='Pass')
geo_df[geo_df['Severe'] == 1].plot(ax=ax, alpha=.1, markersize=20,color='red',label='Fail')
plt.legend(prop={'size':15});

# Preprocessing For Further EDA

### Lets find most frequent words in Negative reviews

Basically, we will find most frequent words in reviews to get an overview of why users gave low ratings. These words could be related to those business attributes or services about which users are most unhappy.



In [None]:
txt_eda = df_4.copy()

In [None]:
passed_df = txt_eda.loc[txt_eda['Severe']==0]
passed_df

In [None]:
failed_df = txt_eda.loc[txt_eda['Severe']==1]
failed_df

In [None]:
passed_corpus = passed_df['Reviews'].to_list()
passed_corpus[:5]

In [None]:
failed_corpus = failed_df['Reviews'].to_list()
failed_corpus[:5]

In [None]:
# Tokenizing the corpus of review text from restaurants that have severly failed their inspection
failed_tokens = word_tokenize(','.join(failed_corpus))


In [None]:
# Tokenizing the corpus of review text from restaurants that have not severly  their inspection
passed_tokens = word_tokenize(','.join(passed_corpus))

In [None]:
# Importing English stop words
stopwords_list = stopwords.words('english')
# Add punctuation marks to the stopwords_list
stopwords_list.extend(string.punctuation)
additional_punc = ['“','”','...',"''",'’','``']
stopwords_list.extend(additional_punc)

In [None]:
# Removing stopwords from failed_corpus
stopped_failed_tokens = [w.lower() for w in failed_tokens if w.lower() not in stopwords_list]
stopped_failed_tokens[:10]

In [None]:
# Removing stopwords from passed_corpus
stopped_passed_tokens = [w.lower() for w in passed_tokens if w.lower() not in stopwords_list]
stopped_passed_tokens[:10]

In [None]:
# Creating FreqDist from stopped_failed_tokens
freq_failed = FreqDist(stopped_failed_tokens)
freq_failed.most_common(15)

In [None]:
# Creating FreqDist from stopped_passed_tokens
freq_passed = FreqDist(stopped_passed_tokens)
freq_passed.most_common(15)

In [None]:
# Functionizing wordcloud generator
def wordcloud_generator(tokens, collocations=False, background_color='black', 
                       colormap='Reds', display=True):

    
    # Initalize a WordCloud
    wordcloud = WordCloud(collocations=collocations, 
                          background_color=background_color, 
                          colormap=colormap, 
                          width=500, height=300)

    # Generate wordcloud from tokens
    wordcloud.generate(','.join(tokens))

    # Plot with matplotlib
    if display:
        plt.figure(figsize = (12, 15), facecolor = None) 
        plt.imshow(wordcloud) 
        plt.axis('off');
        
    return wordcloud
    

In [None]:
# Generating a WordCloud for reviews from restaurants that have failed their inspection
failed_cloud = wordcloud_generator(stopped_failed_tokens, collocations=True)

In [None]:
# Generate a WordCloud from reviews of restaurants that have not severely failed an inspection
passsed_cloud = wordcloud_generator(stopped_passed_tokens, colormap='Greens',
                                     collocations=True)

### Bigrams

In [None]:
# Bigrams from reviews of failed inspections
bigram_measures = nltk.collocations.BigramAssocMeasures()
review_finder = nltk.BigramCollocationFinder.from_words(stopped_failed_tokens)
reviews_scored = review_finder.score_ngrams(bigram_measures.raw_freq)

In [None]:
# df from the Bigrams
pd.DataFrame(reviews_scored, columns=["Word","Freq"]).head(10)

In [None]:
# Bigrams from reviews of restaurants passed their inspections
bigram_measures = nltk.collocations.BigramAssocMeasures()
review_finder = nltk.BigramCollocationFinder.from_words(stopped_passed_tokens)
reviews_scored = review_finder.score_ngrams(bigram_measures.raw_freq)

In [None]:
# df from the Bigrams
pd.DataFrame(reviews_scored, columns=["Word","Freq"]).head(10)

In [None]:
# FREQ DIST PLOT FOR MOST FREQ WORDS IN PASSED AND FAILED RESTAURANTS

In [None]:
fdist = FreqDist(stopped_failed_tokens)
plt.figure(figsize=(10, 10))
fdist.plot(30);

In [None]:
nltk.download('vader_lexicon')

In [None]:
sid = SentimentIntensityAnalyzer()

In [None]:
a =  'This is an excellent restaurant!'

In [None]:
sid.polarity_scores(a)

In [None]:
df_5['price'].value_counts()

In [None]:
# Mapping prices to ordinal categories
price_dict={'$$$$': 4, '$$$': 3, '$$': 2, '$': 1}

df_5['prices'] = df_5['price'].map(price_dict)

In [None]:
# Imputing a zeros for NAN values
df_5['prices'].fillna(0,inplace=True)

In [None]:
transactions = df_5['transactions']

In [None]:
# Use ast module to convert string objects into list
transactions = transactions.apply(lambda x: ast.literal_eval(x))

In [None]:
# Change list into dicts
dcts = transactions.apply(lambda x: {c: 1 for c in x})

# Create new dataframe based on the list of dictionaries.
ohe_df = pd.DataFrame(dcts.tolist()).fillna(0)


In [None]:
ohe_df

In [None]:
# Concatenate dummy variales with the full dataset
df_5 = pd.concat([df_5,ohe_df],axis=1)
df_5.head()

In [None]:
df_5.drop(columns=['price','transactions'],inplace=True)

# Preprocessing For Modeling

In [None]:
df_4.columns

In [None]:
# Dropping unneccessary columns
model_df = df_4.drop(columns=['CAMIS', 'DBA', 'CUISINE DESCRIPTION', 'BORO',
                               'BUILDING', 'STREET','ZIPCODE', 'PHONE',
                               'Latitude', 'Longitude', 'Community Board',
                               'Council District', 'Census Tract', 
                               'id', 'alias', 'name','image_url', 'is_closed',
                               'url', 'review_count', 'categories', 'rating',
                               'coordinates', 'transactions', 'price',
                               'location', 'phone','display_phone'])

In [None]:
model_df.head(3)

In [None]:
model_df['Reviews'].isna().sum()

In [None]:
# Checking for duplicates
model_df.duplicated(keep='first').sum()

In [None]:
# Dropping duplicates
model_df.drop_duplicates(keep='first', inplace=True)

# Confirming duplicates have been dropped

model_df.duplicated(keep='first').sum()

In [None]:
# Checking Class Balance
model_df['Severe'].value_counts()

In [None]:
# Make X and y
y = model_df['Severe'].copy()
X = model_df['Reviews'].copy()

In [None]:
# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    random_state=42)


In [None]:
# Using NLTK's Regular Expressions Tokenizer from nltk.tokenize
tokenizer = RegexpTokenizer(r'[a-zA-Z0-9]+')
tokenizer

In [None]:
# Creating a Count Vectorizer the RE tokenizer's .tokenize method
vectorizer = CountVectorizer(lowercase=True, tokenizer=tokenizer.tokenize,
                            stop_words=stopwords_list)

In [None]:
# Vectorizing the data and saving X_train_bow and X_test_bow
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)
X_train_bow

# Models

## Bag-of-Words Model



In [None]:
# Create and fit a Decision Tree Classifier
dt = DecisionTreeClassifier(class_weight='balanced', max_depth=10)
dt.fit(X_train_bow,y_train)

In [None]:
# Extract predictions for the train and test sets
y_hat_test = dt.predict(X_test_bow)


In [None]:
def evaluate_model(y_test,y_hat_test,X_test,clf=None,
                  scoring=metrics.recall_score,verbose=False,
                  scorer=False,classes=['Passed','Failed']):
    """Quick/simple classification model evaluation"""

    print(metrics.classification_report(y_test,y_hat_test,
                                        target_names=classes))
    
    metrics.plot_confusion_matrix(clf,X_test,y_test,normalize='true',
                                 cmap='Blues',display_labels=classes)
    plt.show()
    if verbose:
        print("MODEL PARAMETERS:")
        print(pd.Series(dt.get_params()))
        
    if scorer:
        
        return scoring(y_test,y_hat_test)

In [None]:
# Evaluating the moodel using function
evaluate_model(y_test,y_hat_test,X_test_bow,dt)

In [None]:
# Plot the top 30 most important features
with plt.style.context('seaborn-talk'):

# Get Feature Importance
    importance = pd.Series(dt.feature_importances_,index=vectorizer.get_feature_names())

# Sort values 


# Take the .tail 30 and plot kind='barh'
    importance.sort_values().tail(30).plot(kind='barh')

In [None]:
count_vect = CountVectorizer()
tf_transform = TfidfTransformer(use_idf=True)

text_pipe = Pipeline(steps=[
    ('count_vectorizer',count_vect),
    ('tf_transformer',tf_transform)])

full_pipe = Pipeline(steps=[
    ('text_pipe',text_pipe),
    ('clf',DecisionTreeClassifier(class_weight='balanced'))
])
full_pipe

In [None]:

## Preview current X_train
X_train_pipe = text_pipe.fit_transform(X_train)
X_test_pipe = text_pipe.transform(X_test)
X_train_pipe


In [None]:
from sklearn import set_config

In [None]:
set_config(display='diagram')

full_pipe

In [None]:
from sklearn.model_selection import GridSearchCV
## Make a tokenizer with TweetTokenizer
tokenizer = RegexpTokenizer(r'[a-zA-Z0-9]+')
vectorizer = CountVectorizer(ngram_range=[1,2])
## Make params Grid
#### use_idf: True/False
#### tokenizer: None, tokenizer.tokenize
#### criterion: gini, entropy
#### stopwords

params = {'text_pipe__tf_transformer__use_idf':[True, False],
         'text_pipe__count_vectorizer__tokenizer':[None,tokenizer.tokenize],
         'text_pipe__count_vectorizer__stop_words':[None,stopwords_list],
         'clf__criterion':['gini', 'entropy']}

## Make and fit grid
grid = GridSearchCV(full_pipe,params,cv=3)
grid.fit(X_train,y_train)
## Display best params
grid.best_params_

In [None]:
vec = CountVectorizer(token_pattern=r"([a-zA-Z]+(?:'[a-z]+)?)",
                      stop_words=sw, ngram_range=[1,2],
                      min_df=2, max_df=25)
X = vec.fit_transform(corpus.body)

df_cv = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
df_cv

In [None]:
## Evluate the best_estimator
best_pipe = grid.best_estimator_
y_hat_test = grid.predict(X_test)

In [None]:
evaluate_model(y_test,y_hat_test,X_test,best_pipe)

## Getting Feature Importances


In [None]:
X_train_pipe = text_pipe.fit_transform(X_train)
X_test_pipe = text_pipe.transform(X_test)
X_train_pipe

In [None]:
X_train_pipe.shape

In [None]:
features = text_pipe.named_steps['count_vectorizer'].get_feature_names()

In [None]:
len(features)

In [None]:
# vectorizer.get_feature_names()
rf = best_pipe.named_steps['clf']
with plt.style.context('seaborn-talk'):
    importance = pd.Series(rf.feature_importances_,index=features)#vectorizer.get_feature_names())
    importance.sort_values(inplace=True)

    importance.sort_values().tail(30).plot(kind='barh')

In [None]:
top_word_probs = {}
for word in importance.tail(20).index:
    rows = df['text'].str.contains(word,regex=False,case=False)
    val_count= df[rows]['source'].value_counts(normalize=True)
    top_word_probs[word] = val_count
#     print(f'\n\n{word}:\n{val_count}')

In [None]:

top_probs = pd.DataFrame(top_word_probs).T
top_probs.style.background_gradient(axis=1)

In [None]:
### Function to produce the model's coefficients

def eval_clf(model, X_test_tf,y_test,cmap='Reds',
                            normalize='true',classes=['Unvaccinated', 'Vaccinated'],figsize=(10,4),
                            X_train = None, y_train = None,):
    """Evaluates a scikit-learn binary classification model.

    Args:
        model ([type]): [description]
        X_test_tf ([type]): [description]
        y_test ([type]): [description]
        cmap (str, optional): [description]. Defaults to 'Reds'.
        normalize (str, optional): [description]. Defaults to 'true'.
        classes ([type], optional): [description]. Defaults to None.
        figsize (tuple, optional): [description]. Defaults to (8,4).
        X_train ([type], optional): [description]. Defaults to None.
        y_train ([type], optional): [description]. Defaults to None.
    """
    
    
    y_hat_test = model.predict(X_test_tf)
    print(metrics.classification_report(y_test, y_hat_test,target_names=classes))
    

    fig,ax = plt.subplots(ncols=2,figsize=figsize)
    plt.grid(False)
    plot_confusion_matrix(model, X_test_tf,y_test,cmap=cmap, 
                                  normalize=normalize,display_labels=classes,
                                 ax=ax[0])
    for a in ax:
        a.grid(False)   
        
    curve = metrics.plot_roc_curve(model,X_test_tf,y_test,ax=ax[1])
    curve.ax_.grid()
    curve.ax_.plot([0,1],[0,1],ls=':')
    fig.tight_layout()
    plt.show()
    
    ## Add comparing Scores if X_train and y_train provided.
    if (X_train is not None) & (y_train is not None):
        print(f"Training Score = {model.score(X_train,y_train):.2f}")
        print(f"Test Score = {model.score(X_test_tf,y_test):.2f}")

In [None]:
## Creating baseline classifier model

base = DummyClassifier(strategy='stratified', random_state = 42)

base.fit(X_train_df, y_train)

eval_clf(base,X_test_tf,y_test,X_train=X_train_df,y_train=y_train)

## TF-IDF Models


In [None]:
## Make a Regular Expression Tokenizer from nltk.tokenize
tokenizer = RegexpTokenizer(r'[a-zA-Z0-9]+')

In [None]:
## Make a TfIdf Vectorizer using tweet tokenizer's .tokenize method
vectorizer = TfidfVectorizer(tokenizer=tokenizer.tokenize,
                             stop_words=stopwords_list)

# Vectorize data and make X_train_tfidf and X_test_tfidf
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
X_train_tfidf


In [None]:
tf_vec = TfidfVectorizer(token_pattern=r"([a-zA-Z]+(?:'[a-z]+)?)", stop_words=stopwords_list)
X = tf_vec.fit_transform(X_train)

df = pd.DataFrame(X.toarray(), columns = tf_vec.get_feature_names())
df.head()

In [None]:
df.iloc[33].sort_values(ascending=False)[:10]

In [None]:
# Comparing to the CountVectorizor
vec = CountVectorizer(token_pattern=r"([a-zA-Z]+(?:'[a-z]+)?)", stop_words=stopwords_list)
X = vec.fit_transform(X_train)

df_cv = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
df_cv

In [None]:
df_cv.iloc[33].sort_values(ascending=False)[:10]

In [None]:
tf_vec.vocabulary_

### Modeling Baseline

In [None]:
## Make and fit a decision tree  (class_weight='balanced')
dt = DecisionTreeClassifier(max_depth=6, class_weight='balanced')
dt.fit(X_train_tfidf,y_train)

In [None]:
def evaluate_classification(model, X_test_tf,y_test,cmap='Greens',
                            normalize='true',classes=None,figsize=(10,4),
                            X_train = None, y_train = None,):
    """Evaluates a scikit-learn binary classification model.

    Args:
        model (classifier): any sklearn classification model.
        X_test_tf (Frame or Array): X data
        y_test (Series or Array): y data
        cmap (str, optional): Colormap for confusion matrix. Defaults to 'Greens'.
        normalize (str, optional): normalize argument for plot_confusion_matrix. 
                                    Defaults to 'true'.
        classes (list, optional): List of class names for display. Defaults to None.
        figsize (tuple, optional): figure size Defaults to (8,4).
        
        X_train (Frame or Array, optional): If provided, compare model.score 
                                for train and test. Defaults to None.
        y_train (Series or Array, optional): If provided, compare model.score 
                                for train and test. Defaults to None.
    """
    
    ## Get Predictions and Classification Report
    y_hat_test = model.predict(X_test_tf)
    print(metrics.classification_report(y_test, y_hat_test,target_names=classes))
    
    ## Plot Confusion Matrid and roc curve
    fig,ax = plt.subplots(ncols=2, figsize=figsize)
    metrics.plot_confusion_matrix(model, X_test_tf,y_test,cmap=cmap, 
                                  normalize=normalize,display_labels=classes,
                                 ax=ax[0])
    
    ## if roc curve erorrs, delete second ax
    try:
        curve = metrics.plot_roc_curve(model,X_test_tf,y_test,ax=ax[1])
        curve.ax_.grid()
        curve.ax_.plot([0,1],[0,1],ls=':')
        fig.tight_layout()
    except:
        fig.delaxes(ax[1])
    plt.show()
    
    ## Add comparing Scores if X_train and y_train provided.
    if (X_train is not None) & (y_train is not None):
        print(f"Training Score = {model.score(X_train,y_train):.2f}")
        print(f"Test Score = {model.score(X_test_tf,y_test):.2f}")
        
    
def plot_importance(tree, X_train_df, top_n=20,figsize=(10,10)):
    
    df_importance = pd.Series(tree.feature_importances_,
                              index=X_train_df.columns)
    df_importance.sort_values(ascending=True).tail(top_n).plot(
        kind='barh',figsize=figsize,title='Feature Importances',
    ylabel='Feature',)
    return df_importance

In [None]:
## Evaluate Model using function
evaluate_classification(dt,X_test_tfidf,y_test,X_train=X_train_tfidf,y_train=y_train)

In [None]:
# Plot the top 30 most important features
with plt.style.context('seaborn-talk'):

    ## Get Feature Importance
    importance = pd.Series(dt.feature_importances_,
                           index=vectorizer.get_feature_names())

    ## Take the .tail 30 and plot kind='barh'
    importance.sort_values().tail(30).plot(kind='barh')

In [None]:
## Make a text preprocessing pipeline
text_pipe = Pipeline(steps=[
    ('count_vectorizer',CountVectorizer()),
    ('tf_transformer',TfidfTransformer(use_idf=True))
])
text_pipe

In [None]:
## Test out the text pipeline on X_train
X_train_pipe = text_pipe.fit_transform(X_train)
X_test_pipe = text_pipe.transform(X_test)
X_train_pipe

In [None]:
## Make a full pipeline with the random forest model as the second step
full_pipe = Pipeline([('text_pipe',text_pipe),
                     ('clf',DecisionTreeClassifier(max_depth=6,class_weight='balanced'))])
full_pipe

In [None]:
## Modeling with full pipeline
full_pipe.fit(X_train,y_train)
evaluate_classification(full_pipe,X_test,y_test,X_train=X_train, y_train=y_train)

In [None]:
# DTREE Model

In [None]:
# SVM Model

##  RNN Model

In [None]:
# All lowercase
# No stopword removal
# No stemming/lemma

##  Deep NLP Models


In [1]:
from gensim.models import Word2Vec
from nltk import word_tokenize

In [None]:
# TO-DO
# Deep NLP
# Train on word2vec /glove model
# LSTM or GRU layers
# Sequential – keras models 
# Blackbox models (lime)
# #

## Interpreting LIME

In [3]:
# https://www.kdnuggets.com/2022/01/explain-nlp-models-lime.html

In [2]:
# for LIME import necessary packages
from lime import lime_text
from lime.lime_text import LimeTextExplainer
from sklearn.pipeline import make_pipeline
from lime.lime_text import IndexedString,IndexedCharacters
from lime.lime_base import LimeBase
from sklearn.linear_model import Ridge, lars_path
from lime.lime_text import explanation
from functools import partial
import scipy as sp
from sklearn.utils import check_random_state


ModuleNotFoundError: No module named 'lime'

# Interpreting Results

In [None]:
# get the names of the features
feature_names = np.array(vec.get_feature_names())

def get_top_features(features, model, level, limit, bottom=False):
    """ Get the top (most likely to see violations) and bottom (least
        likely to see violations) features for a given model.
        
        :param features: an array of the feature names
        :param model: a fitted linear regression model
        :param level: 0, 1, 2 for *, **, *** violation levels
        :param limit: how many features to return
        :param worst: if we want the bottom features rather than the top 
    """
    # sort order for the coefficients
    sorted_coeffs = np.argsort(model.coef_[i])
    
    if bottom:
        # get the features at the end of the sorted list
        return features[sorted_coeffs[-1 * limit:]]
    else:
        # get the features at the beginning of the sorted list
        return features[sorted_coeffs[:limit]]
    
# get the features that indicate we are most and least likely to see violations
worst_feature_sets = [get_top_features(feature_names, ols, i, 100) for i in range(3)]
best_feature_sets = [get_top_features(feature_names, ols, i, 100, bottom=True) for i in range(3)]

# reduce the independent feature sets to just the ones
# that we see in common across the per-level models (*, **, ***)
worst = reduce(np.intersect1d, best_feature_sets)
best = reduce(np.intersect1d, worst_feature_sets)

# display as a pretty table
html_fmt = "<table><th>More Violations</th><th>Fewer Violations</th><tbody>{}</tbody></table>"
table_rows = ["<tr><td>{}</td><td>{}</td></tr>".format(w, b) for w, b in zip(worst, best)]
table_body = "\n".join(table_rows)
display.HTML(html_fmt.format(table_body))

# Conclusions

## Best Model Results

## Takeaways and Recommended Actions

##  Next Steps and Future Work