<a href="https://cognitiveclass.ai"><img src = "https://ibm.box.com/shared/static/9gegpsmnsoo25ikkbl4qzlvlyjbgxs5x.png" width = 400> </a>

<h1 align=center><font size = 5>Making Predictions Using FourSquare and DOHMH Data with Python</font></h1>

## Introduction

In this work, I used the NYC DOHMH restaurant inspection report to get the data such as inspection score and critical flag (Y/N) for the Italian restaurants in Mahattan. Critical Flag indicates if there are critical violations. In addition, the report also includes the latitude and longitude data of the restaurants. With these latitude/longitude data, I searched for the customers ratings via Foursquare database. After collecting the ratings data, I combined the datasets and then developed a model using Logistic Regression method. The model presented find evaluation results. To be clear, my goal is to use the inspection score and customers ratings to make prediction of occurrence of critical violation in the restaurants of interests. 

## Table of Contents

1. <a href="#item1">Data Preparation</a>
2. <a href="#item2">Search Ratings through Foursquare API</a>  
3. <a href="#item3">Model Using Logistic Regressionr</a>  
4. <a href="#item4">Evaluation Resuts</a>  

## 1. Data Preparation
### Import necessary Libraries

In [1]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

#!conda install -c conda-forge folium=0.5.0 --yes
#import folium # plotting library

#print('Folium installed')
#print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-1.22.0               |     pyh9f0ad1d_0          63 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          97 KB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geopy-1.22.0-pyh9f0ad1d_0



Downloading and Extracting Packages
geopy-1.22.0         | 63 KB     | ##################################### | 100% 
geographiclib-1.50   | 34 KB     | ###############################

### Define Foursquare Credentials and Version

##### Make sure that you have created a Foursquare developer account and have your credentials handy

In [2]:
CLIENT_ID = 'MQXS0P5VD545NDHRISOYUIVGKI4TCLDFMYKTXE3TTGPXMBPW' # your Foursquare ID
CLIENT_SECRET = 'RLYYI0A1IKQXLI3FEICSI3TGQ5OAUFPIYT1EURKOFMPIIGTP' # your Foursquare Secret
VERSION = '20200515'
LIMIT = 2
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: MQXS0P5VD545NDHRISOYUIVGKI4TCLDFMYKTXE3TTGPXMBPW
CLIENT_SECRET:RLYYI0A1IKQXLI3FEICSI3TGQ5OAUFPIYT1EURKOFMPIIGTP


#### Read Inspection Data Report by the New York City DOHMH.  

In [100]:
df=pd.read_csv('ItalianRestaurant_Inspection300_NYC.csv')
df.head()

Unnamed: 0,BORO,ZIPCODE,CUISINE DESCRIPTION,Critical Flag,Score,Latitude,Longitude
0,Queens,11103,Italian,Y,96,40.764675,-73.911974
1,Bronx,10451,Italian,N,85,40.8193,-73.926971
2,Manhattan,10013,Italian,Y,71,40.717778,-73.998149
3,Manhattan,10014,Italian,Y,59,40.732186,-74.001537
4,Manhattan,10011,Italian,N,57,40.733953,-73.998586


In [5]:
#remove the record with Score is larger than 50 that are rare and likely a mistake or typo
df.drop(df[df['Score'] >= 50].index, inplace= True)

#recalucate the score, consider the 50 is maximum. 
df['Score'] = df['Score'].apply(lambda x: 50-x)

df.sort_values(['ZIPCODE'], ascending = True, inplace=True)

df.reset_index(inplace=True, drop=True)

for i in df.index:
    Lat.append(df.loc[i][5])
    Lon.append(df.loc[i][6])

print("Total rows:",len(Lat))
df.head()

Total rows: 282


Unnamed: 0,BORO,ZIPCODE,CUISINE DESCRIPTION,Critical Flag,Score,Latitude,Longitude
0,Manhattan,10001,Italian,1,38,40.748948,-73.995806
1,Manhattan,10002,Italian,1,27,40.716079,-73.98719
2,Manhattan,10002,Italian,0,38,40.721164,-73.983993
3,Manhattan,10002,Italian,0,23,40.716079,-73.98719
4,Manhattan,10002,Italian,0,37,40.72047,-73.989015


<a id="item1"></a>

## 2. Search Ratings through Foursquare API

#### Define the corresponding URL and grab the ratings from Foursquare database based on latitude and longitude info
> `https://api.foursquare.com/v2/venues/`**search**`?client_id=`**CLIENT_ID**`&client_secret=`**CLIENT_SECRET**`&ll=`**LATITUDE**`,`**LONGITUDE**`&v=`**VERSION**`&query=`**QUERY**`&radius=`**RADIUS**`&limit=`**LIMIT**

In [103]:
search_query = 'Italian '
radius = 100
Ratings = []
print(search_query + ' .... OK!')

Italian  .... OK!


In [102]:
# Be Careful on the Call Limit!!!

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']


for i in range(len(Lat)):

    BackupRatings = Ratings
    
    latitude = Lat[i]
    longitude = Lon[i]
    
    url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
    #url
    results = requests.get(url).json()
    #results
    
    # assign relevant part of JSON to venues
    venues = results['response']['venues']

    if venues ==  []:
        print('Break at i', i)
        Ratings.append(0)
    else:
        # tranform venues into a dataframe
        dataframe = json_normalize(venues)
        dataframe.head()

        # keep only columns that include venue name, and anything that is associated with location
        filtered_columns = ['name', 'categories'] + [col for col in dataframe.columns if col.startswith('location.')] + ['id']
        dataframe_filtered = dataframe.loc[:, filtered_columns]

        # filter the category for each row
        dataframe_filtered['categories'] = dataframe_filtered.apply(get_category_type, axis=1)

        # clean column names by keeping only last term
        dataframe_filtered.columns = [column.split('.')[-1] for column in dataframe_filtered.columns]

        dataframe_filtered    
        #print(dataframe_filtered)
    
        dataframe_filtered1 = dataframe_filtered[dataframe_filtered.categories != "Office"]
        #dataframe_filtered1.id[0]
        #print(dataframe_filtered1)
    
        RowCount = len(dataframe_filtered1.index)
        #print('RowCount', RowCount)
        
        if RowCount == 0:
            Ratings.append(0)
        if RowCount > 0:
            venue_id = dataframe_filtered1.loc[dataframe_filtered1.index[0]]['id']
            try:
                url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'.format(venue_id, CLIENT_ID, CLIENT_SECRET, VERSION)
                result = requests.get(url).json()
                #print(result)
                
                if result['response']['venue']['rating']!= []:
                    Ratings.append(result['response']['venue']['rating'])
                else:
                    Ratings.append(0)
            except:
                Ratings.append(0)
        #print(Ratings)


#### Add rating values to the dataframe and save to the csv file

In [68]:
#add Ratings to the dataframe and save to the csv file
df['Ratings'] = Ratings
df.to_csv("ResultsWitRatings.csv", index=False)
df.head()

Unnamed: 0,BORO,ZIPCODE,CUISINE DESCRIPTION,Critical Flag,Score,Latitude,Longitude,Ratings
0,Manhattan,10001,Italian,1,38,40.748948,-73.995806,8.4
1,Manhattan,10002,Italian,1,27,40.716079,-73.98719,6.0
2,Manhattan,10002,Italian,0,38,40.721164,-73.983993,9.1
3,Manhattan,10002,Italian,0,23,40.716079,-73.98719,8.6
4,Manhattan,10002,Italian,0,37,40.72047,-73.989015,8.8


## 3. Model Using Logistic Regression

In [71]:
df1 = pd.read_csv("ResultsWitRatings.csv")
df1.head()

Unnamed: 0,BORO,ZIPCODE,CUISINE DESCRIPTION,Critical Flag,Score,Latitude,Longitude,Ratings
0,Manhattan,10001,Italian,1,38,40.748948,-73.995806,8.4
1,Manhattan,10002,Italian,1,27,40.716079,-73.98719,6.0
2,Manhattan,10002,Italian,0,38,40.721164,-73.983993,9.1
3,Manhattan,10002,Italian,0,23,40.716079,-73.98719,8.6
4,Manhattan,10002,Italian,0,37,40.72047,-73.989015,8.8


In [73]:
#Remove the records with Ratings = 0.0
df1.drop(df1[df1['Ratings'] == 0].index, inplace= True)
df1.head()

Unnamed: 0,BORO,ZIPCODE,CUISINE DESCRIPTION,Critical Flag,Score,Latitude,Longitude,Ratings
0,Manhattan,10001,Italian,1,38,40.748948,-73.995806,8.4
1,Manhattan,10002,Italian,1,27,40.716079,-73.98719,6.0
2,Manhattan,10002,Italian,0,38,40.721164,-73.983993,9.1
3,Manhattan,10002,Italian,0,23,40.716079,-73.98719,8.6
4,Manhattan,10002,Italian,0,37,40.72047,-73.989015,8.8


In [76]:
feature = df1[['Score','Ratings']]
feature.head()

Unnamed: 0,Score,Ratings
0,38,8.4
1,27,6.0
2,38,9.1
3,23,8.6
4,37,8.8


In [78]:
#Feature selections
X=feature
y=df1['Critical Flag'].values

#Preprocessing the data
from sklearn import preprocessing
X= preprocessing.StandardScaler().fit(X).transform(X)


  return self.partial_fit(X, y)
  import sys


In [94]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.15, random_state=4)
#Create logistic regression object
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR

LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

## 4. Evaluation Results

In [101]:
from sklearn import metrics
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
yhat_LR = LR.predict(X_test)
print("Critical Flag LR Model's Accuracy: ", metrics.accuracy_score(y_test, yhat_LR))
print("Critical Flag LR Model's F1-score: ", f1_score(y_test, yhat_LR, average='weighted')) 
yhat_LRprob = LR.predict_proba(X_test)
print("Critical Flag LR Model's Log Loss: ", log_loss(y_test, yhat_LRprob)) 

Critical Flag LR Model's Accuracy:  0.7142857142857143
Critical Flag LR Model's F1-score:  0.7142857142857143
Critical Flag LR Model's Log Loss:  0.6879167051612881


<a id="item2"></a>

### Thank You for Reviewing My Work!