# Yelp Restaurant Rating Prediction

## Project Description
The project goal is to predict restaurant overall ratings on yelp
in New York City, using multiple features of restaurants that we will
extract using the yelp API. Our motivation is to help determine how
successful a new restaurant business might be, given certain known
characteristics of it.

We will use the Yelp API, with the help of pandas, to acquire raw data
from restaurants. Then we will extract reasonable features such as
location, open hours, whether it takes reservations, whether it has
delivery service, whether there is parking space, and whether it
provides free wifi etc., from the parsed data, and combine with the
overall ratings, which is a numerical value ranging from 0 to 5, as
labels.

We will model the rating distribution over the different features that
we extract and create, and analyse how much each feature shifts our
distribution. Using our results from this we will select good features
to train on machine learning models.

Using the labeled features that we construct, we will train different
machine learning models like linear regression, nonlinear regression,
logistic regression as well as neural networks, then make some
predictions, and compare the accuracy obtained from them.

## Team Members
Jun Hee Kim, Nikhil Rangarajan, Sander Shi

## Procedure
* [Data Gathering from API](#step-1)
* [Feature Extraction with Parsing](#step-2)
* [Feature Analysis and Variable Selection](#step-3)
* [Setup of Models](#step-4)
* [Cross Validation](#step-5)
* [Final Analysis](#step-6)

## Part 0: Imports and Definitions of Constants

We will be using `pandas` to parse the data and `tensorflow` to construct the machine learning models. We will also be using the Yelp API to gather the data.

In [1]:
import tensorflow as tf
import numpy as np
import pandas as pd
import requests

API_URL = "https://api.yelp.com/v3/businesses"
SEARCH_URL = API_URL + "/search"
API_KEY = "./API_KEY"

<a id="step-1"></a>

## Part 1: Data Gathering from API

In this step we will use the Yelp API to gather restaurant pages, then extract
information using business search API requests.

In [23]:
def find_restaurants(url, api_key_url):
    """
    This function loads all restaurant data from restaurants in Pittsburgh.
    
    @input url: The API url.
    @type url: String.
    
    @input api_key_url: The API key url.
    @type api_key_url: String.
    
    @return: A Pandas DataFrame containing the restaurant URLs.
    @rtype: pandas.DataFrame.
    """
    # Retrieve API key
    with open(api_key_url, 'r') as f:
        api_key = f.readline().strip()
        
    # Set request header and params for search query
    headers = {
        'Authorization': ' '.join(['Bearer', api_key])
    }
    params = {
        'term': 'restaurants',
        'location': 'NYC',
        'limit': 10,
        'offset': 990
    }
    response = requests.get(url=url, headers=headers, params=params)
    content = response.json()
    print(content['total'])
    print(len(content['businesses']))
    print(content['businesses'][0])
    return None

restaurants = find_restaurants(SEARCH_URL, API_KEY)

18800
50
{'id': '7K9WGGP9SOUxdlJ4ozFy8A', 'alias': 'meet-the-meat-astoria', 'name': 'Meet The Meat', 'image_url': 'https://s3-media4.fl.yelpcdn.com/bphoto/2K9gkRDbM4bOlQSy-Ux20A/o.jpg', 'is_closed': False, 'url': 'https://www.yelp.com/biz/meet-the-meat-astoria?adjust_creative=ZAj1bx8bJHOcikvnMXVxEg&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=ZAj1bx8bJHOcikvnMXVxEg', 'review_count': 213, 'categories': [{'alias': 'steak', 'title': 'Steakhouses'}], 'rating': 4.5, 'coordinates': {'latitude': 40.7769295040115, 'longitude': -73.9214426951538}, 'transactions': ['pickup', 'restaurant_reservation'], 'price': '$$$', 'location': {'address1': '2392 21st St', 'address2': '', 'address3': '', 'city': 'Astoria', 'zip_code': '11105', 'country': 'US', 'state': 'NY', 'display_address': ['2392 21st St', 'Astoria, NY 11105']}, 'phone': '+19178327984', 'display_phone': '(917) 832-7984', 'distance': 10032.108120752231}


<a id="step-2"></a>

## Part 2: Feature Extraction with Parsing

In this step we will parse the raw features we got from the API and mold them into
discrete and continuous values.

In [3]:
def extract_features(raw_data):
    """
    This function extracts the features from raw restaurant data.
    """
    pass

labeled_features = extract_features(all_restaurants)

<a id="step-3"></a>

## Part 3: Feature Analysis and Variable Selection

Now that we have the features, we can separately analyse them and see how each feature
contributes to the rating.

In [4]:
# TODO: Jun Hee's Feature Analysis

<a id="step-4"></a>

## Part 4: Setup of Models

We need to construct a couple machine learning models for us to train to see which
one gives us the best testing accuracy.

In [5]:
class ANN(object):
    """
    This is a neural network.
    """
    def __init__(self, layers=[5, 5]):
        self.layers = layers
        
class LogisticRegression(object):
    """
    This is the logistic regression model.
    """
    def __init__(self):
        self.theta = np.zeros(100)

<a id="step-5"></a>

## Part 5: Cross Validation

Now we will use cross-validation to determine which model produces the highest
validation accuracy for our dataset, then pick this model.

In [6]:
def pick_model():
    pass

<a id="step-6"></a>

## Part 6: Final Analysis

Our results were amazing, so yeah.