<h1 align=center><font size = 5>Identifying Fast Food Franchise Restaurants in Houston, Texas to pursue for purchase</font></h1>

### Applied Data Science Capstone Project by Joe McReynolds

The goal of the project is to identify poorly rated fast-food, franchise restaurants that are the best canditates to purchase and turn around to profitability. A work flow utilizing Foursquare data will be designed and executed. The results will be reported with emphasis on visual displays.

## Table of Contents

1. <a href="#item1">Introduction / Business Problem</a>
2. <a href="#item2">Data</a>  
3. <a href="#item3"> </a>  
4. <a href="#item4"> </a>  
5. <a href="#item5"> </a>  

### Introduction / Business Problem

The business problem is identifying distressed or underperforming “fast food” franchise restaurants in Houston, Texas for restaurant investors to purchase and turn around to profitability. The plan is to leverage Foursquare location data to design a work flow that helps commercial real estate investors locate “fast food” franchise restaurants to purchase for investment.

1. Underperforming or poorly rated restaurants will be located. The focus is on poorly rated restaurants because they are likely to cost less to purchase than high performing properties and be available.

2. From that low performing group, restaurants that have the most promise of becoming higher rated, profitable properties for the investor will identified by evaluating how that specific franchise typically performs in the type of neighborhood in which it is located in Houston. The Foursquare data service is well suited for this task because it not only locates the properties but also contains venue statistics and ratings for each restaurant.

The value of the project is to give investors another piece of evidence to add to their decision process and should improve their chance of buying the underperforming business that will become profitable.

### Data

Two main data sources will be used in the project. FourSquare data will be combined with local Houston neighborhood data found in web research to complete the project .

1) The Houston neighborhood data will be used to segment Houston into areas that through cluster analysis will be grouped into neighborhoods with similar venues found in the Foursquare data. 

Information on Houston neighborhoods is located at the following web page:
https://en.wikipedia.org/wiki/List_of_Houston_neighborhoods

The necessary neighborhood info will be scraped from this web page, and latitude longitude will be generated using geocoder packages imported from the geopy library.

#### 88 "super" neighborhoods exist in the Houston area.
Those neighborhoods or some subset, if time runs short, will be used as the neighborhood groupings for this analysis.

<img src = "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c2/Houston_superneighborhoods.png/1000px-Houston_superneighborhoods.png" width = 800, align = "center" alt="logistic regression block diagram" />

2) Foursquare data will used in multiple parts of the workflow and several examples of the data will be shown below to help illustrate what specific aspects of the FourSquare data will be utilized. 

In preparation for cluster analysis to group similar neighboroods, Foursquare Data will be gathered for each neighborhood focusing on category elements of the venue data within the FourSquare database. The neighborhoods will be compared based on the types of venue categories and those categories prevalence, See a small data example below in he Spring Branch area.

The Foursquare data service is well suited for this task because it not only locates the properties but also contains venue statistics and ratings for each restaurant. Further below, two samples of the data will be displayed to show what the FourSquare data can look like.

#### You will need to scroll  down through some actual Pre-processing steps to see the actual data examples 

### First libraries need to be set up and imported

In [3]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 

import json # library to handle JSON files
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

# import k-means from clustering stage
from sklearn.cluster import KMeans

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
#%matplotlib inline 

#!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

print('Folium installed')
print('Libraries imported.')

Folium installed
Libraries imported.


### Next to Define Foursquare Credentials and Version

##### Get my Foursquare developer account and have your credentials handy

In [31]:
CLIENT_ID = 'TE0PG2YHTFN3QLWZM5ET0O0EGQX1XNEJJECQRZJVWXAG4G10' # your Foursquare ID
CLIENT_SECRET = 'CWPXUWBN55NCV0IBFRH3VPMX4JZM4UOKLXL2VWAZT0HU01FM' # your Foursquare Secret
VERSION = '20180605'
LIMIT = 300
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: TE0PG2YHTFN3QLWZM5ET0O0EGQX1XNEJJECQRZJVWXAG4G10
CLIENT_SECRET:CWPXUWBN55NCV0IBFRH3VPMX4JZM4UOKLXL2VWAZT0HU01FM


### Get latitude and longitude coordinates Spring Branch in Houston, TX.

In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>foursquare_agent</em>, as shown below.

In [33]:
address = 'Spring Branch Houston, TX'

geolocator = Nominatim(user_agent="jam_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(latitude, longitude)

29.7998786 -95.5111698


### Now, get the top 100 venues for Spring Branch, Houston, TX within a radius of 1000 meters.

First, let's create the GET request URL. Name your **url**.

In [141]:
# parameters needed for foursquare url
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 1000 # define radius

In [142]:
# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude, 
    longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=TE0PG2YHTFN3QLWZM5ET0O0EGQX1XNEJJECQRZJVWXAG4G10&client_secret=CWPXUWBN55NCV0IBFRH3VPMX4JZM4UOKLXL2VWAZT0HU01FM&v=20180605&ll=29.7998786,-95.5111698&radius=1000&limit=100'

### Request and preprocess FourSquare Venue and Category Data

Send the GET request and examine the resutls

In [143]:
Vresults = requests.get(url).json()

In [144]:
# run check once and comment out
#Vresults

In [145]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a *pandas* dataframe.

In [147]:
venues =Vresults['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.id', 'venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head(10)

##  CREATE SAMPLE LISTING OF VENUES FOR A NEIGHBORHOOD IN HOUSTON

Unnamed: 0,id,name,categories,lat,lng
0,4b5cb48df964a520893f29e3,Texas Rock Gym,Gym / Fitness Center,29.797439,-95.513321
1,508dac8fe4b0b69fe93e2c86,Tornado Taco,Mexican Restaurant,29.799776,-95.514221
2,4b5356d8f964a520069827e3,Whataburger,Burger Joint,29.801802,-95.510137
3,4c5837ad6201e21ec8cb1970,Visible Changes,Health & Beauty Service,29.792277,-95.514417
4,4c0e802e336220a1ad4fcc77,Casa De Leon,Mexican Restaurant,29.799693,-95.516477
5,56295aa8498ec369b5bed888,Dollar Tree,Discount Store,29.803567,-95.504475
6,4f23166a4fde9081ce1264d1,LA Fitness,Gym / Fitness Center,29.805295,-95.503299
7,5b5b8aef28374e00396954a3,Julie’s Closet,Women's Store,29.802759,-95.506104
8,4f1b6fafe4b0e6badba9d541,Bargain Liquor Warehouse,Liquor Store,29.804398,-95.517242
9,4f0cdca4e4b0281b8bc0c303,Stop N' In,Convenience Store,29.804416,-95.517265


### **ABOVE --  View FourSquare Venue and Category Data sample for Spring Branch, Houston, TX**
The collection of general venue data will be used in the Cluster Analysis for Grouping neighborhoods. This should be relatively straight forward as the class has used Foursquare data for this type of work.

#### Venue ranking analyis is the other part of the project

A specific category of venues, namely **fast food restaurants** will be rated based on the ratings and "like" counts included with venue specific data. Below We will show rating and "like" counts for the Venue Whataburger found in the list above as an example. 

Scroll down through the preprocessing steps to see the results for the Whataburger.

In [134]:
nearby_venues.shape

(15, 5)

In [88]:
## Venue id
venue_id = '4b5356d8f964a520069827e3'

In [89]:
# create URL
#url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'.format(  
url2 = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'.format(
    venue_id,
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION)
url2 # display URL

'https://api.foursquare.com/v2/venues/4b5356d8f964a520069827e3?client_id=TE0PG2YHTFN3QLWZM5ET0O0EGQX1XNEJJECQRZJVWXAG4G10&client_secret=CWPXUWBN55NCV0IBFRH3VPMX4JZM4UOKLXL2VWAZT0HU01FM&v=20180605'

#### Send GET request for result

In [101]:
result = requests.get(url2).json()

In [148]:
#print(result['response']['venue'].keys())
vname = result['response']['venue']['name']
vrating = result['response']['venue']['rating']
vlike_num = result['response']['venue']['likes']['count']
print(vname, 'is rated :  ', vrating, 'in a 1 to 10 scale and has received : ', vlike_num, 'Likes')

## Venue specific rating results

Whataburger is rated :   7.4 in a 1 to 10 scale and has received :  19 Likes


### **ABOVE --  Note FourSquare Venue Rating and "Like" counts for a Whataburger venue**

The ratings and "like" counts are pulled straight from  venue specific data. We will gather and sort both rating and Like data for each restaurant in our evaluation.
1. The rating data will be used to determine the top five rated franchises in each neighborhood cluster

2. The rating for will be used to identify low rated venues that are similar to franchise venues that belong to the group of high rated franchises determined in the last step of this workflow.


Note: Getting fast food restaurants as a single category is not a trivial task and will be an important issue to work through in this projct. In our WhataBurger example, the raw download data shows the venue is categorized both as a "burger joint" and as "fast food restaurant". This is a common occurrence in the FourSquare database and appears to be challenging to always import the "fast food" category into the pandas databases for analysis. It is also worth noting that some peculiarities exist in querying venue data within FourSqare. Query results frequently pull more than "fast food" venues. 


### This concludes the DATA section

An original notebook  created by [Alex Aklson](https://www.linkedin.com/in/aklson/). was used as the goby and altered for this lab