<h1><center> Exploring Food Culture in the United States<center></h1>

# Introduction

Food is a huge part of our lives in America. There are websites, TV shows, challenges, careers, even outfits [(see Lady Gaga's meat dress)](https://mtv.mtvnimages.com/uri/mgid:file:http:shared:mtv.com/news/wp-content/uploads/2015/08/lady-gaga-meat-2-1440794946.jpg?quality=.8&height=1211.2&width=800)
that are centered around food as a culture. As lovers of food, our team has decided to look into data provided by Yelp and and explore what could affect food preferences in different geographic location, which brings us to our main question that we want to ask:


<br></br>

<h4><center>🍔  Is there <b>interesting patterns </b>to identify food preferences in America?   🍟</center></h4>
<br><br>
There's actually a lot of speculation on this already, but not many findings published online. By the end of this exploration, we hope to demonstrate the power of data and how it can show both interesting and important information about the food culture in the United States. 

    


# Gathering Data


To understand the context of the situation, we researched articles(both academic and non academic) that looked into important factors of food choice, and then gathered data based on these claims. Even though they may not be 100% true, it gave us a good base for data analysis.

Several factors were heavily mentioned, such as:

    Economic
    Cultural
    Location
    Social
    Education

After spending a lot of time parsing through data, we decided to look at political and demographic data, which touches upon economic, cultural, and social factors, and separated the data by state. We also found that data is harder to come by than we thought, so we spent a lot of time.  


### Yelp Data

Our main dataset is json files straight from Yelp which carries data from 2004 - 2018. The files are quite large, so to get them yourself, go to https://www.yelp.com/dataset/download and make an account. You will then see 2 download urls (review.json and business.json). Download both (we recommend using Chrome, as we ran into complications with Firefox). 

![download screen](assets/login.png)

Once the data is downloaded, we can now clean it up and turn it into a dataframe. 

As always, lets start by importing the libraries we need


In [25]:
import json    
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import tqdm 
from mpl_toolkits.axes_grid1 import make_axes_locatable

#And the tools from Sklearn to do our clustering
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.metrics import mean_squared_error

from scipy.sparse import csr_matrix

from datetime import datetime


#To make things cleaner, lets also not display all the warnings
import warnings
warnings.filterwarnings('ignore')

Now lets look at the data. As you can business.json shows: 


| | **Business Data** | 
|----------|:-------------|
| business_id     | business key |
| name | name of business |
| address | address of business |
| city | city of business |
| state | state abbreviation |
| postal code | zipcode | 
| latitude | latitude of coordinates | 
| longitude | longitude of coordinates | 
| stars | amount of stars given based on ratings (out of 5) |
| review_count | how many reviews |
| is_open | is the business open | 
| attributes | things about parking type and takeout | 
| categories | type of food | 
| hours | when business is open |

In [26]:
def load_businesses():
    businesses = []
    with open('business.json') as f:
        for line in tqdm.tqdm_notebook(f):
            businesses.append(json.loads(line))  
    return businesses
businesses = load_businesses()
testBusiness = businesses[1]
testBusiness


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




{'business_id': 'QXAEGFB4oINsVuTFxEYKFQ',
 'name': 'Emerald Chinese Restaurant',
 'address': '30 Eglinton Avenue W',
 'city': 'Mississauga',
 'state': 'ON',
 'postal_code': 'L5R 3E7',
 'latitude': 43.6054989743,
 'longitude': -79.652288909,
 'stars': 2.5,
 'review_count': 128,
 'is_open': 1,
 'attributes': {'RestaurantsReservations': 'True',
  'GoodForMeal': "{'dessert': False, 'latenight': False, 'lunch': True, 'dinner': True, 'brunch': False, 'breakfast': False}",
  'BusinessParking': "{'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}",
  'Caters': 'True',
  'NoiseLevel': "u'loud'",
  'RestaurantsTableService': 'True',
  'RestaurantsTakeOut': 'True',
  'RestaurantsPriceRange2': '2',
  'OutdoorSeating': 'False',
  'BikeParking': 'False',
  'Ambience': "{'romantic': False, 'intimate': False, 'classy': False, 'hipster': False, 'divey': False, 'touristy': False, 'trendy': False, 'upscale': False, 'casual': True}",
  'HasTV': 'False',
  'WiFi': "u'no'",
 

review.json looks like: 


| | **Review Data** | 
|----------|:-------------|
| review_id     | review key |
| user_id | who wrote the review |
| business_id | which business  |
| stars | rating given (out of 5) |
| date | date of review |
| text | actual review| 
| useful | votes given whether review is useful | 
| funny | votes given whether review is funny | 
| cool | votes given whether review is cool |

In [27]:
def load_reviews():
    reviews = []
    with open('review.json') as f:
        for line in tqdm.tqdm_notebook(f):
            reviews.append(json.loads(line))    
    return reviews
reviews = load_reviews()
testReview = reviews[1]
testReview



HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




{'review_id': 'GJXCdrto3ASJOqKeVWPi6Q',
 'user_id': 'yXQM5uF2jS6es16SJzNHfg',
 'business_id': 'NZnhc2sEQy3RmzKTZnqtwQ',
 'stars': 5.0,
 'useful': 0,
 'funny': 0,
 'cool': 0,
 'text': "I *adore* Travis at the Hard Rock's new Kelly Cardenas Salon!  I'm always a fan of a great blowout and no stranger to the chains that offer this service; however, Travis has taken the flawless blowout to a whole new level!  \n\nTravis's greets you with his perfectly green swoosh in his otherwise perfectly styled black hair and a Vegas-worthy rockstar outfit.  Next comes the most relaxing and incredible shampoo -- where you get a full head message that could cure even the very worst migraine in minutes --- and the scented shampoo room.  Travis has freakishly strong fingers (in a good way) and use the perfect amount of pressure.  That was superb!  Then starts the glorious blowout... where not one, not two, but THREE people were involved in doing the best round-brush action my hair has ever seen.  The team o

Now let's take this json and turn them into dataframes

In [69]:
def makeRatingsDF():

    review_id = []
    user_id = []
    business_id = []
    stars = []
    date = []

    for review in tqdm.tqdm_notebook(reviews):
        review_id.append(review['review_id'])
        user_id.append(review['user_id'])
        business_id.append(review['business_id'])
        stars.append(review['stars'])
        date.append(review['date'])

    ratingsDF = pd.DataFrame({'review_id': review_id,
                                 'user_id': user_id,
                                 'business_id': business_id,
                             'stars':stars,
                             'date':date})
    return ratingsDF
ratingsDF =  makeRatingsDF()
ratingsDF.head()

HBox(children=(IntProgress(value=0, max=6685900), HTML(value='')))




Unnamed: 0,review_id,user_id,business_id,stars,date
0,Q1sbwvVQXV2734tPgoKj4Q,hG7b0MtEbXx5QzbzE6C_VA,ujmEBvifdJM6h6RLv4wQIg,1.0,2013-05-07 04:34:36
1,GJXCdrto3ASJOqKeVWPi6Q,yXQM5uF2jS6es16SJzNHfg,NZnhc2sEQy3RmzKTZnqtwQ,5.0,2017-01-14 21:30:33
2,2TzJjDVDEuAW6MR5Vuc1ug,n6-Gk65cPZL6Uz8qRm3NYw,WTqjgwHlXbSFevF32_DJVw,5.0,2016-11-09 20:09:03
3,yi0R0Ugj_xUx_Nek0-_Qig,dacAIZ6fTM6mqwW5uxkskg,ikCg8xy5JIg_NGPx-MSIDA,5.0,2018-01-09 20:56:38
4,11a8sVPMUFtaC7_ABRkmtw,ssoyf2_x0EQMed6fgHeMyQ,b1b1eb3uo-w561D0ZfCEiQ,1.0,2018-01-30 23:07:38


In [71]:
def makeBusinessesDF():
    business_id = []
    business_name = []
    business_city = []
    business_state = []
    business_categories = []

    for business in tqdm.tqdm_notebook(businesses):
        business_id.append(business['business_id'])
        business_name.append(business['name'])
        business_city.append(business['city'])
        business_state.append(business['state'])    

        categories = None
        if business['categories'] != None:
            categories = ""
            for category in business['categories']:
                categories += category

        business_categories.append(categories)

    businessDFold = pd.DataFrame({'business_id': business_id,
                               'name': business_name,
                                 'city': business_city,
                                 'state': business_state, 
                              'categories': business_categories})
    return businessDFold
businessDF = makeBusinessesDF()
businessDF.head()

HBox(children=(IntProgress(value=0, max=192609), HTML(value='')))




Unnamed: 0,business_id,name,city,state,categories
0,1SWheh84yJXfytovILXOAQ,Arizona Biltmore Golf Club,Phoenix,AZ,"Golf, Active Life"
1,QXAEGFB4oINsVuTFxEYKFQ,Emerald Chinese Restaurant,Mississauga,ON,"Specialty Food, Restaurants, Dim Sum, Imported..."
2,gnKjwL_1w79qoiV3IC_xQQ,Musashi Japanese Restaurant,Charlotte,NC,"Sushi Bars, Restaurants, Japanese"
3,xvX2CttrVhyG2z1dFg_0xw,Farmers Insurance - Paul Lorenz,Goodyear,AZ,"Insurance, Financial Services"
4,HhyxOkGAM07SRYtlQ4wMFQ,Queen City Plumbing,Charlotte,NC,"Plumbing, Shopping, Local Services, Home Servi..."


# Data Cleaning

In [63]:
# def makeBusinessesDF():
#     business_id = []
#     business_name = []
#     business_city = []
#     business_state = []
#     business_category_1 = []
#     business_category_2 = []
#     business_category_3 = []

#     for business in tqdm.tqdm_notebook(businesses):
#         business_id.append(business['business_id'])
#         business_name.append(business['name'])
#         business_city.append(business['city'])
#         business_state.append(business['state'])    

#         categories = None

#         if business['categories'] != None:
#             categories = business['categories'].split(',')

#             try:
#                 business_category_1.append(categories[0].strip())
#             except:
#                 business_category_1.append(None)

#             try:
#                 business_category_2.append(categories[1].strip())
#             except:
#                 business_category_2.append(None)

#             try:
#                 business_category_3.append(categories[2].strip())
#             except:
#                 business_category_3.append(None)
#         else:
#             business_category_1.append(None)
#             business_category_2.append(None)
#             business_category_3.append(None)
#     businessDF = pd.DataFrame({'business_id': business_id,
#                            'name': business_name,
#                              'city': business_city,
#                              'state': business_state, 
#                           'category1': business_category_1, 
#                           'category2': business_category_2,
#                           'category3': business_category_3})
#     return businessDF
# businessDF = makeBusinessesDF()
# businessDF



HBox(children=(IntProgress(value=0, max=192609), HTML(value='')))




Unnamed: 0,business_id,name,city,state,category1,category2,category3
0,1SWheh84yJXfytovILXOAQ,Arizona Biltmore Golf Club,Phoenix,AZ,Golf,Active Life,
1,QXAEGFB4oINsVuTFxEYKFQ,Emerald Chinese Restaurant,Mississauga,ON,Specialty Food,Restaurants,Dim Sum
2,gnKjwL_1w79qoiV3IC_xQQ,Musashi Japanese Restaurant,Charlotte,NC,Sushi Bars,Restaurants,Japanese
3,xvX2CttrVhyG2z1dFg_0xw,Farmers Insurance - Paul Lorenz,Goodyear,AZ,Insurance,Financial Services,
4,HhyxOkGAM07SRYtlQ4wMFQ,Queen City Plumbing,Charlotte,NC,Plumbing,Shopping,Local Services
5,68dUKd8_8liJ7in4aWOSEA,The UPS Store,Mississauga,ON,Shipping Centers,Couriers & Delivery Services,Local Services
6,5JucpCfHZltJh5r1JabjDg,Edgeworxx Studio,Calgary,AB,Beauty & Spas,Hair Salons,
7,gbQN7vr_caG_A1ugSmGhWg,Supercuts,Las Vegas,NV,Hair Salons,Hair Stylists,Barbers
8,Y6iyemLX_oylRpnr38vgMA,Vita Bella Fine Day Spa,Glendale,AZ,Nail Salons,Beauty & Spas,Day Spas
9,4GBVPIYRvzGh4K4TkRQ_rw,Options Salon & Spa,Fairview Park,OH,Beauty & Spas,Nail Salons,Day Spas


The data contains places that are not in the US (for example, "XGL", which is greater London) so lets removes that data real quick

In [72]:

filter1 = businessDF["state"].isin(["AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DC",  
    "DE", "FL", "GA", "HI", "IA", "ID", "IL", "IN", "KS", "KY", "LA",  
    "MA", "MD", "ME", "MI", "MN", "MO", "MS", "MT", "NC", "ND", "NE",  
    "NH", "NJ", "NM", "NV", "NY", "OH", "OK", "OR", "PA", "RI", "SC",  
    "SD", "TN", "TX", "UT", "VA", "VT", "WA", "WI", "WV", "WY"]) 


In [73]:
businessDF[filter1]

Unnamed: 0,business_id,name,city,state,categories
0,1SWheh84yJXfytovILXOAQ,Arizona Biltmore Golf Club,Phoenix,AZ,"Golf, Active Life"
2,gnKjwL_1w79qoiV3IC_xQQ,Musashi Japanese Restaurant,Charlotte,NC,"Sushi Bars, Restaurants, Japanese"
3,xvX2CttrVhyG2z1dFg_0xw,Farmers Insurance - Paul Lorenz,Goodyear,AZ,"Insurance, Financial Services"
4,HhyxOkGAM07SRYtlQ4wMFQ,Queen City Plumbing,Charlotte,NC,"Plumbing, Shopping, Local Services, Home Servi..."
7,gbQN7vr_caG_A1ugSmGhWg,Supercuts,Las Vegas,NV,"Hair Salons, Hair Stylists, Barbers, Men's Hai..."
8,Y6iyemLX_oylRpnr38vgMA,Vita Bella Fine Day Spa,Glendale,AZ,"Nail Salons, Beauty & Spas, Day Spas"
9,4GBVPIYRvzGh4K4TkRQ_rw,Options Salon & Spa,Fairview Park,OH,"Beauty & Spas, Nail Salons, Day Spas, Massage"
11,1Dfx3zM-rW4n-31KeC8sJg,Taco Bell,Phoenix,AZ,"Restaurants, Breakfast & Brunch, Mexican, Taco..."
12,5t3KVdMnFgAYmSl1wYLhmA,The Kilted Buffalo Langtree,Mooresville,NC,"Bars, Nightlife, Pubs, Barbers, Beauty & Spas,..."
13,fweCYi8FmbJXHCqLnwuk8w,Marco's Pizza,Mentor-on-the-Lake,OH,"Italian, Restaurants, Pizza, Chicken Wings"


In [75]:
# data = ratingsDF.join(businessDF.set_index('business_id'), on='business_id')
# data = data.dropna()

In [78]:
timeData = ratingsDF.join(businessDF.set_index('business_id'), on='business_id')
timeData = timeData.dropna()



In [None]:
categories = ["Sandwiches",
"Pizza",
"Chinese",
"Food Stands",
"Steakhouses",
"Mexican",
"Fast Food",
"Seafood",
"Indian",
"Gluten-Free",
"Breakfast & Brunch",
"Delis",
"Burgers",
"Salad",
"Vegan",
"Comfort Food",
"Mediterranean",
"Latin American",
"German",
"Cafes",
"Vegetarian",
"Italian",
"Middle Eastern",
"Diners",
"Hot Dogs",
"Caribbean",
"French",
"Buffets",
"Thai"]

datasets = []

for cat in tqdm.tqdm_notebook(categories):
    catData = timeData[timeData["categories"].str.contains(category)]
    datasets.append(catData.groupby(['city', 'state']).agg('mean'))
datasets

# Data Exploration


In [7]:
from IPython.display import YouTubeVideo

In [9]:
YouTubeVideo('sIlNIVXpIns') #import import video

# References

### Data sources

- https://apps.bea.gov/itable/iTable.cfm?ReqID=70&step=1
- https://www.census.gov/2010census/data/apportionment-data-map.html
- https://www.census.gov/data/datasets/2010/demo/popest/modified-race-data-2010.html
- https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/PEJ5QU

### Libraries
- 
-