*A1 Report - Wu, Sixuan*
# <center> Exploring Yelp </center>
## <center> Author: Sixuan Wu </center>
## <center> Date: February 15, 2020 </center>
***

## Introduction

Yelp is a business directory service and crowd-sourced review forum, and a public company of the same name that is headquartered in San Francisco, California. The company develops, hosts, and markets the Yelp.com website and the Yelp mobile app, which publish crowd-sourced reviews about businesses. It also operates an online reservation service called Yelp Reservations.

--adapted from Wikipedia

## Data Description

The dataset I used is from [Yelp Dataset Challenge](https://www.yelp.com/dataset/challenge) which is provided by Yelp officially. The json format dataset which is compressed in tar format can be downloaded by providing email and name. Following are related [permissions](https://s3-media1.fl.yelpcdn.com/assets/srv0/engineering_pages/06cb5ad91db8/assets/vendor/yelp-dataset-agreement.pdf) of the dataset:

**Cannot Dos:**
- Use the data to create or update my own business.
- Give the data to the third party without permission of Yelp.
- Give the data to others to make profit.

**Can Dos:**
- Use information from data to do academic project in for this course.

### Extract Data

In [20]:
import tarfile
import json

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [4]:
# Downloaded dataset is in tar format, then extract dataset to json.
tf = tarfile.open("yelp_dataset.tar")
tf.extractall()

The downloaded data is in tar compressed file, the first thing is to extract data to json format data. There are several json formatted datasets after extraction:
- business.json
- checkin.json
- photo.json
- review.json
- tip.json
- user.json

### Observe Data

By observing the datasets listed above, here is the structure and relationship of those datasets:

| Dataset | Structure | Relationship 
|:--------|:-----------|:------------------------------
business.json | **business_id**: id to represent the merchant<br>**name:** name of the merchant<br>**address:** specific address of the merchant <br> **city:** city where the merchant located<br> **state:** state where the merchant located<br> **postal_code:** the postal code of the merchant<br> **latitude:** latitude of the merchant<br> **longitude:** longitude of the merchant<br>**stars:** stars given by customers <br>**review_count:** number of reviews provided by users<br>**is_open:** whether the shop/restautant open, 1 for yes and 0 for no<br>**attributes:** some special sevices provided by the merchant<br>**categories:** some features of the merchant<br>**hours:** opening hours of the restaurant| Detailed information of each merchants registered on yelp.
[checkin](https://blog.yelp.com/2018/12/perfect-yelp-check-in-offer).json | **business_id:** id of the merchant<br> **date:** exact time the bussiness check in | The exact check in time of the merchants which share the same business_id in business dataset.
user.json | **user_id:** id of the user<br>**name:** name of the user<br>**review_count:** number of reviews given by the user<br>**yelping_since:** time the user join in Yelp<br>**useful:** number of useful recieved by the user<br>**funny:** number of funny recieved by the user<br>**cool:** number of cool recieved by the user<br>**[elite](https://www.yelp-support.com/article/What-is-Yelps-Elite-Squad?l=en_US):** the elite year of the user<br>**friends:** id of friends of the user<br>**fans:** number of fans of the user<br>**average_stars:** average stars given by the user<br>**compliment_hot:** number of recommandations the user compliment hot<br>**compliment_more:** number of recommandations the user compliment more<br>**compliment_profile:** number of profiles the user compliment<br>**compliment_cute** number of recommandations the user compliment cute<br>**comliment_list:** number of lists the user compliment<br>**compliment_note:** number of notes the user write for compliments<br>**compliment_plain:** number of recommandations the user compliment plain<br>**compliment_cool:** number of recommandations the user compliment cool<br>**compliment_funny:** number of recommandations the user compliment funny<br>**compliment_writers:** number of writers the user compliment<br>**compliment_photos:** number of photos the user compliment| Detailed information of each user registered on Yelp.
photo.json | **caption:** comments provided by users<br>**photo_id:** id of the photo<br>**business_id:** id of the merchant the photo indicates<br>**label:** inside or outside of the restaurant/shops the photo shows | Photos provided by users to describe the shop/restaurant.<br>Share the same business_id in business.json and same user_id in user.json
review.json | **review_id:** id of the review<br>**user_id:** id of the user who write this review<br>**business_id:** id of the business which is the described by this review<br>**stars:** stars of the review<br>**useful:** number of users think the review is useful<br>**funny:** number of users think the review is funny<br>**cool:** number of users think the review is cool<br>**text:** detailed content of the review | Reviews provided by users to describe the shop/restaurant.<br>Share the same business_id in business.json and same user_id in user.json 
tip.json | **user_id:** id of user who gives the tip<br>**business_id:** id of merchant who recieve the tip<br>**text:** some comments given by users<br>**date:** the date when the tip is given<br>**compliment_count:** number of compliments | Tip information between customers and merchants.<br>Share the same business_id in business.json and same user_id in user.json

In [6]:
# Dataset contains json type data each line, sparse the data by lines.
with open('business.json', 'r', encoding='utf8') as json_file:
    lines = json_file.readlines()

In [26]:
# Some data in the dataset is null which is without default definition, define null as np.nan
null = np.nan

# Read and clean data by lines.
rows = []
for line in lines:
    row = eval(line.strip().strip('\n'))
    rows.append(row)

# Save data to the dataframe.
business = pd.DataFrame(rows)
display(business)

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,1SWheh84yJXfytovILXOAQ,Arizona Biltmore Golf Club,2818 E Camino Acequia Drive,Phoenix,AZ,85016,33.522143,-112.018481,3.0,5,0,{'GoodForKids': 'False'},"Golf, Active Life",
1,QXAEGFB4oINsVuTFxEYKFQ,Emerald Chinese Restaurant,30 Eglinton Avenue W,Mississauga,ON,L5R 3E7,43.605499,-79.652289,2.5,128,1,"{'RestaurantsReservations': 'True', 'GoodForMe...","Specialty Food, Restaurants, Dim Sum, Imported...","{'Monday': '9:0-0:0', 'Tuesday': '9:0-0:0', 'W..."
2,gnKjwL_1w79qoiV3IC_xQQ,Musashi Japanese Restaurant,"10110 Johnston Rd, Ste 15",Charlotte,NC,28210,35.092564,-80.859132,4.0,170,1,"{'GoodForKids': 'True', 'NoiseLevel': 'u'avera...","Sushi Bars, Restaurants, Japanese","{'Monday': '17:30-21:30', 'Wednesday': '17:30-..."
3,xvX2CttrVhyG2z1dFg_0xw,Farmers Insurance - Paul Lorenz,"15655 W Roosevelt St, Ste 237",Goodyear,AZ,85338,33.455613,-112.395596,5.0,3,1,,"Insurance, Financial Services","{'Monday': '8:0-17:0', 'Tuesday': '8:0-17:0', ..."
4,HhyxOkGAM07SRYtlQ4wMFQ,Queen City Plumbing,"4209 Stuart Andrew Blvd, Ste F",Charlotte,NC,28217,35.190012,-80.887223,4.0,4,1,"{'BusinessAcceptsBitcoin': 'False', 'ByAppoint...","Plumbing, Shopping, Local Services, Home Servi...","{'Monday': '7:0-23:0', 'Tuesday': '7:0-23:0', ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
192604,nqb4kWcOwp8bFxzfvaDpZQ,Sanderson Plumbing,,North Las Vegas,NV,89032,36.213732,-115.177059,5.0,9,1,{'BusinessAcceptsCreditCards': 'True'},"Water Purification Services, Water Heater Inst...","{'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W..."
192605,vY2nLU5K20Pee-FdG0br1g,Chapters,17440 Yonge Street,Newmarket,ON,L3Y 6Y9,44.052658,-79.481850,4.5,3,1,"{'RestaurantsPriceRange2': '2', 'BikeParking':...","Books, Mags, Music & Video, Shopping",
192606,MiEyUDKTjeci5TMfxVZPpg,Phoenix Pavers,21230 N 22nd St,Phoenix,AZ,85024,33.679992,-112.035569,4.5,14,1,"{'BusinessAcceptsCreditCards': 'True', 'ByAppo...","Home Services, Contractors, Landscaping, Mason...","{'Monday': '7:0-15:0', 'Tuesday': '7:0-15:0', ..."
192607,zNMupayB2jEHVDOji8sxoQ,Beasley's Barber Shop,4406 E Main St,Mesa,AZ,85205,33.416137,-111.735743,4.5,15,1,"{'RestaurantsPriceRange2': '1', 'BusinessAccep...","Beauty & Spas, Barbers","{'Tuesday': '8:30-17:30', 'Wednesday': '8:30-1..."


In [28]:
business.city.unique()

array(['Phoenix', 'Mississauga', 'Charlotte', ..., 'Henderson Nevada',
       'Boston', 'Spring Hill City View'], dtype=object)