<img src="https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png" style="float: left; margin: 15px;">

# Capstone Project Milestone 2: General Yelp EDA

_Author: Schubert H. Laforest (BOS)_

# Overview 

This dataset is a subset of Yelp's businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp's data and share their discoveries. In the dataset you'll find information about businesses across 11 metropolitan areas in four countries. 

In total, there are :
- 5,200,000 user reviews
- Information on 174,000 businesses
- The data spans 11 metropolitan areas


Provenance: https://www.kaggle.com/yelp-dataset/yelp-dataset

Link to Original Dataset: https://www.yelp.com/dataset

References: "Presonalizing Yelp Review Ratings" 

**Main Goal of the Project**:
- Analyze User Reviews and build a recommender that suggests which other establishments business they would like (type, location, time of day/year) 

**Secondary Goal (time permitting)**:
- Track main complaints in reviews and offer up "areas of improvement" for a given business, and/or what they should be doing (i.e. What's most popular with your users, whats least popular). 

**Run of the Mill Goals / Exploration**:
- Predict the number of Stars a Business will, and by extension, check-ins 

**Methods and Models**:
- Packages: Gensim, SpaCy, Google Cloud Natural Language API
- Methods: Word2Vec, TDIDF, CVEC, HVEC, 
- Models: Latent Dirichlet allocation, Principal Component Analysis, MultinomialNB, CNN, 
    
- Classification 

**Risks and Assumptions of the Data**:
- The Data is 

# EDA Guide 

# Data Dictionary 

## Reviews
Contains full review text data including the user_id that wrote the review and the business_id the review is written for.
     
- **review_id**:  unique review id
    - Data Type: string, 22 character (ex: "zdSx_SD6obEhz9VrW9uAWA")


- **user_id**: unique user id, maps to the user in user.csv
    - Data type: string, 22 character (ex: "Ha3iJu77CxlrFm-vQRs_8g")
    

- **business_id**: business id, maps to business in business.csv 
    - Data Type: string, 22 character (ex:"tnhfDv5Il8EaGSXZGiuQGg")
    
    
- **stars**: star rating by reviewer, maxium of 5 
    - Data Type*: Integer
    
    
- **date**: date formatted YYYY-MM-DD
    - Data Type: String (ex: "2016-03-09")
    
    
- **text**: the review itself
    - Data Type: Sting
    - Example: *"Great place to hang out after work: the prices are decent, and the ambience is fun. It's a bit loud, but very lively. The staff is friendly, and the food is good. They have a good selection of drinks."*

- **useful**: number of useful votes  received 
    - Data Type: Integer 
    
    
- **funny**: number of funny votes  received 
    - Data Type: Integer 
    
    
    
- **cool**: number of cool votes  received 
    - Data Type: Integer 
    
    
    
- **useful**: number of useful votes  received 
    - Data Type: Integer 
    
    

## Business


## Business Hours


## Check-in 
Check-ins on a business.

- **business id**:
- **Weekday**:
- **hour**:
- **check-ins**:
- 



## Attributes


## Tip


## User




To do: 
- Visualize the distribution of of star ratings in reviews, useful, cool, 
- Going to need to subsection the data because cannot to all for everything. Pick a market.
- Articulates the main goal of your project (your problem statement)
- Outlines your proposed methods and models
- Defines the risks & assumptions of your data
- Revises initial goals & success criteria, as needed
- Performs & summarizes preliminary EDA of your data

In [1]:
# EDA imports 
import pandas as pd 
import numpy as np 

In [5]:
# Main Dataset (avoid rerunning, very big)
reviews = pd.read_csv('yelp_review.csv')

In [2]:
# Loading in the other datasets
business = pd.read_csv('yelp_business.csv')
attributes = pd.read_csv('yelp_business_attributes.csv')
business_hours= pd.read_csv('yelp_business_hours.csv')
checkin = pd.read_csv('yelp_checkin.csv')
tip = pd.read_csv('yelp_tip.csv')
user = pd.read_csv('yelp_user.csv')

Ok so, data is way to huge. Going to have to randomly subset data. Randomly so that I don't work off of a biased dataset. Random sample for every year for the years that I have. 

In [7]:
reviews.head()

Unnamed: 0,review_id,user_id,business_id,stars,date,text,useful,funny,cool
0,vkVSCC7xljjrAI4UGfnKEQ,bv2nCi5Qv5vroFiqKGopiw,AEx2SYEUJmTxVVB18LlCwA,5,2016-05-28,Super simple place but amazing nonetheless. It...,0,0,0
1,n6QzIUObkYshz4dz2QRJTw,bv2nCi5Qv5vroFiqKGopiw,VR6GpWIda3SfvPC-lg9H3w,5,2016-05-28,Small unassuming place that changes their menu...,0,0,0
2,MV3CcKScW05u5LVfF6ok0g,bv2nCi5Qv5vroFiqKGopiw,CKC0-MOWMqoeWf6s-szl8g,5,2016-05-28,Lester's is located in a beautiful neighborhoo...,0,0,0
3,IXvOzsEMYtiJI0CARmj77Q,bv2nCi5Qv5vroFiqKGopiw,ACFtxLv8pGrrxMm6EgjreA,4,2016-05-28,Love coming here. Yes the place always needs t...,0,0,0
4,L_9BTb55X0GDtThi6GlZ6w,bv2nCi5Qv5vroFiqKGopiw,s2I_Ni76bjJNK9yG60iD-Q,4,2016-05-28,Had their chocolate almond croissant and it wa...,0,0,0


In [13]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5261668 entries, 0 to 5261667
Data columns (total 9 columns):
review_id      object
user_id        object
business_id    object
stars          int64
date           object
text           object
useful         int64
funny          int64
cool           int64
dtypes: int64(4), object(5)
memory usage: 361.3+ MB


In [14]:
business.head()

Unnamed: 0,business_id,name,neighborhood,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,categories
0,FYWN1wneV18bWNgQjJ2GNg,"""Dental by Design""",,"""4855 E Warner Rd, Ste B9""",Ahwatukee,AZ,85044,33.33069,-111.978599,4.0,22,1,Dentists;General Dentistry;Health & Medical;Or...
1,He-G7vWjzVUysIKrfNbPUQ,"""Stephen Szabo Salon""",,"""3101 Washington Rd""",McMurray,PA,15317,40.291685,-80.1049,3.0,11,1,Hair Stylists;Hair Salons;Men's Hair Salons;Bl...
2,KQPW8lFf1y5BT2MxiSZ3QA,"""Western Motor Vehicle""",,"""6025 N 27th Ave, Ste 1""",Phoenix,AZ,85017,33.524903,-112.11531,1.5,18,1,Departments of Motor Vehicles;Public Services ...
3,8DShNS-LuFqpEWIp0HxijA,"""Sports Authority""",,"""5000 Arizona Mills Cr, Ste 435""",Tempe,AZ,85282,33.383147,-111.964725,3.0,9,0,Sporting Goods;Shopping
4,PfOCPjBrlQAnz__NXj9h_w,"""Brick House Tavern + Tap""",,"""581 Howe Ave""",Cuyahoga Falls,OH,44221,41.119535,-81.47569,3.5,116,1,American (New);Nightlife;Bars;Sandwiches;Ameri...


Is open, number of stars could be interesting ys. 

In [8]:
attributes.head()

Unnamed: 0,business_id,AcceptsInsurance,ByAppointmentOnly,BusinessAcceptsCreditCards,BusinessParking_garage,BusinessParking_street,BusinessParking_validated,BusinessParking_lot,BusinessParking_valet,HairSpecializesIn_coloring,...,Corkage,DietaryRestrictions_dairy-free,DietaryRestrictions_gluten-free,DietaryRestrictions_vegan,DietaryRestrictions_kosher,DietaryRestrictions_halal,DietaryRestrictions_soy-free,DietaryRestrictions_vegetarian,AgesAllowed,RestaurantsCounterService
0,FYWN1wneV18bWNgQjJ2GNg,Na,Na,Na,True,Na,Na,Na,Na,Na,...,Na,Na,Na,Na,Na,Na,Na,Na,Na,Na
1,He-G7vWjzVUysIKrfNbPUQ,Na,Na,Na,Na,Na,Na,Na,Na,Na,...,Na,Na,Na,Na,Na,Na,Na,Na,Na,Na
2,8DShNS-LuFqpEWIp0HxijA,Na,Na,Na,Na,Na,Na,Na,Na,Na,...,Na,Na,Na,Na,Na,Na,Na,Na,Na,Na
3,PfOCPjBrlQAnz__NXj9h_w,Na,Na,Na,Na,Na,Na,Na,Na,Na,...,Na,Na,Na,Na,Na,Na,Na,Na,Na,Na
4,o9eMRCWt5PkpLDE0gOPtcQ,Na,Na,Na,Na,False,False,False,False,False,...,Na,Na,Na,Na,Na,Na,Na,Na,Na,Na


In [9]:
attributes.columns

Index(['business_id', 'AcceptsInsurance', 'ByAppointmentOnly',
       'BusinessAcceptsCreditCards', 'BusinessParking_garage',
       'BusinessParking_street', 'BusinessParking_validated',
       'BusinessParking_lot', 'BusinessParking_valet',
       'HairSpecializesIn_coloring', 'HairSpecializesIn_africanamerican',
       'HairSpecializesIn_curly', 'HairSpecializesIn_perms',
       'HairSpecializesIn_kids', 'HairSpecializesIn_extensions',
       'HairSpecializesIn_asian', 'HairSpecializesIn_straightperms',
       'RestaurantsPriceRange2', 'GoodForKids', 'WheelchairAccessible',
       'BikeParking', 'Alcohol', 'HasTV', 'NoiseLevel', 'RestaurantsAttire',
       'Music_dj', 'Music_background_music', 'Music_no_music', 'Music_karaoke',
       'Music_live', 'Music_video', 'Music_jukebox', 'Ambience_romantic',
       'Ambience_intimate', 'Ambience_classy', 'Ambience_hipster',
       'Ambience_divey', 'Ambience_touristy', 'Ambience_trendy',
       'Ambience_upscale', 'Ambience_casual', 'Restau

So it appears I'm going to have to join business and attributes in order to 

Other consideration

## Check-in EDA 

In [3]:
checkin.head()

Unnamed: 0,business_id,weekday,hour,checkins
0,3Mc-LxcqeguOXOVT_2ZtCg,Tue,0:00,12
1,SVFx6_epO22bZTZnKwlX7g,Wed,0:00,4
2,vW9aLivd4-IorAfStzsHww,Tue,14:00,1
3,tEzxhauTQddACyqdJ0OPEQ,Fri,19:00,1
4,CEyZU32P-vtMhgqRCaXzMA,Tue,17:00,1


In [5]:
cg = checkin.groupby('business_id')

Unnamed: 0,business_id,weekday,hour,checkins
0,3Mc-LxcqeguOXOVT_2ZtCg,Tue,0:00,12
1,SVFx6_epO22bZTZnKwlX7g,Wed,0:00,4
2,vW9aLivd4-IorAfStzsHww,Tue,14:00,1
3,tEzxhauTQddACyqdJ0OPEQ,Fri,19:00,1
4,CEyZU32P-vtMhgqRCaXzMA,Tue,17:00,1
5,9dn5pee_n2dWQfN57xoJpg,Sun,3:00,5
6,6Zk5F7fsTr8n2CJTlaxHlw,Wed,1:00,4
7,OE_IDW5w_W97sBcZvq2Img,Sat,1:00,1
8,gy5pr5bFAjOL5rERSdMCLg,Sat,15:00,1
9,r2-eAhGANXlcgQy898tTaw,Mon,19:00,1
