Skip to content

romainr/yelp-data-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

Yelp Data Analysis with Hue

Hue (http://cloudera.github.com/hue) can be used for quickly starting up with Hadoop and anlysing data. Here the Yelp Dataset challenge provides small sets ideal for starting up.

Watch the video on this blog post!

Getting Started

Get the dataset from Yelp: https://www.yelp.com/dataset_challenge/

Normalize data

Clean the data with https://github.com/romainr/yelp-data-analysis/blob/master/convert.py

  1. Retrieve the data and extract it.
tar -xvf yelp_phoenix_academic_dataset.tar
cd yelp_phoenix_academic_dataset
wget https://raw.github.com/romainr/yelp-data-analysis/master/convert.py
yelp_phoenix_academic_dataset$ ls
convert.py notes.txt READ_FIRST-Phoenix_Academic_Dataset_Agreement-3-11-13.pdf yelp_academic_dataset_business.json yelp_academic_dataset_checkin.json yelp_academic_dataset_review.json yelp_academic_dataset_user.json
  1. Convert it to TSV.
chmod +x convert.py
./convert.py
  1. The column headers will be printed by the above script.
["city", "review_count", "name", "neighborhoods", "type", "business_id", "full_address", "state", "longitude", "stars", "latitude", "open", "categories"]
["funny", "useful", "cool", "user_id", "review_id", "text", "business_id", "stars", "date", "type"]

Create Table

Create the Hive tables with the 'Create a new table from a file' in the Catalog app or Beeswax 'Tables' tab.

Upload the data files yelp_academic_dataset_business_clean.json and yelp_academic_dataset_review_clean.json. Hue will then guess the tab separator and then lets you name each column of the tables (use above column headers and paste them directly if you use Hue 2.3).

Queries

Open up Hue's Hive editor named Beeswax and run:

  1. Top 25: business with most of the reviews
SELECT name, review_count
FROM business
ORDER BY review_count DESC
LIMIT 25
  1. Top 25: coolest restaurants
SELECT r.business_id, name, SUM(cool) AS coolness
FROM review r JOIN business b
ON (r.business_id = b.business_id)
WHERE categories LIKE '%Restaurants%'
GROUP BY r.business_id, name
ORDER BY coolness DESC
LIMIT 25

Let your imagination run wild!

About

Yelp contest data analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages