Skip to content

My project for the Yelp Dataset Challenge Round 10

Notifications You must be signed in to change notification settings

unclenachoduh/yelp

Repository files navigation

Yelp Dataset Challenge Round 10

Kekoa Riggin - University of Washington - MSc Computational Linguistics

Project

https://www.yelp.com/dataset

Project Scope

  1. Build a statistical language model with Yelp reviews by metropolitan area.
  2. Generate top 100 sentences from each language model.
  3. Report on the language model and the linguistic features of Yelp reviews in general and by metropolitan area.
  4. Generate train and test data with 90/10 of extracted Yelp reviews.
  5. Use reviews from different areas as test data to get perplexity scores.

Status

  • Relevant data extraction (Businesses).
    • Verify and correct regional data for businesses.
    • Generate plot charts of all data and accepted data.
  • Relevant data extraction (Review Text)
  • Create model from extracted text
  • Generate from model
  • Analyze generation
  • Generate on region

Additional Steps

  • Relevant data extraction (Tips)

Metropolitan Areas

Area Lat Long
Champaign 40.116420 -88.243383
Charlotte 35.227087 -80.843127
Cleveland 41.499320 -81.694361
Las_Vegas 36.169941 -115.13983
Madison 43.073052 -89.40123
Phoenix 33.448377 -112.074037
Pittsburgh 40.440625 -79.995886
Montreal 45.501689 -73.567256
Toronto 43.653226 -79.383184
Buenos_Aires -34.603684 -58.381559
Stuttgart 48.7758 9.1829
Inverness 57.477772 -4.224721
Edinburgh 55.953251 -3.188267

Steps

  1. Run extract_business_location on business.json file from Yelp Dataset.
  2. Run split_file on the review.json file from the Yelp Dataset (because this file is 4.7 million lines long and the following steps will not terminate on a file that long)
  3. Run extract_review_text on the output of split_file from step 2 and use the output of extract_business_location from step 1 as the business file.
  4. Run merge on the output of extract_review_text from step 3.

About

My project for the Yelp Dataset Challenge Round 10

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published