Yelp Dataset Challenge Round 10

Kekoa Riggin - University of Washington - MSc Computational Linguistics

Project

Build a statistical language model with Yelp reviews by metropolitan area.
Generate top 100 sentences from each language model.
Report on the language model and the linguistic features of Yelp reviews in general and by metropolitan area.
Generate train and test data with 90/10 of extracted Yelp reviews.
Use reviews from different areas as test data to get perplexity scores.

Relevant data extraction (Businesses).
- Verify and correct regional data for businesses.
- Generate plot charts of all data and accepted data.
Relevant data extraction (Review Text)
Create model from extracted text
Generate from model
Analyze generation
Generate on region

Additional Steps

Run extract_business_location on business.json file from Yelp Dataset.
Run split_file on the review.json file from the Yelp Dataset (because this file is 4.7 million lines long and the following steps will not terminate on a file that long)
Run extract_review_text on the output of split_file from step 2 and use the output of extract_business_location from step 1 as the business file.
Run merge on the output of extract_review_text from step 3.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
extract_business_location		extract_business_location
extract_review_text		extract_review_text
lang_model		lang_model
merge_file		merge_file
split_file		split_file
tokenizer		tokenizer
Riggin_Statistical_Language_Model_and_Perplexity.pdf		Riggin_Statistical_Language_Model_and_Perplexity.pdf
readme.md		readme.md