Kekoa Riggin - University of Washington - MSc Computational Linguistics
- Build a statistical language model with Yelp reviews by metropolitan area.
- Generate top 100 sentences from each language model.
- Report on the language model and the linguistic features of Yelp reviews in general and by metropolitan area.
- Generate train and test data with 90/10 of extracted Yelp reviews.
- Use reviews from different areas as test data to get perplexity scores.
- Relevant data extraction (Businesses).
- Verify and correct regional data for businesses.
- Generate plot charts of all data and accepted data.
- Relevant data extraction (Review Text)
- Create model from extracted text
- Generate from model
- Analyze generation
- Generate on region
Additional Steps
- Relevant data extraction (Tips)
Area | Lat | Long |
---|---|---|
Champaign | 40.116420 | -88.243383 |
Charlotte | 35.227087 | -80.843127 |
Cleveland | 41.499320 | -81.694361 |
Las_Vegas | 36.169941 | -115.13983 |
Madison | 43.073052 | -89.40123 |
Phoenix | 33.448377 | -112.074037 |
Pittsburgh | 40.440625 | -79.995886 |
Montreal | 45.501689 | -73.567256 |
Toronto | 43.653226 | -79.383184 |
Buenos_Aires | -34.603684 | -58.381559 |
Stuttgart | 48.7758 | 9.1829 |
Inverness | 57.477772 | -4.224721 |
Edinburgh | 55.953251 | -3.188267 |
- Run
extract_business_location
onbusiness.json
file from Yelp Dataset. - Run
split_file
on thereview.json
file from the Yelp Dataset (because this file is 4.7 million lines long and the following steps will not terminate on a file that long) - Run
extract_review_text
on the output ofsplit_file
from step 2 and use the output ofextract_business_location
from step 1 as the business file. - Run
merge
on the output ofextract_review_text
from step 3.