-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.txt
59 lines (47 loc) · 2.42 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
Yelp Data mining
Data source: Yelp review
API: Yelp Fusion API, Google Cloud Natural Language API
Goals and Outline:
The goal of our project is to apply clustering methods to answer the following questions that we think worthwhile to explore:
1. What neighborhoods in a city have the best cuisine selection according to the ratings in Yelp dataset?
2. Does some neighborhoods in a city provide more expansive cuisines than the others?
2. In the review of a restaurant, which kind of keywords are more likely to be mentioned?
If a restaurant wants to improve its reviews, what should they focus on?
This project report takes the example of reviews about "Pizza" in Boston. Three main components were done
to solve the above questions.
1.Drawing of locations of all restaurants serves pizza in Boston shown in Google Map with Google API.
2.Geographical based clustering.
Use K-means and GMM clustering algorithm to draw the clustering figure of restaurants. This shows the ratings and price distributions of all the restaurants.
Silhouette Score of each clustering is calculated.
3.Classify the reviews via sentiment analysis.
Make use of Google Cloud Natural Language API to divide review into positive and
negative part.
4.Find and plot the relationship between topics and keywords.
Use LDA(Latent Dirichlet Allocation) to process the corpus, visualize the result by
PyLDAVis and Wordclouds.
Environment requirements:
- Download and install Anaconda, install Jupyter via Anaconda.
- install the following packages:
lda
gensim
PyLDAVis
wordcloud
bokeh
sklearn
TextBlob
google.cloud
Running instruction:
To run the whole application, you should:
1. run main.ipynb in the root folder
this will show
a) the plotting of all the restaurants on Google Map
b) the K-Means and GMM clustering result for rating and price, 4 plotting in total
c) the reviews crwaled and saved to LDA/reviews.csv
d) key words via wordclouds.
3. run sentiment_analysis.ipynb, this will show
a) the result of sentiment Analysis of positive and negative revirew and the result of the review process.
4. go into the directory of this project and then cd LDA in terminal( cd [path]/LDA)
and then use command: python LDA.py
this will show
a) the topics extracted from the corpus, with top frequent keywords on its right
when click on topics, the right panel will show how much the keywords count