CSE6242 Final Project

Description

Our project analyzes the dataset that Yelp provides (https://www.yelp.com/dataset) with businesses, reviews, tips, and checkins to determine supply and demand of individual businesses and infer future business opportunities.

Running Our Code

Loading the Data into SQL and Filtering to Only Philadelphia Businesses

Download the Yelp dataset from the website above. It comes as two archive files: yelp_dataset.tar and yelp_photos.tar. Extract the archive files into ./data folder to obtain the following JSON files listed below:
- ./data/yelp_academic_dataset_business.json,
- ./data/yelp_academic_dataset_checkin.json,
- ./data/yelp_academic_dataset_review.json,
- ./data/yelp_academic_dataset_tip.json,
- ./data/yelp_academic_dataset_user.json,
- ./data/photos.json;
Download & install PostgreSQL, create a username and password, then create a database called yelp and a schema under the database called yelp_ds
Run the following SQL file create_yelp_ds in the data folder to create the database schema: create_yelp_ds.sql
Install most recent version of Python 3. Once it is done please setup your Python virtual environment using one of your favorite virtual environment tools (conda, virtualenv, poetry etc.) and install PostreSQL driver for Python:

$ pip3 install psycopg2-binary

Run the following python scripts to load the JSON data files into the PostgreSQL database. Note please make sure that we have JSON data files stored in the same folder with these scripts.
1. python3 ./1_load_businesses.py,
2. python3 ./1_load_users.py,
3. python3 ./2_load_checkins.py,
4. python3 ./2_load_photos.py,
5. python3 ./2_load_reviews.py,
6. python3 ./2_load_tips.py;

Note: please make sure that the python scripts listed above have proper connectivity settings configured for SQL database like it is shown on the code snippet below. You might want to change username and host and add a password if you run PostreSQL on a different host:

with psycopg2.connect(database = "yelp", user = "vorlov", host= "localhost", port = 5432, options="-c search_path=yelp_ds") as conn:

Once the files have been loaded in the database, you can proceed to the next sections to analyze the data.
For the PostgreSQL dattabase created above, please run the following script: ./data/4_prefiltered_data.sql. This will filter the data to only include Philadelphia data and create additional tables for it:
- PHI_BUSINESS,
- PHI_CUSTOMER,
- PHI_REVIEW,
- PHI_TIP,
- PHI_CHECKIN.

Clustering to Find Business Categories

From the preproccessing folder, run preprocess-categories.ipynb to produce philly_categories_v2.csv which is the phi_business dataframe but with business categories for each business. Note: we extracted business categories into a SQL script data/6_categories.sql, which is used later to upload them into the database.

Calculating Demand and Missing Demand Score

Please use the following SQL query to compute the total demand & missing demand scores by business:

data/5_compute_demand_scores.sql

In order to roll up total demand & missing demand scores, please load business categories into database first using the following SQL script:

data/6_categories.sql

Once business categories are loaded into the database, please use the following SQL query to compute total demand & missing demand score roll ups by zip code and business category:

data/7_compute_demand_scores_by_zip_bicategory.sql

The results of this SQL query can then be exported into CSV format for subsequent visualization in Tableau.

NLP

NOTE: all required libraries are in the requirements.txt file. Versions should just be updated to the latest versions of each package:

matplotlib
seaborn
numpy
pandas
nltk
sklearn
wordcloud

NOTE: Notebooks are used for experimentation, score creation and processing text. Thus run times may be upwards to >45+ minutes on certain cells. CSVs created are also >1 GB, thus no CSVs are included in the repo due to size limitations.

8_combine_reviews_business_categories.sql in the data folder contains the SQL required to combine review information along with category and zipcode. It's recommended to download CSVs in 3-4 parts for further analysis due to size
NLP-v1.ipynb contains initial preprocessing for word tokenization and initial wordclouds/EDA
Further experimenation and processing to get word frequencies as well as the VADER sentiment scores was performed in NLP-v2.ipynb
N-gram analysis was completed and processed further for tableau compatibility in NLP-v3.ipynb and NLP-v4.ipynb

Experiment, Evaluating Our Demand and Missing Demand Score

From the experiment folder, run the SQL file download_phi_review.sql and Download_NO_review.sql (need to change the directory to whatever folder you are saving the jupyter notebook files to). This will create two csv files, phi_review.csv and no_review.csv
Download all the files in the experiment folder (csv and jupyter notebook) into the same directory that you chose for above (1)
Run the jupyter notebook files (Philly_Experiment.ipynb and NewOrleansExperiment.ipynb) and they will display the results of the evaluation of our demand and missing demands scores

Tableau Visualization

We built a dashboard to visualize the businesses' demand and other insights from the Yelp Data. Use the following link to access the dashboard: https://public.tableau.com/app/profile/shivani.narahari/viz/YelpDemandDashboard/Dashboard1?publish=yes

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.idea		.idea
Documents		Documents
Experiment		Experiment
data		data
nlp		nlp
preproccessing		preproccessing
visualization		visualization
.gitignore		.gitignore
DOCs		DOCs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CSE6242 Final Project

Description

Running Our Code

Loading the Data into SQL and Filtering to Only Philadelphia Businesses

Clustering to Find Business Categories

Calculating Demand and Missing Demand Score

NLP

Experiment, Evaluating Our Demand and Missing Demand Score

Tableau Visualization

About

Releases

Packages

Languages

zhanghaoshuang/Data-and-Visual-Analytics-Group-Project

Folders and files

Latest commit

History

Repository files navigation

CSE6242 Final Project

Description

Running Our Code

Loading the Data into SQL and Filtering to Only Philadelphia Businesses

Clustering to Find Business Categories

Calculating Demand and Missing Demand Score

NLP

Experiment, Evaluating Our Demand and Missing Demand Score

Tableau Visualization

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages