# This notebook joins all datasets together for consolidated review database

## Prior to running this script:
### - Master brewery list should be obtained from scraping the Brewers Association website. 

> Membership fees required for full access, however the version used for this study is a static, “manually” scraped dataset. Website structure does not lend itself to automated scraping.


### - Google Places and Yelp Fusion notebooks must be executed. The output of each of these notebooks are the input files for this notebook. 

### The DSCI 511 Brewery Team:
Wynton Britton

Russell Destremps

Hao Deng

Evan Falkowski

## Import libraries and set drive

In [None]:
# import required libraries 
import pandas as pd
from pandas import DataFrame


In [None]:
# mount drive ***Note for funcitons created below, you will ahve to change the drive mapping within each function ***
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
#set path
path = '/content/drive/MyDrive/Python/DSCI_511/Project/team_project/'

## Bring in all of the individual databases

*   Master Brewery List
*   Google Places Rating Data
*   Yelp Review/Rating Data (x4)









### Master Brewery List

In [None]:
master_list = pd.read_csv(path + 'brewery_master.csv')

In [None]:
master_list.columns

Index(['type', 'location_name', 'region', 'latitude', 'longitude'], dtype='object')

### Google Places Rating Data

In [None]:
google_data = pd.read_csv(path + 'Google_data/full_table.csv')

In [None]:
google_data.columns

Index(['Unnamed: 0', 'location_name', 'google_name', 'Location',
       'Average User Rating', 'Total User Rating'],
      dtype='object')

### Yelp Review/Rating Data

In [None]:
#combining the datframes back togther

yelp_1 = pd.read_csv(path + 'Yelp_data/copy1YelpData.csv') 
yelp_2 = pd.read_csv(path + 'Yelp_data/copy2YelpData.csv') 
yelp_3 = pd.read_csv(path + 'Yelp_data/copy3YelpData.csv') 
yelp_4 = pd.read_csv(path + 'Yelp_data/copy4YelpData.csv') 

yelp_data = pd.concat([yelp_1, yelp_2, yelp_3, yelp_4], axis=0)


In [None]:
yelp_data.columns

Index(['Unnamed: 0', 'type', 'location_name', 'region', 'latitude',
       'longitude', 'yelp_id', 'yelp_ave', 'yelp_reviews', 'Ratings',
       'Review_Text'],
      dtype='object')

## Join all datasets together

In [None]:
review_database = pd.merge(master_list, google_data[['location_name','Average User Rating', 'Total User Rating']], on = 'location_name', how = 'left')
review_database = pd.merge(review_database, yelp_data[['location_name','yelp_ave', 'yelp_reviews', 'Review_Text']], on = 'location_name', how = 'left')
review_database.head()

Unnamed: 0,type,location_name,region,latitude,longitude,Average User Rating,Total User Rating,yelp_ave,yelp_reviews,Review_Text
0,Brewpub,101 Brewery,WA,47.822407,-122.875356,4.4,93.0,4.0,90,While we were disappointed that we did not get...
1,Brewpub,122 West Brewing Co,WA,48.762557,-122.485773,4.6,62.0,Na,Na,Na
2,Brewpub,12Degree Brewing,CO,39.978215,-105.131876,4.7,231.0,4.5,168,While I have been there in person and it was w...
3,Brewpub,15 24 Brew House,KS,39.376021,-97.127491,4.6,130.0,4.5,7,I haven't had micro brew like this place in a ...
4,Brewpub,16 Stone Brewpub,NY,43.241849,-75.256302,4.6,71.0,4.0,4,I'm torn. Came here for Fathers Day and had s...


In [None]:
# clean up the column headers
review_database.columns = ['Brewery_type', 'Brewery_name', 'State', 'Latitude', 'Longitude', 'Google_rating', 'Google_num_ratings', 'Yelp_rating', 'Yelp_num_ratings', 'Yelp_review_text']

In [None]:
# Inspect the head
review_database.head()

Unnamed: 0,Brewery_type,Brewery_name,State,Latitude,Longitude,Google_rating,Google_num_ratings,Yelp_rating,Yelp_num_ratings,Yelp_review_text
0,Brewpub,101 Brewery,WA,47.822407,-122.875356,4.4,93.0,4.0,90,While we were disappointed that we did not get...
1,Brewpub,122 West Brewing Co,WA,48.762557,-122.485773,4.6,62.0,Na,Na,Na
2,Brewpub,12Degree Brewing,CO,39.978215,-105.131876,4.7,231.0,4.5,168,While I have been there in person and it was w...
3,Brewpub,15 24 Brew House,KS,39.376021,-97.127491,4.6,130.0,4.5,7,I haven't had micro brew like this place in a ...
4,Brewpub,16 Stone Brewpub,NY,43.241849,-75.256302,4.6,71.0,4.0,4,I'm torn. Came here for Fathers Day and had s...


### Explore database to detemine number of na's from each source and create a complete database

In [None]:
# need to replace all "na" with N/A 
review_database["Yelp_rating"].replace({"Na": 'nan'}, inplace=True)
review_database["Yelp_num_ratings"].replace({"Na": 'nan'}, inplace=True)


In [None]:
# And then convert column to float for Yelp data
review_database['Yelp_rating'] = review_database['Yelp_rating'].astype(float)
review_database['Yelp_num_ratings'] = review_database['Yelp_num_ratings'].astype(float)

In [None]:
# confirm how many na's resulted 
print("Number of null values Google Data : " + 
       str(review_database.iloc[:, 5].isnull().sum())) 
print("Number of null values in Yelp Data : " + 
       str(review_database.iloc[:, 7].isnull().sum()))

Number of null values Google Data : 1718
Number of null values in Yelp Data : 1666


In [None]:
# Create a dataframe that only has complete ratings/reviews from both sources
review_database2 = review_database
review_database2.dropna(inplace=True)

### In the end, we have 5182 complete brewery rating/review data

In [None]:
review_database2['Overall_rating'] = (review_database2['Google_rating']+review_database2['Yelp_rating'])/2
review_database2['Overall_num_ratings'] = review_database2['Google_num_ratings'] + review_database2['Yelp_num_ratings']

In [None]:
review_database2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5182 entries, 0 to 8457
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Brewery_type         5182 non-null   object 
 1   Brewery_name         5182 non-null   object 
 2   State                5182 non-null   object 
 3   Latitude             5182 non-null   float64
 4   Longitude            5182 non-null   float64
 5   Google_rating        5182 non-null   float64
 6   Google_num_ratings   5182 non-null   float64
 7   Yelp_rating          5182 non-null   float64
 8   Yelp_num_ratings     5182 non-null   float64
 9   Yelp_review_text     5182 non-null   object 
 10  Overall_rating       5182 non-null   float64
 11  Overall_num_ratings  5182 non-null   float64
dtypes: float64(8), object(4)
memory usage: 526.3+ KB


In [None]:
# write to file
review_database2.to_csv(path + 'complete_database.csv')