# Joining Datasets

This notebook is for creating dataframes of text and non-text reviews that match up with scraped data from Beer Advocate.

The scraped data is comprised of ALL beers and breweries from the US; this is the basis of how I will be joining the reviews from datasets aqcuired from https://data.world.

First, I will start off loading in dependencies for manipulating data and interacting with the database that I have created 

In [227]:
import pandas as pd
import os
import re
from sqlalchemy import create_engine

%matplotlib inline

Loading in the datasets

In [228]:
beer_reviews = pd.read_csv("../datasets/socialmediadata-beeradvocate/data/beer_reviews.csv")
beer_text_reviews = pd.read_csv("../datasets/petergensler-beer-advocate-reviews/BeerAdvocate-000.csv")

In [229]:
beer_reviews.head()

Unnamed: 0,brewery_id,brewery_name,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid
0,10325,Vecchio Birraio,1234817823,1.5,2.0,2.5,stcules,Hefeweizen,1.5,1.5,Sausa Weizen,5.0,47986
1,10325,Vecchio Birraio,1235915097,3.0,2.5,3.0,stcules,English Strong Ale,3.0,3.0,Red Moon,6.2,48213
2,10325,Vecchio Birraio,1235916604,3.0,2.5,3.0,stcules,Foreign / Export Stout,3.0,3.0,Black Horse Black Beer,6.5,48215
3,10325,Vecchio Birraio,1234725145,3.0,3.0,3.5,stcules,German Pilsener,2.5,3.0,Sausa Pils,5.0,47969
4,1075,Caldera Brewing Company,1293735206,4.0,4.5,4.0,johnmichaelsen,American Double / Imperial IPA,4.0,4.5,Cauldron DIPA,7.7,64883


In [230]:
pg_pass = os.environ['PG_PASS']
engine = create_engine(f'postgresql://postgres:{pg_pass}@127.0.0.1:5432/craft_beer_development')

Running a test query

In [231]:
pd.read_sql("SELECT tablename FROM pg_catalog.pg_tables WHERE schemaname='public' LIMIT 5", con=engine)

Unnamed: 0,tablename
0,spatial_ref_sys
1,SequelizeMeta
2,Breweries
3,Styles
4,Beers


I want to find the reviews that have breweries that are present from my scraped data set

In [232]:
beer_review_breweries = beer_reviews[["brewery_name","brewery_id"]].drop_duplicates()

# https://stackoverflow.com/questions/29815129/pandas-dataframe-to-list-of-dictionaries
beer_review_breweries = beer_review_breweries.T.to_dict().values()

# https://stackoverflow.com/questions/283645/python-list-in-sql-query-as-parameter
beer_review_breweries_joined = ', '.join(["'" + str(br['brewery_name']).replace("'","''") + "'" for br in beer_review_breweries])
beer_review_beer_ids = [obj['brewery_id'] for obj in beer_review_breweries]
query = 'SELECT name as brewery_name, state, ba_link FROM "Breweries" WHERE name in (%s)' % beer_review_breweries_joined
links_from_ids = ', '.join([f"'https://www.beeradvocate.com/beer/profile/{bl}/'" for bl in beer_review_beer_ids])
query += ' AND ba_link in (%s)' % links_from_ids
results = pd.read_sql(query, con=engine)

In [233]:
results.head()

Unnamed: 0,brewery_name,state,ba_link
0,Arbor Brewing Company,Michigan,https://www.beeradvocate.com/beer/profile/1457/
1,Lake Superior Brewing,Michigan,https://www.beeradvocate.com/beer/profile/3146/
2,Leelanau Brewing Company,Michigan,https://www.beeradvocate.com/beer/profile/11974/
3,Liberty Street Brewing Company,Michigan,https://www.beeradvocate.com/beer/profile/19163/
4,Lily's Seafood & B.C.,Michigan,https://www.beeradvocate.com/beer/profile/3968/


In [234]:
results.shape

(1459, 3)

I want to convert the links into ID's and rename the "ba_link" column to "brewery_id"

In [235]:
results.ba_link = results.ba_link.map(lambda x: int(re.search(r'[0-9]+', x).group()))

In [236]:
results = results.rename(columns={"ba_link": "brewery_id"})

In [237]:
results.head()

Unnamed: 0,brewery_name,state,brewery_id
0,Arbor Brewing Company,Michigan,1457
1,Lake Superior Brewing,Michigan,3146
2,Leelanau Brewing Company,Michigan,11974
3,Liberty Street Brewing Company,Michigan,19163
4,Lily's Seafood & B.C.,Michigan,3968


Now I have all of the **non-text** reviews that match up with the breweries and beers in my database.

In [238]:
matched_beer_reviews = beer_reviews.where(beer_reviews.brewery_id.isin(results.brewery_id.tolist())).dropna()

In [239]:
matched_beer_reviews.isnull().sum()

brewery_id            0
brewery_name          0
review_time           0
review_overall        0
review_aroma          0
review_appearance     0
review_profilename    0
beer_style            0
review_palate         0
review_taste          0
beer_name             0
beer_abv              0
beer_beerid           0
dtype: int64

## Beer Text Reviews

In [221]:
beer_text_reviews.head()

Unnamed: 0,beer_ABV,beer_beerId,beer_brewerId,beer_name,beer_style,review_appearance,review_aroma,review_overall,review_palate,review_profileName,review_taste,review_text,review_time
0,5.0,47986,10325,Sausa Weizen,Hefeweizen,2.5,2.0,1.5,1.5,stcules,1.5,A lot of foam. But a lot. In the smell some ba...,1234817823
1,6.2,48213,10325,Red Moon,English Strong Ale,3.0,2.5,3.0,3.0,stcules,3.0,"Dark red color, light beige foam, average. In ...",1235915097
2,6.5,48215,10325,Black Horse Black Beer,Foreign / Export Stout,3.0,2.5,3.0,3.0,stcules,3.0,"Almost totally black. Beige foam, quite compac...",1235916604
3,5.0,47969,10325,Sausa Pils,German Pilsener,3.5,3.0,3.0,2.5,stcules,3.0,"Golden yellow color. White, compact foam, quit...",1234725145
4,7.7,64883,1075,Cauldron DIPA,American Double / Imperial IPA,4.0,4.5,4.0,4.0,johnmichaelsen,4.5,"According to the website, the style for the Ca...",1293735206


In [223]:
beer_text_reviews.tail()

Unnamed: 0,beer_ABV,beer_beerId,beer_brewerId,beer_name,beer_style,review_appearance,review_aroma,review_overall,review_palate,review_profileName,review_taste,review_text,review_time
528865,,4032,3340,Dinkel Acker Dark,Munich Dunkel Lager,4.0,3.0,4.0,3.5,orangemoustache,4.0,"A-pours a reddish amber that looks very nice,l...",1205212721
528866,,4032,3340,Dinkel Acker Dark,Munich Dunkel Lager,4.0,3.5,3.0,3.0,MisterStout,3.0,I don't really have anything special to say ab...,1203490783
528867,,4032,3340,Dinkel Acker Dark,Munich Dunkel Lager,4.0,4.0,4.5,4.0,meechum,4.5,Had this on tap at Vreny's Beirgarten A - Came...,1201320897
528868,,4032,3340,Dinkel Acker Dark,Munich Dunkel Lager,4.0,3.0,4.0,4.0,Dodo2step,4.5,"Purchased at Market Cross Pub in carlisle, PA....",1201215290
528869,,4032,3340,Dinkel Acker Dark,Munich Dunkel Lager,4.0,4.0,4.0,4.0,jenbys2001,4.0,"I ordered a mug of this beer at Schnitzelhaus,...",1200336367


In [246]:
beer_text_review_breweries = beer_text_reviews["beer_brewerId"].drop_duplicates().tolist()

links_from_ids = ', '.join([f"'https://www.beeradvocate.com/beer/profile/{bl}/'" for bl in beer_text_review_breweries])

query = 'SELECT name as brewery_name, state, ba_link FROM "Breweries" WHERE ba_link in (%s)' % links_from_ids

results = pd.read_sql(query, con=engine)
results.head()

Unnamed: 0,brewery_name,state,ba_link
0,Arbor Brewing Company,Michigan,https://www.beeradvocate.com/beer/profile/1457/
1,Arbor Brewing Company Microbrewery,Michigan,https://www.beeradvocate.com/beer/profile/14034/
2,Atwater Brewery,Michigan,https://www.beeradvocate.com/beer/profile/15280/
3,Rochester Mills Beer Co.,Michigan,https://www.beeradvocate.com/beer/profile/2346/
4,Round Barn Brewery,Michigan,https://www.beeradvocate.com/beer/profile/11882/


In [247]:
results.ba_link = results.ba_link.map(lambda x: int(re.search(r'[0-9]+', x).group()))

In [248]:
results = results.rename(columns={"ba_link": "brewery_id"})

In [254]:
beer_text_reviews.beer_brewerId
matched_beer_text_reviews = beer_text_reviews.where(beer_text_reviews.beer_brewerId.isin(results.brewery_id.tolist())).dropna()

In [255]:
matched_beer_text_reviews.shape

(358073, 13)

In [257]:
matched_beer_text_reviews.head()

Unnamed: 0,beer_ABV,beer_beerId,beer_brewerId,beer_name,beer_style,review_appearance,review_aroma,review_overall,review_palate,review_profileName,review_taste,review_text,review_time
4,7.7,64883.0,1075.0,Cauldron DIPA,American Double / Imperial IPA,4.0,4.5,4.0,4.0,johnmichaelsen,4.5,"According to the website, the style for the Ca...",1293735000.0
5,4.7,52159.0,1075.0,Caldera Ginger Beer,Herbed / Spiced Beer,3.5,3.5,3.0,3.0,oline73,3.5,Poured from the bottle into a Chimay goblet. A...,1325525000.0
6,4.7,52159.0,1075.0,Caldera Ginger Beer,Herbed / Spiced Beer,3.5,3.5,3.5,4.0,Reidrover,4.0,"22 oz bottle from ""Lifesource"" Salem. $3.95 Ni...",1318991000.0
7,4.7,52159.0,1075.0,Caldera Ginger Beer,Herbed / Spiced Beer,3.5,2.5,3.0,2.0,alpinebryant,3.5,"Bottle says ""Malt beverage brewed with Ginger ...",1306276000.0
8,4.7,52159.0,1075.0,Caldera Ginger Beer,Herbed / Spiced Beer,3.5,3.0,4.0,3.5,LordAdmNelson,4.0,I'm not sure why I picked this up... I like gi...,1290455000.0


In [258]:
matched_beer_text_reviews.tail()

Unnamed: 0,beer_ABV,beer_beerId,beer_brewerId,beer_name,beer_style,review_appearance,review_aroma,review_overall,review_palate,review_profileName,review_taste,review_text,review_time
528749,5.5,38275.0,11492.0,Alaskan Summer Ale,American Blonde Ale,3.5,3.0,4.0,3.5,ThirstyHopHead,3.5,A: Poured a straw yellow color with a 1 finger...,1224719000.0
528750,5.5,38275.0,11492.0,Alaskan Summer Ale,American Blonde Ale,4.0,3.5,3.5,3.5,RedDiamond,3.5,A relaxing summer ale with a soothing aroma of...,1187584000.0
528758,4.5,24849.0,11492.0,Cream Ale,Cream Ale,3.0,3.0,3.0,3.0,Bookseeb,2.5,Appearance is a light golden with a thin head....,1196475000.0
528759,4.5,24849.0,11492.0,Cream Ale,Cream Ale,3.0,3.0,3.0,3.5,RedDiamond,3.5,"Cream ales are gentle beers. Even so, this one...",1187590000.0
528760,4.5,24849.0,11492.0,Cream Ale,Cream Ale,3.5,4.0,4.0,4.0,canucklehead,4.0,This is a really pale cream ale but the beer i...,1121741000.0
