# Webscrapping Untappd

Previously, I worked with the open api from www.brewerydb.com that provided a free sandbox to grab data from. However, the data provided was extremely limited. One of the variables that I was most interested in were price nd rating. Price seemed the most difficult to find, but was able to find ratings on untappd. 

The website www.untappd.com had several hurdles to overcome in order to grab data. First of all, we needed to mask the browser we were using in python before getting a get request.

In [13]:
# import dependencies
import pandas as pd
from bs4 import BeautifulSoup
import requests
from tqdm import tqdm

# set url and mask headers
url = 'https://untappd.com/search?q=Murican+Pilsner'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

# grab data from url and set headers, prep soup
response = requests.get(url, headers=headers)
soup = response.content

In [None]:
# import beer data grabbed from the brewerydb sandbox
df_beers = pd.read_csv('data/beers.csv')
df_beers.head()

This initial grab using the beer names from the brewerydb api pull resulted in many mismatched names and few results. The total number of results amounted to around 300. This was an extremely small dataset for what I wanted to be doing (machine learning). After looking into the issue further, I found that not only were many of the beers provided in brewerydb were not on the untappd webpage, but also that my results were limited because simply getting a request would only generate the page without a login. 

In [None]:
ids = []
ratings = []
errors = []
for i in tqdm(range(df_beers.shape[0])):
#     print(df_beers.loc[i,'display_name'])
    url = 'https://untappd.com/search?q='
    url = url + str(df_beers.loc[i,'display_name'])
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    response = requests.get(url, headers=headers)
    soup = response.content
    soupy = BeautifulSoup(soup, 'lxml')
    
    beer_items = soupy.findAll(class_='beer-item')
    for beer in beer_items:
        try:
            abv_raw = beer.findAll(class_='abv')[0].text.strip()
            abv_raw = abv_raw.strip('% ABV')
            abv = float(abv_raw)
        except Exception as e:
            print(e)
        if df_beers.loc[i,'abv'].astype(float) == abv:
            ids.append(df_beers.loc[i,'beer_id'])
            ratings.append(beer.findAll(class_='rating')[0].text.strip())
        elif df_beers.loc[i, 'name'].lower() == beer.findAll(class_="name")[0].find('a').text.strip().lower():
            ids.append(df_beers.loc[i,'beer_id'])
            ratings.append(beer.findAll(class_='rating')[0].text.strip())
        else:
            errors.append(df_beers.loc[i,'beer_id'])
        

Below, we utilize selenium to log in to the website. We iterate the requests by changing the beer names in the search request in hopes of matching with the beer names of the beers from the brewerydb data by its beer_id. The resulting dataset was around 1100 data points, which was much better than the 300 before. However, after running several ml models on this group, the results were lackluster. 
> machine learning portion can be found in AY_MachineLearning

In order to better predict the ratings of the beers, we either needed more data or more features to predict on.

In [6]:
from selenium import webdriver

In [7]:
driver = webdriver.Chrome('./chromedriver')
driver.get('https://untappd.com/login')

username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")

username.send_keys("ayang2012")
password.send_keys("mulciber1")

driver.find_element_by_css_selector(".button.yellow.submit-btn").find_element_by_tag_name('input').click()

In [None]:
ids2 = []
ratings2 = []
errors2 = []
df_untappd = pd.DataFrame()
untappd_columns = ['id','name', 'brewery', 'style', 'abv', 'ibu', 'rating']
for col in untappd_columns:
    df_untappd[col]=''

count=0    
for i in tqdm(range(df_beers.shape[0])):

    url = 'https://untappd.com/search?q='
    url = url + str(df_beers.loc[i,'display_name'])
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    driver.get(url)
    soup = driver.page_source
    soupy = BeautifulSoup(soup, 'lxml')
    
    beer_items = soupy.findAll(class_='beer-item')

    for beer in beer_items:

        df_untappd.set_value(count, 'id', df_beers.loc[i,'beer_id']) 
        df_untappd.set_value(count, 'name', beer.findAll(class_="name")[0].find('a').text.strip())
        df_untappd.set_value(count, 'brewery', beer.findAll(class_="brewery")[0].find('a').text.strip())
        df_untappd.set_value(count, 'style', beer.findAll(class_="style")[0].text.strip())
        df_untappd.set_value(count, 'abv', beer.findAll(class_="abv")[0].text.strip())
        df_untappd.set_value(count, 'ibu', beer.findAll(class_="ibu")[0].text.strip())
        df_untappd.set_value(count, 'rating', beer.findAll(class_="rating")[0].text.strip())
        
        count +=1

In [None]:
df_untappd.shape

In [None]:
df_untappd.tail()

In [None]:
df_untappd.to_csv('untappd_beer_ratings.csv')

In [None]:
df_beers[df_beers['beer_id'].apply(lambda x: x not in df_untappd['id'].unique())]

# Finding beers from best breweries

### first we need to get a list of breweries

In [23]:
from lxml import html

with open("Top Rated Breweries _ Untappd.html") as f:
    page = f.read()
tree = html.fromstring(page)
brewery = BeautifulSoup(page,'lxml')

In [24]:
# url2 = "https://untappd.com/brewery/top_rated?country_id=86"
# response = requests.get(url2, headers=headers)
# brewery = response.content
# brewery = BeautifulSoup(brewery, 'lxml')

In [26]:
breweries = brewery.findAll(class_='beer-details')
breweries = [(i.findAll('a')[0].get('href')) for i in breweries]
breweries

['https://untappd.com/PipsMeadery',
 'https://untappd.com/SchrammsMead',
 'https://untappd.com/3SonsBrewingCo',
 'https://untappd.com/SideProject',
 'https://untappd.com/ColesRoadBrewery',
 'https://untappd.com/GaragisteMeadery',
 'https://untappd.com/treehousebrewco',
 'https://untappd.com/w/the-alchemist/1244',
 'https://untappd.com/w/brasserie-cantillon/202',
 'https://untappd.com/CaseyBrewingBlending',
 'https://untappd.com/LittleCottageBrewery',
 'https://untappd.com/4firesmeadery',
 'https://untappd.com/w/floodland-brewing/379455',
 'https://untappd.com/HouseofFermentology',
 'https://untappd.com/Bokkereyder',
 'https://untappd.com/TheAnswerBrewpub',
 'https://untappd.com/AfterthoughtBrewing',
 'https://untappd.com/HorusAles',
 'https://untappd.com/HillFarmsteadBrewery',
 'https://untappd.com/MartoBrewing',
 'https://untappd.com/rootandbranchbrewing',
 'https://untappd.com/TroonHopewell',
 'https://untappd.com/MortalisBrewingCompany',
 'https://untappd.com/EQBrew',
 'https://unta

Now that we have a list of breweries, we can set up the requests to output a number of results related to the beers in each brewery. My hope was to make the brewery a feature, given that there would be multiple beers per brewery, but the number of results were still limited due to my lack of knowledge of the click function.

# UPDATE

With the help of my mentors, I was able to figure out the issue with the click 'See More' button. Selenium was finding the button, but was not in the right position to click, so I had to find the location of the button and offset by a few pixels to click. Then I ran into the problem of needing to scroll in order for the consecutive results to show. After going through many options, I decided to go with moving the screen to the footer. Then another issue came up with the timing/loading of the results and pages. I had to mess with some time.sleep amounts to get it going.

In [72]:
driver = webdriver.Chrome('./chromedriver')
driver.get('https://untappd.com/login')

username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")

username.send_keys("ayang2012")
password.send_keys("mulciber1")

driver.find_element_by_css_selector(".button.yellow.submit-btn").find_element_by_tag_name('input').click()

In [78]:
from selenium.webdriver.common.action_chains import ActionChains

df_breweries = pd.DataFrame()
untappd_columns = ['name', 'brewery', 'style', 'abv', 'ibu', 'rating', 'raters', 'date']
for col in untappd_columns:
    df_breweries[col]=''

count=0    
error = True
for i in tqdm(breweries):
    url = i + '/beer'
    print(url)
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    driver.get(url)

    error=True
    while(error):
        try:
            target = driver.find_element_by_class_name('footer-nav')
            actions = ActionChains(driver)
            actions.move_to_element_with_offset(target, 0, -50)
            actions.perform()
            
            temp = driver.page_source
            tempy = BeautifulSoup(temp, 'lxml')
            beer_items_tempy = tempy.findAll(class_='beer-item')
            err1 = len(beer_items_tempy)

#           print line below to check for 'show more itterations'            
#            print("err1: " + str(err1))
    
            button = driver.find_element_by_css_selector("a.button.yellow.more-list-items")
            action = ActionChains(driver)
            action.move_to_element_with_offset(button, 10, 5)
            action.click()
            action.perform()
            
            time.sleep(1)
            
            target = driver.find_element_by_class_name('footer-nav')
            actions = ActionChains(driver)
            actions.move_to_element_with_offset(target, 0, -50)
            actions.perform()
            
            time.sleep(1)
            
            temp2 = driver.page_source
            tempy2 = BeautifulSoup(temp2, 'lxml')
            beer_items_tempy2 = tempy2.findAll(class_='beer-item')
            err2 = len(beer_items_tempy2)

#           print line below to check for 'show more itterations'
#             print("err2: " + str(err2))
            
            if err1 == err2:
                error=False

        except Exception as e:
            error=False
        
    soup = driver.page_source
    soupy = BeautifulSoup(soup, 'lxml')
    
    beer_items = soupy.findAll(class_='beer-item')

    for beer in beer_items:

        df_breweries.set_value(count, 'name', beer.findAll(class_="name")[0].find('a').text.strip())
        df_breweries.set_value(count, 'brewery', i)
        df_breweries.set_value(count, 'style', beer.findAll(class_="style")[0].text.strip())
        df_breweries.set_value(count, 'abv', beer.findAll(class_="abv")[0].text.strip())
        df_breweries.set_value(count, 'ibu', beer.findAll(class_="ibu")[0].text.strip())
        df_breweries.set_value(count, 'rating', beer.findAll(class_="rating")[0].text.strip())
        df_breweries.set_value(count, 'raters', beer.findAll(class_="raters")[0].text.strip())
        df_breweries.set_value(count, 'date', beer.findAll(class_="date")[0].text.strip())
        df_breweries.set_value(count, 'text',beer.findAll(class_='desc')[0].text.strip())
        
        count +=1






  0%|          | 0/50 [00:00<?, ?it/s][A[A[A[A[A

https://untappd.com/PipsMeadery/beer







  2%|▏         | 1/50 [00:01<01:21,  1.67s/it][A[A[A[A[A

https://untappd.com/SchrammsMead/beer







  4%|▍         | 2/50 [00:02<01:12,  1.51s/it][A[A[A[A[A

https://untappd.com/3SonsBrewingCo/beer







  6%|▌         | 3/50 [00:04<01:07,  1.43s/it][A[A[A[A[A

https://untappd.com/SideProject/beer







  8%|▊         | 4/50 [00:05<01:01,  1.33s/it][A[A[A[A[A

https://untappd.com/ColesRoadBrewery/beer







 10%|█         | 5/50 [00:06<00:56,  1.26s/it][A[A[A[A[A

https://untappd.com/GaragisteMeadery/beer







 12%|█▏        | 6/50 [00:07<00:59,  1.36s/it][A[A[A[A[A

https://untappd.com/treehousebrewco/beer







 14%|█▍        | 7/50 [00:09<01:01,  1.43s/it][A[A[A[A[A

https://untappd.com/w/the-alchemist/1244/beer







 16%|█▌        | 8/50 [00:10<00:58,  1.39s/it][A[A[A[A[A

https://untappd.com/w/brasserie-cantillon/202/beer







 18%|█▊        | 9/50 [00:11<00:55,  1.35s/it][A[A[A[A[A

https://untappd.com/CaseyBrewingBlending/beer







 20%|██        | 10/50 [00:13<00:53,  1.33s/it][A[A[A[A[A

https://untappd.com/LittleCottageBrewery/beer







 22%|██▏       | 11/50 [00:14<00:49,  1.27s/it][A[A[A[A[A

https://untappd.com/4firesmeadery/beer







 24%|██▍       | 12/50 [00:15<00:51,  1.35s/it][A[A[A[A[A

https://untappd.com/w/floodland-brewing/379455/beer







 26%|██▌       | 13/50 [00:17<00:48,  1.30s/it][A[A[A[A[A

https://untappd.com/HouseofFermentology/beer







 28%|██▊       | 14/50 [00:18<00:45,  1.25s/it][A[A[A[A[A

https://untappd.com/Bokkereyder/beer







 30%|███       | 15/50 [00:19<00:42,  1.22s/it][A[A[A[A[A

https://untappd.com/TheAnswerBrewpub/beer







 32%|███▏      | 16/50 [00:20<00:42,  1.26s/it][A[A[A[A[A

https://untappd.com/AfterthoughtBrewing/beer







 34%|███▍      | 17/50 [00:22<00:42,  1.29s/it][A[A[A[A[A

https://untappd.com/HorusAles/beer







 36%|███▌      | 18/50 [00:25<00:59,  1.85s/it][A[A[A[A[A

https://untappd.com/HillFarmsteadBrewery/beer







 38%|███▊      | 19/50 [00:27<00:56,  1.82s/it][A[A[A[A[A

https://untappd.com/MartoBrewing/beer







 40%|████      | 20/50 [00:28<00:50,  1.70s/it][A[A[A[A[A

https://untappd.com/rootandbranchbrewing/beer







 42%|████▏     | 21/50 [00:30<00:48,  1.66s/it][A[A[A[A[A

https://untappd.com/TroonHopewell/beer







 44%|████▍     | 22/50 [00:32<00:53,  1.91s/it][A[A[A[A[A

https://untappd.com/MortalisBrewingCompany/beer







 46%|████▌     | 23/50 [00:34<00:54,  2.02s/it][A[A[A[A[A

https://untappd.com/EQBrew/beer







 48%|████▊     | 24/50 [00:39<01:10,  2.71s/it][A[A[A[A[A

https://untappd.com/ObercreekBrewingCompany/beer







 50%|█████     | 25/50 [00:41<01:04,  2.56s/it][A[A[A[A[A

https://untappd.com/DeGardeBrewing/beer







 52%|█████▏    | 26/50 [00:42<00:54,  2.26s/it][A[A[A[A[A

https://untappd.com/hudson_valley_brewery/beer







 54%|█████▍    | 27/50 [00:45<00:55,  2.43s/it][A[A[A[A[A

https://untappd.com/TheAleApothecary/beer







 56%|█████▌    | 28/50 [00:46<00:45,  2.06s/it][A[A[A[A[A

https://untappd.com/TrilliumBrewing/beer







 58%|█████▊    | 29/50 [00:48<00:39,  1.89s/it][A[A[A[A[A

https://untappd.com/MidnightProject/beer







 60%|██████    | 30/50 [00:49<00:35,  1.75s/it][A[A[A[A[A

https://untappd.com/Brouwerij3Fonteinen/beer







 62%|██████▏   | 31/50 [00:51<00:31,  1.67s/it][A[A[A[A[A

https://untappd.com/SanteAdairius/beer







 64%|██████▍   | 32/50 [00:54<00:36,  2.04s/it][A[A[A[A[A

https://untappd.com/VitaminSeaBrewing/beer







 66%|██████▌   | 33/50 [00:56<00:35,  2.06s/it][A[A[A[A[A

https://untappd.com/LawsonsFinestLiquids/beer







 68%|██████▊   | 34/50 [00:59<00:37,  2.32s/it][A[A[A[A[A

https://untappd.com/monkishbrewing/beer







 70%|███████   | 35/50 [01:00<00:31,  2.11s/it][A[A[A[A[A

https://untappd.com/RedDragonBreweryOps/beer







 72%|███████▏  | 36/50 [01:02<00:29,  2.09s/it][A[A[A[A[A

https://untappd.com/TheRareBarrel/beer







 74%|███████▍  | 37/50 [01:04<00:25,  1.96s/it][A[A[A[A[A

https://untappd.com/BarrelCulture/beer







 76%|███████▌  | 38/50 [01:06<00:22,  1.84s/it][A[A[A[A[A




 78%|███████▊  | 39/50 [01:06<00:14,  1.32s/it][A[A[A[A[A

https://untappd.com/MAZURTBrewingCompany/beer
https://untappd.com/Whiteroosterfarmhousebrewery/beer
https://untappd.com/RiverRoostBrewery/beer







 82%|████████▏ | 41/50 [01:06<00:08,  1.05it/s][A[A[A[A[A




 86%|████████▌ | 43/50 [01:06<00:04,  1.45it/s][A[A[A[A[A

https://untappd.com/Hnf/beer
https://untappd.com/TiltedBarnBrewery/beer
https://untappd.com/FoamBrewers/beer







 90%|█████████ | 45/50 [01:06<00:02,  1.95it/s][A[A[A[A[A

https://untappd.com/OtherHalfBrewingCompany/beer
https://untappd.com/CycleBrewingCompany/beer
https://untappd.com/Shared/beer







 94%|█████████▍| 47/50 [01:06<00:01,  2.60it/s][A[A[A[A[A




 98%|█████████▊| 49/50 [01:07<00:00,  3.39it/s][A[A[A[A[A

https://untappd.com/MisbeehavinMeads/beer
https://untappd.com/GreatNotionBrewing/beer
https://untappd.com/NarrowGaugeBrewing/beer







100%|██████████| 50/50 [01:07<00:00,  1.34s/it][A[A[A[A[A

In [79]:
df_breweries['brewery'].value_counts()
df_breweries.shape

(934, 9)

In [80]:
df_breweries.to_csv('data/test2.csv')

In [None]:
df_breweries.head()
df_breweries.to_csv('data/full_brewery_data.csv')

# NLP data grab

After building some ML models with the data I just grabbed, the results weren't so good. Train: 35% Test: 22%, we should have enough data at 5000+ observations. Now we need to find some better features. We'll use NLP on the description texts we grab to do some more feature creation.

In [None]:
driver = webdriver.Chrome('./chromedriver')
driver.get('https://untappd.com/login')

username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")

username.send_keys("ayang2012")
password.send_keys("mulciber1")

driver.find_element_by_css_selector(".button.yellow.submit-btn").find_element_by_tag_name('input').click()

In [None]:
df_breweries = pd.DataFrame()
untappd_columns = ['name', 'text']
for col in untappd_columns:
    df_breweries[col]=''

count=0    
error = True
for i in tqdm(breweries):
    url = i + '/beer'
    print(url)
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    driver.get(url)
    
    error=True
    while(error):
        try:
            temp = driver.page_source
            tempy = BeautifulSoup(temp, 'lxml')
            beer_items_tempy = tempy.findAll(class_='beer-item')
            err1 = len(beer_items_tempy)

#           print line below to check for 'show more itterations'            
#            print("err1: " + str(err1))
    
            button = driver.find_element_by_css_selector("a.button.yellow.more-list-items")
            action = ActionChains(driver)
            action.move_to_element_with_offset(button, 10, 5)
            action.click()
            action.perform()
            
            time.sleep(1)
            
            target = driver.find_element_by_class_name('footer-nav')
            actions = ActionChains(driver)
            actions.move_to_element(target)
            actions.perform()
            
            temp2 = driver.page_source
            tempy2 = BeautifulSoup(temp2, 'lxml')
            beer_items_tempy2 = tempy2.findAll(class_='beer-item')
            err2 = len(beer_items_tempy2)

#           print line below to check for 'show more itterations'
#             print("err2: " + str(err2))
            
            if err1 == err2:
                error=False

        except Exception as e:
            error=False
        
    soup = driver.page_source
    soupy = BeautifulSoup(soup, 'lxml')
    
    beer_items = soupy.findAll(class_='beer-item')

    for beer in beer_items:

        df_breweries.set_value(count, 'name', beer.findAll(class_="name")[0].find('a').text.strip())
        df_breweries.set_value(count, 'text',beer.findAll(class_='desc')[0].text.strip())
        count +=1

In [None]:
df_breweries.to_csv("untappd_beer_texts.csv")

In [None]:
import random
countries = random.sample(range(1, 100), 0)


In [None]:
df_breweries = pd.DataFrame()
untappd_columns = ['name', 'brewery', 'style', 'abv', 'ibu', 'rating', 'raters', 'date']
for col in untappd_columns:
    df_breweries[col]=''

count=0    
error = True
for i in countries:
    url = 'https://untappd.com/beer/top_rated?country_id=' + str(i)
    print(url)
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    driver.get(url)
    
    error=True
    while(error):
        try:
            temp = driver.page_source
            tempy = BeautifulSoup(temp, 'lxml')
            beer_items_tempy = tempy.findAll(class_='beer-item')
            err1 = len(beer_items_tempy)

#           print line below to check for 'show more itterations'            
#            print("err1: " + str(err1))
    
            button = driver.find_element_by_css_selector("a.button.yellow.more-list-items")
            action = ActionChains(driver)
            action.move_to_element_with_offset(button, 10, 5)
            action.click()
            action.perform()
            
            time.sleep(1)
            
            target = driver.find_element_by_class_name('footer-nav')
            actions = ActionChains(driver)
            actions.move_to_element(target)
            actions.perform()
            
            temp2 = driver.page_source
            tempy2 = BeautifulSoup(temp2, 'lxml')
            beer_items_tempy2 = tempy2.findAll(class_='beer-item')
            err2 = len(beer_items_tempy2)

#           print line below to check for 'show more itterations'
#             print("err2: " + str(err2))
            
            if err1 == err2:
                error=False

        except Exception as e:
            error=False
        
    soup = driver.page_source
    soupy = BeautifulSoup(soup, 'lxml')
    
    beer_items = soupy.findAll(class_='beer-item')

    for beer in beer_items:

        df_breweries.set_value(count, 'name', beer.findAll(class_="name")[0].find('a').text.strip())
        df_breweries.set_value(count, 'brewery', beer.findAll(class_='style')[0].findAll('a')[0]['href'].strip())
        df_breweries.set_value(count, 'style', beer.findAll(class_="style")[0].text.strip())
        df_breweries.set_value(count, 'abv', beer.findAll(class_="abv")[0].text.strip())
        df_breweries.set_value(count, 'ibu', beer.findAll(class_="ibu")[0].text.strip())
        df_breweries.set_value(count, 'rating', beer.findAll(class_="rating")[0].text.strip())
        df_breweries.set_value(count, 'raters', beer.findAll(class_="raters")[0].text.strip())
        df_breweries.set_value(count, 'date', beer.findAll(class_="date")[0].text.strip())
        df_breweries.set_value(count, 'text',beer.findAll(class_='desc')[0].text.strip())
        
        count +=1

In [None]:
df_breweries.head()

In [None]:
df_breweries.to_csv('30countries_test_1.csv')

In [2]:
import pandas as pd
df1 = pd.read_csv("30countries_test_1.csv")
df1.head()

Unnamed: 0.1,Unnamed: 0,name,brewery,style,abv,ibu,rating,raters,date,text
0,0,La Negra,/CadejoBrewingCompany,Cadejo Brewing Company,4.8% ABV,31 IBU,(3.53),710 Ratings,Added 03/30/14,El estilo clásico irlandés. Elaborada con una ...
1,1,La Suegra IPA,/CadejoBrewingCompany,Cadejo Brewing Company,5.5% ABV,N/A IBU,(3.53),414 Ratings,Added 06/18/16,
2,2,Roja,/CadejoBrewingCompany,Cadejo Brewing Company,5.3% ABV,31 IBU,(3.47),951 Ratings,Added 03/26/13,"Aromática, con buen balance entre el amargo de..."
3,3,Hija De Pooh,/CadejoBrewingCompany,Cadejo Brewing Company,4.3% ABV,17 IBU,(3.42),784 Ratings,Added 09/06/15,Hija de Pooh - Honey Blonde Ale Una cerveza un...
4,4,Suprema Roja,/w/industrias-la-constancia/56613,Industrias La Constancia,5.2% ABV,N/A IBU,(3.37),202 Ratings,Added 05/17/13,


In [2]:
import pandas as pd
df1 = pd.read_csv("data/full_brewery_data.csv")
df1.shape

(2229, 10)

In [5]:
df2 = pd.read_csv("data/untappd_breweries_ratings.csv")
df2.shape

(5134, 9)

In [6]:
df2.head()

Unnamed: 0.1,Unnamed: 0,name,brewery,style,abv,ibu,rating,raters,date
0,0,Blue Suede Shews,https://untappd.com/PipsMeadery,Mead - Other,14% ABV,N/A IBU,(4.85),643 Ratings,Added 04/29/16
1,1,Find the Lady,https://untappd.com/PipsMeadery,Mead - Melomel,14% ABV,N/A IBU,(4.65),532 Ratings,Added 01/19/17
2,2,The Monte,https://untappd.com/PipsMeadery,Mead - Other,14% ABV,N/A IBU,(4.72),474 Ratings,Added 03/18/17
3,3,Levitation,https://untappd.com/PipsMeadery,Mead - Other,14% ABV,N/A IBU,(4.54),311 Ratings,Added 04/20/18
4,4,Call of the Void,https://untappd.com/PipsMeadery,Mead - Melomel,14% ABV,N/A IBU,(4.85),308 Ratings,Added 08/12/17


In [None]:
df2['style'].value_counts()

In [9]:
import QK
df_texts = QK.process_data(df2['style'])
df_texts.head()

0      mead other
1    mead melomel
2      mead other
3      mead other
4    mead melomel
Name: style, dtype: object

In [10]:
df_texts = QK.stop_stem((df_texts.apply(str)))
df_texts.head()

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/MacBookPro/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


0            mead
1    mead melomel
2            mead
3            mead
4    mead melomel
Name: style, dtype: object

In [27]:
[i[0] for i in all_words.most_common(20)]

['ale',
 'american',
 'ipa',
 'imperi',
 'doubl',
 'sour',
 'stout',
 'farmhous',
 'wild',
 'saison',
 'mead',
 'new',
 'england',
 'berlin',
 'weiss',
 'gose',
 'fruit',
 'pale',
 'melomel',
 'beer']

In [23]:
all_words

FreqDist({'ale': 1731, 'american': 1445, 'ipa': 1287, 'imperi': 1280, 'doubl': 1160, 'sour': 888, 'stout': 854, 'farmhous': 506, 'wild': 504, 'saison': 462, ...})

In [19]:
all_words = QK.generate_words(df_texts)
for x in [(i, all_words[i]) for i in all_words]:
    print(x)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/MacBookPro/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


('mead', 344)
('melomel', 173)
('pyment', 13)
('cyser', 15)
('metheglin', 16)
('stout', 854)
('imperi', 1280)
('doubl', 1160)
('oatmeal', 32)
('barleywin', 38)
('english', 28)
('sour', 888)
('berlin', 208)
('weiss', 208)
('russian', 38)
('american', 1445)
('ipa', 1287)
('new', 262)
('england', 259)
('pumpkin', 6)
('yam', 6)
('beer', 140)
('milk', 99)
('sweet', 99)
('wild', 504)
('ale', 1731)
('brown', 37)
('scotch', 9)
('wee', 9)
('heavi', 9)
('belgian', 99)
('strong', 37)
('golden', 33)
('gose', 201)
('pale', 175)
('porter', 80)
('rye', 5)
('wheat', 22)
('saison', 462)
('farmhous', 506)
('red', 28)
('blond', 59)
('dubbel', 12)
('pilsner', 39)
('witbier', 10)
('flander', 8)
('oud', 4)
('bruin', 4)
('grisett', 6)
('adambi', 1)
('quad', 4)
('bièr', 16)
('de', 14)
('champagn', 2)
('brut', 7)
('old', 5)
('baltic', 7)
('tripl', 49)
('fruit', 179)
('amber', 10)
('extra', 4)
('special', 4)
('bitter', 6)
('german', 9)
('coffe', 25)
('kellerbi', 5)
('zwickelbi', 5)
('lager', 26)
('north', 3)
('

In [12]:
word_features = list(all_words.keys())[:20]
word_features

['mead',
 'melomel',
 'pyment',
 'cyser',
 'metheglin',
 'stout',
 'imperi',
 'doubl',
 'oatmeal',
 'barleywin',
 'english',
 'sour',
 'berlin',
 'weiss',
 'russian',
 'american',
 'ipa',
 'new',
 'england',
 'pumpkin']