# Webscrapping Untappd

Previously, I worked with the open api from www.brewerydb.com that provided a free sandbox to grab data from. However, the data provided was extremely limited. One of the variables that I was most interested in were price nd rating. Price seemed the most difficult to find, but was able to find ratings on untappd. 

The website www.untappd.com had several hurdles to overcome in order to grab data. First of all, we needed to mask the browser we were using in python before getting a get request.

In [1]:
# import dependencies
import pandas as pd
from bs4 import BeautifulSoup
import requests
from tqdm import tqdm

# set url and mask headers
url = 'https://untappd.com/search?q=Murican+Pilsner'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

# grab data from url and set headers, prep soup
response = requests.get(url, headers=headers)
soup = response.content

In [2]:
# import beer data grabbed from the brewerydb sandbox
df_beers = pd.read_csv('data/beers.csv')
df_beers.head()

Unnamed: 0.1,Unnamed: 0,beer_id,name,display_name,abv,style_id,year,glass_id,organic,brewery_id,retired,status
0,0,c4f2KE,'Murican Pilsner,'Murican Pilsner,5.5,98.0,,4.0,N,nHLlnK,N,verified
1,1,zTTWa2,11.5° PLATO,11.5° PLATO,4.5,164.0,,,N,nHLlnK,N,verified
2,2,zfP2fK,12th Of Never,12th Of Never,5.5,25.0,,,N,nLsoQ9,N,verified
3,3,xwYSL2,15th Anniversary Ale,15th Anniversary Ale,,5.0,,,N,TMc6H2,N,verified
4,4,UJGpVS,16 So Fine Red Wheat Wine,16 So Fine Red Wheat Wine,11.0,35.0,,,N,TMc6H2,N,verified


This initial grab using the beer names from the brewerydb api pull resulted in many mismatched names and few results. The total number of results amounted to around 300. This was an extremely small dataset for what I wanted to be doing (machine learning). After looking into the issue further, I found that not only were many of the beers provided in brewerydb were not on the untappd webpage, but also that my results were limited because simply getting a request would only generate the page without a login. 

In [None]:
ids = []
ratings = []
errors = []
for i in tqdm(range(df_beers.shape[0])):
#     print(df_beers.loc[i,'display_name'])
    url = 'https://untappd.com/search?q='
    url = url + str(df_beers.loc[i,'display_name'])
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    response = requests.get(url, headers=headers)
    soup = response.content
    soupy = BeautifulSoup(soup, 'lxml')
    
    beer_items = soupy.findAll(class_='beer-item')
    for beer in beer_items:
        try:
            abv_raw = beer.findAll(class_='abv')[0].text.strip()
            abv_raw = abv_raw.strip('% ABV')
            abv = float(abv_raw)
        except Exception as e:
            print(e)
        if df_beers.loc[i,'abv'].astype(float) == abv:
            ids.append(df_beers.loc[i,'beer_id'])
            ratings.append(beer.findAll(class_='rating')[0].text.strip())
        elif df_beers.loc[i, 'name'].lower() == beer.findAll(class_="name")[0].find('a').text.strip().lower():
            ids.append(df_beers.loc[i,'beer_id'])
            ratings.append(beer.findAll(class_='rating')[0].text.strip())
        else:
            errors.append(df_beers.loc[i,'beer_id'])
        

Below, we utilize selenium to log in to the website. We iterate the requests by changing the beer names in the search request in hopes of matching with the beer names of the beers from the brewerydb data by its beer_id. The resulting dataset was around 1100 data points, which was much better than the 300 before. However, after running several ml models on this group, the results were lackluster. 
> machine learning portion can be found in AY_MachineLearning

In order to better predict the ratings of the beers, we either needed more data or more features to predict on.

In [16]:
from selenium import webdriver

In [17]:
driver = webdriver.Chrome('./chromedriver')
driver.get('https://untappd.com/login')

username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")

username.send_keys("ayang2012")
password.send_keys("mulciber1")

driver.find_element_by_css_selector(".button.yellow.submit-btn").find_element_by_tag_name('input').click()

In [18]:
ids2 = []
ratings2 = []
errors2 = []
df_untappd = pd.DataFrame()
untappd_columns = ['id','name', 'brewery', 'style', 'abv', 'ibu', 'rating']
for col in untappd_columns:
    df_untappd[col]=''

count=0    
for i in tqdm(range(df_beers.shape[0])):

    url = 'https://untappd.com/search?q='
    url = url + str(df_beers.loc[i,'display_name'])
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    driver.get(url)
    soup = driver.page_source
    soupy = BeautifulSoup(soup, 'lxml')
    
    beer_items = soupy.findAll(class_='beer-item')

    for beer in beer_items:

        df_untappd.set_value(count, 'id', df_beers.loc[i,'beer_id']) 
        df_untappd.set_value(count, 'name', beer.findAll(class_="name")[0].find('a').text.strip())
        df_untappd.set_value(count, 'brewery', beer.findAll(class_="brewery")[0].find('a').text.strip())
        df_untappd.set_value(count, 'style', beer.findAll(class_="style")[0].text.strip())
        df_untappd.set_value(count, 'abv', beer.findAll(class_="abv")[0].text.strip())
        df_untappd.set_value(count, 'ibu', beer.findAll(class_="ibu")[0].text.strip())
        df_untappd.set_value(count, 'rating', beer.findAll(class_="rating")[0].text.strip())
        
        count +=1

100%|██████████| 1109/1109 [03:29<00:00,  5.28it/s]


In [19]:
df_untappd.shape

(1231, 7)

In [20]:
df_untappd.tail()

Unnamed: 0,id,name,brewery,style,abv,ibu,rating
1226,0tWTUV,Koutské tmavé 14%,Pivovar Kout na Šumavě,Lager - Dark,6% ABV,N/A IBU,(3.492)
1227,0tWTUV,Tmavé 14%,Pivovar Matuška,Schwarzbier,5.8% ABV,N/A IBU,(3.62)
1228,0tWTUV,Tmave Pivo,Vltava House Brand,Lager - Dark,3.8% ABV,N/A IBU,(3.096)
1229,0tWTUV,Dětenické tmavé,Zámecký pivovar Dětenice,Lager - Dark,4% ABV,N/A IBU,(3.518)
1230,0tWTUV,Sv. Norbert Dunkel Weizenbock / Tříkrálový pše...,Klášterní pivovar Strahov,Bock - Weizenbock,6.3% ABV,16 IBU,(3.877)


In [21]:
df_untappd.to_csv('untappd_beer_ratings.csv')

In [41]:
df_beers[df_beers['beer_id'].apply(lambda x: x not in df_untappd['id'].unique())]

Unnamed: 0.1,Unnamed: 0,beer_id,name,display_name,abv,style_id,year,glass_id,organic,brewery_id,retired,status
5,5,vz5JZ1,1794 The Fergal Project,1794 The Fergal Project,4.50,42.0,,,N,DifSi4,N,verified
28,28,lWygSS,471 Double IPA - Hull Melon,471 Double IPA - Hull Melon,9.20,31.0,,,N,IImUD9,N,verified
29,29,fa0oqf,471 ESB - Extra Special Bitter,471 ESB - Extra Special Bitter,7.80,5.0,,5.0,N,IImUD9,Y,verified
32,32,tw2Iw0,471 IPA. Aggressive Hoppiness,471 IPA. Aggressive Hoppiness,9.20,31.0,,,N,IImUD9,Y,verified
33,33,GYF0P4,471 Pilsner,471 Pilsner,,98.0,,,N,IImUD9,N,verified
37,37,Fhw2NF,7 Cities Pilsner,7 Cities Pilsner,5.00,98.0,,,Y,p1tFbP,N,verified
49,49,aG4Ie2,Alpha Dog Imperial IPA,Alpha Dog Imperial IPA,8.50,31.0,,,N,yX6twV,N,verified
50,50,hYaduh,Alt Route - Beer Camp Across America,Alt Route - Beer Camp Across America (2014),6.60,55.0,2014.0,5.0,N,nHLlnK,N,verified
52,52,qIa0fL,Amber Beer,Amber Beer,,32.0,,,N,p3YrOa,N,verified
53,53,Zd8Cxd,American Summer Hoppy Wit,American Summer Hoppy Wit,6.00,65.0,,,N,q6vJUK,N,verified


# Finding beers from best breweries

### first we need to get a list of breweries

In [51]:
from lxml import html

with open("Top Rated Breweries _ Untappd.html") as f:
    page = f.read()
tree = html.fromstring(page)
brewery = BeautifulSoup(page,'lxml')

In [161]:
breweries = brewery.findAll(class_='beer-details')
breweries = [i.findAll('a')[0].get('href') for i in breweries]
# breweries

Now that we have a list of breweries, we can set up the requests to output a number of results related to the beers in each brewery. My hope was to make the brewery a feature, given that there would be multiple beers per brewery, but the number of results were still limited due to my lack of knowledge of the click function.

In [170]:
driver = webdriver.Chrome('./chromedriver')
driver.get('https://untappd.com/login')

username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")

username.send_keys("ayang2012")
password.send_keys("mulciber1")

driver.find_element_by_css_selector(".button.yellow.submit-btn").find_element_by_tag_name('input').click()

In [173]:

df_breweries = pd.DataFrame()
untappd_columns = ['name', 'brewery', 'style', 'abv', 'ibu', 'rating', 'raters', 'date']
for col in untappd_columns:
    df_untappd[col]=''

count=0    
error = True
for i in tqdm(breweries[0:5]):
    url = i + '/beer'
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    driver.get(url)
    while error:
        try:
            #driver.find_element_by_class_name(".buttom.yellow.more-list-items.track-click").find_element_by_tag('data-sort').click()
            #driver.find_element_by_link_text("#").click()
            #driver.find_elements_by_xpath("//*[contains(text(), 'Show More')]").click()
            driver.find_element_by_text("Show More").click()
            

            driver.implicitly_wait(5)

            #driver.findElement(By.xpath("//a/u[contains(text(),'Show More')]")).click();
            #driver.findElement(By.xpath("//a[@href='#']")).click()

        except Exception as e:
            error=False
        
    soup = driver.page_source
    soupy = BeautifulSoup(soup, 'lxml')
    
    beer_items = soupy.findAll(class_='beer-item')

    for beer in beer_items:

        df_breweries.set_value(count, 'name', beer.findAll(class_="name")[0].find('a').text.strip())
        df_breweries.set_value(count, 'brewery', i)
        df_breweries.set_value(count, 'style', beer.findAll(class_="style")[0].text.strip())
        df_breweries.set_value(count, 'abv', beer.findAll(class_="abv")[0].text.strip())
        df_breweries.set_value(count, 'ibu', beer.findAll(class_="ibu")[0].text.strip())
        df_breweries.set_value(count, 'rating', beer.findAll(class_="rating")[0].text.strip())
        df_breweries.set_value(count, 'raters', beer.findAll(class_="raters")[0].text.strip())
        df_breweries.set_value(count, 'date', beer.findAll(class_="date")[0].text.strip())
        
        count +=1





 20%|██        | 1/5 [00:01<00:05,  1.47s/it][A[A

 40%|████      | 2/5 [00:02<00:04,  1.42s/it][A[A

 60%|██████    | 3/5 [00:03<00:02,  1.35s/it][A[A

 80%|████████  | 4/5 [00:05<00:01,  1.29s/it][A[A

100%|██████████| 5/5 [00:06<00:00,  1.27s/it][A[A

[A[A

In [175]:
df_breweries['brewery'].value_counts()

https://untappd.com/3SonsBrewingCo      25
https://untappd.com/SideProject         25
https://untappd.com/PipsMeadery         25
https://untappd.com/SchrammsMead        25
https://untappd.com/ColesRoadBrewery    25
Name: brewery, dtype: int64

In [160]:
df_breweries.head()
df_breweries.to_csv('data/untappd_breweries_ratings.csv')

# WELP

Can't figure out how to click the "Show More" button.

End