This notebook comprises the scraping function and initial build of the dataframe I will use for my project.

# Imports:

In [1]:
from requests import get
from bs4 import BeautifulSoup
import re
import pandas as pd
import time

# Initial Functions Used to Scrape Data:

- I used the following tutorial to learn how to use Beautiful Soup:
https://www.pluralsight.com/guides/extracting-data-html-beautifulsoup
- The below function was used to extract the top 50 beers of each type of beer listed on Untappd (over 200 options).

In [2]:
# 1. Get the page text from the url and load it as a Beautiful Soup text object:
# url = 'https://untappd.com/beer/top_rated'
def get_html(url):
    response = get(url, headers = {'User-agent': 'friendly person'})
    html_soup = BeautifulSoup(response.text, 'html.parser')
    return html_soup

# 2. Parse the data -
# For each column in my data I am scraping individually from each entry on the page url
# Each variable is referenced in the soup differently so I am finding them all according to these attributes
# and then appending to an individual list

def get_beer_data(html_soup):
    
    all_beer_types = get_beer_country_options(html_soup)['beer_types']
    
    name_beers = []
    for name in html_soup.find_all('p', attrs={'class' : 'name'}):
        name_beers.append(name.text.strip())
        
    num_ratings = []
    for raters in html_soup.find_all('p', class_ = 'raters'):
        num_ratings.append(raters.text.strip('\n').strip('Ratings').strip())
        
    date_added = []
    for date in html_soup.find_all('p', class_ = 'date'):
        date_added.append(date.text[7:-1].strip())
        
    abvs = []
    for abv in html_soup.find_all('p', class_ = 'abv'):
        abvs.append(abv.text.strip('\n').strip())
        
    ratings = []
    for rating in html_soup.find_all('span', {'class' : 'num'}):
        ratings.append(rating.text.strip('(').strip(')').strip())
        
    beer_text = []
    for x in html_soup.find_all("p", class_=re.compile("^desc desc-full-")):
        beer_text.append(x.text.strip('Read Less').strip())
    
    # style, brewery and availability are mixed up together so scraping together and then separating:
    style_brewery_and_availability = []
    for s in html_soup.find_all('p', attrs={'class' : 'style'}):
        style_brewery_and_availability.append(s.text)   
        
    style_brewery = [x for x in style_brewery_and_availability if x != 'This beer is no longer being produced by the brewery']
    
    not_available = []
    for a in html_soup.find_all('p', attrs={'class' : 'style'}): 
        if a.find('strong'):
            not_available.append(a.find_previous("p",  attrs={'class' : 'name'}).text)
    
    top_50_brewery = []
    top_50_styles = []
    for s in style_brewery:
        if s in all_beer_types:
            top_50_styles.append(s)
            continue
        else:
            top_50_brewery.append(s)
    
    # Merging together all these lists to form one dataframe:
    df_main = pd.DataFrame({'name': name_beers, 
                   'beer_style': top_50_styles,
                  'brewery': top_50_brewery,
                  'rating': ratings,
                   'num_ratings': num_ratings,
                  'abv': abvs,
                  'date_added': date_added,
                   'beer_desc': beer_text
                  })
    
    df_avail = pd.DataFrame({'name': not_available, 'not_available': 1})
    
    df = df_main.merge(df_avail, on='name', how='left')
    
    return df


# 4. Beer types and countries on Untappd are in drop-down menu's - this function gets the options within those menu's:
def get_beer_country_options(html_soup):
    options = []
    options_suffix = []
    for option in html_soup.find_all('option'):
        options.append(option.text)
        options_suffix.append(option.get("data-value-slug"))
    countries = options[options.index('Show All Countries')+1::]
    beer_types = options[1:options.index('Show All Countries')]
    
    # ignore first missing value
    options_suffix = options_suffix[1::]
    countries_suffix = options_suffix[options_suffix.index(None)+1::]
    beer_types_suffix = options_suffix[1:options_suffix.index(None)]
    
    # missing beer type
    beer_types = beer_types + ["Stout - American Imperial / Double"]
    return {'beer_types': beer_types,
            'countries': countries,
            'beer_types_website': beer_types_suffix,
            'countries_website': beer_types_suffix}


## Getting the data for the Top 50 Beers:

- Using the functions defined above to scrape the top 50 beers from the URL below.
- I want to keep my dataframe manageable in size and also I don't want to scrape too much data from the website either.
- This is mainly a test to check the function is working correctly.

In [4]:
top_rated_page_html = get_html("https://untappd.com/beer/top_rated")

In [14]:
top50_beer_data = get_beer_data(top_rated_page_html)

In [16]:
len(top50_beer_data)

50

In [19]:
top50_beer_data.head(10)

Unnamed: 0,name,beer_style,brewery,rating,num_ratings,abv,date_added,beer_desc,not_available
0,King JJJuliusss,IPA - Imperial / Double,Tree House Brewing Company,4.74,14996,8.4% ABV,06/25/16,To continue with our 4th Anniversary celebrati...,
1,Beer : Barrel : Time (2018),Stout - Imperial / Double,Side Project Brewing,4.74,2184,15% ABV,11/03/18,"For Beer : Barrel : Time 2018, we chose a blen...",1.0
2,Rare Bourbon County Brand Stout (2010),Stout - Imperial / Double,Goose Island Beer Co.,4.74,8232,13% ABV,11/26/10,Aged 2 years in 23-year old Pappy Van Winkle B...,1.0
3,Proprietor's Bourbon County Brand Stout (2014),Stout - Imperial / Double,Goose Island Beer Co.,4.74,14913,13.2% ABV,10/30/14,Proprietor’s Bourbon County Brand Stout is mea...,1.0
4,Bourbon County Brand Stout Vanilla Rye (2014),Stout - Imperial / Double,Goose Island Beer Co.,4.74,27513,13.6% ABV,08/11/14,First brewed for the legendary festival of Woo...,1.0
5,Kentucky Brunch Brand Stout,Stout - Imperial / Double,Toppling Goliath Brewing Co.,4.73,2390,12% ABV,02/12/12,This beer is the real McCoy. Barrel aged and c...,
6,Beer : Barrel : Time (2019),Stout - Imperial / Double,Side Project Brewing,4.73,2321,15% ABV,10/30/19,The barrel stock we tasted through to choose t...,
7,Blue Suede Shews,Mead - Other,Pips Meadery,4.71,1648,14% ABV,04/29/16,14% Orange Blossom Honey-wine with Wild Bluebe...,
8,Rare Bourbon County Brand Stout (2015),Stout - Imperial / Double,Goose Island Beer Co.,4.71,24850,14.8% ABV,09/09/15,Back in 1979 the folks at Heaven Hill Distille...,1.0
9,Proprietor's Bourbon County Brand Stout (2013),Stout - Imperial / Double,Goose Island Beer Co.,4.71,9529,14.1% ABV,10/22/13,Imperial Stout brewed with Toasted Coconut and...,1.0


## Creating a Data Dictionary Using the Function - get_beer_country_options:

- I am using the drop-down menu on the website URL to get a list of all the country and beer options that are on Untappd.

In [5]:
data_dict = get_beer_country_options(top_rated_page_html)

## Test Case: Using the data dictionary to construct a dataframe of just sour beers:

In [24]:
sour_list = [x for x in data_dict['beer_types_website'] if x.startswith('sour')]

In [25]:
sour_list

['sour-berliner-weisse',
 'sour-flanders-oud-bruin',
 'sour-flanders-red-ale',
 'sour-fruited',
 'sour-gose-fruited',
 'sour-gose',
 'sour-other']

In [26]:
# example: https://untappd.com/beer/top_rated?type=stout-american
sours_data = []

for sour in sour_list:
    beer_url = "https://untappd.com/beer/top_rated?type=" + sour + "&country=england"
    time.sleep(1)
    page_html = get_html(beer_url)
    beer_data = get_beer_data(page_html)
    sours_data.append(beer_data)
    
sours_data = pd.concat(sours_data)

In [38]:
# Saving this to csv file:
# appended_data.to_csv('sours_data.csv')

In [27]:
sours_data.head()

Unnamed: 0,name,beer_style,brewery,rating,num_ratings,abv,date_added,beer_desc,not_available
0,Bourbon Skyline,Sour - Berliner Weisse,Buxton Brewery,3.9,1606,7.2% ABV,12/12/15,Barrel Aged Berliner Wei,
1,WLS #028 Tzatziki Sour,Sour - Berliner Weisse,Orbit Beers London,3.77,1165,4.3% ABV,04/13/19,Get ready for the return of the Tzatziki Sour!...,
2,Very Far Skyline,Sour - Berliner Weisse,Buxton Brewery,3.76,1683,5% ABV,04/03/15,,
3,Laserbl’ast,Sour - Berliner Weisse,Black Iris Brewery,3.74,546,5% ABV,06/14/18,A dry hopped kettle sour featuring the tropica...,
4,Calypso Berliner Weisse Batch 377: Mosaic,Sour - Berliner Weisse,Siren Craft Brew,3.74,259,4% ABV,01/17/15,,


# Using this method to get a complete list across beer types:

- This method returns the top 50 rated beers of each beer style as listed on the untapped website.

In [7]:
comp_beer_list = [x for x in data_dict['beer_types_website']]

In [8]:
comp_beer_list[:5]

['altbier',
 'american-wild-ale',
 'australian-sparkling-ale',
 'barleywine-american',
 'barleywine-english']

In [9]:
len(comp_beer_list)

213

In [22]:
complete_data = []

for beer in comp_beer_list:
    try:
        beer_url = "https://untappd.com/beer/top_rated?type=" + beer
        time.sleep(1)
        page_html = get_html(beer_url)
        beer_data = get_beer_data(page_html)
        complete_data.append(beer_data)
    except ValueError:      # Exception continues loop when website url is empty
        continue
    
complete_beer_data = pd.concat(complete_data)

In [23]:
complete_beer_data

Unnamed: 0,name,beer_style,brewery,rating,num_ratings,abv,date_added,beer_desc,not_available
0,Estimation,Altbier,Suarez Family Brewery,3.96,318,4.8% ABV,02/09/20,A classic interpretation of the native Dusseld...,
1,Bourbon Barrel Aged Balt The More,Altbier,Union Craft Brewing,3.94,490,10% ABV,07/18/15,"imited edition, 700 cans. Batch 500. A nod to ...",
2,Barrel Roll No. 7 - Tailslide (2013),Altbier,Hangar 24 Craft Brewing,3.94,760,10.5% ABV,06/29/13,It’s with heightened awareness that one attemp...,1.0
3,Amber Apple Pie Ale,Altbier,Frye Brewing,3.9,370,5% ABV,08/26/17,"This is a variation on the ""I DO"" Brew amber a...",
4,subs.tân.ci.a antiga,Altbier,Latido Ale House,3.89,255,5% ABV,08/09/19,"Estilo alemão clássico, mas eternamente modern...",
...,...,...,...,...,...,...,...,...,...
20,Zoigl,Zoigl,Brauerei Bischofshof,3.37,1227,5.1% ABV,10/10/11,This beer specialty has been brewed in the Obe...,
21,Communbräu-Winterzoigl,Zoigl,Privatbrauerei Hösl,3.36,348,5.9% ABV,03/12/13,,
22,Zoigl,Zoigl,Gänstaller Braumanufaktur,3.32,2772,5.6% ABV,02/09/12,ine unfiltrierte Hommage an die oberpfälzische...,
23,Moosbacher Zoigl,Zoigl,Private Landbrauerei Scheuerer,3.28,1246,5.4% ABV,10/24/11,,


I initially saved this df to csv, but then realised I just wanted beer descriptions in english and so filtered it further, see below:

In [37]:
# Saving df to csv

# complete_beer_data.to_csv('untapped_beer_data.csv')

## Creating a dataframe of just english language beers:

- I want to use the beer descriptions to inform my analysis and to create a corpus of common words used in these descriptions. 
- As can be seen above, a lot of the beers from non-english countries/breweries have descriptions written in that foreign language and so would have to be excluded from the analysis.
- Therefore, I will create a dataframe of just the beers from the UK, US, Australia.

In [10]:
english_beer_data = []
country_names = ['&country=england', '&country=ireland', '&country=wales', '&country=united-states', 
                 '&country=australia', '&country=scotland']

for beer in comp_beer_list:
    for country in country_names:
        try:
            beer_url = "https://untappd.com/beer/top_rated?type=" + beer + country
            time.sleep(1)
            page_html = get_html(beer_url)
            beer_data = get_beer_data(page_html)
            english_beer_data.append(beer_data)
        except ValueError:
            continue
    
enlgish_speaking_data = pd.concat(english_beer_data)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  app.launch_new_instance()


In [11]:
enlgish_speaking_data

Unnamed: 0,abv,beer_desc,beer_style,brewery,date_added,name,not_available,num_ratings,rating
0,4.5% ABV,Collaboration with Magic Rock,Altbier,Thornbridge Brewery,06/10/16,Exalted,,542,3.7
1,6.2% ABV,Beechwood smoked malt meets a double altbier.\...,Altbier,Orbit Beers London,12/03/14,Leaf,,783,3.67
2,4.9% ABV,12.4° Plato | OG 1050 | ABV 4.9%\n\nWylam & Le...,Altbier,Wylam,02/07/16,Ctrl Alt Del,,318,3.65
3,4.5% ABV,Dry Hopped Altbier.,Altbier,Ghost Brew Co,07/16/16,Hades,,228,3.62
4,5% ABV,,Altbier,Anspach & Hobday,10/02/15,The Altbier,,288,3.57
...,...,...,...,...,...,...,...,...,...
0,5.4% ABV,Amber lager,Zoigl,Geipel Brewing,12/30/13,Zoigl,,164,3.62
0,4.8% ABV,2017 Great American Beer Festival Gold Medal i...,Zoigl,Zoiglhaus Brewing Company,03/16/17,Zoigl-Pils,,2188,3.74
1,4.8% ABV,,Zoigl,The Commons Brewery,11/22/13,Zoigl Bier,1.0,201,3.63
2,5.5% ABV,Zoiglbier is a style that originated in a smal...,Zoigl,Revolution Brewing Company,07/25/17,ChiStar,1.0,321,3.62


In [12]:
enlgish_speaking_data.to_csv('untapped_beer_data_eng_language.csv')

In [13]:
enlgish_speaking_data.isnull().sum()

abv                  0
beer_desc            0
beer_style           0
brewery              0
date_added           0
name                 0
not_available    18627
num_ratings          0
rating               0
dtype: int64