In [14]:
# First, import the relevant modules
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

The inital data set was provided on Kaggle collected by jtrofe (https://www.kaggle.com/jtrofe/beer-recipes) and using data from www.Brewersfriend.com. The initial data provided in this data set was a robust starting point for this project. However, additional data was acquired to supplement this inital set of 73,800+ entries. The original data came in two .csv files. The first (recipeData.csv) contains most of the information on the homebrews. The second (styleData.csv) contains the assignment for the styles of beer found in the recipeData file.

The data acqusitition for my project is in two parts. The first part is scraping ratings data from the website using the Beautiful Soup package. The second part is to use the API to obtain recipe data (e.g. ingredients, hops, yeast, etc.) for each entry.

A few separate files used in this notebook have been augmented from the original data set. The first (recipeData_urls_all.csv) only contains all of the urls in the original recipeData.csv document. These entries specifically are the subdirectories pertaining to each beer in the dataset. The second (recipe_id.csv) is generated below in part two and contains a list of each of the recipe id harvested from the urls in the original recipeData.csv document. 

1) Data Acquisition of Ratings Data via Beautiful Soup

In [15]:
# Establish headers, the base_url (or domain), and a list to accept data from Beautiful Soup

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
}

base_url = 'https://www.brewersfriend.com'
data = [["url", "rating", "reviews", "calories", "carbs"]]

In [16]:
# Define a function to help sift through the soup: ladel

def ladle(url):
    # This function takes the url given and requests the rating, review, calories, and carbs of the beer in question.
    # The entry is appended to the defined list.
    
    beer_html = requests.get(url, headers=headers).text

    soup = BeautifulSoup(beer_html, 'html5lib')

    rating = soup.find('span', {'itemprop': 'ratingValue'}) if soup.find('span', {'itemprop': 'ratingValue'}) != None else "NaN"
    review = soup.find('span', {'itemprop': 'reviewCount'}) if soup.find('span', {'itemprop': 'reviewCount'}) != None else "NaN"
    calories = soup.find('strong', {'class': 'calories'}) if soup.find('strong', {'class': 'calories'}) != None else "NaN"
    carbs = soup.find('strong', {'class': 'carbs'}) if soup.find('strong', {'class': 'carbs'}) != None else "NaN"

    temp = [url, 
            rating.text if rating != "NaN" else 'NaN', 
            review.text if review != "NaN" else 'NaN', 
            calories.text if calories != "NaN" else 'NaN', 
            carbs.text if carbs != "NaN" else 'NaN']
    data.append(temp)
        
    return data

In [17]:
# A dataframe of the full url is generated for each url fragment in Kaggle data set.

url_df = pd.read_csv('recipeData_urls_all.csv', header= None, engine = 'python', encoding = 'ISO-8859-1').apply(lambda x: base_url + x)

print(url_df.tail(10))

                                                       0
73851  https://www.brewersfriend.com/homebrew/recipe/...
73852  https://www.brewersfriend.com/homebrew/recipe/...
73853  https://www.brewersfriend.com/homebrew/recipe/...
73854  https://www.brewersfriend.com/homebrew/recipe/...
73855  https://www.brewersfriend.com/homebrew/recipe/...
73856  https://www.brewersfriend.com/homebrew/recipe/...
73857  https://www.brewersfriend.com/homebrew/recipe/...
73858  https://www.brewersfriend.com/homebrew/recipe/...
73859  https://www.brewersfriend.com/homebrew/recipe/...
73860  https://www.brewersfriend.com/homebrew/recipe/...


In [None]:
# Write a for loop: for each full url in the dataframe "url_df", perform the ladel function. This will build up a list: data

for index, row in url_df.iterrows():
    ladle(row[0])
    
print(data)

In [6]:
# Convert the list "data" into a dataframe: df
# This step took many hours to complete and was done in a separate Jupyter notebook with the same code.
# A sample tail output was generated below from a failed attempt.

df = pd.DataFrame(data)
print(df.tail())

                                                      0    1    2  \
3433  https://www.brewersfriend.com/homebrew/recipe/...  NaN  NaN   
3434  https://www.brewersfriend.com/homebrew/recipe/...  NaN  NaN   
3435  https://www.brewersfriend.com/homebrew/recipe/...  NaN  NaN   
3436  https://www.brewersfriend.com/homebrew/recipe/...  NaN  NaN   
3437  https://www.brewersfriend.com/homebrew/recipe/...  NaN  NaN   

                 3       4  
3433           NaN     NaN  
3434  159 calories  17.2 g  
3435  153 calories  14.8 g  
3436  206 calories  22.4 g  
3437  435 calories  29.3 g  


In [9]:
# Save the webscraped data as a csv file for later: reviewData_all.csv

df.to_csv('reviewData_all.csv')

From step one, we have available review data from most of the entries. This was limited by two factors. 

The first limitation was from the data itself. Only entries that have reviews and ratings at the time of scraping will be included.

The second limitation is the time required for this scraping to occur. Due to how large the data set is, I experienced problems
scraping due to TimeOut Errors with the server. This was possibly due to over-requesting and being rejected from the server end to 
avoid a crash. Another possibility was user error on my end with incorrect settings on my computer that would interupt the request.
AS SUCH, ONLY A PORTION OF THE DATA SET WAS PROPERLY SCRAPED OVER MANY HOURS. One optimization for time management is to use 
dataframe methods instead of list methods for this harvesting which may speed up the process.

Going forward, I will use the best harvested review data csv I was able to get for the next steps: 'reviewData_16822.csv'

2) Data Acquisition of Ingredient Data via Brewer's Friend API

First, the recipe ID numbers need to be extracted from the recipeData.csv original dataset. 

In [18]:
url_df = pd.read_csv('recipeData_urls.csv', header= None, engine = 'python', encoding = 'ISO-8859-1')

print(url_df.tail(10))

                                                       0
73851          /homebrew/recipe/view/615556/blonde-stout
73852        /homebrew/recipe/view/618629/session-simcoe
73853  /homebrew/recipe/view/602248/chris-ford-wheat-ipa
73854  /homebrew/recipe/view/603016/x-files-american-ale
73855           /homebrew/recipe/view/607368/unicorn-pee
73856         /homebrew/recipe/view/609673/amber-alfie-2
73857               /homebrew/recipe/view/610955/rye-ipa
73858                      /homebrew/recipe/view/586891/
73859                      /homebrew/recipe/view/603788/
73860  /homebrew/recipe/view/613776/elvis-juice-ipa-c...


In [3]:
new_list = []
for index,row in url_df.iterrows():
    temp = row[0].rsplit('/')
    new_list.append(temp)
    
rec_df = pd.DataFrame(new_list)
print(rec_df.tail(10))

      0         1       2     3       4                      5
73851    homebrew  recipe  view  615556           blonde-stout
73852    homebrew  recipe  view  618629         session-simcoe
73853    homebrew  recipe  view  602248   chris-ford-wheat-ipa
73854    homebrew  recipe  view  603016   x-files-american-ale
73855    homebrew  recipe  view  607368            unicorn-pee
73856    homebrew  recipe  view  609673          amber-alfie-2
73857    homebrew  recipe  view  610955                rye-ipa
73858    homebrew  recipe  view  586891                       
73859    homebrew  recipe  view  603788                       
73860    homebrew  recipe  view  613776  elvis-juice-ipa-clone


In [4]:
# Getting only the column with the recipe ID's, 

rec_id = rec_df[[4]]
print(rec_id.head())

       4
0   1633
1  16367
2   5920
3   5916
4  89534


In [5]:
# The list of recipe_IDs is saved for future reference: 'recipe_id.csv'
rec_id.to_csv('recipe_id.csv')

Using the newly created list of recipe IDs, rec_id, the ingredient list API can now be set up.

In [6]:
# Establish new headers and base url (domain) for the recipe and ingredients

my_headers = {'X-API-KEY': '1062c1a3650672bb65e9dc8c71bd7dfe4061166f'}
base_rec_url = 'https://api.brewersfriend.com/v1/recipes/'

In [7]:
# Create five empty dataframes, one for each category of ingredient and place in a dictionary: ingredient_dict

fermentables = pd.DataFrame()
hops = pd.DataFrame()
misc = pd.DataFrame()
mash = pd.DataFrame()
yeast = pd.DataFrame()

ingredient_dict = {"FERMENTABLE": fermentables, "HOP": hops, "MISC": misc, 'MASH': mash,'YEAST': yeast}

In [8]:
# Define a function to help parse through xml data for each recipe: xml_sift()

def xml_sift(xml_file, recipe_id, xpath_loc, df):
    
    """ This function looks at the xml data from Brewer's Friend recipe API output, specifically looking at the Fermentables, Hops, 
    Misc, Mash Steps, and Yeast used and adds the value of Recipe ID in a new column. The MASH portion of the if-else statement is 
    needed because the xpath is unique compared to the other ingredients. The try-except clause is required to ignore recipes that 
    do not contain a given ingredient type (usually MISC values are missing).""" 
    
    if 'MASH' in xpath_loc:
        try:
            temp = pd.read_xml(xml_file, xpath = "/RECIPES/RECIPE/MASH/MASH_STEPS/MASH_STEP")
            temp.insert(0, "Recipe_ID", recipe_id, True)
            return temp
        except ValueError: 
            pass 
    else:
        try:
            temp = pd.read_xml(xml_file, xpath = "/RECIPES/RECIPE/" + xpath_loc + 'S/' + xpath_loc)
            temp.insert(0, "Recipe_ID", recipe_id, True)
            return temp
        except ValueError: 
            pass

In [9]:
# Define a function to get ingredient data for a given recipe: gather_ingredients()

def gather_ingredients(recipe_id):
    
    """This function uses the domain (base_rec_url) and the input recipe_id to request an xml using the API from
    Brewer's Friend (https://docs.brewersfriend.com/api/recipes). This function uses standard request.get to retrieve
    the xml file. This file cannot be converted into a json. The for loop works through each of the five ingredient
    types using the dictionary ingredient_dict. The results from xml_sift are concatenated onto the respective dataframe
    by adding rows under the previous recipe (axis = 0). 
    
    The subfunction '.reset_index()' was required to get this to work because otherwise we were rewriting index 0-5 multiple times.
    The try-except clause is used to acknowledge missing entries or xpaths passed from the previous function (e.g.
    entries that don't have MISC ingredients listed will be ignored)."""
    
    url = base_rec_url + str(recipe_id) + '.xml'

    r = requests.get(url, headers=my_headers)

    xml_file = r.text
    
    for ingredient in ingredient_dict:
        try: 
            ingredient_dict[ingredient] = pd.concat([ingredient_dict[ingredient], xml_sift(xml_file, recipe_id, ingredient, ingredient_dict[ingredient])], axis = 0, ignore_index = True, sort = False).reset_index(drop=True)
        except:
            continue
    

In [10]:
# Write a for loop: for each recipe ID in the dataframe "rec_id", perform the gather_ingredients function. 

for index, row in rec_id.iterrows():
    gather_ingredients(row[4])   

In [11]:
# Checking the shape of the FERMENTABLE dataframe 

print(ingredient_dict['FERMENTABLE'].shape)

(294096, 10)


In [12]:
# Save each of the dataframes in 'ingredient_dict' as a csv file

for k,v in ingredient_dict.items():
    ingredient_dict[k].to_csv(path_or_buf = k+'.csv', index=False)

From step two, we have five CSV files that each contain all of the ingredient data for each recipe saved (e.g. 'FERMENTABLE.CSV'). 

In the data wrangling component, using the six CSV files gathered here and the original data set, we'll go through and 
design the final data set that will be used for modeling. 
