# Capstone Project

## Problem Statement

The world of craft beer has been growing steadily since the late 1970s. Nowadays you can easily find hundreds of varieties by hundreds if not thousands of different brewers. Even in most supermarkets you can find dozens of different beers available where before it was not uncommon to only see a few.

Along with a greater interest in craft beer has come a greater interest in and particpation in homebrewing. There are huge organizations of people who are creating and innovating with their own beers. Some homebrewers work in the industry, while others are just enthusiasts. All share a love of trying

With this incredible growth in the craft beer marketplace, are commercial breweries making the same types of beers that homebrewers are? If not, there is a potential to increase sales by putting out products for which demand is high, but supply does not yet exist.

In this project I would like to:

1. Analyze what styles of homebrew recipes are being made using clustering based on data scraped from www.brewtoad.com

2. Using Bayesian inference, generate 95% credible intervals for each cluster in terms of mean ABV, FG, OG, SRM, and IBU
    - Time Permitting: Also generate Bayesian probabilities of using specific grains/hops/yeast in a recipe
    - Time Permitting: Generate credible intervals for amount or proportion of ingredient used


3. Compare generated cluseter parameters with BJCP guidelines to determine if any BJCP categories should be expanded or combined to more accurately reflect current homebrewing trends, or if the BJCP is missing any categories within the homebrewing community.
4. Write recommendations for the BJCP in developing next set of style guidelines.

---
## Gathering Data

Data is coming from 2 separate sources:
1. Brewtoad Scraper: This is a scraper I wrote to pull recipes from the webiste www.brewtoad.com.  This scraper is running on an AWS EC2 instance to gather data currently. It will get the name, style, base statistics, and full recipe details of each recipe it scans.
2. 2015 BJCP Guidelines: [These guidelines](https://www.bjcp.org/docs/2015_Guidelines_Beer.pdf) are the most recent officially published beer guidelines. This pdf document can be found in .csv form [here](https://www.bjcp.org/docs/2015_styleguide.xml)

---
## EDA and Data Cleaning

This is currently just for illustrative purposes, as my scraping script has not finished running yet

In [1]:
import pandas as pd
import pandas_profiling as pdpro

In [24]:
# Import baseline csv file
df = pd.read_csv('../Data/recipes/brewtoad_recipes_100.csv')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,ABV,FG,IBU,OG,SRM,boil_time,boil_time_units,extras,fermentables,hops,name,style,volume,volume_units,yeast_attenuation,yeast_lab,yeast_name
0,0,0.054,1.012,18,1.053,4,90,min,0,"{0: {'amount': 8.0, 'amount_unit': 'lb', 'name...","{0: {'amount': 0.5, 'amount_unit': 'oz', 'name...",Firework Cream Ale,Cream Ale,6.0,gal,0.775,White Labs WLP080,Cream Ale Yeast Blend
1,1,0.066,1.018,68,1.068,6,60,min,0,"{0: {'amount': 11.5, 'amount_unit': 'lb', 'nam...","{0: {'amount': 2.5, 'amount_unit': 'oz', 'name...",Cascade IPA,American IPA,5.0,gal,0.74,Wyeast 1272,American Ale II
2,2,0.069,1.018,53,1.07,8,60,min,1,"{0: {'amount': 12.75, 'amount_unit': 'lb', 'na...","{0: {'amount': 0.5, 'amount_unit': 'oz', 'name...",3 Floyd's Zombie Dust Clone,American IPA,5.25,gal,0.75,White Labs WLP002,English Ale Yeast
3,3,0.056,1.016,31,1.059,24,60,min,0,"{0: {'amount': 8.0, 'amount_unit': 'lb', 'name...","{0: {'amount': 2.0, 'amount_unit': 'oz', 'name...",Nut Brown Ale,American Brown Ale,5.0,gal,0.735,Fermentis S-04,Safale S-04
4,4,0.05,1.012,12,1.05,3,60,min,0,"{0: {'amount': 5.5, 'amount_unit': 'lb', 'name...","{0: {'amount': 0.5, 'amount_unit': 'oz', 'name...",Hefe IV,Weizen/Weissbier,6.0,gal,0.765,White Labs WLP380,Hefeweizen IV Ale Yeast


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2992 entries, 0 to 2991
Data columns (total 18 columns):
Unnamed: 0           2992 non-null int64
ABV                  2992 non-null float64
FG                   2992 non-null float64
IBU                  2992 non-null int64
OG                   2992 non-null float64
SRM                  2992 non-null int64
boil_time            2992 non-null int64
boil_time_units      2992 non-null object
extras               2992 non-null int64
fermentables         2992 non-null object
hops                 2992 non-null object
name                 2992 non-null object
style                2992 non-null object
volume               2992 non-null float64
volume_units         2992 non-null object
yeast_attenuation    2992 non-null float64
yeast_lab            2878 non-null object
yeast_name           2992 non-null object
dtypes: float64(5), int64(5), object(8)
memory usage: 420.8+ KB


In [13]:
df.isnull().sum()

Unnamed: 0             0
ABV                    0
FG                     0
IBU                    0
OG                     0
SRM                    0
boil_time              0
boil_time_units        0
extras                 0
fermentables           0
hops                   0
name                   0
style                  0
volume                 0
volume_units           0
yeast_attenuation      0
yeast_lab            114
yeast_name             0
dtype: int64

The data are largely what I was expecting; most of the data types have been correctly determined by the scraping script I wrote and there are not many missing values. The only column with missing values is the `yeast_lab` column. This occurred because people used generic yeast names when entering their recipes, rather than using a specific lab and culture number. I will have to investigate more of the data from the scraper before deciding how to handle this

### Unit Differences and Batch Size Differences

In my own personal usage of brewtoad, I have noticed recipes using different units for various parts of the recipe. The most common differences are given in the weight of the grains, the weight of the hops, and the boil size of the batch. In order to make accurate comparisons amongst the values, I will have to convert all of these units to be the same. I'll also have to scale all recipes to the same batch size; a recipe to make a 10 gallon batch will obviously be using more ingredients than a recipe used to make a 5 gallon batch.

In [25]:
# Find which units are used in the boil volume
df['volume_units'].value_counts()

gal    2618
L       374
Name: volume_units, dtype: int64

Since both gallons and liters were used to measure the boil volume, I first will have to convert all of these to the same unit. Since gallons are much more commonly used, I will convert all boil volumes in liters to boil volumes in gallons.

In [32]:
def liter_to_gal(row):
    # Look for where the units are liters
    if row['volume_units'] == 'L':
        # 1 Liter = 0.2641720524 gallons, multiply by this conversion factor
        row['volume'] *= 0.2641720524
    return pd.Data

In [33]:
df[df['volume_units'] == 'L']

Unnamed: 0.1,Unnamed: 0,ABV,FG,IBU,OG,SRM,boil_time,boil_time_units,extras,fermentables,hops,name,style,volume,volume_units,yeast_attenuation,yeast_lab,yeast_name
37,37,0.068,1.013,54,1.065,12,60,min,0,"{0: {'amount': 5.9, 'amount_unit': 'kg', 'name...","{0: {'amount': 44.4, 'amount_unit': 'g', 'name...",Indica Clone,English IPA,28.00,L,0.800,Danstar,Danstar Nottingham Ale
45,45,0.079,1.013,24,1.074,3,90,min,1,"{0: {'amount': 5.8, 'amount_unit': 'kg', 'name...","{0: {'amount': 47.0, 'amount_unit': 'g', 'name...",Duvel Clone,Belgian Golden Strong Ale,23.00,L,0.820,Fermentis,Abbaye
53,53,0.049,1.012,38,1.050,7,30,min,0,"{0: {'amount': 4879.0, 'amount_unit': 'g', 'na...","{0: {'amount': 15.0, 'amount_unit': 'g', 'name...",Centennial ESB,Extra Special/Strong Bitter (English Pale Ale),24.00,L,0.750,Fermentis S-04,Safale S-04
64,64,0.053,1.017,27,1.057,45,60,min,1,"{0: {'amount': 4.0, 'amount_unit': 'kg', 'name...","{0: {'amount': 45.0, 'amount_unit': 'g', 'name...",Vanilla Oatmeal Stout,Oatmeal Stout,23.00,L,0.700,Fermentis S-04,Safale S-04
69,69,0.052,1.012,11,1.052,2,60,min,0,"{0: {'amount': 2.8, 'amount_unit': 'kg', 'name...","{0: {'amount': 20.9, 'amount_unit': 'g', 'name...",Weihenstephaner Hefe Weissbier #1,Weizen/Weissbier,22.00,L,0.770,White Labs WLP300,Hefeweizen Ale Yeast
74,74,0.047,1.010,42,1.046,6,60,min,0,"{0: {'amount': 2.5, 'amount_unit': 'kg', 'name...","{0: {'amount': 20.0, 'amount_unit': 'g', 'name...",Stored XSS - BIAB APA,American Pale Ale,12.50,L,0.775,Fermentis US-05,Safale US-05
79,79,0.050,1.019,34,1.057,34,90,min,0,"{0: {'amount': 5400.0, 'amount_unit': 'g', 'na...","{0: {'amount': 40.0, 'amount_unit': 'g', 'name...",Nilssons Svart,Oatmeal Stout,25.00,L,0.665,White Labs WLP002,English Ale Yeast
96,96,0.068,1.004,31,1.056,5,60,min,1,"{0: {'amount': 9.1, 'amount_unit': 'kg', 'name...","{0: {'amount': 30.0, 'amount_unit': 'g', 'name...",#13003 Black Pepper Saison,Saison,50.00,L,0.925,Danstar,Belle Saison
121,121,0.072,1.009,70,1.064,7,90,min,1,"{0: {'amount': 12.7, 'amount_unit': 'kg', 'nam...","{0: {'amount': 50.0, 'amount_unit': 'g', 'name...",1337 Zombie (Citra Single Hop),American IPA,55.00,L,0.855,Fermentis US-05,Safale US-05
123,123,0.068,1.017,53,1.069,6,60,min,0,"{0: {'amount': 6.0, 'amount_unit': 'kg', 'name...","{0: {'amount': 20.0, 'amount_unit': 'g', 'name...",#2 Summer SMaSH IPA,American IPA,20.00,L,0.760,White Labs WLP001,California Ale Yeast


In [34]:
df['volume'] = df.apply(liter_to_gal, axis=1)
df['volume_units'] = 'gal'

In [23]:
df['volume_units'] = 'gal'

In [14]:
df['volume'].value_counts()

5.00      764
5.50      702
6.00      251
10.00     108
5.49       78
11.00      77
20.00      69
3.00       64
1.00       46
5.25       38
6.50       37
25.00      32
2.50       30
23.00      30
12.00      29
3.50       26
15.00      23
7.00       19
3.01       18
4.50       18
21.00      16
19.00      15
10.50      15
16.00      15
10.01      15
24.00      13
18.93      13
5.20       13
4.00       12
5.75       12
         ... 
7.93        1
34.07       1
775.00      1
6.70        1
9.01        1
6.34        1
13.20       1
5.35        1
1.33        1
37.85       1
4.30        1
3.70        1
108.50      1
12.60       1
7.57        1
13.21       1
12.15       1
1.51        1
6.26        1
3.96        1
12.20       1
10.49       1
15.85       1
5.39        1
3.10        1
19.99       1
4.80        1
3.79        1
6.80        1
5.40        1
Name: volume, Length: 187, dtype: int64

In [5]:
df['boil_time_units'].value_counts()

min    2992
Name: boil_time_units, dtype: int64

In [7]:
def liter_to_gal(row):
    if row['volume_units'] == 'L':
        row['volume'] *= 0.2641720524
    return row['volume']   

In [11]:
df.apply(liter_to_gal, axis=1)

0        6.000000
1        5.000000
2        5.250000
3        5.000000
4        6.000000
5        5.000000
6        5.500000
7        5.500000
8        5.500000
9        5.500000
10       5.000000
11       5.000000
12       5.500000
13       5.500000
14       5.000000
15       5.500000
16       5.500000
17       5.500000
18       6.600000
19       5.000000
20       5.500000
21       5.000000
22       5.000000
23       5.500000
24       5.500000
25       6.000000
26       5.000000
27       5.500000
28       5.000000
29       5.490000
          ...    
2962     6.075957
2963     5.000000
2964     5.490000
2965    10.000000
2966     5.000000
2967     3.000000
2968     3.040000
2969     5.000777
2970     5.260000
2971     6.000000
2972     5.000000
2973     5.000000
2974     5.250000
2975     3.000000
2976     6.100000
2977     5.000000
2978    10.000000
2979     3.010000
2980     5.500000
2981     5.500000
2982     5.000000
2983     5.490000
2984    10.500000
2985    11.000000
2986     3

In [12]:
2991*25

74775