# BEvERage Analysis

In this lab, we'll practice using Pandas by exploring a dataset of beer reviews. 

First we'll retrieve a small slice of the data. The full beer review dataset is surprisingly large ... or maybe not that surprising, since it seems like the kind of job that would be hard to give up so long as one more beer was out there :)

First we'll import Pandas and retrieve the data:

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('data/beer_small.csv')

df

How many reviews are there?

In [None]:
len(df)

How can we tell if there are missing values?

In [None]:
df.count()

Since most reviews have data for most fields, let's drop the records with incomplete data

In [None]:
df2 = df.dropna()

In [None]:
df2.count()

Let's get summary statistics for the numeric columns ... things like review score and ABV

In [None]:
df2.describe()

There are some really low-alcohol beers in there ... maybe even bogus data.

Find all entries with ABV less than 1%

In [None]:
low_abv = df2[df2.beer_abv < 1]

low_abv

How many of these reviews are there?

In [None]:
len(low_abv)

Some of these are multiple reviews for the same beer, which is allowed (and even encouraged). Let's group by beer and count.

In [None]:
grouping = low_abv.groupby('beer_name')
grouping.size()

How consistent are the O'Douls overall scores?

In [None]:
scores = low_abv[low_abv.beer_name=="O'Doul's"]['review_overall']
scores

Let's plot a histogram

In [None]:
scores.hist()

What are the mean and sd for the O'Doul's overall scores?

In [None]:
scores.mean(), scores.std()

In the full dataset, can we count beers by brewery, and then by style within that brewery?

In [None]:
df2.groupby(['brewery_name', 'beer_style']).size()

### Now we'll try and build up a slightly more complex report

Step 1: Find all rows corresponsing to reviews where the beer style starts with "American"

In [None]:
all_american = df2[df2.beer_style.str.startswith('American')]
all_american

Next, make a dataframe with just the `beer_style` and `review_overall` fields for those rows.

In [None]:
narrowed = all_american[['beer_style', 'review_overall']]
narrowed

Now we'll make a boxplot to capture the range and variance of the ratings. Pandas will do all the work is we call the built-in API. Look for it here: https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html

In [None]:
narrowed.boxplot(by='beer_style', vert=False, figsize=(12,10))