# Craft Beers Dataset Analysis
This notebook contains data from the [Craft Beers Dataset](https://www.kaggle.com/nickhould/craft-cans) obtained from Kaggle. The dataset contains two files, one with information on breweries and one with information on induvidual beers.
## Project Goals
The goal for this project is to explore the dataset to identify information about the market of craft beer brewers and their beers. I will use this notebook to showcase skills in data manipulation and visualization.

## Initial Data Preperation

Import the necessary python libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Load in the data files and give a preview of each

In [None]:
beers = pd.read_csv('../input/craft-cans/beers.csv')

In [None]:
beers.head()

In [None]:
breweries = pd.read_csv('../input/craft-cans/breweries.csv')

In [None]:
breweries.head()

Change some of the column names to be easier to understand

In [None]:
beers.rename(columns={'id': 'beer_id', 'name': 'beer_name'}, inplace=True)

In [None]:
beers.head()

Set the index for the brewery dataframe and merge the two dataframes together

In [None]:
# Create a column for the brewery index
breweries['brewery_id'] = breweries.index

# Merge the two dataframes together based on brewery ID
df = beers.merge(breweries, on='brewery_id')

In [None]:
df.head()

Drop the unnecessary columns

In [None]:
df.drop(labels=['Unnamed: 0_x', 'Unnamed: 0_y'], axis=1, inplace=True)

In [None]:
df.head()

# Exploratory Data Analysis
## Exploration of the Breweries

Explore the breweries file and provide insight on brewery locations.

In [None]:
df.head(10)

In [None]:
breweries.isnull().sum()

In [None]:
print('Number of records:', breweries.shape[0])

The breweries file contains 558 records, with 4 columns including the name, city, and state of the breweries.

In [None]:
# Plot a bar chart with the number of breweries in each state
breweries_by_state = breweries['state'].value_counts()
plt.figure(figsize=(10,8))
sns.barplot(x=breweries_by_state.index, y=breweries_by_state.values)
plt.title('Number of Breweries by State')
plt.ylabel('Number of Breweries')
plt.xlabel('State')
plt.xticks(rotation='vertical')

All 50 states are represented in the dataset, and DC is included as well. Colorado has the most breweries in the datset, followed by California, Michigan, and then Oregon.

In [None]:
# Create a series that contains the number of breweries in the 20 cities with the highest brewery count
breweries_by_city = breweries.groupby('city')['name'].count().nlargest(20)
plt.figure(figsize=(10,8))
sns.barplot(x=breweries_by_city.index, y=breweries_by_city.values)
plt.title('Top 20 Cities by Number of Breweries')
plt.ylabel('Number of Breweries')
plt.xlabel('State')
plt.xticks(rotation=45)

This chart shows the top 20 cities with the most breweries. Portland, OR leads the pack with nearly twice as many as the next city, Boulder, CO.

Let's look into my home state of Washington a little more to see the breakdown of breweries

In [None]:
breweries.head()

In [None]:
# Create a new dataframe with only the washington breweries
washington_breweries = breweries[breweries['state']==' WA']
print('Number of breweries located in Washington:', washington_breweries.shape[0])

There are a total of 23 breweries in the dataset that are located in Washington state

In [None]:
# Get the value counts of each washington city featured in the dataset and plot
wa_breweries_by_city = washington_breweries['city'].value_counts()
plt.figure(figsize=(8,6))
sns.barplot(x=wa_breweries_by_city.index, y=wa_breweries_by_city.values)
plt.title('Number of Breweries in Washington Cities')
plt.ylabel('Number of Breweries')
plt.xlabel('City')
plt.xticks(rotation=45)

Seattle is the hotspot for breweries in Washington state, with a total of 9.

## Exploration of the Beers

In [None]:
beers.head()

In [None]:
print('Total number of induvidual beers in the dataset:', beers.shape[0])

In [None]:
beers.isnull().sum()

In [None]:
print('Average ABV of all beers:', beers['abv'].mean())
print('Average IBU of all beers:', beers['ibu'].mean())

* There are a total of 2,410 induvidal beers in the dataset
* The average ABV (Alchohol By Volume) of all the beers in the dataset is .0597, or 5.97%. There are 62 records with a missing ABV value.
* The average IBU (International Bitterness Units) of all the beers in the data set is 42.71. There are 1,005 records with a missing IBU value.

Let's look into how both the ABV and IBU of the beers in the dataset are distributed

In [None]:
# Create a histogram of ABV
plt.figure(figsize=(8,8))
sns.distplot(a=beers['abv'])
plt.title('Histogram of Beer ABV')
plt.ylabel('Frequency')
plt.xlabel('ABV')

The majority of beers have an ABV around 5%

In [None]:
# Create a histogram of IBU
plt.figure(figsize=(8,8))
sns.distplot(a=beers['ibu'])
plt.title('Histogram of Beer IBU')
plt.ylabel('Frequency')
plt.xlabel('IBU')

Is the amount of alcohol and the bitterness of any given beer related? Let's find out.

In [None]:
# Create a scatter plot comparing ABV and IBU and plot a regression line
plt.figure(figsize=(10,8))
sns.regplot(x=beers['abv'], y=beers['ibu'])
plt.title('ABV vs. IBU')
plt.xlabel('ABV')
plt.ylabel('IBU')

In [None]:
# Create a joint kde plot of ABV and IBU
sns.jointplot(data=beers, x='abv', y='ibu', kind='kde')

The first chart above compares ABV and IBU of the beers in the dataset and plots a regression line. As we might expect, the amount of alcohol and the bitterness of any given beer are correlated.

Let's see what the most popular styles of beers are.

In [None]:
# Create a series with the styles that have the top 20 most records in the dataframe
beers_by_type = beers['style'].value_counts().nlargest(20)
beers_by_type

In [None]:
# Plot the top 20 most popular beer styles
plt.figure(figsize=(8,6))
sns.barplot(x=beers_by_type.index, y=beers_by_type.values)
plt.title('Top 20 Styles of Beer that Appear in Dataset')
plt.xlabel('Beer Style')
plt.ylabel('Number of Beers')
plt.xticks(rotation='vertical')

As we can see, the most popular type of beer is the American IPA, followed by the American Pale Ale

In [None]:
print('The highest ABV beer:', (beers['abv'].max() * 100), '%')
print('The lowest ABV beer:', (beers['abv'].min() * 100), '%')

Let's look at which styles of beer have the highest average ABV

In [None]:
print('The highest IBU beer:', beers['ibu'].max())
print('The lowest IBU beer:', beers['ibu'].min())

In [None]:
highest_avg_abv = beers.groupby('style')[['abv']].mean().nlargest(10, columns='abv')
highest_avg_abv.sort_values(by='abv', ascending=False)

In [None]:
highest_avg_ibu = beers.groupby('style')[['ibu']].mean().nlargest(10, columns='ibu')
highest_avg_ibu.sort_values(by='ibu', ascending=False)

Maybe you prefer a beer that is not so bitter. This list will help you pick a style that on average has a low IBU.

In [None]:
lowest_avg_ibu = beers.groupby('style')[['ibu']].mean().nsmallest(10, columns='ibu')
lowest_avg_ibu.sort_values(by='ibu', ascending=True)

## Exploring the Merged Dataframe

In [None]:
df.head()

Which states brew beer with the highest average ABV?

In [None]:
avg_abv_by_state = df.groupby('state')[['abv']].mean().sort_values(by='abv', ascending=False)

In [None]:
plt.figure(figsize=(8,11))
sns.barplot(y=avg_abv_by_state.index, x=avg_abv_by_state['abv'], orient='horizontal')
plt.title('States Ranked by Average ABV')
plt.ylabel('State')
plt.xlabel('ABV')

In [None]:
avg_ibu_by_state = df.groupby('state')[['ibu']].mean().sort_values(by='ibu', ascending=False)

plt.figure(figsize=(8,11))
sns.barplot(y=avg_ibu_by_state.index, x=avg_ibu_by_state['ibu'], orient='horizontal')
plt.title('States Ranked by Average IBU')
plt.ylabel('State')
plt.xlabel('IBU')

* On average, Nevada has the highest average ABV in their beers, while Utah has the lowest.
* On average, West Virginia has the highest average IBU in their beers, while Wisconsin has the lowest.