# Module 6 Python Assignment

<div class="alert alert-block alert-warning"><b>In this assignment you will read through the notebook and complete the exercises. Once you are satisfied with the results, submit your notebook and html file to Canvas. Your files should include all output, i.e. run each cell and save your file before submitting.</b></div>

<div class="alert alert-block alert-info"> 
<b>Module 6 </b> is a continuation of EDA that you started in Module 5.  You will read in your file that you saved from Module 5 and continue on EDA with a focus on visuals. <br>
    
<b>Research project problem statement:</b> A brewery has a number of signature beers that they produce and they want to expand their production in to a different style of beer.  They have hired you to help them understand how the beer reviewers rate the qualities of the beers already on the market?  They want to know how different styles of beers are rated. They are also thinking about a seasonal beer but are not sure if seasonal beers are rated highly?  You will use the data that you cleaned in Module 5 for this research.
  

You will use a number of EDA techniques to answer these questions and many more.
</div>

![image.png](attachment:image.png)

<div class="alert alert-block alert-danger"><b>In many of the problems you will see <font color=black>#TODO</font> statements added as comments on the code cell provided. You will want to be sure to complete each of these as indicated to avoid losing points.</b></div>

In [0]:
# load up modules
import pandas as pd
import numpy as np
# load for visuals
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

# set up notebook to display multiple output in one cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

<div class="alert alert-block alert-warning"><b>Summary of steps taken in this notebook:
    </b><br>
1. Deciding what to analyze - examine date ranges<br>
2. Look at outliers<br>
3. Creating new fields <br>
4. Are all fields created equal? <br>
5. Looking for the highly rated beers, breweries and beer styles <br>
6. Time Series plot - by days and by month <br>
7. Season analysis <br>
    
<b><u> Best practice</b></u> - we have discussed that the process of EDA is an iteration with each change to the data providing a new view of the data.  Depending on how you are manipulating the data, you should look at your data before a change and after a change so that you are confident that the code did as you expected.  The inspection of the data can be via any combination of shape, info, describe, boxplot, etc.
</div>

### Read in data


In [0]:
# read in file from module 5
df = pd.read_csv('beer_reviews_final.csv')

# what is the shape of the data
df.shape
# look at first five records
df.head()

In [0]:
# check for nulls and for data types
# if you did your assignment correctly for module 5, there should be no nulls
df.info()


## Deciding what data to analyze - examine date ranges

In [0]:
# let's check out our date info
yearmonth = df['review_date'].str[0:7]

# We have all of 2011 and part of January 2012
yearmonth.value_counts().sort_index()

<div class="alert alert-block alert-success"><b>Problem 1 (2 pts.)</b>: Drop January 2012 from your data so that you are only analyzing the 12 months of 2011.  Show that your dataframe only contains 2011 data by showing a monthly count (see below).
    
![image.png](attachment:image.png)
</div>

In [0]:
# TODO show the shape of your data

# TODO only keep 2011 data for analysis

# TODO show the new shape of the data

# TODO show your dataframe contains 2011 data with monthly counts
 

## Deciding what to analyze - looking at outliers

In examination of beer_abv using a boxplot, there appears to be outliers beyond 15 percent alcohol.

What are reasonable alcohol levels for beer?  According to the link provided, average alcohol levels for beer are around the 5 percent level. https://www.alcohol.org/statistics-information/abv/

There are a lot of values over 15, so next we'll take a closer look at them. Note that it is possible to specify the percentiles within describe( ).

In [0]:
df['beer_abv'].describe(percentiles = [.25, .5, .75, .95])

df.boxplot(column = 'beer_abv')

Now we can isolate the beers with alcohol content over 15% and take a closer look.

In [0]:
# isolate beers with alcohol content over 15
x = df[df['beer_abv'] > 15]

# how many reviews are there? 2513
len(x)

# how many unique beers is that? 94
x['beer_name'].unique().shape

# do they look like valid abv values? Or are they mislabeled?
x
 

<div class="alert alert-block alert-success"><b>Problem 2 (2 pts.)</b>: The data appears to be valid and accurate as the beer name in many cases matches the percent alcohol (Schorsch Weizen 16% as an example). So there are real beers with high alcohol content. But the high alcohol content beers are unusual in the beer market and not mainstream enough for our client, so let's drop those with alcohol content above 11.5% from our analysis. Also, our client does not want to make a non-alcoholic beer so also drop any beer with less than 1% alcohol.
</div>

In [0]:
#TODO show the shape of your data

#TODO Drop all beers with an alcohol content under 1 percent and over 11.5 percent

#TODO show the shape of your data

#TODO display another boxplot to show beer_abv


In [0]:
# now look at how the change to the data affected the description stats
df['beer_abv'].describe(percentiles = [.25, .5, .75, .95])

## Deciding what data to analyze - creating new fields

Since we're only interested in the data from 2011, we can extract just the month from our data and create a new column for this.

In [0]:
# create a new column for month
df['month'] = df['review_date'].str[5:7]
df['month'].value_counts()

One of the questions from our client is about seasonal beers. With this in mind, we'll create a dictionary called `seasons` that we will use later in our analysis. We will use our `month` variable as the key. We will do more with dictionaries next week, but until then here is some more information on them: __[dictionaries](https://docs.python.org/3/tutorial/datastructures.html#dictionaries)__

Note that `map( )` is a built-in function that is an iteration tool. More information can be found here: __[built-in functions](https://docs.python.org/3/library/functions.html#map)__

In [0]:
# create a dictionary called 'seasons' using the 'month' variable

seasons = {'01' : 'Winter', '12' : 'Winter', '02' : 'Winter',
           '03' : 'Spring', '04' : 'Spring', '05' : 'Spring',
           '06' : 'Summer', '07' : 'Summer', '08' : 'Summer',
           '09' : 'Fall',   '10' : 'Fall',   '11' : 'Fall'}
df['season'] = df['month'].map(seasons)
df.sample(5)

## Deciding what data to analyze - are all values created equal?

We will create a correlation matrix and Seaborn heatmap to investigate how rating, taste, and alcohol level are related. More information on creating a heatmap using Seaborn can be found here: __[Seaborn heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html)__

What does the correlation matrix below tell us?  It tells us that there is a high positive correlation between the overall review score and the taste score which makes sense since good tasting beer should get a high overall rating. The red colored boxes in the heat map indicate slight negative relationships betwen the scores shown and the beer alcohol level. This is an interesting finding because it suggests that the amount of alcohol in the beer doesn't matter that much in terms of the ratings. <br>

Notice that the correlation matrix and the heatmap are two different ways to present the same data.

In [0]:
# setting the columns to correlate
columns = ['review_overall','review_taste', 'beer_abv']
df_corr = df[columns]
# running the correlation
df_corr.corr()

# setting up the heatmap
corrmat = df_corr.corr()

# set the figure size
f, ax = plt.subplots(figsize=(9, 6))

# pass the data and set the parameters
sns.heatmap(corrmat, vmax=.8, square=True, annot=True, cmap='RdYlBu', linewidths=.5 )
plt.title('Heatmap Beer Review ratings')

# images can be saved - default is .png
# https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.savefig.html
plt.savefig('Correlation Heat Map Beer Reviews')

<div class="alert alert-block alert-success"><b>Problem 3 (3 pts.)</b>: Create a data matrix with all six of the numeric variables in the dataframe and create a heatmap using all six numeric variables with a new color scheme. Explain how the visual cues of the heatmap represent the correlations.
</div>

In [0]:
# TODO create a data matrix using all six numeric variables
 
# TODO create a heat map using all six numeric variables. Pick a new color combination.
# https://matplotlib.org/3.1.1/gallery/color/colormap_reference.html

#TODO explain how the visual cues of the heatmap represent the correlactions.
 

## Deciding what to analyze - looking for the highly rated beers

The `review_overall` score was highly correlated enough with all of the other review scores that we will use just the `review_overall` score for our analysis on top beers.

Let's look at top beers three different ways: by brewery, by style, and by individual beer.

In [0]:
# look at mean of overall review and the number of reviews
brewery =df['review_overall'].groupby(df['brewery_name']).agg(['mean','count'])

# Notice that looking at just the mean is misleading as those that have a rating 5.0 have only 1 
#     or 2 reviews
# we could eliminate the low count, or instead focus on the high count
# 
brewery.sort_values(by=['count'], ascending = False)[:20]
brewery.sort_values(by=['mean'], ascending = False)[:10]

In [0]:
# repeat for beer style
beerStyle =df['review_overall'].groupby(df['beer_style']).agg(['mean','count'])
beerStyle.sort_values(by=['count'], ascending = False)[:20]

In [0]:
# repeat for beer name
beers =df['review_overall'].groupby(df['beer_name']).agg(['mean','count'])
beers.sort_values(by=['count'], ascending = False)[:20]

In [0]:
# a table heat map can help point out top values
z = brewery.sort_values(by=['count'], ascending = False)[:50]
z.style.background_gradient(cmap = 'Blues')

# what does this heatmap help us see?
# top breweries to investigate based on review_overall 
#  Russian River Brewing Company
#  Brasserie Cantillian
#  Founders Brewing Company
#  Surly Brewing Company


<div class="alert alert-block alert-success"><b>Problem 4 (4 pts.)</b>: Create a heat map for both beer styles and for individual beers showing the top 50 based on the count of review_overall.  Which three beer styles are top rated and which three individual beers are top rated?  Optional: create the heatmap with different colors.
</div>

In [0]:
# TODO create a heatmap for beer styles
 
# TODO list top three beer styles based on mean rating
 

In [0]:
# TODO create a heatmap for individual beers
 
# TODO list top three beers based on mean rating
 

There are plenty of ways to slice and dice data. A heatmap is a nice visual, but there are other ways to analyse the data. Below is an example of an easy way to change the count number to see if lower count reviews have a higher review score.

Try changing the comparison value of 600 to 400 and see how the results change.

In [0]:
# set the comparison value to more than 600

temp = beers[beers['count'] > 600]
temp['mean'].nlargest(5)

# change the comparison value of 600 above to 400 and see how the results change

## Time Series plot

Next we will look at our total number of beer reviews by day.  Note that when data is stored in a csv file, it does not retain the date field type; the review_date in this module was read in as an Object - which is the default.

There are a few ways to handle date fields.  If you know you are reading in a date field from a csv file, you can specify so in the read_csv command:
df = pd.read_csv('beer6.csv',parse_dates = ['review_date'])

Or you can convert a date in an Object field into a Date field, which is shown below.

In [0]:
# convert review_date to a date format
df['review_date'] = pd.to_datetime(df['review_date'])
df.info()

In [0]:
# to plot by date, we need one sum for each date
# lets group by date and create a df that we can plot
df_date = pd.DataFrame(df['review_overall'].groupby(df['review_date']).count())
df_date.sample(5)
# the date is the index and it needs to be reset so it can be used as a regular column
df_date = df_date.reset_index()
df_date.info()

## Two different plots showing the same information

Shown below are plots of the beer review counts for each day.  The top graph is using Matplotlib and the bottom graph is using Plotly. Here is more information on those:

__[matplotlib simple plot](https://matplotlib.org/3.1.0/gallery/lines_bars_and_markers/simple_plot.html#sphx-glr-gallery-lines-bars-and-markers-simple-plot-py)__

__[plotly discrete colors](https://plotly.com/python/discrete-color/)__

__[plotly text and annotations](https://plotly.com/python/text-and-annotations/)__

Both plot styles have plenty of features that can be customized and you are encouraged to experiement with the customizations. We will be updating the plot title along with the labels for the x and y axis.

In [0]:
# matplotlib version of plot

fig, ax = plt.subplots(figsize = (8,8))

ax.plot(df_date['review_date'], df_date['review_overall'])
ax.set(xlabel = 'Review Date')

In [0]:
# plotly version of plot - notice the info on hover

fig = px.line(df_date, x = 'review_date', y = 'review_overall',  
        title='Beer Review Count by Date')  

fig.update_layout(height = 600, xaxis_title = 'Review Date')

# hover over Feb 13 to see counts

<div class="alert alert-block alert-success"><b>Problem 5 (4 pts.)</b>: Create a new line plot using the count of reviews showing for each month (new x axis value).  Your figure is to include a <b><font color=black>title, x and y axis labels</font></b>.  You can choose to either use Matplotlib or Plotly.  You will need to prep the data and then display the plot.
</div>

In [0]:
# TODO create your dataframe that groups the review_overall count by month
 

In [0]:
# TODO create a plot with Month on the x axis and counts on the y axis; Include a title, x and y axis label.


## Seasons

We already created a variable so that each review has a value of either Summer, Sprint, Winter or Fall.  We want to know if there are beers that have a high number of ratings in one season which suggests they are a special beer with seasonal release.

In [0]:
df50 = pd.DataFrame(pd.crosstab(df['beer_name'],df['season'])) 
# get a total count of reviews per beer
df50['Total'] = df50['Fall'] + df50['Spring'] + df50['Summer'] + df50['Winter']
df50.head(10)

In [0]:
# We don't want beers with few reviews, so only keep beers with 50 or more reviews
df50 = df50[df50['Total'] >= 50]
df50 = df50.reset_index()
df50.head()

In [0]:
# lets caculate percentages of total for each season

df50['fallPercent'] = (df50['Fall']/df50['Total']) * 100
df50['springPercent'] = (df50['Spring']/df50['Total']) * 100
df50['summerPercent'] = (df50['Summer']/df50['Total']) * 100
df50['winterPercent'] = (df50['Winter']/df50['Total']) * 100
df50.info()
df50.sample(5)

In [0]:
# let's look at Spring to see if any beers have the majority of reviews in Spring
df50[df50['springPercent'] > 75]

<div class="alert alert-block alert-success"><b>Problem 6 (2 pts.)</b>: Show the number of beers for each season that have over 75 percent of their reviews in one season.<br>
    
The output should show the season and the number of beers that qualify where x represents the count: <br>
Spring has x beers <br>
Summer has x beers <br>
Fall has x beers <br>
Winter ha x beers
</div>

In [0]:
# TODO Show the beer review counts for each season with over 75 percent per season


<div class="alert alert-block alert-success"><b>Problem 7 (1 pts.)</b>: Which beer(s) would you suggest the client to look at in regards to a seasonal beer and why? 
</div>

In [0]:
# TODO Which beer(s) would you suggest that the client to look at in regards to a seasonal beer and why?
