# Beer Hops Data: Exploratory Data Analysis

**Data Files:** *cln_hops_profile.csv, cln_hops_brewvalues.csv*

**Original Source:** *https://beermaverick.com/hops/*  (Data retrieved via web-scraping)

file:///C:/Users/Romith/Desktop/Romith/DU/Q4-TOOLS1-4447/comp_4447_syllabus.pdf

------------------------------------------------------------

### Setup

**Objective:** Import necessary modules for exploratory analysis & visualization and read in CSV files into local dataframes for easier access.

In [None]:
# Import necessary packages
import datapackage
import numpy as np
import geopandas as gpd
import pandas as pd
import itertools
import folium
import functools
import matplotlib.pyplot as plt
import operator
import seaborn as sns

In [None]:
# Define file paths for processed CSV data
CLEAN_HOPS_PATH = './clean_data/cln_hops_brewvalues.csv'  
CLEAN_HOPS_PROFILE_PATH = './clean_data/cln_hops_profile.csv'

In [None]:
# Read in raw CSV data into local dataframes
hop_values_df = pd.read_csv(CLEAN_HOPS_PATH, index_col='Hop Name')
hop_profile_df = pd.read_csv(CLEAN_HOPS_PROFILE_PATH, index_col='Hop Name')

---------------------------------------
### Cursory Look at Dataframes

**Objective:** Basic exploration of processed dataframes (of brew values & profile info)

In [None]:
hop_values_df.sample(5)

In [None]:
hop_values_df.info()

In [None]:
hop_values_df.isnull().sum()

From this cursory look, we observe that all the datatypes are consistent with each column having float values after the cleaning. However, a potential hurdle is the number of null values that exist, especially for the oil values.

In [None]:
hop_values_df.describe()

From this, we see that lot of the attributes have large ranges between the min and max values. There are also few "inf" values to be wary of for analysis. As before, we can observe that we have more data points for the main brew values and more NaN for the specific oil values.

In [None]:
hop_profile_df.sample(5)

In [None]:
hop_profile_df.info()

In [None]:
hop_profile_df.isnull().sum()

In [None]:
hop_profile_df[hop_profile_df.isna().any(axis=1)]

In the profile dataset, we seem to have a lot less missing values to take care of. And conveniently, the rows are all NaN instead of partial missing data. The data values seem to be in either categorical or boolean formats. 

In [None]:
hop_profile_df.describe()

This is not as enlightening as we would expect a majority of False values for each aroma since only a few aromas are tagged per hop. However, this does reinforce that the profile info is more complete and has much more consistency in terms of which values are missing across the board.

-----------------------------------------------------
### Exploratory Visualization: Hops Profile

**Objective:** Output basic visualizations that give us an alternative cursory look at our dataset for hops profile.

In [None]:
## VISUALIZE: HOPS AVAILABLE FOR COUNTRY

plt.figure(figsize=(20, 8))
c = sns.countplot(x='Country', data=hop_profile_df, hue='Purpose', dodge=False)
c.set_xticklabels(c.get_xticklabels(), rotation=35, ha="right")
plt.title("Number of Hops by Country")
plt.show()


From this countplot above, there are some interesting observations to be made. Most apparent is the fact that USA has the most hops by far, while UK, Germany, & New Zealand make up the next tier, and sum up to the same amount as US. However, the makeup of hops-purpose for each country is very unique. USA, for example, despite having the majority number of hops, have no "Dual" purpose hops (which is also true for the second most common Germany). Whereas, the majority of other countries have some dual, as well as, specialized purpose hops. UK and New Zealand seem to have the most dual-purpose hops.

In [None]:
## VISUALIZE: MOST POPULAR HOP AROMAS

# Create modified df for easier visualization
profiles = hop_profile_df.drop(columns=['Purpose', 'Country']).copy()  # hard copy to avoid cross-contamination
aroma_count = profiles.sum()  # getting true counts for each aroma

# Filter out aromas that only appear once (unique to a single hop)
aroma_count_single = aroma_count[aroma_count.values == 1]
print("Aromas that are unique to a single hop: \n", aroma_count_single.index)
aroma_count = aroma_count[aroma_count.values > np.mean(aroma_count.values)]
aroma_count = aroma_count.sort_values(axis=0, ascending=False, kind='mergesort')

# Execute seaborn barplot with count of aromas in dataset 
g = sns.barplot(y=aroma_count.index, x=aroma_count.values, orient='h')
sns.set(rc={'figure.figsize':(8, 6)})
# g.set_xticklabels(g.get_xticklabels(), rotation=90, ha="right")
g.set(ylabel='Aroma', xlabel='Aroma Count in Hops')
plt.title("Aroma Count in Hops")
plt.tight_layout()
plt.show()

From the barplot above, we see that there are a few aromas that are extremely popular among hops (citrus, floral, spicy most common). But a vase majority of aromas are unique to only less than 10 hops (most aromas were not plotted due to low count). 

In [None]:
## VISUALIZE: Most Popular Aromas for Countries' Hops (TO DO: make colors consistent)

# Set up plot config
f, axs = plt.subplots(14, 1, figsize=(20, 46))

# Loop through each country and create subplot based on most popular hops for that country
country_list = sorted(list(hop_profile_df.Country.unique()))
for i in range(len(country_list))[0:]:
    df = hop_profile_df[hop_profile_df.Country == country_list[i]].copy()
    df.drop(columns=['Purpose', 'Country'], inplace=True)
    aroma_count = df.sum()
    aroma_count = aroma_count.astype('int32')
    aroma_count = aroma_count.nlargest(n=10, keep='first')
    g = sns.barplot(x=aroma_count.index, y=aroma_count.values, ax=axs[i]).set(xlabel="Aroma", ylabel = "Num of Hops")
    axs[i].set_title(f'{country_list[i]}')
    axs[i].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

Citrus seems to be a commonly popular hop aroma across various countries. Spice and floral also seem very popular. Since some countries only produce few hops to begin with, not all have 10. 

-----------------------------------------------------
### Exploratory Visualization: Hops Values

**Objective:** Output basic visualizations that give us an alternative cursory look at our dataset for hops values.

In [None]:
## VISUALIZE: Distribution of each brew value (group with min/avg/max)

df = hop_values_df.copy()
df.replace([np.inf, -np.inf], np.nan, inplace=True)
hist = df.hist(layout=(10, 3), figsize=(15, 30), bins=10)

From the above distributions we see that the data for most attributes without a lot of missing values is mostly normal. Since we already observed there is a large range between min and max for the brew values, it might be tempting to use average but interestingly, we find another flaw through this exploratory histogram. If either the min or max values of a particular brew value has lot more missing values, then it also significant affects the average values as well (i.e. Alpha Beta Ratio). So there needs to be a more clever implementation in standardizing these values for further analyses.

In [None]:
## VISUALIZE: Correlation between numeric variables

sns.heatmap(hop_values_df.corr(), cmap='coolwarm', xticklabels=True, yticklabels=True)
plt.title("Correlation of Titanic Passenger Attributes")
plt.xticks(rotation=90)
plt.show

Looking at the correlation among the brew values, we see some strong correlation between Alpha Acid & Alpha-Beta Ratio (and interestingly, Beta Acid % is a lot less correlated with the ratio). The heat map also indicates a relatively strong correlation between Alpha Acid % and Total Oils. There is a slight correlation between Caryophyllene and Humulene as well. Apart from those, there doesn't seem to be strong correlations between any of the other values with each other. 

### Exploratory Visualization: High-Level Summaries by Country
**Objective:** Plot countries on an interactive map with high-level, interessting info on values and profile as a whole.

In [None]:
# Read in geojson data with filtered countries to be used for Folium mapping
countries_df = gpd.read_file('./rc_raw_data/countries.geojson')
countries_df = countries_df[countries_df.ADMIN.isin(hop_profile_df.Country.unique())]
countries_df.reset_index(drop=True, inplace=True)
countries_df

In [None]:
# Retrieve country info for values to be used for visualization
values_countries = hop_values_df.merge(hop_profile_df.Country, left_index=True, right_index=True)
values_countries.replace([np.inf, -np.inf], np.nan, inplace=True)  # replacing inf values (reasoning...)
values_countries

In [None]:
# Add avg brewvalues for each country in the countries_df geojson dataframe
for col in [i for i in values_countries.columns if 'Avg' in i]:  # avg value columns only
    mean_values_per_country = values_countries.groupby(by='Country', dropna=True).mean()[col]
    countries_df[col] = mean_values_per_country.values

# Adding popular aromas for hops for each country
country_list = sorted(list(hop_profile_df.Country.unique()))
aromas_lists = []
for i in range(len(country_list)):
    df = hop_profile_df[hop_profile_df.Country == country_list[i]].copy()
    df.drop(columns=['Purpose', 'Country'], inplace=True)
    aroma_count = df.sum()
    aroma_count = aroma_count.astype('int32')
    aroma_count = aroma_count.nlargest(n=3, keep='first')
    aromas_lists.append(list(aroma_count.index))
countries_df['Top 3 Aromas'] = aromas_lists

countries_df

In [None]:
## VISUALIZE: Map with Values

m = folium.Map(location=[0, 0], zoom_start=2)  # starting point

def create_choro(brew_value):
    """Sets up choropleth specifications for a given brew value and returns the choropleth object."""
    choro = folium.Choropleth(
        geo_data=countries_df,
        name=brew_value,
        data=countries_df,
        columns=['ADMIN', brew_value],
        key_on='feature.properties.ADMIN',
        fill_color='YlOrRd',
        fill_opacity=0.8,
        line_opacity=.2,
        line_weight=2,
        smooth_factor=0,
        Highlight=True,
        nan_fill_color='White',
        legend_name=f'Brew Value: {brew_value}',
        show=False,
        highlight=True,
        overlay=False
    )
    return choro

# Add choropleth layer for each brew value
for brew_value in [i for i in values_countries.columns if 'Avg' in i]:
    m.add_child(create_choro(brew_value))

# Add information markers for popular aromas
centers = countries_df.to_crs('+proj=cea').centroid.to_crs(countries_df.crs)
for i in range(len(countries_df)):
    m.add_child(
        folium.Marker(
            location=[centers[i].y, centers[i].x],
            popup=f"""
                <p>Top 3 Aromas:</p> <hr>
                <p>{countries_df.iloc[i]["Top 3 Aromas"][0]}</p> <hr>
                <p>{countries_df.iloc[i]["Top 3 Aromas"][1]}</p> <hr>
                <p>{countries_df.iloc[i]["Top 3 Aromas"][2]}</p> <hr>
            """,
            icon=folium.Icon(color='blue')
        )
    )

m.add_child(folium.LayerControl(position='topright', collapsed=False))
m.save('choro_values.html')
m