# 1. Project Details

## 1.1 Project Title

Beer reviews data analysis using machine learning, exploratory data visualization and analysis techniques.

## 1.2. Background Information
This project focuses on optimizing the brewing process for a craft beer brewery, using a dataset covering January 2020 to January 2024. By analyzing brewing parameters such as fermentation time, temperature, and ingredient ratios, our goal is to establish correlations between brewing techniques and beer quality and identify optimal brewing conditions that improve product quality.

# 2. Reading the Data Source

For this project, we are using a dataset in a csv file. Given our collaborative work environment and the use of GitHub for version control, we've chosen to host this file online to ensure all team members can access and load the data directly into the Jupyter notebook. 

The .csv file is hosted on Dropbox, on the following link:: 

[Brewery Data Link](https://www.kaggle.com/datasets/rdoume/beerreviews/data?select=beer_reviews.csv)
The dataset variables description is:

- Brewery_id - Unique identifier for the brewery.
- Brewery_name -  Name of the brewery.
- Review_time - Timestamp of the review in Unix time format.
- Review_overall - Overall rating given by the reviewer.
- Review_aroma - Rating for the aroma of the beer.
- Review_appearance - Rating for the appearance of the beer.
- Review_profilename - Profile name of the reviewer.
- Beer_style - Style of the beer being reviewed.
- Review_palate - Rating for the palate of the beer.
- Review_taste - Rating for the taste of the beer.
- Beer_name - Beer name.
- Beer_abv - Alcohol By Volume (ABV) of the beer.
- Beer_beerid - Unique identifier for the beer.

In [None]:
# Library Imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import plotly.express as px
import plotly.graph_objects as go 
import plotly.subplots as sp
import plotly.figure_factory as ff
import os
import kagglehub
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

In [None]:
# Download latest version
path = kagglehub.dataset_download("rdoume/beerreviews")

print("Path to dataset files:", path)

In [None]:
csv_path = os.path.join(path, "beer_reviews.csv")
df = pd.read_csv(csv_path)
print(df.head()) 

# 3. Data Cleaning and Preprocessing

During this phase we will prepare the dataset for analysis and modeling. First, the structure of the dataset will be inspected, in order to understand the structure and its features along with their data types. After that, missing values will be handled, either by imputing or deleting them. Finally, data transformation is applied to convert some variables to a more suited scale.

## 3.1. Data Structure

We use info() to check the structure of our data. We will look into the data type of each column and the number of data points and variables. 

In [None]:
# Obtain data structure
df.info()

Doing an initial analysis to the structure of the data, we can observe the following details:

- The dataset contains 1,586,614 observations, with 13 distinct variables.
- The memory usage is approximately 157.4 MB.
- The 'brewery_id' and 'brewery_name' variables represent the brewery's unique identifier and name. 'Brewery_id' could be converted into a categorical variable if the IDs represent different categories or groups of breweries.
- Ratings for aroma, appearance, palate, and taste ('review_aroma', 'review_appearance', 'review_palate', 'review_taste') are provided. Analyze the distribution of these ratings and explore potential relationships between them.
- Explore the distribution of beer styles 'beer_style' to identify the most common styles and analyze how they are rated.
- Analyze the distribution of 'beer_abv' and explore how it correlates with overall ratings or specific aspects of the reviews.
- There are missing values in 'brewery_name', 'review_profilename', and 'beer_abv'. Determine if there are patterns or reasons for missing data.
- There is potential relationships between different variables that could be explored. 'review_taste' or 'review_overall' might be influenced by 'review_aroma', 'review_appearance', or 'review_palate'.
- Exploring trends in the number of reviews over time 'review_time' will help us understand if there are seasonality or significant changes in reviewing behavior.

These observations are just a starting point, and further exploration of the data might reveal additional information and considerations.

## 3.2. Summary Statistics

Now during this step, we intend to modify the data to better suit our analysis and the predicting modeling building. 

In [None]:
df.describe()

### Summary Statistics Analysis

The summary statistics of the dataset provide valuable information into the distribution and characteristics of beer reviews:

- **Count:** There are a total of 1,586,614 reviews in the dataset. However, the `beer_abv` column has fewer non-null values (1,518,829), indicating some missing data in this column.

- **Review Ratings (Overall, Aroma, Appearance, Palate, Taste):**
    - The average ratings for overall, aroma, appearance, palate, and taste are all above 3.5, suggesting a generally positive sentiment in the reviews.
    - The minimum ratings for overall, aroma, appearance, palate, and taste are 0, 1, 0, 1, and 1, respectively. The presence of 0 in overall and appearance ratings may indicate some outlier or erroneous entries.
    - The maximum rating for all these categories is 5, which is consistent with a typical rating scale.

- **Alcohol by Volume (ABV):**
    - The average ABV is 7.04%, with a standard deviation of 2.32%, indicating a moderate variation in the alcohol content of the beers reviewed.
    - The minimum ABV is 0.01%, which is unusually low for a beer. This could be an outlier or a special case (e.g., non-alcoholic beer).
    - The maximum ABV is 57.7%, which is exceptionally high for a beer and likely represents a special or extreme brew.

- **Brewery and Beer IDs:**
    - The `brewery_id` and `beer_beerid` columns have a wide range of values, indicating a diverse set of breweries and beers in the dataset.

These summary statistics provide an understanding of the dataset's structure and the distribution of key variables. They also highlight potential areas for further investigation, such as examining the distribution of ABV values and investigating the reasons behind the low and high extremes.


## 3.3. Check for Duplicate values

Let's investigate if there are any duplicate values.

In [None]:
# Check for duplicate rows
duplicates = df.duplicated()

# Count the number of duplicate rows
num_duplicates = duplicates.sum()
print(f"Number of duplicate rows: {num_duplicates}")

There are no duplicate values so we do not need to delete any rows based on that. 

## 3.4. Handling Missing Values

Missing values need to be handled because most models do not handle them well. For missing values, the options are to:

- Substitute them with statistical estimates such as the mean, median, mode, etc. 
- Eliminate them all together, if the amount of missing values is relatively low. 

We will do some analysis to see if there are missing values in our dataset, and if so, how to handle them. 

In [None]:
# Check for missing values
print("\nMissing values in each column:")
missing_values = df.isnull().sum()
print(missing_values)

# Total number of missing values
total_missing_values = missing_values.sum()
print(f"\nTotal number of missing values in the dataset: {total_missing_values}")

# List of columns that contain at least one missing value
columns_with_missing_values = missing_values[missing_values > 0].index.tolist()
print("\nColumns that contain at least one missing value:")
print(columns_with_missing_values)

Our analysis of the missing values reveals the following:

- The total number of missing values in the dataset is 68,148.
- The columns that contain at least one missing value are 'brewery_name', 'review_profilename', and 'beer_abv'.
- Given the wide range of values in the 'brewery_name' variable and the concern that imputing with the mode might not be representative, we have opted to remove the rows with missing values in this column.
- The 'review_profilename' represents categorical data and there are 348 missing values, we have decided to impute the most appearing user profile name.
- With a significant number of missing values in 'beer_abv,' we recognize the importance of retaining data while providing a reasonable estimate. To accomplish this, we have chosen to impute the missing values using the mean.

In [None]:
# Remove rows with missing values in 'brewery_name'
df_cleaned = df.dropna(subset=['brewery_name']).copy()

# Impute missing values in 'review_profilenam
mode_profilename = df_cleaned['review_profilename'].mode()[0]
df_cleaned['review_profilename'] = df_cleaned['review_profilename'].fillna(mode_profilename)

# Impute missing values in 'beer_abv' with the mean
df_cleaned['beer_abv'] = df_cleaned['beer_abv'].fillna(df_cleaned['beer_abv'].mean())

# Display the final cleaned dataset
print("Final Cleaned Dataset:")
print(df_cleaned.head())


In [None]:
# Check for missing values in the final cleaned dataset
print("\nMissing values in the final cleaned dataset:")
missing_values_cleaned = df_cleaned.isnull().sum()
print(missing_values_cleaned)

# Total number of missing values in the final cleaned dataset
total_missing_values_cleaned = missing_values_cleaned.sum()
print(f"\nTotal number of missing values in the final cleaned dataset: {total_missing_values_cleaned}")

 We ensured there are no missing values; this indicates that the data is now complete. 

# 4. Data Transformation

## 4.1. Data Type Conversion

Before going into further analysis, it is important that each column in our dataset is of the appropriate data type. This step is important for several reasons:

- **Datetime Conversion:** The `review_time` column, currently in UNIX timestamp format, will be converted to a datetime format for easier manipulation and interpretation of time-based data.
- **Floating-Point Conversion:** Numerical columns such as `review_overall`, `review_aroma`, and `beer_abv` will be converted to float data types to ensure precision in numerical operations.
- **Categorical Conversion:** Columns representing categorical data, such as `brewery_name`, `beer_style`, and `beer_beerid`, will be converted to the `category` data type. This conversion is beneficial for memory efficiency and computational performance when dealing with categorical operations.

In [None]:
# Convert 'review_time' to datetime format
df_cleaned['review_time'] = pd.to_datetime(df_cleaned['review_time'], unit='s')

# Convert numerical columns to float
numerical_cols = ['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv']
df_cleaned[numerical_cols] = df_cleaned[numerical_cols].astype(float)

# Convert categorical variables to 'category' data type
categorical_cols = ['brewery_name', 'beer_style', 'beer_name', 'brewery_id', 'beer_beerid']
df_cleaned[categorical_cols] = df_cleaned[categorical_cols].astype('category')

# Display the updated data types
print(df_cleaned.dtypes)
df_cleaned.describe()


### Post-Conversion Summary

After converting the data types:

- The `review_time` column has been successfully converted to a datetime format, facilitating time-based analysis.
- Numerical columns like `review_overall` and `beer_abv` are now in float format, ensuring accurate numerical computations.
- Categorical columns such as `brewery_name` and `beer_style` have been converted to the `category` data type, enhancing memory efficiency and computational speed for categorical operations.
- The updated data types are displayed above, confirming the successful conversions and preparing our dataset for further analysis.

## 4.2. Outliers Analysis

Given that our reviews should range from 1 to 5, outliers would be any values below 1 or above 5. These could be due to data entry errors or issues with data collection. Let's identify and handle these outliers:

In [None]:
# Define review score columns
review_score_columns = ['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste']

# Initialize dictionary to hold the counts of outliers
outliers_dict = {}

# Count outliers for each review score column
for column in review_score_columns:
    outliers_below = df_cleaned[df_cleaned[column] < 1].shape[0]
    outliers_above = df_cleaned[df_cleaned[column] > 5].shape[0]
    outliers_dict[column] = {'below_1': outliers_below, 'above_5': outliers_above}

# Display the results
for column, counts in outliers_dict.items():
    print(f"{column} - Outliers below 1: {counts['below_1']}, Outliers above 5: {counts['above_5']}")

### Outliers Boxplot Visualization

In [None]:
# Define numerical columns for visualization
numerical_cols = ['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv']

# Create boxplots for each numerical column
plt.figure(figsize=(15, 10))
for i, col in enumerate(numerical_cols, 1):
    plt.subplot(2, 3, i)
    df_cleaned.boxplot(column=col)
    plt.title(col)

plt.tight_layout()
plt.show()

### Removal of Outliers Outside the Valid Score Range

Upon closer examination of our review score data, we've identified a small number of entries that have `review_overall` and `review_appearance` scores below the valid range of 1 to 5. Since our rating system does not accommodate scores below 1, these entries are likely to be data entry errors and will be removed to maintain the integrity of our dataset.

In [None]:
# Remove the outliers with review scores below 1
df_cleaned = df_cleaned[(df_cleaned['review_overall'] >= 1) & (df_cleaned['review_appearance'] >= 1)]

In [None]:
# Define review score columns
review_score_columns = ['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste']

# Initialize dictionary to hold the counts of outliers
outliers_dict = {}

# Count outliers for each review score column
for column in review_score_columns:
    outliers_below = df_cleaned[df_cleaned[column] < 1].shape[0]
    outliers_above = df_cleaned[df_cleaned[column] > 5].shape[0]
    outliers_dict[column] = {'below_1': outliers_below, 'above_5': outliers_above}

# Display the results
for column, counts in outliers_dict.items():
    print(f"{column} - Outliers below 1: {counts['below_1']}, Outliers above 5: {counts['above_5']}")

The entries with invalid review scores have been removed. 


### Investigate the High ABV Values:
High alcohol by volume (ABV) in beers can be quite normal, especially for certain styles of beer like barleywines, imperial stouts, and others that are known to have higher ABV percentages. However, exceptionally high values (e.g., significantly above 20%) may be erroneous or represent a very niche type of beer.

In [None]:
# Investigate high ABV values by checking the beer names and styles
high_abv_beers = df_cleaned[df_cleaned['beer_abv'] > 20]  # Example cutoff
high_abv_beers[['beer_name', 'beer_style', 'beer_abv']].sort_values(by='beer_abv', ascending=False).head()

### Inclusion of High ABV Beers in the Dataset

After a thorough verification using the Beer Advocate website, we have confirmed that the exceptionally high ABV beers in our dataset are genuine. Beers like "Schorschbräu Schorschbock 57%" and "Sink The Bismarck!" with ABV levels above 20% are indeed specialty beers with significantly higher alcohol content than typical beers.

Given that these are legitimate products, we have decided to include these high ABV beers in our dataset. This will allow our analysis to reflect the full spectrum of beer varieties, including those that are at the extreme end of the ABV scale. These entries provide valuable insight into the diversity of the beer market and consumer preferences for high-strength beers.

# 5. Exploratory Data Analysis

## 5.1. Distributions of Key Variables

### Review Score Distributions

We will begin our exploratory data analysis by examining the distributions of the review scores. These scores provide insights into consumer preferences and beer quality as perceived by the reviewers. We want to understand the central tendencies, dispersions, and the overall shape of the distribution of review scores such as 'review_overall', 'review_aroma', 'review_appearance', 'review_palate', and 'review_taste'.

To visualize these distributions, we will use histograms which offer a clear view of the frequency of different rating scores. We will also calculate skewness and kurtosis for these review scores to identify any asymmetry and the tailedness of the distributions, respectively. It's important to investigate these aspects as they can influence the interpretation of the average scores and the general user sentiment towards the beers reviewed.

Additionally, understanding the most reviewed beer styles can help identify consumer trends and popular types of beer, which may be beneficial for market analysis and product development strategies.

In [None]:
# Configure layout dimensions 
fig_width = 1200  
fig_height = 600 

# Create histograms for review scores 
review_score_columns = ['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste']

# Create figure with subplots
fig = sp.make_subplots(rows=2, cols=3, subplot_titles=review_score_columns) 

for index, column in enumerate(review_score_columns, 1):
    # Calculate row and column positions for subplot layout
    row = (index - 1) // 3 + 1  
    col = (index - 1) % 3 + 1 

    # Create a histogram trace 
    hist_trace = go.Histogram(
        x=df_cleaned[column], 
        nbinsx=20,
        name=column.capitalize()
    )

    # Add the trace to the figure with subplot specification
    fig.add_trace(hist_trace, row=row, col=col)

# Update figure layout
fig.update_layout(
    title="Distributions of Review Scores",
    width=fig_width, 
    height=fig_height,
    showlegend=False
)

fig.show()

# Calculate skewness and kurtosis for review scores
print("Skewness of review scores:")
print(df_cleaned[review_score_columns].skew())
print("\nKurtosis of review scores:")
print(df_cleaned[review_score_columns].kurt())

# Calculate the top 20 most frequent beer styles
top_n = 20
top_20_styles = df_cleaned['beer_style'].value_counts().head(top_n)

# Create a DataFrame (Plotly Express works well with DataFrames)
df_top_styles = pd.DataFrame({'Beer Style': top_20_styles.index, 
                              'Number of Reviews': top_20_styles.values})

# Create the bar chart using Plotly Express
fig = px.bar(df_top_styles, 
             x='Number of Reviews', 
             y='Beer Style',
             title='Top 20 Most Reviewed Beer Styles',
             orientation='h',  
             color='Beer Style',  
             )

# Customize appearance
fig.update_layout(xaxis_title='Number of Reviews',
                  yaxis_title='',  
                  )

# Show the chart
fig.show()



### Results: Review Score Distributions

The histograms for each review score category show a clear preference towards higher ratings, indicating a general positivity in reviewer feedback. However, the calculated skewness values are negative across all review scores, suggesting that the distribution tails are longer on the lower end. This means that there are relatively fewer low scores, but these can significantly differ from the average ratings.

Similarly, the positive kurtosis values for all review scores indicate that the distributions have heavier tails and sharper peaks than the normal distribution. This reflects a high level of agreement among reviewers on certain ratings, with fewer moderate opinions.

In the context of beer styles, the horizontal bar plot highlights the top 20 most reviewed beer styles, with 'American IPA' and 'American Double / Imperial IPA' being the most frequently reviewed. This suggests a high popularity and consumer interest in these categories, potentially offering valuable data to breweries looking to cater to market demands.

Overall, the review score distributions and the frequency of reviews across different beer styles paint a comprehensive picture of the current landscape of beer reviews on the platform.

### Distribution Analysis of Alcohol by Volume (ABV)

The Alcohol by Volume (ABV) percentage in beer gives us an understanding of the strength of the beer. The ABV can vary significantly across different beer styles, influencing the flavor, body, and overall drinking experience. In this part of the analysis, we will create a histogram to visualize the distribution of ABV percentages across our dataset.

By examining the distribution, we identify the most common strength of beers reviewed and determine if our dataset aligns with general beer style ABV expectations. High concentrations in certain ABV ranges could indicate popular market trends or reviewer preferences.

This analysis could help breweries and beer enthusiasts alike to determine the popularity of different beer strengths and styles. It also provides an understanding into how ABV correlates with taste preferences and reviewing behavior.


In [None]:
# Plot the distribution of the ABV using Plotly Express
fig = px.histogram(
    df_cleaned, 
    x='beer_abv', 
    nbins=50, 
    marginal='violin',  
    color_discrete_sequence=['blue'] 
)

# Update layout for a more appealing look
fig.update_layout(
    title='Distribution of Alcohol by Volume (ABV)',
    xaxis_title='ABV (%)',
    yaxis_title='Frequency'
)

# Show the interactive plot
fig.show()

### Findings: ABV Distribution Analysis

The histogram of the ABV distribution reveals a concentration of beers in the lower to mid-range of alcohol content, which is typical for many popular beer styles. The presence of a long tail towards the higher ABV percentages suggests that there are fewer but still significant offerings of stronger beers. This could reflect a niche market for high-strength beers, such as Imperial Stouts or Barleywines.

The ABV distribution generally aligns with known beer style ABV ranges. The peak of the distribution occurs around what is standard for many ales and lagers, indicating that these styles are well-represented in the dataset. High-ABV beers are less common, but the variety within this range could speak to the diversity of the beer market and the experimentation among craft brewers.

In summary, the ABV analysis shows the wide range of beer strengths available and reviewed, from light and sessionable to strong and bold. This variability supports the notion of a diverse and evolving beer culture where consumers have a broad array of choices to suit their taste and preferences.



## 5.2. Relationships Between Variables

### Exploring Relationships Between Variables

Understanding the relationships between different variables in our dataset can tell how certain aspects of beers are related to each other. 

In [None]:
# Select relevant columns for the pair plot
columns = ['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv']

# Create a pair plot
pair_plot = sns.pairplot(df_cleaned[columns])

# Adjust the size of the pair plot if necessary
pair_plot.fig.set_size_inches(15,15)

# Add a main title to the pair plot
plt.subplots_adjust(top=0.95)
pair_plot.fig.suptitle('Pairwise Relationships Between Review Scores and ABV', fontsize=16)

# Show the plot
plt.show()


### Pair Plot Analysis Summary

The pair plot provided insights into the relationships between review scores for overall quality, aroma, appearance, palate, taste, and the Alcohol by Volume (ABV) percentage. Here's what we can deduce from the visualization:

**Consistent Review Scores**
- There is a **positive correlation** between the different review scores. Higher scores in one sensory attribute tend to coincide with higher scores in others, indicating **consistency** in reviewers' perceptions.

**Positive Reviews**
- Scores predominantly cluster around the 3.5 to 4.5 range, suggesting that the majority of reviews are **favorable**. The histograms show a skew towards higher ratings, with fewer low scores.

**ABV Distribution**
- The `beer_abv` content is mostly concentrated in the lower range (approximately 5% to 10%), which aligns with the ABV of **common beer styles**. High ABV beers are present but significantly less common.

**### ABV vs Review Scores**
- The relationship between ABV and review scores does not show a distinct pattern, suggesting that beer strength does not solely influence its perceived quality. The diversity in scoring across ABV levels indicates that **other factors** may have a more substantial impact on reviews.

**Presence of Outliers** 
- Outliers in the `beer_abv` suggest the existence of specialty beers with much higher than average alcohol content. 

**Review Scores Density**
- The peaks in the distribution for each review score category could indicate a **tendency** among reviewers to prefer certain scores or might reflect a general consensus on beer quality.

In conclusion, while the dataset reveals general positivity in beer reviews and consistent relationships between the different review scores, ABV does not emerge as a significant determinant of review outcomes. Instead, the analysis points towards the importance of other variables in influencing reviewer opinions, which could be explored in further studies.


## 5.3. Correlation Matrix

We now focus on identifying the strength and direction of the relationships between different beer attributes. The correlation matrix is a powerful tool in this exploration, allowing us to visualize how closely linked certain characteristics, such as taste, aroma, and appearance, are to each other and to the alcohol by volume (ABV) percentage.

We would like to discover if there are any patterns that indicate, for instance, whether beers with higher ABV tend to receive better or worse reviews. 

The following code block generates an annotated heatmap to visualize these correlations. It offers a snapshot of the interdependencies within our dataset, providing clarity on which attributes move together and which diverge.


In [None]:
# Select only numeric columns for correlation matrix
numeric_df = df_cleaned.select_dtypes(include=[np.number])

# Recalculate the correlation matrix
corr = numeric_df.corr()

# Generate the plot
fig = ff.create_annotated_heatmap(
    z=corr.to_numpy(),
    x=corr.columns.tolist(),
    y=corr.index.tolist(),
    annotation_text=corr.round(2).astype(str).to_numpy(),
    colorscale='Viridis',
    showscale=True
)

# Update layout
fig.update_layout(
    title='Correlation Matrix of Beer Review Scores',
    xaxis={'title': 'Variables'},
    yaxis={'title': 'Variables'},
    autosize=False,  
    width=1800,       
    height=600       
)

# Show the interactive figure
fig.show()



## Correlation Matrix Analysis

The resulting heatmap from the correlation matrix provides us with several noteworthy observations:

- There's a strong positive correlation between the review scores for taste, aroma, palate, and appearance, suggesting that reviewers who enjoy the taste of a beer often find the aroma, palate, and appearance to be pleasant as well. It indicates a general consistency in the perception of beer quality across different sensory dimensions.
- There's a very strong correlation between the overall review score and the scores for taste and palate. This suggests that taste and palat are the mot influencial factors.
- The overall review score also shows a strong correlation with the aroma score, highlighting the importance of scent in the overall experience of the beer.
- There is a moderately strong correlation between the overall review score and the appearance score. This reflects that visual appeal does play a role in the overall impression but perhaps not as much as taste or aroma.
- The Alcohol by Volume (ABV) percentage shows a weaker correlation with the review scores, which could imply that the strength of the beer does not consistently influence how it is rated in terms of taste, aroma, etc.
- Interestingly, the correlation between ABV and the other attributes is not negative, indicating that stronger beers are not necessarily rated poorly.


## 5.4. Time Series Analysis of Beer Ratings and ABV

Now we do a time series analysis of beer ratings and Alcohol by Volume (ABV) percentages to uncover patterns that may inform brewers and consumers alike about shifts in beer preferences and brewing strengths.

- The average overall rating trends, which may reflect changes in consumer satisfaction or review behaviors over time.
- The average ABV trends, indicating if there's been a shift toward stronger or milder beers.

In [None]:
# Make sure to exclude non-numeric columns before grouping
numeric_columns = df_cleaned.select_dtypes(include=[np.number])
df_grouped = numeric_columns.groupby(df_cleaned['review_time'].dt.to_period('M')).mean()  # Group by month

# Reset index to get 'review_time' as a column
df_grouped.reset_index(inplace=True)

# Convert the period column to datetime to make sure Matplotlib can handle it
df_grouped['review_time'] = df_grouped['review_time'].dt.to_timestamp()

# Create line plots
plt.figure(figsize=(14, 7))

# Plot average overall rating
plt.subplot(2, 1, 1)
plt.plot(df_grouped['review_time'], df_grouped['review_overall'], label='Average Overall Rating')
plt.title('Average Ratings and ABV Over Time')
plt.ylabel('Average Rating')
plt.legend()

# Plot average ABV
plt.subplot(2, 1, 2)
plt.plot(df_grouped['review_time'], df_grouped['beer_abv'], label='Average ABV', color='orange')
plt.xlabel('Time')
plt.ylabel('Average ABV (%)')
plt.legend()

# Show plots
plt.tight_layout()
plt.show()



## Analysis of Trends in Beer Ratings and ABV

The time series visualizations reveal intriguing patterns in beer ratings and ABV from 1996 to 2012:

**Average Overall Rating**
- The graph indicates a gradual stabilization of beer ratings over time. After an initial period of volatility, the ratings converge to a narrower band, suggesting a maturing market where consumers' expectations are increasingly met by the brewers' offerings.

**Average ABV**
- Contrasting the stability in beer ratings, the ABV shows a gradual increase. This trend could be reflective of a growing consumer interest in stronger beers, perhaps in line with the rise of craft breweries that often experiment with higher ABV ranges.

**Conclusion**
- These observations may suggest a market that's settling in terms of quality expectations while simultaneously developing a taste for potency in beer profiles. Brewers can infer that there is a potential market segment leaning towards beers with higher ABV.



## Top Breweries

In [None]:
import pandas as pd
# Calculate the top 10 most reviewed breweries
top_n = 10
top_10_breweries = df_cleaned['brewery_name'].value_counts().head(top_n)

# Create a DataFrame suitable for Plotly Express
df_top_breweries = pd.DataFrame({'Brewery Name': top_10_breweries.index, 
                                 'Number of Reviews': top_10_breweries.values})

# Create the bar chart using Plotly Express
fig = px.bar(df_top_breweries, 
             x='Number of Reviews', 
             y='Brewery Name',
             title='Top 10 Breweries by Review Count',
             orientation='h',  # horizontal bar chart
             color='Brewery Name',  # color code by brewery
             )

# Customize the chart's appearance
fig.update_layout(
    xaxis_title='Number of Reviews',
    yaxis_title='',
    yaxis=dict(autorange="reversed"),  # to have the highest count on top
)

# Show the chart
fig.show()


In [None]:
# Top Reviewers
# Find the top 10 reviewers based on the number of reviews they've submitted
top_reviewers = df_cleaned['review_profilename'].value_counts().head(10)
top_reviewers_df = pd.DataFrame({'Reviewer': top_reviewers.index, 'Number of Reviews': top_reviewers.values})

# Plotting the top reviewers with a bar chart
plt.figure(figsize=(10, 6))
sns.barplot(x='Number of Reviews', y='Reviewer', data=top_reviewers_df, palette='viridis')
plt.title('Top 10 Reviewers by Number of Reviews')
plt.xlabel('Number of Reviews')
plt.ylabel('Reviewer')
plt.tight_layout()
plt.show()

# Rating Patterns
# Calculate the average rating for each reviewer and count how many ratings they've submitted
reviewer_means = df_cleaned.groupby('review_profilename')['review_overall'].agg(['mean', 'count']).reset_index()
# Filter out reviewers with a very low number of reviews to avoid skewing the results
reviewer_means_filtered = reviewer_means[reviewer_means['count'] > 50]  # arbitrary threshold

# Plotting the distribution of average ratings for reviewers
plt.figure(figsize=(10, 6))
sns.histplot(reviewer_means_filtered['mean'], bins=30, kde=False)
plt.title('Distribution of Average Ratings by Reviewers')
plt.xlabel('Average Rating')
plt.ylabel('Frequency')
plt.show()

# You can also plot a scatter plot to visualize the relationship between the number of reviews and the average rating
plt.figure(figsize=(10, 6))
sns.scatterplot(x='count', y='mean', data=reviewer_means_filtered)
plt.title('Number of Reviews vs Average Rating by Reviewers')
plt.xlabel('Number of Reviews')
plt.ylabel('Average Rating')
plt.show()


****Top 10 Reviewers by number of reviews**

This chart highlights a significant variation in activity levels among reviewers.  Some reviewers, like 'northyorksammy' are extremely prolific, while others contribute a smaller number of reviews. It's important to be aware that the most frequent reviewers might disproportionately influence the overall perception of beers in the dataset. Their preferences and biases could carry more weight.

**Distribution of Average Ratings by Reviewers**

- The distribution appears to be centered around positive ratings (between 3.5 and 4.0 stars), suggesting that the majority of reviewers have a favorable impression of the beers they are reviewing.

- The distribution has a somewhat bell-shaped curve, with a peak in the middle and tapering tails on either side. This suggests that ratings are clustered around the average, with fewer outliers at the extremes (very low or very high ratings).

- The prevalence of positive ratings might indicate that the beers are generally well-received by reviewers.

**Number of Reviews vs Average Rating by Reviewers**
- There appears to be a weak positive correlation between the number of reviews and the average rating. Beers with more reviews tend to have slightly higher average ratings.

- The data points are scattered, indicating that the number of reviews doesn't solely determine a beer's average rating. There are beers with a high number of reviews that have lower average ratings, and vice versa.
Possible Interpretations

- The positive correlation might suggest that more popular beers (those with more reviews) tend to be better-rated on average. This could be because popular beers are more likely to be well-known and established brands that receive consistently positive reviews.

- Another possibility is that reviewers are more likely to leave reviews for beers they have strong opinions about, either positive or negative. This could lead to higher average ratings for both highly-rated and poorly-rated beers with a lot of reviews, and lower average ratings for less well-known beers.


## 6. Modeling

We decided to implement K-means clustering, to segment beer reviews based on their sensory attributes and alcohol by volume (ABV) content. By clustering similar beer reviews together, we aimed to uncover underlying patterns and groupings within the dataset that could provide valuable insights for brewers and consumers alike.

In [None]:
# Select relevant columns for clustering
clustering_columns = ['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv']

# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df_cleaned[clustering_columns])

# Define the range of clusters to test
min_clusters = 2
max_clusters = 10
num_clusters_range = range(min_clusters, max_clusters + 1)

# Calculate within-cluster sum of squares (WCSS) for each number of clusters
wcss = []
for num_clusters in num_clusters_range:
    kmeans = KMeans(n_clusters=num_clusters, random_state=42)
    kmeans.fit(scaled_data)
    wcss.append(kmeans.inertia_)

# Plot the Elbow curve
plt.figure(figsize=(10, 6))
plt.plot(num_clusters_range, wcss, marker='o', linestyle='-')
plt.title('Elbow Method for Optimal Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Within-Cluster Sum of Squares (WCSS)')
plt.xticks(num_clusters_range)
plt.grid(True)
plt.show()

In [None]:
# Define the number of clusters (you can adjust this based on your requirements)
num_clusters = 5

# Initialize K-means model
kmeans = KMeans(n_clusters=num_clusters, random_state=42)

# Fit the model to the scaled data
kmeans.fit(scaled_data)

# Get the cluster labels for each sample
cluster_labels = kmeans.labels_

# Add cluster labels to the original dataframe
df_cleaned['cluster'] = cluster_labels

# Check the distribution of clusters
print("Distribution of clusters:")
print(df_cleaned['cluster'].value_counts())

# Analyze cluster centroids
cluster_centroids = scaler.inverse_transform(kmeans.cluster_centers_)
cluster_centroids_df = pd.DataFrame(cluster_centroids, columns=clustering_columns)
cluster_centroids_df.index.name = 'Cluster'
print("\nCluster centroids:")
print(cluster_centroids_df)

In [None]:
# Perform PCA to reduce the dimensionality of the data to 2 dimensions
pca = PCA(n_components=2)
principal_components = pca.fit_transform(scaled_data)

# Create a DataFrame with the principal components and cluster labels
cluster_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])
cluster_df['cluster'] = kmeans.labels_

# Plot the clusters in 2D space
plt.figure(figsize=(10, 6))
for cluster in range(num_clusters):
    plt.scatter(cluster_df.loc[cluster_df['cluster'] == cluster, 'PC1'],
                cluster_df.loc[cluster_df['cluster'] == cluster, 'PC2'],
                label=f'Cluster {cluster}',
                alpha=0.5)
plt.title('Clusters in 2D PCA Space')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()

## 6.2 K-means 2nd option
- Selecting Features for Clustering:
    - We began by selecting the numerical features that will be used for clustering. These features include 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', and 'beer_abv'. These features represent different aspects of beer reviews and characteristics.
- Scaling Numerical Features:
    - Before applying K-Means clustering, it is important to scale the numerical features to ensure that each feature contributes equally to the clustering process. We used the StandardScaler to scale the numerical features to have zero mean and unit variance.
- Selecting the Number of Clusters (k):
    - We determined the number of clusters (k) to use in the K-Means algorithm. In this case, we chose to use 5 clusters, but the number of clusters can be adjusted based on domain knowledge or using techniques such as the elbow method or silhouette score.
- Performing K-Means Clustering:
    - We applied the K-Means clustering algorithm to the scaled numerical features. K-Means partitions the dataset into k clusters based on the similarity of data points. Each cluster is represented by its centroid, and data points are assigned to the nearest centroid.
- Adding Cluster Labels to the DataFrame:
    - After clustering, we added the cluster labels assigned by the K-Means algorithm to the original DataFrame. These cluster labels indicate which cluster each data point belongs to and will be used for further analysis and interpretation.
- Visualizing Clusters Using PCA:

    - We used Principal Component Analysis (PCA) for dimensionality reduction and visualization of the clusters. PCA transforms the original data into a lower-dimensional space while preserving the variance of the data. We visualized the clusters in a 2D scatter plot using the first two principal components to understand the distribution of data points in the reduced space.
- Visualizing Clusters in 3D:
    - Additionally, we performed PCA with three components to visualize the clusters in a 3D scatter plot. This provides a more comprehensive view of the clusters, allowing us to explore the data in three dimensions.
    
By completing this clustering analysis, we aim to uncover underlying patterns or structures in the beer review dataset. These insights can be valuable for various applications, such as targeted marketing, product recommendations, or customer segmentation.

In [None]:
# Selecting features for clustering (adjust as needed)
numerical_features = ['review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv']

# Extracting numerical feature values
X = df_cleaned[numerical_features].values

# Scaling the numerical features
scaler2 = StandardScaler()
X_scaled = scaler2.fit_transform(X)

# Determine the optimal number of clusters using the elbow method
inertia = []
for n_clusters in range(1, 11):
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    kmeans.fit(X_scaled)
    inertia.append(kmeans.inertia_)

# Plotting the elbow curve
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), inertia, marker='o', linestyle='-')
plt.title('Elbow Method for Optimal Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.xticks(range(1, 11))
plt.grid(True)
plt.show()

# Selecting the number of clusters
n_clusters = 5

# Performing k-means clustering
kmeans2 = KMeans(n_clusters=n_clusters, random_state=42)
kmeans2.fit(X_scaled)

# Adding cluster labels to the DataFrame
df_cleaned['cluster_label2'] = kmeans2.labels_

# Visualizing the clusters using PCA for dimensionality reduction
pca1 = PCA(n_components=2)
X_pca = pca1.fit_transform(X_scaled)

pca2 = PCA(n_components=3)
X_pca2 = pca2.fit_transform(X_scaled)

# Plotting the clusters
plt.figure(figsize=(10, 6))
for cluster_label2 in range(n_clusters):
    plt.scatter(X_pca[kmeans.labels_ == cluster_label2, 0],
                X_pca[kmeans.labels_ == cluster_label2, 1],
                label=f'Cluster {cluster_label2 + 1}')
plt.title('Clusters Visualized Using PCA')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()

# Adding PCA components to the DataFrame
df_cleaned['PCA1'] = X_pca2[:, 0]
df_cleaned['PCA2'] = X_pca2[:, 1]
df_cleaned['PCA3'] = X_pca2[:, 2] 

# Plotting the clusters using 3D scatter plot with Plotly
fig = px.scatter_3d(df_cleaned, x='PCA1', y='PCA2', z='PCA3', color='cluster_label2', 
                    symbol='cluster_label2', opacity=0.7, 
                    labels={'PCA1': 'Principal Component 1', 'PCA2': 'Principal Component 2', 'PCA3': 'Principal Component 3'}, 
                    title='Beer Reviews Clustering (3D Scatter Plot)')
fig.update_layout(scene=dict(xaxis_title='Principal Component 1', yaxis_title='Principal Component 2', zaxis_title='Principal Component 3'))
fig.show()