## EDA

### Descriptive Statistics:
Calculate **basic statistics** like **mean, median, and standard deviation** for the 'Rating,' 'Aroma,' 'Acidity,' 'Body,' 'Flavor,' 'Aftertaste,' and 'Price' columns to get an overall understanding of the dataset. Find out how often different roasters and locations appear in the dataset. 

### Distributions:
Visualize the **distributions** of 'Rating,' 'Aroma,' 'Acidity,' 'Body,' 'Flavor,' and 'Aftertaste' using to understand the spread of values.

### Correlations:
Investigate the **correlations** between different attributes such as 'Rating,' 'Aroma,' 'Acidity,' 'Body,' 'Flavor,' and 'Aftertaste.' Identify which attributes tend to go together or have an impact on the overall rating.

### Top Roasters and Coffees:
Identify the top-rated roasters and coffee names based on the 'Rating' column. 

### Quantity Analysis:
Investigate the 'Quantity' and 'Unit' columns to understand the different packaging sizes and units in which coffee is sold. Analyze how these factors relate to pricing and consumer preferences.

### Roaster Performance:
Evaluate roasters' performance based on their ratings and the origin of the coffee beans. Are there specific regions or origins associated with higher ratings for particular roasters?

## Deeper Analysis:

### Geospatial Analysis:
Analyze the 'Roaster_Location' and 'Origin' columns to understand where the roasters are located and where the coffee beans are sourced from. You can use geospatial tools to create maps or investigate the relationship between origin and rating.

### Currency Analysis:
Analyze the 'Currency' column to understand the currencies used for pricing. You can convert prices to a common currency (e.g., USD) for comparison.

### Price Analysis:
Analyze the relationship between 'Price' and 'Rating.' Do higher-priced coffees tend to have higher ratings? You can also look for outliers in pricing.Investigate the relationship between pricing ('Price' and 'Currency') and sensory attributes ('Aroma,' 'Acidity,' 'Body,' 'Flavor,' 'Aftertaste'). Are there pricing strategies associated with higher ratings?

### Text Analysis:
Perform natural language processing (NLP) on the 'Review_Description,' 'Blind_Assessment,' and 'Notes' columns to extract insights about the sensory descriptions, flavor profiles, and unique characteristics of the coffees.

Distribution of Ratings:

Visualize the distribution of coffee ratings to see the overall quality of coffees reviewed. This could be done using histograms or boxplots.
Trend Analysis:

Analyze rating trends over time to identify any patterns or shifts in coffee quality or preferences.
Explore how roast levels or coffee origins may trend over time. Are lighter or darker roasts becoming more popular?
Geographic Analysis:

Map the roaster locations to visualize geographic distributions and densities of coffee roasters.
Compare the coffee origins to their corresponding ratings and prices to see if certain regions consistently produce higher-rated or more expensive coffees.
Price Analysis:

Investigate the relationship between price and quality. Do higher prices correlate with higher ratings?
Adjust prices for inflation using the consumer price index to analyze real price changes over time.
Roast Level Analysis:

Compare the average ratings, aroma, body, and flavor profiles between different roast levels.
Determine if certain coffee origins tend to have specific roast levels.
Word Clouds from Reviews:

Generate word clouds from the 'notes' and 'blind_assessment' columns to visualize the most frequent descriptors used in coffee reviews.
Correlation Analysis:

Perform correlation analysis between numeric variables such as rating, price, acidity, body, flavor, and aftertaste. This can help identify which factors are most closely associated with high-quality coffee.
Text Analysis on Coffee Descriptions:

Use natural language processing to analyze the text data in coffee descriptions. Extract common themes or topics that appear in higher-rated coffees.
Impact of Coffee Variety:

Investigate if certain varieties of Arabica, like Geisha, consistently receive higher ratings compared to others.
Comparative Analysis by Country:

Compare average coffee ratings and price by country of origin and roaster location. This can reveal which countries are known for better quality or more expensive coffees.
Review Sentiment Analysis:

Conduct sentiment analysis on review texts to quantify the positivity or negativity of each review and see if this correlates with the coffee's rating.

In [2]:
from pathlib import Path

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn.objects as so 


In [9]:
data_dir = Path('../../data')
file = data_dir / 'processed' / '05052024_roast_review_cleaned.csv'

df = pd.read_csv(file)
df.columns

Index(['rating', 'roaster', 'title', 'blind_assessment', 'bottom_line',
       'roaster_location', 'coordinate location', 'og_roaster_location',
       'roaster_location_identifier', 'territorial_entity_1',
       'territorial_entity_1_identifiers', 'territorial_entity_2',
       'territorial_entity_2_identifiers', 'roaster_country', 'coffee_origin',
       'coffee_origin_country', 'roast_level', 'est_price', 'review_date',
       'aroma', 'body', 'flavor', 'aftertaste', 'url', 'acidity', 'notes',
       'agtron_external', 'agtron_ground', 'quantity_value', 'quantity_unit',
       'price_value', 'price_currency', 'price_value_usd_hist',
       'consumer_price_index', 'price_usd_adj', 'quantity_in_lbs',
       'price_per_lbs_adj', 'roaster_county', 'roaster_us_state'],
      dtype='object')

In [10]:
df.drop(columns=['territorial_entity_1_identifiers', 'territorial_entity_2_identifiers',
                 'est_price', 'consumer_price_index'])

Unnamed: 0,rating,roaster,title,blind_assessment,bottom_line,roaster_location,coordinate location,og_roaster_location,roaster_location_identifier,territorial_entity_1,...,quantity_value,quantity_unit,price_value,price_currency,price_value_usd_hist,price_usd_adj,quantity_in_lbs,price_per_lbs_adj,roaster_county,roaster_us_state
0,94,Utopian Coffee,Colombia Turron,"Richly sweet, deeply savory. Rambutan, pink pe...",An intriguingly spice-toned natural Colombia w...,Fort Wayne,"41.08045,-85.13915","Fort Wayne, Indiana",Q49268,Allen County,...,4.0,ounces,10.00,USD,10.00,10.00,0.25,40.00,Allen County,Indiana
1,93,JBC Coffee Roasters,Mukwinja DR Congo,"Balanced, richly spice-toned. Ginger snaps, ne...",A quietly confident DR Congo that leads with w...,Madison,"43.07472222222222,-89.38416666666667","Madison, Wisconsin",Q43788,Dane County,...,12.0,ounces,19.00,USD,19.00,19.00,0.75,25.33,Dane County,Wisconsin
2,92,JBC Coffee Roasters,Buesaco Colombia,"Crisply sweet-tart. Cocoa nib, pink grapefruit...","A solid, familiar washed Colombia, crisply cho...",Madison,"43.07472222222222,-89.38416666666667","Madison, Wisconsin",Q43788,Dane County,...,12.0,ounces,18.00,USD,18.00,18.00,0.75,24.00,Dane County,Wisconsin
3,94,Coffeebox Coffee,Colombia Nariño Villa Maria La Chorrera Geish...,"Elegantly sweet, spice-toned. Bergamot, pear b...","A deep, spice-toned, richly aromatic washed Co...",Taipei,"25.0375,121.5625","Taipei, Taiwan",Q1867,Taiwan,...,227.0,grams,710.00,TWD,23.14,23.14,0.50,46.28,,
4,93,Temple Coffee and Tea,Honduras Delmy Regalado Octopeque,"Deeply sweet, spice-toned. Baking spices (clov...",A spice-toned Honduras with a through like of ...,Sacramento,"38.575277777778,-121.48611111111","Sacramento, California",Q18013,Sacramento County,...,12.0,ounces,22.00,USD,22.00,22.00,0.75,29.33,Sacramento County,California
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4225,92,Paradise Roasters,Hawaii Laka,Intense aromatically; notes suggesting baking ...,,Ramsey,"45.232055555555554,-93.4605","Ramsey, Minnesota",Q1992875,Anoka County,...,12.0,ounces,25.00,USD,25.00,35.43,0.75,47.24,Anoka County,Minnesota
4226,88,Barnie's Coffee & Tea,Jamaica Blue Mountain,A difficult coffee to evaluate. On the upside ...,,Orlando,"28.533611111111,-81.386666666667","Orlando, Florida",Q49233,Orange County,...,16.0,ounces,45.99,USD,45.99,65.17,1.00,65.17,Orange County,Florida
4227,88,Javaloha,100% Hawaiian Coffee Hamakua Estate,"Tart, grapefruity pungency, hints of sweet coc...",,Paauilo,"20.043888888889,-155.37027777778","Pa'auilo, Hawaii",Q2014052,Hawaii County,...,16.0,ounces,30.00,USD,30.00,42.51,1.00,42.51,Hawaii County,Hawaii
4228,88,Clive Coffee,Haiti Ranquitte EcoCafe,"Sweetly rounded aroma with hints of flowers, c...",,Portland,"45.516666666667,-122.66666666667","Portland, Oregon",Q6106,Multnomah County,...,12.0,ounces,19.95,USD,19.95,28.27,0.75,37.69,Multnomah County,Oregon
