<a id='table_of_contents'></a>

0. [Import libraries](#imports)
1. [Import data](#import_data)
2. [Initial Cleaning](#initial_cleaning)
3. [Price and Quantity Cleaning](#price_and_quantity_cleaning)
4. [Coffee Origin Cleaning](#coffee_origin_cleaning)
4. [Data Checks](#data_checks)
5. [Export cleaned data](#export_data)

# 0. Import libraries <a id='imports'></a>
[Back to top](#table_of_contents)

In [5]:
from pathlib import Path
import json
from datetime import datetime

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

from unidecode import unidecode
import pycountry

pd.set_option("display.max_columns", 100)
pd.set_option("display.max_colwidth", 100)
mpl.rcParams["figure.dpi"] = 300

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# 1. Import raw data <a id='import_data'></a>
[Back to top](#table_of_contents)

In [6]:
# Read in raw data
BASE_DIR = Path().resolve().parent
DATA_DIR = BASE_DIR / "data"
FILE_IN = "25072024_reviews.csv"

df_in = pd.read_csv(DATA_DIR / "raw" / FILE_IN)
df_in.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7890 entries, 0 to 7889
Data columns (total 21 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   rating                             7890 non-null   object 
 1   roaster                            7890 non-null   object 
 2   title                              7890 non-null   object 
 3   blind_assessment                   7889 non-null   object 
 4   notes                              7888 non-null   object 
 5   bottom_line                        3812 non-null   object 
 6   roaster location                   7887 non-null   object 
 7   coffee origin                      7386 non-null   object 
 8   roast level                        7488 non-null   object 
 9   agtron                             7890 non-null   object 
 10  est. price                         5852 non-null   object 
 11  review date                        7890 non-null   objec

# 2. Initial Cleaning <a id='initial_cleaning'></a>
[Back to top](#table_of_contents)

First step is basic data checks and cleaning. This includes dropping columns that are not needed, setting datatypes, renaming columns,
combining columns, cleaning up strings, and creating new columns. 

In [7]:
# Cleanup column names
df_in.columns = (
    df_in.columns.str.strip().str.lower().str.replace(" ", "_").str.replace(".", "")
)

df_in.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7890 entries, 0 to 7889
Data columns (total 21 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   rating                            7890 non-null   object 
 1   roaster                           7890 non-null   object 
 2   title                             7890 non-null   object 
 3   blind_assessment                  7889 non-null   object 
 4   notes                             7888 non-null   object 
 5   bottom_line                       3812 non-null   object 
 6   roaster_location                  7887 non-null   object 
 7   coffee_origin                     7386 non-null   object 
 8   roast_level                       7488 non-null   object 
 9   agtron                            7890 non-null   object 
 10  est_price                         5852 non-null   object 
 11  review_date                       7890 non-null   object 
 12  aroma 

In [8]:
def tweak_df(df: pd.DataFrame) -> pd.DataFrame:
    """Initial data cleaning and feature creation"""
    return (
        df.assign(
            review_date=lambda df_: pd.to_datetime(df_["review_date"], format="%B %Y"),
            # Combing acidity and acidity/structure into one column, they are the same
            # field but names used in reviews changed at one point
            acidity=lambda df_: df_["acidity"].fillna(df_["acidity/structure"]),
            # Split the agtron column into one for external bean agtron data and ground
            # bean agtron data
            agtron_external=lambda df_: pd.to_numeric(
                df_["agtron"].str.split("/", expand=True)[0].str.strip(),
                errors="coerce",
            ),
            agtron_ground=lambda df_: pd.to_numeric(
                df_["agtron"].str.split("/", expand=True)[1].str.strip(),
                errors="coerce",
            ),
            # Distinguish espresso roasts from other reviews
            is_espresso=lambda df_: df_.apply(
                lambda row: (
                    True
                    if "espresso" in row["title"].lower()
                    or pd.notnull(row["with_milk"])
                    else False
                ),
                axis=1,
            ),
        )
        .replace(["", "NR", "N/A", "na"], np.nan)
        # Agtron values must be equalt to or below 100, some entries on website have typos
        .loc[
            lambda df_: (df_["agtron_external"] <= 100) & (df_["agtron_ground"] <= 100),
            :,
        ]
        # Run str.strip on every string column
        .map(lambda x: x.strip() if isinstance(x, str) else x)
        .drop(
            columns=["acidity/structure", "agtron", "refresh(enable_javascript_first)"]
        )
        .astype(
            {
                k: "float"
                for k in [
                    "agtron_external",
                    "agtron_ground",
                    "acidity",
                    "rating",
                    "aroma",
                    "body",
                    "flavor",
                    "aftertaste",
                ]
            }
        )
    )


df = tweak_df(df_in)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7566 entries, 0 to 7889
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   rating            7563 non-null   float64       
 1   roaster           7566 non-null   object        
 2   title             7566 non-null   object        
 3   blind_assessment  7566 non-null   object        
 4   notes             7565 non-null   object        
 5   bottom_line       3758 non-null   object        
 6   roaster_location  7564 non-null   object        
 7   coffee_origin     7299 non-null   object        
 8   roast_level       7445 non-null   object        
 9   est_price         5790 non-null   object        
 10  review_date       7566 non-null   datetime64[ns]
 11  aroma             7540 non-null   float64       
 12  acidity           6322 non-null   float64       
 13  body              7556 non-null   float64       
 14  flavor            7553 non-nu

# 3. Price and Quantity Cleaning <a id='price_and_quantity_cleaning'></a>
[Back to top](#table_of_contents)

The `est_price` column contains information on price, currency, and quantity. We need to split this column up to separate the price and quantity information.

Splitting on "/" character creates one column with price and currency information and another with quantity and unit of measurement information. 

The quantities have to be standardized so they contain a single representation for each unit and so unecessary punctuation and parentheses are removed.

We also filter the dataset to remove all products that came in units of cans, boxes, capusles, pods, etc. We will only concern ourselves with coffee sold in bags or bulk, ground or whole.


### Cleaning quantities

In [9]:
drop_terms: list[str] = [
    "can",
    "box",
    "capsules",
    "K-",
    "cups",
    "bags",
    "concentrate",
    "discs",
    "bottle",
    "pods",
    "ml",
    "pods",
    "pouch",
    "packet|tin",
    "instant",
    "sachet",
    "vue",
    "single-serve",
    "fluid",
    "capsultes",
]

drop_terms_string: str = "|".join(drop_terms)


def price_quantity_split(df: pd.DataFrame) -> pd.DataFrame:
    print(f"Original df shape: {df.shape}")
    price_quantity = (
        df
        # Split est_price into columns for price and quantity
        .est_price.str.split("/", n=1, expand=True)
        # Remove any commas from the price and quantity columns
        .replace(",", "", regex=True)
        .rename(columns={0: "price", 1: "quantity"})
        .assign(
            # Cleanup quantity
            quantity=lambda df_: (
                df_["quantity"]
                .str.lower()
                .str.strip()
                # Remove parentheses and anything inside them
                .str.replace(r"\(.*?\)", "", regex=True)
                # Remove anything after a semicolon. This is usually a note, or deal price.
                .str.replace(r";.*", "", regex=True)
                # Standardize units
                .str.replace(r".g$", " grams", regex=True)
                .str.replace(r"\sg$", "grams", regex=True)
                .str.replace(r"\bgram$", "grams", regex=True)
                .str.replace(r"pound$", "1 pounds", regex=True)
                .str.replace(r"oz|onces|ouncues|ounce$|ounces\*", "ounces", regex=True)
                .str.replace("kilogram", "kilograms")
                .str.replace("kg", "kilograms")
                # Remove "online" from any quantity
                .str.replace("online", "")
                .str.strip()
            ),
            price=lambda df_: df_["price"].str.replace("..", "."),
        )
        .dropna()
        # Remove rows where coffee is sold in a can, box, pouch, packet, or tin
        .loc[
            lambda df_: ~df_["quantity"].str.contains(drop_terms_string, case=False),
            :,
        ]
        # Split quantity into value and unit, and split price into value and currency
        .assign(
            # Extract number value from quantity
            quantity_value=lambda df_: (
                df_["quantity"].str.extract(r"(\d+)").astype(float)
            ),
            # Extract the unit from quantity column
            quantity_unit=lambda df_: (
                df_["quantity"]
                .str.replace(r"(\d+)", "", regex=True)
                .replace("\.", "", regex=True)
                .str.strip()
                .mask(lambda s: s == "g", "grams")
                .mask(lambda s: s == "kilo", "kilograms")
                .str.strip()
            ),
            # Extract price value from price column
            price_value=lambda df_: (
                df_["price"].str.extract(r"(\d+\.\d+|\d+)").astype(float)
            ),
            # Extract currency from price column
            price_currency=lambda df_: (
                df_["price"]
                .str.replace(",", "")
                .str.replace(r"(\d+\.\d+|\d+)", "", regex=True)
                .str.strip()
            ),
        )
        # Drop the original price and quantity columns
        .drop(columns=["price", "quantity"])
        # remove rows where quantity_unit contains (
        .loc[lambda df_: ~df_["quantity_unit"].str.contains(r"\(", regex=True), :]
    )
    print(f"Shape of price_quantity: {price_quantity.shape}")

    # Merge the price_quantity DataFrame with the original DataFrame
    return df.merge(price_quantity, how="left", left_index=True, right_index=True)


df = df_in.pipe(tweak_df).pipe(price_quantity_split)

df.quantity_unit.value_counts()

Original df shape: (7566, 21)
Shape of price_quantity: (5607, 4)


quantity_unit
ounces       4757
grams         824
pounds         17
kilograms       9
Name: count, dtype: int64

In [10]:
def convert_to_lbs(df: pd.DataFrame) -> pd.DataFrame:
    to_lbs_conversion: dict[str, float] = {
        "ounces": 1 / 16,
        "pounds": 1,
        "kilograms": 2.20462,
        "grams": 0.00220462,
    }
    df["quantity_in_lbs"] = np.round(
        df["quantity_value"] * df["quantity_unit"].map(to_lbs_conversion), 2
    )
    return df


df = df_in.pipe(tweak_df).pipe(price_quantity_split).pipe(convert_to_lbs)

df.groupby("quantity_unit")[
    ["est_price", "quantity_value", "quantity_unit", "quantity_in_lbs"]
].sample(1)

Original df shape: (7566, 21)
Shape of price_quantity: (5607, 4)


Unnamed: 0,est_price,quantity_value,quantity_unit,quantity_in_lbs
939,NT $880/200 grams,200.0,grams,0.44
2062,HK $150/1 kilogram,1.0,kilograms,2.2
5812,$17.50/12 ounces,12.0,ounces,0.75
1254,$15.99/pound,1.0,pounds,1.0


### Cleaning Prices and Currencies
Normalize the currency column to contain a standardized set of currency symbols. We will use the ISO 4217 codes to make it easier to get foreign exchange data from an external API later on. 


In [11]:
def clean_currency(df: pd.DataFrame) -> pd.DataFrame:
    """Standardize currencies to ISO 4217 codes."""
    price_currency = (
        df.price_currency.str.upper()
        .str.replace(r"^\$$", "USD", regex=True)
        .str.replace("PRICE: $", "USD")
        .str.replace("$", "")
        .str.replace("#", "GBP")
        .str.replace("¥", "JPY")
        .str.replace("£", "GBP")
        .str.replace("€", "EUR")
        .str.replace("POUND", "GBP")
        .str.replace("PESOS", "MXN")
        .str.replace("RMB", "CNY")
        .str.replace("EUROS", "EUR")
        .str.replace("RM", "MYR")
        .str.strip()
        .mask(lambda s: s == "US", "USD")
        .mask(lambda s: s == " ", "USD")
        .mask(lambda s: s == "E", "EUR")
        .mask(lambda s: s == "NTD", "TWD")
        .mask(lambda s: s == "NT", "TWD")
        .mask(lambda s: s == "", "USD")
        .mask(lambda s: s == "HK", "HKD")
        .str.strip()
    )
    return df.assign(price_currency=price_currency)


df = df_in.pipe(tweak_df).pipe(price_quantity_split).pipe(clean_currency)


# Check that currencies make sense from original est_price column
df.loc[:, ["est_price", "price_currency", "price_value"]].groupby(
    "price_currency"
).sample(3, replace=True)

Original df shape: (7566, 21)
Shape of price_quantity: (5607, 4)


Unnamed: 0,est_price,price_currency,price_value
415,AED $95.00/250 grams,AED,95.0
5999,AED $99.75/250 grams,AED,99.75
1462,AED $103.95/250 grams,AED,103.95
320,AUD $22.00/250 grams,AUD,22.0
3820,AUD $16.00/250 grams,AUD,16.0
1030,AUD $16.00/250 grams,AUD,16.0
5686,CAD $13.00/16 ounces (454 grams),CAD,13.0
3856,CAD $14.00/12 oz.,CAD,14.0
4241,CAD $19.95/12 ounces,CAD,19.95
1321,RMB $48.00/227 grams,CNY,48.0


In [12]:
df.price_currency.value_counts()

price_currency
USD    4249
TWD    1067
CAD     125
HKD      46
CNY      27
THB      21
KRW      20
JPY      12
GBP      10
AUD      10
EUR       6
AED       5
MYR       3
IDR       3
GTQ       1
MXN       1
LAK       1
Name: count, dtype: int64

#### Converting prices to 2024 USD

1. Convert price to USD using historical exchange rates
2. Adjust price to 2024 USD using BLS consumer price index

In [14]:
with open(DATA_DIR / "external/openex_exchange_rates.json") as f:
    currency_codes: dict[str, dict[str, float]] = dict(json.load(f))


def convert_row(row: pd.Series) -> float | np.float64:
    try:
        date: str = row.review_date.strftime("%Y-%m-%d")
        currency: float = row.price_currency
        value: float | np.float64 = np.round(row.price_value / currency_codes[date][currency], 2)
    except KeyError:
        value = np.nan
    return value


def convert_currency(df: pd.DataFrame) -> pd.DataFrame:
    df["price_usd"] = df.apply(convert_row, axis=1)
    return df


df = (
    df_in.pipe(tweak_df)
    .pipe(price_quantity_split)
    .pipe(clean_currency)
    .pipe(convert_currency)
)


df.groupby("price_currency")[
    [
        "price_usd",
        "price_value",
        "price_currency",
    ]
].sample(1)

Original df shape: (7566, 21)
Shape of price_quantity: (5607, 4)


Unnamed: 0,price_usd,price_value,price_currency
1462,28.3,103.95,AED
6014,16.68,23.0,AUD
769,15.64,15.99,CAD
7624,8.85,60.0,CNY
3079,33.23,29.95,EUR
2768,29.91,23.0,GBP
1575,12.53,100.0,GTQ
7671,12.63,98.0,HKD
1717,8.52,120000.0,IDR
6004,11.71,1280.0,JPY


In [15]:
def load_cpi_dataframe(file_path: Path) -> pd.DataFrame:
    """Loads and transforms the CPI data."""
    try:
        cpi: pd.DataFrame = pd.read_csv(file_path)
    except FileNotFoundError:
        raise FileNotFoundError("CPI file is not found in the specified directory.")

    cpi.columns = cpi.columns.str.strip().str.lower().str.replace(" ", "_")
    return (
        cpi.drop(columns=["half1", "half2"])
        .melt(id_vars="year", var_name="month", value_name="cpi")
        .assign(
            month=lambda df_: df_["month"].apply(
                lambda x: datetime.strptime(x, "%b").month
            ),
            date=lambda df_: pd.to_datetime(df_[["year", "month"]].assign(day=1)),
        )
        .drop(columns=["year", "month"])
    )


def adjust_row(row: pd.Series, cpi_baseline: float) -> float:
    # CPI is NaN for the current month, return the original price_usd
    if pd.isnull(row["cpi"]):
        return row["price_usd"]
    else:
        return np.round(row["price_usd"] * (cpi_baseline / row["cpi"]), 2)


def create_cpi_adjusted_price(
    df: pd.DataFrame, file_path: Path, date: str = "2024-06-01"
) -> pd.DataFrame:
    """
    Adjusts historical price data to 2024 prices using CPI data.
    """
    cpi: pd.DataFrame = load_cpi_dataframe(file_path)
    cpi_baseline: float = cpi.loc[cpi["date"] == date, "cpi"].values[0]

    return df.merge(cpi, how="left", left_on="review_date", right_on="date").assign(
        price_usd_adj=lambda df_: df_.apply(
            adjust_row, cpi_baseline=cpi_baseline, axis=1
        )
    )

cpi_path: Path = DATA_DIR / "external" / "consumer_price_index.csv"

df = (
    df_in.pipe(tweak_df)
    .pipe(price_quantity_split)
    .pipe(convert_to_lbs)
    .pipe(clean_currency)
    .pipe(convert_currency)
    .pipe(create_cpi_adjusted_price, file_path=cpi_path)
)

df.groupby("price_currency")[
    [
        "price_value",
        "price_currency",
        "price_usd",
        "review_date",
        "price_usd_adj",
    ]
].sample(1)

Original df shape: (7566, 21)
Shape of price_quantity: (5607, 4)


Unnamed: 0,price_value,price_currency,price_usd,review_date,price_usd_adj
395,95.0,AED,25.86,2017-09-01,32.92
4657,24.0,AUD,17.4,2016-06-01,22.68
4,17.0,CAD,13.21,2017-11-01,16.83
1099,234.0,CNY,37.85,2015-04-01,50.26
4429,11.9,EUR,15.72,2013-01-01,21.45
2455,14.5,GBP,17.66,2023-11-01,18.07
1513,100.0,GTQ,12.53,2013-09-01,16.81
3520,125.0,HKD,15.93,2019-05-01,19.54
6107,120000.0,IDR,8.89,2017-10-01,11.32
6973,78.0,JPY,0.68,2021-11-01,0.77


In [1]:
# Plot the price difference between the adjusted and historical prices over time

(
    df.assign(
        price_diff=lambda df_: (df_["price_usd_adj"] - df_["price_usd"])
        / df_["price_usd_adj"]
        * 100
    ).sort_values("review_date")
).plot(
    x="review_date",
    y="price_diff",
    title="% Price difference between adjusted and historical prices",
)

NameError: name 'df' is not defined

### Create a column for price/lb using adjusted price 

In [17]:
# Create a new column for price per pound
def price_per_lbs(df: pd.DataFrame) -> pd.DataFrame:
    df["price_usd_adj_per_lb"] = np.round(
        df["price_usd_adj"] / df["quantity_in_lbs"], 2
    )
    return df


df = (
    df_in.pipe(tweak_df)
    .pipe(price_quantity_split)
    .pipe(clean_currency)
    .pipe(convert_currency) # Convert to USD with historical exchange rates
    .pipe(create_cpi_adjusted_price, file_path=cpi_path) # Adjust for inflation
    .pipe(convert_to_lbs) # Convert quantities to pounds
    .pipe(price_per_lbs) # Calculate adjusted USD price per pound
)

df.head()

Original df shape: (7566, 21)
Shape of price_quantity: (5607, 4)


Unnamed: 0,rating,roaster,title,blind_assessment,notes,bottom_line,roaster_location,coffee_origin,roast_level,est_price,review_date,aroma,acidity,body,flavor,aftertaste,url,with_milk,agtron_external,agtron_ground,is_espresso,quantity_value,quantity_unit,price_value,price_currency,price_usd,cpi,date,price_usd_adj,quantity_in_lbs,price_usd_adj_per_lb
0,92.0,Red Rooster Coffee Roaster,Ethiopia Sidama Shoye,"Rich, intricate and layered. Lemon zest, roasted cacao nib, violet, mulberry, frankincense in ar...",Produced by family-owned farms that are part of the Shoye Cooperative. This lot was processed by...,"An elegant washed Sidamo cup, both deeply sweet and delicately tart.","Floyd, Virginia","Sidamo (also Sidama) growing region, southern Ethiopia",Medium-Light,$14.49/12 ounces,2016-11-01,9.0,8.0,8.0,9.0,8.0,https://www.coffeereview.com/review/ethiopia-sidama-shoye/,,56.0,80.0,False,12.0,ounces,14.49,USD,14.49,241.353,2016-11-01,18.86,0.75,25.15
1,92.0,El Gran Cafe,Finca Santa Elisa Geisha,"Gently sweet-tart, crisply herbaceous. Baking chocolate, green grape, lemon verbena, fresh-cut o...","Produced by Finca Santa Elisa entirely of the Geisha variety of Arabica, and processed by the tr...","A confident, pretty washed-process Guatemala Geisha enlivened by crisp chocolate and sweet herb ...","Antigua, Guatemala","Acatenango growing region, Guatemala",Medium-Light,$30.00/12 ounces,2023-08-01,8.0,9.0,8.0,9.0,8.0,https://www.coffeereview.com/review/finca-santa-elisa-geisha/,,62.0,78.0,False,12.0,ounces,30.0,USD,30.0,307.026,2023-08-01,30.7,0.75,40.93
2,93.0,Tipico Coffee,Costa Rica Sin Limites Gesha,"Floral-toned, tropical-leaning. Magnolia, guava, green banana, amber, cocoa nib in aroma and cup...","Produced by Jamie Cardenas of Finca Sin Limites, entirely of the Gesha variety of Arabica, and p...","A delicate honey-processed Costa Rica Gesha that evokes the tropics with its fruity profile, and...","Buffalo, New York","West Valley, Costa Rica",Light,$33.00/12 ounces,2024-06-01,9.0,9.0,8.0,9.0,8.0,https://www.coffeereview.com/review/costa-rica-sin-limites-gesha/,,60.0,82.0,False,12.0,ounces,33.0,USD,33.0,314.175,2024-06-01,33.0,0.75,44.0
3,91.0,Roast House,Ride the Edge,"Crisply floral, delicately lively. Complex flowers – lavender, lilac – roasted cacoa nib, tanger...","The coffees in this blend are certified organically grown and Fair Trade, meaning they were purc...",,"Spokane, Washington","Southern Ethiopia; Mexico; northern Sumatra, Indonesia.",Medium,$14.00/16 ounces,2015-07-01,8.0,8.0,8.0,9.0,8.0,https://www.coffeereview.com/review/ride-the-edge/,,53.0,67.0,False,16.0,ounces,14.0,USD,14.0,238.654,2015-07-01,18.43,1.0,18.43
4,92.0,Level Ground Trading,Direct Fair Trade Espresso,"Evaluated as espresso. Intrigungly complex, balanced. Dark chocolate, molasses, cedar, narcissus...",Coffees in this blend are all fully wet-processed or “washed” and certified organically grown in...,A solid espresso blend equally pleasing as a straight shot and in milk.,"Victoria, British Columbia, Canada",Africa; South America,Medium-Light,CAD $17.00/16 ounces,2017-11-01,9.0,,8.0,8.0,8.0,https://www.coffeereview.com/review/direct-fair-trade-espresso-2/,9.0,51.0,73.0,True,16.0,ounces,17.0,CAD,13.21,246.669,2017-11-01,16.83,1.0,16.83


# 4. Coffee Origin Cleaning <a id='coffee_origin_cleaning'></a>

In [18]:
def tweak_countries(countries: set) -> set:
    remove = [
        "american samoa",
        "united states minor outlying islands",
        "south sudan",
        "south georgia and the south sandwich islands",
        "british indian ocean territory",
        "congo, the democratic republic of the",
        "taiwan, province of china",
        "guinea",
    ]

    for r in remove:
        try:
            countries.remove(r)
        except ValueError:
            continue
    for c in list(countries):
        c_new = c.split(",")[0]
        countries.remove(c)
        countries.add(c_new)

    countries.add("tawain")
    return countries


countries: set = set(unidecode(c.name.lower()) for c in pycountry.countries)
countries = tweak_countries(countries)

In [19]:
def retrieve_countries(row: pd.Series, countries: set) -> str:
    countries_found = set()
    if pd.isna(row) or row == "":
        return None
    for c in countries:
        if c in row:
            countries_found.add(c)
    if len(countries_found) == 0:
        return row
    countries_found = ";".join(countries_found)
    return countries_found


def clean_origin(df: pd.DataFrame) -> pd.DataFrame:
    df = df.assign(coffee_origin=df.coffee_origin.str.lower())
    df["origin_country"] = df.coffee_origin.apply(
        retrieve_countries, countries=countries
    )
    return df


df = (
    df_in.pipe(tweak_df)
    .pipe(price_quantity_split)
    .pipe(clean_currency)
    .pipe(convert_currency)
    .pipe(create_cpi_adjusted_price, file_path=cpi_path)
    .pipe(convert_to_lbs)
    .pipe(price_per_lbs)
    .pipe(clean_origin)
)

Original df shape: (7566, 21)
Shape of price_quantity: (5607, 4)


In [22]:
df

Unnamed: 0,rating,roaster,title,blind_assessment,notes,bottom_line,roaster_location,coffee_origin,roast_level,est_price,review_date,aroma,acidity,body,flavor,aftertaste,url,with_milk,agtron_external,agtron_ground,is_espresso,quantity_value,quantity_unit,price_value,price_currency,price_usd,cpi,date,price_usd_adj,quantity_in_lbs,price_usd_adj_per_lb,origin_country
0,92.0,Red Rooster Coffee Roaster,Ethiopia Sidama Shoye,"Rich, intricate and layered. Lemon zest, roasted cacao nib, violet, mulberry, frankincense in ar...",Produced by family-owned farms that are part of the Shoye Cooperative. This lot was processed by...,"An elegant washed Sidamo cup, both deeply sweet and delicately tart.","Floyd, Virginia","sidamo (also sidama) growing region, southern ethiopia",Medium-Light,$14.49/12 ounces,2016-11-01,9.0,8.0,8.0,9.0,8.0,https://www.coffeereview.com/review/ethiopia-sidama-shoye/,,56.0,80.0,False,12.0,ounces,14.49,USD,14.49,241.353,2016-11-01,18.86,0.75,25.15,ethiopia
1,92.0,El Gran Cafe,Finca Santa Elisa Geisha,"Gently sweet-tart, crisply herbaceous. Baking chocolate, green grape, lemon verbena, fresh-cut o...","Produced by Finca Santa Elisa entirely of the Geisha variety of Arabica, and processed by the tr...","A confident, pretty washed-process Guatemala Geisha enlivened by crisp chocolate and sweet herb ...","Antigua, Guatemala","acatenango growing region, guatemala",Medium-Light,$30.00/12 ounces,2023-08-01,8.0,9.0,8.0,9.0,8.0,https://www.coffeereview.com/review/finca-santa-elisa-geisha/,,62.0,78.0,False,12.0,ounces,30.00,USD,30.00,307.026,2023-08-01,30.70,0.75,40.93,guatemala
2,93.0,Tipico Coffee,Costa Rica Sin Limites Gesha,"Floral-toned, tropical-leaning. Magnolia, guava, green banana, amber, cocoa nib in aroma and cup...","Produced by Jamie Cardenas of Finca Sin Limites, entirely of the Gesha variety of Arabica, and p...","A delicate honey-processed Costa Rica Gesha that evokes the tropics with its fruity profile, and...","Buffalo, New York","west valley, costa rica",Light,$33.00/12 ounces,2024-06-01,9.0,9.0,8.0,9.0,8.0,https://www.coffeereview.com/review/costa-rica-sin-limites-gesha/,,60.0,82.0,False,12.0,ounces,33.00,USD,33.00,314.175,2024-06-01,33.00,0.75,44.00,costa rica
3,91.0,Roast House,Ride the Edge,"Crisply floral, delicately lively. Complex flowers – lavender, lilac – roasted cacoa nib, tanger...","The coffees in this blend are certified organically grown and Fair Trade, meaning they were purc...",,"Spokane, Washington","southern ethiopia; mexico; northern sumatra, indonesia.",Medium,$14.00/16 ounces,2015-07-01,8.0,8.0,8.0,9.0,8.0,https://www.coffeereview.com/review/ride-the-edge/,,53.0,67.0,False,16.0,ounces,14.00,USD,14.00,238.654,2015-07-01,18.43,1.00,18.43,indonesia;mexico;ethiopia
4,92.0,Level Ground Trading,Direct Fair Trade Espresso,"Evaluated as espresso. Intrigungly complex, balanced. Dark chocolate, molasses, cedar, narcissus...",Coffees in this blend are all fully wet-processed or “washed” and certified organically grown in...,A solid espresso blend equally pleasing as a straight shot and in milk.,"Victoria, British Columbia, Canada",africa; south america,Medium-Light,CAD $17.00/16 ounces,2017-11-01,9.0,,8.0,8.0,8.0,https://www.coffeereview.com/review/direct-fair-trade-espresso-2/,9,51.0,73.0,True,16.0,ounces,17.00,CAD,13.21,246.669,2017-11-01,16.83,1.00,16.83,africa; south america
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7561,94.0,JBC Coffee Roasters,Tano Batak Sumatra,"Rich-toned, deeply and sweetly earthy. Chocolate fudge, white sage, blackberry, perique pipe tob...","This coffee was grown by indigenous Batak people, who have been involved in coffee production si...","A multi-layered Sumatra cup with berry and tropical flowers as primary notes, balanced with fine...","Madison, Wisconsin","lintong growing region, north sumatra province, indonesia",Medium-Light,$18.50/12 ounces,2018-07-01,9.0,9.0,9.0,9.0,8.0,https://www.coffeereview.com/review/tano-batak-sumatra-5/,,58.0,75.0,False,12.0,ounces,18.50,USD,18.50,252.006,2018-07-01,23.06,0.75,30.75,indonesia
7562,96.0,Big Shoulders Coffee,Panama Hacienda La Esmeralda Gesha,"Complex, floral- and citrus-toned. Lilac, cocoa nib, tangerine zest, apricot, sandalwood in arom...",Coffee from trees of the botanical variety Geisha (also Gesha) grown on Price Peterson’s Haciend...,A classic washed Geisha from the celebrated Hacienda Esmeralda farm in Panama. A relative bargai...,"Chicago, Illinois","boquete growing region, western panama",Medium-Light,$56.00/8 ounces,2020-12-01,9.0,9.0,9.0,10.0,9.0,https://www.coffeereview.com/review/panama-hacienda-la-esmeralda-gesha-2/,,58.0,78.0,False,8.0,ounces,56.00,USD,56.00,260.474,2020-12-01,67.55,0.50,135.10,panama
7563,92.0,Lexington Coffee Roasters,Papua New Guinea Kimel,"Balanced, engaging depth, quiet complexity. Raisiny dark chocolate, ripe tangerine-like citrus, ...",Kimel Plantation is owned and operated by the indigenous Opais people. The high mountain valleys...,,"Lexington, Virginia","wahgi valley, western highlands, papua new guinea.",Medium-Light,$13.95/12 ounces,2014-11-01,8.0,8.0,9.0,9.0,8.0,https://www.coffeereview.com/review/papua-new-guinea-kimel-2/,,53.0,73.0,False,12.0,ounces,13.95,USD,13.95,236.151,2014-11-01,18.56,0.75,24.75,papua new guinea
7564,87.0,Green Mountain Coffee,Newman’s Own Organics Special Decaf (K-Cup),"(As brewed in a Keurig B60 single-serve brewing device using a ""K-Cup"" capsule at a cup volume o...",Sales of this coffee help support a variety of educational and charitable organizations through ...,,"Waterbury, Vermont",blend,Very Dark,,2006-03-01,8.0,7.0,7.0,8.0,,https://www.coffeereview.com/review/newmans-own-organics-special-decaf-k-cup-2/,,0.0,48.0,False,,,,,,199.800,2006-03-01,,,,blend


# 5. Data Checks <a id='data_checks'></a>
[Back to top](#table_of_contents)

In [23]:
df.dtypes

rating                         float64
roaster                         object
title                           object
blind_assessment                object
notes                           object
bottom_line                     object
roaster_location                object
coffee_origin                   object
roast_level                     object
est_price                       object
review_date             datetime64[ns]
aroma                          float64
acidity                        float64
body                           float64
flavor                         float64
aftertaste                     float64
url                             object
with_milk                       object
agtron_external                float64
agtron_ground                  float64
is_espresso                       bool
quantity_value                 float64
quantity_unit                   object
price_value                    float64
price_currency                  object
price_usd                

# 6. Export processed dataframe <a id='export_data'></a>
[Back to top](#table_of_contents)

In [24]:
FILE_OUT: str = FILE_IN.replace(".csv", "_intermediate.csv")
df.to_csv(DATA_DIR / "intermediate" / FILE_OUT, index=False)