This notebook is used to process some columns in the parsed coffee review dataset.

Columns to be processed
- Roast Level: Replace empty string with NA; Convert to an ordinal variable 
- Agtron: Seperate into Agtron_whole and Agtron_ground
- Roaster Location
- Coffee Origin

# Load dependencies

In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

# Load data

In [25]:
data = pd.read_csv("../data/coffee_reviews_parsed.csv")
data

Unnamed: 0,URL,all_text,Rating,Roaster,Coffee Name,Roaster Location,Coffee Origin,Roast Level,Agtron,Est. Price,...,Acidity,Acidity/Structure,Body,Flavor,Aftertaste,With Milk,Blind Assessment,Notes,Who Should Drink It,Bottom Line
0,https://www.coffeereview.com/review/100-arabic...,89\nCaffe Bomrad\n100% Arabica 100% Italiano\n...,89.0,Caffe Bomrad,100% Arabica 100% Italiano,"Torino, Italy",Not disclosed.,Medium,48/65,$54.00/1 Kilogram,...,,,8.0,8.0,7.0,8.0,Evaluated as espresso. Smoothly round aroma: t...,Roasted in Northern Italy and distributed in N...,A strong-charactered Northern Italian styled e...,
1,https://www.coffeereview.com/review/100-arabic...,"87\nLucaff?\n100% Arabica, Black Label (ESE po...",87.0,Lucaff?,"100% Arabica, Black Label (ESE pod)","Padenghe sul Garda, Italy",Not disclosed.,Dark,0/80,,...,,,8.0,7.0,7.0,8.0,Produced from an ESE pod on a FrancisFrancis! ...,ESE (Easy Serving Espresso) pods are wafer-lik...,An attractive pod espresso for big milk drinks.,
2,https://www.coffeereview.com/review/100-arabic...,87\nCaribeans\n100% Arabica Coffee from Puerto...,87.0,Caribeans,100% Arabica Coffee from Puerto Rico,"San Juan, Puerto Rico","Utuado, central Puerto Rico",Medium-Light,54/69,$17.00/8 ounces,...,7.0,,7.0,8.0,7.0,,Bittersweet but balanced; chocolaty. Dark choc...,Produced on a single farm in the central mount...,,Satisfying chocolate and nut notes nearly carr...
3,https://www.coffeereview.com/review/100-arabic...,88\nWaka Coffee\n100% Arabica Freeze-Dried Col...,88.0,Waka Coffee,100% Arabica Freeze-Dried Colombian (Instant C...,"Los Angeles, California",Colombia,,0/0,$10.99/8 single-serve packets,...,,7.0,8.0,8.0,8.0,,Evaluated at proportions of 5 grams of instant...,The green coffee for this product was produced...,,A appealing 100% Colombia coffee in instant fo...
4,https://www.coffeereview.com/review/100-arabic...,72\nYuban\n100% Arabica Instant Coffee\nRoaste...,72.0,Yuban,100% Arabica Instant Coffee,"Northfield, Illinois",Colombia. All coffee of the Arabica species.,,0/0,$8.27/8 ounces instant,...,4.0,,7.0,3.0,4.0,,In the aroma caramel and wet burned wood notes...,An instant coffee evaluated as mixed in propor...,"Not good, but not the worst of the instants on...",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8382,https://www.coffeereview.com/review/zimbabwean...,88\nLeopard Forest Coffee\nZimbabwean Peaberry...,88.0,Leopard Forest Coffee,Zimbabwean Peaberry,"Travelers Rest, South Carolina","Eastern Highlands, Zimbabwe",Medium,48/67,,...,7.0,,8.0,8.0,7.0,,"Low-toned, deep aroma with a tight-knit comple...",Historically Zimbabwe has been the premier cof...,A quietly bright breakfast cup with some pleas...,
8383,https://www.coffeereview.com/review/zimbabwe/,83\nThe Sensuous Bean\nZimbabwe\nRoaster Locat...,83.0,The Sensuous Bean,Zimbabwe,"New York, New York","Chipinge growing region, eastern Zimbabwe",Medium,50/63,,...,6.0,,8.0,6.0,7.0,,A potentially fine coffee compromised by mild ...,Historically Zimbabwe has been the premier cof...,"Those who enjoy the earthy, musty tones in som...",
8384,https://www.coffeereview.com/review/zombie-des...,87\nCafe Kreyol\nZombie Desert 100% Organic Ha...,87.0,Cafe Kreyol,Zombie Desert 100% Organic Haitian Bleu,"Fairfax, Virginia.","Artibonite growing region, Haiti.",Medium-Dark,47/52,$14.99/12 ounces,...,7.0,,8.0,7.0,7.0,,"Smoky, pungent. Very lightly scorched cedar, b...","Like all Haitian coffee, this is produced enti...",Those who enjoy pungently bracing medium-dark-...,
8385,https://www.coffeereview.com/review/zoom-espre...,93\nZuco Coffee Roasters\nZoom Espresso\nRoast...,93.0,Zuco Coffee Roasters,Zoom Espresso,"Hong Kong, China",Honduras; Ethiopia; Brazil,Medium,44/60,HK $150/250 grams,...,,,8.0,9.0,8.0,9.0,"Evaluated as espresso. Rich, winey, floral. Ba...","Zoom is a signature espresso blend from Zuco, ...",,"A complex, lively espresso very lightly touche..."


# Roast level

In [26]:
data["Roast Level"].value_counts()

Roast Level
Medium-Light    3586
Medium          1622
Light           1266
Medium-Dark      782
Very Dark        387
Dark             238
Name: count, dtype: int64

In [27]:
# Define order for this variable
ordered_categories = ["Light", "Medium-Light", "Medium", "Medium-Dark", "Dark", "Very Dark"]
data["Roast Level"] = pd.Categorical(data["Roast Level"], categories=ordered_categories, ordered=True)

## Roast Level vs Rating

In [28]:
# sns.boxplot(x="Roast Level_ordinal", y="Rating", data=data)
# plt.title("Coffee with lighter roast level tends to get a higher rating.")
# plt.xlabel("Coffee Roast Level")
# plt.ylabel("Review Rating of Coffee")
# plt.show()

# Agtron

In [29]:
data[['Agtron_whole', 'Agtron_ground']] = data['Agtron'].str.split('/', expand=True)
data['Agtron_whole'] = pd.to_numeric(data['Agtron_whole'], errors='coerce')
data['Agtron_ground'] = pd.to_numeric(data['Agtron_ground'], errors='coerce')
data

Unnamed: 0,URL,all_text,Rating,Roaster,Coffee Name,Roaster Location,Coffee Origin,Roast Level,Agtron,Est. Price,...,Body,Flavor,Aftertaste,With Milk,Blind Assessment,Notes,Who Should Drink It,Bottom Line,Agtron_whole,Agtron_ground
0,https://www.coffeereview.com/review/100-arabic...,89\nCaffe Bomrad\n100% Arabica 100% Italiano\n...,89.0,Caffe Bomrad,100% Arabica 100% Italiano,"Torino, Italy",Not disclosed.,Medium,48/65,$54.00/1 Kilogram,...,8.0,8.0,7.0,8.0,Evaluated as espresso. Smoothly round aroma: t...,Roasted in Northern Italy and distributed in N...,A strong-charactered Northern Italian styled e...,,48.0,65.0
1,https://www.coffeereview.com/review/100-arabic...,"87\nLucaff?\n100% Arabica, Black Label (ESE po...",87.0,Lucaff?,"100% Arabica, Black Label (ESE pod)","Padenghe sul Garda, Italy",Not disclosed.,Dark,0/80,,...,8.0,7.0,7.0,8.0,Produced from an ESE pod on a FrancisFrancis! ...,ESE (Easy Serving Espresso) pods are wafer-lik...,An attractive pod espresso for big milk drinks.,,0.0,80.0
2,https://www.coffeereview.com/review/100-arabic...,87\nCaribeans\n100% Arabica Coffee from Puerto...,87.0,Caribeans,100% Arabica Coffee from Puerto Rico,"San Juan, Puerto Rico","Utuado, central Puerto Rico",Medium-Light,54/69,$17.00/8 ounces,...,7.0,8.0,7.0,,Bittersweet but balanced; chocolaty. Dark choc...,Produced on a single farm in the central mount...,,Satisfying chocolate and nut notes nearly carr...,54.0,69.0
3,https://www.coffeereview.com/review/100-arabic...,88\nWaka Coffee\n100% Arabica Freeze-Dried Col...,88.0,Waka Coffee,100% Arabica Freeze-Dried Colombian (Instant C...,"Los Angeles, California",Colombia,,0/0,$10.99/8 single-serve packets,...,8.0,8.0,8.0,,Evaluated at proportions of 5 grams of instant...,The green coffee for this product was produced...,,A appealing 100% Colombia coffee in instant fo...,0.0,0.0
4,https://www.coffeereview.com/review/100-arabic...,72\nYuban\n100% Arabica Instant Coffee\nRoaste...,72.0,Yuban,100% Arabica Instant Coffee,"Northfield, Illinois",Colombia. All coffee of the Arabica species.,,0/0,$8.27/8 ounces instant,...,7.0,3.0,4.0,,In the aroma caramel and wet burned wood notes...,An instant coffee evaluated as mixed in propor...,"Not good, but not the worst of the instants on...",,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8382,https://www.coffeereview.com/review/zimbabwean...,88\nLeopard Forest Coffee\nZimbabwean Peaberry...,88.0,Leopard Forest Coffee,Zimbabwean Peaberry,"Travelers Rest, South Carolina","Eastern Highlands, Zimbabwe",Medium,48/67,,...,8.0,8.0,7.0,,"Low-toned, deep aroma with a tight-knit comple...",Historically Zimbabwe has been the premier cof...,A quietly bright breakfast cup with some pleas...,,48.0,67.0
8383,https://www.coffeereview.com/review/zimbabwe/,83\nThe Sensuous Bean\nZimbabwe\nRoaster Locat...,83.0,The Sensuous Bean,Zimbabwe,"New York, New York","Chipinge growing region, eastern Zimbabwe",Medium,50/63,,...,8.0,6.0,7.0,,A potentially fine coffee compromised by mild ...,Historically Zimbabwe has been the premier cof...,"Those who enjoy the earthy, musty tones in som...",,50.0,63.0
8384,https://www.coffeereview.com/review/zombie-des...,87\nCafe Kreyol\nZombie Desert 100% Organic Ha...,87.0,Cafe Kreyol,Zombie Desert 100% Organic Haitian Bleu,"Fairfax, Virginia.","Artibonite growing region, Haiti.",Medium-Dark,47/52,$14.99/12 ounces,...,8.0,7.0,7.0,,"Smoky, pungent. Very lightly scorched cedar, b...","Like all Haitian coffee, this is produced enti...",Those who enjoy pungently bracing medium-dark-...,,47.0,52.0
8385,https://www.coffeereview.com/review/zoom-espre...,93\nZuco Coffee Roasters\nZoom Espresso\nRoast...,93.0,Zuco Coffee Roasters,Zoom Espresso,"Hong Kong, China",Honduras; Ethiopia; Brazil,Medium,44/60,HK $150/250 grams,...,8.0,9.0,8.0,9.0,"Evaluated as espresso. Rich, winey, floral. Ba...","Zoom is a signature espresso blend from Zuco, ...",,"A complex, lively espresso very lightly touche...",44.0,60.0


# Coffee Origin

In [None]:
data["Coffee Origin"].value_counts()

Coffee Origin
Not disclosed                                                 543
Yirgacheffe growing region, southern Ethiopia                 285
Boquete growing region, western Panama                        165
Nyeri growing region, south-central Kenya                     163
South-central Kenya                                           128
                                                             ... 
Northwestern Sumatra, Indonesia                                 1
Okapa Valley, Eastern Highlands Province, Papua New Guinea      1
Mexico; Peru                                                    1
Peru, Mexico, Nicaragua                                         1
Artibonite growing region, Haiti                                1
Name: count, Length: 2338, dtype: int64

In [None]:
# Drop "." in the end
data["Coffee Origin"] = data["Coffee Origin"].str.rstrip(".")
data["Coffee Origin"].value_counts()

Coffee Origin
Not disclosed                                                 543
Yirgacheffe growing region, southern Ethiopia                 285
Boquete growing region, western Panama                        165
Nyeri growing region, south-central Kenya                     163
South-central Kenya                                           128
                                                             ... 
Northwestern Sumatra, Indonesia                                 1
Okapa Valley, Eastern Highlands Province, Papua New Guinea      1
Mexico; Peru                                                    1
Peru, Mexico, Nicaragua                                         1
Artibonite growing region, Haiti                                1
Name: count, Length: 2338, dtype: int64

In [None]:
# data[["URL", "Coffee Origin"]].to_csv("coffee_origins.csv", index=False)

# Roaster Location
For this column:

- If the place relates to the USA, it follows "City, State/Territory" pattern;
- If the place is outside the USA, it follows "Town/City, Country" pattern.

Note that **Puerto Rico** is neither a Country nor a State. It is a self-governing Caribbean archipelago and island organized as an unincorporated territory of the United States under the designation of commonwealth. (From Wikipedia)

Therefore, we break this column down into 3 variables: “Town/City”, “State/Territory”, “Country”.

Step 1: Check if values in this column all follows "A, B" pattern in order to define the split function well - No!

Step 2: Check the largest number of parts after spliting the cell by "," - largest number = 3

Step 3: Check if the location for each Roaster is consistent or not, can we use another table to store the roaster information? Some of the locations are inconsistent, I mannually adjusted them.

Idea: Can we use Roaster + Roaster Location to extract the longitude and latitude using Google Maps API? Concern: But some roasters might not have a physical office?


In [3]:
data["Roaster Location"]

0                        Torino, Italy
1            Padenghe sul Garda, Italy
2                San Juan, Puerto Rico
3              Los Angeles, California
4                 Northfield, Illinois
                     ...              
8382    Travelers Rest, South Carolina
8383                New York, New York
8384                Fairfax, Virginia.
8385                  Hong Kong, China
8386                Madison, Wisconsin
Name: Roaster Location, Length: 8387, dtype: object

## Step 1: Check patttern

In [None]:
# # First, we check if all cells in "Roaster Location" column follows "A,B" pattern.
# # Regex pattern to match "City, State/Territory" or "City, Country"
# pattern = r'^[A-Za-z\s]+,\s[A-Za-z\s]+$'

# # Function to apply the regex check
# def check_pattern(location):
#     # Ensure the location is a string before applying regex
#     if isinstance(location, str):
#         return bool(re.match(pattern, location))
#     return False

# # Apply the function to check the pattern for each row
# data['location_is_valid'] = data['Roaster Location'].apply(check_pattern)
# data[data['location_is_valid'] == False]

Unnamed: 0,URL,all_text,Rating,Roaster,Coffee Name,Roaster Location,Coffee Origin,Roast Level,Agtron,Est. Price,...,Acidity/Structure,Body,Flavor,Aftertaste,With Milk,Blind Assessment,Notes,Who Should Drink It,Bottom Line,location_is_valid
23,https://www.coffeereview.com/review/100-hawaii...,88\nJavaloha\n100% Hawaiian Coffee Hamakua Est...,88.0,Javaloha,100% Hawaiian Coffee Hamakua Estate,"Pa'auilo, Hawaii","Hamakua district, northeastern coast of the Bi...",Medium-Dark,43/54,$30 / 16 oz.,...,,8.0,8.0,7.0,,"Tart, grapefruity pungency, hints of sweet coc...","Javaloha coffee is grown at an elevation of 2,...",Aficionados interested in an emerging new Hawa...,,False
24,https://www.coffeereview.com/review/100-jamaic...,89\nStoneleigh Coffee\n100% Jamaica Blue Mount...,89.0,Stoneleigh Coffee,100% Jamaica Blue Mountain,"Toronto, Ontario, Canada","Blue Mountain growing region, eastern Jamaica.",Medium,50/61,$32.00/8 ounces,...,,8.0,8.0,7.0,,"Round, deeply sweet. Brown sugar, roasted caca...","Blue Mountain, grown in the lush, lovely mount...","Those who value the sweetness, full mouthfeel ...",,False
27,https://www.coffeereview.com/review/100-kau-na...,93\nPacific Coffee Research\n100% Ka‘ū Navarro...,93.0,Pacific Coffee Research,100% Ka‘ū Navarro,"Kealakekua, Kona, “Big Island” of Hawai’i","Ka‘ū growing region, “Big Island” of Hawai’i",Light,60/84,$39.20/10 ounces,...,8.0,9.0,9.0,8.0,,"Richly sweet-savory, deep-toned. Raspberry com...",Produced by Delvin and Nette Navarro of Navarr...,,Fruit and floral notes occupy center stage in ...,False
38,https://www.coffeereview.com/review/100-kona-e...,92\nKona Hills Coffee\n100% Kona Extra Fancy\n...,92.0,Kona Hills Coffee,100% Kona Extra Fancy,"Captain Cook, Hawai’i","Kona growing district, “Big Island” of Hawai’i",Medium-Light,57/76,$58.00/16 ounces,...,8.0,8.0,9.0,8.0,,"Sweetly nutty, crisply chocolaty. Cashew butte...","Produced by Mike Takizawa of Kona Hills Farm, ...",,"A nut-driven, chocolaty Kona Typica with invit...",False
51,https://www.coffeereview.com/review/100-kona-t...,94\nKona Farm Direct\n100% Kona Typica\nRoaste...,94.0,Kona Farm Direct,100% Kona Typica,"Holualoa, Hawai’i","Kona growing region, Big Island of Hawai’i",Medium-Light,58/80,$27.95/7 ounces,...,9.0,9.0,9.0,8.0,,"Sweet-toned, chocolaty and rich. Dark chocolat...","Produced by Kona Farm Direct, entirely of the ...",,"A lively, balanced washed Kona cup with a dark...",False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8333,https://www.coffeereview.com/review/yemen-natu...,95\nKakalove Cafe\nYemen Natural Sana’a Manakh...,95.0,Kakalove Cafe,Yemen Natural Sana’a Manakhah Ja’adi,"Chia-Yi, Taiwan","Manakhah, Sana'a, Yemen",Medium-Light,58/77,NT $900/4 ounces,...,9.0,9.0,9.0,9.0,,"Deeply spice-toned, richly layered. Black curr...",Produced by smallholding farmers entirely of t...,,"A deeply aromatic, richly fruit- and chocolate...",False
8341,https://www.coffeereview.com/review/yirgacheff...,94\nSO Roasters\nYirgacheffe Aricha Natural G1...,94.0,SO Roasters,Yirgacheffe Aricha Natural G1,South Korea,"Yirgacheffe growing region, southern Ethiopia.",Medium-Light,59/74,"KRW$10,500/100 grams",...,,8.0,10.0,8.0,,"Sweet, intense, melodic. Blueberry, tangerine,...",Most Yirgacheffe coffee is prepared by the con...,"A sweet, cleanly fermenty dried-in-the-fruit E...",,False
8365,https://www.coffeereview.com/review/zafiro/,92\nFinca Tasta\nZafiro\nRoaster Location:\nLl...,92.0,Finca Tasta,Zafiro,"Llayla District, Satipo Province, Peru","Llayla District, Satipo Province, Peru",Medium-Light,60/78,$10.00/250 grams,...,8.0,9.0,9.0,8.0,,"Complex, sweet-savory. Tamarind, cocoa nib, ma...",Produced and roasted by the Meza family at Fin...,,"A nuanced, multilayered Peru Pacamara processe...",False
8380,https://www.coffeereview.com/review/zimbabwe-g...,80\nLa Lucie Estate\nZimbabwe – Green\nRoaster...,80.0,La Lucie Estate,Zimbabwe – Green,Zimbabwe,"La Lucie Estate, Zimbabwe",,/,,...,,,,,,Ambivalence. Most panelists noted a pleasantly...,Though not as admired as the best Kenyas in th...,,,False


In [None]:
# data['Roaster Location'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 8387 entries, 0 to 8386
Series name: Roaster Location
Non-Null Count  Dtype 
--------------  ----- 
8281 non-null   object
dtypes: object(1)
memory usage: 65.6+ KB


## Step 2: Check number of parts

In [None]:
# # Step 2: Check the number of parts in location
# def count_parts(location):
#     if isinstance(location, str):  # Check if the value is a string
#         return len(location.split(','))
#     return 0  # Return 0 if it's not a string (e.g., NaN)

# data['num_loc_parts'] = data['Roaster Location'].apply(count_parts)
# data['num_loc_parts'].value_counts()

num_loc_parts
2    7926
3     284
0     106
1      71
Name: count, dtype: int64

Interpret the number of parts:

- 0: Missing Roaster location info (106 rows)
- 1: Multiple cases (71 rows) - possible process method: Check if the location for each Roaster is consistent or not.
    - Some indicate Country like "Zimbabwe" and "El Salvador". But "Taiwan" is not a Country here.
    - [One review page](https://www.coffeereview.com/review/black-espresso/) has a mistake in Roaster Location info! It should be NaN instead. [The original coffee page](https://cmykcoffee.com/products/black?_pos=1&_sid=341ed5d0d&_ss=r)
    - [One review page](https://www.coffeereview.com/review/burundi-bwayi/) use "." instead of "," as the spliter: "Peoria. Illinois".
    - [One review page](https://www.coffeereview.com/review/sumatra-beveo/) use " " instead of "," as the spliter: "Scottsdale Arizona". But some Country names also include " ", which makes this case hard to process.
    - [One review page](https://www.coffeereview.com/review/sumatra-tano-batak/) use "Sacr" to indicate a location, which might be a shorthand for "Sacramento, California" ([The original Roaster Page](https://templecoffee.com/pages/locations)).
    - 2 pages: [page 1](https://www.coffeereview.com/review/green-coffee-zambia/) and [page 2](https://www.coffeereview.com/review/green-coffee-zimbabwe/) use "Grower" which seems not a location name. Found some information on Internet: [Kowa](https://www.instagram.com/kowacoffee/), [Terranova Estate](https://coffeaalchemy.com/zambia-terranova-coffee-beans/)



In [None]:
# data.loc[3997, 'URL']

'https://www.coffeereview.com/review/green-coffee-zimbabwe/'

In [None]:
# # data[(data['num_loc_parts'] == 0) & (data['Roaster Location'].notna())]
# data[data['num_loc_parts'] == 2]

Unnamed: 0,URL,all_text,Rating,Roaster,Coffee Name,Roaster Location,Coffee Origin,Roast Level,Agtron,Est. Price,...,Body,Flavor,Aftertaste,With Milk,Blind Assessment,Notes,Who Should Drink It,Bottom Line,location_is_valid,num_loc_parts
0,https://www.coffeereview.com/review/100-arabic...,89\nCaffe Bomrad\n100% Arabica 100% Italiano\n...,89.0,Caffe Bomrad,100% Arabica 100% Italiano,"Torino, Italy",Not disclosed.,Medium,48/65,$54.00/1 Kilogram,...,8.0,8.0,7.0,8.0,Evaluated as espresso. Smoothly round aroma: t...,Roasted in Northern Italy and distributed in N...,A strong-charactered Northern Italian styled e...,,True,2
1,https://www.coffeereview.com/review/100-arabic...,"87\nLucaff?\n100% Arabica, Black Label (ESE po...",87.0,Lucaff?,"100% Arabica, Black Label (ESE pod)","Padenghe sul Garda, Italy",Not disclosed.,Dark,0/80,,...,8.0,7.0,7.0,8.0,Produced from an ESE pod on a FrancisFrancis! ...,ESE (Easy Serving Espresso) pods are wafer-lik...,An attractive pod espresso for big milk drinks.,,True,2
2,https://www.coffeereview.com/review/100-arabic...,87\nCaribeans\n100% Arabica Coffee from Puerto...,87.0,Caribeans,100% Arabica Coffee from Puerto Rico,"San Juan, Puerto Rico","Utuado, central Puerto Rico",Medium-Light,54/69,$17.00/8 ounces,...,7.0,8.0,7.0,,Bittersweet but balanced; chocolaty. Dark choc...,Produced on a single farm in the central mount...,,Satisfying chocolate and nut notes nearly carr...,True,2
3,https://www.coffeereview.com/review/100-arabic...,88\nWaka Coffee\n100% Arabica Freeze-Dried Col...,88.0,Waka Coffee,100% Arabica Freeze-Dried Colombian (Instant C...,"Los Angeles, California",Colombia,,0/0,$10.99/8 single-serve packets,...,8.0,8.0,8.0,,Evaluated at proportions of 5 grams of instant...,The green coffee for this product was produced...,,A appealing 100% Colombia coffee in instant fo...,True,2
4,https://www.coffeereview.com/review/100-arabic...,72\nYuban\n100% Arabica Instant Coffee\nRoaste...,72.0,Yuban,100% Arabica Instant Coffee,"Northfield, Illinois",Colombia. All coffee of the Arabica species.,,0/0,$8.27/8 ounces instant,...,7.0,3.0,4.0,,In the aroma caramel and wet burned wood notes...,An instant coffee evaluated as mixed in propor...,"Not good, but not the worst of the instants on...",,True,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8382,https://www.coffeereview.com/review/zimbabwean...,88\nLeopard Forest Coffee\nZimbabwean Peaberry...,88.0,Leopard Forest Coffee,Zimbabwean Peaberry,"Travelers Rest, South Carolina","Eastern Highlands, Zimbabwe",Medium,48/67,,...,8.0,8.0,7.0,,"Low-toned, deep aroma with a tight-knit comple...",Historically Zimbabwe has been the premier cof...,A quietly bright breakfast cup with some pleas...,,True,2
8383,https://www.coffeereview.com/review/zimbabwe/,83\nThe Sensuous Bean\nZimbabwe\nRoaster Locat...,83.0,The Sensuous Bean,Zimbabwe,"New York, New York","Chipinge growing region, eastern Zimbabwe",Medium,50/63,,...,8.0,6.0,7.0,,A potentially fine coffee compromised by mild ...,Historically Zimbabwe has been the premier cof...,"Those who enjoy the earthy, musty tones in som...",,True,2
8384,https://www.coffeereview.com/review/zombie-des...,87\nCafe Kreyol\nZombie Desert 100% Organic Ha...,87.0,Cafe Kreyol,Zombie Desert 100% Organic Haitian Bleu,"Fairfax, Virginia.","Artibonite growing region, Haiti.",Medium-Dark,47/52,$14.99/12 ounces,...,8.0,7.0,7.0,,"Smoky, pungent. Very lightly scorched cedar, b...","Like all Haitian coffee, this is produced enti...",Those who enjoy pungently bracing medium-dark-...,,False,2
8385,https://www.coffeereview.com/review/zoom-espre...,93\nZuco Coffee Roasters\nZoom Espresso\nRoast...,93.0,Zuco Coffee Roasters,Zoom Espresso,"Hong Kong, China",Honduras; Ethiopia; Brazil,Medium,44/60,HK $150/250 grams,...,8.0,9.0,8.0,9.0,"Evaluated as espresso. Rich, winey, floral. Ba...","Zoom is a signature espresso blend from Zuco, ...",,"A complex, lively espresso very lightly touche...",True,2


## Run this - Step 3: Check if the location for each Roaster is consistent or not. - Some of them are inconsistent!
Export the inconsistent locations and manunally fix them.

In [30]:
# Check for violations by identifying roasters with more than one unique location
violating_roasters = data.groupby('Roaster')['Roaster Location'].unique()

# Filter out the roasters with multiple locations
violating_roasters = violating_roasters[violating_roasters.apply(len) > 1]

# Show the violating roasters with their respective locations
if not violating_roasters.empty:
    print(f"The following roasters have different locations associated with them:")
    print(violating_roasters)
else:
    print("All roasters have a consistent location.")

The following roasters have different locations associated with them:
Roaster
1980 CAFE                                [Tainan City, Taiwan, Tainan, Taiwan]
A.R.C.                                                 [Hong Kong, China, nan]
AHRIRE Roasting              [Pu'Er Yunnan Province, China, Pu’Er, Yunnan P...
Allegro Coffee               [Boulder, Colorado, Thornton, Colorado, Berkel...
Allegro Coffee Roasters      [Boulder, Colorado, Thornton, Colorado, Denver...
                                                   ...                        
Wheelys Cafe Taiwan                          [Taipei, Taiwan, Taoyuan, Taiwan]
Whole Foods Market                           [Austin, Texas, Denver, Colorado]
Willoughby's Coffee & Tea        [Branford, Connecticut, Branford Connecticut]
Wonderstate Coffee                  [Driftless, Wisconsin, Viroqua, Wisconsin]
Yuban                              [Northfield, Illinois, Rye Brook, New York]
Name: Roaster Location, Length: 154, dtype: object


In [31]:
temp = violating_roasters.reset_index()
temp["Roaster Location_str"] = temp["Roaster Location"].apply(lambda lst: "/".join([str(x) for x in lst if pd.notna(x)]))
temp = temp.drop("Roaster Location", axis = 1)
# temp.to_csv("inconsistent_roaster_location.csv")

In [32]:
location_df = pd.read_csv("../data/inconsistent_roaster_location.csv", encoding="ISO-8859-1")
temp["Roaster Location"] = location_df["Roaster Location"]
temp

Unnamed: 0,Roaster,Roaster Location_str,Roaster Location
0,1980 CAFE,"Tainan City, Taiwan/Tainan, Taiwan","Tainan, Taiwan, China"
1,A.R.C.,"Hong Kong, China",
2,AHRIRE Roasting,"Pu'Er Yunnan Province, China/Pu’Er, Yunnan Pro...","Pu'Er, Yunnan, China"
3,Allegro Coffee,"Boulder, Colorado/Thornton, Colorado/Berkeley,...",
4,Allegro Coffee Roasters,"Boulder, Colorado/Thornton, Colorado/Denver, C...",
...,...,...,...
149,Wheelys Cafe Taiwan,"Taipei, Taiwan/Taoyuan, Taiwan",
150,Whole Foods Market,"Austin, Texas/Denver, Colorado",
151,Willoughby's Coffee & Tea,"Branford, Connecticut/Branford Connecticut","Branford, Connecticut"
152,Wonderstate Coffee,"Driftless, Wisconsin/Viroqua, Wisconsin",


In [33]:
# Build a dict
location_dict = pd.Series(temp['Roaster Location'].values, index=temp['Roaster']).to_dict()

# Apply to data
data["Roaster Location_new"] = data["Roaster"].map(location_dict)

# Use Roaster Location to fill Roaster Location_new
data['Roaster Location_new'] = data.apply(
    lambda row: row['Roaster Location'] if pd.isna(row['Roaster Location_new']) else row['Roaster Location_new'],
    axis=1)

print(data["Roaster Location_new"].isna().sum())
print(data["Roaster Location"].isna().sum())

# Drop new column
data["Roaster Location"] = data["Roaster Location_new"]
data = data.drop("Roaster Location_new", axis=1)
data

104
106


Unnamed: 0,URL,all_text,Rating,Roaster,Coffee Name,Roaster Location,Coffee Origin,Roast Level,Agtron,Est. Price,...,Body,Flavor,Aftertaste,With Milk,Blind Assessment,Notes,Who Should Drink It,Bottom Line,Agtron_whole,Agtron_ground
0,https://www.coffeereview.com/review/100-arabic...,89\nCaffe Bomrad\n100% Arabica 100% Italiano\n...,89.0,Caffe Bomrad,100% Arabica 100% Italiano,"Torino, Italy",Not disclosed.,Medium,48/65,$54.00/1 Kilogram,...,8.0,8.0,7.0,8.0,Evaluated as espresso. Smoothly round aroma: t...,Roasted in Northern Italy and distributed in N...,A strong-charactered Northern Italian styled e...,,48.0,65.0
1,https://www.coffeereview.com/review/100-arabic...,"87\nLucaff?\n100% Arabica, Black Label (ESE po...",87.0,Lucaff?,"100% Arabica, Black Label (ESE pod)","Padenghe sul Garda, Italy",Not disclosed.,Dark,0/80,,...,8.0,7.0,7.0,8.0,Produced from an ESE pod on a FrancisFrancis! ...,ESE (Easy Serving Espresso) pods are wafer-lik...,An attractive pod espresso for big milk drinks.,,0.0,80.0
2,https://www.coffeereview.com/review/100-arabic...,87\nCaribeans\n100% Arabica Coffee from Puerto...,87.0,Caribeans,100% Arabica Coffee from Puerto Rico,"San Juan, Puerto Rico","Utuado, central Puerto Rico",Medium-Light,54/69,$17.00/8 ounces,...,7.0,8.0,7.0,,Bittersweet but balanced; chocolaty. Dark choc...,Produced on a single farm in the central mount...,,Satisfying chocolate and nut notes nearly carr...,54.0,69.0
3,https://www.coffeereview.com/review/100-arabic...,88\nWaka Coffee\n100% Arabica Freeze-Dried Col...,88.0,Waka Coffee,100% Arabica Freeze-Dried Colombian (Instant C...,"Los Angeles, California",Colombia,,0/0,$10.99/8 single-serve packets,...,8.0,8.0,8.0,,Evaluated at proportions of 5 grams of instant...,The green coffee for this product was produced...,,A appealing 100% Colombia coffee in instant fo...,0.0,0.0
4,https://www.coffeereview.com/review/100-arabic...,72\nYuban\n100% Arabica Instant Coffee\nRoaste...,72.0,Yuban,100% Arabica Instant Coffee,"Northfield, Illinois",Colombia. All coffee of the Arabica species.,,0/0,$8.27/8 ounces instant,...,7.0,3.0,4.0,,In the aroma caramel and wet burned wood notes...,An instant coffee evaluated as mixed in propor...,"Not good, but not the worst of the instants on...",,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8382,https://www.coffeereview.com/review/zimbabwean...,88\nLeopard Forest Coffee\nZimbabwean Peaberry...,88.0,Leopard Forest Coffee,Zimbabwean Peaberry,"Travelers Rest, South Carolina","Eastern Highlands, Zimbabwe",Medium,48/67,,...,8.0,8.0,7.0,,"Low-toned, deep aroma with a tight-knit comple...",Historically Zimbabwe has been the premier cof...,A quietly bright breakfast cup with some pleas...,,48.0,67.0
8383,https://www.coffeereview.com/review/zimbabwe/,83\nThe Sensuous Bean\nZimbabwe\nRoaster Locat...,83.0,The Sensuous Bean,Zimbabwe,"New York, New York","Chipinge growing region, eastern Zimbabwe",Medium,50/63,,...,8.0,6.0,7.0,,A potentially fine coffee compromised by mild ...,Historically Zimbabwe has been the premier cof...,"Those who enjoy the earthy, musty tones in som...",,50.0,63.0
8384,https://www.coffeereview.com/review/zombie-des...,87\nCafe Kreyol\nZombie Desert 100% Organic Ha...,87.0,Cafe Kreyol,Zombie Desert 100% Organic Haitian Bleu,"Fairfax, Virginia.","Artibonite growing region, Haiti.",Medium-Dark,47/52,$14.99/12 ounces,...,8.0,7.0,7.0,,"Smoky, pungent. Very lightly scorched cedar, b...","Like all Haitian coffee, this is produced enti...",Those who enjoy pungently bracing medium-dark-...,,47.0,52.0
8385,https://www.coffeereview.com/review/zoom-espre...,93\nZuco Coffee Roasters\nZoom Espresso\nRoast...,93.0,Zuco Coffee Roasters,Zoom Espresso,"Hong Kong, China",Honduras; Ethiopia; Brazil,Medium,44/60,HK $150/250 grams,...,8.0,9.0,8.0,9.0,"Evaluated as espresso. Rich, winey, floral. Ba...","Zoom is a signature espresso blend from Zuco, ...",,"A complex, lively espresso very lightly touche...",44.0,60.0


## Use Google Maps API to extract the latitude and longitude

In [None]:
# data["Roaster Location_search"] = data["Roaster"] +", "+ data["Roaster Location_new"]
# data["Roaster Location_search"]

0                             Caffe Bomrad, Torino, Italy
1                      Lucaff?, Padenghe sul Garda, Italy
2                        Caribeans, San Juan, Puerto Rico
3                    Waka Coffee, Los Angeles, California
4                             Yuban, Northfield, Illinois
                              ...                        
8382    Leopard Forest Coffee, Travelers Rest, South C...
8383                The Sensuous Bean, New York, New York
8384                      Cafe Kreyol, Fairfax, Virginia.
8385               Zuco Coffee Roasters, Hong Kong, China
8386              JBC Coffee Roasters, Madison, Wisconsin
Name: Roaster Location_search, Length: 8387, dtype: object

In [None]:
# import requests

# # Replace with your own Google Maps API key
# API_KEY =  #'YOUR_GOOGLE_MAPS_API_KEY'

# # Function to get latitude and longitude from partial location info (city, town, country)
# def get_lat_lon(partial_location):
#     # The endpoint URL for the Geocoding API
#     url = f"https://maps.googleapis.com/maps/api/geocode/json"
    
#     # Parameters to send with the request
#     params = {
#         'address': partial_location,  # City, Town, or Country
#         'key': API_KEY       # Your Google Maps API key
#     }
    
#     # Send the request and get the response
#     response = requests.get(url, params=params)
    
#     # Check if the request was successful
#     if response.status_code == 200:
#         data = response.json()
        
#         if data['status'] == 'OK':
#             # Extract latitude and longitude from the response
#             lat = data['results'][0]['geometry']['location']['lat']
#             lon = data['results'][0]['geometry']['location']['lng']
#             return lat, lon
#         else:
#             print(f"Error: {data['status']}")
#             return None, None
#     else:
#         print(f"Request failed with status code: {response.status_code}")
#         return None, None

# data[['Roaster Latitude', 'Roaster Longitude']] = data['Roaster Location_search'].apply(
#     lambda x: pd.Series(get_lat_lon(x))
# )

# temp = data[["URL", "Roaster Location_search", "Roaster Latitude", "Roaster Longitude"]]
# temp.to_csv("roaster_locations.csv")
# data.to_csv("coffee_review_processed_0409.csv", index = False)

Latitude: 45.0703155, Longitude: 7.6868552


## Run this! - Combine roaster location data

In [34]:
# Load the roaster location coordinates and combine with data
location_coordinates = pd.read_csv("../data/roaster_locations.csv")
location_coordinates = location_coordinates[["URL", "Roaster Latitude", "Roaster Longitude"]]
result_df = pd.merge(data, location_coordinates, on='URL', how='left')
result_df

Unnamed: 0,URL,all_text,Rating,Roaster,Coffee Name,Roaster Location,Coffee Origin,Roast Level,Agtron,Est. Price,...,Aftertaste,With Milk,Blind Assessment,Notes,Who Should Drink It,Bottom Line,Agtron_whole,Agtron_ground,Roaster Latitude,Roaster Longitude
0,https://www.coffeereview.com/review/100-arabic...,89\nCaffe Bomrad\n100% Arabica 100% Italiano\n...,89.0,Caffe Bomrad,100% Arabica 100% Italiano,"Torino, Italy",Not disclosed.,Medium,48/65,$54.00/1 Kilogram,...,7.0,8.0,Evaluated as espresso. Smoothly round aroma: t...,Roasted in Northern Italy and distributed in N...,A strong-charactered Northern Italian styled e...,,48.0,65.0,45.070315,7.686855
1,https://www.coffeereview.com/review/100-arabic...,"87\nLucaff?\n100% Arabica, Black Label (ESE po...",87.0,Lucaff?,"100% Arabica, Black Label (ESE pod)","Padenghe sul Garda, Italy",Not disclosed.,Dark,0/80,,...,7.0,8.0,Produced from an ESE pod on a FrancisFrancis! ...,ESE (Easy Serving Espresso) pods are wafer-lik...,An attractive pod espresso for big milk drinks.,,0.0,80.0,45.495816,10.511448
2,https://www.coffeereview.com/review/100-arabic...,87\nCaribeans\n100% Arabica Coffee from Puerto...,87.0,Caribeans,100% Arabica Coffee from Puerto Rico,"San Juan, Puerto Rico","Utuado, central Puerto Rico",Medium-Light,54/69,$17.00/8 ounces,...,7.0,,Bittersweet but balanced; chocolaty. Dark choc...,Produced on a single farm in the central mount...,,Satisfying chocolate and nut notes nearly carr...,54.0,69.0,18.454191,-66.070583
3,https://www.coffeereview.com/review/100-arabic...,88\nWaka Coffee\n100% Arabica Freeze-Dried Col...,88.0,Waka Coffee,100% Arabica Freeze-Dried Colombian (Instant C...,"Los Angeles, California",Colombia,,0/0,$10.99/8 single-serve packets,...,8.0,,Evaluated at proportions of 5 grams of instant...,The green coffee for this product was produced...,,A appealing 100% Colombia coffee in instant fo...,0.0,0.0,34.054908,-118.242643
4,https://www.coffeereview.com/review/100-arabic...,72\nYuban\n100% Arabica Instant Coffee\nRoaste...,72.0,Yuban,100% Arabica Instant Coffee,"Northfield, Illinois",Colombia. All coffee of the Arabica species.,,0/0,$8.27/8 ounces instant,...,4.0,,In the aroma caramel and wet burned wood notes...,An instant coffee evaluated as mixed in propor...,"Not good, but not the worst of the instants on...",,0.0,0.0,42.099750,-87.780897
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8382,https://www.coffeereview.com/review/zimbabwean...,88\nLeopard Forest Coffee\nZimbabwean Peaberry...,88.0,Leopard Forest Coffee,Zimbabwean Peaberry,"Travelers Rest, South Carolina","Eastern Highlands, Zimbabwe",Medium,48/67,,...,7.0,,"Low-toned, deep aroma with a tight-knit comple...",Historically Zimbabwe has been the premier cof...,A quietly bright breakfast cup with some pleas...,,48.0,67.0,34.973824,-82.432503
8383,https://www.coffeereview.com/review/zimbabwe/,83\nThe Sensuous Bean\nZimbabwe\nRoaster Locat...,83.0,The Sensuous Bean,Zimbabwe,"New York, New York","Chipinge growing region, eastern Zimbabwe",Medium,50/63,,...,7.0,,A potentially fine coffee compromised by mild ...,Historically Zimbabwe has been the premier cof...,"Those who enjoy the earthy, musty tones in som...",,50.0,63.0,40.775896,-73.979505
8384,https://www.coffeereview.com/review/zombie-des...,87\nCafe Kreyol\nZombie Desert 100% Organic Ha...,87.0,Cafe Kreyol,Zombie Desert 100% Organic Haitian Bleu,"Fairfax, Virginia.","Artibonite growing region, Haiti.",Medium-Dark,47/52,$14.99/12 ounces,...,7.0,,"Smoky, pungent. Very lightly scorched cedar, b...","Like all Haitian coffee, this is produced enti...",Those who enjoy pungently bracing medium-dark-...,,47.0,52.0,38.780663,-77.558812
8385,https://www.coffeereview.com/review/zoom-espre...,93\nZuco Coffee Roasters\nZoom Espresso\nRoast...,93.0,Zuco Coffee Roasters,Zoom Espresso,"Hong Kong, China",Honduras; Ethiopia; Brazil,Medium,44/60,HK $150/250 grams,...,8.0,9.0,"Evaluated as espresso. Rich, winey, floral. Ba...","Zoom is a signature espresso blend from Zuco, ...",,"A complex, lively espresso very lightly touche...",44.0,60.0,22.319304,114.169361


In [35]:
# For the NaN in coordinates, can we fill them?


# Price

In [36]:
# Load processed price data
price_df = pd.read_csv("../data/adjusted_prices.csv")

# Combine with data
result_df = pd.merge(result_df, price_df, on='URL', how='left')
result_df

Unnamed: 0,URL,all_text,Rating,Roaster,Coffee Name,Roaster Location,Coffee Origin,Roast Level,Agtron,Est. Price,...,With Milk,Blind Assessment,Notes,Who Should Drink It,Bottom Line,Agtron_whole,Agtron_ground,Roaster Latitude,Roaster Longitude,usd_per_100g_adj
0,https://www.coffeereview.com/review/100-arabic...,89\nCaffe Bomrad\n100% Arabica 100% Italiano\n...,89.0,Caffe Bomrad,100% Arabica 100% Italiano,"Torino, Italy",Not disclosed.,Medium,48/65,$54.00/1 Kilogram,...,8.0,Evaluated as espresso. Smoothly round aroma: t...,Roasted in Northern Italy and distributed in N...,A strong-charactered Northern Italian styled e...,,48.0,65.0,45.070315,7.686855,7.548955
1,https://www.coffeereview.com/review/100-arabic...,"87\nLucaff?\n100% Arabica, Black Label (ESE po...",87.0,Lucaff?,"100% Arabica, Black Label (ESE pod)","Padenghe sul Garda, Italy",Not disclosed.,Dark,0/80,,...,8.0,Produced from an ESE pod on a FrancisFrancis! ...,ESE (Easy Serving Espresso) pods are wafer-lik...,An attractive pod espresso for big milk drinks.,,0.0,80.0,45.495816,10.511448,
2,https://www.coffeereview.com/review/100-arabic...,87\nCaribeans\n100% Arabica Coffee from Puerto...,87.0,Caribeans,100% Arabica Coffee from Puerto Rico,"San Juan, Puerto Rico","Utuado, central Puerto Rico",Medium-Light,54/69,$17.00/8 ounces,...,,Bittersweet but balanced; chocolaty. Dark choc...,Produced on a single farm in the central mount...,,Satisfying chocolate and nut notes nearly carr...,54.0,69.0,18.454191,-66.070583,9.615077
3,https://www.coffeereview.com/review/100-arabic...,88\nWaka Coffee\n100% Arabica Freeze-Dried Col...,88.0,Waka Coffee,100% Arabica Freeze-Dried Colombian (Instant C...,"Los Angeles, California",Colombia,,0/0,$10.99/8 single-serve packets,...,,Evaluated at proportions of 5 grams of instant...,The green coffee for this product was produced...,,A appealing 100% Colombia coffee in instant fo...,0.0,0.0,34.054908,-118.242643,5.733532
4,https://www.coffeereview.com/review/100-arabic...,72\nYuban\n100% Arabica Instant Coffee\nRoaste...,72.0,Yuban,100% Arabica Instant Coffee,"Northfield, Illinois",Colombia. All coffee of the Arabica species.,,0/0,$8.27/8 ounces instant,...,,In the aroma caramel and wet burned wood notes...,An instant coffee evaluated as mixed in propor...,"Not good, but not the worst of the instants on...",,0.0,0.0,42.099750,-87.780897,5.097570
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8382,https://www.coffeereview.com/review/zimbabwean...,88\nLeopard Forest Coffee\nZimbabwean Peaberry...,88.0,Leopard Forest Coffee,Zimbabwean Peaberry,"Travelers Rest, South Carolina","Eastern Highlands, Zimbabwe",Medium,48/67,,...,,"Low-toned, deep aroma with a tight-knit comple...",Historically Zimbabwe has been the premier cof...,A quietly bright breakfast cup with some pleas...,,48.0,67.0,34.973824,-82.432503,
8383,https://www.coffeereview.com/review/zimbabwe/,83\nThe Sensuous Bean\nZimbabwe\nRoaster Locat...,83.0,The Sensuous Bean,Zimbabwe,"New York, New York","Chipinge growing region, eastern Zimbabwe",Medium,50/63,,...,,A potentially fine coffee compromised by mild ...,Historically Zimbabwe has been the premier cof...,"Those who enjoy the earthy, musty tones in som...",,50.0,63.0,40.775896,-73.979505,
8384,https://www.coffeereview.com/review/zombie-des...,87\nCafe Kreyol\nZombie Desert 100% Organic Ha...,87.0,Cafe Kreyol,Zombie Desert 100% Organic Haitian Bleu,"Fairfax, Virginia.","Artibonite growing region, Haiti.",Medium-Dark,47/52,$14.99/12 ounces,...,,"Smoky, pungent. Very lightly scorched cedar, b...","Like all Haitian coffee, this is produced enti...",Those who enjoy pungently bracing medium-dark-...,,47.0,52.0,38.780663,-77.558812,5.945681
8385,https://www.coffeereview.com/review/zoom-espre...,93\nZuco Coffee Roasters\nZoom Espresso\nRoast...,93.0,Zuco Coffee Roasters,Zoom Espresso,"Hong Kong, China",Honduras; Ethiopia; Brazil,Medium,44/60,HK $150/250 grams,...,9.0,"Evaluated as espresso. Rich, winey, floral. Ba...","Zoom is a signature espresso blend from Zuco, ...",,"A complex, lively espresso very lightly touche...",44.0,60.0,22.319304,114.169361,9.475734


In [37]:
result_df.to_csv("coffee_review_processed_01.csv", index=False)