This notebook is used to parsing the text data into dataframe.

# Import dependencies

In [2]:
import requests
from bs4 import BeautifulSoup

import pandas as pd
import os
import re

# Check the total number of available reviews on sitemap

In [4]:
# Step 1: Get all review URLs from the sitemap
sitemap_url = "https://www.coffeereview.com/sitemap_index.xml"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
}

# Fetch and parse the sitemap XML
response = requests.get(sitemap_url, headers=headers)

if response.status_code == 200:
    soup = BeautifulSoup(response.content, "xml")
    all_urls = [loc.text for loc in soup.find_all("loc")]
    print(f"✅ Found {len(all_urls)} URLs.")
else:
    print(f"❌ Failed to fetch sitemap. Status code: {response.status_code}")
    exit()

✅ Found 15 URLs.


In [5]:
review_sitemaps = [url for url in all_urls if "review-sitemap" in url]
review_sitemaps

['https://www.coffeereview.com/review-sitemap.xml',
 'https://www.coffeereview.com/review-sitemap2.xml',
 'https://www.coffeereview.com/review-sitemap3.xml',
 'https://www.coffeereview.com/review-sitemap4.xml',
 'https://www.coffeereview.com/review-sitemap5.xml',
 'https://www.coffeereview.com/review-sitemap6.xml',
 'https://www.coffeereview.com/review-sitemap7.xml',
 'https://www.coffeereview.com/review-sitemap8.xml',
 'https://www.coffeereview.com/review-sitemap9.xml']

In [None]:
all_review_urls = []

for this_sitemap in review_sitemaps:
    # Fetch and parse the sitemap XML
    response = requests.get(this_sitemap, headers=headers)

    if response.status_code == 200:
        soup = BeautifulSoup(response.content, "xml")
        review_urls = [loc.text for loc in soup.find_all("loc")]
        print(f"✅ Found {len(review_urls)} review URLs in sitemap {this_sitemap}.")
        all_review_urls = all_review_urls + review_urls
    else:
        print(f"❌ Failed to fetch sitemap. Status code: {response.status_code}")
        exit()

✅ Found 1001 review URLs in sitemap https://www.coffeereview.com/review-sitemap.xml.
✅ Found 1000 review URLs in sitemap https://www.coffeereview.com/review-sitemap2.xml.
✅ Found 1040 review URLs in sitemap https://www.coffeereview.com/review-sitemap3.xml.
✅ Found 1057 review URLs in sitemap https://www.coffeereview.com/review-sitemap4.xml.
✅ Found 1056 review URLs in sitemap https://www.coffeereview.com/review-sitemap5.xml.
✅ Found 1067 review URLs in sitemap https://www.coffeereview.com/review-sitemap6.xml.
✅ Found 1058 review URLs in sitemap https://www.coffeereview.com/review-sitemap7.xml.
✅ Found 1049 review URLs in sitemap https://www.coffeereview.com/review-sitemap8.xml.
✅ Found 422 review URLs in sitemap https://www.coffeereview.com/review-sitemap9.xml.
Found 8750 review URLs in total.


In [13]:
if "https://www.coffeereview.com/review/" in all_review_urls:
    all_review_urls.remove("https://www.coffeereview.com/review/")

print(f"Found {len(all_review_urls)} review URLs in total.")
print(f"Found {len(set(all_review_urls))} unique review URLs in total.")

Found 8749 review URLs in total.
Found 8748 unique review URLs in total.


## Found duplicated review urls

In [15]:
def first_duplicate(lst):
    seen = set()
    for item in lst:
        if item in seen:
            return item
        seen.add(item)
    return None

print(first_duplicate(all_review_urls))

https://www.coffeereview.com/wp-content/uploads/2014/11/29_375x375.jpg


# Check the scrapped reviews

In [17]:
reviews_from_sitemap = [re.sub(r"[^\w\-]", "_", url) + ".txt" for url in all_review_urls]
reviews_from_sitemap[:5]

['https___www_coffeereview_com_review_100-colombian_.txt',
 'https___www_coffeereview_com_review_moka-java_.txt',
 'https___www_coffeereview_com_review_java_.txt',
 'https___www_coffeereview_com_review_sumatra-gayo-mountain_.txt',
 'https___www_coffeereview_com_review_folgers-french-roast_.txt']

In [20]:
folder_path = "./coffee_reviews_text/"
scrapped_texts = os.listdir(folder_path)
print(len(scrapped_texts))

8747


In [21]:
# Check differences between reviews_from_sitemap and scrapped_texts
set_1 = set(reviews_from_sitemap)
set_2 = set(scrapped_texts)

set_1 ^ set_2

{'https___www_coffeereview_com_wp-content_uploads_2019_02_6_375x375_jpg.txt'}

This difference will not have influence because we will not use wp-content!

In [22]:
# Remove 360 wp-content files
scrapped_reviews = [filename for filename in scrapped_texts if "wp-content" not in filename]
print(f"We will use {len(scrapped_reviews)} unique reviews for analysis.")

We will use 8387 unique reviews for analysis.


This number should align with the number of reviews (until 03/03/2025) on the website https://www.coffeereview.com/review/:
$20 \times 414 + 4 = 8284$. However, it looks like we scrapped more reviews ($8387 > 8284$) than those shown on the website.

# Read all text files and save as csv with raw texts which need to be parsed further
Done by Xin on 03/03/2025

In [25]:
# Define regex patterns
url_pattern = re.compile(r'URL:\s*(https?://\S+)')
all_text_pattern = re.compile(r'“行銷攻略” 促銷活動\s*(.*?)\s*Explore Similar Coffees', re.DOTALL)

def extract_info(text):
    """Extract URL and relevant text from the review file"""
    url = url_pattern.search(text)
    all_text = all_text_pattern.search(text)
    
    return {
        "URL": url.group(1) if url else None,
        "all_text": all_text.group(1).strip() if all_text else None,
    }

data = []
for file_name in scrapped_reviews:
    with open(os.path.join(folder_path, file_name), "r", encoding="utf-8") as file:
        text = file.read()
        extracted_info = extract_info(text)
        data.append(extracted_info)

# Store data in DataFrame
df = pd.DataFrame(data)

In [None]:
# check_index = 8386
# print(df["URL"][check_index])
# print(df["all_text"][check_index])

https://www.coffeereview.com/review/__trashed-5/
94
JBC Coffee Roasters
Kagunyu Kenya
Roaster Location:
Madison, Wisconsin
Coffee Origin:
Nyeri County, Kenya
Roast Level:
Medium-Light
Agtron:
60/78
Est. Price:
$22.00/12 ounces
Review Date:
February 2024
Aroma:
9
Acidity/Structure:
9
Body:
9
Flavor:
9
Aftertaste:
8
Blind Assessment
Complex, multi-layered, deep-toned. Red currant, cocoa nib, tangerine, fresh-cut oak, marjoram in aroma and cup. Bright, juicy structure with phosphoric (cola-like) acidity; crisp, syrupy mouthfeel. Resonant finish centered around notes of red currant and cocoa nib.
Notes
Produced by smallholding farmers, from trees of the SL28 and SL34 varieties of Arabica, and processed by the traditional washed method (fruit skin and pulp removed before drying) at the Kagunyu Washing Station. JBC Coffee Roasters’ vision is simple: “Let the coffee lead the way” through sourcing and roasting the best and most unique coffees available and rewarding the farmers who grow those 

In [33]:
df.to_csv("coffee_review_raw_texts.csv", index = False)

# Raw Text Parsing