<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Capstone Project - Wine Recommender System <br> [Part 2 of 3]

## Contents:
- [Web-Scraping](##Web-Scraping)

---
## Web-Scraping
---

To further enhance our wine recommender system, Webscraping provides additional information on wines available on Amazon. By incorporating this data, the recommender system can suggest wines that are not only suitable for a user's preferences but are also readily available for purchase online.

In [174]:
# Importing Libraries
import requests
from bs4 import BeautifulSoup
import time
import json
import pandas as pd

In [175]:
# Load main wines dataset
df = pd.read_pickle('../data/wine_reviews_clean.pkl')

In [176]:
# Define headers as a global variable to be accessible in both functions
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.3",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
}

In [177]:
# Define a function to scrape wine information from a given URL

def scrape_wine_info(url):

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")

    wine_title_tag = soup.find("h1", attrs={"id": "title"})
    wine_description_tag = soup.find("div", attrs={"id": "feature-bullets-container"})
    wine_description = wine_description_tag.text.strip() if wine_description_tag else None
    wine_varieties = ['Cabernet Sauvignon', 'Chardonnay', 'Shiraz', 'Shiraz Cabernet', 'Red Blend', 'Cabernet Merlot', 'Champagne', 
                      'Merlot', 'Moscato', 'Sauvignon Blanc', 'Pinot Noir', 'Rose', 'Malbec', 'Pinot Gris', 'Riesling', 'Semillon Sauvignon', 
                      'Tempranillo', 'Zinfandel', 'Syrah', 'Carmenere', 'Sangiovese', 'Chenin Blanc', 'Port', 'Prosecco', 'Pinot Grigio']
    wine_variety = None
    if wine_description:
        wine_description_lower = wine_description.lower()
        for variety in wine_varieties:
            if pd.Series(wine_description_lower).str.contains(variety.lower()).any():
                wine_variety = variety
                break
    wine_url_tag = soup.find("link", attrs={"rel": "canonical"})
    wine_price_tag = soup.find("span", class_="a-size-medium a-color-price")
    wine_customer_rating_tag = soup.find("span", {"data-hook": "rating-out-of-text", "class": "a-size-medium a-color-base"})

    wine_title = wine_title_tag.text.strip() if wine_title_tag else None
    wine_description = wine_description_tag.text.strip() if wine_description_tag else None
    wine_url = wine_url_tag["href"] if wine_url_tag else None
    wine_price = wine_price_tag.text.strip() if wine_price_tag else None
    wine_customer_rating = float(wine_customer_rating_tag.text.strip().split()[0]) if wine_customer_rating_tag else None

    wine_info = {
        "title": wine_title,
        "description": wine_description,
        "url": wine_url,
        "price": wine_price,
        "customer_rating": wine_customer_rating,
        "variety": wine_variety
    }

    return wine_info

In [178]:
# Define a function to scrape wine URLs from a given Amazon search result page

def scrape_wine_urls(page_url):
    
    response = requests.get(page_url, headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")
    wine_tags = soup.select("div[data-index] .a-section.a-spacing-none > h2 > a")
    wine_urls = ["https://www.amazon.sg" + tag["href"] for tag in wine_tags if not tag["href"].startswith("/gp/slredirect/")]
    return wine_urls

In [179]:
base_url = "https://www.amazon.sg/s?i=grocery&bbn=8112913051&rh=n%3A8112913051%2Cn%3A6314506051%2Cn%3A6400278051%2Cn%3A6400326051&s=featured-rank&dc&ds=v1%3A3x%2FgMJzyl3vu0cmWPIrKm2NFKwdxHYHu4lg%2B55H6q6s&qid=1682265201&rnid=8112913051&ref=sr_nr_n_6&page="
all_wine_data = []

for page in range(1, 27):
    print(f"Scraping page {page}")
    page_url = base_url + str(page)
    wine_urls = scrape_wine_urls(page_url)

    for i, url in enumerate(wine_urls):
        print(f"Scraping wine {i + 1} of {len(wine_urls)} from page {page}")
        wine_info = scrape_wine_info(url)
        if wine_info:
            all_wine_data.append(wine_info)
        else:
            print(f"Failed to scrape wine info from {url}")
        time.sleep(1)  # To avoid overloading the server

    time.sleep(2)  # To avoid overloading the server

print("Scraping completed! All wine information has been scraped.")

Scraping page 1
Scraping wine 1 of 24 from page 1
Scraping wine 2 of 24 from page 1
Scraping wine 3 of 24 from page 1
Scraping wine 4 of 24 from page 1
Scraping wine 5 of 24 from page 1
Scraping wine 6 of 24 from page 1
Scraping wine 7 of 24 from page 1
Scraping wine 8 of 24 from page 1
Scraping wine 9 of 24 from page 1
Scraping wine 10 of 24 from page 1
Scraping wine 11 of 24 from page 1
Scraping wine 12 of 24 from page 1
Scraping wine 13 of 24 from page 1
Scraping wine 14 of 24 from page 1
Scraping wine 15 of 24 from page 1
Scraping wine 16 of 24 from page 1
Scraping wine 17 of 24 from page 1
Scraping wine 18 of 24 from page 1
Scraping wine 19 of 24 from page 1
Scraping wine 20 of 24 from page 1
Scraping wine 21 of 24 from page 1
Scraping wine 22 of 24 from page 1
Scraping wine 23 of 24 from page 1
Scraping wine 24 of 24 from page 1
Scraping page 2
Scraping wine 1 of 24 from page 2
Scraping wine 2 of 24 from page 2
Scraping wine 3 of 24 from page 2
Scraping wine 4 of 24 from page 2
S

In [180]:
# Save the scraped data to a file

with open("../data/amazon_wines.json", "w") as f:
    json.dump(all_wine_data, f, ensure_ascii=False, indent=4)

print("Wine data saved to amazon_wines.json")

Wine data saved to amazon_wines.json


This code will scrape wine information from all 26 pages and save the result in a JSON file called `amazon_wines.json`.

In [181]:
# Load Amazon wines data
with open("../data/amazon_wines.json", "r") as f:
    amazon_wines = json.load(f)

In [182]:
# create a function to find the best matching wine from Amazon

def find_best_amazon_wine(recommended_wine_id):
    
    # Get the recommended wine attributes
    recommended_wine = df.loc[df["wine_id"] == recommended_wine_id]
    recommended_attributes = recommended_wine["attributes"].values[0].lower().split(", ")
    recommended_variety = recommended_wine["variety"].values[0].lower()

    # Find the best matching wine from Amazon based on user-selected attributes
    best_match = None
    max_matching_attributes = 0
    same_variety_wines = [wine for wine in amazon_wines if wine["variety"].lower() == recommended_variety]
    if not same_variety_wines:
        return None

    same_variety_wines = sorted(same_variety_wines, key=lambda x: x["customer_rating"], reverse=True)

    for wine in same_variety_wines:
        wine_attributes = wine["description"].lower()
        matching_attributes = [attr for attr in recommended_attributes if attr in wine_attributes]
        if len(matching_attributes) > max_matching_attributes:
            best_match = wine
            max_matching_attributes = len(matching_attributes)

    return best_match

The above function takes a recommended wine ID as input, retrieves its attributes and variety, and finds the best matching wine from the scraped Amazon wines data based on the user-selected attributes and variety. 

The wines are sorted by customer rating to prioritize highly-rated options. This will be used later in our streamlit app.py to recommend wines on Amazon to users.

By incorporating Amazon's wine data, the recommender system can provide more accurate and relevant suggestions, ultimately improving the user experience and promoting a seamless online purchasing process.