# Ceneo Scraper

## Extraction of components of single opinion

|component|Selector|Key|
|---------|--------|--------|
|opinion ID|["data-entry-id"]|opinion_id|
|opinion’s author|span.user-post__author-name|author|
|author’s recommendation|span.user-post__author-recomendation >  em|recommendation|
|score expressed in number of stars|span.user-post__score-count|score|
|opinion’s content|div.user-post__text|content|
|list of product advantages|div.review-feature__title review-feature__title--positives ~ div.review-feature__item|pros|
|list of product disadvantages|div.review-feature__title review-feature__title--negatives ~ div.review-feature__item|cons|
|how many users think that opinion was helpful|button.vote-yes > span|helpful|
|how many users think that opinion was unhelpful|button.vote-no > span|unhelpful|
|publishing date|span.user-post__published > time:nth-child(1)["datetime"]|publish_date|
|purchase date|span.user-post__published > time:nth-child(2)["datetime"]|purchase_date|



## Loading libraries

In [20]:
import os
import json
import requests
from bs4 import BeautifulSoup
from deep_translator import GoogleTranslator


## Function to extract data from HTML code

In [21]:
def extract(ancestor, selector = None, attribute = None, return_list = False):
    if return_list:
        if attribute:
            return [tag[attribute] for tag in ancestor.select(selector)]
        return [tag.get_text().strip() for tag in ancestor.select(selector)]
    
    if selector:
        if attribute:
            try:
                return ancestor.select_one(selector)[attribute]
            except TypeError:
                return None
    
        try:
            return ancestor.select_one(selector).get_text().strip()
        except AttributeError:
            return None
    
    if attribute:
        return ancestor[attribute]
    return ancestor.get_text().strip()

## Transformation Funcions

In [22]:
def rate(score):
    rate = score.split("/")
    return float(rate[0].replace(",","."))/float(rate[1])
def recommend(recommendation):
    return True if recommendation ==  "Polecam" else False if recommendation == "Nie polecam" else None

## Translate

In [23]:
def translate(text, from_lang = "pl", to_lang = "en"):
    if text:
        if isinstance(text, list):

            return{
                from_lang:  text,
                to_lang: [GoogleTranslator(source=from_lang, targert=to_lang).translate(t) for t in text]
            }
        return{
            from_lang:  text,
            to_lang: GoogleTranslator(source=from_lang, targert=to_lang).translate(text)
        }
        #     return [GoogleTranslator(source=from_lang, targert=to_lang).translate(t) for t in text]
        # return GoogleTranslator(source=from_lang, targert=to_lang).translate(text)
    return None

## structure of single opinion

In [24]:
selectors = {
    "opinion_id" :[ None, "data-entry-id"],
    "author" :["span.user-post__author-name"],
    "recommendation" :["span.user-post__author-recomendation > em"],
    "score" : ["span.user-post__score-count"],
    "content" : ["div.user-post__text"],
    "pros" :  ["div.review-feature__title review-feature__title--positives ~ div.review-feature__item", None, True],
    "cons" :  ["div.review-feature__title review-feature__title--negatives ~ div.review-feature__item", None, True],
    "helpful" : ["button.vote-yes > span"],
    "unhelpful" : ["button.vote-no > span"],
    "publish_date" : ["span.user-post__published > time:nth-child(1)","datetime"],
    "purchase_date" : ["span.user-post__published > time:nth-child(2)","datetime"],
}


## Transformations

In [25]:
transformations = {
    "recommendation" : recommend,
    "score" : rate,
    "helpful" : int,
    "unhelpful" : int,
    "content": translate,
    "pros" : translate,
    "cons" : translate,
}


## URL address for first page with opinions abouut product

In [26]:
product_id = "158793489"
# product_id = "158184470"
# product_id = "153373072"
# product_id = input("Please provide Ceneo.pl product code: ")
url = f"https://www.ceneo.pl/{product_id}#tab=reviews"

## Extracting all opinions from HTML code

In [27]:
all_opinions = []
while(url):
    print(url)
    response = requests.get(url)
    page_dom= BeautifulSoup(response.text, "html.parser")
    opinions = page_dom.select("div.js_product-review")
    for opinion in opinions:
        single_opinion = {
            key: extract(opinion, *value)
                for key, value in selectors.items()
        }
        for key, value in transformations.items():
            single_opinion[key] = value(single_opinion[key])
        all_opinions.append(single_opinion)
    try:
        url = "https://www.ceneo.pl"+extract(page_dom, "a.pagination__next", "href")
    except TypeError:
        url = None


https://www.ceneo.pl/158793489#tab=reviews
https://www.ceneo.pl/158793489/opinie-2
https://www.ceneo.pl/158793489/opinie-3
https://www.ceneo.pl/158793489/opinie-4


## Saving all opinions to JSON file

In [28]:
if not os.path.exists("opinions"):
    os.mkdir("opinions")
jf = open(f"opinions/{product_id}.json", "w", encoding="UTF-8")
json.dump(all_opinions, jf, indent=4, ensure_ascii=False)
jf.close()