# Ceneo Scraper

## Struktura opinii w serwisie ceneo.pl

|składowa|zmienna|selektor|
|--------|-------|--------|
|identyfikator opinii|opinion_id|["data-entry-id"]|
|autora|author|.user-post__author-name|
|rekomendacje|recommandation|.user-post__author-recomendation > em|
|liczbę gwiazdek|rating|.user-post__score-count|
|treść opinii|content|.user-post__text|
|listę zalet|pros|review-feature__title--positives ~ review-feature__item|
|listę wad|cons|review-feature__title--negatives ~ review-feature__item|
|dawa wystawienia|opinion_date|.user-post__published > time:nth-child(1)["datetime"]|
|data zakupu produktu|purchase_date|.user-post__published > time:nth-child(2)["datetime"]|
|ile osób uznało opinię za przydatną|likes|button.vote-yes > span|
|ile osób uznało opinię za nieprzydatną|dislikes|button.vote-no > span|

In [26]:
selectors = {
    'author': (".user-post__author-name",),
    'recommandation': (".user-post__author-recomendation > em",),
    'rating': ("user-post__score-count",),
    'content': (".user-post__text",),
    'pros': ("div.review-feature__title--positives ~ review-feature__item", None, True),
    'cons': ("div.review-feature__title--negatives ~ review-feature__item", None, True),
    'opinion_date': (".user-post__published > time:nth-child(1)", 'datetime'),
    'purchase_date': (".user-post__published > time:nth-child(2)", 'datetime'),
    'likes': ("button.vote-yes > span",),
    'dislikes': ("button.vote-no > span",),
    'opinion_id': (None, "data-entry-id"),
}

1. Import bibliotek

In [23]:
import os
import json
import requests
from bs4 import BeautifulSoup

2. Funkcja do ekstrakcji zawartości ze strony HTML

In [21]:
def extract(ancestor, selector, attribute = None, return_list = False):
    if return_list:
        if attribute:
            return [tag[attribute].strip() for tag in ancestor.select(selector)]
        return [tag.get_text().strip() for tag in ancestor.select(selector)]
    if selector:
        if attribute:
            try:
                return ancestor.select_one(selector)[attribute].strip()
            except TypeError:
                return None
        try:
            return ancestor.select_one(selector).get_text().strip()
        except AttributeError:
            return None
    if attribute:
        return ancestor[attribute].strip()
    return ancestor.get_text().strip()

3. Adres pierwszej strony z opiniami o produkcie

In [2]:
#product_id = '84514582'
product_id = input('Podaj kod produktu z Ceneo.pl: ')
url = f'https://www.ceneo.pl/{product_id}#tab=reviews'

4. Pobranie opinii o produkcie

In [36]:
all_opinions = []
while(url):
    response = requests.get(url)
    page = BeautifulSoup(response.text, 'html.parser')
    opinions = page.select('div.js_product-review') 
    for opinion in opinions:
        single_opinion = {
            key: extract(opinion, *value)
                for key, value in selectors.items()
        }
        all_opinions.append(single_opinion)
    try:
        url = 'https://www.ceneo.pl' + extract(page, 'a.pagination__next', 'href')
    except TypeError:
        url = False

In [37]:
if not os.path.exists('opinions'):
    os.mkdir('opinions')
with open(f'opinions/{product_id}.json', 'w', encoding='UTF-8') as jf:
    json.dump(all_opinions, jf, indent=4, ensure_ascii=False)