# Ceneo Scraper

## Extract components of single opinion

|Component|Selector|Variable|
|---------|--------|--------|
|opinion ID|["data-entry-id"]|opinion_id|
|opinion’s author|span.user-post__author-name|author|
|author’s recommendation|span.user-post__author-recomendation > em|recomendation|
|score expressed in number of stars|user-post__score-count|stars|
|opinion’s content|div.user-post__text|content|
|list of product advantages|div.review-feature__title--positives ~ div.review-feature__item|pros|
|list of product disadvantages|div.review-feature__title--negatives ~ div.review-feature__item|cons|
|how many users think that opinion was helpful|button.vote-yes > span|helpful|
|how many users think that opinion was unhelpful|button.vote-no > span|unhelpful|
|publishing date|span.user-post__published > time:nth-child(1)['datetime']|publish_date|
|purchase date|span.user-post__published > time:nth-child(2)[datatime]|purchase_date |

## Imports

In [None]:
import requests
from bs4 import BeautifulSoup

# Fucntions of extraction

In [None]:
def extract_content(ancestor,selector=None,attribute=None,return_list=False):
    if selector:
        if return_list:
            if attribute:
                return [tag[attribute].strip()  for  tag in ancestor.select(selector)]
            return [tag.text.strip()  for  tag in ancestor.select(selector)]
        if attribute:
            try:
                return ancestor.select_one(selector)[attribute].strip()
            except TypeError:
                return None
        return ancestor.select_one(selector).text.strip()
    if attribute:
        return ancestor[attribute]
    return ancestor.text.strip()


## Opinion structure


In [None]:
selectors = {
    'opinion_id': (None,"data-entry-id",),
    'author': ('span.user-post__author-name',),
    'recomendation': ('span.user-post__author-recomendation > em',),
    'stars': ('span.user-post__score-count',),
    'content': ('div.user-post__text',),
    'pros': ('div.review-feature__title--positives ~ div.review-feature__item',None,True),
    'cons': ('div.review-feature__title--negatives ~ div.review-feature__item',None,True),
    'helpful': ('button.vote-yes > span',),
    'unhelpful': ('button.vote-no > span',),
    'publish_date': ("span.user-post__published > time:nth-child(1)",'datetime'),
    'purchase_date': ("span.user-post__published > time:nth-child(2)",'datetime'),
}



## Send request to Ceneo.pl service

In [None]:

product_id = '108290707'
url = f'https://www.ceneo.pl/{product_id}#tab=reviews'
response = requests.get(url)
response.status_code
#print(response.text)

# Convert plain text HTML code into DOM structure

In [None]:
page_dom = BeautifulSoup(response.text, "html.parser")
opinions =page_dom.select('div.js_product-review')
opinion =page_dom.select_one('div.js_product-review')


## extract all components of single opinion 


In [None]:
single_opinion = {
    key: extract_content(opinion, *value)
        for key, value in selectors.items()
}
print(single_opinion)