# Two step scraper for notebookcheck.net

Not sure why, but my friend A. developed an interest for notebookcheck.net. He wanted to scrape this website. I figured other new-to-scraping-folks like yourself might want to come along. ;)

**Goal**  
Collect data for every reviewed device.

## setup

In [52]:
import requests
from bs4 import BeautifulSoup
import re
import unicodedata
import csv

## collect links (1st scraper)

**Goal**  
Collect links to review for 500 reviewed devices.

In [3]:
# request webpage, save request to r
r = requests.get('https://www.notebookcheck.net/Reviews.55.0.html?&items_per_page=5000&hide_youtube=1&ns_show_num_normal=250&hide_external_reviews=1&tagArray[]=10&typeArray[]=1',
                timeout=15)

In [4]:
# create soup variable that contains all html from r
soup = BeautifulSoup(r.content, 'html.parser')
soup

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="TYPO3 CMS" name="generator"/>
<meta content="INDEX,FOLLOW" name="ROBOTS"/>
<meta content="Our latest laptop and notebook reviews." name="description"/>
<meta content="en" name="content-language"/>
<meta content="laptop review, laptop reviews, notebook, benchmarks, tests, measurements, unbiased,notebook, laptop, review, reviews, tests, test, reports, netbook, benchmarks, graphics card, processor" name="keywords"/>
<link href="typo3temp/Assets/828466e0e7.css?1484583930" media="all" rel="stylesheet" type="text/css"/>
<link href="typo3temp/Assets/ac5d5618a8.css?1473768586" media="all" rel="stylesheet" type="text/css"/>
<link href="fileadmin/templates/nbc_v4_4/notebookcheck_new.css?1596812164" media="all" rel="stylesheet" type="text/css"/>
<link href="fileadmin/templates/js/fancybox2/source/jquery.fancybox.min.css?1419431048" media="all" rel="stylesheet" type="text/css"/>
<link href="fileadmin/templates/js/fancy

In [5]:
# get all links 'a' with class 'introa_review',
# using the BeautifulSoup findAll command
links = soup.findAll('a', {'class': 'introa_review'})

In [6]:
# check number of links found
len(links)

500

In [8]:
# print first link
links[0]

<a class="introa_large introa_review" href="https://www.notebookcheck.net/Samsung-Galaxy-Note20-Review-Not-always-better-than-the-Note10.496540.0.html"><article><div class="introa_rm_img"><img alt="Samsung Galaxy Note20 Review - Not always better than the Note10" class="introa_img_large lazy" data-src="fileadmin/_processed_/f/b/csm_4_zu_3_Teaser_Samsung_Galaxy_Note20_8257_99e63366fd.jpg" data-src-retina="fileadmin/_processed_/f/b/csm_4_zu_3_Teaser_Samsung_Galaxy_Note20_8257_53e0deb6ef.jpg" src="fileadmin/_processed_/f/b/csm_4_zu_3_Teaser_Samsung_Galaxy_Note20_8257_99e63366fd.jpg"/></div><div class="introa_rm_text"><h2 class="introa_title"><span class="rating" style="color:#4baf4e; "><span class="average">88%</span></span> Samsung Galaxy Note20 Review - Not always better than the Note10</h2><div class="introa_rm_abstract"><b>Dimmed.</b> The Samsung Galaxy Note20 is more than $300 less than the Ultra model. Looking at the specs sheet we can spot a few cut corners, but it is not immediate

For every link, it stored everything within `<a` and `</a>`, while we just want the url stored in `href`. 
Let's use a for-loop, to loop over all links in variable links, to then for every link in links select only the href. :) 

In [11]:
# create empty list to store href's in
reviewLinks = []

# loop over all links in links
for i in links:
    # select href, save to url
    url = i['href']
    # write url to new list
    reviewLinks.append(url)

In [12]:
# check if it worked, print first 5 reviewLinks
reviewLinks[:5]

['https://www.notebookcheck.net/Samsung-Galaxy-Note20-Review-Not-always-better-than-the-Note10.496540.0.html',
 'https://www.notebookcheck.net/Xiaomi-Redmi-10X-5G-Smartphone-Review-Faster-than-a-Samsung-Galaxy-S20-Ultra.496414.0.html',
 'https://www.notebookcheck.net/Blu-G90-Smartphone-Review-USA-Cell-Phone-with-Triple-Camera.496256.0.html',
 'https://www.notebookcheck.net/LG-K51S-Smartphone-Review-Too-little-too-late.495712.0.html',
 'https://www.notebookcheck.net/Samsung-Galaxy-M21-Smartphone-Review-Plain-but-good.495326.0.html']

In [13]:
# check if it worked, print length reviewLinks
len(reviewLinks)

500

## collect dat (2nd scraper)

**Goal**  
For every device we now have a url of, get the data we want.   

Ok, how are we gonna go about this? 
...  
...  
...  
...  
...  
Off course, let's use another for loop. ;)

For every url in the list named reviewLinks:
- request the webpage;
- create a soup of said webpage;
- select the needed data from that soup;
- write that data to a csv.

Let's try all of the above for 1 webpage first.

In [14]:
# request webpage, save result to r
r = requests.get(reviewLinks[0], timeout=5)
r.status_code

200

In [31]:
# create a soup of the content of the request
soup = BeautifulSoup(r.content, 'html.parser')

In [35]:
# select devicename from soup
name = soup.find('div', {'class': 'specs_header'}).text.strip()
name

'Samsung\xa0Galaxy Note20 (Galaxy Note Series)'

`\xa0` is a non-breaking space in Latin1 (ISO 8859-1); so 'Samsung\xa0Galaxy Note20 (Galaxy Note Series)' includes some unicode encoding that showed up on the surface. By using the Python unicode library, this is easily fixed. `unicodedata.normalize('NFKD', text_string)`

In [36]:
# select devicename from soup
name = unicodedata.normalize("NFKD", soup.find('div', {'class': 'specs_header'}).text)
name

'Samsung Galaxy Note20 (Galaxy Note Series)'

In [37]:
# select the auto analysis from the soup
aa = soup.find('div', {'class': 'auto_analysis'})
aa

<div class="auto_analysis" style="display:inline-block;width:218px"><div style="margin-bottom:3px;">X-Rite i1Pro 2</div>Maximum: 637 cd/m² Average: 620.6 cd/m² Minimum: 1.52 cd/m²<br/>Brightness Distribution: 95 %<br/><span style="">Center on Battery: 610 cd/m²</span><br/>Contrast: ∞:1 (Black: 0 cd/m²)<br/>ΔE Color 2.4 | 0.6-29.43 Ø5.8<br/>ΔE Greyscale 2.8 | 0.64-98 Ø6<br/>99.9% sRGB (Calman 2D) <br/>Gamma: 2.09</div>

In [38]:
aaClean = unicodedata.normalize("NFKD", aa.text)
aaClean

'X-Rite i1Pro 2Maximum: 637 cd/m2 Average: 620.6 cd/m2 Minimum: 1.52 cd/m2Brightness Distribution: 95 %Center on Battery: 610 cd/m2Contrast: ∞:1 (Black: 0 cd/m2)ΔE Color 2.4 | 0.6-29.43 Ø5.8ΔE Greyscale 2.8 | 0.64-98 Ø699.9% sRGB (Calman 2D) Gamma: 2.09'

In [47]:
# use regex to find all data needed in aaClean
matches = re.findall(r'([A-Za-z]+\:\s[0-9\.|&nbsp;|cd\/m2]+)', aaClean)
matches

['Maximum: 637',
 'Average: 620.6',
 'Minimum: 1.52',
 'Distribution: 95',
 'Battery: 610',
 'Black: 0',
 'Gamma: 2.09']

We now have a list named `matches`, for which we like to split every list-item on the `:` in there. This way we can create a dictionary, and easily write the data to a csv.


In [51]:
# create an empty list [] that contains a dictionary {}
result = [{}]
# for every item in matches
for m in matches:
    # split item on : into key and value
    key, val = m.split(":", 1)
    # save key and value to list
    if key in result[-1]:
        result.append({})
    result[-1][key] = val.strip()
    
result

[{'Maximum': '637',
  'Average': '620.6',
  'Minimum': '1.52',
  'Distribution': '95',
  'Battery': '610',
  'Black': '0',
  'Gamma': '2.09'}]

Now all that's left to do is to collect all code below, and write the data to a csv...

In [65]:
# while it might sound counter-intuitive, we're starting by creating an empty csv
# this way we create a file to write the data we collect later to
# first, let's name csvColumns to be used in our csv-file
csvColumns = ['device',
              'url',
              'Maximum',
              'Average',
              'Minimum',
              'Distribution',
              'Battery',
              'Black',
              'Gamma',
              'Contrast', 
              'calibrated']

# now, let's create a csv file named 'data.csv'
with open('data.csv', 'w+') as file:
    # we're going to be writing a Python dictionary to the csv, hence this writer
    writer = csv.DictWriter(file, fieldnames=csvColumns)
    # write the header to the csv already 
    writer.writeheader()
    
    # now let's loop over all urls in the list
    for i in reviewLinks:
        # request webpage, save result to r
        r = requests.get(i, timeout=5)
        # if request is 200, and thus did not fail...
        if r.status_code == 200:
            # create a soup of the content of the request
            soup = BeautifulSoup(r.content, 'html.parser')
            # select name from soup
            # we're using try and except here, so when things don't work out
            # our scraper will continue setting name to none instead of breaking
            try:
                name = unicodedata.normalize("NFKD", soup.find('div', {'class': 'specs_header'}).text)
            except:
                name = None
            # parse auto analysis (aa) data from it
            # again, try and except so incase it doesn't work our scraper can continue
            # because it knows what to do when an exception occurs
            try:
                aa = soup.find('div', {'class': 'auto_analysis'})
                aaClean = unicodedata.normalize("NFKD", aa.text)
                # use regex to select data 
                matches = re.findall(r'([A-Za-z]+\:\s[0-9\.|&nbsp;|cd\/m2]+)', aaClean)
            except:
                matches = None
            # create an empty dictionary {} named result
            result = {}
            # for every item in matches
            # again, using try and except...
            try:
                # for every item (m) in matches...
                for m in matches:
                    # split item on : into key and value
                    key, val = m.split(":", 1)
                    # add both key and stripped value to dictionary result
                    result[key] = val.strip()
            except:
                # if an exceptions occur, continue
                continue
            # add name device to dict
            result['device'] = name
            # add url to dict
            result['url'] = i
            # write result to csv
            writer.writerow(result)
        # if the request fails (read: has not a 200 statuscode)
        else:
            print('Oh no! Something went wrong requesting ' + i)