# Introduction

This file contains of three function I use to scrape GeekBench website. 

As an example I've put a dataset contain of Name, URL, and Pages of Ryzen's CPU products.


The steps I took to scrape Ryzen's CPU benchmark from GeekBench website are as follow:
1. List all product name.
2. Look at the URL of some products and find the pattern of their URL.
3. Create 'base URL' for each products.
4. Search the last page number of each URL.
5. Create a container for each URL and their last page number.
6. Check URL of the last page and found the pattern.
7. Confirm the pattern by looking other pages' URL.
8. Generate URL from 'base URL', combine with page number start from 1 to the latest page.
9. Scrape with delay of 1.5 to 2.5 seconds for each URL and put them into a list.
10. Parse and extract all data needed from scraped URL list into a dataframe.

In [1]:
import numpy as np
import pandas as pd
import re
from time import sleep
from random import uniform

import requests
from bs4 import BeautifulSoup

# List Preparation

All data inside of this list was scraped from [this](https://en.wikipedia.org/wiki/List_of_AMD_Ryzen_processors) wikipedia article. I'm cleaning unwanted quotation marks on the result before using them.

In [13]:
list_ryzen = pd.read_csv('ryzen_list.csv', header= None).stack().to_list()

# Cleaning process
compilers = re.compile('\"|\'')
clean_list_ryzen = [re.sub(compilers, '', ryzen) for ryzen in list_ryzen]
clean_list_ryzen

['Ryzen 3 1200',
 'Ryzen 3 Pro 1200',
 'Ryzen 3 Pro 1300',
 'Ryzen 3 1300X',
 'Ryzen 5 1400',
 'Ryzen 5 Pro 1500',
 'Ryzen 5 1500X',
 'Ryzen 5 1600',
 'Ryzen 5 Pro 1600',
 'Ryzen 5 1600X',
 'Ryzen 7 1700',
 'Ryzen 7 Pro 1700',
 'Ryzen 7 1700X',
 'Ryzen 7 1800X',
 'Ryzen Threadripper 1900X',
 'Ryzen Threadripper 1920X',
 'Ryzen Threadripper 1950X',
 'Ryzen 3 2200GE',
 'Ryzen 3 Pro 2200GE ',
 'Ryzen 3 2200G ',
 'Ryzen 3 Pro 2200G ',
 'Ryzen 5 2400GE ',
 'Ryzen 5 Pro 2400GE ',
 'Ryzen 5 2400G ',
 'Ryzen 5 Pro 2400G',
 'Ryzen 3 2300X',
 'Ryzen 5 2500X',
 'Ryzen 5 2600',
 'Ryzen 5 2600X',
 'Ryzen 7 2700E',
 'Ryzen 7 2700',
 'Ryzen 7 Pro 2700',
 'Ryzen 7 Pro 2700X',
 'Ryzen 7 2700X',
 'Ryzen Threadripper 2920X',
 'Ryzen Threadripper 2950X',
 'Ryzen Threadripper 2970WX',
 'Ryzen Threadripper 2990WX',
 'Ryzen 3 Pro 3200GE',
 'Ryzen 3 3200G',
 'Ryzen 3 Pro 3200G',
 'Ryzen 5 Pro 3350GE',
 'Ryzen 5 Pro 3350G',
 'Ryzen 5 Pro 3400GE',
 'Ryzen 5 3400G',
 'Ryzen 5 Pro 3400G',
 'Ryzen 3 4300G',
 'Ryze

# Compiling Base URL of Each Product 

The steps to compile:
1. Prepare base URL from geekbench.com
2. Use list comprehension to change and add each Ryzen's product name to the base URL. 


I try to search some product name in search box and comparing their URLs. I found a basic pattern (I put it inside URL_NAME variable).
The searching URL is acquired by first substitute whitespace of each CPU name with plus mark and combine them with the base URL.
    

    https://browser.geekbench.com/search?q=[name+of+product+and_series]

In [15]:
# Base URL
URL_NAME = 'https://browser.geekbench.com/search?q='

# Searching URL
ryzen_url = [URL_NAME+(re.sub('\s', '+', element))  for element in clean_list_ryzen]
ryzen_url

['https://browser.geekbench.com/search?q=Ryzen+3+1200',
 'https://browser.geekbench.com/search?q=Ryzen+3+Pro+1200',
 'https://browser.geekbench.com/search?q=Ryzen+3+Pro+1300',
 'https://browser.geekbench.com/search?q=Ryzen+3+1300X',
 'https://browser.geekbench.com/search?q=Ryzen+5+1400',
 'https://browser.geekbench.com/search?q=Ryzen+5+Pro+1500',
 'https://browser.geekbench.com/search?q=Ryzen+5+1500X',
 'https://browser.geekbench.com/search?q=Ryzen+5+1600',
 'https://browser.geekbench.com/search?q=Ryzen+5+Pro+1600',
 'https://browser.geekbench.com/search?q=Ryzen+5+1600X',
 'https://browser.geekbench.com/search?q=Ryzen+7+1700',
 'https://browser.geekbench.com/search?q=Ryzen+7+Pro+1700',
 'https://browser.geekbench.com/search?q=Ryzen+7+1700X',
 'https://browser.geekbench.com/search?q=Ryzen+7+1800X',
 'https://browser.geekbench.com/search?q=Ryzen+Threadripper+1900X',
 'https://browser.geekbench.com/search?q=Ryzen+Threadripper+1920X',
 'https://browser.geekbench.com/search?q=Ryzen+Threadri

# Search The Latest Page 

For getting the latest page number of each URL, I first checking the HTML codes of some of the URL listed. I found that for each link that have data inside they will have tag \<div class = "col-12 list-col"\> . Any link that don't have any data (data not found in server) will not having this tag.

For other URL with data inside I checked their navigation part (the one with number and arrow below the page) to check the code. I found a pattern for URL with more than one page, they have tag \<a class = 'page-link'>. The latest page number located on the second last element of the result list. Other URL that only have one page don't have this tag.


In [16]:
def find_max_page(url_list):
    pages = []

    # Check pages
    for url in url_list:
        sleep(uniform(1.5, 2.5))
        html_page = requests.get(url).content
        soup = BeautifulSoup(html_page, 'html.parser')
        # check if the page have contents  
        if soup.find('div', class_= 'col-12 list-col') is None:
            pages.append(0)
        else:
            # check if the page is single or mutiple
            if soup.find('a', class_= 'page-link') is None:
                pages.append(1)
            else:
                nav_page = int([nav.text for nav in soup.find_all('a', class_= 'page-link')][-2])
                pages.append(nav_page)
    
    url_dict = {url : page for url, page in zip(url_list, pages) if page != 0}
    return url_dict

The result is a dictionary that only contain a list of pages with contents as the keys and their latest page as the values.

# Generate Page URL For Each Link

The next step is using dictionary from previous step to create URL list for final scraping.

In [17]:
def ryzen_geekbench_url_pages(url_dict):
    '''A function to create all URL pages for each URL in dataframe.
    The result will be another list that can be use directly for scraping.'''
    # Declaring URL container
    url_list = []  
    
    # Using for loop on the dictionary...
    for url in url_dict:
        # ... we change the URL based on observed pattern in GeekBench website...
        change_url = re.sub(r'search\?', '{}', url)

        # ... And insert new pattern.
        for num in range(1, url_dict[url] + 1):
            # num represent all page numbers of each respected URL.
            # ryzen_dict[url], or 'Pages' column in dataframe, is the latest page of each respected URL.
            add_url = f'search?page={num}&'
            # We use curly bracket to URL. Using .format() make it possible to insert strings into the URL. 
            new_url = change_url.format(add_url)
            url_list.append(new_url)
    
    return url_list

# Scraping URL 

Get Response object from URL is the most time consuming process of all of these steps, especially if there are so many URL to process. And in order to not give too much burden to the web server--and make the should-be-unharmed scraping process to become similar to  a DDoS attack--I need to use delay by using sleep function from time module. Uniform function from random module was used for create randomness of the delay.

In [4]:
def scraping_function(url_lists):
    '''Scraping all URL pages in the URL list.
    The result will be a list consist of the content of Response objects.'''
    scrape_result = []
    for url in url_lists:
        sleep(uniform(1.5, 2.5))
        scrape = requests.get(url).content
        scrape_result.append(scrape)
    return scrape_result

# Parsing And Extracting Data From Response Object

This following function will do three things: 
1. Parsing the Response Object, 
2. Extracting the data from Response object, and
3. Create dataframe containing all the data extracted.

In [6]:
def page_extractor(scrape_list):
    '''A function to parsing a scraped result'''

    cpu_name =[]
    time = []
    platform = []
    single_core = []
    multi_core = []

    for scraped in scrape_list:
        soup = BeautifulSoup(scraped, 'html.parser')
           
        # CPU Name data  
        cpu_data = [cpu.find('span', class_= 'list-col-model')
                        .text.strip().replace('\n', ' ') 
                        for cpu in 
                        soup.find_all('div', class_= 'col-12 col-lg-4')]

        cpu_name += cpu_data  
      
        # Date/Time data
        datetimes = [date_time.text 
                         for date_time 
                         in soup.find_all('span', class_= 'timestamp-to-local-short')]

        time += datetimes

        # OS Name data
        platform_name = [platform.text.strip().replace('\n', ' ') 
                             for platform 
                             in soup.find_all('span', class_= 'list-col-text')]

        platforms = [platform_name[n] for n in range(0, len(platform_name)) if n % 2 == 1]

        platform += platforms
      
        # Benchmark data
        # Both single core and multi core data using the same tag
        # Thus, when scraping the tag, both values are extracted with pattern:
        # [single-core, multi-core, single-core, multi-core,... etc]
        core_score = soup.find_all('span', class_= 'list-col-text-score')
      
        # Single-core benchmark data
        # All single-core data are in even python index ([0, 2, 4, 6, ...])
        single_core_data = [core_score[n].text.strip() 
                                for n in range(0, len(core_score)) if n % 2 == 0]
        
        single_core += single_core_data

        # Multi-core benchmark data
        # ALl multi-core data are in odd python index ([1, 3, 5, 7, ...])
        multi_core_data = [core_score[n].text.strip() 
                    for n in range(0, len(core_score)) if n % 2 == 1]
        
        multi_core += multi_core_data

    # Create dataframe:
    data_lib = {                
                'CPU Name'         : cpu_name,
                'Upload Date'      : time, 
                'Platform Name'    : platform,
                'Single-core Score': single_core,
                'Multi-core Score' : multi_core
                }
    
    geekbench_data = pd.DataFrame(data_lib).drop_duplicates()
    
    # Assuring that each columns in dataframe are in correct dtypes:
    newtype = {
        'CPU Name': 'string',
        'Upload Date': 'datetime64',
        'Platform Name': 'string',
        'Single-core Score': 'int',
        'Multi-core Score': 'int'
    }
    
    geekbench_data = geekbench_data.astype(newtype)
   
    return geekbench_data

        