## Scraping scrollers

Infinite scroll sites are designed for the mobile age. Links are hard to tap with a finger on a small device,  but a simple swipe easily scrolls the page down to reveal more data. That can make scraping an infinite scroll page difficult. We’ll learn to find the actual location of the data buried in the scrolls.

Here's a couple of examples of a scrolling sites: 

- <a href="https://www.difc.ae/public-register/">DIFC Public Register</a>

- <a href="https://www.quintoandar.com.br/alugar/imovel/sao-paulo-sp-brasil">Rentals in São Paulo</a>

Let's target the data source we'll need to scrape this <a href="https://quotes.toscrape.com/scroll">mockup site</a>.







In [1]:
## Lets import all the libaries we are likely to need
import requests ## to capture content from web pages
from bs4 import BeautifulSoup ## to parse our scraped data
import pandas as pd ## to easily export our data to dataframes/CSVs
# from icecream import ic ## easily debug
# from pprint import pprint as pp ## to prettify our printouts
import itertools ## to flatten lists
from random import randrange ## to create a range of numbers
import time # for timer
import json ## to work with JSON data

### Figure out how to scape a single page

In [2]:
url = "https://quotes.toscrape.com/api/quotes?page=1"

In [7]:
response = requests.get(url)
# response.status_code
type(response.text)
response.text

'{"has_next":true,"page":1,"quotes":[{"author":{"goodreads_link":"/author/show/9810.Albert_Einstein","name":"Albert Einstein","slug":"Albert-Einstein"},"tags":["change","deep-thoughts","thinking","world"],"text":"\\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\\u201d"},{"author":{"goodreads_link":"/author/show/1077326.J_K_Rowling","name":"J.K. Rowling","slug":"J-K-Rowling"},"tags":["abilities","choices"],"text":"\\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\\u201d"},{"author":{"goodreads_link":"/author/show/9810.Albert_Einstein","name":"Albert Einstein","slug":"Albert-Einstein"},"tags":["inspirational","life","live","miracle","miracles"],"text":"\\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\\u201d"},{"author":{"goodreads_link":"/author/show/1265.Jane_Austen","name":"Jane Austen","

In [8]:
## method 1 - slightly slower
json.loads(response.text)

{'has_next': True,
 'page': 1,
 'quotes': [{'author': {'goodreads_link': '/author/show/9810.Albert_Einstein',
    'name': 'Albert Einstein',
    'slug': 'Albert-Einstein'},
   'tags': ['change', 'deep-thoughts', 'thinking', 'world'],
   'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'},
  {'author': {'goodreads_link': '/author/show/1077326.J_K_Rowling',
    'name': 'J.K. Rowling',
    'slug': 'J-K-Rowling'},
   'tags': ['abilities', 'choices'],
   'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'},
  {'author': {'goodreads_link': '/author/show/9810.Albert_Einstein',
    'name': 'Albert Einstein',
    'slug': 'Albert-Einstein'},
   'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles'],
   'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”'},
  {'author': {'goodreads_

In [9]:
type(json.loads(response.text))

dict

In [11]:
## method 2 - part of requests lib

content = response.json()
content

{'has_next': True,
 'page': 1,
 'quotes': [{'author': {'goodreads_link': '/author/show/9810.Albert_Einstein',
    'name': 'Albert Einstein',
    'slug': 'Albert-Einstein'},
   'tags': ['change', 'deep-thoughts', 'thinking', 'world'],
   'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'},
  {'author': {'goodreads_link': '/author/show/1077326.J_K_Rowling',
    'name': 'J.K. Rowling',
    'slug': 'J-K-Rowling'},
   'tags': ['abilities', 'choices'],
   'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'},
  {'author': {'goodreads_link': '/author/show/9810.Albert_Einstein',
    'name': 'Albert Einstein',
    'slug': 'Albert-Einstein'},
   'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles'],
   'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”'},
  {'author': {'goodreads_

In [12]:
type(content)

dict

### Accessing values in a dictionary

In [13]:
## run this cell
animals = [{"rank": 1, 'animal': 'Blue whale', 'weight': 136000, 'animal_type': 'Marine'},
 {"rank": 2, 'animal': 'Bowhead whale', 'weight': 100000, 'animal_type': 'Marine'},
 {"rank": 3, 'animal': 'Fin whale', 'weight': 70000, 'animal_type': 'Marine'},
 {"rank": 4, 'animal': 'Southern right whale', 'weight': 45000, 'animal_type': 'Marine'},
 {"rank": 5, 'animal': 'Humpback whale', 'weight': 30000, 'animal_type': 'Marine'},
 {"rank": 6, 'animal': 'Gray whale', 'weight': 28500, 'animal_type': 'Marine'},
 {"rank": 7, 'animal': 'Northern right whale', 'weight': 23000, 'animal_type': 'Marine'},
 {"rank": 8, 'animal': 'Sei whale', 'weight': 20000, 'animal_type': 'Marine'},
 {"rank": 9, 'animal': "Bryde's whale", 'weight': 16000, 'animal_type': 'Marine'},
 {"rank": 10,'animal': "Baird's beaked whale", 'weight': 11380, 'animal_type': 'Marine'}]

In [14]:
animals[0]

{'rank': 1, 'animal': 'Blue whale', 'weight': 136000, 'animal_type': 'Marine'}

In [16]:
animals[0].get("weight")

136000

In [17]:
## call our content
content

{'has_next': True,
 'page': 1,
 'quotes': [{'author': {'goodreads_link': '/author/show/9810.Albert_Einstein',
    'name': 'Albert Einstein',
    'slug': 'Albert-Einstein'},
   'tags': ['change', 'deep-thoughts', 'thinking', 'world'],
   'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'},
  {'author': {'goodreads_link': '/author/show/1077326.J_K_Rowling',
    'name': 'J.K. Rowling',
    'slug': 'J-K-Rowling'},
   'tags': ['abilities', 'choices'],
   'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'},
  {'author': {'goodreads_link': '/author/show/9810.Albert_Einstein',
    'name': 'Albert Einstein',
    'slug': 'Albert-Einstein'},
   'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles'],
   'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”'},
  {'author': {'goodreads_

In [18]:
quotes_ls = content.get("quotes")
quotes_ls

[{'author': {'goodreads_link': '/author/show/9810.Albert_Einstein',
   'name': 'Albert Einstein',
   'slug': 'Albert-Einstein'},
  'tags': ['change', 'deep-thoughts', 'thinking', 'world'],
  'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'},
 {'author': {'goodreads_link': '/author/show/1077326.J_K_Rowling',
   'name': 'J.K. Rowling',
   'slug': 'J-K-Rowling'},
  'tags': ['abilities', 'choices'],
  'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'},
 {'author': {'goodreads_link': '/author/show/9810.Albert_Einstein',
   'name': 'Albert Einstein',
   'slug': 'Albert-Einstein'},
  'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles'],
  'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”'},
 {'author': {'goodreads_link': '/author/show/1265.Jane_Austen',
   'name': 'Jane 

In [19]:
type(quotes_ls)

list

In [20]:
len(quotes_ls)

10

In [21]:
df = pd.DataFrame(quotes_ls)
df

Unnamed: 0,author,tags,text
0,{'goodreads_link': '/author/show/9810.Albert_E...,"[change, deep-thoughts, thinking, world]",“The world as we have created it is a process ...
1,{'goodreads_link': '/author/show/1077326.J_K_R...,"[abilities, choices]","“It is our choices, Harry, that show what we t..."
2,{'goodreads_link': '/author/show/9810.Albert_E...,"[inspirational, life, live, miracle, miracles]",“There are only two ways to live your life. On...
3,{'goodreads_link': '/author/show/1265.Jane_Aus...,"[aliteracy, books, classic, humor]","“The person, be it gentleman or lady, who has ..."
4,{'goodreads_link': '/author/show/82952.Marilyn...,"[be-yourself, inspirational]","“Imperfection is beauty, madness is genius and..."
5,{'goodreads_link': '/author/show/9810.Albert_E...,"[adulthood, success, value]",“Try not to become a man of success. Rather be...
6,{'goodreads_link': '/author/show/7617.Andr_Gid...,"[life, love]",“It is better to be hated for what you are tha...
7,{'goodreads_link': '/author/show/3091287.Thoma...,"[edison, failure, inspirational, paraphrased]","“I have not failed. I've just found 10,000 way..."
8,{'goodreads_link': '/author/show/44566.Eleanor...,[misattributed-eleanor-roosevelt],“A woman is like a tea bag; you never know how...
9,{'goodreads_link': '/author/show/7103.Steve_Ma...,"[humor, obvious, simile]","“A day without sunshine is like, you know, nig..."


In [22]:
pip install cherrypicker

Note: you may need to restart the kernel to use updated packages.


In [23]:
from cherrypicker import CherryPicker

In [24]:
picker = CherryPicker(quotes_ls)
best_quotes = picker.flatten().get()

In [30]:
type(picker)

cherrypicker.traversable.CherryPickerIterable

In [26]:
quotes_ls

[{'author': {'goodreads_link': '/author/show/9810.Albert_Einstein',
   'name': 'Albert Einstein',
   'slug': 'Albert-Einstein'},
  'tags': ['change', 'deep-thoughts', 'thinking', 'world'],
  'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'},
 {'author': {'goodreads_link': '/author/show/1077326.J_K_Rowling',
   'name': 'J.K. Rowling',
   'slug': 'J-K-Rowling'},
  'tags': ['abilities', 'choices'],
  'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'},
 {'author': {'goodreads_link': '/author/show/9810.Albert_Einstein',
   'name': 'Albert Einstein',
   'slug': 'Albert-Einstein'},
  'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles'],
  'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”'},
 {'author': {'goodreads_link': '/author/show/1265.Jane_Austen',
   'name': 'Jane 

In [25]:
best_quotes

[{'author_goodreads_link': '/author/show/9810.Albert_Einstein',
  'author_name': 'Albert Einstein',
  'author_slug': 'Albert-Einstein',
  'tags_0': 'change',
  'tags_1': 'deep-thoughts',
  'tags_2': 'thinking',
  'tags_3': 'world',
  'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'},
 {'author_goodreads_link': '/author/show/1077326.J_K_Rowling',
  'author_name': 'J.K. Rowling',
  'author_slug': 'J-K-Rowling',
  'tags_0': 'abilities',
  'tags_1': 'choices',
  'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'},
 {'author_goodreads_link': '/author/show/9810.Albert_Einstein',
  'author_name': 'Albert Einstein',
  'author_slug': 'Albert-Einstein',
  'tags_0': 'inspirational',
  'tags_1': 'life',
  'tags_2': 'live',
  'tags_3': 'miracle',
  'tags_4': 'miracles',
  'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as 

In [27]:
df = pd.DataFrame(best_quotes)
df

Unnamed: 0,author_goodreads_link,author_name,author_slug,tags_0,tags_1,tags_2,tags_3,text,tags_4
0,/author/show/9810.Albert_Einstein,Albert Einstein,Albert-Einstein,change,deep-thoughts,thinking,world,“The world as we have created it is a process ...,
1,/author/show/1077326.J_K_Rowling,J.K. Rowling,J-K-Rowling,abilities,choices,,,"“It is our choices, Harry, that show what we t...",
2,/author/show/9810.Albert_Einstein,Albert Einstein,Albert-Einstein,inspirational,life,live,miracle,“There are only two ways to live your life. On...,miracles
3,/author/show/1265.Jane_Austen,Jane Austen,Jane-Austen,aliteracy,books,classic,humor,"“The person, be it gentleman or lady, who has ...",
4,/author/show/82952.Marilyn_Monroe,Marilyn Monroe,Marilyn-Monroe,be-yourself,inspirational,,,"“Imperfection is beauty, madness is genius and...",
5,/author/show/9810.Albert_Einstein,Albert Einstein,Albert-Einstein,adulthood,success,value,,“Try not to become a man of success. Rather be...,
6,/author/show/7617.Andr_Gide,André Gide,Andre-Gide,life,love,,,“It is better to be hated for what you are tha...,
7,/author/show/3091287.Thomas_A_Edison,Thomas A. Edison,Thomas-A-Edison,edison,failure,inspirational,paraphrased,"“I have not failed. I've just found 10,000 way...",
8,/author/show/44566.Eleanor_Roosevelt,Eleanor Roosevelt,Eleanor-Roosevelt,misattributed-eleanor-roosevelt,,,,“A woman is like a tea bag; you never know how...,
9,/author/show/7103.Steve_Martin,Steve Martin,Steve-Martin,humor,obvious,simile,,"“A day without sunshine is like, you know, nig...",


In [29]:
content

{'has_next': True,
 'page': 1,
 'quotes': [{'author': {'goodreads_link': '/author/show/9810.Albert_Einstein',
    'name': 'Albert Einstein',
    'slug': 'Albert-Einstein'},
   'tags': ['change', 'deep-thoughts', 'thinking', 'world'],
   'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'},
  {'author': {'goodreads_link': '/author/show/1077326.J_K_Rowling',
    'name': 'J.K. Rowling',
    'slug': 'J-K-Rowling'},
   'tags': ['abilities', 'choices'],
   'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'},
  {'author': {'goodreads_link': '/author/show/9810.Albert_Einstein',
    'name': 'Albert Einstein',
    'slug': 'Albert-Einstein'},
   'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles'],
   'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”'},
  {'author': {'goodreads_

## multipage scrape

In [28]:
##placeholder
url = "https://quotes.toscrape.com/api/quotes?page={}"

In [35]:
quotes_list = []
page_number = 1
valid = 0

while valid < 1:
    link = url.format(page_number)
    page_number += 1
    print(link)
    response = requests.get(link)
    content = response.json()
    target = content.get("quotes")
    picker = CherryPicker(target)
    best_quotes = picker.flatten().get()
    quotes_list.append(best_quotes)
    snoozer = randrange(4,7)
    print(f"snoozing for {snoozer} seconds")
    time.sleep(snoozer)
    if content.get("has_next") == False:
        valid += 1

print("done scraping")
    

https://quotes.toscrape.com/api/quotes?page=1
snoozing for 4 seconds
https://quotes.toscrape.com/api/quotes?page=2
snoozing for 6 seconds
https://quotes.toscrape.com/api/quotes?page=3
snoozing for 5 seconds
https://quotes.toscrape.com/api/quotes?page=4
snoozing for 4 seconds
https://quotes.toscrape.com/api/quotes?page=5
snoozing for 6 seconds
https://quotes.toscrape.com/api/quotes?page=6
snoozing for 4 seconds
https://quotes.toscrape.com/api/quotes?page=7
snoozing for 6 seconds
https://quotes.toscrape.com/api/quotes?page=8
snoozing for 6 seconds
https://quotes.toscrape.com/api/quotes?page=9
snoozing for 6 seconds
https://quotes.toscrape.com/api/quotes?page=10
snoozing for 6 seconds
done scraping


In [36]:
len(quotes_list)

10

In [39]:
quotes_list[0:2]

[[{'author_goodreads_link': '/author/show/9810.Albert_Einstein',
   'author_name': 'Albert Einstein',
   'author_slug': 'Albert-Einstein',
   'tags_0': 'change',
   'tags_1': 'deep-thoughts',
   'tags_2': 'thinking',
   'tags_3': 'world',
   'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'},
  {'author_goodreads_link': '/author/show/1077326.J_K_Rowling',
   'author_name': 'J.K. Rowling',
   'author_slug': 'J-K-Rowling',
   'tags_0': 'abilities',
   'tags_1': 'choices',
   'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'},
  {'author_goodreads_link': '/author/show/9810.Albert_Einstein',
   'author_name': 'Albert Einstein',
   'author_slug': 'Albert-Einstein',
   'tags_0': 'inspirational',
   'tags_1': 'life',
   'tags_2': 'live',
   'tags_3': 'miracle',
   'tags_4': 'miracles',
   'text': '“There are only two ways to live your life. One is as though nothing is a mi

In [40]:
flat_quotes = list(itertools.chain(*quotes_list))
len(flat_quotes)

100

In [41]:
flat_quotes[0]

{'author_goodreads_link': '/author/show/9810.Albert_Einstein',
 'author_name': 'Albert Einstein',
 'author_slug': 'Albert-Einstein',
 'tags_0': 'change',
 'tags_1': 'deep-thoughts',
 'tags_2': 'thinking',
 'tags_3': 'world',
 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}

In [42]:
df = pd.DataFrame(flat_quotes)
df

Unnamed: 0,author_goodreads_link,author_name,author_slug,tags_0,tags_1,tags_2,tags_3,text,tags_4,tags_5,tags_6,tags_7
0,/author/show/9810.Albert_Einstein,Albert Einstein,Albert-Einstein,change,deep-thoughts,thinking,world,“The world as we have created it is a process ...,,,,
1,/author/show/1077326.J_K_Rowling,J.K. Rowling,J-K-Rowling,abilities,choices,,,"“It is our choices, Harry, that show what we t...",,,,
2,/author/show/9810.Albert_Einstein,Albert Einstein,Albert-Einstein,inspirational,life,live,miracle,“There are only two ways to live your life. On...,miracles,,,
3,/author/show/1265.Jane_Austen,Jane Austen,Jane-Austen,aliteracy,books,classic,humor,"“The person, be it gentleman or lady, who has ...",,,,
4,/author/show/82952.Marilyn_Monroe,Marilyn Monroe,Marilyn-Monroe,be-yourself,inspirational,,,"“Imperfection is beauty, madness is genius and...",,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
95,/author/show/1825.Harper_Lee,Harper Lee,Harper-Lee,better-life-empathy,,,,“You never really understand a person until yo...,,,,
96,/author/show/106.Madeleine_L_Engle,Madeleine L'Engle,Madeleine-LEngle,books,children,difficult,grown-ups,“You have to write the book that wants to be w...,write,writers,writing,
97,/author/show/1244.Mark_Twain,Mark Twain,Mark-Twain,truth,,,,“Never tell the truth to people who are not wo...,,,,
98,/author/show/61105.Dr_Seuss,Dr. Seuss,Dr-Seuss,inspirational,,,,"“A person's a person, no matter how small.”",,,,
