## Scraping non-tabular, multipage sites
Scrape the top 500 <a href="https://bestsellingalbums.org/decade/2010">best-selling albums of the 2010's</a>. Your data must include the following datapoints:

- Name of album
- Name of artist
- Number of albums sold 
- The link to the page that breaks down sales by country (found by clicking album title)



In [1]:
## import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
from random import randrange

## Old solution

In [None]:
all_dfs = [] ## hold all dfs
url = "https://bestsellingalbums.org/decade/2010" ## base url

count = 1 ## count
while count <=10:
    print(f"Scraping {url}")
    ## get response
    response = requests.get(url)
    ## turn response into soup (navigable html from string)
    soup = BeautifulSoup(response.text, "html.parser")
    ## grab ALL albums data and store in variable
    all_albums = soup.find_all("div", class_="album_card")

    ## name lists to hold data
    artists_list = []
    albums_list = []
    albums_url_list = []
    sales_list = []

    ## iterate through to capture target data points
    for target in all_albums:
        ## artist name
        artists_list.append(target.find("div", class_="artist").get_text())
        ## album title
        albums_list.append(target.find("div", class_="album").get_text())
        ## album links
        albums_url_list.append(target.find("a").get("href"))
        ##sales
        sales = target.find("div", class_="sales").get_text() ## get the sales text
        sales = int(sales.replace("Sales: ","").replace(",","")) ## Turn into integer remove Sales: and commas
        sales_list.append(sales)

    ## zip to tuple
    album_data = []
    for all_data in zip(artists_list, albums_list, sales_list, albums_url_list):
        album_data.append(all_data)

        #   convert to df
    df = pd.DataFrame(album_data)
    df.columns = ["artist", "title", "sales", "more_info"]
    all_dfs.append(df)

    ## incredment url and set timer
    count += 1
    url = "https://bestsellingalbums.org/decade/2010"
    url = f"{url}-{count}"
    snoozer = randrange(5,12)
    print(f"snoozing for {snoozer} seconds before next scrape")
    time.sleep(snoozer)
    
print("done scraping all links")        

## REFACTORED CODE

In [4]:
all_dfs = [] ## hold all dfsb
base_url = "https://bestsellingalbums.org/decade/2010" ## base url
end_page = 10 ## how many pages we want to scrape

for url_number in range(1, end_page + 1):
    try: ## anything but the first page
        if url_number != 1:
            response = requests.get(f"{base_url}-{url_number}")
        else:
            response = requests.get(base_url)
    except: ## problematic first page
        print(f"Problem with {base_url}-{url_number}")
    finally: ## turn response into soup (navigable html from string)
        soup = BeautifulSoup(response.text, "html.parser")
        ## grab ALL target data and store in lists
        all_targets = soup.find_all("div", class_="album_card")
        artists_list = [target.find("div", class_="artist").get_text() for target in all_targets]
        albums_list = [target.find("div", class_="album").get_text() for target in all_targets]
        more_info_list = [target.find("a").get("href") for target in all_targets]
        sales_list = [int(target.find("div", class_="sales").get_text().replace("Sales: ", "").replace(",", ""))\
                      for target in all_targets]
        
        ## create a dictionary using captured lists
        all_dfs.append(pd.DataFrame({"artist": artists_list, "album": albums_list,
                           "sales": sales_list, "more_info": more_info_list}))
            
        ## timer
        snoozer = randrange(5,12)
        print(f"Created DF from page {url_number} and snoozing for {snoozer} seconds before next page")
        time.sleep(snoozer)## set timer
print(f"Done scraping all {end_page} pages")                                
                          
        
    

Created DF from page 1 and snoozing for 5 seconds before next page
Created DF from page 2 and snoozing for 9 seconds before next page
Created DF from page 3 and snoozing for 5 seconds before next page
Created DF from page 4 and snoozing for 11 seconds before next page
Created DF from page 5 and snoozing for 9 seconds before next page
Created DF from page 6 and snoozing for 5 seconds before next page
Created DF from page 7 and snoozing for 11 seconds before next page
Created DF from page 8 and snoozing for 6 seconds before next page
Created DF from page 9 and snoozing for 10 seconds before next page
Created DF from page 10 and snoozing for 10 seconds before next page
Done scraping all 10 pages


In [3]:
## turn into data frame
df = pd.concat(all_dfs, ignore_index = True)
df

Unnamed: 0,artist,album,sales,more_info
0,ADELE,21,30000000,https://bestsellingalbums.org/album/1034
1,ADELE,25,23000000,https://bestsellingalbums.org/album/1035
2,MICHAEL BUBLÉ,CHRISTMAS,15000000,https://bestsellingalbums.org/album/30524
3,TAYLOR SWIFT,1989,14748116,https://bestsellingalbums.org/album/45488
4,JUSTIN BIEBER,PURPOSE,14000000,https://bestsellingalbums.org/album/23318
...,...,...,...,...
145,JUSTIN BIEBER,UNDER THE MISTLETOE,2700725,https://bestsellingalbums.org/album/23319
146,AC/DC,ROCK OR BUST,2700000,https://bestsellingalbums.org/album/901
147,PITBULL,PLANET PIT,2683000,https://bestsellingalbums.org/album/36403
148,STROMAE,RACINE CARREE,2657500,https://bestsellingalbums.org/album/44563
