# Applied Web Scraping

## This project will cover: 
1. Recap of basic Web Scraping princple
2. Retrieve titles of every 2* rated book
    2.1 Multi-paging

In [30]:
import requests
import bs4

# 1. Recap of basic principle

In [84]:
res = requests.get('https://en.wikipedia.org/wiki/Alan_Partridge')

contents = bs4.BeautifulSoup(res.text, 'lxml')
title = contents.select('title')
type(title)
title[0].getText()

'Alan Partridge - Wikipedia'

# 2. Every 2* Rated Book

### 2.1 Multi-paging

Based on our example, https://books.toscrape.com/, data is spread over multiple pages therefore we'll need to find a way to engage with all of this data effectively.

One way to do this would be to establish a base_url. 

* `base_url = 'https://books.toscrape.com/catalogue/page-{}.html'`
* `base_url.format('2') --> https://books.toscrape.com/catalogue/page-2.html`

#### 2.1.1 - Distillation

This will involve parsing the base_url via soup to a bs4.element.ResultSet stage

In [85]:
base_url = 'https://books.toscrape.com/catalogue/page-{}.html'

# Pod Request & BSoup Distillation

res = requests.get(base_url.format('2'))

contents = bs4.BeautifulSoup(res.text, 'lxml')
products = contents.select(".product_pod")
products[0]

<article class="product_pod">
<div class="image_container">
<a href="in-her-wake_980/index.html"><img alt="In Her Wake" class="thumbnail" src="../media/cache/5d/72/5d72709c6a7a9584a4d1cf07648bfce1.jpg"/></a>
</div>
<p class="star-rating One">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="in-her-wake_980/index.html" title="In Her Wake">In Her Wake</a></h3>
<div class="product_price">
<p class="price_color">Â£12.84</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>

#### 2.1.2 - Dissemination

This will involve converting/accessing the star-rating from each p-tag within the ResultSet. 

In [88]:
searchable = products[0]
searchable

<article class="product_pod">
<div class="image_container">
<a href="in-her-wake_980/index.html"><img alt="In Her Wake" class="thumbnail" src="../media/cache/5d/72/5d72709c6a7a9584a4d1cf07648bfce1.jpg"/></a>
</div>
<p class="star-rating One">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="in-her-wake_980/index.html" title="In Her Wake">In Her Wake</a></h3>
<div class="product_price">
<p class="price_color">Â£12.84</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>

In [89]:
type(searchable) # Meaning we can select on this and make dictionary calls

bs4.element.Tag

#### 2.1.3 Returning the Rating

In [93]:
searchable.select('.star-rating.One') == []

False

#### 2.1.4 Returning the Title

In [94]:
searchable.select('a')[1].getText() # Title access approach

'In Her Wake'

#### 2.1.5 Combining the Two

In [96]:
base_url = 'https://books.toscrape.com/catalogue/page-{}.html'

two_star_titles = []

# Capping at 5 pages 

for n in range(1, 5): 
        scrape_url = base_url.format(n)
        res = requests.get(scrape_url)        
        soup = bs4.BeautifulSoup(res.text, 'lxml')
        books = soup.select(".product_pod")
        
        for book in books:
            
            if len(book.select('.star-rating.Two')) != 0: 
                book_title = book.select('a')[1]['title']
                two_star_titles.append(book_title)
two_star_titles

['Starving Hearts (Triangular Trade Trilogy, #1)',
 'Libertarianism for Beginners',
 "It's Only the Himalayas",
 'How Music Works',
 'Maude (1883-1993):She Grew Up with the country',
 "You can't bury them all: Poems",
 'Reasons to Stay Alive',
 'Without Borders (Wanderlove #1)',
 'Soul Reader',
 'Security',
 'Saga, Volume 5 (Saga (Collected Editions) #5)',
 'Reskilling America: Learning to Labor in the Twenty-First Century']