# "Easy Web Scraping with Python"
> "And build a data application"

- toc: true
- branch: master
- badges: true
- hide_binder_badge: True
- hide_colab_badge: True
- comments: true
- author: Samuel Oranyeli
- categories: [python, streamlit, requests-html, requests, pandas]
- image: images/some_folder/your_image.png
- hide: false
- search_exclude: true
- metadata_key1: "web scraping"
- metadata_key2: "python"

This is partly motivated by an [article](https://www.betterdatascience.com/web-scraping-with-r-easier-than-python/) published by [Dario Radecic](https://www.betterdatascience.com/author/betterdatascience_onouc8/) - the article is a good read. 

The aim here is to show how to scrape pages easily in Python and share your results. We will be using two packages : [requests-html](https://requests-html.kennethreitz.org/) for the web scraping, and [streamlit](https://www.streamlit.io/) to build a data application. 

Our source for the scraping is [books.toscrape.com](http://books.toscrape.com/). It is a good place to practise web scraping.

Our goal - Get the title, book url, thumbnail url, rating, price, and availability of each book per genre on the website. Roughly a thousand books. A sample image of one of the books is shown below.

![webscraping.png](Images/webscraping.png)

To effectively scrape web pages, one needs to understand a bit of [html](https://www.w3schools.com/html/) and [css](https://www.w3schools.com/html/html_css.asp). [W3Schools](https://www.w3schools.com/) is a good place to learn the fundamentals; we will not be dwelling on that here, just how to use it.

You can check the output of the web scraping [here](https://safe-refuge-85400.herokuapp.com/)

## **Web Scraping**

[requests-html](https://requests-html.kennethreitz.org/) makes web scraping easy. It supports both [css-selectors](https://www.w3schools.com/cssref/css_selectors.asp) and [xpath](https://msdn.microsoft.com/en-us/library/ms256086(v=vs.110).aspx); we will be using [css-selectors](https://www.w3schools.com/cssref/css_selectors.asp). Let's look at how to scrape the book titles in the *Travel* category:

In [None]:
from requests_html import HTMLSession
session = HTMLSession()

# url for travel section:
url = "http://books.toscrape.com/catalogue/category/books/travel_2/index.html"

# access data from url:
webpage = session.get(url)

titles = [element.attrs["title"] for element in webpage.html.find("h3>a")]

titles

Pretty easy and straightforward. The key part is getting the [css-selectors](https://www.w3schools.com/cssref/css_selectors.asp) right. 

Let's write a function that gets the title, urls, and other details:

In [None]:
import pandas as pd

def data_extract(genre):
    # pull data from specific webpage
    webpage = genre_urls.get(genre)
    webpage = session.get(webpage)
    
    urls = [element.attrs["href"].strip("../")
            for element in webpage.html.find("div.image_container>a")
           ]

    titles = [element.attrs["title"] for element in webpage.html.find("h3>a")]

    imgs = [element.attrs["src"].strip("../")
            for element in webpage.html.find("div.image_container>a>img")
           ]

    ratings = [element.attrs["class"][-1] 
               for element in webpage.html.find("p.star-rating")
              ]

    prices = [element.text for element in webpage.html.find("p.price_color")]

    availability = [element.text for element in webpage.html.find("p.instock")]

    data = dict(
        Title = titles,
        URL = urls,
        Source_Image = imgs,
        Rating = ratings,
        Price = prices,
        Availability = availability,
    )

    return pd.DataFrame(data)

The function above pulls in the data and returns a [Pandas](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) dataframe.

What's left is a pairing of the categories and the urls for each category for the entire website:

In [None]:
main_url = "http://books.toscrape.com/"
main_page = session.get(main_url)

# the css-selector helps pull out the links for all the categories (travel, horror, crime, ...)
navlinks = "div.side_categories>ul.nav.nav-list>li>ul>li>a"

# get the categories
genres = [element.text for element in main_page.html.find(navlinks)]

# get the actual urls for each category
genre_urls = [f"{main_url}/{element.attrs['href']}" 
             for element in main_page.html.find(navlinks)
            ]

# pair the category with the url
genre_urls = dict(zip(genres, genre_urls))

We can easily apply the *data_extract* function to any genre to get the entire details:

In [None]:
# view all the details for the Travel category
data_extract("Travel")

Data extraction is complete. Next up is building an easy to use **[web application](https://safe-refuge-85400.herokuapp.com/)**. Enter [streamlit](https://www.streamlit.io/)

## **Sharing Data Apps**

[streamlit](https://www.streamlit.io/) really makes building data apps easy, with a few lines of code. 

The code below is basic, but shows the power of streamlit. All we have to do is :

- Import streamlit, 
- give our application a title, 
- add our web scraping code, 
- add a [sidebar](https://docs.streamlit.io/en/stable/api.html#add-widgets-to-sidebar) to select genres, and finally
- add a line of code to show the results. The entire code is listed below:

In [None]:
# import streamlit and other libraries
import streamlit as st
from requests_html import HTMLSession
import pandas as pd

# give our application a title
st.title("Real-time web scraper with Python")


# add our web scraping code
session = HTMLSession()
main_url = "http://books.toscrape.com/"
main_page = session.get(main_url)

navlinks = "div.side_categories>ul.nav.nav-list>li>ul>li>a"
genres = [element.text for element in main_page.html.find(navlinks)]
list_urls = [
    f"{main_url}/{element.attrs['href']}" for element in main_page.html.find(navlinks)
]
genre_urls = dict(zip(genres, list_urls))


@st.cache
def data_extract(genre):
    webpage = genre_urls.get(genre)
    webpage = session.get(webpage)
    urls = [
        element.attrs["href"].strip("../")
        for element in webpage.html.find("div.image_container>a")
    ]

    titles = [element.attrs["title"] for element in webpage.html.find("h3>a")]

    imgs = [
        element.attrs["src"].strip("../")
        for element in webpage.html.find("div.image_container>a>img")
    ]

    ratings = [
        element.attrs["class"][-1] for element in webpage.html.find("p.star-rating")
    ]

    prices = [element.text for element in webpage.html.find("p.price_color")]

    availability = [element.text for element in webpage.html.find("p.instock")]

    data = dict(
        Title=titles,
        URL=urls,
        SourceImage=imgs,
        Rating=ratings,
        Price=prices,
        Availability=availability,
    )

    return pd.DataFrame(data).to_markdown(index=False)


# add a sidebar to select genre
option = st.sidebar.selectbox("Genres", genres)

# add a line of code to show the result
st.markdown(data_extract(option), unsafe_allow_html=True)

Easy. Of course, this is a very simple application; you can do so much more with streamlit. Have a look at the [streamlit gallery](https://www.streamlit.io/gallery) for inspiration.

## **Summary**

Web scraping and building data applications are easy to do in Python. Hopefully, this article gives you an idea of how to achieve this, the rest is up to you to go far and beyond.