## Web Scraping - Books to Scrape

En este ejercicio haremos **Web Scraping** a la página: [https://books.toscrape.com/](https://books.toscrape.com/)

El objetivo es extraer la información de **todos** los libros de esta página para crear un **pd.DataFrame** con las siguientes columnas:

|  |Columna          |Descripción                          |
|--|-----------------|-------------------------------------|
|0 |**book_name**    |Nombre del libro.                    |
|1 |**stars**        |Número de estrellas.                 |
|2 |**book_category**|Categoría del libro.                 |
|3 |**description**  |Descripción del libro.               |
|4 |**upc**          |Código de UPC.                       |
|5 |**product_type** |Tipo del producto.                   |
|6 |**price**        |Precio con impuesto (**incl. tax**). |
|7 |**tax**          |Impuesto.                            |
|8 |**availability** |Disponibilidad (**True**/**False**). |
|9 |**stock**        |Total de libros disponibles en stock.|
|10|**reviews**      |Número de reseñas.                   |
|11|**img_src**      |Enlace de la imagen (**src**).       |

El resultado debe ser un **pd.DataFrame** de 1000 filas y 12 columnas.

**No es necesario hacer limpieza de los datos pero es una celda solo y me aburro**

In [1]:
import numpy as np
import pandas as pd
import requests

from bs4 import BeautifulSoup

In [2]:
url = "https://books.toscrape.com/"

In [26]:
def get_links():
    
    page_number = 1
    has_next = True
    links = []
    
    while has_next:
        
        page_url = f"{url}/catalogue/page-{page_number}.html"
        
        response = requests.get(page_url)
        soup = BeautifulSoup(response.text, 'html.parser')
        
        items = soup.find_all('article', 'product_pod')
    
        for item in items:
            link = item.find('a').get('href')
            links.append(link)

        has_next = soup.find('li', 'next') != None
        
        page_number += 1

    return links

In [29]:
#links = get_links()
links[-5:]

['alice-in-wonderland-alices-adventures-in-wonderland-1_5/index.html',
 'ajin-demi-human-volume-1-ajin-demi-human-1_4/index.html',
 'a-spys-devotion-the-regency-spies-of-london-1_3/index.html',
 '1st-to-die-womens-murder-club-1_2/index.html',
 '1000-places-to-see-before-you-die_1/index.html']

In [43]:
def get_data(links):

    data = []
    
    for link in links:
        page_url = f"{url}/catalogue/{link}"

        response = requests.get(page_url)
        response.encoding = 'utf-8'
        soup = BeautifulSoup(response.text, 'lxml')

        main_info = soup.find('div', 'col-sm-6 product_main')
        
        book_name = main_info.find('h1').text
        stars = main_info.find('p', 'star-rating').get('class', [])[-1]
        
        book_category = soup.find('ul', 'breadcrumb').find_all('li')[2].text.strip()
        dp = soup.find('div', id='product_description')
        description = dp.find_next_sibling('p').text if dp else None

        tds = soup.find('table', 'table table-striped').find_all('td')
        upc, product_type, _, price, tax, availability_stock, reviews = [td.text for td in tds]

        availability, stock = availability_stock.split('(')
        stock = stock.replace(')', '')

        img_src = soup.find('div', 'thumbnail').find('img').get('src')

        data.append(
            {
                'book_name': book_name,
                'stars': stars,
                'book_category': book_category,
                'description': description,
                'upc': upc,
                'product_type': product_type,
                'price': price,
                'tax': tax,
                'availability': availability,
                'stock': stock,
                'reviews': reviews,
                'img_src': img_src
            }
        )

    return data

In [44]:
data = get_data(links)
data[-1]

{'book_name': '1,000 Places to See Before You Die',
 'stars': 'Five',
 'book_category': 'Travel',
 'description': 'Around the World, continent by continent, here is the best the world has to offer: 1,000 places guaranteed to give travelers the shivers. Sacred ruins, grand hotels, wildlife preserves, hilltop villages, snack shacks, castles, festivals, reefs, restaurants, cathedrals, hidden islands, opera houses, museums, and more. Each entry tells exactly why it\'s essential to visit. Th Around the World, continent by continent, here is the best the world has to offer: 1,000 places guaranteed to give travelers the shivers. Sacred ruins, grand hotels, wildlife preserves, hilltop villages, snack shacks, castles, festivals, reefs, restaurants, cathedrals, hidden islands, opera houses, museums, and more. Each entry tells exactly why it\'s essential to visit. Then come the nuts and bolts: addresses, websites, phone and fax numbers, best times to visit. Stop dreaming and get going.This hefty 

In [69]:
df = pd.DataFrame(data)
df

Unnamed: 0,book_name,stars,book_category,description,upc,product_type,price,tax,availability,stock,reviews,img_src
0,A Light in the Attic,Three,Poetry,It's hard to imagine a world without A Light i...,a897fe39b1053632,Books,£51.77,£0.00,In stock,22 available,0,../../media/cache/fe/72/fe72f0532301ec28892ae7...
1,Tipping the Velvet,One,Historical Fiction,"""Erotic and absorbing...Written with starling ...",90fa61229261140a,Books,£53.74,£0.00,In stock,20 available,0,../../media/cache/08/e9/08e94f3731d7d6b760dfbf...
2,Soumission,One,Fiction,"Dans une France assez proche de la nôtre, un h...",6957f44c3847a760,Books,£50.10,£0.00,In stock,20 available,0,../../media/cache/ee/cf/eecfe998905e455df12064...
3,Sharp Objects,Four,Mystery,"WICKED above her hipbone, GIRL across her hear...",e00eb4fd7b871a48,Books,£47.82,£0.00,In stock,20 available,0,../../media/cache/c0/59/c05972805aa7201171b8fc...
4,Sapiens: A Brief History of Humankind,Five,History,From a renowned historian comes a groundbreaki...,4165285e1663650f,Books,£54.23,£0.00,In stock,20 available,0,../../media/cache/ce/5f/ce5f052c65cc963cf4422b...
...,...,...,...,...,...,...,...,...,...,...,...,...
995,Alice in Wonderland (Alice's Adventures in Won...,One,Classics,,cd2a2a70dd5d176d,Books,£55.53,£0.00,In stock,1 available,0,../../media/cache/99/df/99df494c230127c3d5ff53...
996,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",Four,Sequential Art,High school student Kei Nagai is struck dead i...,bfd5e1701c862ac3,Books,£57.06,£0.00,In stock,1 available,0,../../media/cache/30/98/309814b6eeba469f4c7411...
997,A Spy's Devotion (The Regency Spies of London #1),Five,Historical Fiction,"In England’s Regency era, manners and elegance...",19fec36a1dfb4c16,Books,£16.97,£0.00,In stock,1 available,0,../../media/cache/f9/6b/f96b60a7614c4e3e868b82...
998,1st to Die (Women's Murder Club #1),One,Mystery,"James Patterson, bestselling author of the Ale...",f684a82adc49f011,Books,£53.98,£0.00,In stock,1 available,0,../../media/cache/f6/8e/f68e6ae2f9da04fccbde84...


In [70]:
df['stars'] = df['stars'].map({'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5})
df['price'] = df['price'].str.replace('£', '', regex=False).astype(float)
df['tax'] = df['tax'].str.replace('£', '', regex=False).astype(float)
df[['book_category', 'product_type']] = df[['book_category', 'product_type']].astype('category')
df['description'] = df['description'].replace({None: np.nan})
df['availability'] = df['availability'].apply(lambda x: True if x == 'In stock ' else False)
df['stock'] = df['stock'].str.replace(' available', '', regex=False).astype(int)
df['reviews'] = df['reviews'].astype(int)
df

Unnamed: 0,book_name,stars,book_category,description,upc,product_type,price,tax,availability,stock,reviews,img_src
0,A Light in the Attic,3,Poetry,It's hard to imagine a world without A Light i...,a897fe39b1053632,Books,51.77,0.0,True,22,0,../../media/cache/fe/72/fe72f0532301ec28892ae7...
1,Tipping the Velvet,1,Historical Fiction,"""Erotic and absorbing...Written with starling ...",90fa61229261140a,Books,53.74,0.0,True,20,0,../../media/cache/08/e9/08e94f3731d7d6b760dfbf...
2,Soumission,1,Fiction,"Dans une France assez proche de la nôtre, un h...",6957f44c3847a760,Books,50.10,0.0,True,20,0,../../media/cache/ee/cf/eecfe998905e455df12064...
3,Sharp Objects,4,Mystery,"WICKED above her hipbone, GIRL across her hear...",e00eb4fd7b871a48,Books,47.82,0.0,True,20,0,../../media/cache/c0/59/c05972805aa7201171b8fc...
4,Sapiens: A Brief History of Humankind,5,History,From a renowned historian comes a groundbreaki...,4165285e1663650f,Books,54.23,0.0,True,20,0,../../media/cache/ce/5f/ce5f052c65cc963cf4422b...
...,...,...,...,...,...,...,...,...,...,...,...,...
995,Alice in Wonderland (Alice's Adventures in Won...,1,Classics,,cd2a2a70dd5d176d,Books,55.53,0.0,True,1,0,../../media/cache/99/df/99df494c230127c3d5ff53...
996,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",4,Sequential Art,High school student Kei Nagai is struck dead i...,bfd5e1701c862ac3,Books,57.06,0.0,True,1,0,../../media/cache/30/98/309814b6eeba469f4c7411...
997,A Spy's Devotion (The Regency Spies of London #1),5,Historical Fiction,"In England’s Regency era, manners and elegance...",19fec36a1dfb4c16,Books,16.97,0.0,True,1,0,../../media/cache/f9/6b/f96b60a7614c4e3e868b82...
998,1st to Die (Women's Murder Club #1),1,Mystery,"James Patterson, bestselling author of the Ale...",f684a82adc49f011,Books,53.98,0.0,True,1,0,../../media/cache/f6/8e/f68e6ae2f9da04fccbde84...


In [71]:
df.dtypes

book_name          object
stars               int64
book_category    category
description        object
upc                object
product_type     category
price             float64
tax               float64
availability         bool
stock               int64
reviews             int64
img_src            object
dtype: object

In [None]:
##############################################################################################################################