## Using Pydantic for Validation and Cleaning of Scraped data

Note: not to use list[Any] for building pydantic class model, because there is no type list for sqlalchemy, if you want to save data to database via SQLModel ORM

Introduction:
- In Python, we do not need to specify data type of the variable when declaring or using it. Example:

```python
var = 100
var = 'My name is Python'
```


Reference: 
1. Data validation using Pydantic https://medium.com/@mahimamanik.22/data-validation-using-pydantic-5e97fe93fc87

In [2]:
%pip install requests
%pip install lxml
%pip install pandas
%pip install watermark
%pip install pydantic==2.3.0
%pip install pydantic-core==2.6.3
%pip install word2number
%pip install tqdm

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated pa

#### Scrape the website content

In [3]:
from pydantic import (BaseModel, 
                      validate_call,
                      computed_field, 
                      field_serializer,
                      HttpUrl,
                      )

class BookModel(BaseModel):
  price: str
  book_title: str
  book_url: str
  star_rating_class: str

  @computed_field
  @property
  def star_rating(self) -> str:
    return self.star_rating_class.replace("star-rating ","")

  @field_serializer('star_rating')
  def serialize_star_rating(star_rating: str) -> int:
    return w2n.word_to_num(star_rating)

In [4]:
%%time

import requests
from lxml import html
from word2number import w2n # convert number words (eg. twenty one) to numeric digits (21)
import pandas
from typing import Union # add type hint
from tqdm import tqdm

class BookSpider:
    def __init__(self):
        self.base_url: AnyHttpUrl = "https://books.toscrape.com"
        self.session: requests.sessions.Session  = requests.Session()
        self.headers: dict[str, str] = {
            'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36'
            }
        self.bookmodel = BookModel

    # @validate_call(validate_return=True)
    def parse(self, page_num: int=50) -> list[BookModel]:
        results = []
        # loop the page number to scrape
        for page_num in tqdm(range(1,page_num)):
            url = self.base_url + f"/catalogue/page-{page_num}.html"
            output_data = self._fetch_data(url=url)
            validated_bookmodel_data = self._validate_bookmodel_data(output_data)
            results.append(validated_bookmodel_data)
        return results
            

    # @validate_call(validate_return=True)
    def _fetch_data(self, url: HttpUrl) -> list[dict]:
        response = self.session.request("GET", url)
        tree = html.fromstring(response.content)      
        
        price = tree.xpath("//article[@class='product_pod']//div[@class='product_price']\
            /p[@class='price_color']/text()")        
        book_title = tree.xpath("//article[@class='product_pod']/h3/a/text()")
        book_url = tree.xpath("//article[@class='product_pod']/h3/a/@href")
        star_rating_class = tree.xpath("//article[@class='product_pod']/p/@class")

        data_list = {
                    "price": price,
                    "star_rating_class": star_rating_class,
                    "book_url": book_url, 
                    "book_title": book_title,
                }
        data = pandas.DataFrame(data_list).to_dict(orient="records")
        return data


    # @validate_call(validate_return=True)
    def _validate_bookmodel_data(self, books_data: list[dict]) -> BookModel:
        # validate book data with pydantic
        validated_bookmodel_data_list = []
        for book in books_data:
            validated_bookmodel_data = self.bookmodel.model_validate(book).model_dump() ## model_dump is to compute field
            validated_bookmodel_data_list.append(validated_bookmodel_data)
        return validated_bookmodel_data


book_spider = BookSpider()
data_list = book_spider.parse()

100%|██████████| 49/49 [00:03<00:00, 12.71it/s]

CPU times: user 320 ms, sys: 12.8 ms, total: 333 ms
Wall time: 3.86 s





In [5]:
import pandas

df = pandas.DataFrame(data_list)
df.explode(['price', 'book_title', 'book_url', 'star_rating_class', 'star_rating'])

Unnamed: 0,price,book_title,book_url,star_rating_class,star_rating
0,£45.17,It's Only the Himalayas,its-only-the-himalayas_981/index.html,star-rating Two,2
1,£33.63,You can't bury them ...,you-cant-bury-them-all-poems_961/index.html,star-rating Two,2
2,£45.22,The Natural History of ...,the-natural-history-of-us-the-fine-art-of-pret...,star-rating Three,3
3,£50.40,"Rat Queens, Vol. 3: ...",rat-queens-vol-3-demons-rat-queens-collected-e...,star-rating Three,3
4,£22.00,In the Country We ...,in-the-country-we-love-my-family-divided_901/i...,star-rating Four,4
5,£28.09,Avatar: The Last Airbender: ...,avatar-the-last-airbender-smoke-and-shadow-par...,star-rating Two,2
6,£17.44,The Stranger,the-stranger_861/index.html,star-rating Four,4
7,£41.25,The Cookies & Cups ...,the-cookies-cups-cookbook-125-sweet-savory-rec...,star-rating One,1
8,£30.25,Mrs. Houdini,mrs-houdini_821/index.html,star-rating Five,5
9,£42.16,Deliciously Ella Every Day: ...,deliciously-ella-every-day-quick-and-easy-reci...,star-rating Three,3


## Computing environment

In [6]:
%load_ext watermark

%watermark

# print out pypi packages used
%watermark --iversions

# date
%watermark -u -n -t -z

Last updated: 2024-03-10T15:16:34.648636+00:00

Python implementation: CPython
Python version       : 3.10.12
IPython version      : 8.22.2

Compiler    : GCC 11.4.0
OS          : Linux
Release     : 6.1.75-060175-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 16
Architecture: 64bit

word2number: 1.1
pandas     : 2.2.1
requests   : 2.31.0
lxml       : 5.1.0

Last updated: Sun Mar 10 2024 15:16:34UTC

