# Threading and Asynchronous Programming

Note book designed for exploring Threading and Asynchronous Programming.

## Thread

A **thread** is a separate flow of execution. This means that our program will have two things happening at once. But for most Python 3 implementations the different **threads** do not actually execute at the same time: they merely appear to.
(https://realpython.com/intro-to-python-threading)

## Synchronous programming

A **synchronous program* is executed one step at a time. Even with conditional branching, loops and function calls, you can still think about the code in terms of taking one execution step at a time. When each step is complete, the program moves on to the next one.
(https://realpython.com/python-async-features)

## Ansynchronous programming

An asynchronous program behaves differently. It still takes one execution step at the time. The difference is that the system may not wait for an execution step to be completed before moving on to the next one.
(https://realpython.com/python-async-features)

## Synchronous version of getting response

In [None]:
from time import perf_counter
import requests

def sync_version(urls):
    for url in urls:
        r = requests.get(f"http://127.0.0.1:8000/items/{url}")
        print(r.json())

start = perf_counter()
sync_version(range(1, 2500))
stop = perf_counter()
print("time take:", stop - start)

In [13]:
print(range(1,40))

range(1, 40)


## Threading verios of getting response

In [None]:
start = perf_counter()
urls = range(1, 2500)

def get_data(url):
    r = request.get(url)
    print(r.json())

with ThreadPoolExecutor() as executor:
    executor.map(get_data, urls)

stop = perf_counter()

print("time taken:", stop - start)

## Asynchronous version of getting response

In [None]:
async def fetch(s, url):
    async with s.get(url) as r:
        if r.status != 200:
            r.raise_for_status()
        return await r.text()
    
async def fetch_all(s, urls):
    tasks = []
    for url in urls:
        task = asyncio.create_task(fetch(s, url))
        tasks.append(task)
    res = await asyncio.gather(*tasks)
    return res

async def main():
    urls = range(1, 2500)
    async with aiohttp.ClientSession() as session:
        htmls = await fetch_all(session, urls)
        print(htmls)

if __name__ == '__main__':
    start = perf_counter()
    asyncio.run(main())
    stop = perf_counter()
    print("time taken:", stop - start)

## Adding Async to Sync code Example

### Sync code example

Use **httpx** instead of standard **requests** because in **httpx** we have **AsyncClient** aviable.

Couple things which are important to know:
    * To actualy make use of async request we need to have a list of url beforehand. Either know them or we want to construct them by pulling them of the page.

We want to transpose code below to our orginal code. To make it a little bit quicker.

In [None]:
import httpx
import asyncio

async def get_data(client, url):
    resp = await client.get(url)
    return resp.json()['name']

async def main():
    async with httpx.AsyncClient() as client:
        tasks = []
        for i in range(1, 150):
            tasks.append(get_data(client, f"https://rickandmortyapi.com/api/character/{i}"))

        characters = await asyncio.gather(*tasks)
        for c in characters:
            print(c)

asyncio.run(main())

### Orginal code example

This code goes through each page and gets all of the links for each product, vists that link and pulls out product information. What we gonna do is asyncing the product detals page part.

In [None]:
import httpx
from selectolax.parser import HTMLParser
from dataclasses import dataclass
from rich import print

@dataclass
class Book:
    title: str
    UPC: str
    product_type: str
    price_inc_tax: str
    price_exc_tax: str
    tax: str
    availability: str
    num_of_reviews: str

@dataclass
class Response:
    body_html: HTMLParser
    next_page: dict

def get_page(client, url):
    """
    Fetches the HTML page using the specified client and URL.

    Args:
        client (httpx.Client): The HTTP client.
        url (str): The URL to fetch.

    Returns:
        Response: An instance of the Response class containing the parsed HTML and next page information.
    """
    resp = client.get(url)
    data = HTMLParser(resp.text)
    if data.css_first("li.next"):
        next_page = data.css_first("li.next a").attributes
    else:
        next_page = {"href": None}
    return Response(body_hmtl=data, next_page=next_page)

def parse_link(html):
    """
    Parses the HTML to extract detail page links.

    Args:
        html (HTMLParser): The parsed HTML

    Returns:
        list: A list of detail page links.
    """
    links = html.css("article.product_pod h3 a")
    return[link.attrs["href"] for link in links]

def parse_detail(html, selector, index):
    """
    Parses the HTML to extract a specific detail value.

    Args: 
        html (HTMLParser): The parsed HTML.
        selector (str): The CSS selector
        index (int): The index of the element to extract.

    Returns:
        str: The extracted detail value or "none" if not found.
    """
    try:
        value = html.css(selector)[index].text(strip=True)
        return value
    except:
        return "none"
    
def detail_page_new(html):
    """
    Parses the detail page HTML to create a Book object.

    Args:
        html (HTMLParser): The parsed detail page HTML.

    Returns:
        Book: An instance of the Book class representing the book details.
    """
    new_book = Book(
        title=parse_detail(html, "h1", 0),
        UPC=parse_detail(html, "table tbody tr td", 0),
        product_type=parse_detail(html, "table tbody tr td", 1),
        price_inc_tax=parse_detail(html, "table tbody tr td", 2),
        price_exc_tax=parse_detail(html, "table tbody tr td", 3),
        tax=parse_detail(html, "table tbody tr td", 4),
        availability=parse_detail(html, "table tbody tr td", 5),
        num_of_reviews=parse_detail(html, "table tbody tr td", 6)
    )
    return new_book

def check_url_text(value):
    if "catalogue" not in value:
        return "catalogue/" + value
    else:
        return value

def main():
    results = []
    base_url = "https://books.toscrape.com/"
    url = "https://books.toscrape.com/"
    client = httpx.Client()
    while True:
        data = get_page(client, url)
        print(data)
        detail_links = parse_links(data.body_html)
        links = [base_url + check_url_text(link) for link in detail_links]
        for link in detail_links:
            product_page_data = get_page(client, base_url + check_url_text(link))
            book_item = detail_page_new(product_page_data.body_html)
            results.append(book_item)
            print(book_item)
        if data.next_page["href"] == None:
            client.close()
            break
        next_page_url = check_url_text(data.next_page["href"])
        url = base_url + str(next_page_url)
    print(results)

if __name__ == "__main__":
    main()

### Changed code

In [None]:
import httpx
from selectolax.parser import HTMLParser
from dataclasses import dataclass
from rich import print
import asyncio

@dataclass
class Book:
    title: str
    UPC: str
    product_type: str
    price_inc_tax: str
    price_exc_tax: str
    tax: str
    availability: str
    num_of_reviews: str

@dataclass
class Response:
    body_html: HTMLParser
    next_page: dict

def get_page(client, url):
    """
    Fetches the HTML page using the specified client and URL.

    Args:
        client (httpx.Client): The HTTP client
        url (str): The URL to fetch

    Returns:
        Response: An instance of the Response class containing the parsed HTML and next page information.
    """
    resp = client.get(url)
    data = HTMLParser(resp.text)
    if data.css_first("li.next"):
        next_page = data.css_first("li.next a").attributes
    else:
        next_page = {"href": None}
    return Response(body_hmtl=data, next_page=next_page)

def parse_link(html):
    """
    Parses the HTML to extract detail page links.

    Args:
        html (HTMLParser): The parsed HTML.

    Returns:
        list: A list of detail page links.
    """
    links = html.css("article.product_pod h3 a")
    return[link.attrs["href"] for link in links]

def parse_detail(html, selector, index):
    """
    Parses the HTML to extract a specific detail value.

    Args:
        html (HTMLParser): The parsed HTML.
        selector (str): The CSS selector.
        index (int): The index of the element to extract.

    Returns:
        str: The extracted detail value or "none" if not found.
    """
    try:
        value = html.css(selector)[index].text(strip=True)
        return value
    except:
        return "none"
    
def detail_page_new(html):
    """
    Parses the detail page HTML to create a Book object.

    Args:
        html (HTMLParser): The parsed detail page HTML.
    
    Returns:
        Book: An instance of the Book class representing the book details.
    """
    new_book = Book(
        title=parse_detail(html, "h1", 0),
        UPC=parse_detail(html, "table tbody tr td", 0),
        product_type=parse_detail(html, "table tbody tr td", 1),
        price_inc_tax=parse_detail(html, "table tbody tr td", 2),
        price_exc_tax=parse_detail(html, "table tbody tr td", 3),
        tax=parse_detail(html, "table tbody tr td", 4),
        availability=parse_detail(html, "table tbody tr td", 5),
        num_of_reviews=parse_detail(html, "table tbody tr td", 6)
    )
    return new_book

def check_url_text(value):
    if "catalogue" not in value:
        return "catalogue/" + value
    else:
        return value

async def async_get_data(client, url):
    """
    Asynchronously fetches dat` from the specified URL using the client.

    Args:
        client (httpx.AsyncClient): The asynchronous HTTP client.
        url (str): The URL to fetch.

    Returns:
        None
    """
    resp = await client.get(url)
    html = HTMLParser(resp.text)
    print(detail_page_new(html))

async def with_async(links):
    """
    Executes the async_get_data coroutine for multiple links in parallel.

    Args:
        links (list): A list of URLs to fetch.
    
    Returns:
        None
    """
    async with httpx.AsyncClient() as client:
        tasks = []
        for link in links:
            tasks.append(async_get_data(client, link))
        return await asyncio.gather(*tasks)

def main():
    results = []
    base_url = "https://books.toscrape.com/"
    url = "https://books.toscrape.com/"
    client = httpx.Client()
    while True:
        data = get_page(client, url)
        print(data)
        detail_links = parse_links(data.body_html)
        links = [base_url + check_url_text(link) for link in detail_links]
        asyncio.run(with_async(links))
        # for link in detail_links:
        #     product_page_data = get_page(client, base_url + check_url_text(link))
        #     book_item = detail_page_new(product_page_data.body_html)
        #     results.append(book_item)
        #     print(book_item)
        if data.next_page["href"] == None:
            client.close()
            break
        next_page_url = check_url_text(data.next_page["href"])
        url = base_url + str(next_page_url)
    print(results)

if __name__ == "__main__":
    main()

## Async with pagination

In [None]:
import httpx
from selectolax.parser import HTMLParser
import asyncio
from dataclasses import dataclass
from itertools import chain

@dataclass
class Book:
    title: str
    UPC: str
    product_type: str
    price_inc_tax: str
    price_exc_tax: str
    tax: str
    availability: str
    num_of_reviews: str

@dataclass
class Response:
    body_html: HTMLParser
    next_page: dict

def is_integer(val):
    try:
        return int(val)
    except ValueError:
        return

def parse_detail(html, selector, index):
    """
    Parses the HTML to extract a specific detail value.

    Args:
        html (HTMLParser): The parsed HTML.
        selector (str): The CSS selector.
        index (int): The index of the element to extract.

    Returns:
        str: The extracted detail value or "none" if not found.
    """
    try:
        value = html.css(selector)[index].text(strip=True)
        return value
    except:
        return "none"
    
def detail_page_new(html):
    """
    Parses the detail page HTML to create a Book object.

    Args:
        html (HTMLParser): The parsed detail page HTML.

    Returns:
        Book: An instance of the Book class representing the book details.
    """
    new_book = Book(
        title=parse_detail(html, "h1", 0),
        UPC=parse_detail(html, "table tbody tr td", 0),
        product_type=parse_detail(html, "table tbody tr td", 1),
        price_inc_tax=parse_detail(html, "table tbody tr td", 2),
        price_exc_tax=parse_detail(html, "table tbody tr td", 3),
        tax=parse_detail(html, "table tbody tr td", 4),
        availability=parse_detail(html, "table tbody tr td", 5),
        num_of_reviews=parse_detail(html, "table tbody tr td", 6)
    )
    return new_book

def get_total_pages(url):
    """
    Fetches the total number of pages from the website.

    Args:
        url (str): The base URL of the website.
    
    Returns:
        int: The total number of pages.
    """
    resp = httpx.get(url)
    html = HTMLParser(resp.text)
    pages = html.css_first("ul.pager li.current").text(strip=True).split()
    pages_int = [is_integer(page) for page in pages if is_integer(page) is not None]
    last_page = max(pages_int)
    return last_page

def parse_links(html):
    """
    Parses the HTML to extract detail page links.

    Args:
        html (HTMLParser): The parsed HTML.
    
    Returns:
        list: A list of detail page links.
    """
    links = html.css("article.product_pod h3 a")
    return [link.attrs["href"] for link in links]

async def get_async_links(client, url):
    """
    Asynchronously fetches detail page links from the specified URL using the client.

    Args:
        client (httpx.AsyncClient): The asynchronous HTTP client.
        url (str): The URL to fetch.

    Returns:
        list: A list of detail page links.
    """
    resp = await client.get(url)
    html = HTMLParser(resp.text)
    links = parse_links(html)
    return links

async def get_links():
    """
    Asynchronously fetches detail page links from multiple pages.

    Returns:
        list: A list of lists, containing detail page links for each page.
    """
    base_url = "https://books.toscrape.com/catalogue/"
    async with htppx.AsyncClient() as client:
        tasks = []
        for i in range(1, get_total_pages(base_url + "page-1.html" + 1)):
            tasks.append(
                asyncio.ensure_future(
                    get_async_links(client, base_url + f"page-{i}.html")
                )
            )

async def get_async_details(client, url):
    """
    Asynchronously fetches book details from the specified URL using the client.

    Args:
        client (httpx.AsyncClient): The asynchronous HTTP client.
        url (str): The URL to fetch.
    
    Returns:
        None
    """
    resp = await client.get(url)
    html = HTMLParser(resp.text)
    print(detail_page_new(html))

async def get_detail(urls):
    """
    Asynchronously fetches book details for a list of URLs.

    Args:
        urls (list): A list of URLs to fetch.

    Returns:
        None
    """
    base_url = "https://books.toscrape.com/catalogue/"
    async with httpx.AsyncClient() as client:
        tasks = []
        for url in urls:
            tasks.append(
                asyncio.ensure_future(get_async_details(client, base_url + url))
            )

def main():
    links = asyncio.run(get_links())
    detail_links = []
    for link in chain(links):
        for l in chain(link):
            detail_links.append(l)
    details = asyncio.run(get_detail(detail_links))
    print(len(details))

if __name__ == "__main__":
    main()