# Lab 1. Basic Techniques

## Regex

Expresiile regulate. Utile pentru cautare si normalizare de texte - https://www.w3schools.com/python/python_regex.asp


In [None]:
import re

text = """
Praise for The Rain in Portugal
 
“Nothing in Billy Collins’s twelfth book . . . is exactly what readers might expect, and that’s the charm of this collection.”—The Washington Post
 
“This new collection shows [Collins] at his finest. . . . Certain to please his large readership and a good place for readers new to Collins to begin.”—Library Journal. 
 
“Disarmingly playful and wistfully candid.”—Booklist
Buy new:$38.65
No Import Fees Deposit & $13.01 Shipping to Romania Details -12.3.
"""

Sterge toate caracterele diferite de litere mari si mici ale alfabetului englez, apoi normalizeaza toate secventele de caractere de tip spatii la 1 spatiu

In [None]:
cleaned_text = re.sub("[^A-Za-z]", " ", text)
cleaned_text = re.sub("\s+", " ", cleaned_text)
print(cleaned_text)

Le putem testa cel mai usor pe https://pythex.org/ . Aici cautam toate numerele float sau int, impreuna cu pozitiile si valorile lor.

In [None]:
pattern = re.compile("[+-]?(\d+\.)?\d+")
for matching in pattern.finditer(text):
    print(matching.start(), matching.group())

## Encodings

Encodarea unui text poate varia, in functie de limba si este importanta. Python foloseste standardul 'utf8'. Urmatorul exemplu este preluat dintr-o subtitrare (.srt) din limba rusa, dar nu este encodat in utf8. Citindu-l obtinem urmatoarea eroare:

In [None]:
with open('encoded_text.txt', "r") as fin:
    content = fin.read()
    print(content)

Incercam sa detectam encoding-ul lui folosind chardet

In [None]:
! pip install chardet

In [None]:
import chardet

with open('encoded_text.txt', "rb") as f:
    rawdata = f.read()
    result = chardet.detect(rawdata)
    extracted_encoding = result['encoding']
    print(extracted_encoding)

Reluam citirea si salvam continutul in format utf8

In [None]:
with open('encoded_text.txt', 'r', encoding=extracted_encoding) as fin:
    content = fin.read()
with open('utf8_text.txt', 'w', encoding='utf8') as fout:
    fout.write(content)

In [None]:
with open('utf8_text.txt', "r") as fin:
    content = fin.read()
    print(content)

## Non-standard files (PDF, Word, etc.)

Putem citi texte din documente word folosind:

In [None]:
! pip install docx2txt

In [None]:
import docx2txt
my_text = docx2txt.process("soup.docx")
print(my_text)

Sau pdf-uri care sunt salvate ca texte (nu poze):

In [None]:
! pip install pdfplumber

In [None]:
import pdfplumber
with pdfplumber.open('soup.pdf') as pdf:
    for page in pdf.pages:
        print(page.extract_text())

## Web scraping

Prin scraping ne referim la o multime de metode prin care putem descarca date nestructurate din mediul web, pe care le putem apoi procesa si stoca sub forma structurata.

Ca prim exemplu de scraping vom incerca urmatorul task: pornind de la site-ul de programare competitiva "infoarena.ro" dorim pentru un utilizator sa descarcam informatii despre toate submisiile efectuate de acesta.

Exemplu pagina de submisii: https://www.infoarena.ro/monitor?user=iordache.bogdan

Pentru a realiza un request care sa intoarca continutul paginii putem folosi biblioteca "requests":

In [None]:
! pip install requests

In [None]:
import requests

def get_submissions_page(user):
    return requests.get(f"https://www.infoarena.ro/monitor?user={user}")

In [None]:
html = get_submissions_page("iordache.bogdan").content

Observam ca folosind metoda de mai sus putem descarca intreg continutul HTML al paginii. Pentru a extrage informatii utile trebuie sa parsam acest continut. Vom folosi biblioteca [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/):

In [None]:
import bs4

def parse_html(html):
    return bs4.BeautifulSoup(html, "html.parser")

Avand continutul parsat, putem determina acum cate submisii are in total acest utilizator:

In [None]:
import re

soup = parse_html(html)

# cautam un span care are clasa "count", in acest span se afla numarul de submisii
submission_count_text = soup.find("span", class_="count").text
print(submission_count_text)
submission_count = int(re.search(r"\d+", submission_count_text).group())
print(submission_count)

Observam ca aceste submisii sunt impartite in mai multe pagini (paginarea rezultatelor). Observam ca link-ul urmator: https://www.infoarena.ro/monitor?user=iordache.bogdan&display_entries=250&first_entry=100 ne returneaza 250 de submisii, incepand cu submisia cu numarul 100. Modificam metoda get_submissions_page astfel:

In [None]:
def get_submissions_page(user, display_entries=None, first_entry=None):
    req_string = f"https://www.infoarena.ro/monitor?user={user}"
    if display_entries is not None:
        req_string += f"&display_entries={display_entries}"
    if first_entry is not None:
        req_string += f"&first_entry={first_entry}"

    return requests.get(req_string)

Implementam o functie care returneaza informatii despre toate submisiile unui utilizator:

In [None]:
from tqdm import tqdm
import pandas as pd
import pdb

def scrape_submissions(user):
    # determinam numarul total de submisii
    html = get_submissions_page(user).content
    soup = parse_html(html)
    submission_count_text = soup.find("span", class_="count").text
    submission_count = int(re.search(r"\d+", submission_count_text).group())

    # vom salva in acest dictionar datele despre submisiile extrase, structura aceasta
    # ne va ajuta ulterior sa construim un tabel (dataframe) folosind pandas
    d = {
        "id": [],
        "problema": [],
        "url_problema": [],
        "url_sursa": [],
        "data": [],
        "puncte": [],
    }

    # accesam pagini cu submisii in grupuri de 250
    for first_entry in tqdm(range(0, submission_count, 250)):
        html = get_submissions_page(user, display_entries=250, first_entry=first_entry).content
        soup = parse_html(html)

        # selectam toate liniile de tabel (tr)
        lines = soup.select("table.monitor tbody tr")

        for line in lines:
            # selectam celulele de pe aceasta linie
            cells = [cell for cell in line.select("td")]

            # extragem link-urile pentru problema si codul sursa
            try:
                url_problema = cells[2].select_one("a")["href"]
                url_sursa = cells[4].select_one("a")["href"]
            except Exception:  # daca vreun link nu exista ignoram linia
                continue
            
            d["id"].append(cells[0].text)
            d["problema"].append(cells[2].text)
            d["url_problema"].append(url_problema)
            d["url_sursa"].append(url_sursa)
            d["data"].append(cells[5].text)

            try:
                puncte = int(re.search(r"\d+", cells[6].text).group())
            except Exception:
                puncte = 0
            d["puncte"].append(puncte)

    return pd.DataFrame(d)

In [None]:
df_submissions = scrape_submissions("iordache.bogdan")

In [None]:
df_submissions.head()

In [None]:
df_submissions.to_csv("submissions.csv", index=False)

Exemplu scriere/citire fisier JSON:

In [None]:
import json

vec = [
    {"title": "example_1", "size": 7},
    {"title": "example_2", "size": 3},
    {"title": "example_3", "size": 8},
]

with open("example.json", "w") as f:
    json.dump(vec, f, indent=4)

In [None]:
with open("example.json", "r") as f:
    vec = json.load(f)
print(vec)

Un alt mod de a face scraping este sa folosim biblioteca pandas pentru a ne extrage tabele html, transformandu-le in DataFrame-uri, pe care le putem manipula foarte usor. Un exemplu util este extragerea sarbatorile legale romanesti, in anul 2022, de pe https://www.timeanddate.com/.

In [None]:
! pip install lxml

In [None]:
import pandas as pd

tables_df = pd.read_html('https://www.timeanddate.com/holidays/romania/2022?hol=1')
df = tables_df[0]

# Il putem curata prin a sterge liniile nule si modifica coloanele de la tuplul "(Date, Date)" -> "Date"
df = df.dropna(axis='index')
df.columns = ['Date', 'Day', 'Name', 'Type']

# Reindexam tabelul
df = df.reset_index(drop="True")

# Afisam primele 5 randuri
df.head()

In [None]:
# Daca vrem se vedem sarbatorile care pica in ziua de luni
df_luni = df.loc[df["Day"] == "luni"]
df_luni

In [None]:
df_luni.describe()

Putem salva rezultatul (la fel ca orice dictionar de python) intr-un json, ca alternativa la DataFrame - si poate fi util intr-o aplicatie pentru comunicarea cu front-end-ul

In [None]:
import json
json_str = df.to_json(orient='records')
json_result = json.loads(json_str)

with open('holidays.json', 'w', encoding='utf8') as fout:
    json.dump(json_result, fout, indent=4, sort_keys=True, ensure_ascii=False)

Alte biblioteci utile pentru scraping:
 * [scrapy](https://scrapy.org/) (folosit in special pentru web crawling)
 * [selenium](https://selenium-python.readthedocs.io/) (folosit pentru a simula activitatea din browser, utilizat in special in scrierea de teste pentru aplicatii front-end)

## TASK: IMDb scraping (deadline: 3 martie ora 23:59)

1. Pornind de la lista cu cele mai populare 250 de filme de pe IMDb ([https://www.imdb.com/chart/top/](https://www.imdb.com/chart/top/)), identificati pentru toate aceste filme link-ul catre pagina sa de recenzii.

Exemplu: aici se gaseste pagina cu recenzii pentru "The Shawshank Redemption": [https://www.imdb.com/title/tt0111161/reviews](https://www.imdb.com/title/tt0111161/reviews)

In [21]:
# Theodor-Pierre Moroianu
# grupa 334

import pandas as pd
import requests
from tqdm import tqdm
import bs4
import re

# citim lista de filme
content = requests.get('https://www.imdb.com/chart/top/')
content = bs4.BeautifulSoup(content.content, "html.parser")

# extragem tabelul principal, si linile tabelului
table = content.find("tbody", class_="lister-list")
rows = table.find_all("tr")

# aici salvam rezultatele
movies_links = {
    "title": [],
    "link": [],
}

# iteram prin linii
for row in rows:
    # extragem coloana care ne intereseaza
    title_column = row.find("td", class_="titleColumn")
    
    # extragem linkul. Textul va fi titlul, href-ul linkul catre film
    title = title_column.find("a")
    movies_links["title"].append(title.text)
    # obtinem un link care nu contine URL-ul, pentru ca ramane pe acelasi site.
    # trebuie sa injectam manual https://www.imdb.com/ [...] / reviews
    movies_links["link"].append(
        "https://www.imdb.com" + 
        title["href"] +
        "reviews"
    )

movies_links = pd.DataFrame(movies_links)
movies_links

Unnamed: 0,title,link
0,Închisoarea îngerilor,https://www.imdb.com/title/tt0111161/reviews
1,Nașul,https://www.imdb.com/title/tt0068646/reviews
2,Nașul: Partea a II-a,https://www.imdb.com/title/tt0071562/reviews
3,Cavalerul negru,https://www.imdb.com/title/tt0468569/reviews
4,12 Oameni mânioşi,https://www.imdb.com/title/tt0050083/reviews
...,...,...
245,Andrei Rubliov,https://www.imdb.com/title/tt0060107/reviews
246,Soorarai Pottru,https://www.imdb.com/title/tt10189514/reviews
247,Le notti di Cabiria,https://www.imdb.com/title/tt0050783/reviews
248,Shin seiki Evangelion Gekijô-ban: Air/Magokoro...,https://www.imdb.com/title/tt0169858/reviews


2. Pentru fiecare film colectati date despre recenziile sale (titlu, text, rating, data, utlizator, etc.)

In [27]:
# df to store the data into
movies_reviews = {
    "movie_title": [],
    "review_title": [],
    "review_text": [],
    "review_rating": [],
    "review_date": [],
    "review_user": [],
    "movie_year": [],
}


def process_review(movie_title: str, review_url: str):
    """
        Extrage datele dintr-un SINGUR review.
    """
    review_content = requests.get(review_url)
    review_content = bs4.BeautifulSoup(review_content.content, "html.parser")

    review_user = review_content.find("div", class_="subpage_title_block")\
        .find("div", class_="parent").text

    # extragem review container ca sa nu avem coliziuni cu alte componente
    # ale paginii
    review_content = review_content.find("div", class_="review-container")
    review_content = review_content.find("div", class_="lister-item-content")

    review_title = review_content.find("a", class_="title").text
    review_text = review_content.find("div", class_="text show-more__control").text

    # unele reviewuri nu au rating
    # daca try crapa, inseamna ca nu avem niciun rating
    try:
        review_rating = review_content.find("span", class_="rating-other-user-rating")\
            .find("span").text
    except AttributeError:
        review_rating = "N/A"    

    review_date = review_content.find("span", class_="review-date").text
    movie_year = review_content.find("span", class_="lister-item-year text-muted unbold").text[1:-1]

    # adaugam review-ul
    movies_reviews["movie_title"].append(movie_title)
    movies_reviews["review_title"].append(review_title)
    movies_reviews["review_text"].append(review_text)
    movies_reviews["review_rating"].append(review_rating)
    movies_reviews["review_date"].append(review_date)
    movies_reviews["review_user"].append(review_user)
    movies_reviews["movie_year"].append(movie_year)


def parse_reviews(movie_title, content: bs4.BeautifulSoup, nr_min_reviews: int = 0):
    """
        Parseaza content, si:
            * extrage toate reviewurile
            * Daca este nevoie, apeleaza recursiv cu urmatoarea paginare
    """
    # extragem tabelul principal, si linile tabelului
    reviews_table = content.find("div", class_="lister-list")
    reviews_rows = reviews_table.findChildren("div", recursive=False)
    
    # iteram prin fiecare review
    for review_row in reviews_rows:
        # izolam continutul
        review_content = review_row.find("div", class_="lister-item-content")
        
        # extragem url-ul
        review_url = review_content.find("a", class_="title")["href"]
        review_url = "https://www.imdb.com" + review_url
        
        # procesam review-ul
        process_review(title, review_url)
        nr_min_reviews -= 1
    
    # mai trebuie reviewuri
    # apelam recursiv, cu urmatoarea paginare
    if nr_min_reviews > 0:
        # gasim data-key-ul folosit pentru a cere mai multe reviewuri
        token = content.find("div", class_="load-more-data")["data-key"]
        api_endpoint = f"https://www.imdb.com/title/tt0111161/reviews/_ajax?ref_=undefined&paginationKey={token}"
        next_content = requests.get(api_endpoint)
        next_content = bs4.BeautifulSoup(next_content.content, "html.parser")
        # apelam recursiv
        # am fi putut sa facem un while, dar pare mai clean asa
        parse_reviews(movie_title, next_content, nr_min_reviews)


def process_movie_reviews(title: str, url: str, nr_min_reviews: int) -> int:
    """
        Extrage datele dintr-un singur film
        Extrage cel putin nr_min_reviews recenzii.
    """
    reviews_content = requests.get(url)
    reviews_content = bs4.BeautifulSoup(reviews_content.content, "html.parser")

    parse_reviews(title, reviews_content, nr_min_reviews)

In [28]:
# iterate through the top 250 movies
# luam doar primele 4 ca sa nu dureze o vesnicie
for _, (title, url) in tqdm([i for p, i in enumerate(movies_links.iterrows()) if p < 4]):
    process_movie_reviews(title, url, -1)

100%|██████████| 4/4 [01:07<00:00, 16.76s/it]


In [29]:
pd.DataFrame(movies_reviews)

Unnamed: 0,movie_title,review_title,review_text,review_rating,review_date,review_user,movie_year
0,Închisoarea îngerilor,"Enthralling, fantastic, intriguing, truly rem...",Shawshank Redemption is without doubt one of t...,10,17 April 2009,TheLittleSongbird,1994
1,Închisoarea îngerilor,"""I Had To Go To Prison To Learn To Be A Crook""\n",None of the usual otherworld creatures that po...,9,17 February 2011,bkoganbing,1994
2,Închisoarea îngerilor,All-time prison film classic\n,"Based on a novella by Stephen King, this is be...",10,18 December 2016,Leofwine_draca,1994
3,Închisoarea îngerilor,Masterpiece\n,"Shawshank Redemption, The (1994)**** (out of 4...",,2 December 2008,Michael_Elliott,1994
4,Închisoarea îngerilor,Freeman gives it depth\n,Andy Dufresne (Tim Robbins) is a banker convic...,8,8 December 2013,SnoopyStyle,1994
...,...,...,...,...,...,...,...
95,Cavalerul negru,A terrorist plot so complicated and convolute...,"Gotham City no longer has an ethereal, poetic ...",6,22 January 2011,moonspinner55,2008
96,Cavalerul negru,Smashing follow-up to Batman Begins.\n,I really like the first one in this current se...,,16 December 2008,TxMike,2008
97,Cavalerul negru,"Nolan, Bale, Ledger, Eckhart, all hit it out ...",I wasn't sure what to expect from The Dark Kni...,7,18 July 2008,Quinoa1984,2008
98,Cavalerul negru,The Dark Knight was another awesome Batman mo...,"Before I comment on the movie proper, I'd like...",9,21 July 2008,tavm,2008


3. Creati un dataset de recenzii, pentru fiecare recenzie stocati:
 * filmul caruia ii apartine
 * titlul recenziei
 * textul recenziei
 * ratingul
 * data
 * utilizator

 Salvati datasetul intr-un fisier JSON.

In [30]:
import json

# avem deja de mai sus in movies_reviews toate datele cerute.
# doar le punem intr-un fisier.
data = json.dumps(movies_reviews)
with open("data.json", "w") as fout:
    fout.write(data)

4. Pe o pagina cu recenzii putem gasi un numar mic de astfel de date. Butonul de "Load more" de la final, cand este apasat, produce un request care returneaza HTML-ul urmatoarelor recenzii. Folosind aceasta logica colectati automat pentru fiecare film un numar mai mare de recenzii.

In [31]:
# iterate through the top 250 movies
# luam doar primele 2 ca sa nu dureze o vesnicie
# cerem sa ne faca 100 de reviewuri, si functia jmechera recursiva
# le va lua dupa paginare cum trebuie, ca sa aiba minim 100.
for _, (title, url) in tqdm([i for p, i in enumerate(movies_links.iterrows()) if p < 2]):
    process_movie_reviews(title, url, 100)

100%|██████████| 2/2 [02:22<00:00, 71.06s/it]


In [None]:
!firefox "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
!chrome "https://www.youtube.com/watch?v=dQw4w9WgXcQ"