# Infnet
## Projeto de Bloco: Ciência de Dados Aplicada [24E3_5]
### Aluno: Rodrigo Avila

#### Para acessar o projeto no GIT Hub, clique [aqui](https://github.com/r-moreira/infnet-4sem-project-tp2)

---

## Web Scrapping Genius.com
A ideia é fazer o webscrapping de letras da banda Metallica no site genius.com

Após analisar o site, a páginação é feita através de um scrolling infinito, ou seja, a cada vez que o usuário chega no final da página, mais músicas são carregadas.

O que dificulta fazer o scrapping de todas as músicas, mas ao analisar a parte de network da página, percebi que a página faz requisições para uma API no backend utilizando paginação, por mais que o link não apareça na url.

Exemplo de requisição para a API:
```
https://genius.com/api/artists/10662/songs?page=2&per_page=20&sort=popularity&text_format=html%2Cmarkdown
```

Dessa forma, descobri que a banda Metallica possui o ID 10662, e que a cada requisição, 20 músicas são retornadas, não sendo necessário realizar o crawler. (Porém, fiz para fins de exercício). É possível chamar o backend diretamente para obter as urls.

Para obter as letras das música é necessário scrapping, pois não é disponibilizado pelas APIs, será feito um scrapping do album Master of Puppets, que possui 8 músicas, esse album é um bom exemplo, pois possui uma música instrumental, o que exige um tratamento diferenciado.

In [20]:
import requests
from bs4 import BeautifulSoup

class GeniusAlbumSongsCrawler:
    def __init__(self, album_url: str):
        self.album_url = album_url
        self._soup = None

    @property
    def soup(self) -> BeautifulSoup:
        return self._soup

    def fetch_page(self) -> BeautifulSoup:
        response = requests.get(self.album_url)
        response.raise_for_status()  
        self._soup = BeautifulSoup(response.text, 'html.parser')
        return self._soup

    def extract_song_links(self) -> list:
        if not self._soup:
            raise ValueError("Soup not initialized. Call fetch_page() first.")
        
        song_links = []
        for link in self._soup.select('a.u-display_block'):
            href = link.get('href')
            if href:
                song_links.append(href)
        return song_links

crawler = GeniusAlbumSongsCrawler('https://genius.com/albums/Metallica/Master-of-puppets')
crawler.fetch_page()
links = crawler.extract_song_links()
links

['https://genius.com/Metallica-battery-lyrics',
 'https://genius.com/Metallica-master-of-puppets-lyrics',
 'https://genius.com/Metallica-the-thing-that-should-not-be-lyrics',
 'https://genius.com/Metallica-welcome-home-sanitarium-lyrics',
 'https://genius.com/Metallica-disposable-heroes-lyrics',
 'https://genius.com/Metallica-leper-messiah-lyrics',
 'https://genius.com/Metallica-orion-lyrics',
 'https://genius.com/Metallica-damage-inc-lyrics']

In [38]:
import requests
from bs4 import BeautifulSoup

class GeniusSongLyricsScraper:
    def __init__(self, song_url: str):
        self.song_url = song_url
        self._soup = None

    @property
    def soup(self) -> BeautifulSoup:
        if not self._soup:
            self.fetch_page()
        return self._soup

    def fetch_page(self) -> None:
        response = requests.get(self.song_url)
        response.raise_for_status() 
        self._soup = BeautifulSoup(response.text, 'html.parser')

    def extract_lyrics(self) -> str | None:
        instrumental_message = self.soup.select_one('div.LyricsPlaceholder__Message-uen8er-2')
        if instrumental_message and 'This song is an instrumental' in instrumental_message.text:
            return 'This song is an instrumental'
        
        lyrics_div = self.soup.select_one('div[class*="Lyrics__Container"]')
        if not lyrics_div:
            return None
        
        lyrics = []
        for element in lyrics_div.descendants:
            if element.name == 'br':
                lyrics.append('\n')
            elif isinstance(element, str):
                lyrics.append(element)
        
        return ''.join(lyrics)

    def extract_song_name(self) -> str | None:
        song_name_span = self.soup.select_one('h1 span[class*="SongHeaderdesktop__HiddenMask"]')
        if not song_name_span:
            return None
        
        return song_name_span.text
    
    def extract_album_name(self) -> str | None:
        album_name_div = self.soup.select_one('div.HeaderArtistAndTracklistdesktop__Tracklist-sc-4vdeb8-2 a')
        if not album_name_div:
            return None
        
        return album_name_div.text.strip()
    
    def extract_artist_name(self) -> str | None:
        artist_name_div = self.soup.select_one('div.HeaderArtistAndTracklistdesktop__ListArtists-sc-4vdeb8-1 a')
        if not artist_name_div:
            return None
        
        return artist_name_div.text.strip()

# Example usage:
scraper = GeniusSongLyricsScraper("https://genius.com/Albert-collins-iceman-lyrics")
lyrics = scraper.extract_lyrics()
song_name = scraper.extract_song_name()
album_name = scraper.extract_album_name()
artist_name = scraper.extract_artist_name()

print(f"Album Name: {album_name}")
print(f"Song Name: {song_name}")
print(f"Artist Name: {artist_name}\n")
print(f"Lyrics:\n\n{lyrics}")


Album Name: None
Song Name: Iceman
Artist Name: Albert Collins

Lyrics:

I'm your iceman, baby, ain't here to cool you down
Yes I'm your iceman, ladies, you'll always know when I'm around
I left Leona, Texas, to heat this coolest place in town

Gonna play this old guitar, mix up some fire with my ice
Yes, I'm gonna play this old guitar, light your fire with my ice
Sometimes it sounds so good to me, I just might play it twice

I'm your iceman, baby, call me Al, if you please

I'm your iceman, baby, I'm so hot I'll probably freeze
I'm gonna whip up a twister, turn tornados to a breeze

Yes, I'm your iceman, baby

Yes, I'm your iceman, people, fixin' fire with my ice
I'm your iceman, ladies, I make it hot an' chill it right
If you follow my instructions, my ice will last all night


In [3]:
import json
import csv
from typing import Literal

def save_genius_album_lyrics_to_file(
    file_path: str,
    album_url: str, 
    format: Literal['json', 'csv'] = 'csv'):
    
    crawler = GeniusAlbumSongsCrawler(album_url)
    crawler.fetch_page()
    song_links = crawler.extract_song_links()

    songs_list = []

    for song_url in song_links:
        print(f"Processing {song_url}")
        scraper = GeniusSongLyricsScraper(song_url)
        scraper.fetch_page()
        song_name = scraper.extract_song_name()
        album_name = scraper.extract_album_name()
        artist_name = scraper.extract_artist_name()
        lyrics = scraper.extract_lyrics()
        
        song_info = {
            'album_name': album_name,
            'song_name': song_name,
            'artist_name': artist_name,
            'lyrics': lyrics
        }
        songs_list.append(song_info)

    if format == 'json':
        with open(file_path, 'w', encoding='utf-8') as json_file:
            json.dump(songs_list, json_file, ensure_ascii=False, indent=4)
    elif format == 'csv':
        with open(file_path, 'w', encoding='utf-8', newline='') as csv_file:
            writer = csv.DictWriter(csv_file, fieldnames=['album_name', 'song_name', 'artist_name', 'lyrics'])
            writer.writeheader()
            for song_info in songs_list:
                writer.writerow(song_info)
        
    print(f"Lyrics saved to {file_path}")

save_genius_album_lyrics_to_file(
    '../data/processed/metallica_master_of_puppets_lyrics.csv',
    'https://genius.com/albums/Metallica/Master-of-puppets',
    'csv'
)

Processing https://genius.com/Metallica-battery-lyrics
Processing https://genius.com/Metallica-master-of-puppets-lyrics
Processing https://genius.com/Metallica-the-thing-that-should-not-be-lyrics
Processing https://genius.com/Metallica-welcome-home-sanitarium-lyrics
Processing https://genius.com/Metallica-disposable-heroes-lyrics
Processing https://genius.com/Metallica-leper-messiah-lyrics
Processing https://genius.com/Metallica-orion-lyrics
Processing https://genius.com/Metallica-damage-inc-lyrics
Lyrics saved to ../data/processed/metallica_master_of_puppets_lyrics.csv


In [8]:
%pip install requests


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [4]:
import pandas as pd

df = pd.read_csv('../data/processed/metallica_master_of_puppets_lyrics.csv')

df.head()

Unnamed: 0,album_name,song_name,artist_name,lyrics
0,Master of Puppets (Deluxe Box Set),Battery,Metallica,"[Verse 1]\nLashing out the action, returning t..."
1,Master of Puppets (Deluxe Box Set),Master of Puppets,Metallica,"[Verse 1]\nEnd of passion play, crumbling away..."
2,Master of Puppets (Deluxe Box Set),The Thing That Should Not Be,Metallica,[Verse 1]\nMessenger of fear in sight\nDark de...
3,Master of Puppets (Deluxe Box Set),Welcome Home (Sanitarium),Metallica,[Verse 1]\nWelcome to where time stands still\...
4,Master of Puppets (Deluxe Box Set),Disposable Heroes,Metallica,"[Verse 1]\nBodies fill the fields I see, hungr..."


In [5]:
save_genius_album_lyrics_to_file(
    '../data/processed/metallica_and_justice_for_all_lyrics.csv',
    'https://genius.com/albums/Metallica/And-justice-for-all',
    'csv'
)

Processing https://genius.com/Metallica-blackened-lyrics
Processing https://genius.com/Metallica-and-justice-for-all-lyrics
Processing https://genius.com/Metallica-eye-of-the-beholder-lyrics
Processing https://genius.com/Metallica-one-lyrics
Processing https://genius.com/Metallica-the-shortest-straw-lyrics
Processing https://genius.com/Metallica-harvester-of-sorrow-lyrics
Processing https://genius.com/Metallica-the-frayed-ends-of-sanity-lyrics
Processing https://genius.com/Metallica-to-live-is-to-die-lyrics
Processing https://genius.com/Metallica-dyers-eve-lyrics
Lyrics saved to ../data/processed/metallica_and_justice_for_all_lyrics.csv


In [22]:
save_genius_album_lyrics_to_file(
    '../data/processed/guns_n_roses_ride_greatest_hits_lyrics.csv',
    'https://genius.com/albums/Guns-n-roses/Greatest-hits',
    'csv'
)

Processing https://genius.com/Guns-n-roses-welcome-to-the-jungle-lyrics
Processing https://genius.com/Guns-n-roses-sweet-child-o-mine-lyrics
Processing https://genius.com/Guns-n-roses-patience-lyrics
Processing https://genius.com/Guns-n-roses-paradise-city-lyrics
Processing https://genius.com/Guns-n-roses-knockin-on-heavens-door-lyrics
Processing https://genius.com/Guns-n-roses-civil-war-lyrics
Processing https://genius.com/Guns-n-roses-you-could-be-mine-lyrics
Processing https://genius.com/Guns-n-roses-dont-cry-lyrics
Processing https://genius.com/Guns-n-roses-november-rain-lyrics
Processing https://genius.com/Guns-n-roses-live-and-let-die-lyrics
Processing https://genius.com/Guns-n-roses-yesterdays-lyrics
Processing https://genius.com/Guns-n-roses-aint-it-fun-lyrics
Processing https://genius.com/Guns-n-roses-since-i-dont-have-you-lyrics
Processing https://genius.com/Guns-n-roses-sympathy-for-the-devil-lyrics
Lyrics saved to ../data/processed/guns_n_roses_ride_greatest_hits_lyrics.csv

In [None]:
!playwright install

In [1]:
#Exportando para PDF
!jupyter nbconvert --to webpdf rodrigo_avila_PB_TP3.ipynb

[NbConvertApp] Converting notebook rodrigo_avila_PB_TP3.ipynb to webpdf
[NbConvertApp] Building PDF
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 220267 bytes to rodrigo_avila_PB_TP3.pdf


In [6]:
%pip install requests


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [None]:
import requests
from typing import Dict


def search_artist(artist: str) -> Dict:
        url = f"https://genius.com/api/search?q={artist}"
        
        response = requests.get(url)  

        response.raise_for_status()
        
        return response.json()


'Metallica'

In [26]:
artist = "Albert Collins"

response_hits = search_artist(artist)['response']['hits']

def get_artist_id(artist: str) -> int | None:
    response_hits = search_artist(artist)['response']['hits']
    for hit in response_hits:
        if hit['result']['primary_artist']['name'].lower() == artist.lower() or artist.lower() in [name.lower() for name in hit['result']['artist_names']]:
            artist_id = hit['result']['primary_artist']['id']
            return artist_id
    return None

artist_id = get_artist_id(artist)
artist_id

348886

In [33]:
def get_artist_song_lyrics_url(artist_id: int, song_name: str) -> str:
    
    page = 1
    
    while True:
        url = f"https://genius.com/api/artists/{artist_id}/songs?page={page}&per_page=50"
        
        print(f"Fetching page {page}...")
        
        response = requests.get(url)
        response.raise_for_status()
        
        print(f"Page fetched. - {response.json()}")
        
        if response.json()['response']['next_page'] is None:
            break
            
        
        songs = response.json()['response']['songs']
        
        for song in songs:
            if song['title'].lower() == song_name.lower():
                return song['url']
        
        page += 1

In [36]:
lyrics_url = get_artist_song_lyrics_url(artist_id, "Ice Pick")
lyrics_url

Fetching page 1...
Page fetched. - {'meta': {'status': 200}, 'response': {'songs': [{'_type': 'song', 'annotation_count': 1, 'api_path': '/songs/1123123', 'artist_names': 'Albert Collins', 'full_title': 'A Good Fool Is Hard To Find by\xa0Albert\xa0Collins', 'header_image_thumbnail_url': 'https://images.genius.com/4c4fef8769bbba21dbcdd3bd643ee421.300x302x1.jpg', 'header_image_url': 'https://images.genius.com/4c4fef8769bbba21dbcdd3bd643ee421.500x503x1.jpg', 'id': 1123123, 'instrumental': False, 'lyrics_owner_id': 1549345, 'lyrics_state': 'complete', 'lyrics_updated_at': 1429646690, 'path': '/Albert-collins-a-good-fool-is-hard-to-find-lyrics', 'primary_artist_names': 'Albert Collins', 'pyongs_count': None, 'relationships_index_url': 'https://genius.com/Albert-collins-a-good-fool-is-hard-to-find-sample', 'release_date_components': None, 'release_date_for_display': None, 'release_date_with_abbreviated_month_for_display': None, 'song_art_image_thumbnail_url': 'https://images.genius.com/4c4fe

'https://genius.com/Albert-collins-ice-pick-lyrics'

In [39]:
scraper = GeniusSongLyricsScraper("https://genius.com/Albert-collins-iceman-lyrics")
lyrics = scraper.extract_lyrics()
song_name = scraper.extract_song_name()
album_name = scraper.extract_album_name()
artist_name = scraper.extract_artist_name()

print(f"Album Name: {album_name}")
print(f"Song Name: {song_name}")
print(f"Artist Name: {artist_name}\n")
print(f"Lyrics:\n\n{lyrics}")

Album Name: None
Song Name: Iceman
Artist Name: Albert Collins

Lyrics:

I'm your iceman, baby, ain't here to cool you down
Yes I'm your iceman, ladies, you'll always know when I'm around
I left Leona, Texas, to heat this coolest place in town

Gonna play this old guitar, mix up some fire with my ice
Yes, I'm gonna play this old guitar, light your fire with my ice
Sometimes it sounds so good to me, I just might play it twice

I'm your iceman, baby, call me Al, if you please

I'm your iceman, baby, I'm so hot I'll probably freeze
I'm gonna whip up a twister, turn tornados to a breeze

Yes, I'm your iceman, baby

Yes, I'm your iceman, people, fixin' fire with my ice
I'm your iceman, ladies, I make it hot an' chill it right
If you follow my instructions, my ice will last all night
