# **Data Analytics and Visualization**

 🧑🏻‍🏫 **Dosen Pengampu :**

- Wayan Oger Vihikan, S.T.I., M.I.T.

🧑🏻‍🎓 **Anggota Kelompok 2 :**

- 2105551125 - Anak Agung Sagung Mirah Indira Wardhana

- 2105551122 - Mananda Davar Sinaga

- 2105551126 - I Nyoman Yodya Mahesa Sastra

# **Scraping Data Website AOTY**

Notebook ini berisi kode untuk scraping website AOTY (Album of the year)\
https://www.albumoftheyear.org/

<img src="./pic_aoty.png" style="max-height: 300px; max-width: 600px">

## **Import Library**

📚 **Library yang digunakan :**

- `request`, digunakan untuk melakukan HTTP request ke website albumoftheyear

- `pandas`, digunakan untuk manipulasi dan analisis data

- `bs4`, digunakan untuk parser HTML

- `concurrent.futures`, digunakan untuk multithreading dalam mempercepat scraping web



In [1]:
# Import library
import requests # HTTP request
import pandas as pd # Data manipulation & analysis
from bs4 import BeautifulSoup # HTML parser
from concurrent.futures import ThreadPoolExecutor # Concurrent Multithreading

## **Function**

Kami membentuk beberapa function dengan tujuan untuk mempermudah pembacaan program dan juga agar dapat dilakukan multithreading dengan menggunakan `ThreadPoolExecutor`

### get_dan_parse()

📚 **Function `get_dan_parse()`**\
berfungsi untuk melakukan scrape web sekaligus parse HTML

In [3]:
# Function untuk GET HTTP Request dan Parsing
def get_dan_parse(link: str):
  header = {'User-Agent': 'Mozilla/5.0'}
  web = requests.get(link, headers=header).content
  web_parsed = BeautifulSoup(web, "html.parser")

  # bs4.BeautifulSoup
  return web_parsed

### scraping_top_billboard()

📚 **Function `scraping_top_billboard()`**\
berfungsi untuk melakukan scrape web pada top 50 album billboard 2023\
https://www.albumoftheyear.org/list/2154-billboards-50-best-albums-of-2023/

<img src="./pic_aoty_top50.png" style="max-height: 300px; max-width: 600px">

In [None]:
# Function untuk scraping top 50 album billboard 2023
def scraping_top_billboard(link: str):
  web = get_dan_parse(link)
  web_albums = web.find_all(class_='albumListTitle')
  albums = list()
  for album in web_albums:
    artisjudul_album = album.text.split('. ', 1)[1].split(' - ', 1) # Angka. NamaArtis - NamaAlbum -> ['NamaArtis', 'NamaAlbum']
    # 1. Nama Artis
    artis_album = artisjudul_album[0] # NamaArtis
    # 2. Nama Album
    nama_album = artisjudul_album[1] # NamaAlbum
    # 3. Link Album
    link_album = 'https://www.albumoftheyear.org' + album.find('a')['href'] # https://www.albumoftheyear.org/album/link_album/
    albums.append([artis_album, nama_album, link_album])

  # list('artis', 'album', 'link_album')
  return albums

### scraping_album_review()

📚 **Function `scraping_album_review()`**\
berfungsi untuk melakukan scrape 1.000 review terbaik pada masing-masing album\
contoh: https://www.albumoftheyear.org/album/722921-taylor-swift-1989-taylors-version/user-reviews/

<img src="./pic_aoty_review.png" style="max-height: 300px; max-width: 600px">

In [None]:
def scraping_album_review(link: str):
  # 1. Link Masuk Ke album
  url_parent = link.split('/user-reviews')[0]

  # Cek apakah ada wildcard '?p='
  if '?p=' in link:
    web = get_dan_parse(link+'&sort=best')
  else:
    web = get_dan_parse(link+'/?sort=best')

  review_list = list()
  web_review = web.find_all(class_='albumReviewRow')
  for review in web_review:
    # 2. Nama User
    nama_user = review.find(class_='userReviewName').text
    # 3. Rating
    rating = review.find(class_='rating').text
    # 4. Link User
    link_user = 'https://www.albumoftheyear.org' + review.find(class_='userReviewName').find('a')['href']
    review_list.append([url_parent, nama_user, rating, link_user])

  if web.find(class_='pageSelect next'):
    # Lanjut scrape dan stop jika sudah page 40
    nextpage = link.split('?p=')
    try:
      nextlink = nextpage[0] + '?p=' + str(int(nextpage[1]) + 1)
      if len(nextpage) > 1 and nextpage[1] != '40':
        review_list.extend(scraping_album_review(nextlink))
    except:
      nextlink = nextpage[0] + '?p=2'
      review_list.extend(scraping_album_review(nextlink))

  # list('link_album', 'user', 'rating_album', 'link_user')
  return review_list

### scraping_user_rating()

📚 **Function `scraping_user_rating()`**\
berfungsi untuk melakukan scrape 600 album terbaik yang dirating masing-masing user\
contoh: https://www.albumoftheyear.org/user/calup/ratings/

<img src="./pic_aoty_rating.png" style="max-height: 300px; max-width: 600px">

In [None]:
# Function untuk scraping user rating
def scraping_user_rating(link: str):
  url_parent = link.split('ratings/highest/')

  # Cek halaman pertama atau bukan
  if len(url_parent) == 1:
    link = link + 'ratings/highest/'

  review_list = list()

  web = get_dan_parse(link)
  # 2. Nama User
  nama_user = web.find(class_='userLink').text
  # 4. Link User
  link_user = url_parent[0]

  web_review = web.find_all(class_='albumBlock')
  for review in web_review:
    artis_dan_album = review.find_all('a')
    # 1. link masuk ke album
    link_album = 'https://www.albumoftheyear.org' + artis_dan_album[0]['href']
    # 3. Rating Album
    rating_album = review.find(class_='rating').text
    review_list.append([link_album, nama_user, rating_album, link_user])

  if web.find(class_='pageSelectRow'):
    # lanjut scrape dan stop jika sudah page 10
    next = web.find(class_='pageSelectRow').find_all(class_='pageSelectSmall')
    try:
      currentpagenum = int(url_parent[1].split('/')[0])
      nextpagenum = currentpagenum + 1
      nextlink = url_parent[0] + 'ratings/highest/' + str(nextpagenum) + '/'
      if int(next[-1].text) != currentpagenum and currentpagenum < 10:
        review_list.extend(scraping_user_rating(nextlink))
    except:
      nextlink = link + '2/'
      review_list.extend(scraping_user_rating(nextlink))

  # list('link_album', 'user', 'rating_album', 'link_user')
  return review_list

### scraping_artis()

📚 **Function `scraping_artis()`**\
berfungsi untuk melakukan scrape thumbnail artis\
contoh: https://www.albumoftheyear.org/artist/323-taylor-swift/

<img src="./pic_aoty_artis.png" style="max-height: 300px; max-width: 600px">

In [None]:
# Function untuk scraping artis:
def scraping_artis(link: str):
  web = get_dan_parse(link)
  web_image = web.find(class_='artistImage').find('img')['src']

  # bs4.BeautifulSoup
  return web_image

### scraping_album()

📚 **Function `scraping_album()`**\
berfungsi untuk melakukan scrape detail-detail album\
contoh: https://www.albumoftheyear.org/album/722921-taylor-swift-1989-taylors-version.php

<img src="./pic_aoty_album.png" style="max-height: 300px; max-width: 600px">

In [None]:
# Function untuk scraping album
def scraping_album(link: str):
  web = get_dan_parse(link)

  web_headline = web.find(class_='albumHeadline').find('h1')
  # 1. Nama Artis
  artis_album = web_headline.find(class_='artist').text # NamaArtis
  # 2. Link Artis
  thumbnail_artis = str()
  artis_link = str()
  try:
    artis_link = 'https://www.albumoftheyear.org' + web_headline.find(class_='artist').find('a')['href'] # https://www.albumoftheyear.org/artist/NamaArtis/)
    # 13. Thumbnail Artis
    thumbnail_artis = scraping_artis(artis_link)
  except:
    pass

  # 3. Nama Album
  nama_album = web_headline.find(class_='albumTitle').text # NamaAlbum

  # 4. Thumbnail Album
  thumbnail_album = web.find(class_='albumTopBox cover')
  try:
    thumbnail_album = thumbnail_album.find('img')['src'] # Link Thumbnail Album
  except:
    thumbnail_album = ""

  web_tracklist = web.find(class_='trackListTable')
  # 5. Tracklist Album
  tracklist = str()
  try:
    if web_tracklist:
      for track in web_tracklist:
        nama_lagu = track.find(class_="trackTitle").find('a').text # NamaLagu
        tracklist += nama_lagu + ";|"
    if tracklist:
      tracklist = tracklist[:-2]
  except:
    tracklist = ""

  # 6. Link Review
  link_review = link + "/user-reviews/"

  # 7. Tanggal Rilis
  tanggalrilis = ""
  # 8. Label
  label = ""
  # 9. Genre
  genre = ""
  # 10. Producer
  produser = ""
  # 11. Penulis
  penulis = ""

  try:
    web_detail_album = web.find(class_='albumTopBox info').find_all(class_='detailRow')
    # TANGGAL RILIS
    tanggalrilis = web_detail_album[0].text.split(" /")[0]
    # LABEL LAGU
    label = str()
    label_plural = web_detail_album[2].find_all('a', class_="")
    for value in label_plural:
      label += value.text +';|'
    if label:
      label = label[:-2]
    # GENRE LAGU
    genre = str()
    genre_plural = web_detail_album[3].find_all('a', class_="")
    for value in genre_plural:
      genre += value.text +';|'
    if genre:
      genre = genre[:-2]
    # PRODUSER
    produser = str()
    produser_plural = web_detail_album[4].find_all('a', class_="")
    for value in produser_plural:
      produser += value.text +';|'
    if produser:
      produser = produser[:-2]
    # PENULIS
    penulis = str()
    penulis_plural = web_detail_album[5].find_all('a', class_="")
    for value in penulis_plural:
      penulis += value.text +';|'
    if penulis:
      penulis = penulis[:-2]
  except:
    pass

  album = [artis_album, artis_link, nama_album, thumbnail_album, tracklist, link_review, tanggalrilis, label, genre, produser, penulis, link, thumbnail_artis]

  # list('artis', 'link_artis', 'album', 'thumbnail_album', 'tracklist_album', 'link_review', 'tanggal_rilis', 'label', 'genre', 'produser', 'penulis', 'link_album', 'thumbnail_artis')
  return album

## Program Utama

📚 **Program Utama**\
Berikut merupakan program utama dari scraping AOTY (Album of The Year)

In [None]:
# Link top 50 album billboard 2023
billboard = "https://www.albumoftheyear.org/list/2154-billboards-50-best-albums-of-2023/"

# Scrape top 50 album billboard 2023
albums = scraping_top_billboard(billboard)

# Scrape detail masing-masing album top 50 billboard 2023
list_link = []
for album in albums:
  list_link.append(album[2])
with ThreadPoolExecutor() as executor:
  results_topalbum = list(executor.map(scraping_album, list_link))

# Scrape review masing-masing album top 50 billboard 2023
list_link = []
for result in results_topalbum:
  list_link.append(result[5])
with ThreadPoolExecutor() as executor:
  results_review_topalbum = list(executor.map(scraping_album_review, list_link))

# Scrape masing-masing user yang mereview album top 50 billboard 2023
list_link = set()
for result in results_review_topalbum:
  for reviewer in result:
    list_link.add(reviewer[3])
with ThreadPoolExecutor() as executor:
  results_user_rating = list(executor.map(scraping_user_rating, list_link))

# Scrape ulang semua detail masing-masing album yang dirating seluruh user
list_link = set()
for result in results_user_rating:
  for album in result:
    list_link.add(album[0])
with ThreadPoolExecutor() as executor:
  results_all_album = list(executor.map(scraping_album, list_link))

In [None]:
# Bentuk data albums menjadi DataFrame
dfalbum = pd.DataFrame(results_topalbum, columns=['artis', 'link_artis', 'album', 'thumbnail_album', 'tracklist_album', 'link_review', 'tanggal_rilis', 'label', 'genre', 'produser', 'penulis', 'link_album', 'thumbnail_artis'])

In [None]:
# Bentuk data ratings menjadi DataFrame
dftemp = []
for result in results_review_topalbum:
  df = pd.DataFrame(result, columns=['link_album', 'user', 'rating_album', 'link_user'])
  dftemp.append(df)
dfrating = pd.concat(dftemp, ignore_index=True)

In [None]:
# Simpan DataFrame albums dan ratings menjadi .csv
dfalbum.to_csv('albums.csv', index=False, sep=';', encoding='utf-8')
dfrating.to_csv('ratings.csv', index=False, sep=';', encoding='utf-8')