# ProgArchives Web scraping

In this code, we perform web scraping of a website. Web scraping is the activity of extracting information 
from one or more websites in an automated way. We usually do this when we want data that is not available 
through APIs.

The target site is [Prog archives](http://www.progarchives.com/). It is a progressive rock website with an 
active community that indexes various genres, bands and albums, with ratings and comments. This website is 
the most complete and powerful progressive rock resource. Prog Archives has a very robust search engine that 
allows us to search the albums of each year, by genre and country.

In [1]:
# some important modules to scrape a website.
from bs4 import BeautifulSoup
import requests
import sys
import csv

In [2]:
# HTTP headers let the client and the server pass additional information with an HTTP request or response.
headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"}

# Perform a GET type request and check if the response had a status code 200.
page = requests.get('http://www.progarchives.com/top-prog-albums.asp?salbumtypes=1&smaxresults=250#list', headers=headers)

if page.status_code != 200:
    sys.exit('Non 200 status code received')

# Parse the html into the 'page' variable and store it in Beautiful Soup format.
soup = BeautifulSoup(page.content, 'html.parser')

# Table that organizes the top rating albuns.
table = soup.find('table', attrs={'cellpadding':'7'})

In [3]:
# Tag that contains all the album information.
table_tag = table.find_all('tr')

artist = [table_tag[i].find_all('a')[1].get_text() for i in range(0, len(table_tag))]

album = [table_tag[i].find_all('a')[0].get_text() for i in range(0, len(table_tag))]

rating = [table_tag[i].find_all('td')[2].find('span').get_text() for i in range(0, len(table_tag))]

QWR = [table_tag[i].find_all('td')[2].find('div', attrs={'style':"font-size:80%;"}).get_text() for i in range(0, len(table_tag))]

genre = [table_tag[i].find_all('strong')[2].get_text() for i in range(0, len(table_tag))]

year = [table_tag[i].find_all('br')[1].text for i in range(0, len(table_tag))]

In [4]:
# Save the dataset in csv file.
colums_names = ['artist', 'album', 'rating', 'QWR', 'genre', 'year']
with open('data/raw_data/raw_top_250_progarchives.csv', 'w', encoding='utf-8') as f:
    writer = csv.writer(f, delimiter=',')
    writer.writerow(colums_names)
    writer.writerows(zip(artist, album, rating, QWR, genre, year))
    
f.close()