# Predicting user rating of a hip hop album

Source: albumoftheyear.org

### Brainstorming:
Features/Variables:
- (Title)
- (Artist)
- Label: major vs not
- Release date/year -- month+year vs year?
- Number of tracks
- Debut album (Y/N) -- considering only LP here
- Format -- LP vs non-LP (EP, mixtape)
- Genres/styles
- Artist user score
- Number of user ratings for artist (filtered for 50+)
- Artist critic rating
- Amt of time artist has been active in years (based on discography)
*- Availability on streaming platforms (iTunes, Amazon, etc.)* -- may not be good variable as all albums are available for streaming
- Featuring other artists -- *may need from API*
- **Duration** -- *from API*

*Possible:*
- Presence of instrumentals
- Duration of longest track
- Duration of shortest track
- Producers?

Scrapped:
- Posthumous: this is rare so there will likely be no correlation

*Considerations:*
- How to deal with null values?
- Filter by # of user ratings? At least *x* number?
- Categorial variables: genre/style, streaming platforms
- Remove non-LPs?

Explain:
- List was comprehensive with minimum rating of 4

In [None]:
# Import web scraping tools
from bs4 import BeautifulSoup
import requests
import time, os

In [None]:
import pandas as pd
import numpy as np
import datetime as dt
import re

In [None]:
# Import functions from scraping module
import * from scraping

In [None]:
# Create list of page number extensions
pages = ['{}/?r=50'.format(i) for i in range(2,54)]

# Insert opening/first page link to pages list
pages.insert(0, '?r=50')

In [None]:
# Create list of links for album pages for all pages of master list
album_links_list = []

for page in pages:
    album_links_list = [link for link in get_album_links(page)]
        
album_links_list

In [None]:
# Pickle list
import pickle

outfile = open('albumlinks','wb')
pickle.dump(album_links_list,outfile)
outfile.close()

**Scraping album page**

In [None]:
# Create list of dictionaries containing features scraped for each album
scraped_albums_l = [scrape_album(link) for link in albums_links_list]

In [None]:
len(scraped_albums_l)

In [None]:
# Pickle list of album features
import pickle

out_file = open('scrapedalbumslist','wb')
pickle.dump(scraped_albums_l, out_file)
out_file.close()

In [None]:
# Convert list of album features into dataframe
albums_df = pd.DataFrame(scraped_albums_l)
albums_df.head()

**Explore and clean data**

In [None]:
# Explore albums dataframe
albums_df.info()

In [None]:
# Look at ranges of num_user_ratings
print((albums_df['num_user_ratings'] >= 100).sum())
print((albums_df['num_user_ratings'] >= 75).sum())
print(((albums_df['num_user_ratings'] >= 50) & (albums_df['num_user_ratings'] < 75)).sum())

In [None]:
# Look at streaming column values: if there are no 0 values, this column does not provide useful information and
# should be dropped
albums_df['streaming'].value_counts()

In [None]:
albums_df.drop(columns={'streaming'}, inplace=True)
albums_df.head()

In [None]:
# Convert user_score and critic_score columns to numeric types
albums_df['user_score'] = albums_df['user_score'].astype('int')
albums_df['critic_score'] = pd.to_numeric(albums_df['critic_score'])
albums_df.info()

In [None]:
# Look at range of user rating/score
albums_df['user_score'].describe()

# There appears to be a decent range although there are more values >60 than below. I will explore this further
# with my final data set.

In [None]:
# Create list of album titles
album_titles_l = list(albums_df['title'])
len(album_titles_l)

In [None]:
# Pickle list of album titles
import pickle

outfile3 = open('titles','wb')
pickle.dump(album_titles_l,outfile3)
outfile3.close()

In [None]:
# Pickle albums dataframe
albums_df.to_pickle("./albums_df.pkl")

In [None]:
# Unpickle albums dataframe
unpickled_albums_df = pd.read_pickle("./albums_df.pkl")
unpickled_albums_df.head()

In [None]:
# Create list of links to individual artist pages
artist_links_list = [get_artist_link(link) for link in album_links_l]
len(artist_links_list)

In [None]:
# Get links for unique artists
from collections import OrderedDict
artist_links_l = list(OrderedDict.fromkeys(artist_links_list))
len(artist_links_l)

In [None]:
# Pickle list of links to artist pages
import pickle

outfile2 = open('artistlinks','wb')
pickle.dump(artist_links_l,outfile2)
outfile2.close()

**Scraping artist page**

In [None]:
# Scrape artist pages
scraped_artists_l = [scrape_artist(link) for link in artist_links_l]
len(scraped_artists_l)

In [None]:
# Create dataframe of artist features
artists_df = pd.DataFrame(scraped_artists_l)
artists_df.head()

In [None]:
# Explore artists dataframe
artists_df.info()

In [None]:
# Some of the columns contain 'None' values. Replace all 'None' values with np.nan in preparation for EDA.
artists_df = artists_df.fillna(value=np.nan)
print(artists_df.isna().sum())
print(artists_df.info())

In [None]:
# Convert artist_user_score to numeric value
artists_df['artist_user_score'] = pd.to_numeric(artists_df['artist_user_score'])

# Raises an error because some of the values are 'NR' (no rating)

In [None]:
# Look at the rows w/ 'NR' value for artist_user_score
artists_df[artists_df['artist_user_score'] == 'NR']

In [None]:
artists_df.isna().sum()

In [None]:
artists_df[artists_df['first_album'].isna()].head()
print(artists_df['first_album'].isna().sum())

# This means the album on the list is not an LP

In [None]:
# Look at number of non-LPs in albums list
(albums_df['format'] == 0).sum()

In [None]:
# Pickle artists dataframe
import pickle

outfile4 = open('artists','wb')
pickle.dump(artists_df,outfile4)
outfile4.close()

In [None]:
# Rename num_user_ratings column of artists_df to avoid overlap with albums_df
artists_df.rename(columns={'num_user_ratings': 'artist_num_user_ratings'}, inplace=True)
artists_df.head(1)

**Merge album and artist dataframes**

In [None]:
merged_df = pd.merge(albums_df, artists_df)

In [None]:
merged_df.head(50)

In [None]:
merged_df.info()

In [None]:
# Add 'debut_album' column containing boolean integers by determining if the album is the artist's first album,
# then drop 'first_album column'
merged_df['debut_album'] = merged_df.apply(lambda row: int(row['title'] == row['first_album']), axis=1)
merged_df.drop(columns=['first_album'], inplace=True)

In [None]:
merged_df.head(20)

In [None]:
# Pickle merged dataframe
import pickle

outfile = open('merged','wb')
pickle.dump(merged_df,outfile)
outfile.close()