# Using Metacritic and HowLongToBeat data to determine which games from my backlog are worth playing

This is a personal problem that every gamer faces at some point. Currently I own over 500  video games on 18 different platforms and storefrons in both physical and digital. Picking out a new game to play is becoming a daunting task so I've decided to use data from Metacritic, mainly Critic reviews, and the number of said reviews as well as average length of a playthrough as provided by HowLongToBeat to determine which of my games are worth playing.

In [1]:
import re
from bs4 import BeautifulSoup
import requests
import pandas as pd
from howlongtobeatpy import HowLongToBeat as hltb

## Data

### Preparing and reading data

In [2]:
list_of_games = pd.read_csv(r"C:\Users\ricar\Documents\Python Scripts\Game Backlog\game_list.csv", encoding='cp1252')

I've made a list of every game I own in .csv format. The document is pretty simple it contains the name of the game in the first column and the platform for which I own the game in the second. All my Steam, Epic Store, Origin, Amazon and other PC storefronts were marked as 'pc'.

Example of the .csv file:

| name | platform |
| --- | --- |
| Abzu | playstation-4 |
| Alien Isolation | playstation-3 |
| Animal Crossing Amiibo Festival | wii-u |
| Animal Crossing New Horizons | switch |
| Assassin's Creed | pc |

### Getting Metacritic data

I want to have the metacritic critics score as well as number of reviews that match the platform I own the game on. However matacritic does not provide any api, or database for their game information, so I've decided to scrape metacritic to get the data I need.

Page for every game contains it's 'Metascore', the critic aggregate review score as well as number of said reviews. Using page inspection tool I've detemined which html code blocks have the data I need. I will feed the tags into the html parser and get the necessary data. For examle here is the game Enter the Gungeon for Playstation 4 ![metacritic%20example.png](attachment:metacritic%20example.png). The Metascore is in this html element:
```html
<div class="metascore_w xlarge game positive">
    <meta itemprop="worstRating" content="0">
    <meta itemprop="bestRating" content="100">
    <span itemprop="ratingValue">82</span>
</div>
```

Metacritic generates the link to the page of a game using 'https://www.metacritic.com/game/' base url then adding the platform for example 'playstation-4' and then the game name hyphenated and in lowercase, e.g. enter-the-gungeon, making the final url https://www.metacritic.com/game/playstation-4/enter-the-gungeon which is perfect, because I have an easy way to generate the url to grab the html, and then parse out the data I need.

In [3]:
def extract_mc_data(series):
    url = 'https://www.metacritic.com/game/'
    name = series['name'].replace('.','').replace(',','').replace('-','').replace('\'','').replace(':','').replace(' ','-')
    #removing any symbols from the game title that do not appear in metacritic links,
    #replacing any spaces with '-'
    session = requests.Session()
    session.headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'
    #Making sure that requests are sent as if they were done through a browser,
    #otherwise metacritic does not return the correct page.
    full_url = url + series['platform'] + '/' + name.lower() + '/'
    #creating the metacritic link, making sure name is in all lowercase as the metacritic links are case-sensitive
    response = session.get(full_url)
    parser = BeautifulSoup(response.content, 'html.parser')
    #dowloading the full html page for each game and feeding it into a html parser
    
    review_count_elements = parser.find("span", class_="count")
    if review_count_elements is not None:
        review_count_text = review_count_elements.text.strip()
        review_count = int(re.findall(r'[\d]+', review_count_text)[0])
    else:
        review_count = 0
    #Checking if the html file contains text 'based on x Critic Reviews'
    #if it does then extraxting the number of reviews using a simple regex and converting to int
    #if game contains 0 reviews, or page for a game does not exist then review count is set to 0
    
    score_elements = parser.find("span", itemprop="ratingValue")
    if score_elements is not None:
        score_text = score_elements.text.strip()
        score = int(score_text)
    else:
        score = 0
    #Checking if the html file contains element 'ratingValue'
    #if it does then extraxting the critics score and converting to int
    #if game contains 0 reviews, or page for a game does not exist then critics score is set to 0
    return pd.Series([series['name'], series['platform'], score, review_count], index =['name', 'platform', 'score', 'total_reviews'])

In [4]:
print(extract_mc_data(list_of_games.iloc[34]))
#testing to see if function works

name             Enter the Gungeon
platform             playstation-4
score                           82
total_reviews                   33
dtype: object


In [5]:
subset_10 = list_of_games.head(10)

In [6]:
subset_10 = subset_10.apply(extract_mc_data, axis='columns')

In [7]:
print(subset_10)

                              name       platform  score  total_reviews
0                             Abzu  playstation-4     78             72
1                  Alien Isolation  playstation-3      0              0
2  Animal Crossing Amiibo Festival          wii-u     46             20
3     Animal Crossing New Horizons         switch     90            111
4                 Assassin's Creed             pc      0              0
5                        Bayonetta          wii-u     86             18
6                      Bayonetta 2          wii-u     91             80
7                    Balloon Fight            NES      0              0
8             Batman Arkham Asylum  playstation-3     91             70
9                       BloodBorne  playstation-4     92            100


In [8]:
list_of_games = list_of_games.apply(extract_mc_data, axis='columns')

In [9]:
list_of_games.head(10)

Unnamed: 0,name,platform,score,total_reviews
0,Abzu,playstation-4,78,72
1,Alien Isolation,playstation-3,0,0
2,Animal Crossing Amiibo Festival,wii-u,46,20
3,Animal Crossing New Horizons,switch,90,111
4,Assassin's Creed,pc,0,0
5,Bayonetta,wii-u,86,18
6,Bayonetta 2,wii-u,91,80
7,Balloon Fight,NES,0,0
8,Batman Arkham Asylum,playstation-3,91,70
9,BloodBorne,playstation-4,92,100


In [10]:
list_of_games.tail(10)

Unnamed: 0,name,platform,score,total_reviews
516,Burnout Paradise: The Ultimate Box,pc,87,26
517,Peggle,pc,0,0
518,Bejeweled 3,pc,82,27
519,Dragon Age: Origins,pc,91,67
520,Syndicate,pc,69,16
521,Mass Effect 2,pc,94,55
522,Medal of Honor Pacific Assault,pc,80,43
523,SteamWorld Dig,pc,76,12
524,Plants vs. Zombies Game of the Year Edition,pc,0,0
525,The Sims 4,pc,70,75


In [11]:
list_of_games[list_of_games['score'] == 0].head(15)

Unnamed: 0,name,platform,score,total_reviews
1,Alien Isolation,playstation-3,0,0
4,Assassin's Creed,pc,0,0
7,Balloon Fight,NES,0,0
10,Bubble Bobble,NES,0,0
13,Castlevania,NES,0,0
14,Castlevania II: Simon's Quest,NES,0,0
15,Castlevania: Symphony of the Night,playstation-3,0,0
16,Colin McRae Dirt,playstation-3,0,0
27,Donkey Kong,NES,0,0
28,Donkey Kong Jr.,NES,0,0


Web scraping from metacritic was a success! Now my game list contains both the critics score and the number of reviews. Unfortunately there are a number of games that contain 0 reviews and have a score of 0. In some cases there is simply no page for the game on metacritic. For example I own an original copy of Assassin's Creed on PC however metacritic does not have a page for the original version of the game, and only has a Director's cut re-release page. I would consider these different enough and do not want to use one score to fill in gaps in the other, so I'll leave them as 0 for now. 

Some games metacritic does have a page for, but there are no reviews on the specific version of the game I own. For example Alien Isolation on Playstation 3 contains 0 critic reviews on metacritic.

Another issue is that Metacritic only contains pages and reviews for the games starting with the sixth generation of consoles. Meaning any games I own for the original Playstatio, GameBoy, GameBoy Color, Nintendo Entertainment System, or Nintendo 64 would automatically have 0 score and 0 reviews.

### Getting HowLongToBeat Data

In [12]:
def extract_hltb_data(name):
    matching_games = hltb().search(name)
    gametime = [0, 0, 0]
    if matching_games is not None:
        for game in matching_games:
            if name == game.game_name:
                gametime[0] = game.main_story
                gametime[1] = game.main_extra
                
    gametime[2] = (gametime[0] + gametime[1]) / 2
    return pd.Series(gametime)

In [13]:
print(extract_hltb_data(list_of_games.iloc[216]['name']))

0    10.84
1    10.94
2    10.89
dtype: float64


In [None]:
list_of_games[['main_story', 'main_extra', 'avg_playthrough']] = list_of_games['name'].apply(extract_hltb_data)

In [None]:
list_of_games.tail(10)

In [None]:
list_of_games['normalized_score'] = list_of_games['score'].apply(
    lambda x: (x - list_of_games['score'].min())/(list_of_games['score'].max() - list_of_games['score'].min()))

In [None]:
list_of_games['normalized_reviews'] = list_of_games['total_reviews'].apply(
    lambda x: (x - list_of_games['total_reviews'].min())/(list_of_games['total_reviews'].max() - list_of_games['total_reviews'].min()))

In [None]:
list_of_games['normalized_time'] = list_of_games['avg_playthrough'].apply(
    lambda x: (x - list_of_games['avg_playthrough'].min())/(list_of_games['avg_playthrough'].max() - list_of_games['avg_playthrough'].min()))

In [None]:
list_of_games['playability_score'] = list_of_games.apply(
    lambda x: ((x['normalized_score']*0.5)+(x['normalized_reviews']*0.3)+(x['normalized_time']*0.2)) * 100,
    axis='columns')

In [None]:
list_of_games.sort_values(['playability_score'], inplace=True, ascending=False)
list_of_games[['name', 'platform', 'playability_score']].head(15)

In [None]:
list_of_games.to_csv(r"C:\Users\ricar\Documents\Python Scripts\Game Backlog\backlog.csv")