<a href="https://colab.research.google.com/github/timrodz/google-colab-notebooks/blob/master/scraper-itch-io.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# itch.io Game Recommends!

> To run this tool, click **Runtime > Run all**. It will take up to a minute, as it has to install local packages, and grab all the games from the website (They're more than 1500!).
> 
> After you're done, head to the **Results** section. Here is where you can see your recommended games!

Pro-tip: you can toggle any heading's visibility on/off.

## Setting up

Technical only. Utilises `Selenium` + `BeautifulSoup` to scrape the itch.io website.

In [1]:
!pip install selenium
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')

import time
import random

from selenium import webdriver
from selenium.webdriver.android.webdriver import WebDriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

from bs4 import BeautifulSoup as BSoup

Hit:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/ InRelease
Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Ign:3 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release
Hit:5 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Hit:6 http://archive.ubuntu.com/ubuntu bionic InRelease
Hit:7 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:8 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:9 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:10 http://ppa.launchpad.net/marutter/c2d4u3.5/ubuntu bionic InRelease
Get:12 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Fetched 252 kB in 2s (145 kB/s)
Reading package lists... Done
Reading package lists... Done
Building de

In [0]:
class SeleniumObject(object):
	"""
	Selenium object with init parameters
	"""
	default_user_agent = [
		'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36',
		'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36',
		'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763'
	]

	driver = None

	def __init__(self):
		options = webdriver.ChromeOptions()
		options.add_argument('--headless')
		options.add_argument('--window-size=1280,720')
		options.add_argument('--disable-gpu')
		options.add_argument('--disable-dev-shm-usage')
		options.add_argument('--no-sandbox')
		options.add_argument('--user-agent="{}"'.format(
			self.default_user_agent[random.randint(0, 2)])
		)
		self.options = options

	def init_driver(self):
		self.driver = webdriver.Chrome(options=self.options)

	def add_option(self, name: str) -> None:
		self.options.add_argument(name)

	def add_experimental_option(self, name: str, value: any) -> None:
		self.options.add_experimental_option(name, value)

	def get_driver(self, connect_session=None) -> WebDriver or None:
		if not self.driver:
			return None
		return connect_session(self.driver) if connect_session else self.driver

def get_driver() -> WebDriver:
	prefs = {'profile.managed_default_content_settings.images': 2, 'disk-cache-size': 4096}
	so = SeleniumObject()
	so.add_option(' - incognito')
	so.add_experimental_option('prefs', prefs)
	so.init_driver()
	return so.get_driver()

## Acquiring and formatting the data

### Step 1: Fetch

We will fetch data from https://itch.io/b/520/bundle-for-racial-justice-and-equality. It contains all game entries inside one `div` element.

The itch webpage scrolls infinitely. In order to get all games, we must keep scrolling endlessly for a while until we're able to find the true end of the webpage. Source: [Stack Overflow](https://stackoverflow.com/questions/20986631/how-can-i-scroll-a-web-page-using-selenium-webdriver-in-python).

In [0]:
endpoint = 'https://itch.io/b/520/bundle-for-racial-justice-and-equality'
driver = get_driver()
driver.get(endpoint)

Now, we define a method that will load the entire webpage using Selenium, and we'll use BeautifulSoup to extract what we need. 

In [0]:
def get_game_list(driver: WebDriver):
  """
  Games are located in the following XPath directory
  //div[@class="grid_sizer_children"]
  """
  scroll_pause_time = 0.5

  # Get scroll height
  last_height = driver.execute_script("return document.body.scrollHeight")

  while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(scroll_pause_time)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
      break
    last_height = new_height

  bsj = BSoup(driver.page_source, 'html.parser')

  try:
    element = bsj.find_all('div', {'class': 'grid_sizer_children'})[0]
  except Exception:
    raise
  
  return element

In [0]:
games = get_game_list(driver)

In [6]:
# Let's see how many games we have!
len(games)

1509

### Step 2: Format

Here is where we manipulate the data into a human-readable format. This step heavily relies on understanding how the website's HTML works before attempting to find elements. The easiest way is to find an elements tag and either its class, id, or custom attribute!

In [0]:
formatted_games = []

for g in games:
  # Game information
  label = g.find('div', {'class': 'label'})
  label_data = label.find('a', href=True)
  game_name = label_data.text
  game_url = label_data['href']

  short_text = g.find('div', {'class': 'sub short_text'})
  game_description = short_text.text if short_text else ''

  # Developer information
  user_row = g.find('div', {'class': 'user_row'})
  user_row_data = user_row.find('a', href=True)
  developer_name = user_row_data.text
  developer_url = user_row_data['href']
  
  # Add the games to a list that will contain all the formatted values
  formatted_games.append(dict(
      game_name=game_name,
      game_url=game_url,
      game_description=game_description,
      developer_name=developer_name,
      developer_url=developer_url,
  ))

In [8]:
# Show the first 3 values to make sure they look ok
formatted_games[:3]

[{'developer_name': 'Finji',
  'developer_url': 'https://finji.itch.io',
  'game_description': 'A squad-based survival strategy game with procedurally generated levels set in post-apocalyptic North America.',
  'game_name': 'Overland',
  'game_url': 'https://finji.itch.io/overland'},
 {'developer_name': 'Finji',
  'developer_url': 'https://finji.itch.io',
  'game_description': 'At the end of everything, hold onto anything.',
  'game_name': 'Night in the Woods',
  'game_url': 'https://finji.itch.io/night-in-the-woods'},
 {'developer_name': 'Kenney',
  'developer_url': 'https://kenney.itch.io',
  'game_description': '20,000+ game assets for use in your games!',
  'game_name': 'Kenney Game Assets 1',
  'game_url': 'https://kenney.itch.io/kenney-game-assets-1'}]

### Step 3: Creating the game generator

In [0]:
# Keep track of which games are being recommended, so we don't repeat them
recommended_games = []

def get_random_game() -> dict:
  """
  Uses the `random.choice` method
  https://docs.python.org/3/library/random.html#random.choice
  """
  # Create a temporary list that removes games which have been recommended
  temp_game_list = [x for x in formatted_games if x['game_name'] not in recommended_games]
  game = random.choice(temp_game_list)

  # Add the newly recommended game to the list
  recommended_games.append(game['game_name'])
  return game

## Results

TL;DR: Here is where you will get your game recommendations. Run the cell below to get a different game, every time!

In [10]:
# Run this cell to get a new random game!
rnd_game = get_random_game()
print(rnd_game)

{'game_name': 'The Testimony of Trixie Glimmer Smith', 'game_url': 'https://digital-poppy.itch.io/trixie', 'game_description': 'Trixie Glimmer Smith finds a haunted book. Hijinks ensue.', 'developer_name': 'Digital Poppy', 'developer_url': 'https://digital-poppy.itch.io'}


In [11]:
# Keep track of the recommended games - Run this cell to refresh the list
recommended_games

['The Testimony of Trixie Glimmer Smith']