# Video Game Classifier Project
Tyler Short and Gideon Keith-Stanley

### Background
The PC Games 2020 dataset contains the results of scraping and sorting the entire catalog of Valve's "Steam" video game store, and includes data on over 27,000 titles. These data include title, description, genre, price points, several success metrics, and more. We hypothesize that using the "bag of words" method as seen in email spam filters, we can train a machine learning model using the digested description of video games, and use that information to classify games by genre.

In [53]:
import numpy as np
import pandas as pd

### Data Loader
This routine downloads the dataset from OpenML.org and processes it with liac-arff. 

In [54]:
import arff

from urllib.request import urlretrieve

def load_game_data():
    url = 'https://api.openml.org/data/v1/download/22102514/PC-Games-2020.arff'
    filename = 'pc_game_dataset.arff'
    file, http_response = urlretrieve(url, filename)
    dataset = arff.load(open(file, 'r'))
    attributes = np.array(dataset['attributes'])
    data = np.array(dataset['data'])
    return data, attributes

# Use this to save bandwidth and time if the project has the data file in the /data folder
def load_game_data_from_file()
    file = 'data/pc_game_dataset.arff'
    dataset = arff.load(open(file, 'r'))
    attributes = np.array(dataset['attributes'])
    data = np.array(dataset['data'])
    return data, attributes
    

In [55]:
data, attributes = load_game_data()

## Preprocessing
This code digests the dataset into the form we need and prepares it for use by the model.

In [58]:
results = set([])
genres = data[:,6]
for entry in genres:
    terms = str(entry).split(',')
    for term in terms:
        results.add(term.strip())
y_headers = list(results)

y = []
for entry in data:
    y_row = [0] * len(y_headers)
    for genre in str(entry[6]).split(','):
        y_row[y_headers.index(genre.strip())] = 1
    y.append(y_row)
    
y = np.array(y)

# y is now our label vector



In [None]:
bag = set([])
descriptions = data[:,25]

for entry in descriptions:
    terms = str(entry).split(' ')
    for term in terms:
        bag.add(term.strip())
x_headers = list(bag)

X = []
for entry in data:
    x_row = [0] * len(x_headers)
    for word in str(entry[25]).split(' '):
        x_row[x_headers.index(word.strip())] = 1
    X.append(x_row)

X = np.array(X)
print(X[0])