# Video Game Classifier Project
Tyler Short and Gideon Keith-Stanley

### Background
The PC Games 2020 dataset contains the results of scraping and sorting the entire catalog of Valve's "Steam" video game store, and includes data on over 27,000 titles. These data include title, description, genre, price points, several success metrics, and more. We hypothesize that using the "bag of words" method as seen in email spam filters, we can train a machine learning model using the digested description of video games, and use that information to classify games by genre.

In [1]:
import numpy as np
import pandas as pd

### Data Loader
This routine downloads the dataset from OpenML.org and processes it with liac-arff. 

In [2]:
import arff

from urllib.request import urlretrieve

def load_game_data():
    url = 'https://api.openml.org/data/v1/download/22102514/PC-Games-2020.arff'
    filename = 'pc_game_dataset.arff'
    file, http_response = urlretrieve(url, filename)
    dataset = arff.load(open(file, 'r'))
    attributes = np.array(dataset['attributes'])
    data = np.array(dataset['data'])
    return data, attributes

# Use this to save bandwidth and time if the project has the data file already downloaded
def load_game_data_from_file():
    file = 'pc_game_dataset.arff'
    dataset = arff.load(open(file, 'r'))
    attributes = np.array(dataset['attributes'])
    data = np.array(dataset['data'])
    return data, attributes
    

In [3]:
data, attributes = load_game_data()

## Preprocessing
This code digests the dataset into the form we need and prepares it for use by the model.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt

In [5]:
A, b = load_game_data_from_file()

In [6]:
genre_docs = [str(n) for n in A[:,6]]
cv = CountVectorizer(lowercase=True, stop_words='english')
cv_result_a = cv.fit_transform(genre_docs)
y = cv_result_a.toarray()

In [7]:
docs = [str(n) for n in A[:,25]]
cv = CountVectorizer(lowercase=True, stop_words='english')
cv_result = cv.fit_transform(docs)
X = cv_result

In [8]:
X.shape, y.shape

((30250, 124736), (30250, 26))

## Multi-Class LinearSVC


In [16]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC

x_train, x_test, y_train, t_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

classifier = OneVsRestClassifier(LinearSVC(loss='hinge', max_iter=10000))

The below cell takes almost 15 minutes to run and produces 0.15097520661157027 accuracy

In [15]:
accuracy = cross_val_score(classifier, X, y, cv=3, scoring='accuracy')
print(np.mean(accuracy))



0.15097520661157027


In [20]:
# from sklearn.decomposition import PCA

# pca = PCA(200)
# pca.fit(X)
# sparse_X = pca.components_

# accuracy = cross_val_score(classifier, sparse_X, y, cv=2, scoring='accuracy')
# print(np.mean(accuracy))

The below cell takes almost 15 minutes to run and produces 0.1626789757927357 accuracy with 50 estimators

In [19]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=50)
accuracy = cross_val_score(classifier, X, y, cv=3, scoring='accuracy')
print(np.mean(accuracy))

0.1626789757927357
