###Giant Bomb Text Mining
In this notebook, we are going to access the Giant Bomb API via Python and use game descriptions to try and identify common genres and classify games accordingly using clustering techniques.  First, let's import our tools.

In [1]:
from __future__ import print_function
import pandas as pd
import numpy as np
import requests
import nltk 
import re
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

Next, we're going to define a function to pull down our game descriptions from the Giant Bomb API.  Note that the Giant Bomb API only lets us pull down 50 games at a time so we need to use the "**offset**" argument to iterate through the whole library.

In [2]:
#Define a function that returns a JSON object from the Giant Bomb API
def pull_info(resource_type, resource_id='', offset=0):
    bomb = 'http://www.giantbomb.com/api/'
    resource_type = resource_type + '/'
    if resource_id != "":
        resource_id = resource_id + '/'
    api_key = '?api_key=98fade4b69e9695cee10f3d8a8f9cb69a5d03c55'
    json = '&format=json'
    offset = '&offset=' + str(offset)
    
    bomb += resource_type
    bomb += resource_id
    bomb += api_key
    bomb += json
    bomb += offset
    
    bomb_output = requests.get(bomb)
    bomb_json = bomb_output.json()
    
    return bomb_json

Next, we create another function to actually create our dictionary of game information, using the "**pull_info**" function we defined above.  Then we call the function and create our dictionary, "**game_data**."

In [3]:
#Define a function that uses our JSON object to create our data
def create_full_db(resource_type, resource_id='', offset=0):
    game_dict = {'names': [], 'decks': [], 'platforms' : [], 'releases' : []}
    bomb_length = pull_info(resource_type, resource_id, 0)
    pages = (bomb_length['number_of_total_results']/100) + 1
    for page in range(pages):
        bomb_data = pull_info(resource_type, resource_id, page*100)        
        for game in bomb_data['results']:
            if game['deck'] != None:
                game_dict['names'].append(game['name'])
                game_dict['decks'].append(game['deck'])
                game_dict['platforms'].append(game['platforms'])
                game_dict['releases'].append(game['original_release_date'])
    return game_dict
game_data = create_full_db('games')  

Let's take a quick look at what our dictionary looks like by reading it into a Pandas DataFrame.

In [4]:
pd.DataFrame(game_data).head()

Unnamed: 0,decks,names,platforms,releases
0,A top-down isometric helicopter shoot 'em up o...,Desert Strike: Return to the Gulf,[{u'api_detail_url': u'http://www.giantbomb.co...,1992-02-29 00:00:00
1,Breakfree is a block-breaking game that is sim...,Breakfree,[{u'api_detail_url': u'http://www.giantbomb.co...,1995-12-31 00:00:00
2,The Chessmaster 2000 is the chess game that be...,The Chessmaster 2000,[{u'api_detail_url': u'http://www.giantbomb.co...,1986-06-15 00:00:00
3,Put on some tight red spandex pants and grab y...,Bass Avenger,[{u'api_detail_url': u'http://www.giantbomb.co...,2000-09-28 00:00:00
4,Smackdown Vs. Raw 2007 is the third installmen...,WWE SmackDown! vs. RAW 2007,[{u'api_detail_url': u'http://www.giantbomb.co...,2006-11-14 00:00:00


Note that for our project, we really will only be delving into the "**decks**" variable, which provides a brief summary of the game.  Next, we need to start preparing for text mining.  The first step in successful text mining is to *tokenize* the text; that is, to break it up into smaller, digestable chunks.  In this case, we will break our game descriptions into lists of single word tokens.  Additionally, we will *stem* each word, which means that we will chop off the end of words such that different permutations of the same word can be understood to be related.  Lastly, we will also remove all punctuation from our text objects.  The function we define below performs all three of these tasks.

In [5]:
#Define a new tokenizer for clustering
def tokenize_and_stem(text):
    snowball = nltk.stem.snowball.SnowballStemmer('english')
    punc = re.compile('[%s]' % re.escape(string.punctuation))
    tokens = []
    stems = []
    text = text.lower()
    text = punc.sub('', text)
    tokens.append(nltk.word_tokenize(text))
    for token in tokens:
        for word in token:
            stems.append(snowball.stem(word))
    return stems

Here is an example of our function's output by calling it on the first game description in our data set and showing the first 10 words.

In [6]:
tokenize_and_stem(game_data['decks'][0])[0:10]

[u'a',
 u'topdown',
 u'isometr',
 u'helicopt',
 u'shoot',
 u'em',
 u'up',
 u'origin',
 u'for',
 u'the']

Next we're going to define another tokenizer.  "*Now Rich, we already have a tokenizer, why on earth do we need another one?*" you are wondering to yourself.  Well mysterious person reading this, we will get to that!  Have some patience!

In [7]:
#Define another tokenizer that doesn't stem so we can match stems to words later  
def tokenize_no_stem(text):
    punc = re.compile('[%s]' % re.escape(string.punctuation))
    tokens = []
    words = []
    text = text.lower()
    text = punc.sub('', text)
    tokens.append(nltk.word_tokenize(text))
    for token in tokens:
        for word in token:
            words.append(word)
    return words

This next function explains the need for our second tokenizer.  Essentially, I want to have a table in which I can look up the non-stemmed version of any of my stemmed words.  The function below creates a DataFrame in which I can pass the stemmed word in as the index and find the non-stemmed version.

In [8]:
#Create list of stemmed an non-stemmed words to match up
def vocab(text):
    totalvocab_stemmed = []
    totalvocab_tokenized = []
    for i in text:
        allwords_stemmed = tokenize_and_stem(i)
        totalvocab_stemmed.extend(allwords_stemmed)
        allwords_tokenized = tokenize_no_stem(i)
        totalvocab_tokenized.extend(allwords_tokenized)    
            
    vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)
    return vocab_frame
vocab_frame = vocab(game_data['decks'])

Here is a quick snapshot of the construction of this lookup table.  This table will be useful later on in our exercise.

In [9]:
vocab_frame.head()

Unnamed: 0,words
a,a
topdown,topdown
isometr,isometric
helicopt,helicopter
shoot,shoot


Next, we need to define our *stop words*, which are essentially words that add no value to our end goal.  These words will be removed from our analysis.  Here we define the list "**stop_words**" by calling a pre-existing list of common stop words from NLTK.

In [10]:
stop_words = nltk.corpus.stopwords.words('english')

Here are the first 10 stop words, just to give you an idea of what kind of words we don't care about.

In [11]:
stop_words[0:10]

[u'i',
 u'me',
 u'my',
 u'myself',
 u'we',
 u'our',
 u'ours',
 u'ourselves',
 u'you',
 u'your']

Now we need to create our *TF-IDF matrix* which measures two things.  First, it counts the number of times a word appears in a given document.  Then, this count is weighted such that words that appear in many documents are penalized for being more common.  The theory being, that if a word only appears in a few documents, it is likely important in distinguishing that document.

The TfidfVectorizer function below creates this matrix for us.  A few things to note:
 * We pass the function our list of stop words so that these are not included in our TF-IDF matrix.
 * The arguments "**max_df**" and "**min_df**" specify to only include words that are included in less than 30 percent of documents but at least 0.1% of documents.
 * The argument "**use_idf**" simply tells the vectorizer to use the inverse document frequency weighting-scheme we mentioned above.
 * Lastly, we pass in the tokenizer we defined above to transform each word entered into our matrix.

In [12]:
#Define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words, max_df=0.3, min_df = 0.01,
                                 use_idf=True, tokenizer=tokenize_and_stem)

In [13]:
#Create TFIDF Matrix
tfidf_matrix = tfidf_vectorizer.fit_transform(game_data['decks'])
print(tfidf_matrix.shape)
terms = tfidf_vectorizer.get_feature_names()

(40533, 160)


The TF-IDF matrix we've created consists of **40,430** documents (the number of total games Giant Bomb has documented in their wiki) and **161** words that meet the criteria we specified above.  Next, we use the TF-IDF matrix as the input to the KMeans clustering algorithm to create out genre clusters.  Here I've rather arbitrarily chosen 8 as the number of clusters.

In [14]:
#KMeans Clustering
num_clusters = 8
km = KMeans(n_clusters = num_clusters, random_state=3)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()

Now, using that lookup table we created earlier, we can see those words in each cluster that are closest to the cluster centroid.  Here are the three closest words for each of our clusters.

In [15]:
print("Top terms per cluster:")
print()
#sort cluster centers by proximity to centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1] 

for i in range(num_clusters):
    print("Cluster %d words:" % i, end='')
    
    for ind in order_centroids[i, :3]: 
        print(' %s' % vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0][0], end=',')
    print() 

Top terms per cluster:

Cluster 0 words: rpg, action, developed,
Cluster 1 words: puzzle, players, platforms,
Cluster 2 words: racing, released, arcade,
Cluster 3 words: developed, published, released,
Cluster 4 words: novel, visuals, developed,
Cluster 5 words: based, named, series,
Cluster 6 words: released, players, series,
Cluster 7 words: adventure, developed, series,


This output is a decent start.  Note there are a few clusters that make a lot of sense!  Also, a few of our clusters are already starting to show signs of looking like the genres we would expect.  For instance, *Cluster 1* seems to represent puzzle platformers and *Cluster 0* seems to represent action RPGs!  Pretty cool right?

However, we have a few clusters that are completely nonsensical for what we are trying to achieve.  For example, *Cluster 3* just returns words associated with the process of making games. This doesn't help us at all.

One thing we can do is add words that we don't find useful to our list of stop words.  This can be tedious and manually-intensive, but it can help us get better-defined genres.

In [16]:
#Add a bunch of stop words not relevant to a games genre
stop_words.append('featur')
stop_words.append('sequel')
stop_words.append('originally')
stop_words.append('set')
stop_words.append('based')
stop_words.append('franchis')
stop_words.append('seri')
stop_words.append('instal')
stop_words.append('first')
stop_words.append('one')
stop_words.append('second')
stop_words.append('two')
stop_words.append('2')
stop_words.append('third')
stop_words.append('final')
stop_words.append('version')
stop_words.append('entri')
stop_words.append('new')
stop_words.append('develop')
stop_words.append('publish')
stop_words.append('releas')
stop_words.append('pc')
stop_words.append('nes')
stop_words.append('playstat')
stop_words.append('nintendo')
stop_words.append('wii')
stop_words.append('ds')
stop_words.append('io')
stop_words.append('famicom')
stop_words.append('sega')
stop_words.append('xbox')
stop_words.append('live')
stop_words.append('origin')
stop_words.append('japan')
stop_words.append('boy')
stop_words.append('color')
stop_words.append('must')
stop_words.append('base')
stop_words.append('name')
stop_words.append('terms')
stop_words.append('play')
stop_words.append('player')
stop_words.append('popular')
stop_words.append('onli')
stop_words.append('use')
stop_words.append('control')
stop_words.append('take')
stop_words.append('super')
stop_words.append('made')
stop_words.append('creat')
stop_words.append('exclus')
stop_words.append('includ')
stop_words.append('tri')
stop_words.append('video')
stop_words.append('studio')
stop_words.append('like')
stop_words.append('system')
stop_words.append('back')
stop_words.append('find')
stop_words.append('help')
stop_words.append('graphic')
stop_words.append('gameplay')
stop_words.append('star')
stop_words.append('character')

Now let's re-define our TF-IDF matrix and re-run our clustering algorithm and see the results.

In [17]:
#Define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words, max_df=0.3, min_df = 0.01,
                                 use_idf=True, tokenizer=tokenize_and_stem)

#Create TFIDF Matrix
tfidf_matrix = tfidf_vectorizer.fit_transform(game_data['decks'])
print(tfidf_matrix.shape)
terms = tfidf_vectorizer.get_feature_names()

(40533, 100)


In [18]:
#KMeans Clustering
num_clusters = 8
km = KMeans(n_clusters = num_clusters, random_state=3)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()

print("Top terms per cluster:")
print()
#sort cluster centers by proximity to centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1] 

for i in range(num_clusters):
    print("Cluster %d words:" % i, end='')
    
    for ind in order_centroids[i, :3]: 
        print(' %s' % vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0][0], end=',')
    print() 

Top terms per cluster:

Cluster 0 words: robots, war, battle,
Cluster 1 words: simulation, manager, combat,
Cluster 2 words: action, shooter, platforms,
Cluster 3 words: fight, character, 2d,
Cluster 4 words: world, war, adventure,
Cluster 5 words: puzzle, platforms, adventure,
Cluster 6 words: adventure, action, mystery,
Cluster 7 words: strategy, turnbased, war,


This looks way better--our new genres make a lot more sense.  Let's take a look at some of our favorite titles to see how we did and where our algorithm falls short.

In [19]:
#Creates a Series using the Game Name as the Index
game_genres = pd.Series(clusters, index=game_data['names'])

Using the Series defined above, we can now look up games we know and love using their name, returning their output cluster as defined by our algorithm.  Let's look at the first **Halo**.

In [20]:
game_genres['Halo: Combat Evolved']

2

***Cluster 2 - Action, Shooter, Platforms*** - Not bad!  Halo is definitely a shooter, although  it's definitely not a platformer.  I would agree with the algorithm that *Cluster 2* seems to make the most sense.  Next let's look at another one of my favorites, **World of Warcraft**.

In [21]:
game_genres['World of Warcraft']

4

***Cluster 4 - World, War, Adventure*** - Eh, this isn't great, and probably demonstrates one of the flaws of our algorithm.  WoW is a textbook MMO-RPG but we don't really have a cluster that seems to fit those types of games.  Here it looks like the name itself (i.e. **World** of **War**craft) was the deciding factor in terms of its cluster assignment, which isn't exactly what we're looking for.  Let's see if our algorithm can identify fighting games--next up, **Mortal Kombat X**.

In [22]:
game_genres['Mortal Kombat X']

3

***Cluster 3 - Fight, Character, 2d*** - This checks out!  What about the new Jonathan Blow open-world puzzle game that has everybody all hot and bothered: **The Witness**?

In [23]:
game_genres['The Witness']

5

***Cluster 5 - Puzzle, Platforms, Adventure*** - Nailed this one also.  Cool!

###Summary
As you can see, using text-mining and document clustering techniques we can fairly effectively assign games into genres that accurately define them.  I could go on and show you dozens more examples of both games I feel we've classified correctly as well as games in which I don't think are quite right, but instead mysterious reader I implore you to copy the code and see where your favorite game ends up!  

Thanks for reading and please feel free to reach me at **richard_conboy@ncsu.edu** or **845-264-2406** with any questions!