# Exploratory Data Analysis 

Here I will explore the two datasets I have and figure out how best to create the pipeline for the project. Here, I will also do some data cleaning. 

In [23]:
from IPython.display import HTML
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')

In [260]:
from pymongo import MongoClient
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
import matplotlib 
matplotlib.style.use('ggplot')
import pprint
import re
from gensim import corpora, models, similarities

In [3]:
client = MongoClient()
# Pop Games
db_popGames = client.Games
cursor_popGames = db_popGames.PopularGames.find()
df_popGames = pd.DataFrame(list(cursor_popGames))

# Indie Games
db_indieGames = client.Games
cursor_indieGames = db_indieGames.IndieGames.find()
df_indieGames = pd.DataFrame(list(cursor_indieGames))

## Let's check the number of games by their genre

In [31]:
# Genre by Popular Games
df_popGenre  = df_popGames['genre'].unique()
for p in sorted(df_popGenre):
    print p,

2D 3D 4X Action Action Adventure Action RPG Adventure Games Alternative Sports Arcade Artillery Automobile Baseball Basketball Beat-'Em-Up Biking Billiards Board / Card Game Board Games Bowling Breeding/Constructing Business / Tycoon Car Combat Card Battle Card Battle Games Career Civilian Civilian Plane Combat Combat Sims Command Compilation Compilations Console-style RPG Defense Demolition Derby Fantasy Fighting First-Person First-Person Shooters Football Formula One Futuristic Futuristic Combat Sims Futuristic Jet Futuristic Sub GT / Street Gambling General Golf Government Helicopter Historic Hockey Horizontal Horror Hunting Interactive Movie Japanese-Style Kart Large Spaceship Light Gun Linear MOBA Management Massively Multiplayer Matching Mech Military Miscellaneous Mission-based Modern Modern Jet Motocross Music Old Jet Olympic Sports On-foot Open-World Other Other Driving Games Other Shooters Other Sports Games Other Strategy Games PC-style RPG Parlor Games Party Pinball Platfor

In [40]:
df_popGames[(df_popGames["genre"] == "Other Sports Games")]

Unnamed: 0,_id,game_link,game_name,genre,review_count,score,summary
208,ashes-cricket-2009,http://www.metacritic.com/game/pc/ashes-cricke...,ashes-cricket-2009,Other Sports Games,\n4\n,6.9,Providing gamers with the most authentic Ashes...
430,brian-lara-international-cricket-2007,http://www.metacritic.com/game/pc/brian-lara-i...,brian-lara-international-cricket-2007,Other Sports Games,\n7\n,8.0,"(Also known as ""Ricky Ponting International Cr..."
2906,tennis-masters-series,http://www.metacritic.com/game/pc/tennis-maste...,tennis-masters-series,Other Sports Games,\n10\n,8.0,Test your skills against 71 professional playe...
3217,title-bout-championship-boxing,http://www.metacritic.com/game/pc/title-bout-c...,title-bout-championship-boxing,Other Sports Games,\n4\n,7.5,Title Bout Championship Boxing is the ultimate...
3502,winter-sports-2012-feel-the-spirit,http://www.metacritic.com/game/pc/winter-sport...,winter-sports-2012-feel-the-spirit,Other Sports Games,\n4\n,4.0,50 challenges await winter sports fans. Specia...


### Indie Game Genre

In [22]:
# Genre by Indie Games 
df_indieGames_genre = df_indieGames['genre'].unique()
for a in sorted(df_indieGames_genre):
    print a,

4X Adventure Alternative Sport Arcade Baseball Car Combat Cinematic Combat Sim Educational Family Fighting First Person Shooter Football Futuristic Sim Golf Grand Strategy Hack 'n' Slash MOBA Party Platformer Point and Click Puzzle Compilation Racing Real Time Shooter Real Time Strategy Real Time Tactics Realistic Sim Rhythm Roguelike Role Playing Soccer Stealth Tactical Shooter Third Person Shooter Tower Defense Turn Based Strategy Turn Based Tactics Virtual Life Visual Novel Wrestling


### 13,000 Indie Games scrapped genre

In [10]:
# Let's compare the genre from the CSV file that I wanted to scrape. This includes 13,000 games.
df_games_to_scrape = pd.read_csv("../IndieGamesToScrape.csv")

In [21]:
genre_to_scrape = df_games_to_scrape['genre'].unique()
for p in sorted(genre_to_scrape):
    print p,

4X Adventure Alternative Sport Arcade Baseball Basketball Car Combat Cinematic Combat Sim Educational Family Fighting First Person Shooter Football Futuristic Sim Golf Grand Strategy Hack 'n' Slash Hockey MOBA Party Party Based Platformer Point and Click Puzzle Compilation Racing Real Time Shooter Real Time Strategy Real Time Tactics Realistic Sim Rhythm Roguelike Role Playing Soccer Stealth Tactical Shooter Third Person Shooter Tower Defense Turn Based Strategy Turn Based Tactics Virtual Life Visual Novel Wrestling


#### So I find the subset of games that I have include mostly all of the games. So the 13,000 games are represented by the 1500 games I have with me now. 

Creating buckets of games. I will do this for popular games

In [198]:
df_indieGames.summary.loc[0]

[u'\n\n\n\nPlayers pilot a Gun Ship and battle aliens, pirates, and other villainous enemies, to gain levels, skills, and most importantly, discover unique and potentially more powerful weapon systems to outfit their Gun Ships. Through the unique cgNEAT content-generation technology, new weapons are constantly created by the game automatically based on player preferences. Every weapon in the game is unique. NOTE: Requires XNA 4 which is included in the download! \n\n']

In [199]:
all_docs = df_indieGames.summary
sub_docs = all_docs[:5]

#for l in word_list:
#    for word in l:
#        print word.lower().replace("\n", "").replace(".", "").replace('"', "").replace("(","").replace(")",""),

## Sub_summaries

In [200]:
sub_docs

0    [\n\n\n\nPlayers pilot a Gun Ship and battle a...
1    [\n\n\n\nEvery goverment has secrets. These se...
2    [\n\n\n\n\n\n\nAfter an endless fall through p...
3    [\n\n\n\n\n\n\n\nThis is port of amazing PC ga...
4    [\n\n\n\nMTBFreerideVR, with support Oculus Ri...
Name: summary, dtype: object

### Remove Stop Words

In [201]:
documents = []
for a in sub_docs:
    for word in a:
        documents.append("".join(word.encode("utf-8")))

In [240]:
for doc in documents:
    print type(doc)

<type 'str'>
<type 'str'>
<type 'str'>
<type 'str'>
<type 'str'>


In [202]:
documents[0]

'\n\n\n\nPlayers pilot a Gun Ship and battle aliens, pirates, and other villainous enemies, to gain levels, skills, and most importantly, discover unique and potentially more powerful weapon systems to outfit their Gun Ships. Through the unique cgNEAT content-generation technology, new weapons are constantly created by the game automatically based on player preferences. Every weapon in the game is unique. NOTE: Requires XNA 4 which is included in the download! \n\n'

In [203]:
documents[0] = documents[0].lower()
documents[0] = re.sub('[\\@$\/\#\.\-:&\*\+\=\[\]?!\(\)\{\},\'\">\_<;%]',r'',documents[0])
documents[0] = re.sub(r'\s+',r' ', documents[0])

In [215]:
documents[0]

' players pilot a gun ship and battle aliens pirates and other villainous enemies to gain levels skills and most importantly discover unique and potentially more powerful weapon systems to outfit their gun ships through the unique cgneat contentgeneration technology new weapons are constantly created by the game automatically based on player preferences every weapon in the game is unique note requires xna 4 which is included in the download '

In [232]:
texts = []
for document in documents:
    document = document.lower()
    document = re.sub('[\\@$\/\#\.\-:&\*\+\=\[\]?!\(\)\{\},\'\">\_<;%]',r'', document)
    document = re.sub(r'\s+',r' ', document)
    document = document.replace("\n", "")
    document = document.strip()
    document = document.split()
    texts.append(document)

In [244]:
texts_clean = []
stoplist = set('for a of the and to in'.split())

for l in texts:
    mylist = []
    for word in l:
        if word not in stoplist:
            mylist.append(word)
    texts_clean.append(mylist)


In [254]:
from collections import defaultdict
frequency = defaultdict(int)
for text in texts_clean:
    for token in text:
        frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
          for text in texts_clean]

In [261]:
dictionary = corpora.Dictionary(texts)
dictionary.save('/tmp/tests.dict')
print dictionary

Dictionary(47 unique tokens: [u'this', u'suddenly', u'is', u'powerful', u'discover']...)


In [263]:
dictionary.token2id

{u'are': 10,
 u'at': 37,
 u'be': 18,
 u'by': 15,
 u'can': 30,
 u'constantly': 3,
 u'discover': 6,
 u'every': 9,
 u'experiment': 21,
 u'from': 20,
 u'game': 8,
 u'going': 29,
 u'goverment': 33,
 u'gun': 5,
 u'have': 31,
 u'higher': 46,
 u'is': 16,
 u'its': 44,
 u'james': 32,
 u'most': 7,
 u'new': 12,
 u'on': 0,
 u'or': 39,
 u'powerful': 4,
 u'project': 27,
 u'robots': 41,
 u'secrets': 28,
 u'skills': 1,
 u'soon': 42,
 u'suddenly': 23,
 u'that': 22,
 u'this': 40,
 u'through': 13,
 u'tower': 43,
 u'unique': 14,
 u'us': 26,
 u'use': 38,
 u'was': 36,
 u'we': 25,
 u'weapon': 2,
 u'what': 19,
 u'which': 11,
 u'with': 24,
 u'woods': 17,
 u'x': 34,
 u'you': 35,
 u'your': 45}

In [266]:
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('/tmp/tests.mm', corpus)
corpus

[[(0, 1),
  (1, 1),
  (2, 2),
  (3, 1),
  (4, 1),
  (5, 2),
  (6, 1),
  (7, 1),
  (8, 2),
  (9, 1),
  (10, 1),
  (11, 1),
  (12, 1),
  (13, 1),
  (14, 3),
  (15, 1),
  (16, 2)],
 [(0, 2),
  (6, 1),
  (9, 1),
  (10, 3),
  (11, 1),
  (13, 1),
  (16, 2),
  (17, 3),
  (18, 1),
  (19, 2),
  (20, 1),
  (21, 3),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 1),
  (26, 3),
  (27, 4),
  (28, 2),
  (29, 2),
  (30, 2),
  (31, 1),
  (32, 3),
  (33, 2),
  (34, 4),
  (35, 4),
  (36, 2),
  (37, 1)],
 [(1, 1),
  (3, 1),
  (4, 1),
  (7, 1),
  (9, 1),
  (10, 1),
  (13, 1),
  (15, 1),
  (16, 3),
  (18, 1),
  (20, 1),
  (22, 2),
  (23, 1),
  (28, 1),
  (30, 1),
  (31, 1),
  (35, 8),
  (38, 2),
  (39, 2),
  (40, 1),
  (41, 2),
  (42, 1),
  (43, 3),
  (44, 2),
  (45, 5),
  (46, 2)],
 [(0, 1),
  (8, 1),
  (12, 1),
  (16, 1),
  (25, 1),
  (35, 1),
  (37, 1),
  (40, 1),
  (42, 1)],
 [(24, 1), (42, 1)]]