##### Introduction

This is a continuation of userSubredditAnalysis. There are two inputs:

1. "inputGraph", which is a mapping of userId to the subreddits the user commented on, weighted by counts, from January to May, with all 1-subreddit users filtered out.
2. "actualNewSubreddits", which is a JSON list of `{userId: <string>, newSubreddits: {<stringSubreddit>: <intCount>, ...}`, from June.

The output is a randomized devSet and a randomized testSet. The preliminary analysis and rationale behind the generated sets can be found in userSubredditAnalysis.

##### Imports

In [4]:
import collections
import gc
import json
from matplotlib import pyplot as plt
import numpy as np
import random
import tools
reload(tools)
# gc.collect()

<module 'tools' from 'tools.py'>

##### Reformatting (please skip)

The format of actualNewSubreddits is inconsistent, this reformats it to be consistent. This need only be run once in a lifetime.

In [8]:
# Load and save actualNewSubreddits in a better format.
with open("../bigData/analysis/actualNewSubreddits", 'r') as infile, \
     open("../bigData/analysis/actualNewSubreddits2", 'w') as outfile:
    for i, line in enumerate(infile, 1):
        lineJson = json.loads(line)
        newSubreddits = {}
        for thing in lineJson["newSubreddits"]:
            subreddit = thing.keys()[0]
            newSubreddits[subreddit] = thing[subreddit]
        lineJson["newSubreddits"] = newSubreddits
        outfile.write(json.dumps(lineJson) + "\n")
        if i % 1000000 == 0:
            print "Processed {}".format(i)

Processed 1000000
Processed 2000000
Processed 3000000


##### Load Data

In [2]:
# Load inputGraph, include weights.
userIdToOldSubreddits = collections.defaultdict(lambda: collections.defaultdict(int))
tools.getUserIdToSubreddits("../bigData/analysis/inputGraph", userIdToOldSubreddits, includeCounts=True)
print "Number of users: {}".format(len(userIdToOldSubreddits))

Processing ../bigData/analysis/inputGraph
Processed 1000000
Processed 2000000
Processed 3000000
Processed 4000000
Processed 5000000
Processed 6000000
Processed 7000000
Processed 8000000
Processed 9000000
Processed 10000000
Processed 11000000
Processed 12000000
Processed 13000000
Processed 14000000
Processed 15000000
Processed 16000000
Processed 17000000
Processed 18000000
Processed 19000000
Processed 20000000
Processed 21000000
Processed 22000000
Processed 23000000
Processed 24000000
Processed 25000000
Processed 26000000
Processed 27000000
Processed 28000000
Processed 29000000
Processed 30000000
Processed 31000000
Processed 32000000
Processed 33000000
Processed 34000000
Processed 35000000
Processed 36000000
Processed 37000000
Processed 38000000
Processed 39000000
Processed 40000000
Processed 41000000
Processed 42000000
Processed 43000000
Processed 44000000
Processed 45000000
Processed 46000000
Processed 47000000
Processed 48000000
Processed 49000000
Processed 50000000
Processed 5100000

In [5]:
# Load actualNewSubreddits, include weights.
actualNewSubreddits = tools.getUserIdToSubredditsByType("../bigData/analysis/actualNewSubreddits", "newSubreddits")
print "Number of users: {}".format(len(actualNewSubreddits))

Processed 1000000
Processed 2000000
Processed 3000000
Number of users: 3258157


In [7]:
# Output some random subset for dev and test.
# Format is {userId: <string>, oldSubreddits: {<stringSubreddit>: <intCount>, ...}, newSubreddits: {<stringSubreddit>: <intCount>, ...}}
random.seed(7224)

candidateUserIds = []
for userId, subreddits in actualNewSubreddits.iteritems():
    if len(subreddits) >= 10 and len(subreddits) <= 100:
        candidateUserIds.append(userId)
random.shuffle(candidateUserIds)        
print "Number of candidate users: {}".format(len(candidateUserIds))

with open('../bigData/devTest/devUsers', 'w') as devfile, \
     open('../bigData/devTest/testUsers', 'w') as testfile:
    counter = 0
    for userId in candidateUserIds:
        userJson = {"userId": userId, 
                    "newSubreddits": actualNewSubreddits[userId],
                    "oldSubreddits": userIdToOldSubreddits[userId]}
        if counter < 100:
            devfile.write(json.dumps(userJson) + "\n")
        elif counter < 200:
            testfile.write(json.dumps(userJson) + "\n")
        else:
            break
        
        counter += 1

Number of candidate users: 118620
