### Purchase Prediction: Shrink Training Data

Shrink training data to include only users that appear in the test data. 

The fraction of training users who appear in the test data is much smaller than the fraction of training items that appear in the test data. We'll keep all the items to get as much signal on user similarity as we can, while still significantly reducing the training data size.

In [1]:
import pandas as pd
import gzip

In [2]:
# Import users and items from training data

def readGz(f):
  for l in gzip.open(f):
    yield eval(l)
    
train_items = []
for l in readGz("train.json.gz"):
    train_items.append(l['itemID'])
    
train_users = []
for l in readGz("train.json.gz"):
    train_users.append(l['reviewerID'])

In [3]:
#Import Test data

test_allcol = pd.read_csv('pairs_Purchase.txt')

pairings = list(test_allcol['reviewerID-itemID'])
test_users = []
test_items = []

for pairing in pairings:
    u, i = pairing.split('-')
    test_users.append(u)
    test_items.append(i)

In [4]:
# Convert to Pandas df
train_df = pd.DataFrame(train_users, columns = ['ReviewerID'])
test_df = pd.DataFrame(test_users, columns = ['ReviewerID'])

train_df['ItemID'] = pd.DataFrame(train_items)
test_df['ItemID'] = pd.DataFrame(test_items)

train_df

Unnamed: 0,ReviewerID,ItemID
0,U490934656,I402344648
1,U714157797,I697650540
2,U507366950,I464613034
3,U307862152,I559560885
4,U742726598,I476005312
...,...,...
199995,U781794983,I245323432
199996,U151975942,I990230316
199997,U525354881,I037381245
199998,U995566285,I343675670


In [5]:
test_df

Unnamed: 0,ReviewerID,ItemID
0,U938994110,I529819131
1,U181459539,I863471064
2,U941668816,I684585522
3,U768449391,I782253949
4,U640450168,I232683472
...,...,...
27995,U337041888,I763827121
27996,U457455307,I242828364
27997,U052546714,I111529174
27998,U566804667,I857242737


In [6]:
# Get distinct list of test reviewers
test_df['TestReviewCount'] = 1

test_user_counts = test_df.groupby(['ReviewerID']).sum()['TestReviewCount']
test_user_counts.head()

ReviewerID
U000005569    1
U000089279    1
U000132800    1
U000198945    2
U000243198    2
Name: TestReviewCount, dtype: int64

In [7]:
train_df_shrink = pd.merge(train_df, test_user_counts, how = 'inner', 
                           on = ['ReviewerID'], suffixes = ('Train', 'Test'))
train_df_shrink.head()

Unnamed: 0,ReviewerID,ItemID,TestReviewCount
0,U490934656,I402344648,1
1,U490934656,I330290793,1
2,U490934656,I296399509,1
3,U361187730,I773829721,1
4,U361187730,I781019543,1


In [8]:
len(train_df_shrink)

103537

We've nearly cut the training set in half. Let's export this file for later use and easy sharing.

In [9]:
train_df_shrink.to_csv('train_reduced.csv', index = False)