# 02 - Data Sampling
The goal of this notebook is to develop a sampling method to select movie ratings from the original 5-core Amazon Movies and TV dataset. We start by writing a sampling function that takes proportion as a parameter and then pass in different sample proportions to obtain datasets with different sample sizes. These samples will later be used to study how the how run-time scales as sample size changes from small to large.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

sns.set(style='whitegrid')
% matplotlib inline

In [2]:
data_path = os.path.join('..','..','data')
data_file = os.path.join(data_path, 'ratings.csv')
df = pd.read_csv(data_file, parse_dates=True).drop('Unnamed: 0', axis = 1)

In [3]:
df.head()

Unnamed: 0,asin,overall,reviewerID,reviewTime
0,5019281,4,ADZPIG9QOCDG5,"02 26, 2008"
1,5019281,3,A35947ZP82G7JH,"12 30, 2013"
2,5019281,3,A3UORV8A9D5L2E,"12 30, 2013"
3,5019281,5,A1VKW06X1O2X7V,"02 13, 2008"
4,5019281,4,A3R27T4HADWFFJ,"12 22, 2013"


The density of the original 5-core dataset is approximately 0.00027. If we directly take a small random sample at this sparsity level, the density of the resulting dataset could be even lower and runs the risk of negatively affecting the quality of our recommendations. To correct for this, we restrict each of our samples to be above a certain density level by controlling the minimum number of ratings a movie and a user can have. 

To achieve this goal, we first take a random sample based on a given fraction, and then we iteratively filter through the sample based on a given threshold on number of ratings, until the minimum number of ratings in the resulting dataset converges to this threshold. 

By default, a movie or a user should have at least five ratings in order to be included in a sample. However, this threshold can be changed based on different sample sizes and desired density levels. We also set the default random seed to be 1 in order to obtain the same random samples every time we run the program.

In [5]:
# Define a function to make multiple samples of different sizes
def make_sample_df(in_df, in_prop, in_threshold=5, in_seed=1):

    # Initilalize variables
    min_reviews_per_movie = 0
    min_reviews_per_user = 0
    n_iterations = 1
    
    """Take an initial random sample."""
    sample_df = in_df.sample(frac = in_prop, random_state = in_seed)
    n_samples = sample_df.shape[0]
    print 'Number of initial samples:', n_samples
    

    '''Iteratively filter the sample, removing all items and users who have total reviews below the input threshold.
       Loop terminates when the minimum reviews per movie and mininum reviews per user are both at least
       the input threshold.'''
    while min_reviews_per_movie < in_threshold and min_reviews_per_user < in_threshold and len(sample_df) > 0:
        print 'Iteration:', n_iterations
    
        # Calculate number of ratings per item and only keep items with number of reviews above input threshold
        sample_item_counts = sample_df.groupby('asin').count()
        items_geq = sample_item_counts[sample_item_counts['overall'] >= in_threshold]
        n_items = items_geq.shape[0]
        print '\tNumber of items with at least {} reviews:'.format(in_threshold), n_items 
        min_reviews_per_movie = sample_item_counts['overall'].min()
        print '\tMin reviews per movie:', min_reviews_per_movie
        
        # Calculate number of ratings per user and only keep users with number of reviews above input threshold
        sample_user_counts = sample_df.groupby('reviewerID').count()
        users_geq = sample_user_counts[sample_user_counts['overall'] >= in_threshold]
        n_users = users_geq.shape[0]
        print '\tNumber of users with at least {} reviews:'.format(in_threshold), n_users
        min_reviews_per_user = sample_user_counts['overall'].min()
        print '\tMin reviews per user:', min_reviews_per_user
        
        # Maintain a list of items and users meeting threshold requirements
        asin_list = items_geq.index
        user_list = users_geq.index
        
        # Keep only reviews written by users from above list about items from above list
        sample_df = sample_df[sample_df['asin'].isin(asin_list) & sample_df['reviewerID'].isin(user_list)]
        print '\tNumber of final samples:', sample_df.shape[0]
        
        n_iterations += 1
        
    print 'Density:', sample_df.shape[0] / float(n_items * n_users)
    return sample_df

Next we implement the sampling function to select samples with a few different sizes, each with a corresponding threshold on number of ratings. The table belows provides a basic description of each sample.

|Dataset|% Sampled before filter|$k$|Density|% Actually sampled after filter
|---|---|---|---|---|
|pdf_sample_100|100|20|0.008|18|
|pdf_sample_50|50|13|0.007|7|
|pdf_sample_25|25|9|0.007|2|
|pdf_sample_10|10|5|0.005|1|

In [6]:
# Sample 100% of data then keep only users and items with at least 20 reviews
pdf_sample_100 = make_sample_df(df, 1.0, 20)

Number of initial samples: 1697533
Iteration: 1
	Number of items with at least 20 reviews: 17798
	Min reviews per movie: 5
	Number of users with at least 20 reviews: 14176
	Min reviews per user: 5
	Number of final samples: 673342
Iteration: 2
	Number of items with at least 20 reviews: 9477
	Min reviews per movie: 1
	Number of users with at least 20 reviews: 10909
	Min reviews per user: 1
	Number of final samples: 533762
Iteration: 3
	Number of items with at least 20 reviews: 8624
	Min reviews per movie: 8
	Number of users with at least 20 reviews: 8515
	Min reviews per user: 1
	Number of final samples: 482382
Iteration: 4
	Number of items with at least 20 reviews: 7777
	Min reviews per movie: 10
	Number of users with at least 20 reviews: 8173
	Min reviews per user: 14
	Number of final samples: 461668
Iteration: 5
	Number of items with at least 20 reviews: 7637
	Min reviews per movie: 15
	Number of users with at least 20 reviews: 7738
	Min reviews per user: 5
	Number of final samples: 4

In [101]:
# Sample 50% of data then keep only users and items with at least 13 reviews
pdf_sample_50 = make_sample_df(df, 0.5, 13)

Number of initial samples: 848767
Iteration: 1
	Number of items with at least 13 reviews: 14416
	Min reviews per movie: 1
	Number of users with at least 13 reviews: 10344
	Min reviews per user: 1
	Number of final samples: 289873
Iteration: 2
	Number of items with at least 13 reviews: 6870
	Min reviews per movie: 1
	Number of users with at least 13 reviews: 7396
	Min reviews per user: 1
	Number of final samples: 215070
Iteration: 3
	Number of items with at least 13 reviews: 6107
	Min reviews per movie: 5
	Number of users with at least 13 reviews: 5495
	Min reviews per user: 1
	Number of final samples: 189091
Iteration: 4
	Number of items with at least 13 reviews: 5441
	Min reviews per movie: 6
	Number of users with at least 13 reviews: 5206
	Min reviews per user: 9
	Number of final samples: 178625
Iteration: 5
	Number of items with at least 13 reviews: 5342
	Min reviews per movie: 9
	Number of users with at least 13 reviews: 4917
	Min reviews per user: 6
	Number of final samples: 174154

In [102]:
# Sample 25% of data then keep only users and items with at least 9 reviews
pdf_sample_25 = make_sample_df(df, 0.25, 9)

Number of initial samples: 424383
Iteration: 1
	Number of items with at least 9 reviews: 11096
	Min reviews per movie: 1
	Number of users with at least 9 reviews: 7005
	Min reviews per user: 1
	Number of final samples: 119752
Iteration: 2
	Number of items with at least 9 reviews: 4589
	Min reviews per movie: 1
	Number of users with at least 9 reviews: 4604
	Min reviews per user: 1
	Number of final samples: 81005
Iteration: 3
	Number of items with at least 9 reviews: 3952
	Min reviews per movie: 3
	Number of users with at least 9 reviews: 3194
	Min reviews per user: 1
	Number of final samples: 67979
Iteration: 4
	Number of items with at least 9 reviews: 3416
	Min reviews per movie: 3
	Number of users with at least 9 reviews: 2974
	Min reviews per user: 5
	Number of final samples: 62490
Iteration: 5
	Number of items with at least 9 reviews: 3325
	Min reviews per movie: 6
	Number of users with at least 9 reviews: 2721
	Min reviews per user: 4
	Number of final samples: 59862
Iteration: 6
	

In [7]:
# Sample 10% of data then keep only users and items with at least 5 reviews
pdf_sample_10 = make_sample_df(df, 0.10, 5)

Number of initial samples: 169753
Iteration: 1
	Number of items with at least 5 reviews: 8830
	Min reviews per movie: 1
	Number of users with at least 5 reviews: 5379
	Min reviews per user: 1
	Number of final samples: 42870
Iteration: 2
	Number of items with at least 5 reviews: 3288
	Min reviews per movie: 1
	Number of users with at least 5 reviews: 3289
	Min reviews per user: 1
	Number of final samples: 26855
Iteration: 3
	Number of items with at least 5 reviews: 2791
	Min reviews per movie: 1
	Number of users with at least 5 reviews: 2119
	Min reviews per user: 1
	Number of final samples: 21653
Iteration: 4
	Number of items with at least 5 reviews: 2286
	Min reviews per movie: 1
	Number of users with at least 5 reviews: 1952
	Min reviews per user: 2
	Number of final samples: 19249
Iteration: 5
	Number of items with at least 5 reviews: 2194
	Min reviews per movie: 3
	Number of users with at least 5 reviews: 1753
	Min reviews per user: 2
	Number of final samples: 18156
Iteration: 6
	Nu

Finally, all samples are saved as CSV files so that they can be easily converted back to pandas dataframes in the next notebook.

In [104]:
# Save sampled data to CSV files
pdf_sample_10.to_csv(os.path.join(data_path, 'reviews_sample_10.csv'))
pdf_sample_25.to_csv(os.path.join(data_path, 'reviews_sample_25.csv'))
pdf_sample_50.to_csv(os.path.join(data_path, 'reviews_sample_50.csv'))
pdf_sample_100.to_csv(os.path.join(data_path, 'reviews_sample_100.csv'))