# Data Preparation Notebook

In this notebook, you will execute code to 

1. download MovieLens dataset into `ml-latest-small` directory
2. split the data into training and testing sets
3. perform negative sampling
4. calculate statistics needed to train the NCF model
5. upload data onto S3 bucket

## 2. Read data and perform train and test split

In [1]:
# Requirements
import os
import boto3
import sagemaker
import numpy as np
import pandas as pd

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [8]:
# Load the data
fpath = './data/filtered-meta-California.csv'
# df = pd.read_json(fpath, lines=True)
data = pd.read_csv(fpath)

In [9]:
# Let's see what the data look like
data.head()

Unnamed: 0,name,address,gmap_id,description,latitude,longitude,category,avg_rating,num_of_reviews,price,hours,MISC,state,relative_results,url
0,San Soo Dang,"San Soo Dang, 761 S Vermont Ave, Los Angeles, ...",0x80c2c778e3b73d33:0xbdc58662a4a97d49,,34.058092,-118.29213,['Korean restaurant'],4.4,18,,"[['Thursday', '6:30AM–6PM'], ['Friday', '6:30A...","{'Service options': ['Takeout', 'Dine-in', 'De...",Open ⋅ Closes 6PM,"['0x80c2c78249aba68f:0x35bf16ce61be751d', '0x8...",https://www.google.com/maps/place//data=!4m2!3...
1,Vons Chicken,"Vons Chicken, 12740 La Mirada Blvd, La Mirada,...",0x80dd2b4c8555edb7:0xfc33d65c4bdbef42,,33.916402,-118.010855,['Restaurant'],4.5,18,,"[['Thursday', '11AM–9:30PM'], ['Friday', '11AM...","{'Service options': ['Outdoor seating', 'Curbs...",Open ⋅ Closes 9:30PM,,https://www.google.com/maps/place//data=!4m2!3...
2,TACOS LA CABANA,"TACOS LA CABANA, 2015 22nd Ave, Oakland, CA 94606",0x808f879f35b5088b:0xe3541cec7a95bd88,,37.789076,-122.233884,['Taco restaurant'],5.0,2,,"[['Thursday', 'Closed'], ['Friday', '5–11PM'],...","{'Service options': ['Takeout', 'Dine-in'], 'P...",Closed ⋅ Opens 5PM Fri,,https://www.google.com/maps/place//data=!4m2!3...
3,Mariscos el poblano,"Mariscos el poblano, 5401-5441 Coliseum Way, O...",0x808f87f90c1f661f:0xf384e804a61e0c0b,,37.764203,-122.214647,['Restaurant'],5.0,3,,"[['Thursday', 'Open 24 hours'], ['Friday', '8A...","{'Service options': ['Takeout', 'Dine-in'], 'P...",Open ⋅ Closes 12AM,,https://www.google.com/maps/place//data=!4m2!3...
4,Off The Hoof,"Off The Hoof, 201 E 4th St, Santa Ana, CA 92701",0x80dcd95d192d988b:0x68795f58e35bf888,,33.748329,-117.866045,['Restaurant'],4.0,3,,"[['Thursday', '11AM–10PM'], ['Friday', '11AM–1...",{'Service options': ['Delivery']},Permanently closed,,https://www.google.com/maps/place//data=!4m2!3...


In [10]:
# Set valid ranges for latitude and longitude
valid_latitude_range = (32.3, 42.0)
valid_longitude_range = (-124.24, -114.8)

In [11]:
# Remove data points that do not fall within the valid latitude and longitude ranges
filtered_data = data[
    (data['latitude'].between(*valid_latitude_range)) &
    (data['longitude'].between(*valid_longitude_range)) &
    ~((data['longitude'] < -122) & (data['latitude'] < 35))
]

In [None]:
df_train, df_test = train_test_split(df, 10)

## 3. Perform negative sampling

Assuming if a user rating an item is a positive label, there is no negative sample in the dataset, which is not possible for model training. Therefore, we random sample `n` items from the unseen movie list for every user to provide the negative samples.

In [None]:
def negative_sampling(user_ids, movie_ids, items, n_neg):
    """This function creates n_neg negative labels for every positive label
    
    @param user_ids: list of user ids
    @param movie_ids: list of movie ids
    @param items: unique list of movie ids
    @param n_neg: number of negative labels to sample
    
    @return df_neg: negative sample dataframe
    
    """
    
    neg = []
    ui_pairs = zip(user_ids, movie_ids)
    records = set(ui_pairs)
    
    # for every positive label case
    for (u, i) in records:
        # generate n_neg negative labels
        for _ in range(n_neg):
            # if the randomly sampled movie exists for that user
            j = np.random.choice(items)
            while(u, j) in records:
                # resample
                j = np.random.choice(items)
            neg.append([u, j, 0])
    # conver to pandas dataframe for concatenation later
    df_neg = pd.DataFrame(neg, columns=['userId', 'movieId', 'rating'])
    
    return df_neg

In [None]:
# create negative samples for training set
neg_train = negative_sampling(
    user_ids=df_train.userId.values, 
    movie_ids=df_train.movieId.values,
    items=df.movieId.unique(),
    n_neg=5
)

In [None]:
print(f'created {neg_train.shape[0]:,} negative samples')

In [None]:
df_train = df_train[['userId', 'movieId']].assign(rating=1)
df_test = df_test[['userId', 'movieId']].assign(rating=1)

df_train = pd.concat([df_train, neg_train], ignore_index=True)

## 4. Calulate statistics for our understanding and model training

In [None]:
def get_unique_count(df):
    """calculate unique user and movie counts"""
    return df.userId.nunique(), df.movieId.nunique()

In [None]:
# unique number of user and movie in the whole dataset
get_unique_count(df)

In [None]:
print('training set shape', get_unique_count(df_train))
print('testing set shape', get_unique_count(df_test))

Next, we calculate some statistics for training purpose.

In [None]:
# number of unique user and number of unique item/movie
n_user, n_item = get_unique_count(df_train)

print("number of unique users", n_user)
print("number of unique items", n_item)

In [None]:
# save the variable for the model training notebook
# -----
# read about `store` magic here: 
# https://ipython.readthedocs.io/en/stable/config/extensions/storemagic.html

%store n_user
%store n_item

## 5. Preprocess data and upload them onto S3

In [None]:
# get current session region
session = boto3.session.Session()
region = session.region_name
print(f'currently in {region}')

In [None]:
# use the default sagemaker s3 bucket to store processed data
# here we figure out what that default bucket name is 
sagemaker_session = sagemaker.Session()
bucket_name = sagemaker_session.default_bucket()
print(bucket_name)  
# bucket name format: "sagemaker-{region}-{aws_account_id}"
%store bucket_name

**upload data to the bucket**

In [None]:
# save data locally first
dest = 'ml-latest-small/s3'
train_path = os.path.join(dest, 'train.npy')
test_path = os.path.join(dest, 'test.npy')

!mkdir {dest}
np.save(train_path, df_train.values)
np.save(test_path, df_test.values)

# upload to S3 bucket (see the bucket name above)
sagemaker_session.upload_data(train_path, key_prefix='data')
sagemaker_session.upload_data(test_path, key_prefix='data')

### 버킷을 직접 생성했을 경우

In [None]:
# use the default sagemaker s3 bucket to store processed data
# here we figure out what that default bucket name is 
sagemaker_session = sagemaker.Session()
bucket_name = 'sagemaker-gacheon-[여러분계정 숫자]'
print(bucket_name)  
# bucket name format: "sagemaker-gacheon-{account 숫자}"
%store bucket_name

upload data to the bucket

In [None]:
# save data locally first
dest = 'ml-latest-small/s3'
train_path = os.path.join(dest, 'train.npy')
test_path = os.path.join(dest, 'test.npy')

!mkdir {dest}
np.save(train_path, df_train.values)
np.save(test_path, df_test.values)

# upload to S3 bucket (see the bucket name above)
sagemaker_session.upload_data(train_path, bucket=bucket_name, key_prefix='data')
sagemaker_session.upload_data(test_path, bucket=bucket_name, key_prefix='data')