# Sampling large Datasets
In data processing, a great deal of computing involves analysing large amounts of text mixed with numerical data.  This is what Spark is particularly suited for. Sampling is an essential pre-processing for machine leanring for proof of concept

## Recbole dataset
Recbole is a powerful recommendation system traning and evaluation platform. It has many built-in datasets(https://recbole.io/dataset_list.html), some of which is too large to process on a single computer. I will use spark to preprocess it to shrink its size. 

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l- \ done
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l- \ | done
[?25h  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488493 sha256=d51218b3f65615c08ebd283832cfe7afa1d6ca62b18a478e7e1718c7e497719c
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [2]:

!rm url.yaml
!wget https://raw.githubusercontent.com/RUCAIBox/RecBole/master/recbole/properties/dataset/url.yaml
!pip install pyyaml

import yaml

# Specify the path to the YAML file
file_path = "url.yaml"

# Open the file and load the YAML contents
with open(file_path, "r") as file:
    dataset_urls = yaml.safe_load(file)
   
# only print the first 5 lines
for key in list(dataset_urls.keys())[:5]:
    print(key, ":", dataset_urls[key])

rm: cannot remove 'url.yaml': No such file or directory
--2024-05-04 03:17:21--  https://raw.githubusercontent.com/RUCAIBox/RecBole/master/recbole/properties/dataset/url.yaml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16548 (16K) [text/plain]
Saving to: 'url.yaml'


2024-05-04 03:17:21 (4.92 MB/s) - 'url.yaml' saved [16548/16548]

adult : https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/Adult/adult.zip
alibaba-ifashion : https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/Alibaba-iFashion/Alibaba-iFashion.zip
aliec : https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/AliEC/AliEC.zip
amazon-apps-for-android : https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/Amazon_ratings/Amazon_Apps_for_Android.z

Set the datasets to donwload and process

In [3]:
datasets_to_download = ['amazon-books', 'amazon-movies-tv']

import os
# Path to the folder where the zip file will be extracted
input_folder_path = "input"

# Create input folder if it doesn't exist
if not os.path.exists(input_folder_path):
    os.makedirs(input_folder_path)
    
# Path to the folder where processed file will be saved
output_folder_path = "output"

# Create out folder if it doesn't exist
if not os.path.exists(output_folder_path):
    os.makedirs(output_folder_path)

In [4]:
!pip install requests
import requests
import zipfile
import io

def download_upzip(url, dataset_name):
    # Download the zip file
    response = requests.get(url)
    zip_file = zipfile.ZipFile(io.BytesIO(response.content))

    # Extract the zip file to the specified folder of dataset_name
    folder_path = os.path.join(input_folder_path, dataset_name)
    if not os.path.exists(folder_path):
        os.makedirs(folder_path)
    zip_file.extractall(folder_path)

    #TODO: if extracted file is a directory, move all files to the parent directory
    # for root, dirs, files in os.walk(folder_path):
    #     for file in files:
    #         os.rename(os.path.join(root, file), os.path.join(folder_path, file))
    #     for dir in dirs:
    #         os.rmdir(os.path.join(root, dir))

    # Close the zip file
    zip_file.close()

#  download all dataset from datasets_to_download
for dataset in datasets_to_download:
    download_upzip(dataset_urls[dataset], dataset)



In [5]:
from pyspark.sql import SparkSession

#Building Spark Session
spark = (SparkSession.builder.appName("RecBole Sampling")
            .config("spark.driver.memory", "24G")
            .config("spark.executor.memory", "6G")
            .config("spark.executor.cores","4")
            .getOrCreate())
#spark.sparkContext.setLogLevel('INFO')

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/05/04 03:18:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [6]:
spark.version

'3.5.1'

In [7]:
from pyspark.sql.functions import col, when, count

# read from file into dataframe
dfs = {}
for dataset in datasets_to_download:
    dataset_path = os.path.join(input_folder_path, dataset)
    dfs[dataset] = {}
    for file in os.listdir(dataset_path):
        file_path = os.path.join(dataset_path, file)
        df = spark.read.option("delimiter",'\t').option("header", True).csv(file_path)
        dfs[dataset][file] = df
        print(f"Dataset: {dataset}, File: {file}")
        df.show(5)
        
        print(f'num of {file}:',df.count())

        # check the uniqueness of key, we assume key name is ending with _id bofore :token i.e. item_id:token
        # find the header ending with _id:token
        key_columns = [col for col in df.columns if col.endswith('_id:token')]
        for key_column in key_columns:
            print(f"Number of disintict {key_column}:", df.select(key_column).distinct().count())
            

        # check the completeness of each column
        print("Number of non-null values in each column:")
        df.select([count(when(col(c).isNotNull() , c)).alias(c) for c in df.columns]).show()

                                                                                

Dataset: amazon-books, File: Amazon_Books.inter
+--------------+-------------+------------+---------------+
| user_id:token|item_id:token|rating:float|timestamp:float|
+--------------+-------------+------------+---------------+
| AH2L9G3DQHHAJ|   0000000116|         4.0|     1019865600|
|A2IIIDRK3PRRZY|   0000000116|         1.0|     1395619200|
|A1TADCM7YWPQ8M|   0000000868|         4.0|     1031702400|
| AWGH7V0BDOJKB|   0000013714|         4.0|     1383177600|
|A3UTQPQPM4TQO0|   0000013714|         5.0|     1374883200|
+--------------+-------------+------------+---------------+
only showing top 5 rows



                                                                                

num of Amazon_Books.inter: 22507155


                                                                                

Number of disintict user_id:token: 8026324


                                                                                

Number of disintict item_id:token: 2330066
Number of non-null values in each column:


                                                                                

+-------------+-------------+------------+---------------+
|user_id:token|item_id:token|rating:float|timestamp:float|
+-------------+-------------+------------+---------------+
|     22507155|     22507155|    22507155|       22507155|
+-------------+-------------+------------+---------------+

Dataset: amazon-books, File: Amazon_Books.item
+-------------+----------------+----------------+--------------------+--------------------+-----------+-----------+
|item_id:token|sales_type:token|sales_rank:float|categories:token_seq|         title:token|price:float|brand:token|
+-------------+----------------+----------------+--------------------+--------------------+-----------+-----------+
|   0001048791|           Books|       6334800.0|             'Books'|The Crucible: Per...|       NULL|       NULL|
|   0001048775|           Books|      13243226.0|             'Books'|Measure for Measu...|       NULL|       NULL|
|   0001048236|           Books|       8973864.0|             'Books'|The She

                                                                                

num of Amazon_Books.item: 2370604


                                                                                

Number of disintict item_id:token: 2370604
Number of non-null values in each column:


                                                                                

+-------------+----------------+----------------+--------------------+-----------+-----------+-----------+
|item_id:token|sales_type:token|sales_rank:float|categories:token_seq|title:token|price:float|brand:token|
+-------------+----------------+----------------+--------------------+-----------+-----------+-----------+
|      2370604|         1891174|         1891163|             2370585|    1938767|    1679399|        106|
+-------------+----------------+----------------+--------------------+-----------+-----------+-----------+

Dataset: amazon-movies-tv, File: Amazon_Movies_and_TV.item
+-------------+--------------------+--------------------+-----------+----------------+----------------+-----------+
|item_id:token|categories:token_seq|         title:token|price:float|sales_type:token|sales_rank:float|brand:token|
+-------------+--------------------+--------------------+-----------+----------------+----------------+-----------+
|   0000143561|'Movies', 'Movies...|Everyday Italian ...|

                                                                                

Number of disintict item_id:token: 208326
Number of non-null values in each column:
+-------------+--------------------+-----------+-----------+----------------+----------------+-----------+
|item_id:token|categories:token_seq|title:token|price:float|sales_type:token|sales_rank:float|brand:token|
+-------------+--------------------+-----------+-----------+----------------+----------------+-----------+
|       208328|              208325|     107676|     155624|          204904|          204902|      12314|
+-------------+--------------------+-----------+-----------+----------------+----------------+-----------+

Dataset: amazon-movies-tv, File: Amazon_Movies_and_TV.inter
+--------------+-------------+------------+---------------+
| user_id:token|item_id:token|rating:float|timestamp:float|
+--------------+-------------+------------+---------------+
|A3R5OBKS7OM2IR|   0000143502|         5.0|     1358380800|
|A3R5OBKS7OM2IR|   0000143529|         5.0|     1380672000|
| AH3QC2PC1VTGP|   0

                                                                                

num of Amazon_Movies_and_TV.inter: 4607047


                                                                                

Number of disintict user_id:token: 2088620


                                                                                

Number of disintict item_id:token: 200941
Number of non-null values in each column:




+-------------+-------------+------------+---------------+
|user_id:token|item_id:token|rating:float|timestamp:float|
+-------------+-------------+------------+---------------+
|      4607047|      4607047|     4607047|        4607047|
+-------------+-------------+------------+---------------+



                                                                                


## Data Processing

In [8]:
inter_map = {}
# analyze the sparse of the dataset
for dataset in datasets_to_download:
    dataset_path = os.path.join(input_folder_path, dataset)
    for file in os.listdir(dataset_path):
        if file.endswith('.inter'):
            inter_map[dataset] = file

### filter out inactive user/items

In [9]:
user_inter_threshold = 10
item_inter_threshold = 10

# filter out the user and item with less than threshold interactions
for dataset in datasets_to_download:
    print('-----------------------------------')
    print(f"Dataset: {dataset}")
    inter_df = dfs[dataset][inter_map[dataset]]
    
    print(f'num of iteractions:',inter_df.count())

    # print(f'num of {inter_map[dataset]}:',inter_df.count())
    print(f'num of user_id:',inter_df.select('user_id:token').distinct().count())
    print(f'num of item_id:',inter_df.select('item_id:token').distinct().count())
    # count the number of interactions for each user and item and rename the count column
    user_count_df = inter_df.groupBy('user_id:token').count().withColumnRenamed('count','count_user')
    item_count_df = inter_df.groupBy('item_id:token').count().withColumnRenamed('count','count_item')

    # append the count of user and item to the original df
    inter_df = inter_df.join(user_count_df, on='user_id:token', how='inner')
    inter_df = inter_df.join(item_count_df, on='item_id:token', how='inner')
    inter_df.show(5)
    
    # filter out the user and item with less than threshold interactions
    inter_df = inter_df.filter((col('count_user') >= user_inter_threshold) & (col('count_item') >= item_inter_threshold))
    
    print(f'filtered num of iteractions:',inter_df.count())
    
    # release the memory of dfs[dataset][inter_map[dataset]]
    dfs[dataset][inter_map[dataset]] = inter_df.drop('count_user','count_item')
    

-----------------------------------
Dataset: amazon-books


                                                                                

num of iteractions: 22507155


                                                                                

num of user_id: 8026324


                                                                                

num of item_id: 2330066


                                                                                

+-------------+--------------+------------+---------------+----------+----------+
|item_id:token| user_id:token|rating:float|timestamp:float|count_user|count_item|
+-------------+--------------+------------+---------------+----------+----------+
|   0000095699|A1QHY69FQH9F5R|         3.0|     1254700800|         4|         1|
|   0001048775|A2M4YJ7ANBGYKD|         2.0|     1264550400|        21|         1|
|   0001064487|A3GFXEFR8FDX6P|         5.0|     1309046400|         1|         4|
|   0001064487|A17K364R0ETIJJ|         5.0|     1355961600|         2|         4|
|   0001064487|A1V9HZP9ONKV78|         5.0|     1367280000|        11|         4|
+-------------+--------------+------------+---------------+----------+----------+
only showing top 5 rows



                                                                                

filtered num of iteractions: 6789807
-----------------------------------
Dataset: amazon-movies-tv


                                                                                

num of iteractions: 4607047


                                                                                

num of user_id: 2088620


                                                                                

num of item_id: 200941


                                                                                

+-------------+--------------------+------------+---------------+----------+----------+
|item_id:token|       user_id:token|rating:float|timestamp:float|count_user|count_item|
+-------------+--------------------+------------+---------------+----------+----------+
|   B00AQN09G6|A0358075SYJ9W13JC9RE|         5.0|     1403308800|         1|       159|
|   B000E6EK42|A04004323EMIP0JQX...|         5.0|     1403568000|         1|       203|
|   B0009S4IO2|      A1001IQ9OI5H47|         5.0|     1123113600|         2|        16|
|   B003UESJH4|      A100NGGXRQF0AQ|         5.0|     1304035200|         6|      1209|
|   B00HEPDGKA|      A100OFVFM8WLFE|         3.0|     1396396800|         1|       839|
+-------------+--------------------+------------+---------------+----------+----------+
only showing top 5 rows



                                                                                

filtered num of iteractions: 1204688


### Output overlaped users between datasets 

In [10]:
# folder list of output folders
output_folder_list = []

In [11]:
base_dataset = datasets_to_download[0]
# find the common users between base_dataset and other datasets
for j in range(1,len(datasets_to_download)):
        dataset1 = base_dataset
        dataset2 = datasets_to_download[j]
        inter1 = dfs[dataset1][inter_map[dataset1]]
        inter2 = dfs[dataset2][inter_map[dataset2]]
        inter1.createOrReplaceTempView("inter1")
        inter2.createOrReplaceTempView("inter2")

        print(f"Common users between {dataset1} and {dataset2}")    
        # get the distinct users and then intersect
        inter1_dist = inter1.select('user_id:token').distinct()
        # inter1_dist.show(5)
        common_users = inter1_dist.join(inter2, inter1_dist['user_id:token'] == inter2['user_id:token'],'leftsemi')
        common_users.show(5)

        print(f'num of common_users:',common_users.count())
        # print the items count of each inter of common users
        inter1_com_user = inter1.join(common_users, 'user_id:token')
        inter2_com_user = inter2.join(common_users, 'user_id:token')
        # statictics of inter 1
        inter1_com_user_count = inter1_com_user.count()
        inter1_com_item_count = inter1_com_user.select('item_id:token').distinct().count()
        print(f'num of interactino of common users in {dataset1}:',inter1_com_user_count)
        print(f'num of related items in the interaction:',inter1_com_item_count)
        print(f'density of {dataset1} inetraction :',inter1.count()/inter1_com_user_count/inter1_com_item_count)

        # save filtered datasets to file
        inter1_out_path = os.path.join(output_folder_path, f"{dataset1}_{dataset2}")
        inter1_com_user.show(5)
        inter1_com_user.repartition(1).write.option("header", "true").csv(inter1_out_path, mode='overwrite', sep='\t')
        output_folder_list.append(inter1_out_path)
        # output_folder_map[dataset1] = inter1_out_path
        
        # statictics of inter 2
        inter2_com_user_count = inter2_com_user.count()
        inter2_com_item_count = inter2_com_user.select('item_id:token').distinct().count()
        print(f'num of interactino of common users in {dataset2}:',inter2_com_user_count)
        print(f'num of related items in the interaction:',inter2_com_item_count)
        print(f'density of {dataset2} inetraction :',inter2.count()/inter2_com_user_count/inter2_com_item_count)

        # save filtered datasets to file
        inter2_out_path = os.path.join(output_folder_path, f"{dataset2}_{dataset1}")
        inter2_com_user.show(5) 
        inter2_com_user.repartition(1).write.option("header", "true").csv(inter2_out_path, mode='overwrite', sep='\t')
        output_folder_list.append(inter2_out_path)
        # output_folder_map[dataset2] = inter2_out_path

Common users between amazon-books and amazon-movies-tv


                                                                                

+--------------+
| user_id:token|
+--------------+
|A100UD67AHFODS|
|A100WFKYVRPVX7|
|A1018BDG082EVM|
|A101OMG474Q26I|
|A102K2AH06SY4L|
+--------------+
only showing top 5 rows



                                                                                

num of common_users: 15690




num of interactino of common users in amazon-books: 818364
num of related items in the interaction: 218393


                                                                                

density of amazon-books inetraction : 3.799025416546306e-05


                                                                                

+--------------+-------------+------------+---------------+
| user_id:token|item_id:token|rating:float|timestamp:float|
+--------------+-------------+------------+---------------+
|A100UD67AHFODS|   0143121340|         5.0|     1353974400|
|A100UD67AHFODS|   0307352145|         5.0|     1351814400|
|A100UD67AHFODS|   0544217624|         5.0|     1390953600|
|A100UD67AHFODS|   0553245767|         5.0|     1352073600|
|A100UD67AHFODS|   0615818455|         5.0|     1393459200|
+--------------+-------------+------------+---------------+
only showing top 5 rows



                                                                                

num of interactino of common users in amazon-movies-tv: 604613
num of related items in the interaction: 48575


                                                                                

density of amazon-movies-tv inetraction : 4.1018926864298005e-05


                                                                                

+--------------+-------------+------------+---------------+
| user_id:token|item_id:token|rating:float|timestamp:float|
+--------------+-------------+------------+---------------+
|A100UD67AHFODS|   B004HW7JH4|         5.0|     1350950400|
|A100UD67AHFODS|   6304179103|         5.0|     1351814400|
|A100UD67AHFODS|   B000H5U5EE|         5.0|     1351814400|
|A100UD67AHFODS|   B0056Q0V98|         5.0|     1351900800|
|A100UD67AHFODS|   B000E0WJUK|         5.0|     1151884800|
+--------------+-------------+------------+---------------+
only showing top 5 rows



                                                                                

In [12]:
# find the common users among all downloaded datasets
for dataset in datasets_to_download:
    inter = dfs[dataset][inter_map[dataset]]
    inter.createOrReplaceTempView("inter")

    print(f"Common users among all datasets")    
    # get the distinct users and then intersect
    inter_dist = inter.select('user_id:token').distinct()
    inter_dist.show(3)
    if dataset == datasets_to_download[0]:
        common_users = inter_dist
    else:
        common_users = common_users.join(inter_dist, 'user_id:token','inner')
    print(f'num of common_users after merge with {dataset}:',common_users.count())

common_users.show(3)

# export inter of common users to file
for dataset in datasets_to_download:
    inter = dfs[dataset][inter_map[dataset]]
    inter.createOrReplaceTempView("inter")
    inter_com_user = inter.join(common_users, 'user_id:token')
    inter_com_user_count = inter_com_user.count()
    inter_com_item_count = inter_com_user.select('item_id:token').distinct().count()
    print(f'num of interactino of common users in {dataset}:',inter_com_user_count)
    print(f'num of {dataset} :',inter_com_item_count)
    print(f'density of {dataset} inetraction :',inter.count()/inter_com_user_count/inter_com_item_count)

    # save filtered datasets to file
    inter_out_path = os.path.join(output_folder_path, f"{dataset}_common")
    inter_com_user.show(5)
    inter_com_user.repartition(1).write.option("header", "true").csv(inter_out_path, mode='overwrite', sep='\t')
    output_folder_list.append(inter_out_path)
    # output_folder_map[dataset] = inter_out_path

Common users among all datasets


                                                                                

+--------------+
| user_id:token|
+--------------+
|A1J482FVR1LR6P|
|A17SPEC8D1SX85|
|A1PCEZZ6LE72WK|
+--------------+
only showing top 3 rows



                                                                                

num of common_users after merge with amazon-books: 293885
Common users among all datasets


                                                                                

+--------------+
| user_id:token|
+--------------+
|A140XH16IKR4B0|
|A17SPEC8D1SX85|
|A1ABI2GH9C5FG0|
+--------------+
only showing top 3 rows



                                                                                

num of common_users after merge with amazon-movies-tv: 15690


                                                                                

+--------------+
| user_id:token|
+--------------+
|A17SPEC8D1SX85|
|A1W2JCN01CTT1V|
|A33352GGJ0UVRF|
+--------------+
only showing top 3 rows



                                                                                

num of interactino of common users in amazon-books: 818364
num of amazon-books : 218393




density of amazon-books inetraction : 3.799025416546306e-05


                                                                                

+--------------+-------------+------------+---------------+
| user_id:token|item_id:token|rating:float|timestamp:float|
+--------------+-------------+------------+---------------+
|A100UD67AHFODS|   0143121340|         5.0|     1353974400|
|A100UD67AHFODS|   0307352145|         5.0|     1351814400|
|A100UD67AHFODS|   0544217624|         5.0|     1390953600|
|A100UD67AHFODS|   0553245767|         5.0|     1352073600|
|A100UD67AHFODS|   0615818455|         5.0|     1393459200|
+--------------+-------------+------------+---------------+
only showing top 5 rows



                                                                                

num of interactino of common users in amazon-movies-tv: 604613
num of amazon-movies-tv : 48575


                                                                                

density of amazon-movies-tv inetraction : 4.1018926864298005e-05


                                                                                

+--------------+-------------+------------+---------------+
| user_id:token|item_id:token|rating:float|timestamp:float|
+--------------+-------------+------------+---------------+
|A100UD67AHFODS|   B004HW7JH4|         5.0|     1350950400|
|A100UD67AHFODS|   6304179103|         5.0|     1351814400|
|A100UD67AHFODS|   B000H5U5EE|         5.0|     1351814400|
|A100UD67AHFODS|   B0056Q0V98|         5.0|     1351900800|
|A100UD67AHFODS|   B000E0WJUK|         5.0|     1151884800|
+--------------+-------------+------------+---------------+
only showing top 5 rows



                                                                                

In [13]:
dataset_itemfile_map = {}
def get_itemfile_path(dataset):
    dataset_path = os.path.join(input_folder_path, dataset)
    for file in os.listdir(dataset_path):
        if file.endswith('.item'):
            return os.path.join(dataset_path, file)
    return None

for ouptput_folder in output_folder_list:
    # strip the dataset from the first part of folder
    dataset = os.path.basename(ouptput_folder).split('_')[0]
    # copy .item file from correonding input folder to output folder
    itemfile_path = get_itemfile_path(dataset)
    if itemfile_path:
        print(f"copy from {itemfile_path} to {ouptput_folder} for {dataset} ")
        out_path = os.path.join(ouptput_folder, f"{dataset}.item")
        !cp $itemfile_path $out_path
    else:
        print(f"item file not found for {dataset}")

    for file in os.listdir(ouptput_folder):
        # rename exported cvs as .inter
        if file.endswith('.csv'):
            # rename file to {folder}.inter
            file_path = os.path.join(ouptput_folder, file)
            out_path = os.path.join(ouptput_folder, f"{dataset}.inter")
            !mv $file_path $out_path

copy from input/amazon-books/Amazon_Books.item to output/amazon-books_amazon-movies-tv for amazon-books 
copy from input/amazon-movies-tv/Amazon_Movies_and_TV.item to output/amazon-movies-tv_amazon-books for amazon-movies-tv 
copy from input/amazon-books/Amazon_Books.item to output/amazon-books_common for amazon-books 
copy from input/amazon-movies-tv/Amazon_Movies_and_TV.item to output/amazon-movies-tv_common for amazon-movies-tv 


## Analyze Chronicle Characteristics
TBD

## Sampling
Stratified sampling based on hotness(interaction rate) of items

## release all the resources 

In [14]:
# unpersist the dfs
for dataset in datasets_to_download:
    for key in dfs[dataset]:
        dfs[dataset][key].unpersist()
        
# Stop the Spark session
spark.stop()

## Sammary
Spark is a powerful and efficient tool to handle sample on large scale of data. 
* flexible and powerful functionality
* runs super fast even on my laptop
* easy to apply to similar datasets(Amazon have dataset of different categories), I only focused on one categoy this time. 