---
title: "Data Collection"
format:
    html: 
        code-fold: false
---

{{< include instructions.qmd >}} 


{{< include overview.qmd >}} 

{{< include methods.qmd >}} 

# Data Source Information

The data from this project is open source and available via a github repository. A very big thank you to the repo owner, [Jianmo Ni](https://x.com/jianmo_ni), who is a former UCSD student that compiled over 233.1 million amazon reviews for their paper on recommendation systems[@DataArticle]. 

The repo owner offers this data to the public under one condition - that anyone who uses it cites their original work, which I have included in the references section below. Alternatively, this [link](https://cseweb.ucsd.edu/~jmcauley/pdfs/emnlp19a.pdf) will take you directly to the original paper that used this data. The raw data used in this report can be downloaded at the original [repo](https://nijianmo.github.io/amazon/index.html).

Due to limited storage and computational capacity, I have elected to only conduct my analysis on the "5-core" version of Amazon Electronics reviews within the repository. The raw data is in json format, and therefore needs to be parsed and stored in a dataframe before any modeling or EDA can be done. For this process, I use the `parse()` and `getDF()` functions defined in the repository which is linked above. 

In case you are not familiar, the term "5-core" is in reference to dense subsets, and in this case it means that the data below has been filtered such that the remaining users and items have 5 reviews each. 


# Code

## Data Collection Code Overview

Here, we begin the process by loading in our data. as stated above, The initial `parse()` and `getDF()` functions are borrowed from the link above as well. However, when initially trying to load and parse this data, I ran into serious memory issues that rendered my machine unable to successfully convert the data into a dataframe. Therefore, I elected to use the `orjson` library over the traditional `json`, which cut my import time dramatically. For reference to that repository, please head [here](https://github.com/ijl/orjson).

**Importing Packages and Loading Data**

In [1]:
# Importing necessary packages
import pandas as pd
import gzip
import orjson

# Loading in the data

# Defining function that parses the json file
def parse_orjson(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield orjson.loads(l)

# Defining function to load the json data into a pandas dataframe
def getDF_orjson(path):
    i = 0
    df = {}
    for d in parse_orjson(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')

# Retrieving data
df = getDF_orjson('../../data/raw-data/Electronics.json.gz')


**Checking Dimensions**

- Now that the data has been loaded, let's check the shape

In [4]:
# Printing the data shape
# df.shape

(6739590, 12)

- 6,739,590 reviews, which coincides with the count in the original repository. 

**Zipping Data**

- With that out of the way, we can zip up our data and continue on with the process

In [5]:
df.to_csv('../../data/raw-data/electronics_reviews.csv.gz', index=False, compression='gzip')

**Moving Forward**

- Now that our data has been successfully collected and loaded, we can begin cleaning it in the [next](../data-cleaning/instructions.qmd) section.

{{< include closing.qmd >}} 