## Notebook with steps to recreate the dataset

1. Get the commments list from the dataset_comment_ids.txt
2. Download the comments using the YouTube Data API 

### Step 1

You can authenticate with the Google API in various ways (using OAuth, keys etc)
We have provided the code to download the comments, all you need is a key. 
A key can be obtained by opening a project in your google developer console and 
enabling the YouTube Data API for that project. 

The comments are stored in comment_ids.txt
The following cell runs get_comments_from_comment_ids.py and save them in the folder 'raw_comments/'. Obtain a key and save it in keys.txt to download the comments. 

In [None]:
# keys.txt should contain 1 key in each line 
# comment_ids.txt contains the comment IDs 
# raw_comments/ is the save folder

!python get_comments_from_comment_ids.py keys.txt comment_ids.txt raw_comments/ 

### Step 2 

1. Load comments from raw_comments/ 
2. Verify sha1 hash
3. Save in a json file and delete the individual files

In [None]:
from pprint import pprint
import random, math
from utils import *
import json, os, sys 
import hashlib 
import shutil 

In [None]:
comment_list = os.listdir("raw_comments/")

In [None]:
# Retrieve comments and check sha1 hash value 


all_comments_retrieved = {}
sha1 = hashlib.sha1() 

with open("comment_ids_with_hash.json")  as fp : 
    doc = json.load(fp)


for comment in comment_list: 
    comment_filepath = os.path.join("raw_comments",comment)
    with open(comment_filepath, encoding='utf-8') as fp : 
        
        comment_doc = json.load(fp)
        if comment_doc[comment] != "Comment has been removed by the user. To consruct full dataset, contact authors." : 
            comment_text = comment_doc[comment] 
            comment_hash = hashlib.sha1(comment_text.encode('utf-8'))
            digest = comment_hash.hexdigest() 
            if digest != doc[comment] : 
                print(f"The comment {comment} has been removed or modified since the collection of this dataset, contact authors for full data.")
                
            else : 
                all_comments_retrieved[comment] = comment_text

In [None]:
print(f"{len(all_comments_retrieved)}/1000 comments from our released samples were retrieved in their original state. Contact authors for full dataset.")

In [None]:
# save the comments in a json file 


with open("sample_comments.json", "w") as fp : 
    json.dump(all_comments_retrieved, fp, indent=4)

In [None]:
# remove the individual download folders 


shutil.rmtree("raw_comments/")