## ANLY-512 Project - Data Collection

**Authors**: Victor De Lima, Matthew Moriarty

This Jupyter notebook provides the data collection steps for our ANLY-512 project: Community Prediction with NLP. Due to the immense size of data (around 20 GB), we have employed Google Colab for retrieving the data and storing it off into 20 separate "chunks" used in our analysis.

In [None]:
# install transformers on colab
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m29.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.4-py3-none-any.whl (200 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.1/200.1 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.4 tokenizers-0.13.3 transformers-4.28.1


In [None]:
# install datasets on colab
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash
  Downloading xxhash-3.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 kB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting aiohttp
  Downloading aiohttp-3.8.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110

In [None]:
# import libraries
from datasets import load_dataset
import pandas as pd
import numpy as np

In [None]:
# load the reddit dataset
dataset = load_dataset("reddit")

# the loaded dataset is a JSON, so we have to extract the 'train' dataset
df = dataset['train']
df.shape

Downloading builder script:   0%|          | 0.00/4.38k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.83k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/9.12k [00:00<?, ?B/s]

Downloading and preparing dataset reddit/default to /root/.cache/huggingface/datasets/reddit/default/1.0.0/98ba5abea674d3178f7588aa6518a5510dc0c6fa8176d9653a3546d5afcb3969...


Downloading data:   0%|          | 0.00/3.14G [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3848330 [00:00<?, ? examples/s]

Dataset reddit downloaded and prepared to /root/.cache/huggingface/datasets/reddit/default/1.0.0/98ba5abea674d3178f7588aa6518a5510dc0c6fa8176d9653a3546d5afcb3969. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

(3848330, 8)

In [None]:
# collect the number of occurrences for each subreddit
subreddits = pd.Series(df['subreddit']).value_counts()

# show the top ten subreddits
subreddits.index[:10]

Index(['AskReddit', 'relationships', 'leagueoflegends', 'tifu',
       'relationship_advice', 'trees', 'gaming', 'atheism', 'AdviceAnimals',
       'funny'],
      dtype='object')

In [None]:
# display the features of the dataset
df.features

{'author': Value(dtype='string', id=None),
 'body': Value(dtype='string', id=None),
 'normalizedBody': Value(dtype='string', id=None),
 'subreddit': Value(dtype='string', id=None),
 'subreddit_id': Value(dtype='string', id=None),
 'id': Value(dtype='string', id=None),
 'content': Value(dtype='string', id=None),
 'summary': Value(dtype='string', id=None)}

In [None]:
# remove unneeded columns, including extra 'body' features and IDs
dataset = dataset.remove_columns(['body', 'normalizedBody', 'subreddit_id', 'id'])
# pick the top 5 subreddits (relationships and relationship_advice will be combined)
top5 = ['AskReddit', 'relationships', 'leagueoflegends', 'tifu', 'relationship_advice', 'trees']#, 'gaming', 'atheism']
# extract rows only from top 5 subreddits
top5df = dataset.filter(lambda example: example['subreddit'] in top5)

In [None]:
# convert to pandas dataframe for easier use (takes several minutes)
top5dfp = pd.DataFrame(top5df['train'])
top5dfp.head()

Unnamed: 0,author,subreddit,content,summary
0,phyzishy,AskReddit,"Yeah, but most folks think avoiding gluten wil...",stupid stuff.
1,Perservere,leagueoflegends,Didn't they lose 6 games in a row? Just becaus...,"just because you're close ""at times"" doesn't m..."
2,fallsuspect,AskReddit,You probably won't come off as an ass if you j...,"just get both of their numbers, text the one y..."
3,Buck_Speedjunk,trees,"This picture doesn't follow too well, as defin...",It's a half-assed fan-art that literally put e...
4,SinglesRazor,AskReddit,"I want to say this was about two weeks ago, co...",Fuck Slender Man.


In [None]:
# combine 'relationship_advice' and 'relationships'
top5dfp['subreddit'] = top5dfp['subreddit'].replace('relationship_advice', 'relationships')

In [None]:
# sample 40,000 posts from each category
dfsample = top5dfp.groupby('subreddit').sample(n = 40000, replace = False, random_state = 1)

In [None]:
# export to CSV (300+ MB)
dfsample.to_csv('df40k.csv')

In [None]:
# split the 300 MB dataset into 20 pieces, each (10000, 4) in shape
df_split = np.array_split(dfsample, 20)

In [None]:
# export the 20 datasets to CSV for use in the analysis
for i in range(20):
  filename = 'df40k_' + str(i) + '.csv'
  df_split[i].to_csv(filename)