# Data Collection and Preparation
###### An overview of obtaining the Reddit comment data from BigQuery and the obstacles faced along the way.

### Step 1: Sample Dataset
-The very first successful extraction of data from BigQuery. From here, it had to be exported to my personal Google Cloud Storage Bucket in the form of 50 partitioned csvs.  
-From here, it made sense to run some preliminary EDA on one of the partioned csvs before attempting to operate on all 50 csvs. 

![Query1.png](Query1.png)

### Ran into an issue with exporting to a single CSV so you have to split it into shards using the '*' character

![ExtractError.png](ExtractError.png)

#### Sample Google Cloud SDK command prompt code used to download all the csvs in the Google Cloud Storage Bucket
#### <pre>                     gsutil -m cp -R gs://reddit-data-ut [SAVE_TO_LOCATION] <\pre>

### Step 2: Preliminary Exploratory Data A and data subset selection
-Perform preliminary EDA, by printing unique values to check for nulls and apply a threshold for eliminated columns with over 70% NaNs.

In [161]:
import pandas as pd
reddit_df = pd.read_csv('2019_06000000000000.csv')

In [168]:
#Check columns for NaNs and other unique values

#excluding body
column_list = [ 'score_hidden', 'archived', 'name', 'author',
       'author_flair_text', 'downs', 'created_utc', 'subreddit_id', 'link_id',
       'parent_id', 'score', 'retrieved_on', 'controversiality', 'gilded',
       'id', 'subreddit', 'ups', 'distinguished', 'author_flair_css_class']
for i in column_list:
    print('Column Name: ' + i)
    print(reddit_df[i].unique()[0:3])

Column Name: score_hidden
[nan]
Column Name: archived
[nan False]
Column Name: name
[nan]
Column Name: author
['315Lukas' 'matinthebox' '[deleted]']
Column Name: author_flair_text
[nan 'Brandenburg' 'Dortmund']
Column Name: downs
[nan]
Column Name: created_utc
[1557534566 1558570458 1558569956]
Column Name: subreddit_id
['t5_22i0' 't5_2qmla' 't5_2ruhy']
Column Name: link_id
['t3_bmyvwy' 't3_brs6a2' 't3_brsip0']
Column Name: parent_id
['t1_en2i5p6' 't1_eoh0hg9' 't3_brs6a2']
Column Name: score
[15 22 -8]
Column Name: retrieved_on
[1561887482 1563119432 1563119083]
Column Name: controversiality
[0 1]
Column Name: gilded
[0 1 3]
Column Name: id
['en2k8je' 'eoh180n' 'eoh0hg9']
Column Name: subreddit
['de' '311' '3DS']
Column Name: ups
[nan]
Column Name: distinguished
[nan 'moderator']
Column Name: author_flair_css_class
[nan 'BRAN' 'DORMND']


#### Removed columns that are over 70% null after seeing that with no threshold, it's either 0% or something above 70%.

In [156]:
#Removing columns that are over 70% null
print(reddit_df.isnull().mean())

#Drop columns that are over 0.7 in the print statement
reddit_clean = reddit_df.loc[:, reddit_df.isnull().mean() < .7]


body                      0.00000
score_hidden              1.00000
archived                  0.99999
name                      1.00000
author                    0.00000
author_flair_text         0.79152
downs                     1.00000
created_utc               0.00000
subreddit_id              0.00000
link_id                   0.00000
parent_id                 0.00000
score                     0.00000
retrieved_on              0.00000
controversiality          0.00000
gilded                    0.00000
id                        0.00000
subreddit                 0.00000
ups                       1.00000
distinguished             0.97918
author_flair_css_class    0.83909
dtype: float64


#### Find the most popular subreddits for data subset selection

In [169]:
#Count of the sample data to see top 20 most popular subreddits by comment count
reddit_clean.groupby('subreddit')['body'].count().sort_values(ascending=False)[0:20]

subreddit
AskReddit            4796
UFC237LiveStreams    1961
nba                  1454
SquaredCircle        1428
freefolk             1406
teenagers            1240
politics             1179
dankmemes             983
NBAPLAYOFFSLiveHQ     967
memes                 967
gameofthrones         828
AmItheAsshole         818
funny                 715
The_Donald            691
UFC237LiveOnline      657
unpopularopinion      639
worldnews             580
FortNiteBR            546
Market76              531
news                  525
Name: body, dtype: int64

### Step 3: Merge the datasets to give the other teammembers something to work with

In [108]:
#Deprecated code used to merge all the csvs when working with all comments of a given month
# master_df = pd.DataFrame(columns = [ 'body', 'author', 'created_utc', 'subreddit_id', 'link_id',
#        'parent_id', 'score', 'retrieved_on', 'controversiality', 'gilded',
#        'id', 'subreddit'])

# for i in range (0,50):
#     if i <= 9:
#         reddit_df = pd.read_csv('2019_0600000000000{}.csv'.format(i))
#     else:
#         reddit_df = pd.read_csv('2019_060000000000{}.csv'.format(i))
        
#     reddit_df = reddit_df[['body', 'author', 'created_utc', 'subreddit_id', 'link_id',
#        'parent_id', 'score', 'retrieved_on', 'controversiality', 'gilded',
#        'id', 'subreddit']]
    
#     worldnews = reddit_df[reddit_df['subreddit']=='worldnews']
    
#     master_df = master_df.append(worldnews,ignore_index=True)


### Step 4: Attempt to merge "r/worldnews" of June '19 with May '19 and July '19
Using the two datasets after querying from BigQuery for the respective months

![Query3.png](Query3.png)
![Query2.png](Query2.png)

In [126]:
#Deprecated. Used to merge the monthly r/worldnews data and realized it's too large to convert to csv
# master_df = pd.DataFrame(columns = [ 'body', 'author', 'created_utc', 'subreddit_id', 'link_id',
#        'parent_id', 'score', 'retrieved_on', 'controversiality', 'gilded',
#        'id', 'subreddit'])

# for i in range (5,8):
#     worldnews_month_df = pd.read_csv('worldnews_0{}19.csv'.format(i))
        
#     worldnews_month_df = worldnews_month_df[['body', 'author', 'created_utc', 'subreddit_id', 'link_id',
#        'parent_id', 'score', 'retrieved_on', 'controversiality', 'gilded',
#        'id', 'subreddit']]
    
#     master_df = master_df.append(worldnews_month_df,ignore_index=True)

### Step 4: Finally figured out the smart thing to do instead of merging 240MB-480MB CSVs was to just query it directly from BigQuery

![Query4.png](Query4.png)

In [145]:
#The May'19-July'19 worldnews csv after querying from BigQuery
master_df = pd.read_csv('summer19_worldnews.csv')