# Cleaning the YouTube & Reddit Data

Once all the data was collected, I read in the JSON files individually, transformed them into DataFrame objects, and combined them into a single data set. Because the comments for each individual post were in the form of nested dictionaries, special attention was paid to the comment text in order to prepare the data for modeling. Namely, they needed to be flattened and set as a series of strings. Initially, I had planned on modeling each comment separately. However, after reviewing the data, I decided it would be most appropriate to aggregate all comments, along with the post descriptions, into a single text document per post. I chose to do this in order to allow the results of the NLP to reflect the predominant sentiments of the comments as a whole. 

The models in the subsequent notebook have, as their target values, a series of words that represent emotions. In order to avoid skewing the results of the models, I stripped all comment text of all of these terms. I saved the resulting data as a JSON file, to be imported for future use.

In [1]:
## Packages and libraries:

import os
import codecs
import regex as re
import praw
import json
import time
import pandas as pd
import numpy as np
import flatten_json as flatten
from pandas.io.json import json_normalize
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer
import itertools as it

### Read & parse Reddit Data:

In [2]:
reddit_data = []
notParsed = []
reddit_file = open('./Data/REDDITdata_020919.json',"r")
for line in reddit_file:    
    if line.strip(): 
        try:
            post=json.loads(line)
            reddit_data.append(post)
        except:
            notParsed.append(line)
            continue
print(len(reddit_data))
print('Could not parse: ', len(notParsed))

1
Could not parse:  0


In [3]:
reddit_data = pd.io.json.json_normalize(reddit_data[0][0])
reddit_data.head()

Unnamed: 0,comments,created_utc,id,num_comments,score,title,ups
0,[{'body': 'Remember when the highest upvoted p...,1480960000.0,5gn8ru,5031,283485,Guardians of the Front Page,283485
1,"[{'body': '^^psst, ^^hey ^^kid, ^^want ^^some ...",1478651000.0,5bx4bx,6128,230831,"Thanks, Obama.",230831
2,[{'body': 'Here is a collection of all the que...,1346270000.0,z1c9z,23276,216150,"I am Barack Obama, President of the United Sta...",216150
3,[{'body': 'It's fantastic that many airlines t...,1486401000.0,5sfexx,4374,222808,"This is Shelia Fredrick, a flight attendant. S...",222808
4,[{'body': 'That may be the single most superhe...,1482426000.0,5jrlw1,5681,204178,1 dad reflex 2 children,204178


In [4]:
comments = reddit_data['comments'].apply(pd.Series).squeeze()
comments.head(1)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
0,{'body': 'Remember when the highest upvoted po...,{'body': 'Can't wait to upvote this 17 differe...,{'body': 'Nice watermark PP. Clever.'},{'body': 'The mirrored text brought this to a ...,{'body': '[deleted]'},{'body': 'Thank you /u/iH8myPP for wasting you...,{'body': 'https://gfycat.com/RecentIdleAbalone'},{'body': 'I didn't realize how much I missed G...,{'body': 'Source: Guardians of the Galaxy 2 (...,"{'body': 'Great job, OP. I can't wait to see ...",...,{'body': 'Why is this post losing upvotes like...,{'body': 'How has this come from 19000 up vote...,{'body': 'I like reposts. It's sometimes fun ...,{'body': 'This is reddit history happening rig...,"{'body': 'This has gone from 21000 upvotes, to...",{'body': 'PP I had no idea you beat Barack Oba...,{'body': '/r/HighQualityGifs/ for more like th...,{'body': 'I really wish this was a repost'},{'body': 'Back in my day this had 20k upvotes'},"{'body': 'This post has almost 200K upvotes, w..."


In [5]:
reddit_data.drop('comments', axis=1, inplace=True)
reddit_data['description'] = reddit_data['title']
reddit_data['source'] = 'Reddit.com'
reddit_data = reddit_data[['id', 'created_utc', 'num_comments', 'ups', 'score', 'source', 'title', 'description']]
reddit_data.columns = ['post_id', 'published_on','num_comments', 'likes', 'score_views', 'source', 'title', 'description']
reddit_data.head()

Unnamed: 0,post_id,published_on,num_comments,likes,score_views,source,title,description
0,5gn8ru,1480960000.0,5031,283485,283485,Reddit.com,Guardians of the Front Page,Guardians of the Front Page
1,5bx4bx,1478651000.0,6128,230831,230831,Reddit.com,"Thanks, Obama.","Thanks, Obama."
2,z1c9z,1346270000.0,23276,216150,216150,Reddit.com,"I am Barack Obama, President of the United Sta...","I am Barack Obama, President of the United Sta..."
3,5sfexx,1486401000.0,4374,222808,222808,Reddit.com,"This is Shelia Fredrick, a flight attendant. S...","This is Shelia Fredrick, a flight attendant. S..."
4,5jrlw1,1482426000.0,5681,204178,204178,Reddit.com,1 dad reflex 2 children,1 dad reflex 2 children


In [6]:
from datetime import datetime

reddit_data['published_on'] = [datetime.fromtimestamp(x).strftime("%Y-%m-%d %H:%M:%S") 
                            for x in reddit_data['published_on']]

reddit_data['published_on'] = [pd.to_datetime(x) for x in reddit_data['published_on']]

In [7]:
reddit_df = pd.concat([reddit_data, comments], axis=1)

In [8]:
reddit_df.head()

Unnamed: 0,post_id,published_on,num_comments,likes,score_views,source,title,description,0,1,...,40,41,42,43,44,45,46,47,48,49
0,5gn8ru,2016-12-05 09:41:14,5031,283485,283485,Reddit.com,Guardians of the Front Page,Guardians of the Front Page,{'body': 'Remember when the highest upvoted po...,{'body': 'Can't wait to upvote this 17 differe...,...,{'body': 'Why is this post losing upvotes like...,{'body': 'How has this come from 19000 up vote...,{'body': 'I like reposts. It's sometimes fun ...,{'body': 'This is reddit history happening rig...,"{'body': 'This has gone from 21000 upvotes, to...",{'body': 'PP I had no idea you beat Barack Oba...,{'body': '/r/HighQualityGifs/ for more like th...,{'body': 'I really wish this was a repost'},{'body': 'Back in my day this had 20k upvotes'},"{'body': 'This post has almost 200K upvotes, w..."
1,5bx4bx,2016-11-08 16:27:25,6128,230831,230831,Reddit.com,"Thanks, Obama.","Thanks, Obama.","{'body': '^^psst, ^^hey ^^kid, ^^want ^^some ^...",{'body': 'The president we needed. Now time fo...,...,"{'body': 'You know, as a conservative libertar...",{'body': 'He's a genuinely good guy and a grea...,"{'body': 'Mr Obama, On the off chance that you...",{'body': 'Meh.'},{'body': 'I never thought I'd say this but I'm...,"{'body': 'Thanks Obama, I had dollars in my po...",{'body': 'This kinda doesn't make sense to me ...,{'body': ' I miss him so much'},{'body': 'OBAMA I love you'},"{'body': 'Yep, we miss you.'}"
2,z1c9z,2012-08-29 13:01:36,23276,216150,216150,Reddit.com,"I am Barack Obama, President of the United Sta...","I am Barack Obama, President of the United Sta...",{'body': 'Here is a collection of all the ques...,{'body': 'How in the fuck was PresidentObama n...,...,{'body': 'With all the patent lawsuits taking ...,{'body': 'http://i.imgur.com/Ju94o.png'},{'body': '[deleted]'},{'body': 'If you had to select one non-politic...,{'body': 'Why do petitions keep disappearing f...,"{'body': 'Mr. President, can you stop the NHL ...","{'body': 'President Obama, why didn't you clos...","{'body': 'Mr. President, why did you insist on...",{'body': '[BO in third grade and now](http://i...,{'body': 'Why have you not gotten rid of the P...
3,5sfexx,2017-02-06 09:06:40,4374,222808,222808,Reddit.com,"This is Shelia Fredrick, a flight attendant. S...","This is Shelia Fredrick, a flight attendant. S...",{'body': 'It's fantastic that many airlines te...,{'body': 'Good for her. I can only hope that i...,...,{'body': 'just don't let your hostage go to th...,{'body': 'I'm worried I'm going to be on one o...,"{'body': 'If you're ever in that position, put...",{'body': 'Dont worry there are tons of single ...,{'body': 'This is part of the new initiative i...,"{'body': '474 arrests, just in California, in ...",{'body': 'I live in Thailand. I've met human t...,{'body': '[removed]'},{'body': 'I wanna see the faces and names of t...,"{'body': 'And for those interested, but too la..."
4,5jrlw1,2016-12-22 08:57:35,5681,204178,204178,Reddit.com,1 dad reflex 2 children,1 dad reflex 2 children,{'body': 'That may be the single most superher...,{'body': 'The back roll makes this even better'},...,{'body': 'This is honestly one of the most inc...,{'body': 'Is that a snake that's writhing on t...,{'body': 'That guys dodge roll even protected ...,{'body': 'Truly a hero. Absolutely incredible....,{'body': 'Can we find out who this guy is? I w...,{'body': 'This guy needs a medal and his face ...,{'body': 'The backward roll that made the diff...,{'body': 'They really need to make rolling cos...,{'body': 'r/DadReflexes '},{'body': 'That's the reaction that people day ...


In [9]:
for i in range(50):
    reddit_df[i] = [x.values() for x in reddit_df[i]]

reddit_df.head(1)

Unnamed: 0,post_id,published_on,num_comments,likes,score_views,source,title,description,0,1,...,40,41,42,43,44,45,46,47,48,49
0,5gn8ru,2016-12-05 09:41:14,5031,283485,283485,Reddit.com,Guardians of the Front Page,Guardians of the Front Page,(Remember when the highest upvoted post you sa...,(Can't wait to upvote this 17 different times ...,...,(Why is this post losing upvotes like wildfire?),(How has this come from 19000 up votes to 9000...,(I like reposts. It's sometimes fun to see a ...,(This is reddit history happening right here. ...,"(This has gone from 21000 upvotes, to 17000, n...","(PP I had no idea you beat Barack Obama's AMA,...",(/r/HighQualityGifs/ for more like this.),(I really wish this was a repost),(Back in my day this had 20k upvotes),"(This post has almost 200K upvotes, wtf?)"


### Read & parse YouTube Data:

In [10]:
filepath_list = ['YOUTUBEdata_020919_1', 'YOUTUBEdata_020919_2', 'YOUTUBEdata_020919_3', 
                 'YOUTUBEdata_020919_4', 'YOUTUBEdata_020919_5', 'YOUTUBEdata_020919_6', 
                 'YOUTUBEdata_020919_7', 'YOUTUBEdata_020919_8']

youtube_data = []
notParsed = []

In [11]:
for file in filepath_list:
    youtube_file = open('./Data/' + file + '.json',"r")
    for line in youtube_file:    
        if line.strip(): 
            try:
                post=json.loads(line)
                youtube_data.append(post)
            except:
                notParsed.append(line)
                continue
print(len(youtube_data))
print('Could not parse: ', len(notParsed))

8
Could not parse:  0


In [12]:
yt_combined = []
for item in youtube_data:
    yt_combined += item[0][0]

In [13]:
youtube_data = pd.io.json.json_normalize(yt_combined)
youtube_data.head(1)

Unnamed: 0,commentCount,comments,description,dislikeCount,likeCount,publishedAt,title,videoID,viewCount
0,3035242,"[{'topLevelComment': 'D E S P A C I T O :)'}, ...",“Despacito” disponible ya en todas las platafo...,3862675,31966704,2017-01-13T05:00:02.000Z,Luis Fonsi - Despacito ft. Daddy Yankee,kJQP7kiw5Fk,5959601315


In [14]:
comments = youtube_data['comments'].apply(pd.Series).squeeze() 
comments.head(1)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
0,{'topLevelComment': 'D E S P A C I T O :)'},{'topLevelComment': 'February ? 2019😂'},{'topLevelComment': 'Who's here to check to se...,{'topLevelComment': 'Almost 6 billion views'},"{'topLevelComment': 'Wow only 6 billion views,...",{'topLevelComment': '2020 guys anyone... anyon...,{'topLevelComment': '3:25'},{'topLevelComment': '2019/2/10؟؟؟؟'},{'topLevelComment': '2017 2018 2019 💪💪💪'},{'topLevelComment': 'Just admit it you only ca...,...,{'topLevelComment': '2020'},{'topLevelComment': 'Anyone else watching this...,{'topLevelComment': 'February 2019😬🙅'},{'topLevelComment': '?'},{'topLevelComment': 'Ummmmm... why am I here'},{'topLevelComment': '2019?'},{'topLevelComment': '2019'},{'topLevelComment': 'hola putos'},{'topLevelComment': 'Oi'},{'topLevelComment': 'Hello'}


In [15]:
youtube_data['source'] = 'Youtube.com'
youtube_data = youtube_data[['videoID','publishedAt','commentCount', 'likeCount', 'viewCount', 'source', 'title', 'description']]
youtube_data.columns = ['post_id', 'published_on','num_comments', 'likes', 'score_views', 'source', 'title', 'description']
youtube_data.head(3)       

Unnamed: 0,post_id,published_on,num_comments,likes,score_views,source,title,description
0,kJQP7kiw5Fk,2017-01-13T05:00:02.000Z,3035242,31966704,5959601315,Youtube.com,Luis Fonsi - Despacito ft. Daddy Yankee,“Despacito” disponible ya en todas las platafo...
1,JGwWNGJdvx8,2017-01-30T10:57:50.000Z,873673,19004938,4066735225,Youtube.com,Ed Sheeran - Shape of You [Official Video],Stream or Download Shape Of You: https://atlan...
2,OPf0YbXqDm0,2014-11-19T14:00:18.000Z,477799,12290901,3455955229,Youtube.com,Mark Ronson - Uptown Funk ft. Bruno Mars (Offi...,Mark Ronson's official music video for 'Uptown...


In [16]:
youtube_data['published_on'] = [pd.to_datetime(x) for x in youtube_data['published_on']]

In [17]:
print(youtube_data['published_on'][:3])
print(reddit_data['published_on'][:3])

0   2017-01-13 05:00:02
1   2017-01-30 10:57:50
2   2014-11-19 14:00:18
Name: published_on, dtype: datetime64[ns]
0   2016-12-05 09:41:14
1   2016-11-08 16:27:25
2   2012-08-29 13:01:36
Name: published_on, dtype: datetime64[ns]


In [18]:
youtube_df = pd.concat([youtube_data, comments], axis=1)

In [19]:
youtube_df.head(3)

Unnamed: 0,post_id,published_on,num_comments,likes,score_views,source,title,description,0,1,...,40,41,42,43,44,45,46,47,48,49
0,kJQP7kiw5Fk,2017-01-13 05:00:02,3035242,31966704,5959601315,Youtube.com,Luis Fonsi - Despacito ft. Daddy Yankee,“Despacito” disponible ya en todas las platafo...,{'topLevelComment': 'D E S P A C I T O :)'},{'topLevelComment': 'February ? 2019😂'},...,{'topLevelComment': '2020'},{'topLevelComment': 'Anyone else watching this...,{'topLevelComment': 'February 2019😬🙅'},{'topLevelComment': '?'},{'topLevelComment': 'Ummmmm... why am I here'},{'topLevelComment': '2019?'},{'topLevelComment': '2019'},{'topLevelComment': 'hola putos'},{'topLevelComment': 'Oi'},{'topLevelComment': 'Hello'}
1,JGwWNGJdvx8,2017-01-30 10:57:50,873673,19004938,4066735225,Youtube.com,Ed Sheeran - Shape of You [Official Video],Stream or Download Shape Of You: https://atlan...,{'topLevelComment': 'MTB SUPER KONG EVERYONE W...,{'topLevelComment': 'who will listen forever😘❤'},...,{'topLevelComment': 'الاستغراب الاشتراك اه فى ...,{'topLevelComment': 'Cade os br dessa poha'},{'topLevelComment': 'Ed sheeran Madefucker'},{'topLevelComment': '2019's best music'},{'topLevelComment': 'Mars 2019???'},{'topLevelComment': 'Ho My god'},{'topLevelComment': 'Love'},{'topLevelComment': '10.02.2019 :)))))'},{'topLevelComment': 'Wery good song top 1'},{'topLevelComment': 'Why are you here in 2019?'}
2,OPf0YbXqDm0,2014-11-19 14:00:18,477799,12290901,3455955229,Youtube.com,Mark Ronson - Uptown Funk ft. Bruno Mars (Offi...,Mark Ronson's official music video for 'Uptown...,{'topLevelComment': 'Me encanta mmmmm me fasci...,{'topLevelComment': 'Aburrido no le entendí y ...,...,{'topLevelComment': 'Mark Ronson contribution?'},{'topLevelComment': 'I love'},{'topLevelComment': 'WHO LISTENING IN NOVEMBER...,{'topLevelComment': '🌑🌑🌑🌑🌑🌑🌑🌑🌑🌑🌑🌑🌑 🌑🌑🌑🌑🌑🌑🌑🌑🌑🌑🌑...,{'topLevelComment': 'Bruno = GOD Btw in m...,{'topLevelComment': '한국인 손'},{'topLevelComment': '😍👏👏👏👏👏👏👏👏'},{'topLevelComment': '1:32 i love this part'},{'topLevelComment': 'Jdjywkod'},{'topLevelComment': 'I LOVE♡♡♡♡♡ DIRT ON MY BO...


In [20]:
comment_columns = youtube_df.iloc[:, 7:]

for column in comment_columns:
    for x in youtube_df.loc[youtube_df[column].isnull(), column].index:
        youtube_df.at[x, column] = {}

youtube_df.head(3)          

Unnamed: 0,post_id,published_on,num_comments,likes,score_views,source,title,description,0,1,...,40,41,42,43,44,45,46,47,48,49
0,kJQP7kiw5Fk,2017-01-13 05:00:02,3035242,31966704,5959601315,Youtube.com,Luis Fonsi - Despacito ft. Daddy Yankee,“Despacito” disponible ya en todas las platafo...,{'topLevelComment': 'D E S P A C I T O :)'},{'topLevelComment': 'February ? 2019😂'},...,{'topLevelComment': '2020'},{'topLevelComment': 'Anyone else watching this...,{'topLevelComment': 'February 2019😬🙅'},{'topLevelComment': '?'},{'topLevelComment': 'Ummmmm... why am I here'},{'topLevelComment': '2019?'},{'topLevelComment': '2019'},{'topLevelComment': 'hola putos'},{'topLevelComment': 'Oi'},{'topLevelComment': 'Hello'}
1,JGwWNGJdvx8,2017-01-30 10:57:50,873673,19004938,4066735225,Youtube.com,Ed Sheeran - Shape of You [Official Video],Stream or Download Shape Of You: https://atlan...,{'topLevelComment': 'MTB SUPER KONG EVERYONE W...,{'topLevelComment': 'who will listen forever😘❤'},...,{'topLevelComment': 'الاستغراب الاشتراك اه فى ...,{'topLevelComment': 'Cade os br dessa poha'},{'topLevelComment': 'Ed sheeran Madefucker'},{'topLevelComment': '2019's best music'},{'topLevelComment': 'Mars 2019???'},{'topLevelComment': 'Ho My god'},{'topLevelComment': 'Love'},{'topLevelComment': '10.02.2019 :)))))'},{'topLevelComment': 'Wery good song top 1'},{'topLevelComment': 'Why are you here in 2019?'}
2,OPf0YbXqDm0,2014-11-19 14:00:18,477799,12290901,3455955229,Youtube.com,Mark Ronson - Uptown Funk ft. Bruno Mars (Offi...,Mark Ronson's official music video for 'Uptown...,{'topLevelComment': 'Me encanta mmmmm me fasci...,{'topLevelComment': 'Aburrido no le entendí y ...,...,{'topLevelComment': 'Mark Ronson contribution?'},{'topLevelComment': 'I love'},{'topLevelComment': 'WHO LISTENING IN NOVEMBER...,{'topLevelComment': '🌑🌑🌑🌑🌑🌑🌑🌑🌑🌑🌑🌑🌑 🌑🌑🌑🌑🌑🌑🌑🌑🌑🌑🌑...,{'topLevelComment': 'Bruno = GOD Btw in m...,{'topLevelComment': '한국인 손'},{'topLevelComment': '😍👏👏👏👏👏👏👏👏'},{'topLevelComment': '1:32 i love this part'},{'topLevelComment': 'Jdjywkod'},{'topLevelComment': 'I LOVE♡♡♡♡♡ DIRT ON MY BO...


In [21]:
for i in range(50):
    youtube_df[i] = [x.values() for x in youtube_df[i]]

### Concatenate the DataFrames int a single data set:

In [23]:
frames = [youtube_df, reddit_df]

df = pd.concat(frames, axis=0, join='outer', join_axes=None, ignore_index=False,
          keys=None, levels=None, names=None, verify_integrity=False,
          copy=True)

In [24]:
df.reset_index().to_json('./Data/FINAL_COMMENTS.json')

In [78]:
nlp_df = df.iloc[:, 7:]

for column in nlp_df:
    nlp_df[column] = [str(x).replace('dict_values', '') for x in nlp_df[column]]
    nlp_df[column] = [str(x).replace('\n', ' ') for x in nlp_df[column]]
    nlp_df[column] = [x.strip('    ') for x in nlp_df[column]]
    nlp_df[column] = [x.lstrip("(['") for x in nlp_df[column]]
    nlp_df[column] = [x.rstrip("'])") for x in nlp_df[column]]
    nlp_df[column] = nlp_df[column].str.replace('[^\w\s#@/:%.,_-]', '', flags=re.UNICODE)

nlp_df.head(3)

Unnamed: 0,description,0,1,2,3,4,5,6,7,8,...,40,41,42,43,44,45,46,47,48,49
0,Despacito disponible ya en todas las plataform...,D E S P A C I T O :,February 2019,Whos here to check to see comments asking Whos...,Almost 6 billion views,"Wow only 6 billion views,,lol nThats fucked an...",2020 guys anyone... anyone oh wait I from the ...,3:25,2019/2/10,2017n2018n2019n,...,2020,Anyone else watching this for the first time,February 2019,,Ummmmm... why am I here,2019,2019,hola putos,Oi,Hello
1,Stream or Download Shape Of You: https://atlan...,MTB SUPER KONG EVERYONE WATCH THIS https://you...,who will listen forever,#2019,I love you Music,I like this song,Perfect,Who here in 2019,日本人,外国でも〇月に見てる人ー的なことやってて世界共通なんだな,...,الاستغراب الاشتراك اه فى يا عم,Cade os br dessa poha,Ed sheeran Madefucker,2019s best music,Mars 2019,Ho My god,Love,10.02.2019 :,Wery good song top 1,Why are you here in 2019
2,Mark Ronsons official music video for Uptown F...,Me encanta mmmmm me fascina huuyyy,Aburrido no le entendí y hablo español no inglés,2019,bad girls are sexier than the good....x,,fill my--nnme: GOBLET PUT SOME FIRE IN IT,Doodudoodoodudoo,rudy,Bruno Mars el próximo Michael Jackson,...,Mark Ronson contribution,I love,WHO LISTENING IN NOVEMBER 2029,nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn...,Bruno GOD nnnnnBtw in my opinion this ia his ...,한국인 손,,1:32 i love this part,Jdjywkod,I LOVE DIRT ON MY BOOTS


In [79]:
nlp_df['all_comments'] = nlp_df.sum(axis=1)

### Remove terms equal to the target values:

In [80]:
nlp_df['all_comments'] = [str(x.lower()).replace('negative', '') for x in nlp_df['all_comments']]
nlp_df['all_comments'] = [str(x.lower()).replace('positive', '') for x in nlp_df['all_comments']]
nlp_df['all_comments'] = [str(x.lower()).replace('fear', '') for x in nlp_df['all_comments']]
nlp_df['all_comments'] = [str(x.lower()).replace('anger', '') for x in nlp_df['all_comments']]
nlp_df['all_comments'] = [str(x.lower()).replace('trust', '') for x in nlp_df['all_comments']]
nlp_df['all_comments'] = [str(x.lower()).replace('sadness', '') for x in nlp_df['all_comments']]
nlp_df['all_comments'] = [str(x.lower()).replace('disgust', '') for x in nlp_df['all_comments']]
nlp_df['all_comments'] = [str(x.lower()).replace('joy', '') for x in nlp_df['all_comments']]
nlp_df['all_comments'] = [str(x.lower()).replace('anticip', '') for x in nlp_df['all_comments']]
nlp_df['all_comments'] = [str(x.lower()).replace('surprise', '') for x in nlp_df['all_comments']]

In [82]:
nlp_df.head(1)

Unnamed: 0,description,0,1,2,3,4,5,6,7,8,...,41,42,43,44,45,46,47,48,49,all_comments
0,Despacito disponible ya en todas las plataform...,D E S P A C I T O :,February 2019,Whos here to check to see comments asking Whos...,Almost 6 billion views,"Wow only 6 billion views,,lol nThats fucked an...",2020 guys anyone... anyone oh wait I from the ...,3:25,2019/2/10,2017n2018n2019n,...,Anyone else watching this for the first time,February 2019,,Ummmmm... why am I here,2019,2019,hola putos,Oi,Hello,despacito disponible ya en todas las plataform...


In [84]:
df.head(2)

Unnamed: 0,post_id,published_on,num_comments,likes,score_views,source,title,description,0,1,...,40,41,42,43,44,45,46,47,48,49
0,kJQP7kiw5Fk,2017-01-13 05:00:02,3035242,31966704,5959601315,Youtube.com,Luis Fonsi - Despacito ft. Daddy Yankee,“Despacito” disponible ya en todas las platafo...,(D E S P A C I T O :)),(February ? 2019😂),...,(2020),(Anyone else watching this for the first time 😗),(February 2019😬🙅),(?),(Ummmmm... why am I here),(2019?),(2019),(hola putos),(Oi),(Hello)
1,JGwWNGJdvx8,2017-01-30 10:57:50,873673,19004938,4066735225,Youtube.com,Ed Sheeran - Shape of You [Official Video],Stream or Download Shape Of You: https://atlan...,(MTB SUPER KONG EVERYONE WATCH THIS https://yo...,(who will listen forever😘❤),...,(الاستغراب الاشتراك اه فى يا عم),(Cade os br dessa poha),(Ed sheeran Madefucker),(2019's best music),(Mars 2019???),(Ho My god),(Love),(10.02.2019 :)))))),(Wery good song top 1),(Why are you here in 2019?)


In [89]:
test_data = df.iloc[:, :7]
test_data['all_comments'] = nlp_df['all_comments']

test_data.head()

Unnamed: 0,post_id,published_on,num_comments,likes,score_views,source,title,all_comments
0,kJQP7kiw5Fk,2017-01-13 05:00:02,3035242,31966704,5959601315,Youtube.com,Luis Fonsi - Despacito ft. Daddy Yankee,despacito disponible ya en todas las plataform...
1,JGwWNGJdvx8,2017-01-30 10:57:50,873673,19004938,4066735225,Youtube.com,Ed Sheeran - Shape of You [Official Video],stream or download shape of you: https://atlan...
2,OPf0YbXqDm0,2014-11-19 14:00:18,477799,12290901,3455955229,Youtube.com,Mark Ronson - Uptown Funk ft. Bruno Mars (Offi...,mark ronsons official music video for uptown f...
3,9bZkp7q19f0,2012-07-15 07:46:32,5343223,15384768,3286729478,Youtube.com,PSY - GANGNAM STYLE(강남스타일) M/V,psy - i luv it m/v @ https://youtu.be/xvjnoagk...
4,fRh_vgS2dFE,2015-10-22 20:00:02,796522,11271600,3080820133,Youtube.com,Justin Bieber - Sorry (PURPOSE : The Movement),purpose available everywhere now itunes: http:...


In [92]:
test_data.reset_index().to_json('./Data/test_data.json')