<h3>News Outlet Tweets Cleaning</h3>
<p><strong>This file should only be ran if one wishes to download tweets using the 'tweet_extractor.py' script and provided 'tweet-ids-004.txt' file. Doing so will generate a file called 'tweets4_df.pkl' which is necessary for this notebook. </strong></p>

<p>The original dataset contains the tweet id's from the Twitter accounts of aproximately 4,500 worldwide news outlets, i.e., accounts of media organizations intended to disseminate news. The full dataset under "News Outlets", can be found here: </p>

[TweetSets Website](https://tweetsets.library.gwu.edu/datasets)

<p>The downloaded dataset (the file: <strong>'tweet-ids-004.txt'</strong>) corresponds only to original or retweeted tweets that are dated from 01/04/2017 to 13/05/2020. It is meant to be 200,000 tweets, however some will be from now suspended accounts, and so the final dataset is a bit smaller. This specific dataset can be found under this link:</p> 

[http://tweetsets.library.gwu.edu/dataset/b8066f0e](http://tweetsets.library.gwu.edu/dataset/b8066f0e)

In [1]:
# IMPORTANT: THIS LOADS THE PICKLE FILE CREATED AS A RESULT OF RUNNING 'tweet_extractor.py' 
# WITH PROPER API AUTHENTICATION. THE FILE './tweets4_df.pkl' IS QUITE LARGE, SO 
# IT IS NOT INCLUDED IN THE SUBMISSTION.

# If one wishes to run the 'tweet_extractor.py' file, a twitter developer account must be set up,
# and the appropriate API keys must be entered into the required fields in the file. 
# The script should be run, then once it is done, this notebook should be ran to clean the pickle file.

# After this process, you will end up with 'processed_tweets.pkl', which is what is used in the 'main' notebook.

# Imports
import numpy as np
import pandas as pd
import pickle

# Takes a while
tweets_df = pd.read_pickle("./tweets4_df.pkl")
tweets_df

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,...,quoted_status_permalink,retweet_count,retweeted,retweeted_status,source,truncated,user,withheld_copyright,withheld_in_countries,withheld_scope
0,,,2018-05-03 07:55:36,"[0, 280]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 991949375927717888, 'id_str'...",7,False,9 out of 10 people worldwide are breathing pol...,,...,,13,False,,"<a href=""https://www.echobox.com"" rel=""nofollo...",False,"{'id': 25067168, 'id_str': '25067168', 'name':...",,,
1,,,2020-02-01 09:46:33,"[0, 121]","{'hashtags': [], 'symbols': [], 'user_mentions...",,6,False,Japan Travel Week dilaksanakan untuk promosi i...,,...,,1,False,,"<a href=""https://dlvrit.com/"" rel=""nofollow"">d...",False,"{'id': 23343960, 'id_str': '23343960', 'name':...",,,
2,,,2017-09-29 01:50:23,"[0, 91]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 913581686838796289, 'id_str'...",6,False,Nike wants to cut out the middleman to sell yo...,,...,,3,False,,"<a href=""http://www.socialflow.com"" rel=""nofol...",False,"{'id': 35002876, 'id_str': '35002876', 'name':...",,,
3,,,2018-08-13 11:52:00,"[0, 125]","{'hashtags': [], 'symbols': [], 'user_mentions...",,5,False,Drug-dealing brother of a little boy killed in...,,...,,4,False,,"<a href=""https://www.sprinklr.com"" rel=""nofoll...",False,"{'id': 34655603, 'id_str': '34655603', 'name':...",,,
4,,,2020-02-05 00:34:41,"[0, 182]","{'hashtags': [{'text': 'KSLTV', 'indices': [15...",,0,False,Health officials hope to avoid stigma and erro...,,...,,0,False,,"<a href=""https://mobile.twitter.com"" rel=""nofo...",False,"{'id': 208179565, 'id_str': '208179565', 'name...",,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
192238,,,2019-09-17 14:36:53,"[0, 245]","{'hashtags': [], 'symbols': [], 'user_mentions...",,2,False,Cokie Roberts was one of NPR's most recognizab...,,...,,0,False,,"<a href=""https://mobile.twitter.com"" rel=""nofo...",False,"{'id': 77055180, 'id_str': '77055180', 'name':...",,,
192239,,,2017-07-04 02:59:01,"[0, 96]","{'hashtags': [], 'symbols': [], 'user_mentions...",,7,False,A Guinness World Record broken and over $1.2M ...,,...,,1,False,,"<a href=""https://sproutsocial.com"" rel=""nofoll...",False,"{'id': 25549003, 'id_str': '25549003', 'name':...",,,
192240,,,2018-08-21 10:46:00,"[0, 228]","{'hashtags': [{'text': 'BACKTOSCHOOL', 'indice...","{'media': [{'id': 1031854207870099456, 'id_str...",2,False,#BACKTOSCHOOL: Today kids in @bpsdinfo head ba...,,...,,1,False,,"<a href=""https://about.twitter.com/products/tw...",False,"{'id': 14085146, 'id_str': '14085146', 'name':...",,,
192241,,,2019-11-27 14:10:12,"[0, 124]","{'hashtags': [{'text': 'Dailytrust', 'indices'...",,17,False,Civil society groups warns against media trial...,,...,,0,False,,"<a href=""https://www.hootsuite.com"" rel=""nofol...",False,"{'id': 69271273, 'id_str': '69271273', 'name':...",,,


In [2]:
# The columns in the twitter dataframe
tweets_df.columns

# We can drop all which are not in our specified list
for i in tweets_df.columns: 
    if i not in ['created_at', 'favorite_count', 'full_text', 'id_str', 'lang', 'retweet_count', 'retweeted_status']:
        tweets_df = tweets_df.drop(columns=[i])

# Making sure data types are correct
for i in ['favorite_count', 'retweet_count']:
    tweets_df[i]  =  tweets_df[i].astype('int')

tweets_df['full_text'] = tweets_df.full_text.astype(str)

# Since our other data is in days and weeks
# we can also remove the timestamp in the datetime, and focus on just whole days.

tweets_df['created_at'] = pd.to_datetime(tweets_df['created_at']).dt.normalize()
tweets_df.dtypes

tweets_df = tweets_df.sort_values(by='created_at')
tweets_df = tweets_df.reset_index()

tweets_df = tweets_df.drop(columns=['index'])
tweets_df.head()

Unnamed: 0,created_at,favorite_count,full_text,id_str,lang,retweet_count,retweeted_status
0,2017-04-01,0,Colombia: At least 100 dead after flooding sen...,848224134228303872,en,0,
1,2017-04-01,14,Ivanka Trump and Jared Kushner are worth as mu...,848011224365314048,en,13,
2,2017-04-01,0,Watch Facebook Live: @a_hammerschlag is live f...,847976929328336896,en,0,
3,2017-04-01,0,The latest unemployment numbers ... https://t....,848127907222478848,en,0,
4,2017-04-01,0,Round Dance to raise money for missing teen ht...,848075785370034176,en,0,


<p>Tweets on twitter are subject to a 140 character limit. The boolean 'truncated' field is meant to tell you whether the text of the tweet has been truncated. However, as we called the Twitter API with parameter 'tweet_mode='extended'', all tweets which where truncated were automatically extended to their original length. However, tweets can be re-tweeted by users. The text in these tweet object will still be truncated, however the 'truncated' value will still be set to false. This is perhaps a slight fault with the Twitter API. This means the 'truncated' column can be dropped (as it was done so above) as it serves no purpose. </p>
    
<p>Tweets that have been re-tweeted have a 'retweeted_status' field. This is the original tweet object that was re-tweeted. We are interested in regaining the full text of the tweet, and can do so by accessing this object. It also containes all the other parameters we might be interested in.</p>

In [4]:
# Filling all NaN values in 'retweeted_status' with 0
tweets_df['retweeted_status'] = tweets_df['retweeted_status'].fillna(0)

# Creating a list of the favorite_count from the retweeted tweets
fc = []
for i in tweets_df['retweeted_status']:
    if i != 0:
        fc.append(i['favorite_count'])
    else:
        fc.append(0)

# Creating a 'popularity' column, which is the sum of retweets and favorites, then dropping favorite
# and retweet columns
tweets_df['rt_favorite_count'] = fc
tweets_df['popularity'] = tweets_df['favorite_count'] + tweets_df['retweet_count'] + tweets_df['rt_favorite_count']
tweets_df = tweets_df.drop(columns=['favorite_count', 'rt_favorite_count', 'retweet_count'])

# Replacing the 'full_text' entries (which could be truncated) with actual 'full_text' from 'retweeted_status' object
txts = []
row = 0
for i in tweets_df['retweeted_status']:
    if i != 0:
        txts.append(i['full_text'])
    else:
        txts.append(tweets_df['full_text'][row])
    row += 1

# replacing the 'full_texts' column with actual full tweets
tweets_df['full_text'] = txts

# dropping retweeted status column - as we dont need it anymore
tweets_df = tweets_df.drop(columns=['retweeted_status'])

In [5]:
tweets_df.to_pickle('processed_tweets.pkl')
tweets_df

Unnamed: 0,created_at,full_text,id_str,lang,popularity
0,2017-04-01,Colombia: At least 100 dead after flooding sen...,848224134228303872,en,0
1,2017-04-01,Ivanka Trump and Jared Kushner are worth as mu...,848011224365314048,en,27
2,2017-04-01,Watch Facebook Live: @a_hammerschlag is live f...,847976929328336896,en,0
3,2017-04-01,The latest unemployment numbers ... https://t....,848127907222478848,en,0
4,2017-04-01,Round Dance to raise money for missing teen ht...,848075785370034176,en,0
...,...,...,...,...,...
192238,2020-05-13,"#CDNTopStories: Because of this development, t...",1260385523215917056,en,2
192239,2020-05-13,Dundee church ‘swamped’ with demand for food d...,1260623481198112768,en,6
192240,2020-05-13,"""The special economic package announced by the...",1260571712036515840,en,69
192241,2020-05-13,South African Tourism and the Tourism Business...,1260493225791553536,en,2
