## Data Wrangling - Fetching step
- The gossipcop and politifact fake and real news datasets are loaded in pandas dataframes.
- Using urllib and bs4, the original news article is downloaded if available and relevant text is kept from the html
- Using urllib, the twitter API is used to collect the `author_id` of fake and real news
- Using urllib, the twitter API is used to collect the `created_at` timestamp of real and fake news

Requirements:

`.env` file at the root of the repo containing BEARER_TOKEN = XXX where XXX should be replaced with a Twitter API V2 token

In [1]:
import pandas as pd
import numpy as np
pd.options.mode.chained_assignment = None 
from tqdm.notebook import tqdm
from bs4 import BeautifulSoup
from bs4.element import Comment
from urllib.request import Request, urlopen
from urllib.parse import urlparse
from urllib.error import HTTPError, URLError
from json.decoder import JSONDecodeError
from http.client import RemoteDisconnected
import requests
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
from requests import exceptions
import logging
from socket import timeout
import re
import time
from dotenv import load_dotenv
load_dotenv()

True

### 1. Set up the request headers by loading the bearer token from the `.env` file

In [2]:
bearer_token = os.environ.get("BEARER_TOKEN")
headers = {
    'Authorization': f'Bearer {bearer_token}'
}

### 2. Helper functions for processing html files using beautifulsoup

In [18]:
s = requests.Session()
blacklisted_status_codes = [300, 301, 302, 303, 304, 305, 306, 307, 308, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 429]
retries = Retry(connect=3, 
                backoff_factor=0.5,
                status_forcelist=[ 500, 502, 503, 504 ])
s.mount('http://', HTTPAdapter(max_retries=retries))

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

def process_url(url):

    # req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    try:
        r = s.get(url, timeout=10, allow_redirects=False, headers={'User-Agent': 'Mozilla/5.0'})
        if r.status_code in blacklisted_status_codes:
            print(r.status_code)
            return ''
        else:
            return text_from_html(r.text)
    except timeout:
        print("connection timedout")
    except requests.exceptions.ConnectionError as e:
        print('no response')
    except ConnectionError:
        print("conn reset")
        print(url)
    except RemoteDisconnected as e:
        print(e)
    except URLError as e:
        print(e)
    except HTTPError as err:
        if err.code == 410 or 404:
            print("permanently deleted or removed, url should be removed")
        else:
            raise

### Steps 3-7 below are commented out and the previously generated output is loaded into the variables
 - `filtered_p_fake`: filtered politifact fake news with the scraped text from the original article 
 - `filtered_p_real`: filtered politifact real news with the scraped text from the original article 
 - `filtered_g_fake`: filtered gossipcop fake news with the scraped text from the original article 
 - `filtered_g_real`: filtered gossipcop real news with the scraped text from the original article 

In [4]:
politifact_fake = pd.read_csv('fakenewsnet/politifact_fake.csv')
# filtered_p_fake = pd.read_csv('processed-data/scraped_text/politifact_fake_with_scraped_text.csv')

politifact_real = pd.read_csv('fakenewsnet/politifact_real.csv')
# filtered_p_real = pd.read_csv('processed-data/scraped_text/politifact_real_with_scraped_text.csv')

# politifact_real_manually_checked = pd.read_excel('processed-data/scraped_text_filtered/politifact_real_manually_checked.xlsx', engine='openpyxl')

In [5]:
gossipcop_fake = pd.read_csv('fakenewsnet/gossipcop_fake.csv')
# filtered_g_fake = pd.read_csv('processed-data/scraped_text/gossipcop_fake_with_scraped_text.csv')

gossipcop_real = pd.read_csv('fakenewsnet/gossipcop_real.csv')
# filtered_g_real = pd.read_csv('processed-data/scraped_text/gossipcop_real_with_scraped_text.csv')

In [None]:
### 3. Fetching the original article for the real news in the politifact dataset

In [6]:
if 'text' not in politifact_real:
    politifact_real['text'] = ''
for index, url in enumerate(tqdm(politifact_real['news_url'])):
    if politifact_real['text'].iloc[index] == '':
        if type(url) != float:
            if urlparse(url).scheme:
                if politifact_real['text'].iloc[index] == '':
                    politifact_real['text'].iloc[index] = process_url(url)

  0%|          | 0/624 [00:00<?, ?it/s]

302
301
no response
301
301
302
301
301
301
301
303
302
301
301
301
302
301
301
301
301
301
301
301
301
302
301
301
301
301
301
301
406
301
302
301
301
301
no response
301
301
301
301
302
301
no response
301
301
301
301
301
301
302
301
302
301
301
301
302
302
301
301
301
301
301
301
302
301
301
302
301
301
301
301
301
302
302
301
301
301
301
301
301
301
302
no response
301
301
no response
301
302
301
301
301
301
301
301
301
301
301
301
301
301
301
301
301
302
404
301
301
301
301
301
301
302
301
302
301
301
302
302
no response
301
301
no response
301
301
301
301
301
301
301
301
301
302
301
301
301
301
302
302
301
301
301
no response
no response
301
301
301
301
301
301
301
301
301
301
301
302
301
301
301
301
301
301
302
302
301
301
301
301
301
302
301
301
301
301
no response
301
301
301
301
302
302
302
301
301
302
no response
301
302
301
302
no response
301
301
302
301
302
301
301
301
301
301
404
301
301
301
301
301
301
301
301
301
301
302
302
302
301
302
301
301
302
301
301
301
302
301


### 4. Fetching the original article for the fake news in the politifact dataset

In [7]:

if 'text' not in politifact_fake:
    politifact_fake['text'] = ''
for index, url in enumerate(tqdm(politifact_fake['news_url'])):
    if politifact_fake['text'].iloc[index] == '':
        if type(url) != float:
            if urlparse(url).scheme:
                if politifact_fake['text'].iloc[index] == '':
                    politifact_fake['text'].iloc[index] = process_url(url)

  0%|          | 0/432 [00:00<?, ?it/s]

301
301
404
303
301
404
301
no response
302
301
404
no response
302
301
404
301
406
301
no response
301
301
404
no response
301
no response
no response
no response
302
no response
404
404
no response
404
406
no response
no response
301
no response
no response
404
301
404
301
301
302
no response
301
302
301
301
404
301
301
no response
301
no response
404
301
no response
no response
302
302
302
302
301
no response
301
404
301
no response
404
302
no response
404
301
303
404
no response
301
302
no response
301
301
no response
404
301
301
301
406
301
no response
301
301
301
301
404
301
no response
no response
301
404
no response
301
no response
301
no response
404
400
302
301
no response
302
404
404
302
no response
301
301
no response
301
301
no response
301
404
301
303
301
301
no response
301
no response
302
404
no response
302
302
301
no response
301
301
301
301
302
404
no response
302
301
301
no response
301
406
301
301
no response
404
404
301
302
404
301
404
301
301
301
301
301
406
no r

### 5. Fetching the original article for the real news in the gossipcop dataset

In [None]:
if 'text' not in gossipcop_real:
    gossipcop_real['text'] = ''
for index, url in enumerate(tqdm(gossipcop_real['news_url'])):
    if gossipcop_real['text'].iloc[index] == '':
        if type(url) != float:
           if urlparse(url).scheme:
                if gossipcop_real['text'].iloc[index] == '':
                    gossipcop_real['text'].iloc[index] = process_url(url)
                elif "://" in url: 
                    if gossipcop_real['text'].iloc[index] == '':
                        gossipcop_real['text'].iloc[index] = process_url(url)
                else: 
                    url = "http://" + url
                    if gossipcop_real['text'].iloc[index] == '':
                        gossipcop_real['text'].iloc[index] = process_url(url)

### 6. Fetching the original article for the fake news in the gossipcop dataset

In [20]:
if 'text' not in gossipcop_fake:
    gossipcop_fake['text'] = ''
for index, url in enumerate(tqdm(gossipcop_fake['news_url'])):
    if gossipcop_fake['text'].iloc[index] == '' and index > 2650:
        if type(url) != float:
            if "://" in url: 
                gossipcop_fake['text'].iloc[index] = process_url(url)
            else:
                url = "http://" + url
                gossipcop_fake['text'].iloc[index] = process_url(url)

### 7. In some cases, especially for fake news, the article has been removed from the internet and the text is therefore not available. Filter out these instances and drop the rows as topic modelling will not be applicable here

In [14]:
filtered_g_fake = gossipcop_fake[gossipcop_fake['text'] != '' ]
filtered_g_fake = filtered_g_fake[filtered_g_fake['text'].notnull()]
filtered_g_fake = filtered_g_fake[filtered_g_fake['tweet_ids'].notnull()]
# filtered_g_fake.to_csv('processed-data/scraped_text/gossipcop_fake_with_scraped_text.csv')

In [None]:
filtered_g_real = gossipcop_real[gossipcop_real['text'] != '' ]
filtered_g_real = filtered_g_real[filtered_g_real['text'].notnull()]
filtered_g_real = filtered_g_real[filtered_g_real['tweet_ids'].notnull()]
# filtered_g_real.to_csv('processed-data/scraped_text/gossipcop_real_with_scraped_text.csv')

In [62]:
filtered_p_fake = politifact_fake[politifact_fake['text'] != '' ]
filtered_p_fake = filtered_p_fake[filtered_p_fake['text'].notnull()]
filtered_p_fake = filtered_p_fake[filtered_p_fake['tweet_ids'].notnull()]
# filtered_p_fake.to_csv('processed-data/scraped_text/politifact_fake_with_scraped_text.csv')

In [20]:
filtered_p_real = politifact_real[politifact_real['text'] != '' ]
filtered_p_real = filtered_p_real[filtered_p_real['text'].notnull()]
filtered_p_real = filtered_p_real[filtered_p_real['tweet_ids'].notnull()]
# filtered_p_real.to_csv('processed-data/scraped_text/politifact_real_with_scraped_text.csv')

In [95]:
# filtered_p_real = pd.read_csv('processed-data/scraped_text_filtered/politifact_real_manually_checked.csv')

### 8. Tweet processing

- `filtered_p_real` and `filtered_p_fake` contain a column of `tweet_ids`.
- The Twitter v2 [API](https://developer.twitter.com/en/docs/twitter-api/tweets/lookup/api-reference/get-tweets) for tweets allows queries of up to 100 tweets at a time
1. Loop through the dataframe
2. Add at most 100 tweet_ids to a list and construct the request object
3. EAFP, catch the exception for when we reach the 300 requests per 15 minutes limit and wait for 15 minutes with `time.sleep(900)`
4. Process the response and construct an object with the data of interest

In [23]:
gossipcop_fake

Unnamed: 0,id,news_url,title,tweet_ids,text
0,gossipcop-2493749932,www.dailymail.co.uk/tvshowbiz/article-5874213/...,Did Miley Cyrus and Liam Hemsworth secretly ge...,284329075902926848\t284332744559968256\t284335...,
1,gossipcop-4580247171,hollywoodlife.com/2018/05/05/paris-jackson-car...,Paris Jackson & Cara Delevingne Enjoy Night Ou...,992895508267130880\t992897935418503169\t992899...,
2,gossipcop-941805037,variety.com/2017/biz/news/tax-march-donald-tru...,Celebrities Join Tax March in Protest of Donal...,853359353532829696\t853359576543920128\t853359...,
3,gossipcop-2547891536,www.dailymail.co.uk/femail/article-3499192/Do-...,Cindy Crawford's daughter Kaia Gerber wears a ...,988821905196158981\t988824206556172288\t988825...,
4,gossipcop-5476631226,variety.com/2018/film/news/list-2018-oscar-nom...,Full List of 2018 Oscar Nominations – Variety,955792793632432131\t955795063925301249\t955798...,
...,...,...,...,...,...
5318,gossipcop-6702260693,www.huffingtonpost.com/2012/09/11/september-11...,September 11: Celebrities Remember 9/11 (TWEETS),245643768638894080,
5319,gossipcop-6051845337,www.dailymail.co.uk/news/article-4915674/NASCA...,NASCAR owners threaten to fire drivers who pro...,912048333413330944\t912048571482087424\t912049...,
5320,gossipcop-2435526162,www.telegraph.co.uk/men/the-filter/7-signs-dav...,The 7 signs that David Beckham is definitely h...,897794716447539200\t897804460830928896\t897842...,
5321,gossipcop-4576152851,www.vanityfair.com/style/2016/09/ryan-gosling-...,Ryan Gosling and Eva Mendes Did Not Get Marrie...,778678901572710400\t778681718714740736\t778683...,


In [31]:
politifact_real_tweet_data = []
politifact_real_tweet_dataframe = pd.DataFrame(columns=['news_id'])
rt_obj = {}
response_count = 0
try:
    politifact_real_tweet_dataframe
except NameError:
    print('Creating new dataframe for real tweets')
else:
    politifact_real_tweet_dataframe = pd.DataFrame(columns=['news_id'])
for index, tweet_ids in enumerate(tqdm(politifact_real['tweet_ids'])):
    if type(tweet_ids) != float:
        tweet_id_list = tweet_ids.split()
        # print(tweet_id_list)
        temp_data_for_author = []
        # print(index % 100)
        temp_list_of_100 = ''
        # print(len(tweet_id_list))
        for tweet_index, tweet_id in enumerate(tweet_id_list):
            if tweet_index % 100 == 0 and tweet_index > 0:
                if temp_list_of_100.endswith(','):
                    temp_list_of_100 = temp_list_of_100[:-1]
                try:
                        response = requests.get(f'https://api.twitter.com/2/tweets?ids={temp_list_of_100}&tweet.fields=created_at,author_id&expansions=referenced_tweets.id',  headers=headers)
                except (ConnectionError, exceptions.RequestException) as e:
                    print(e)
                    time.sleep(900)
                    index-=1;
                temp_list_of_100 = ''
                
                try:
                    data = response.json()
                except JSONDecodeError as e:
                    continue
                if 'data' in data:
                    for item in data['data']:
                        if 'created_at' in item:
                            item['news_id'] = politifact_real.iloc[index]['id']
                            if politifact_real.iloc[index]['id'] not in politifact_real['id'].to_list():
                                politifact_real_tweet_data.append(item)
                        if 'referenced_tweets' in item:
                            for rt in item['referenced_tweets']:
                                try:
                                    rt_obj[politifact_real.iloc[index]['id']].append(rt)
                                except KeyError:
                                    rt_obj[politifact_real.iloc[index]['id']] = [rt]
                                
            elif tweet_index % 100 == 99:
                temp_list_of_100 = temp_list_of_100 + f'{tweet_id},'
            # same if it's the last of the batch we process
            elif tweet_index == len(politifact_real['tweet_ids'])-1:
                temp_list_of_100 = temp_list_of_100 + f'{tweet_id},'
            # otherwise we got to separate with a comma
            else:
                temp_list_of_100 = temp_list_of_100 + f'{tweet_id},'

  0%|          | 0/624 [00:00<?, ?it/s]

In [36]:
politifact_real_tweet_data

[]

In [33]:
politifact_fake_tweet_data = []
politifact_fake_tweet_dataframe = pd.DataFrame(columns=['news_id'])

try:
    politifact_fake_tweet_dataframe
except NameError:
    print('Creating new dataframe for fake tweets')
else:
    politifact_fake_tweet_dataframe = pd.DataFrame(columns=['news_id'])
for index, tweet_ids in enumerate(tqdm(politifact_fake['tweet_ids'])):
    if type(tweet_ids) != float:
        tweet_id_list = tweet_ids.split()
        temp_data_for_author = []
        temp_list_of_100 = ''
        for tweet_index, tweet_id in enumerate(tweet_id_list):
            if tweet_index % 100 == 0 and tweet_index > 0:
                if temp_list_of_100.endswith(','):
                    temp_list_of_100 = temp_list_of_100[:-1]
                try:
                        response = requests.get(f'https://api.twitter.com/2/tweets?ids={temp_list_of_100}&tweet.fields=created_at,author_id',  headers=headers)
                except (ConnectionError, exceptions.RequestException) as e:
                    print(e)
                    time.sleep(900)
                    index-=1;
                temp_list_of_100 = ''
                try:
                    data = response.json()
                except JSONDecodeError as e:
                    continue
                if 'data' in data:
                    for item in data['data']:
                        if 'created_at' in item:
                            item['news_id'] = politifact_fake.iloc[index]['id']
                            if politifact_fake.iloc[index]['id'] not in politifact_fake_tweet_dataframe['news_id'].to_list():
                                politifact_fake_tweet_data.append(item)
               
            elif tweet_index % 100 == 99:
                temp_list_of_100 = temp_list_of_100 + f'{tweet_id},'
            # same if it's the last of the batch we process
            elif tweet_index == len(politifact_fake['tweet_ids'])-1:
                temp_list_of_100 = temp_list_of_100 + f'{tweet_id},'
            # otherwise we got to separate with a comma
            else:
                temp_list_of_100 = temp_list_of_100 + f'{tweet_id},'

  0%|          | 0/432 [00:00<?, ?it/s]

In [99]:
politifact_real_tweet_data

[]

In [144]:
politifact_fake_df = pd.DataFrame(politifact_fake_tweet_data)
politifact_fake_df = politifact_fake_df.drop_duplicates(subset=['news_id','author_id'])
politifact_fake_author_counts = politifact_fake_df['author_id'].value_counts()
politifact_fake_counts = pd.DataFrame()
politifact_fake_counts['fake'] = politifact_real_author_counts
politifact_fake_counts['author_id'] = politifact_counts.index
politifact_real_df = pd.DataFrame(politifact_real_tweet_data)
politifact_real_df = politifact_real_df.drop_duplicates(subset=['news_id','author_id'])
politifact_real_author_counts = politifact_real_df['author_id'].value_counts()
politifact_counts['real'] = politifact_fake_author_counts
politifact_counts_merged = politifact_counts.merge(politifact_fake_counts, on="author_id", how = 'inner')
# politifact_counts_merged = politifact_counts_merged.reset_index()
# politifact_counts_merged.to_csv('processed-data/tweet_counts_by_author_politifact.csv')

### 9. The authors with the number of fake and real tweets they shared

In [145]:
politifact_counts_merged

Unnamed: 0,real,author_id,fake
0,8.0,48470839,36
1,4.0,34383891,29
2,17.0,1179710990,27
3,2.0,369760961,24
4,1.0,15523710,21
...,...,...,...
148184,,2816232049,1
148185,,2376305339,1
148186,1.0,794150830446440448,1
148187,,84431570,1


In [34]:
# tweets_by_author = pd.DataFrame(columns=["author_id", "tweets", "real_count", "fake_count"])

# for index, item in enumerate(politifact_real_tweet_data):
#     if 'author_id' in item:
#         if tweets_by_author['author_id'].str.contains(item['author_id']).any():
#             try:
#                 current_index = tweets_by_author.index[tweets_by_author.author_id == item['author_id']].tolist()[0]
#                 tweets_by_author.loc[current_index, 'real_count'] += 1
#             except IndexError as e:
#                 print(e)
#         else: 
#             tweets_by_author = tweets_by_author.append({'author_id':item['author_id'],'real_count':1, 'fake_count':0}, ignore_index=True)

In [91]:
tweet_times_p_real = pd.DataFrame(columns=["id", "timestamps"])
for index, item in enumerate(politifact_real_tweet_data):
    for k in item:
        for item_index, j in enumerate(item[k]):
            if 'id' in j:
                if tweet_times_p_real['id'].str.contains(j['id']).any():
                    try:
                        current_index = tweet_times_p_real.index[tweet_times_p_real.id == j['id']].to_list()[0]
                        created_at = j['created_at']
                        tweet_times_p_real.loc[current_index, 'timestamps'] += f',{created_at}' 
                    except IndexError as e:
                        print(e)
                else:
                    tweet_times_p_real = tweet_times_p_real.append({'id':j['id'], 'timestamps':j['created_at']}, ignore_index=True)
           

In [74]:
tweet_times_p_fake = pd.DataFrame(columns=["id", "timestamps"])
for index, item in enumerate(politifact_fake_tweet_data):
    for k in item:
        for item_index, j in enumerate(item[k]):
            if 'id' in j:
                if tweet_times_p_fake['id'].str.contains(j['id']).any():
                    try:
                        current_index = tweet_times_p_fake.index[tweet_times_p_fake.id == j['id']].to_list()[0]
                        tweet_times_p_fake.loc[current_index, 'timestamps'] += j['created_at']
                    except IndexError as e:
                        print(e)
                else:
                    tweet_times_p_fake = tweet_times_p_fake.append({'id':j['id'], 'timestamps':j['created_at']}, ignore_index=True)
           

In [None]:
tweet_times_g_fake = pd.DataFrame(columns=["id", "timestamps"])
for index, item in enumerate(politifact_fake_tweet_data):
    for k in item:
        for item_index, j in enumerate(item[k]):
            if 'id' in j:
                if tweet_times_p_fake['id'].str.contains(j['id']).any():
                    try:
                        current_index = tweet_times_p_fake.index[tweet_times_p_fake.id == j['id']].to_list()[0]
                        tweet_times_p_fake.loc[current_index, 'timestamps'] += j['created_at']
                    except IndexError as e:
                        print(e)
                else:
                    tweet_times_p_fake = tweet_times_p_fake.append({'id':j['id'], 'timestamps':j['created_at']}, ignore_index=True)
           

### Politifact fake processing

In [23]:
filtered_p_fake = filtered_p_fake[filtered_p_fake['tweet_ids'].notnull()]
filtered_p_fake

Unnamed: 0.1,Unnamed: 0,id,news_url,title,tweet_ids,text
0,3,politifact14355,https://howafrica.com/oscar-pistorius-attempts...,Oscar Pistorius Attempts To Commit Suicide,886941526458347521\t887011300278194176\t887023...,Home Advertis...
1,4,politifact15371,http://washingtonsources.org/trump-votes-for-d...,Trump Votes For Death Penalty For Being Gay,915205698212040704\t915242076681506816\t915249...,Washington Sources ...
3,7,politifact14795,https://web.archive.org/web/20171027105356/htt...,Saudi Arabia to Behead 6 School Girls for Bein...,923126512458616832\t923135295070990341\t923189...,success fail ...
4,8,politifact14328,https://web.archive.org/web/20170702174006/htt...,Malia Obama Fired From Cushy Internship At Spa...,880455776107679747\t880457763876462598\t880461...,success fail ...
5,9,politifact13775,http://beforeitsnews.com/opinion-conservative/...,Target to Discontinue Sale of Holy Bible,732741826084397057\t732741823534227456\t732741...,You're using an Ad-Bl...
...,...,...,...,...,...,...
228,419,politifact14169,https://web.archive.org/web/20170528095037/htt...,Rubio: “Rape Victims Should Be In Custody If T...,663538460104392704\t663757208031780864\t663764...,success fail ...
229,421,politifact14992,http://www.trainnews.info/2018/01/rep-paul-gos...,Rep. Paul Gosar Asks Capitol Police to Arrest ...,958425300144218112\t958428242528145409\t958428...,Home Business Online Business Bitcoin Revi...
230,422,politifact14158,https://web.archive.org/web/20170602190500/htt...,WORSE THAN HITLER! Trey Gowdy’s Son Found In A...,865933040492703745,success fail ...
231,427,politifact14944,http://thehill.com/homenews/senate/369928-who-...,Who is affected by the government shutdown?,954602090462146560\t954602093171609600\t954650...,...


In [167]:
politifact_fake_tweet_times = pd.DataFrame(columns=['news_id', 'timestamps'])
politifact_fake_tweet_data = []

# for index, tweet_ids in enumerate(tqdm(filtered_p_fake['tweet_ids'])):
for index, tweet_ids in enumerate(tqdm(missing_filtered_p_fake['tweet_ids'])):
    if type(tweet_ids) != float:
        tweet_id_list = tweet_ids.split()
        temp_data_for_author = []
        temp_list_of_100 = ''
        for tweet_index, tweet_id in enumerate(tweet_id_list):
            if tweet_index % 100 == 0 and tweet_index > 0:
                time.sleep(0.8)
                if temp_list_of_100.endswith(','):
                    temp_list_of_100 = temp_list_of_100[:-1]
                    response = requests.get(f'https://api.twitter.com/2/tweets?ids={temp_list_of_100}&tweet.fields=created_at,author_id',  headers=headers) 
                    temp_list_of_100 = ''
                else:
                    response = requests.get(f'https://api.twitter.com/2/tweets?ids={temp_list_of_100}&tweet.fields=created_at,author_id',  headers=headers) 
                    temp_list_of_100 = ''
                try:
                    data = response.json()
                    # print(response)
                    # print(data)
                except JSONDecodeError as e:
                    continue
                if 'data' in data:
                    for item in data['data']:
                        if 'created_at' in item:
                            item['news_id'] = filtered_p_fake.iloc[index]['id']
                            if filtered_p_fake.iloc[index]['id'] not in politifact_fake_tweet_dataframe['news_id'].to_list():
                                print(filtered_p_fake.iloc[index]['id'])
                                politifact_fake_tweet_data.append(item)
                    # print(len(data['data']))
                    # print(tweet_index)
                    # print(index)
                    # for tweet in data['data']:
                    #     # print(filtered_p_fake.loc[index, 'id'])
                    #     if politifact_fake_tweet_times['news_id'].str.contains(filtered_p_fake.loc[index, 'id']).any():
                    #         try:
                    #             current_index = politifact_fake_tweet_times.index[politifact_fake_tweet_times.news_id == filtered_p_fake.loc[index, 'id']].tolist()[0]
                    #             # print(current_index)
                    #             created_at = tweet['created_at']
                    #             politifact_fake_tweet_times.loc[current_index, 'timestamps'] += f',{created_at}' 
                    #         except IndexError as e:
                    #             print(e)
                    #     else: 
                    #         politifact_fake_tweet_times = politifact_fake_tweet_times.append({'news_id':filtered_p_fake.loc[index, 'id'], 'timestamps':tweet['created_at']}, ignore_index=True)
                    #         # print(tweet.created_at)
                    #         # print(data['data'][index]['created_at'])
                    #         # print('piece_of_news ', filtered_p_fake.loc[index, 'id'])

            elif tweet_index % 100 == 99:
                temp_list_of_100 = temp_list_of_100 + f'{tweet_id},'
            # same if it's the last of the batch we process
            elif tweet_index == len(filtered_p_fake['tweet_ids'])-1:
                temp_list_of_100 = temp_list_of_100 + f'{tweet_id},'
            # otherwise we got to separate with a comma
            else:
                temp_list_of_100 = temp_list_of_100 + f'{tweet_id},'
            



  0%|          | 0/91 [00:00<?, ?it/s]

<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200

In [178]:
politifact_real_tweet_times = pd.DataFrame(columns=['news_id', 'timestamps'])
politifact_real_tweet_data = []

# for index, tweet_ids in enumerate(tqdm(filtered_p_fake['tweet_ids'])):
for index, tweet_ids in enumerate(tqdm(filtered_p_real['tweet_ids'])):
    if type(tweet_ids) != float:
        tweet_id_list = tweet_ids.split()
        temp_data_for_author = []
        temp_list_of_100 = ''
        for tweet_index, tweet_id in enumerate(tweet_id_list):
            if tweet_index % 100 == 0 and tweet_index > 0:
                time.sleep(0.8)
                if temp_list_of_100.endswith(','):
                    temp_list_of_100 = temp_list_of_100[:-1]
                    response = requests.get(f'https://api.twitter.com/2/tweets?ids={temp_list_of_100}&tweet.fields=created_at,author_id',  headers=headers) 
                    temp_list_of_100 = ''
                else:
                    response = requests.get(f'https://api.twitter.com/2/tweets?ids={temp_list_of_100}&tweet.fields=created_at,author_id',  headers=headers) 
                    temp_list_of_100 = ''
                try:
                    data = response.json()
                    # print(response)
                    # print(data)
                except JSONDecodeError as e:
                    continue
                if 'data' in data:
                    for item in data['data']:
                        if 'created_at' in item:
                            item['news_id'] = filtered_p_real.iloc[index]['id']
                            # if filtered_p_real.iloc[index]['id'] not in politifact_real_tweet_dataframe['news_id'].to_list():
                                # print(filtered_p_fake.iloc[index]['id'])
                            politifact_real_tweet_data.append(item)
                    # print(len(data['data']))
                    # print(tweet_index)
                    # print(index)
                    # for tweet in data['data']:
                    #     # print(filtered_p_fake.loc[index, 'id'])
                    #     if politifact_fake_tweet_times['news_id'].str.contains(filtered_p_fake.loc[index, 'id']).any():
                    #         try:
                    #             current_index = politifact_fake_tweet_times.index[politifact_fake_tweet_times.news_id == filtered_p_fake.loc[index, 'id']].tolist()[0]
                    #             # print(current_index)
                    #             created_at = tweet['created_at']
                    #             politifact_fake_tweet_times.loc[current_index, 'timestamps'] += f',{created_at}' 
                    #         except IndexError as e:
                    #             print(e)
                    #     else: 
                    #         politifact_fake_tweet_times = politifact_fake_tweet_times.append({'news_id':filtered_p_fake.loc[index, 'id'], 'timestamps':tweet['created_at']}, ignore_index=True)
                    #         # print(tweet.created_at)
                    #         # print(data['data'][index]['created_at'])
                    #         # print('piece_of_news ', filtered_p_fake.loc[index, 'id'])

            elif tweet_index % 100 == 99:
                temp_list_of_100 = temp_list_of_100 + f'{tweet_id},'
            # same if it's the last of the batch we process
            elif tweet_index == len(filtered_p_real['tweet_ids'])-1:
                temp_list_of_100 = temp_list_of_100 + f'{tweet_id},'
            # otherwise we got to separate with a comma
            else:
                temp_list_of_100 = temp_list_of_100 + f'{tweet_id},'
            



  0%|          | 0/366 [00:00<?, ?it/s]

In [181]:
politifact_real_tweet_dataframe = pd.DataFrame(politifact_real_tweet_data)
# politifact_real_tweet_dataframe = politifact_real_tweet_dataframe.append(politifact_real_tweet_data, ignore_index=True)
politifact_real_tweet_dataframe.to_csv('processed-data/politifact_real_tweet_times.csv')

In [184]:
len(politifact_real_tweet_dataframe['news_id'].unique())

160

In [187]:
# politifact_fake_tweet_data

In [160]:
politifact_fake_tweet_dataframe = politifact_fake_tweet_dataframe.append(politifact_fake_tweet_data, ignore_index=True)

In [161]:
politifact_fake_tweet_dataframe.to_csv('processed-data/politifact_fake_tweet_times.csv')

In [162]:
len(politifact_fake_tweet_dataframe['id'].to_list())

78500

In [163]:
collected_politifact = politifact_fake_tweet_dataframe['news_id'].unique().tolist()

In [164]:
missing_filtered_p_fake = filtered_p_fake[~filtered_p_fake['id'].isin(collected_politifact)]

In [188]:
politifact_fake_times = pd.DataFrame(columns=['id','timestamps'])
for index, item in enumerate(politifact_fake_tweet_data):
    for k in item:
        for item_index, j in enumerate(item[k]):
            if 'created_at' in j:
                if politifact_fake_times['id'].str.contains(j['id']).any():
                    try:
                        current_index = politifact_fake_times.index[politifact_fake_times.id == j['id']].tolist()[0]
                        politifact_fake_times.loc[current_index, 'timestamps'] += j['created_at']
                    except IndexError as e:
                        print(e)
                else: 
                    politifact_fake_times = politifact_fake_times.append({'id':j['id'], 'timestamps':j['created_at']}, ignore_index=True)

In [59]:
politifact_fake_times

Unnamed: 0,id,timestamps
0,978246351036264453,2018-03-26T12:24:39.000Z
1,978246559472275456,2018-03-26T12:25:29.000Z
2,978246571128250368,2018-03-26T12:25:32.000Z
3,978246651491094528,2018-03-26T12:25:51.000Z
4,978247005481947136,2018-03-26T12:27:15.000Z
...,...,...
20947,1012345194983907328,2018-06-28T14:41:17.000Z
20948,1012345247916003328,2018-06-28T14:41:30.000Z
20949,1012345635062853634,2018-06-28T14:43:02.000Z
20950,1012345706085011456,2018-06-28T14:43:19.000Z


In [42]:
tweet_times_g_fake = pd.DataFrame(columns=["id", "timestamps"])
for index, item in enumerate(politifact_fake_tweet_data):
    for k in item:
        for item_index, j in enumerate(item[k]):
            if 'id' in j:
                if tweet_times_p_fake['id'].str.contains(j['id']).any():
                    try:
                        current_index = tweet_times_p_fake.index[tweet_times_p_fake.id == j['id']].to_list()[0]
                        tweet_times_p_fake.loc[current_index, 'timestamps'] += j['created_at']
                    except IndexError as e:
                        print(e)
                else:
                    tweet_times_p_fake = tweet_times_p_fake.append({'id':j['id'], 'timestamps':j['created_at']}, ignore_index=True)
           

list index out of range
list index out of range
list index out of range
list index out of range
list index out of range
list index out of range
list index out of range
list index out of range
list index out of range
list index out of range
list index out of range


In [13]:
tweets_by_author.sort_values(by=['fake_count'], ascending=False)

Unnamed: 0,author_id,tweets,real_count,fake_count
32764,710856880680275969,,0,311
32928,144222146,,0,266
40659,18000449,,0,185
40553,23772575,,0,133
40578,790019230389248000,,0,58
...,...,...,...,...
15422,16377693,,1,0
15423,64465376,,1,0
15424,44999215,,1,0
15425,25969442,,1,0


In [16]:
test = tweets_by_author[tweets_by_author['fake_count']!=0]
test = test[test['real_count']!=0]
test

Unnamed: 0,author_id,tweets,real_count,fake_count
44,612797013,,8,1
160,56135553,,1,1
199,49800332,,2,1
212,262797667,,2,1
215,566828522,,1,1
...,...,...,...,...
30396,946822648729915393,,1,1
30476,125767959,,1,6
30502,105272579,,3,1
30635,27229155,,1,1


In [134]:
filtered_p_real = filtered_p_real[filtered_p_real['tweet_ids'].notnull()]

def getTweetAuthors(tweet_ids_raw):
    count = 0
    print(tweet_ids_raw)
    tweet_ids = re.split(r'\\t', tweet_ids_raw)
    for tweet_id in tweet_ids:
        if count < 1:
            response = requests.get('https://api.twitter.com/2/tweets?ids={tweet_id}&tweet.fields=created_at&expansions=author_id&user.fields=created_at',  headers)
            count+=1


In [43]:
tweets_by_author.to_csv("processed-data/tweets-by-author-politifact.csv")

In [20]:
len(politifact_fake)

432

In [49]:
filtered_p_fake = politifact_fake[politifact_fake['text'] != '' ]
filtered_p_fake = filtered_p_fake[filtered_p_fake['text'].notnull()]
filtered_p_fake = filtered_p_fake[filtered_p_fake['tweet_ids'].notnull()]
# filtered_p_fake.to_csv('processed-data/scraped_text/politifact_fake_with_scraped_text.csv')

In [50]:
filtered_p_fake

Unnamed: 0,id,news_url,title,tweet_ids,text
3,politifact14355,https://howafrica.com/oscar-pistorius-attempts...,Oscar Pistorius Attempts To Commit Suicide,886941526458347521\t887011300278194176\t887023...,Home Advertis...
7,politifact14795,https://web.archive.org/web/20171027105356/htt...,Saudi Arabia to Behead 6 School Girls for Bein...,923126512458616832\t923135295070990341\t923189...,RELIGION MIND Humanity's fi...
8,politifact14328,https://web.archive.org/web/20170702174006/htt...,Malia Obama Fired From Cushy Internship At Spa...,880455776107679747\t880457763876462598\t880461...,Home Politics President T...
19,politifact15178,https://www.politico.com/story/2017/05/09/trum...,Former presidents walk fine line in Trump’s Am...,861881844949688324\t861881844697923585\t861881...,Skip to Main Content POLITI...
22,politifact15267,https://www.independent.co.uk/arts-entertainme...,Donald Trump inauguration: Artists who won't p...,812333495020195841\t812333821945376773\t812334...,Subscribe Login Subscrib...
...,...,...,...,...,...
413,politifact14667,https://www.facebook.com/StopDjTrump/photos/a....,Wake Up America,1002128667819106304\t1002129159420858368\t1002...,
418,politifact15505,https://yournewswire.com/senate-report-clinton...,Senate Report Admits Clinton ‘Gifted’ Children...,1008393012563644416\t1008396231964585984\t1008...,YourNewswire.com Buy T...
419,politifact14169,https://web.archive.org/web/20170528095037/htt...,Rubio: “Rape Victims Should Be In Custody If T...,663538460104392704\t663757208031780864\t663764...,Facebook U...
422,politifact14158,https://web.archive.org/web/20170602190500/htt...,WORSE THAN HITLER! Trey Gowdy’s Son Found In A...,865933040492703745,Daily USA Update Menu News Pol...


In [59]:
filtered_p_fake.head(1).news_url.values[0]

'https://howafrica.com/oscar-pistorius-attempts-commit-suicide/'

In [80]:
filtered_p_real.to_excel('processed-data/scraped_text_filtered/politifact_real.xlsx',engine='xlsxwriter')

In [18]:
filtered_p_real

Unnamed: 0.1,Unnamed: 0,id,news_url,title,tweet_ids,text
0,0,politifact14984,http://www.nfib-sbet.org/,National Federation of Independent Business,967132259869487105\t967164368768196609\t967215...,At a Glance Indicato...
1,1,politifact12944,http://www.cq.com/doc/newsmakertranscripts-494...,comments in Fayetteville NC,942953459\t8980098198\t16253717352\t1668513250...,Logi...
2,2,politifact333,https://web.archive.org/web/20080204072132/htt...,"Romney makes pitch, hoping to close deal : Ele...",,success fail ...
3,3,politifact4358,https://web.archive.org/web/20110811143753/htt...,Democratic Leaders Say House Democrats Are Uni...,,success fail ...
4,4,politifact779,https://web.archive.org/web/20070820164107/htt...,"Budget of the United States Government, FY 2008",89804710374154240\t91270460595109888\t96039619...,success fail ...
...,...,...,...,...,...,...
549,618,politifact13619,http://www.cnn.com/2017/01/05/politics/border-...,"Trump asking Congress, not Mexico, to pay for ...",817357495047979008\t817357627566985217\t817357...,The Biden Presidency Fact...
550,620,politifact329,https://web.archive.org/web/20080131000131/htt...,Change We Can Believe In,634287923135909888\t946743411100536832\t946816...,success fail ...
551,621,politifact1576,http://www.youtube.com/watch?v=4O8CxZ1OD58,deputy director of national health statistics ...,,Over Pers Auteursrecht Contact Creators Advert...
552,622,politifact4720,http://www.youtube.com/watch?v=EhyMplwY6HY,Romneys ProLife Conversion Myth or Reality Jun...,188871706637647874,Over Pers Auteursrecht Contact Creators Advert...


In [105]:
politifact_real_df.sort_values('created_at')

Unnamed: 0,author_id,created_at,id,text,news_id,withheld
10267,8739282,2008-01-10T17:32:35.000Z,584359142,"Received Obama ""Anti-Pledge of Allegiance"" ema...",politifact548,
18941,5530062,2008-01-27T02:20:44.000Z,645557142,"""This election is about the past versus the fu...",politifact323,
18942,13420,2008-01-27T02:23:38.000Z,645562962,"""This election is about the past versus the fu...",politifact323,
18943,10235502,2008-01-27T03:40:24.000Z,645737242,"""This election is about the past versus the fu...",politifact323,
18944,9252212,2008-01-27T04:03:45.000Z,645787662,"Obama: Campaign is about ""the future versus th...",politifact323,
...,...,...,...,...,...,...
26687,2597987358,2018-12-13T19:41:37.000Z,1073301935569154053,@FLOTUS 😂😂😂😂 Funny saying the person who wante...,politifact1202,
26688,2533381116,2018-12-13T21:11:40.000Z,1073324601151651840,"—MICHELLE OBAMA SIGNS BOOK—\n\nSecond stop, @T...",politifact1202,
26689,2359065805,2018-12-14T00:21:23.000Z,1073372343072772097,@realDonaldTrump I'm still laughing today abou...,politifact1202,
23014,15018447,2018-12-14T20:46:54.000Z,1073680754910117889,Having to enroll or renew health insurance cov...,politifact13226,


In [107]:
len(politifact_real_df['author_id'].unique())

19664

In [4]:
politifact_real_manually_checked = pd.read_excel('processed-data/scraped_text_filtered/politifact_real_manually_checked.xlsx', engine='openpyxl')

In [7]:
p_tweet_times_real = pd.read_csv('processed-data/tweet_times/politifact_real_tweet_times.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
