# Check (That Tweet) Yo Self 
## Prioritizing Tweets to Fact Check
###### Part 1: Gathering Tweets

For our project, we want to find a way to prioritize disaster related tweets to be fact checked in order to slow the spread of misinformation. To start, we'll gather tweets with a few keywords.

Import necessary libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests
import time
import warnings
import regex as re
import seaborn as sns


from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

from nltk.sentiment.vader import SentimentIntensityAnalyzer
warnings.filterwarnings('ignore')
np.random.seed(824)
from bs4 import BeautifulSoup 

# Import stopwords.
from nltk.corpus import stopwords # Import the stopword list
import nltk

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

plt.style.use('fivethirtyeight')

In [2]:
from tweetscrape.profile_tweets import TweetScrapperProfile
from tweetscrape.search_tweets import TweetScrapperSearch
from tweetscrape.users_scrape import TweetScrapperUser

These are the keywords we will pull tweets about.

In [3]:
keywords = ['isolation',
            'pandemic',
            'COVID',
            'quarantine',
            'vaccine',
            'coronavirus',
            'lysol',
            'ingest',
            'inject',
            'disinfectant',
            'bleach']

The function below loops through the keyword and gathers 10,000 for each, then returns the final DataFrame with all tweets/keywords.

In [4]:
def get_tweets(words):
    to_concat = [] #to hold the dataframes of tweets
    for word in words: #for every keyword we want to search
        
        #gather the tweets and export to a csv
        tweet_scrapper = TweetScrapperSearch(search_all = word, search_till_date= '2020-04-26', num_tweets=10_000, tweet_dump_path=f'./data/initial_grab/{word}_tweets.csv', tweet_dump_format='csv')
        tweet_count, tweet_id, tweet_time, dump_path = tweet_scrapper.get_search_tweets()
        print('{0} tweets about {1} gathered till {2}'.format(tweet_count, word, tweet_time))
        
        #read the csv back in as a dataframe
        tweets = pd.read_csv(f'./data/initial_grab/{word}_tweets.csv')
        
        #add to list to concat later
        to_concat.append(tweets)
        
    #put all the tweets together and check the shape
    all_tweets = pd.concat(to_concat, axis = 0)
    print(f'Initial DataFrame was {all_tweets.shape}')
    
    #drop the duplicates, reset the index, and display the first five rows
    all_tweets = all_tweets.drop_duplicates()
    print(f'After dropping duplicate tweets, DataFrame is {all_tweets.shape}')
    
    #save as all_tweets csv file
    all_tweets.to_csv('./data/all_tweets.csv', index = False)
    return all_tweets

Since this gathering process was very slow, we included a lot of printouts so we knew it was still running.

In [5]:
get_tweets(keywords)

9999 tweets about isolation gathered till 
9998 tweets about pandemic gathered till 
10000 tweets about COVID gathered till 2020-04-25
10000 tweets about quarantine gathered till 2020-04-25
9997 tweets about vaccine gathered till 
9989 tweets about coronavirus gathered till 
9999 tweets about lysol gathered till 
9997 tweets about ingest gathered till 
9996 tweets about inject gathered till 
9997 tweets about disinfectant gathered till 
9999 tweets about bleach gathered till 
Initial DataFrame was (109971, 14)
After dropping duplicate tweets, DataFrame is (103608, 14)


Unnamed: 0,id,type,time,author,author_id,re_tweeter,associated_tweet,text,links,hashtags,mentions,reply_count,favorite_count,retweet_count
0,1254198473819291649,tweet,1587859192000,PulpNews,100986964,,1254198473819291649,Isolation and boredom of staying at home can b...,['https://t.co/49b7W0d6V5'],[],[],0,0,0
1,1254198461563637763,tweet,1587859189000,aishacs,15809934,,1254197958595301386,We left the trail early once we saw that the s...,[],[],[],2,21,0
2,1254198450494885893,tweet,1587859187000,nonatofilho,50183821,,1254198450494885893,"During the period of isolation in Brazil, I wa...",[],[],['@realDonaldTrump'],0,1,0
3,1254198394022768640,tweet,1587859173000,abbiesbuswell,1086009979692335108,,1254198394022768640,@pritchardfan happy birthday ella !! hope you ...,[],[],['@pritchardfan'],1,1,0
4,1254198364209664000,tweet,1587859166000,Oof_utd,1243653781679734784,,1254198103504228352,Had a wank in isolation,[],[],[],1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9994,1253829170204798976,tweet,1587771143000,VictorNarraway,944181190285598724,,1253481322141679619,A frightening 89 thousand approved of Trump's ...,[],[],[],0,0,0
9995,1253829169080926215,tweet,1587771143000,Stephaniegiann8,763887982898212864,,1253751812194070529,Trump didn’t say to drink bleach,[],[],[],0,0,0
9996,1253829166861959169,tweet,1587771143000,William48013192,1198328078688112640,,1253801998127857664,Not drinking bleach?,[],[],[],0,2,0
9997,1253829166853693440,tweet,1587771143000,usernewm,311637491,,1253815999306047492,They won't know if the bleach curve is flatten...,['https://t.co/Mtbc7fBU8U'],[],[],0,0,0


Reading out CSV of tweets back in to make sure it looks correct:

In [6]:
tweet = pd.read_csv('./data/all_tweets.csv')

In [7]:
tweet.head()

Unnamed: 0,id,type,time,author,author_id,re_tweeter,associated_tweet,text,links,hashtags,mentions,reply_count,favorite_count,retweet_count
0,1254198473819291649,tweet,1587859192000,PulpNews,100986964,,1254198473819291649,Isolation and boredom of staying at home can b...,['https://t.co/49b7W0d6V5'],[],[],0,0,0
1,1254198461563637763,tweet,1587859189000,aishacs,15809934,,1254197958595301386,We left the trail early once we saw that the s...,[],[],[],2,21,0
2,1254198450494885893,tweet,1587859187000,nonatofilho,50183821,,1254198450494885893,"During the period of isolation in Brazil, I wa...",[],[],['@realDonaldTrump'],0,1,0
3,1254198394022768640,tweet,1587859173000,abbiesbuswell,1086009979692335108,,1254198394022768640,@pritchardfan happy birthday ella !! hope you ...,[],[],['@pritchardfan'],1,1,0
4,1254198364209664000,tweet,1587859166000,Oof_utd,1243653781679734784,,1254198103504228352,Had a wank in isolation,[],[],[],1,1,0
