# 01_1: Scraping Tweets without Label

Although many trending posts may have some type of labeling through the use of hashtags(will be discussed more specifically in 01_2: Scraping Hashtag Labeled Data), many casual and daily posts/replies lack any type of labeling for any context. These type of may or may not contain some helpful information for machine learning, but in order for them to be used in supervised learning, they first need to be labeled. 

The code below were used to scrape covid vaccine related tweets, with no labels whatsoever. Scraping API used was snscrape ([documentation](https://github.com/JustAnotherArchivist/snscrape)).

These will be processed with a labeling tool called [Snorkel](https://www.snorkel.org/) (will covered in detail in the labeling notebook).

---

## Imports

In [9]:
# snscrape API was used for scraping twitter posts
import snscrape.modules.twitter as sntwitter
import pandas as pd

In [10]:
# Make Request 
def post_request(date):
    """
    make request for 200 posts from the date specified and return
    a list of tweets and features. This will request for
    200 most recent posts for the timeframe given. 
    """
    
    # datestart is the date given by parameter
    # set a date end to startdate+1, inorder to limit the post to one specific date
    date_start = pd.to_datetime(date) 
    date_end = date_start + pd.DateOffset(days=1) 
    
    start_str = str(date_start.date())
    end_str = str(date_end.date())
    
    tweets_ls = []
    
    # iterate over posts requested, break if it goes over 200 posts
    for i, tweet in enumerate(sntwitter.TwitterSearchScraper(f'covid vaccine lang:en until:{end_str} since:{start_str}').get_items()):
        if i>200:
            break
        # store following attributes to the list
        tweets_ls.append([tweet.date, tweet.id, tweet.user.username, tweet.content, tweet.likeCount, tweet.retweetCount])
    
    return tweets_ls

In [11]:
# Comsume_posts
def consume_posts(tweets_list):
    """
    Compile the list of tweets from post_request function into dataframe
    """
    columns = ['datetime', 'tweet_id', 'username', 'text', 'likes', 'retweet']
    tweets_df = pd.DataFrame(tweets_list, columns=columns)
    
    return tweets_df

In [15]:
# Save data
def save_data(df, overwrite):
    """
    save the tweets dataframe as a csv file
    """
    if overwrite:
        mode = 'w'
        header = True
    else:
        mode ='a'
        header = False
        
    filepath = f'../data/twitter_scrape2.csv' # file path/name
    
    df.to_csv(filepath,
              mode=mode,
              header=header,
              index=False)

In [13]:
# Scraping, Functions put together
def scrape_twitter(start, end):
    """
    compile the 200 posts from each day in the date range given. 
    It will then save the completed dataset
    """
    
    unique_id = set()
    current_date = pd.to_datetime(start).date()
    i = 0
    
    while current_date != pd.to_datetime(end).date():
        tweets_ls = post_request(str(current_date))
        tweets_df = consume_posts(tweets_ls)
        
        tweets_df = tweets_df[~tweets_df['tweet_id'].isin(unique_id)]
        
        if i == 0:
            overwrite = True
        else: 
            overwrite = False
            
        save_data(tweets_df, overwrite)
        
        unique_id.update(tweets_df['tweet_id'])
        current_date = (current_date + pd.DateOffset(days=1)).date()
        i += 1
        
        print(str(current_date))
        
    print('scraping completed')

In [16]:
scrape_twitter("2021-01-20", "2022-07-31")

2021-01-21
2021-01-22
2021-01-23
2021-01-24
2021-01-25
2021-01-26
2021-01-27
2021-01-28
2021-01-29
2021-01-30
2021-01-31
2021-02-01
2021-02-02
2021-02-03
2021-02-04
2021-02-05
2021-02-06
2021-02-07
2021-02-08
2021-02-09
2021-02-10
2021-02-11
2021-02-12
2021-02-13
2021-02-14
2021-02-15
2021-02-16
2021-02-17
2021-02-18
2021-02-19
2021-02-20
2021-02-21
2021-02-22
2021-02-23
2021-02-24
2021-02-25
2021-02-26
2021-02-27
2021-02-28
2021-03-01
2021-03-02
2021-03-03
2021-03-04
2021-03-05
2021-03-06
2021-03-07
2021-03-08
2021-03-09
2021-03-10
2021-03-11
2021-03-12
2021-03-13
2021-03-14
2021-03-15
2021-03-16
2021-03-17
2021-03-18
2021-03-19
2021-03-20
2021-03-21
2021-03-22
2021-03-23
2021-03-24
2021-03-25
2021-03-26
2021-03-27
2021-03-28
2021-03-29
2021-03-30
2021-03-31
2021-04-01
2021-04-02
2021-04-03
2021-04-04
2021-04-05
2021-04-06
2021-04-07
2021-04-08
2021-04-09
2021-04-10
2021-04-11
2021-04-12
2021-04-13
2021-04-14
2021-04-15
2021-04-16
2021-04-17
2021-04-18
2021-04-19
2021-04-20
2021-04-21