# Sentiment Analysis of Health Tweets -- Data Collection

## Introduction

This notebook is to scrap tweets data based on health keywords from Twitter API using tweepy. 

Specifically, I'll be walking through:

  1. **Getting the data** 
    - Collect the data from Twitter APIs
  2. **Cleaning the data** 
    - Make text all lower case
    - Remove punctuation
    - Remove numerical values
    - Remove common non-sensical text (/n)
    - Tokenize text
    - Remove stop words
    - Stemming / lemmatization
    - Parts of speech tagging
    - Create bi-grams or tri-grams
    - Deal with typos
  3. **Organizing the data**
    - Corpos
    - Document-Term Matrix

## Problem Statement
Health is one of  the most important things in our life. But different people have different concerns about their health. 

The goal of this project is to know what type of health people are concerned about more, and apply the sentiment analysis to check how negative they think about their health, and how much time they cost on their health.

## Getting the data
I collected the data from Twitter APIs. Keywords that could be used to filter is https://figshare.com/articles/dataset/List_of_Health_Keywords/1084358/1 and covid-19 from 2017 to 2020. I label them into a few groups based on the keywords like mental health concern, flu, dental, covid-19 etc.

In [None]:
#load python packages
import os
import pandas as pd
import datetime
import time
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import tweepy
from datetime import datetime
%matplotlib inline


In [None]:
# Twitter credentials
consumer_key = 'lBzNFpVdxWRFYfW7XxO7zP8WS'
consumer_secret = 'LjGR3qSETMxqb9eyFONFQJyA85cfvSeqdq52paYV2KAfd1xuFH'
access_key = '1324831202449199109-rwCWdxMtMS33Ry1SMIlbXRlsABvgYd'
access_secret = 'rMl5JZNBpK9Z0rWzhHoHzV1AgOSGBkQg5SwDJNbQzKDj2'
# Create the authentication object
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
# Set the access token and access token secret
auth.set_access_token(access_key, access_secret)
# Creating the API object while passing in auth information
api = tweepy.API(auth,wait_on_rate_limit=True)

In [None]:
# read the keywords from txt to a list
txt_list = []

with open('keywords.txt', "r") as f:
    txt_list = f.read().split()

keywords = [word.strip(',') for word in txt_list]
print(len(keywords))

In [None]:
def scraptweets(search_words, date_since ,date_until, numTweets, numRuns):

    ## Arguments:
    # search_words -> define a string of keywords for this function to extract
    # date_since -> define a date from which to start extracting the tweets 
    # numTweets -> number of tweets to extract per run
    # numRun -> number of runs to perform in this program - API calls are limited to once every 15 mins, so each run will be 15 mins apart.
    ##
    
    # Define a pandas dataframe to store the date:
    db_tweets = pd.DataFrame(columns = ['text', 'user_name', 'user_description', 'user_location','user_created_date','user_followers_count', 
                                        'user_friends_count', 'user_favourites','user_verified','date', 'hashtags', 
                                        'source','totaltweets'])

    # Define a for-loop to generate tweets at regular intervals
    for i in range(0, numRuns):
        
        # Collect tweets using the Cursor object
        # .Cursor() returns an object that you can iterate or loop over to access the data collected.
        # Each item in the iterator has various attributes that you can access to get information about each tweet
        tweets = tweepy.Cursor(api.search, q=search_words, lang="en", since=date_since, until = date_until, 
                               tweet_mode='extended', wait_on_rate_limit=True, wait_on_rate_limit_notify=True).items(numTweets)

        # Store these tweets into a python list
        tweet_list = [tweet for tweet in tweets]

        # Obtain the following info (methods to call them out):

        for tweet in tweet_list:

            # Pull the values
            user_name = tweet.user.screen_name
            user_description = tweet.user.description
            user_location = tweet.user.location
            user_created_date = tweet.user.created_at
            user_followers_count = tweet.user.followers_count
            user_friends_count = tweet.user.friends_count
            user_favourites = tweet.favorite_count
            user_verified = tweet.user.verified
            date = tweet.created_at
            hashtags = tweet.entities['hashtags']
            totaltweets = tweet.user.statuses_count
            source = tweet.source
            
            
            try:
                text = tweet.retweeted_status.full_text
            except AttributeError:  # Not a Retweet
                text = tweet.full_text

            # Add the 6 variables to the empty list - ith_tweet:
            ith_tweet = [text, user_name, user_description, user_location, user_created_date, user_followers_count, 
                          user_friends_count, user_favourites, user_verified, date, hashtags, source, totaltweets]

            # Append to dataframe - db_tweets 
            db_tweets.loc[len(db_tweets)] = ith_tweet
            
        print('sleep for 15 min')    
        time.sleep(900) #15 minute sleep time

    return db_tweets
    

In [None]:
# There is a limitation about amount of keywords can be searched in the same time
# I need seperate the keywords to a few strings
n = len(keywords)//30
search_words_all =[]
for i in range(n):
    start = i*30
    end = (i+1)*30
    search_words_all.append(' OR '.join(keywords[start:end]))
last = len(keywords) - len(keywords)//30 * 30
search_words_all.append(' OR '.join(keywords[-last:]))

In [None]:
# Set up searching time
dates = ["2020-11-03","2020-11-04", "2020-11-05", "2020-11-06", "2020-11-07", "2020-11-08", "2020-11-09", "2020-11-10"]

In [None]:
# Set up searching time
dates = ["2020-11-06", "2020-11-07", "2020-11-08", "2020-11-09", "2020-11-10","2020-11-11","2020-11-12", "2020-11-13"]

In [None]:
numTweets = 2500
numRuns = 3
for date in range(len(dates)-1):
    # Begin scraping the tweets individually:
    date_since = dates[date]
    date_until = dates[date+1]
    print(date_since, date_until)
    db_tweets = pd.DataFrame(columns = ['text', 'user_name', 'user_description', 'user_location','user_created_date','user_followers_count', 
                                        'user_friends_count', 'user_favourites','user_verified','date', 'hashtags', 
                                        'source','totaltweets'])
    # We will time how long it takes to scrape tweets for each run:
    start_run = time.time()
    print(datetime.now())
    for word in search_words_all:
        search_words = word
        db_tweets = db_tweets.append(scraptweets(search_words, date_since, date_until, numTweets, numRuns))
    # Once all runs have completed, save them to a single csv file: 
    # Obtain timestamp in a readable format:
    
    to_csv_timestamp = datetime.today().strftime('%Y%m%d_%H%M%S')
    
    # Define working path and filename
    path = os.getcwd()
    filename = path + '/data/' + to_csv_timestamp + '_health_tweets.csv'

    # Store dataframe in csv with creation date timestamp
    db_tweets.to_csv(filename, index = False)
    
    # Run ended:
    end_run = time.time()
    print(datetime.now())
    duration_run = round(end_run-start_run, 2)
    print('time take for {} run to complete is {}'.format(date, duration_run))
    print(date_since, 'to', date_until, 'has completed!')
print('WHole Scraping has completed!')