# End-to-End Machine Learning Project: Twitter Sentiment Analysis - Introduction and Data Collection (Part 1)

### Introduction

As someone who does not come from a math, statistics, or computer science background, I believe that creating a portfolio project is a great way to showcase our skills and abilities in data science. In this post, I want to share an example of an end-to-end machine learning project on sentiment analysis, which is a rapidly growing field in natural language processing and machine learning. We will go over the entire process, from data collection and preprocessing to model building, creating a dashboard, and finally deploying the model and dashboard as an online application.

### Data Collection

Before collecting the data, we need to define the objective of our project. Our objective is to predict the public's sentiment about a brand (product, service, company or person) based on tweet data. We will use the data collection methodology described in [this paper](https://www-cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf) (Twitter Sentiment Classification using Distant Supervision, Go, Bhayani, & Huang, 2009). 


Distant supervision is a method that utilizes a set of rules to automatically label a dataset. Since it does not require human intervention, it can save a lot of time and resources, especially when working with large datasets. In our case, we will use emoticons to label the sentiment of the tweet. Specifically, a tweet with a smiley face will be labeled as positive, and a tweet with a frowning face will be labeled negative. We will use a library called `snscrape` to collect the tweets; it does not require using the Twitter API, so we can retrieve a large amount of tweets without worrying about the [rate limit](https://developer.twitter.com/en/docs/twitter-api/rate-limits). In the following section we will walk through the code and explain the logic behind it.

**Disclaimer: As of 11 January 2023, Twitter modified its frontend API and the code below will no longer work. I will provide an alternative as soon as I find a solution.**

First we will import the necessary libraries, please install them first if you do not already have them.

In [None]:
!pip install snscrape
import snscrape.modules.twitter as sntwitter
import datetime as dt
import pandas as pd

Then we will make a function that utilizes `sntwitter.TwitterSearchScraper` to retrieve the tweets and save them in a dataframe. The function takes the following arguments:  
* search_term: the term you want to search for on Twitter
* start_date: the start date of the search range in the format of datetime.date object
* end_date: the end date of the search range in the format of datetime.date object
* num_tweets: the number of tweets you want to retrieve

A `for` loop is used to iterate over and store the tweet data (username, date, and tweet content) returned by the `get_items` method of `sntwitter.TwitterSearchScraper`. We use lang:en (English language) and exclude:retweets as the search filters. The tweet data is finally returned as a dataframe.

In [None]:
def scrape_tweet(search_term, start_date, end_date, num_tweets):
    start_date = start_date.strftime("%Y-%m-%d")
    end_date = end_date.strftime("%Y-%m-%d")
    tweet_data = []
    for i, tweet in enumerate(sntwitter.TwitterSearchScraper('{} since:{} until:{} lang:en exclude:retweets'.format(search_term, start_date, end_date)).get_items()):
        if i >= num_tweets:
            break
        tweet_data.append([tweet.user.username, tweet.date, tweet.content])
    tweet_df = pd.DataFrame(tweet_data, columns=['username', 'date', 'tweet'])
    return tweet_df

For this project, we want to retrieve tweets from 2022-01-01 to 2022-12-31. So we make another function, `daily_scrape_2022` which utilizes the `scrape_tweet` function to retrieve tweets for each day in 2022. We can specify the number of tweets we want to retrieve for each day using `num_daily`.

In [None]:
def daily_scrape_2022(search_term, num_daily):
    start_date = dt.datetime(2022, 1, 1)
    end_date = dt.datetime(2022, 1, 2)
    delta = dt.timedelta(days=1)
    df = pd.DataFrame()
    for n in range(365):
        temp_df = scrape_tweet(search_term, start_date, end_date, num_daily)
        df = pd.concat([df, temp_df])
        start_date += delta
        end_date += delta
    return df

Now we will use the `daily_scrape_2022` function to retrieve 1000 tweets daily for each day in 2022. Tweets with negative sentiment will be searched with the term ":(" while tweets with positive sentiment will be searched with the term ":)".

In [None]:
ori_neg_df = daily_scrape_2022(":(", 1000)

In [None]:
ori_pos_df = daily_scrape_2022(":)", 1000)

The retrieved tweets do not always contain the specified search term, so we need to do some filtering. We create two functions, one to include tweets containing specific terms and the other to exclude tweets containing specific terms. 

In [None]:
def filter_include(df, terms):
    temp_df = pd.DataFrame()
    for term in terms:
        add_df = df[df['tweet'].str.contains(term, regex=False) == True]
        temp_df = pd.concat([temp_df, add_df]).drop_duplicates(ignore_index=True)
    return temp_df

In [None]:
def filter_exclude(df, terms):
    temp_df = df.copy()
    for term in terms:
        temp_df = temp_df[temp_df['tweet'].str.contains(term, regex=False) == False]
    return temp_df

For the negative tweets, we will 

In [None]:
neg_df = filter_include(ori_neg_df, [":(", ":-("])
neg_df = filter_exclude(neg_df, [":)", ":D", ":-)"])
neg_df.shape

(358624, 3)

Filter positive tweet 

In [None]:
pos_df = filter_include(ori_pos_df, [":)", ":D", ":-)"])
pos_df = filter_exclude(pos_df, [":(", ":-("])
pos_df.shape

(343477, 3)

## Remove emojis from tweets

Remove all emojis because we want our model to classify the tweet sentiment from the text instead of emojis

In [None]:
def remove_term(df, terms):
    temp_df = df.copy()
    for term in terms:
        temp_df['tweet'] = temp_df['tweet'].str.replace(term, " ", regex=False)
    return temp_df

In [None]:
neg_df = remove_term(neg_df, [":(", ":-("])

In [None]:
pos_df = remove_term(pos_df, [":)", ":D", ":-)"])

## Label tweets and merge them into a dataframe

In [None]:
neg_df["sentiment"] = "Negative"
pos_df["sentiment"] = "Positive"
df = pd.concat([neg_df, pos_df]).reset_index(drop=True)

In [None]:
df.to_csv("../dataset/labeled_tweets.csv", index=False)