# Twitter scraping and preprocessing of the data for Waseem
In this notebook Twitter scraping based on having only tweet ids is done. Some of the preprocessing of the data is done as well, such as dropping the id column, dropping the rows of the tweets that API could not retrieve (deleted tweets, suspended / deleted accounts). 

This is a separate file because extracting tweets from Twitter takes a really long time.

The first step in our process includes simply retrieiving the data set in question. In this case, Waseem. 

In [12]:
import pandas as pd
import csv
url_waseem_data = "https://raw.githubusercontent.com/zeeraktalat/hatespeech/master/NAACL_SRW_2016.csv"
df_waseem = pd.read_csv(url_waseem_data)
df_waseem.columns = ["tweet", "class_label"]
df_waseem

Unnamed: 0,tweet,class_label
0,572341498827522049,racism
1,572340476503724032,racism
2,572334712804384768,racism
3,572332655397629952,racism
4,575949086055997440,racism
...,...,...
16901,576359685843861505,none
16902,576612926838046720,none
16903,576771329975664640,none
16904,560595245814267905,none


# Twitter scraping
In this section, we will get the tweets having only tweet ids from Twitter using tweepy. This was a very challenging task because Twitter is very protective about authentication process. We tried multiple methods on how this can be done:


1.   Using tweepy and get_status
2.   Using TwitterAPI
3.   Using tweepy's Client.
4.   Using web scraping techniques (using Beautiful Soup or lxml).

At first options 1 and 2 did not work because we did not have access to Twitter Elevated Account (we only had basic one which is called Essential) which is required to retieve tweet content from the id. We applied for the account and it got rejected. We had to reappeal in order to gain access and it took multiple days. In the meantime, we hopelessly searched for other alternatives. Option 3 did not work because it required v4 of tweepy that is still in the development (more to read about this issue [here](https://stackoverflow.com/questions/67978717/tweepy-3-10-0-attributeerror-module-tweepy-has-no-attribute-client) ). The last technique also did not work because of how protective Twitter is of the data. 



In [2]:
!pip3 install tweepy



In [3]:
import tweepy
# The keys were replaced with empty strings for the protection of privacy on Github
consumer_key = ""
consumer_secret = ""
access_token = ""
access_token_secret = ""
  
# authorization of consumer key and consumer secret
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
  
# set access to user's access key and access secret 
auth.set_access_token(access_token, access_token_secret)
  
# calling the api 
api = tweepy.API(auth)

In [13]:
def get_tweet_text(id):
  try:
    status = api.get_status(id)
    return status.text
  except Exception:
    return "ERROR"

Here we show how the function get_tweet_text works on one tweet:

In [14]:
print(get_tweet_text(572341498827522049))

ERROR


# Waseem data 
In this section we will transform Waseem dataset to contain the tweet text as well using the function defined above.

In [15]:
df_waseem["text"] = df_waseem.tweet.apply(lambda x: get_tweet_text(x))

In this step we do some of the preprocessing of the data. We drop the error rows, make sure text column is before class_label column, and drop id column. 

In [10]:
df_waseem = df_waseem[df_waseem.text != 'ERROR'] # drop error rows
columns_titles = ["text","class_label"] # swap the columns order and drop id column 
df_waseem=df_waseem.reindex(columns=columns_titles)

In the last step we save csv file. 

In [11]:
df_waseem.to_csv("df_waseem_preprocessed.csv", sep=' ')