# Hate Speech on Twitter
## Deliverable 02
## Amir ElTabakh
## 3/11/2022

This document primarily focuses on accessing the Twitter API, the privileges we have to work with, extended privileges for when we are ready to move forward with our analysis, how we can query Twitter searches, and identifying the features we have to work with.

### Agenda
- Identifying the privileges we have to work with and extended privileges
- Accessing the twitter API with Python.
- Identifying the features/variables available through the Twitter API, and defining them.
- Practice filtering for specific hashtags, such as "#ChineseVirus"

## The Twitter API
Refer to the [Getting Started](https://developer.twitter.com/en/docs/platform-overview) page on the Twitter Developer Platform site so sign up for the twitter API. This shit be smooth sailing. There are three products we can access through the Twitter API. We now have access to the Elevated product.

#### Essential
Free and immediate access to the Twitter API. No application is required.

- 1 environment per project (irrelevant)
- 500K tweets per month/project
- Cost: free

#### Elevated
Higher levels of access to the Twitter API for free with an approved application.

- 3 environments per project (irrelevant)
- 2M tweets per month/project
- Cost: free

#### Academic Research
For academics who have a research project that requires, or would benefit from, studying Twitter’s conversational data. Access is free. An application is required.

- 1 environments per project (irrelevant)
- 10M tweets per month/project
- Cost: free
- For non-commercial use only

### Search Tweets

We will primarily be utilizing the [Search Tweets](https://developer.twitter.com/en/docs/twitter-api/tweets/search/introduction) features. 

Searching for Tweets is an important feature used to surface Twitter conversations about a specific topic or event. While this functionality is present in Twitter, these endpoints provide greater flexibility and power when filtering for and ingesting Tweets so you can find relevant data for your research more easily; build out near-real-time ‘listening’ applications; or generally explore, analyze, and/or act upon Tweets related to a topic of interest. 

Twitter offers two endpoints that allow you to search for Tweets: Recent search and full-archive search. Both of these REST endpoints share a common design and features, including their use of a single search query to filter for Tweets around a specific topic. These search queries are created with a set of operators that match on Tweet and user attributes, such as message keywords, hashtags, and URLs. Operators can be combined into queries with boolean logic and parentheses to help refine the queries matching behavior. 

Once you’ve set up your query and start receiving Tweets, these endpoints support navigating the results both by time and Tweet ID ranges. This is designed to support two common use cases: 

- **Get historical**: Requests are for a period of interest, with no focus on the real-time nature of the data. A single request is made, and all matching data is delivered using pagination as needed. This is the default mode for Search Tweets.
- **Polling or listening**: Requests are made in a "any new Tweets since my last request?" mode. Requests are made on a continual basis, and typically there is a use case focused on near real-time 'listening' for Tweets of interest.

Many operators and query limits are exclusive to Academic Research access, meaning that you must use keys and tokens from an App within a Project with Academic Research access to utilize the additional functionality. You can learn more about this in the endpoint sections below. Both the recent search and the full-archive search endpoints returned Tweets contribute to the monthly Tweet cap.

Now let's go over the two endpoints provided.

#### Recent search
The recent search endpoint allows you to programmatically access filtered public Tweets posted over the last week, and is available to all developers who have a developer account and are using keys and tokens from an App within a Project.

You can authenticate your requests with OAuth 1.0a User Context, OAuth 2.0 App-Only, or OAuth 2.0 Authorization Code with PKCE. However, if you would like to receive private metrics, or a breakdown of organic and promoted metrics within your Tweet results, you will have to use OAuth 1.0a User Context or OAuth 2.0 Authorization Code with PKCE, and pass user Access Tokens that are associated with the user that published the given content. 

This endpoint can deliver up to 100 Tweets per request in reverse-chronological order, and pagination tokens are provided for paging through large sets of matching Tweets. 

When using a Project with Essential or Elevated access, you can use the basic set of operators and can make queries up to 512 characters long. When using a Project with Academic Research access, you have access to additional operators and can make queries up to 1024 characters long. 

#### Full-archive search
*Academic Research access only*

The v2 full-archive search endpoint is only available to Projects with Academic Research access. The endpoint allows you to programmatically access public Tweets from the complete archive dating back to the first Tweet in March 2006, based on your search query.

You can authenticate your requests to this endpoint using OAuth 2.0 App-Only, and the App Access Token must come from an App that is within a Project that has Academic Research access. Since you cannot make a request on behalf of other users (OAuth 1.0a User Context or OAuth 2.0 Authorization Code with PKCE) with this endpoint, you will not be able to pull private metrics. 

This endpoint can deliver up to 500 Tweets per request in reverse-chronological order, and pagination tokens are provided for paging through large sets of matching Tweets. 

Since this endpoint is only available to those that have been approved for Academic Research access, you have access to the full set of search operators and can make queries up to 1024 characters long.

## Accessing the Twitter API with Python

Let's get our hands dirty with some code. I have applied to a Twitter Developer Account and I have my API key and API key secret stored in a seperate document, I will import it below. But first, lets introduce [Tweepy](https://www.tweepy.org/). Tweepy is an easy-to-use Python library for accessing the Twitter API. Let's pip install it so it is accessible in our environment. Note that you only have to pip install tweepy (or any library) once, but you will have to import it in every instance you need it.

In [1]:
# pip install tweepy (use either or)
#!pip install tweep
!python -m pip install git+https://github.com/tweepy/tweepy@master

Collecting git+https://github.com/tweepy/tweepy@master
  Cloning https://github.com/tweepy/tweepy (to revision master) to c:\users\amira\appdata\local\temp\pip-req-build-9uo9gm4y


  ERROR: Error [WinError 2] The system cannot find the file specified while executing command git version
ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH?


In [2]:
# importing dependencies
import tweepy as tw
import pandas as pd

# Twitter Keys, tokens, and secrets are saved in seperate config file on my local device
from config import consumer_key, consumer_secret, access_token, access_secret, bearer_token

Now that we have access to and have imported our API key and API key secret, we can initialize the tweepy OAuthHandler with the API key and the API secret and use it to get an instance of tweepy API class using which we’ll be making requests to the Twitter API.

In [19]:
# authenticate
auth = tw.OAuthHandler(consumer_key, consumer_secret)
api = tw.API(auth, wait_on_rate_limit=True)

A search query is simply a string telling the Twitter API what kind of tweets you want to search for. Imagine using the search bar on Twitter itself without the API. For example, if you want to search for tweets with "#chinesevirus", you’d simply type #chinesevirus in the Twitter search bar and it’ll show you those tweets.

Under the hood, if we’re using a search query with Twitter API, it actually returns the results from what you’d get had you searched for it directly on Twitter. The difference here is we can query for thousands, or millions of tweets, access other metadata about the tweets, and get to work analyzing and generating products with the data.

In [162]:
search_query = "#chinesevirus -filter:retweets"

Here we set up our search_query to fetch tweets with `#chinesevirus` but also filter out the retweets. You can customize your query based on your requirements. For more, refer to [this guide](https://developer.twitter.com/en/docs/twitter-api/v1/tweets/search/guides/standard-operators).

We can query exact phrases, we could add an 'or' clause, a 'minus' clause to query strings not containing a string, we can query hashtags, tweets sent from specific twitter accounts, a tweet authored tin reply to an account, tweets mentioning accounts, we can filter for or against tweets marked as potentially sensitive, tweets containing images and/or videos, tweets containing URLs (you can filter for strings inside the URL as well), you can filter for tweets sent before a date or since a date, you can filter for tweets containing positive attitudes or negative attitudes, and you can filter for tweets containing questions.

^ A lot I know, but these may come in handy when figuring out how we want to target our tweets.

Now that we can access the API, we will build the request for the endpoint we are going to use and the parameters we want to pass.

In [185]:
# get tweets from the API
tweets = tw.Cursor(api.search_tweets,
              q=search_query,
              lang="en",
              since="2022-02-25").items(10)

# store the API responses in a list
tweets_copy = []
for tweet in tweets:
    print(tweet.text, '\n\n')
    tweets_copy.append(tweet)
    
print("Total Tweets fetched:", len(tweets_copy))

Unexpected parameter: since


#Bengaluru After a long time (thanks to the #ChineseVirus) how about a #TweetUp 
(Venue can be decided based on the… https://t.co/sYRZceBf5c 


@PTI_News China will keep playing this bogie as #Putin has taken away entire attention from the #ChineseVirus. 


In view of the spread of new virus, 
China has imposed a lockdown in the north-eastern industrial center of Changch… https://t.co/zRaZbsZrGJ 


@amitsurg So a minister will decide if the prescription is right?
The biggest tragedy following #ChineseVirus pande… https://t.co/fqMPoaZ2F1 


@jaideepsethiya @OpIndia_com Whatever the ranking we're No:1

Terrorist exports, #ChineseVirus, gold and drug smugg… https://t.co/JRhm25NlNy 


"Chinese cyberespionage group Mustang Panda has been targeting European diplomats with an updated variant of the Pl… https://t.co/O5Gj8WHV5K 


@JackDetsch Well, China came first with #chinesevirus. Russia must think another thing for a surprise. 


@UNICEFIndia Stay safe from #ChineseVirus 


Latest evidence

In [186]:
# snapshot of how a single tweet looks like
tweets_copy[0]._json

{'created_at': 'Fri Mar 11 15:21:56 +0000 2022',
 'id': 1502303819740319746,
 'id_str': '1502303819740319746',
 'text': '#Bengaluru After a long time (thanks to the #ChineseVirus) how about a #TweetUp \n(Venue can be decided based on the… https://t.co/sYRZceBf5c',
 'truncated': True,
 'entities': {'hashtags': [{'text': 'Bengaluru', 'indices': [0, 10]},
   {'text': 'ChineseVirus', 'indices': [44, 57]},
   {'text': 'TweetUp', 'indices': [71, 79]}],
  'symbols': [],
  'user_mentions': [],
  'urls': [{'url': 'https://t.co/sYRZceBf5c',
    'expanded_url': 'https://twitter.com/i/web/status/1502303819740319746',
    'display_url': 'twitter.com/i/web/status/1…',
    'indices': [117, 140]}]},
 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'},
 'source': '<a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>',
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'in_reply_to_screen_n

In [187]:
# Number of columns
len(tweets_copy[0]._json.keys())

24

In [188]:
# Let's see all the features available to us
tweets_copy[0]._json.keys()

dict_keys(['created_at', 'id', 'id_str', 'text', 'truncated', 'entities', 'metadata', 'source', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'is_quote_status', 'retweet_count', 'favorite_count', 'favorited', 'retweeted', 'lang'])

In [189]:
# Lets create a dataframe for our scraped tweets
df = pd.DataFrame(columns = tweets_copy[0]._json.keys())

for i in range(len(tweets_copy)):
    df = df.append(pd.json_normalize(tweets_copy[i]._json))
    
df.reset_index()
df

Unnamed: 0,created_at,id,id_str,text,truncated,entities,metadata,source,in_reply_to_status_id,in_reply_to_status_id_str,...,user.profile_use_background_image,user.has_extended_profile,user.default_profile,user.default_profile_image,user.following,user.follow_request_sent,user.notifications,user.translator_type,user.withheld_in_countries,possibly_sensitive
0,Fri Mar 11 15:21:56 +0000 2022,1502303819740319746,1502303819740319746,#Bengaluru After a long time (thanks to the #C...,True,,,"<a href=""https://mobile.twitter.com"" rel=""nofo...",,,...,True,True,False,False,,,,none,[],
0,Fri Mar 11 14:47:40 +0000 2022,1502295193806163968,1502295193806163968,@PTI_News China will keep playing this bogie a...,False,,,"<a href=""http://twitter.com/download/iphone"" r...",1.5022191778509005e+18,1.5022191778509005e+18,...,True,True,False,False,,,,none,[],
0,Fri Mar 11 14:40:16 +0000 2022,1502293335062433793,1502293335062433793,"In view of the spread of new virus, \nChina ha...",True,,,"<a href=""https://mobile.twitter.com"" rel=""nofo...",,,...,True,False,True,False,,,,none,[],
0,Fri Mar 11 13:32:12 +0000 2022,1502276202635329538,1502276202635329538,@amitsurg So a minister will decide if the pre...,True,,,"<a href=""http://twitter.com/download/android"" ...",1.502258623673938e+18,1.502258623673938e+18,...,True,False,True,False,,,,none,[],
0,Fri Mar 11 12:25:04 +0000 2022,1502259310587351041,1502259310587351041,@jaideepsethiya @OpIndia_com Whatever the rank...,True,,,"<a href=""http://twitter.com/download/android"" ...",1.5022418670056735e+18,1.5022418670056735e+18,...,True,False,True,False,,,,none,[],
0,Thu Mar 10 20:47:27 +0000 2022,1502023348464893955,1502023348464893955,"""Chinese cyberespionage group Mustang Panda ha...",True,,,"<a href=""https://mobile.twitter.com"" rel=""nofo...",,,...,True,True,True,False,,,,none,[],False
0,Thu Mar 10 20:11:02 +0000 2022,1502014184506093584,1502014184506093584,"@JackDetsch Well, China came first with #chine...",False,,,"<a href=""http://twitter.com/download/android"" ...",1.50201308754056e+18,1.50201308754056e+18,...,True,False,True,False,,,,none,[],
0,Thu Mar 10 19:51:56 +0000 2022,1502009376517005313,1502009376517005313,@UNICEFIndia Stay safe from #ChineseVirus,False,,,"<a href=""http://twitter.com/download/android"" ...",1.495299682381226e+18,1.495299682381226e+18,...,True,False,True,False,,,,none,[],
0,Thu Mar 10 12:40:21 +0000 2022,1501900768734973952,1501900768734973952,Latest evidence founded at US biological lab i...,True,,,"<a href=""http://twitter.com/download/iphone"" r...",,,...,True,False,True,False,,,,none,[],
0,Thu Mar 10 00:21:19 +0000 2022,1501714781891346435,1501714781891346435,America post-freedom:\n\nDay 414 of fake elect...,True,,,"<a href=""https://mobile.twitter.com"" rel=""nofo...",,,...,False,False,False,False,,,,none,[],


In [190]:
df['geo'] == "None"

0    False
0    False
0    False
0    False
0    False
0    False
0    False
0    False
0    False
0    False
Name: geo, dtype: bool

In [191]:
# all columns available to us
df.columns

Index(['created_at', 'id', 'id_str', 'text', 'truncated', 'entities',
       'metadata', 'source', 'in_reply_to_status_id',
       'in_reply_to_status_id_str', 'in_reply_to_user_id',
       'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo',
       'coordinates', 'place', 'contributors', 'is_quote_status',
       'retweet_count', 'favorite_count', 'favorited', 'retweeted', 'lang',
       'entities.hashtags', 'entities.symbols', 'entities.user_mentions',
       'entities.urls', 'metadata.iso_language_code', 'metadata.result_type',
       'user.id', 'user.id_str', 'user.name', 'user.screen_name',
       'user.location', 'user.description', 'user.url',
       'user.entities.url.urls', 'user.entities.description.urls',
       'user.protected', 'user.followers_count', 'user.friends_count',
       'user.listed_count', 'user.created_at', 'user.favourites_count',
       'user.utc_offset', 'user.time_zone', 'user.geo_enabled',
       'user.verified', 'user.statuses_count', 'use

In [192]:
# Number of columns available to us
len(df.columns)

75

The user has to explicitly share its location while tweeting for tweet.coordinates to be equal to the tweet exact location, so it makes sense many tweets will have empty coordinates. 

Here are the [GeoGuidelines](https://developer.twitter.com/en/developer-terms/geo-guidelines).

There are two ways to get a location from Twitter; a geo-tag from a specific tweet, or a user’s location as part of their profile. According to Twitter, only 1–2% of tweets are geo-tagged hence it isn’t a great metric to be using; on the other hand a significant amount of users have a location in their profile, but they can enter whatever they want. Some are nice to people like us and will write ‘London, England’ or similar, while others are less useful, putting things like ‘My Parents Basement’.


Let's scrape 50 tweets now and filter for those tweets that are located in the United States and the user shares their location.

In [209]:
# get tweets from the API
search_query = "#chinesevirus -filter:retweets"

# center of united states
latitude = "37.09024"
longitude = "-95.712891"
radius = "791mi"

tweets = tw.Cursor(api.search_tweets,
                  q = search_query,
                  lang = "en").items(50)

# store the API responses in a list
tweets_copy = []
for tweet in tweets:
    print(tweet.text, '\n\n')
    tweets_copy.append(tweet)
    
print("Total Tweets fetched:", len(tweets_copy))

#Bengaluru After a long time (thanks to the #ChineseVirus) how about a #TweetUp 
(Venue can be decided based on the… https://t.co/sYRZceBf5c 


@PTI_News China will keep playing this bogie as #Putin has taken away entire attention from the #ChineseVirus. 


In view of the spread of new virus, 
China has imposed a lockdown in the north-eastern industrial center of Changch… https://t.co/zRaZbsZrGJ 


@amitsurg So a minister will decide if the prescription is right?
The biggest tragedy following #ChineseVirus pande… https://t.co/fqMPoaZ2F1 


@jaideepsethiya @OpIndia_com Whatever the ranking we're No:1

Terrorist exports, #ChineseVirus, gold and drug smugg… https://t.co/JRhm25NlNy 


"Chinese cyberespionage group Mustang Panda has been targeting European diplomats with an updated variant of the Pl… https://t.co/O5Gj8WHV5K 


@JackDetsch Well, China came first with #chinesevirus. Russia must think another thing for a surprise. 


@UNICEFIndia Stay safe from #ChineseVirus 


Latest evidence

In [210]:
# Lets create a dataframe for our scraped tweets
df = pd.DataFrame(columns = tweets_copy[0]._json.keys())

for i in range(len(tweets_copy)):
    df = df.append(pd.json_normalize(tweets_copy[i]._json))
    
df.reset_index()
df

Unnamed: 0,created_at,id,id_str,text,truncated,entities,metadata,source,in_reply_to_status_id,in_reply_to_status_id_str,...,place.place_type,place.name,place.full_name,place.country_code,place.country,place.contained_within,place.bounding_box.type,place.bounding_box.coordinates,quoted_status.quoted_status_id,quoted_status.quoted_status_id_str
0,Fri Mar 11 15:21:56 +0000 2022,1502303819740319746,1502303819740319746,#Bengaluru After a long time (thanks to the #C...,True,,,"<a href=""https://mobile.twitter.com"" rel=""nofo...",,,...,,,,,,,,,,
0,Fri Mar 11 14:47:40 +0000 2022,1502295193806163968,1502295193806163968,@PTI_News China will keep playing this bogie a...,False,,,"<a href=""http://twitter.com/download/iphone"" r...",1.5022191778509005e+18,1.5022191778509005e+18,...,,,,,,,,,,
0,Fri Mar 11 14:40:16 +0000 2022,1502293335062433793,1502293335062433793,"In view of the spread of new virus, \nChina ha...",True,,,"<a href=""https://mobile.twitter.com"" rel=""nofo...",,,...,,,,,,,,,,
0,Fri Mar 11 13:32:12 +0000 2022,1502276202635329538,1502276202635329538,@amitsurg So a minister will decide if the pre...,True,,,"<a href=""http://twitter.com/download/android"" ...",1.502258623673938e+18,1.502258623673938e+18,...,,,,,,,,,,
0,Fri Mar 11 12:25:04 +0000 2022,1502259310587351041,1502259310587351041,@jaideepsethiya @OpIndia_com Whatever the rank...,True,,,"<a href=""http://twitter.com/download/android"" ...",1.5022418670056735e+18,1.5022418670056735e+18,...,,,,,,,,,,
0,Thu Mar 10 20:47:27 +0000 2022,1502023348464893955,1502023348464893955,"""Chinese cyberespionage group Mustang Panda ha...",True,,,"<a href=""https://mobile.twitter.com"" rel=""nofo...",,,...,,,,,,,,,,
0,Thu Mar 10 20:11:02 +0000 2022,1502014184506093584,1502014184506093584,"@JackDetsch Well, China came first with #chine...",False,,,"<a href=""http://twitter.com/download/android"" ...",1.50201308754056e+18,1.50201308754056e+18,...,,,,,,,,,,
0,Thu Mar 10 19:51:56 +0000 2022,1502009376517005313,1502009376517005313,@UNICEFIndia Stay safe from #ChineseVirus,False,,,"<a href=""http://twitter.com/download/android"" ...",1.495299682381226e+18,1.495299682381226e+18,...,,,,,,,,,,
0,Thu Mar 10 12:40:21 +0000 2022,1501900768734973952,1501900768734973952,Latest evidence founded at US biological lab i...,True,,,"<a href=""http://twitter.com/download/iphone"" r...",,,...,,,,,,,,,,
0,Thu Mar 10 00:21:19 +0000 2022,1501714781891346435,1501714781891346435,America post-freedom:\n\nDay 414 of fake elect...,True,,,"<a href=""https://mobile.twitter.com"" rel=""nofo...",,,...,,,,,,,,,,


In [211]:
df['coordinates']

0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
0    None
Name: coordinates, dtype: object

It seems as though there are few users that provide their location. I'll have to find another way to query for users that do share their location.

## Questions Moving Forward

1. Temporal Trends: what time frame are we looking at? The research publication gathered tweets one week before, and one week after Trump made his first tweet referring to Covid-19 as the "Chinese Virus".
2. To build on point 1, if we want to access tweets archived prior to one week ago, we will need to apply for the Academic Research product. The product is free, and will grant us greater flexibility with our work. Of course, I plan to use this privilege reasonably, and make only small requests until we've further developed our project.
3. What notable sentiments are we targetting in the tweets? It is one thing to gather all tweets with the hashtag "#chinesevirus" or "#covid19". The article provided considers a tweet anti-Asian if one of the four conditions were met,
- 1. Was opposed to or hostile toward the region, the people, or culture of Asia;
- 2. Demonstrated a general fear, mistrust, and hatred of Asian ethnic groups;
- 3. Supported restrictions on Asian immigration
- 4. Used derogatory language or condoned punishments toward Asian countries or their people

Continuing on point 3, what sentiments are we targeting? From there we can establish methods to extract those sentiments from the tweets and filter for/against them.

4. Sentiment analysis is a natural language processing technique in Python. We can grade the sentiment of a string of text on a scale of [-1, 1], -1 indicating a negative sentiment, and 1 indicating a positive sentiment, while 0 indicated neutrality. Let's keep this in mind moving forward.