# Twitter and Python

We will be showing how to access Twitter using Python. We will use
1. the relatively low-level `requests` library, which will deal with any API
2. the easier to use `tweety` and `twitter` and  libraries, but which only works with Twitter

We will need to install some packages. You should run
```bash
conda install -c conda-forga --file requirements.txt
```
or 
```bash
pip install -r requirements.txt
```

In [None]:
import pandas as pd
import requests

## The requests library

We will be doing a lot of work with Twitter, but I don't want you to leave with the impresssion that requests only works with Twitter. It works with any API, and we have already used it with our Flask apps. A request is a "low-level" network call, and duplicates what the command line tool curl does, for example.

Before diving into connecting to Twitter (where we will need to provide a username and password), I wanted to give an example using the [Star Wars API](http://swapi.co) which doesn't require authentication. 

Let's start with curl from the command line (the `json_pp` just makes the output JSON pretty)

In [None]:
!curl https://swapi.co/api/people/1/ | json_pp

Big picture idea:

We pass an URL (or endpoint) to curl, and it returns data. Requests gives us a way of doing this in Python:

In [None]:
url = 'https://swapi.co/api/people/1/'
response = requests.get(url)
response.json()

We can pass anything that is supported by the API. For example, did you know we can get Wookiee translations

In [None]:
url = 'https://swapi.co/api/people/1/?format=wookiee'
requests.get(url).json()

Last piece of SW trivia -- we can search!

In [None]:
# where is Vader?
url = 'https://swapi.co/api/people/?search=vader'
requests.get(url).json()

#### Basic takeway
- `requests.get(url)`: make a (GET) request to a URL. Can get a webpage or a JSON object back. Returns a `response` object
- `response.json()`: access the JSON object returned (if there was one)

To get this to work with Twitter, we will need to authenticate ourselves. This is the job of OAuth.

## Instructions to connect to Twitter using requests

We need to identify ourselves to Twitter using 
- a public and private key, which doesn't expire
- as well as a public and private token (which does expire). 
If you are re-running this notebook tomorrow, you will need to get a token from the Twitter page (but your keys will remain the same).

Follow the instructions [here](setup_twitter_instructions.md) to get your keys and tokens, and place them in `twitter_credenitials.py`.

**The cell below won't work until you follow the instructions!**

In [None]:
# This is needed to authenticate us to Twitter

try:
    from requests_oauthlib import OAuth1
except ModuleNotFoundError:
    import sys
    import os

    # I need this because requests_oauth gets installed in a weird place on my system
    sys.path.append('/usr/local/lib/python3.6/site-packages')
    from requests_oauthlib import OAuth1

Now we load Twitter credentials from the .env file (this is the trick we saw earlier -- our .env file is in .gitignore, so this prevents our auth keys going on GitHub).

You should have it already installed from earlier, but if you don't you can install it with 
```
pip install -U python-dotenv
```


In [None]:
%load_ext dotenv
%dotenv
import os

oauth = OAuth1(os.getenv("TWITTER_CONSUMER_KEY"),
               os.getenv("TWITTER_CONSUMER_KEY_SECRET"),
               os.getenv("TWITTER_ACCESS_TOKEN"),
               os.getenv("TWITTER_ACCESS_TOKEN_SECRET"))

In [None]:
# This "user_timeline" url tells the API to get the tweets of the user associated with the app.
response = requests.get("https://api.twitter.com/1.1/statuses/user_timeline.json",
                        auth=oauth)

# Note that we're using the "requests" library
tweets = response.json()

In [None]:
tweets

## Twitter search API (free version scrapes last week's tweets)

A detailed description of the twitter search API can be found [here](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets.html). Some of the key parameters

| Parameter | Notes | Example |
|---|---|---|
| q | (required) query string to search for | `@metis` |
| geocode | (optional) Uses tweet geolocation, or user's profile location if tweet geolocation disabled. Should be of the format `latitude longitude radius[unit]` where unit is either "km" or "mi" | `41.8781, -87.6298, 5mi` |
| lang | (optional) Only return tweets in language given. Languages are coded by the two character code used in [ISO 639-1](http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). | `en` `cn` |
| count | (optional) Number of results to return. Defaults to 15, max value is 100 | `20` |

The API returns a JSON object with two keys:
- search_metadata: Information about how long the search took, what was searched for, etc
- statuses: the actual queries that you wanted

Let's see it in action:

In [None]:
parameters = {"q": "wait wait don't tell me", "count":20, 
              "geocode": "41.8781,-87.6298,100mi",
             "lang":"en"}

response = requests.get("https://api.twitter.com/1.1/search/tweets.json",
                        params = parameters,
                        auth=oauth)

In [None]:
# Just look at the first tweet:
response.json()['statuses']

In [None]:
# Ok, can we extract some of the info from this text?
tweets = response.json()['statuses']

def tweet_to_string(tweet):
    s = """
        Text: {text}
        Hashtags: {hashtags}
        Username: {screenname} -- ({description})
        Social status: {friends} friends, {followers} followers
        Location: {location}
    """.format(text=tweet['text'], hashtags=tweet['entities']['hashtags'],
               screenname=tweet['user']['screen_name'], 
               description=tweet['user']['description'],
               friends=tweet['user']['friends_count'],
               followers=tweet['user']['followers_count'],
               location=tweet['user']['location'])
    return s

print(tweet_to_string(tweets[0]))

In [None]:
for tweet in tweets[:5]:
    print(tweet_to_string(tweet))

Did we pull all 20 tweets?

In [None]:
print("Number of tweets = ", len(tweets))

We can pull the next set of tweets if we want (i.e. the "next" 20)

In [None]:
next_page_url = "https://api.twitter.com/1.1/search/tweets.json" + response.json()['search_metadata']['next_results']

response = requests.get(next_page_url, auth=oauth)

more_tweets = response.json()['statuses']

for tweet in more_tweets[:4]:
    print(tweet['text'])
    print()

## Using the Twitter package

In [None]:
import twitter
import pandas as pd
import time
consumer_key = os.getenv("TWITTER_CONSUMER_KEY")
consumer_secret = os.getenv("TWITTER_CONSUMER_KEY_SECRET")
access_token_key = os.getenv("TWITTER_ACCESS_TOKEN")
access_token_secret = os.getenv("TWITTER_ACCESS_TOKEN_SECRET")

In [None]:
api = twitter.Api(consumer_key=consumer_key,
                  consumer_secret=consumer_secret,
                  access_token_key=access_token_key,
                  access_token_secret=access_token_secret)

In [None]:
api.PostUpdate("Hello from SF!")

In [None]:
tweets = api.GetSearch(raw_query="l=&q=%23machinelearning%2C%20OR%20%23ml%2C%20OR%20%23AI&", count=100)

In [None]:
tweets = api.GetSearch(raw_query='l=&q="Deep%20learning"%20"machine%20learning"%20"AI"%20%23machinelearning%20OR%20%23deeplearning%20OR%20%23DL%20OR%20%23AI%20OR%20%23ML', count=100)

In [None]:
len(tweets)

In [None]:
for tweet in tweets:
    print(tweet)
    print()

In [None]:
for tweet in tweets:
    
    for key, val in tweet.AsDict().items():
        print(f'{key}')
    print()

In [None]:
tweet.AsDict()["retweeted_status"]['user']['favourites_count']

In [None]:
api.VerifyCredentials()

In [None]:
f = api.GetFriends()
f

In [None]:
def tweets_to_df(tweets):
    df = pd.DataFrame()
    
    df['tweet_created_at'] = [tweet.AsDict()['created_at'] for tweet in tweets]
    df['hashtags']= [[inner_dict['text'] for inner_dict in tweet.AsDict()['hashtags']] for tweet in tweets]
    df['id'] = [tweet.AsDict()['id'] for tweet in tweets]
    df['lang'] = [tweet.AsDict()['lang'] for tweet in tweets]
    df['user_screen_name'] = [tweet.AsDict()['user']['screen_name'] for tweet in tweets]
    df['user_created_at'] = [tweet.AsDict()['user']['created_at'] for tweet in tweets]
    df['user_description'] = [tweet.AsDict()['user']['description'] for tweet in tweets]
    df['user_statuses_count'] = [tweet.AsDict()['user']['statuses_count'] for tweet in tweets]
    df['tweet_text'] = [tweet.AsDict()['text'].encode('utf-8') for tweet in tweets]
    df['user_friends_count'] = [tweet.AsDict()['user']['friends_count'] for tweet in tweets]
    df['retweet_count'] = [tweet.AsDict()['retweet_count'] if 'retweet_count' in tweet.AsDict().keys() \
                           else 0 for tweet in tweets]
    df['favorites_count'] = [tweet.AsDict()["retweeted_status"]['user']['favourites_count'] \
                             if 'retweeted_status' in tweet.AsDict().keys() else 0 for tweet in tweets]
    
    return df

In [None]:
tweets_to_df(tweets)

In [None]:
api.GetStatus(1027454336429895680)

In [None]:
api.GetStatuses([1027440068980613120, 1027439980740669440])

## Streaming tweets

Instead of looking at the tweets that have already been made, we can look at the tweets in real time.

In [None]:
import tweepy

auth = tweepy.OAuthHandler(consumer_key=consumer_key,
                           consumer_secret=consumer_secret)
auth.set_access_token(access_token_key,
                      access_token_secret)

api=tweepy.API(auth)

In [None]:
max_tweets = 3

for index, tweet in enumerate(tweepy.Cursor(api.search ,q="machinelearning").items(max_tweets)):
    # You can see all the methods available on tweet using .<tab> or 
    # dir(tweet). You can access the raw JSON using tweet._json
    print(str(index) + '. ' + tweet.text + '\n')

In [None]:
# We can also duplicate our original query
for index, tweet in enumerate(tweepy.Cursor(api.search, **parameters).items(max_tweets)):
    print(str(index) + '. ' + tweet.text + '\n')

## Getting tweets into dataframes

The demo below shows how to do this for the tweepy API using cursors.

In [None]:
# this creates the iterable cursor -- it doesn't pull the tweets yet!
cursor = tweepy.Cursor(api.search, **parameters).items(max_tweets)

# This command, when we pull from the cursor, actually starts going to twitter and returns
# a list of the "json" objects (Python dictionaries)
the_tweets = [tweet._json for tweet in cursor]

# which we then can convert to a dataframe:
tweet_frame = pd.DataFrame(the_tweets)
tweet_frame

You could also do this with the response method we looked at earlier

In [None]:
cursor = [t for t in tweepy.Cursor(api.search, q='congress').items(4)]


In [None]:
c = cursor[0]

c.entities

## Getting data into Mongo

We can also insert tweets into MongoDB. We will use the cursor method, but you could also use the requests or streaming method (defined in the next section)

In [None]:
import json
from pymongo import MongoClient

# This connects us to the "legistlation" database, and the "collection" [think table] news 
# in that database
client = MongoClient()
db = client.legislation
news_collection = db.news

In [None]:
def processTweet(tweet):
    tweet_dict = {
        'datetime': tweet.created_at,
        'tweet': tweet.text,
        'entities': tweet.entities
    }
    
    if tweet.coordinates:
        tweet_dict['coordinates'] = tweet.coordinates
    if tweet.geo:
        tweet_dict['geo'] = tweet.geo
    
    return tweet_dict

cursor = tweepy.Cursor(api.search, q='congress').items(400)

# get a list of dictionarier
tweets = [ processTweet(tweet) for tweet in cursor ]

# insert them into Mongo
db.news.insert_many(tweets)


Let's use the PyMongo client to get some information back from the database!

In [None]:
# How many documents do we have
news_collection.count()

In [None]:
# How many mention Trump?
news_collection.find({'tweet': {'$regex': 'Trump'}}).count()

## More complicated query: 20 most popular hashtags

What are the most popular hashtags in the dataset? 
    
To start, let's find a document with at least one hashtag. Then we will build a pipeline using `aggregate`, which goes through a series of filtering sets.


In [None]:
news_collection.find_one({'entities.hashtags.0': {'$exists': 1}})

Aggregation steps used below. Note that `$` signs get used for operators or for existing column names (to distinguish them from normal strings):
- `$match`: The standard query used in `find` that we have already seen
- `$project`: allows us to rename fields, or only include those fields that you want
- `$unwind`: Take a field that contains an array, and create a new record for each element in that array (see example)
- `$group`: This is like a SQL `GROUP BY`. Take a mandatory `_id` (which is what it groups by). Create new fields with aggegate function
- `$sort: { sort_field : +1 ascending or -1 descending }`
- `$limit`: the number of records to return. This can also be called on the resulting cursor.


#### Unwind example

If we have a document
```javascript
{
   '_id': 123456789,
   'field1': [1,2,3],
   'field2': [5,6,7],
   'field3': 'abba'
}
```
after doing an `$unwind` on field 2 you would get three new documents:
```javascript
{
   '_id': 123456789,
   'field1': [1,2,3],
   'field2': 5,
   'field3': 'abba'
},
{
   '_id': 123456789,
   'field1': [1,2,3],
   'field2': 6,
   'field3': 'abba'
},
{
   '_id': 123456789,
   'field1': [1,2,3],
   'field2': 7,
   'field3': 'abba'
}
```


In [None]:
# aggregate is a pipeline, order matters
cursor = news_collection.aggregate([
    {'$match': {'entities.hashtags.0': {'$exists': 1}}},
    {'$project': {'_id': 0, 'hashtags': '$entities.hashtags'}},
    {'$unwind': '$hashtags'},
    {'$group': {'_id': '$hashtags.text', 'count': {'$sum': 1}}},
    {'$sort': {'count': -1}},
    {'$limit': 20}
])

list(cursor)

## Real streaming: the tweepy API

Okay, but how about real streaming? That is, an object that sits there are "listens" for new tweets, and then processes them as they arrive?

This uses a slightly different API. There are a couple of things to pay attention to
- The tweets that we get are *strings* of JSON objects, not the JSON objects themselves. We also don't have the nice ways of accessing the attributes directly (e.g. tweet.text above). Instead we convert the string to JSON, which gives us a dictionary, and then go from there.
- A twitter stream takes a `StreamListener` class. We should write member functions `on_data` and `on_error` that are called when a new tweet arrives, or we encounter an error, respectively.

In this example, we implement a `deque` of length 5, so that we are retaining the 5 most recent tweets. In the `on_data` call, we are adding the tweet to our collection, then printing out the currently stored tweets. 

If we wanted to store the data for all time, then `on_data` method would be where we would load them into Mongo.

In [None]:
from tweepy import Stream
from tweepy.streaming import StreamListener
from IPython import display
from collections import deque
import json

class MyListener(StreamListener):
    def __init__(self):
        super().__init__()
        self.list_of_tweets = deque([], maxlen=5)
        
    def on_data(self, data):
        tweet_text = json.loads(data)['text']
        self.list_of_tweets.append(tweet_text)
        self.print_list_of_tweets()
        
    
    def on_error(self, status):
        print(status)

    def print_list_of_tweets(self):
        display.clear_output(wait=True)
        for index, tweet_text in enumerate(self.list_of_tweets):
            m='{}. {}\n\n'.format(index, tweet_text)
            print(m)
            
twitter_stream = Stream(auth, MyListener())
twitter_stream.filter(track=['#SpaceForce'])