<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# NLP Using the Twitter API: Guided Lab

_Authors: Dave Yerrington (SF)_

---


<img src="https://snag.gy/RNAEgP.jpg" width="600">

### Can we correctly identify which of these two old men tweeted what?

> *Note: this lab is intended to be a guided lab until the independent practice questions.*


## Goals
---

We are going to attempt to classify whether a tweet comes from Trump or Sanders.  This lab involves multiple steps:
- Create a developer account on Twitter
- Create a method to pull a list of tweets from the Twitter API
- Perform proper preprocessing on our text
- Engineer sentiment feature in our dataset using TextBlob
- Explore supervised classification techniques


## Twitter API Developer Registration
---

If you haven't registered a Twitter account yet, this is a requirement in order to have a "developer" account.

[Twitter Rest API](https://dev.twitter.com/rest/public)



## Create an "App"

---

![](https://snag.gy/HPBQbJ.jpg)

Go to Twitter and register an "app" [apps.twitter.com](https://apps.twitter.com/).

> **Note**: For the required website field you can put a placeholder.

After you set up our app, you will only need to reference the cooresponding keys Twitter generates for our app.  These are the keys that we will use with our application to communicate with the Twitter API.

## Install Python Twitter API library

---

Someone was nice enough to build a Python libary for us. It makes pulling tweets simple: we only need to plug in our keys and start collecting data. The library we will be using is provided by [Python Twitter Tools](http://mike.verdone.ca/twitter/).

To install it, uncomment and run the next frame (there is no conda package).

In [None]:
!pip install twitter python-twitter

In [None]:
!pip install textacy

## Some Boring Twitter Rules
---

**Twitter notifies you they will rate limit your requests:**

>When using application-only authentication, rate limits are determined globally for the entire application. If a method allows for 15 requests per rate limit window, then it allows you to make 15 requests per window — on behalf of your application. This limit is considered completely separately from per-user limits. https://dev.twitter.com/rest/public/rate-limiting

Here's a quick overview of what Twitter says are "the rules":

![](https://snag.gy/yJ6vIH.jpg)


## About those Keys: OAuth Review
---

![](https://g.twimg.com/dev/documentation/image/appauth_0.png)

## What's going on here?  Take a minute..

## Our Application Keys
---

Take note of your application keys you will use to connect to Twitter and mine tweets from the official Bernie Sanders and Donald Trump twitter accounts:

![](https://snag.gy/H1djQK.jpg)

## `TweetMiner` class structure

---

The following code will get you up and running, providing connectivity to twitter. The class has the ability to make requests and can eventually transform the JSON responses into DataFrames.

This is a great example of using object-oriented Python to organize our code!

> **Note:** "request_limit" is used in this class to limit the number of tweets that are pulled per instance request.  Setting it to something lower until you've worked the bugs out of your request, and captured the data you want, is essential to avoiding the rate limit blocks.

### Twitter API key setup

Fill the information below in with the keys for your account.

- **consumer_key** - Find this in your app page under the "Keys and Access Tokens"
- **consumer_secret** - Right under **consumer_key** in the "Keys and Access Tokens" tab
- **access_token_key** - You will need to click the button to generate tokens to get this
- **access_token_secret** - Also available after you generate tokens


In [8]:
import twitter, re, datetime, pandas as pd

# your keys go here:
twitter_keys = {
    'consumer_key':        'KmN03M1X1pImZ43sqdIu4yfnE',
    'consumer_secret':     'ePlIrIX5VXbZnO7DBu1RbFlw5lOai9dQr9n5TZb6vxnIdrr5Fz',
    'access_token_key':    '185036086-Q7K5IjuSoQZJwSIqD0wyHf6t62iPKatmfaPkriAM',
    'access_token_secret': 'cYpQz3xWHQbLplOj8iSeiNSOmMcsTOXmWcKMrJ9buLj5d'
}

api = twitter.Api(
    consumer_key         =   twitter_keys['consumer_key'],
    consumer_secret      =   twitter_keys['consumer_secret'],
    access_token_key     =   twitter_keys['access_token_key'],
    access_token_secret  =   twitter_keys['access_token_secret']
)


In [9]:
class TweetMiner(object):

    result_limit    =   20    
    api             =   False
    data            =   []
    
    def __init__(self, keys_dict, api, result_limit = 20):
        
        self.api = api
        self.twitter_keys = keys_dict
        
        self.result_limit = result_limit
        

    def mine_user_tweets(self, user="dyerrington", mine_rewteets=False, max_pages=5):

        data           =  []
        last_tweet_id  =  False
        page           =  1
        
        while page <= max_pages:
            
            if last_tweet_id:
                statuses   =   self.api.GetUserTimeline(screen_name=user, count=self.result_limit, max_id=last_tweet_id - 1)        
            else:
                statuses   =   self.api.GetUserTimeline(screen_name=user, count=self.result_limit)
                
            for item in statuses:

                mined = {
                    'tweet_id':        item.id,
                    'handle':          item.user.name,
                    'retweet_count':   item.retweet_count,
                    'text':            item.text,
                    'mined_at':        datetime.datetime.now(),
                    'created_at':      item.created_at,
                }
                
                last_tweet_id = item.id
                data.append(mined)
                
            page += 1
            
        return data

## Instantiate the class
---

Make sure you pass the keys dictionary and the api as arguments.

**Check:** call the object's `mine_user_tweets()` method, providing a user to pull the tweets of.

In [64]:
# A:
miner = TweetMiner(twitter_keys, api, result_limit=20)

sanders = miner.mine_user_tweets(user='bernisanders', max_pages=10)

In [65]:
len(sanders)

127

In [31]:
print (sanders[0])

{'tweet_id': 978309865876549632, 'handle': 'Bermie Sanders 💎', 'retweet_count': 0, 'text': "@TheOnlyTiffer Indeed! Republican aren't on our side. Neither is their buffoon president.  Democrats are. That's wh… https://t.co/CpX3VXdLwF", 'mined_at': datetime.datetime(2018, 6, 13, 20, 13, 56, 437849), 'created_at': 'Mon Mar 26 16:37:02 +0000 2018'}


In [32]:
print (sanders[0]['text'])

@TheOnlyTiffer Indeed! Republican aren't on our side. Neither is their buffoon president.  Democrats are. That's wh… https://t.co/CpX3VXdLwF


In [37]:
dtrump = miner.mine_user_tweets(user='realDonaldTrump', max_pages=10)
print (dtrump[0]['text'])
dtrump_df = pd.DataFrame(dtrump)

Congratulations to @KevinCramer on his huge win in North Dakota. We need Kevin in the Senate, and I strongly endors… https://t.co/qdIpcCXMd6


### Convert the tweet ouputs to a pandas DataFrame

> *Hint: this is as easy as passing it to the DataFrame constructor!*

In [34]:
# A:
sanders_df = pd.DataFrame(sanders)
sanders_df.head()

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id
0,Mon Mar 26 16:37:02 +0000 2018,Bermie Sanders 💎,2018-06-13 20:13:56.437849,0,@TheOnlyTiffer Indeed! Republican aren't on our side. Neither is their buffoon president. Democrats are. That's wh… https://t.co/CpX3VXdLwF,978309865876549632
1,Mon Mar 26 16:26:19 +0000 2018,Bermie Sanders 💎,2018-06-13 20:13:56.437856,0,@TheOnlyTiffer Thank you parents for lending us your children for more games of #DivideAndConquer. \nThe sugar on to… https://t.co/zbe3t9fgfS,978307168234455040
2,Mon Mar 26 14:44:15 +0000 2018,Bermie Sanders 💎,2018-06-13 20:13:56.437858,0,"@robreiner Spot on! I am moved by their adult parents who vote as we tell them, &amp; lent their children for games of… https://t.co/Q3xoKgdJN8",978281481373073409
3,Mon Mar 26 14:24:18 +0000 2018,Bermie Sanders 💎,2018-06-13 20:13:56.437859,0,@Enjoneer01 @johnastoehr I also addressed people not voting for @HillaryClinton. These people are classified by me… https://t.co/OpzifHU2bY,978276462124875777
4,Mon Mar 26 14:20:01 +0000 2018,Bermie Sanders 💎,2018-06-13 20:13:56.437860,0,"@johnastoehr @washmonthly So many tweets, most filled with the wrong perceptions. You really have a long way to lea… https://t.co/Zerd4AeXZ7",978275385057017857


In [35]:
jimcramer = miner.mine_user_tweets(user='jimcramer', max_pages=10)

In [36]:
pd.set_option('display.max_colwidth', -1)
jimcramer_df = pd.DataFrame(jimcramer)
jimcramer_df

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id
0,Wed Jun 13 23:01:03 +0000 2018,Jim Cramer,2018-06-13 20:14:01.755629,1,Cimarex's slump may soon be over: Kamich https://t.co/Mw5SJnzm1D,1007035147118678016
1,Wed Jun 13 22:28:25 +0000 2018,Jim Cramer,2018-06-13 20:14:01.755637,3,"RT @TheStreet: Rule 9: Be selective in the stocks you decide to defend. To see all 25 of @JimCramer’s investing rules today, visit: https:/…",1007026935988477952
2,Wed Jun 13 21:48:00 +0000 2018,Jim Cramer,2018-06-13 20:14:01.755638,4,Honeywell is ready to rally: @BruceKamich https://t.co/8UMQBfhr5P,1007016763526123522
3,Wed Jun 13 20:42:00 +0000 2018,Jim Cramer,2018-06-13 20:14:01.755640,7,.@EricJhonsa: Tesla and Twitter's big run-ups could be signs of irrational exuberance https://t.co/Xc9xYzUwNZ,1007000153289641990
4,Wed Jun 13 19:48:00 +0000 2018,Jim Cramer,2018-06-13 20:14:01.755641,3,UnitedHealth Group has broken out to new highs: @BruceKamich https://t.co/S7UXEPtbx6,1006986565028143104
5,Wed Jun 13 18:42:14 +0000 2018,Jim Cramer,2018-06-13 20:14:01.755642,23,"RT @MadMoneyOnCNBC: **HOT** The #Fed just raised interest rates, but that didn’t stop @JimCramer from eating lunch in his Smarties hat and…",1006970011926368256
6,Wed Jun 13 18:32:00 +0000 2018,Jim Cramer,2018-06-13 20:14:01.755643,6,Rockwell Automation CEO tells @BrianSozzi why they spent $1 Billion to bet on connected factories https://t.co/sdmlfczUIE,1006967437865246722
7,Wed Jun 13 17:55:06 +0000 2018,Jim Cramer,2018-06-13 20:14:01.755645,3,RT @TheStreet: It's Fed day and @BrianSozzi has the news you need. Stick with us for live updates: https://t.co/rArPkJPiFS,1006958150468202496
8,Wed Jun 13 17:49:00 +0000 2018,Jim Cramer,2018-06-13 20:14:01.755646,4,"For Alphabet, $1,200 looks like it's in the cards via @BruceKamich https://t.co/CgBuBlBt9X",1006956618687369216
9,Wed Jun 13 16:56:13 +0000 2018,Jim Cramer,2018-06-13 20:14:01.755647,7,Some levity never hurts...this is Amgen attempting to sing https://t.co/KlD1PDSzDM,1006943333468590080


In [52]:
df = pd.concat([sanders_df, dtrump_df, jimcramer_df], ignore_index=True) # ignore index = True it will reset all the
# index in the resultant contatinated df

In [53]:
df['label'] = 0

In [54]:
df.loc[df.index[df.handle == 'Donald J. Trump'], 'label'] = 1

In [55]:
df[df.handle == 'Donald J. Trump'].head()

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id,label
127,Wed Jun 13 20:50:20 +0000 2018,Donald J. Trump,2018-06-13 20:17:09.439645,8168,"Congratulations to @KevinCramer on his huge win in North Dakota. We need Kevin in the Senate, and I strongly endors… https://t.co/qdIpcCXMd6",1007002252467458048,1
128,Wed Jun 13 20:17:00 +0000 2018,Donald J. Trump,2018-06-13 20:17:09.439653,8893,Congratulations to Danny Tarkanian on his big GOP primary win in Nevada. Danny worked hard an got a great result. Looking good in November!,1006993862152392704,1
129,Wed Jun 13 20:11:41 +0000 2018,Donald J. Trump,2018-06-13 20:17:09.439654,15107,Senator Claire McCaskill of the GREAT State of Missouri flew around in a luxurious private jet during her RV tour o… https://t.co/Y2FQCocELA,1006992524366503941,1
130,Wed Jun 13 13:30:49 +0000 2018,Donald J. Trump,2018-06-13 20:17:09.439656,35095,"So funny to watch the Fake News, especially NBC and CNN. They are fighting hard to downplay the deal with North Kor… https://t.co/khUajnNtoR",1006891643985854464,1
131,Wed Jun 13 11:52:50 +0000 2018,Donald J. Trump,2018-06-13 20:17:09.439657,19043,"Oil prices are too high, OPEC is at it again. Not good!",1006866982833131520,1


In [56]:
df.loc[df.index[df.handle == 'Jim Cramer'], 'label'] = 2

In [57]:
df[df.handle == 'Jim Cramer'].head()

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id,label
327,Wed Jun 13 23:01:03 +0000 2018,Jim Cramer,2018-06-13 20:14:01.755629,1,Cimarex's slump may soon be over: Kamich https://t.co/Mw5SJnzm1D,1007035147118678016,2
328,Wed Jun 13 22:28:25 +0000 2018,Jim Cramer,2018-06-13 20:14:01.755637,3,"RT @TheStreet: Rule 9: Be selective in the stocks you decide to defend. To see all 25 of @JimCramer’s investing rules today, visit: https:/…",1007026935988477952,2
329,Wed Jun 13 21:48:00 +0000 2018,Jim Cramer,2018-06-13 20:14:01.755638,4,Honeywell is ready to rally: @BruceKamich https://t.co/8UMQBfhr5P,1007016763526123522,2
330,Wed Jun 13 20:42:00 +0000 2018,Jim Cramer,2018-06-13 20:14:01.755640,7,.@EricJhonsa: Tesla and Twitter's big run-ups could be signs of irrational exuberance https://t.co/Xc9xYzUwNZ,1007000153289641990,2
331,Wed Jun 13 19:48:00 +0000 2018,Jim Cramer,2018-06-13 20:14:01.755641,3,UnitedHealth Group has broken out to new highs: @BruceKamich https://t.co/S7UXEPtbx6,1006986565028143104,2


In [58]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

In [59]:
X = df.text
y = df.label

In [60]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   test_size=0.3,
                                                   stratify=y,
                                                   random_state=123)

In [63]:
y_train.value_counts()

2    140
1    139
0    89 
Name: label, dtype: int64

In [66]:
vect = CountVectorizer()
vect.fit(X_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [67]:
X_train_dtm = vect.transform(X_train)
X_test_dtm = vect.transform(X_test)

In [68]:
from sklearn.linear_model import LogisticRegression

In [69]:
lr = LogisticRegression()
lr.fit(X_train_dtm, y_train)
lr.predict(X_test_dtm)

array([0, 0, 2, 0, 1, 2, 1, 1, 2, 1, 1, 0, 1, 2, 2, 1, 1, 2, 1, 1, 1, 1,
       0, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 0, 2, 1, 1, 0, 1, 2, 1, 1, 1,
       1, 1, 2, 0, 2, 1, 0, 0, 1, 1, 2, 1, 2, 0, 2, 2, 2, 2, 0, 2, 1, 1,
       2, 0, 2, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 1, 0, 2, 2, 2, 1, 2, 0, 1,
       1, 2, 2, 2, 2, 1, 1, 0, 0, 0, 1, 2, 2, 2, 1, 2, 1, 0, 1, 2, 2, 1,
       0, 1, 2, 2, 1, 1, 1, 0, 1, 2, 2, 1, 2, 2, 0, 1, 2, 0, 1, 2, 2, 0,
       2, 0, 2, 2, 2, 1, 0, 0, 1, 2, 1, 2, 0, 2, 1, 2, 2, 2, 1, 1, 2, 1,
       1, 2, 0, 0, 2])

In [70]:
y_test.values

array([0, 0, 2, 0, 0, 0, 1, 1, 2, 1, 1, 0, 1, 2, 2, 1, 1, 2, 1, 2, 1, 1,
       0, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 0, 1, 2, 1, 1, 1,
       1, 1, 2, 0, 2, 1, 0, 0, 1, 1, 2, 1, 2, 0, 1, 1, 2, 2, 0, 2, 2, 1,
       2, 0, 2, 2, 0, 2, 2, 2, 1, 2, 1, 1, 2, 2, 0, 2, 2, 2, 2, 2, 0, 1,
       1, 2, 0, 0, 2, 0, 1, 1, 0, 1, 1, 2, 2, 0, 2, 1, 1, 0, 2, 2, 1, 1,
       2, 1, 1, 2, 1, 1, 2, 0, 0, 1, 2, 0, 2, 0, 0, 1, 2, 0, 1, 0, 2, 0,
       2, 1, 1, 2, 2, 1, 0, 0, 1, 2, 1, 2, 0, 0, 1, 2, 2, 2, 1, 2, 2, 2,
       1, 0, 0, 0, 2])

##  Create the training data

---

Let's get our "mined" data from the Twitter API.  

1. Mine Trump tweets
- Create a tweet DataFrame
- Mine Sanders tweets
- Append the results to our DataFrame

In [None]:
# A:

## Any interesting ngrams going on with Trump?
---

Set up a vectorizer from sklearn and fit the text of Trump's tweets with an ngram range from 2 to 4. Figure out what the most common ngrams are.

> **Note:** It's up to you whether you want to remove stopwords or not. How does keeping or removing stopwords affect the results?

In [None]:
# A:

### Look at the ngrams for Bernie Sanders

In [None]:
# A:

## Processing the tweets and building a model

---

To do classfication we will need to convert the tweets into a set of features.

**You will need to:**
- Vectorize input text data.
- Intialize a model (try Logistic regression).
- Train / Predict / cross-validate.
- Evaluate the performance of the model.

> **Bonus:** you may have noticed that there are website links in the tweets. What additional preprocessing steps can you do before building the model?


In [None]:
# A:

## Check the predicted probability for a random Sanders and Trump tweet
---

Below are provided a couple of tweets from both Sanders and Trump. I'm sure you can figure out on your own which one is which.

Estimate the predicted probability of being trump for the two tweets.

In [None]:
# Prep our source as TfIdf vectors
source_test = [
    "Demanding that the wealthy and the powerful start paying their fair share of taxes that's exactly what the American people want.",
    "Crooked Hillary is spending tremendous amounts of Wall Street money on false ads against me. She is a very dishonest person!"
]

############
# NOTE:  Do not re-initialize the tfidf vectorizor or the feature space willbe overwritten and
# hence your transform will not match the number of features you trained your model on.
#
# This is why you only need to "transform" since you already "fit" previously
#
####


## Independent practice questions

---

### 1. Pull tweets for some new users.

Experiment with using more data.  The API will not like it if you blow through their limits - be careful.  Try to grab only what you need one time, then work on the copy of the objects that are returned.  

> Read the documentation about rate limits and see if you can get enough without hitting the rate limit.  Are there any options available in the API to avoid such a problem?

**Pull tweets for more than two different users of your choice.**

In [None]:
# A:

### 2. Build a multi-class classification model to distinguish between the users.

Try a new type of model than we used before.

In [None]:
# A:

### 3. Make a confusion matrix and classification report.

In [None]:
# A:

### 4. What is the most and least "distinctive" tweets for each user?

To find this, identify the tweet that has the highest (correct) predicted probability of being that user's tweet for each user.

In [None]:
# A: