# Twitter data

## Copyright and Licensing

You are free to use or adapt this notebook for any purpose you'd like. However, please respect the [Simplified BSD License](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/blob/master/LICENSE.txt) that governs its use.

# Twitter API Access

Twitter implements **OAuth 1.0A** as its standard authentication mechanism, and in order to use it to make requests to Twitter's API, you'll need to go to https://dev.twitter.com/apps and create a sample application.

Choose any name for your application, write a description and use `http://google.com` for the website.

Under **Key and Access Tokens**, there are four primary identifiers you'll need to note for an OAuth 1.0A workflow: 
* consumer key, 
* consumer secret, 
* access token, and 
* access token secret (Click on Create Access Token to create those).

Note that you will need an ordinary Twitter account in order to login, create an app, and get these credentials.

The first time you execute the notebook, add all credentials so that you can save them in the `pkl` file, then you can remove the secret keys from the notebook because they will just be loaded from the `pkl` file.

**pkl** is a Python utility module that does **serialization** to convert any Python object or data structure to a character stream so that it can be saved to the disk and recreated in Python if we need it later.

The `pkl` file contains sensitive information that can be used to take control of your twitter acccount, **do not share it**.

In [4]:
import pickle
import os

Install the `twitter` package to interface with the Twitter API

In [5]:
!pip install twitter

Collecting twitter
  Downloading twitter-1.17.1-py2.py3-none-any.whl (55kB)
Installing collected packages: twitter
Successfully installed twitter-1.17.1


In [9]:
if not os.path.exists('C:/ml/twitter_credentials.pkl'):
    Twitter={}
    Twitter['Consumer Key'] = ''
    Twitter['Consumer Secret'] = ''
    Twitter['Access Token'] = ''
    Twitter['Access Token Secret'] = ''
    with open('C:/ml/twitter_credentials.pkl','wb') as f:
        pickle.dump(Twitter, f)
else:
    Twitter=pickle.load(open('C:/ml/twitter_credentials.pkl','rb'))

## Example 1. Authorizing an application to access Twitter account data

In [14]:
import twitter

## use twitter keys from pkl file to create twitter api object in python

# create authentication object
auth = twitter.oauth.OAuth(Twitter['Access Token'],
                           Twitter['Access Token Secret'],
                           Twitter['Consumer Key'],
                           Twitter['Consumer Secret'])

# create twitter api object
twitter_api = twitter.Twitter(auth = auth)

# Nothing to see by displaying twitter_api except that it's now a defined variable
print(twitter_api)

<twitter.api.Twitter object at 0x0000000005243C50>


## Example 2. Retrieving trends

Twitter identifies locations using the **Yahoo! Where On Earth ID**.

The Yahoo! Where On Earth ID for the entire world is 1.
See https://dev.twitter.com/docs/api/1.1/get/trends/place and
http://developer.yahoo.com/geo/geoplanet/

look at the BOSS placefinder here: https://developer.yahoo.com/boss/placefinder/

In [8]:
WORLD_WOE_ID = 1
US_WOE_ID = 23424977

Look for the WOEID for [san-diego](http://woeid.rosselliot.co.nz/lookup/san%20diego%20%20ca)

You can change it to another location.

In [20]:
SANDIEGO_LOCAL_WOE_ID = 2487889
PHILLY_LOCAL_WOE_ID = 2471217

# Prefix ID argument in .trends.place() with the underscore for query string parameterization.
#     - Without underscore, twitter package appends the ID value to the URL itself as a special case keyword argument.

# get top 50 trends from various areas via a trend object in JSON
world_trends = twitter_api.trends.place(_id = WORLD_WOE_ID) # = 1
us_trends = twitter_api.trends.place(_id = US_WOE_ID) # = 23424977
sd_trends = twitter_api.trends.place(_id = SANDIEGO_LOCAL_WOE_ID) # = 2487889
philly_trends = twitter_api.trends.place(_id = PHILLY_LOCAL_WOE_ID) # = 2487889


# see top 2 world trends
world_trends[:2]

[{'as_of': '2017-07-24T14:39:39Z',
  'created_at': '2017-07-24T14:36:55Z',
  'locations': [{'name': 'Worldwide', 'woeid': 1}],
  'trends': [{'name': 'Paint',
    'promoted_content': None,
    'query': 'Paint',
    'tweet_volume': 85712,
    'url': 'http://twitter.com/search?q=Paint'},
   {'name': '#FelizLunes',
    'promoted_content': None,
    'query': '%23FelizLunes',
    'tweet_volume': 32243,
    'url': 'http://twitter.com/search?q=%23FelizLunes'},
   {'name': '#MondayMotivation',
    'promoted_content': None,
    'query': '%23MondayMotivation',
    'tweet_volume': 97920,
    'url': 'http://twitter.com/search?q=%23MondayMotivation'},
   {'name': '#LahoreBlast',
    'promoted_content': None,
    'query': '%23LahoreBlast',
    'tweet_volume': 19575,
    'url': 'http://twitter.com/search?q=%23LahoreBlast'},
   {'name': '#24Jul',
    'promoted_content': None,
    'query': '%2324Jul',
    'tweet_volume': 32089,
    'url': 'http://twitter.com/search?q=%2324Jul'},
   {'name': '#توطين_البق

API responses are in **JSON (JavaScript Object Notation)** format, which is used to transfer data on the web and is roughly equivalent to nested Python lists and dictionaries, or 'more-concise' XML

Top 2 trends as of 7/24/2017, 9:59 AM EST are "#FelizLunes" and "#MondayMotivation"

In [21]:
# see type of the trends object
print(type(sd_trends),'\n')

# see the keys from the 1st record/object
print(list(sd_trends[0].keys()),'\n')

# see the trend data from the 1st object
print(sd_trends[0]['trends'])

<class 'twitter.api.TwitterListResponse'> 

['trends', 'as_of', 'created_at', 'locations'] 

[{'name': '#InsecureHBO', 'url': 'http://twitter.com/search?q=%23InsecureHBO', 'promoted_content': None, 'query': '%23InsecureHBO', 'tweet_volume': 127537}, {'name': '#ElTriEng', 'url': 'http://twitter.com/search?q=%23ElTriEng', 'promoted_content': None, 'query': '%23ElTriEng', 'tweet_volume': None}, {'name': '#WWEBattleground', 'url': 'http://twitter.com/search?q=%23WWEBattleground', 'promoted_content': None, 'query': '%23WWEBattleground', 'tweet_volume': 227952}, {'name': '#NW88JR', 'url': 'http://twitter.com/search?q=%23NW88JR', 'promoted_content': None, 'query': '%23NW88JR', 'tweet_volume': None}, {'name': '#SDCC2017', 'url': 'http://twitter.com/search?q=%23SDCC2017', 'promoted_content': None, 'query': '%23SDCC2017', 'tweet_volume': 102633}, {'name': 'Jared Kushner', 'url': 'http://twitter.com/search?q=%22Jared+Kushner%22', 'promoted_content': None, 'query': '%22Jared+Kushner%22', 'tweet_vo

## Example 3. Displaying API responses as pretty-printed JSON

In [19]:
import json

# get a more-formatted version of JSON output with indents for every new level
print((json.dumps(us_trends[:2], indent = 1)))

[
 {
  "trends": [
   {
    "name": "#MondayMotivation",
    "url": "http://twitter.com/search?q=%23MondayMotivation",
    "promoted_content": null,
    "query": "%23MondayMotivation",
    "tweet_volume": 87024
   },
   {
    "name": "Jared Kushner",
    "url": "http://twitter.com/search?q=%22Jared+Kushner%22",
    "promoted_content": null,
    "query": "%22Jared+Kushner%22",
    "tweet_volume": 77242
   },
   {
    "name": "#YoureAtABadHotelIf",
    "url": "http://twitter.com/search?q=%23YoureAtABadHotelIf",
    "promoted_content": null,
    "query": "%23YoureAtABadHotelIf",
    "tweet_volume": null
   },
   {
    "name": "#NationalTequilaDay",
    "url": "http://twitter.com/search?q=%23NationalTequilaDay",
    "promoted_content": null,
    "query": "%23NationalTequilaDay",
    "tweet_volume": null
   },
   {
    "name": "#FelizLunes",
    "url": "http://twitter.com/search?q=%23FelizLunes",
    "promoted_content": null,
    "query": "%23FelizLunes",
    "tweet_volume": 26785
   },
   

## Example 4. Computing the intersection of two *sets* of trends

i.e. Find commonalities in the trends from different locations

In [23]:
# create empty set
trends_set = {}

# from the list value of the 'trends' key from the dictionary in the JSON response, go through each dictionary element w/in
#   the list and get the value from the 'name' key and keep only unique ones
trends_set['world'] = set(trend['name'] for trend in world_trends[0]['trends'])

trends_set['us'] = set(trend['name'] for trend in us_trends[0]['trends'])

trends_set['san diego'] = set(trend['name'] for trend in sd_trends[0]['trends'])

trends_set['philly'] = set(trend['name'] for trend in philly_trends[0]['trends'])

In [30]:
# join the datasets together and loop through them to get each trend from each set 
for loc in ['world','us','san diego','philly']:
    print(('-'*10,loc))


('----------', 'world')
Benzema Beşiktaşta,#LahoreBlast,#コードブルー,#MondayMotivation,#YoureAtABadHotelIf,#Schaffhausen,#SantosCompraTrinos,#SegundaDetremuraSdv,#SONA2017,#DidemMadak,#ElRidiculoMasGrandeEs,#LaLunaSangreHinala,真剣佑,#qanda,#BlueMendy,#PDG36,#muchogusto,#cambiame513,#HoraDoVeneno,#InterOL,#EnLunesSiempreYo,#PorLaMañana,Paint,#TrocoFalsianesPor,#30LuYaşlarda,#PurposeTourMovie,#توطين_البقالات,#Cassano,#BasınBayramı,Jared Kushner,#ALTLTrashTalk,Eto'o,#انا_مسلمه_اتميز_بحجابي,#مجلس_الوزراء,#BoschKullanma,#BTSisComing,#SuperChef,#NationalTequilaDay,#AndIWouldDoItAgain,#あなたっぽいにゃんこ,Bekir Bozdağ,#اسعد_شعور_اذا,#FelizLunes,Benjamin Mendy,#Lozan,#CosasInsoportables,#GazetecilereÖzgürlük,VIVEGAM TEASER 500K LIKES,Yoko Ono,#24Jul 

('----------', 'us')
Dan Rooney,NEW SHOW,#MondayMotivation,#YoureAtABadHotelIf,#ShootYourShot,Ray Tensing,#SchoologyNEXT,#FunBelt,Interstate Bridge,#B1GMediaDay,#TrainingCamp,Bush's Baked Beans,Alexandre Dumas,Kent Island,Sherrod Brown,Greg Ward,Foster,Santana R

In [29]:
# find trends that are happening in 2 different areas of the worrld
print(( '='*10,'Intersection of World and US Trends'))
print((trends_set['world'].intersection(trends_set['us'])),'\n')

print(('='*10,'Intersection of US and San Diego Trends'))
print((trends_set['san diego'].intersection(trends_set['us'])),'\n')

print(('='*10,'Intersection of Philadelphia and San Diego Trends'))
print((trends_set['san diego'].intersection(trends_set['philly'])))

{'#MondayMotivation', '#YoureAtABadHotelIf', 'Jared Kushner', '#FelizLunes', '#NationalTequilaDay', '#24Jul', 'Paint', '#AndIWouldDoItAgain'} 

{'Dan Rooney', 'NEW SHOW', '#MondayMotivation', '#YoureAtABadHotelIf', '#ShootYourShot', 'Ray Tensing', '#SchoologyNEXT', '#FunBelt', 'Interstate Bridge', '#B1GMediaDay', "Bush's Baked Beans", 'Kent Island', 'Alexandre Dumas', 'Sherrod Brown', 'Greg Ward', 'Foster', 'Santana Row', '#MuseMon', 'Rudy Giuliani', 'Kabul', 'Mack Hollins', '#beinternetawesome', '#VBAmazing', '#c4ignite', 'Paint', '#100RCsummit', '#ThingsIAlwaysSeemToForget', 'Morning Joe', 'Jared Kushner', 'Sidney Jones', '#ABetterDeal', '#NationalTequilaDay', '#AndIWouldDoItAgain', '#5HonFallon', 'Suga Free', 'Chuck Schumer', 'GenX', '#FelizLunes', 'Drain the Sewer', 'WR Anquan Boldin', 'The Kook', '#24Jul', '#PioneerDay'} 

{'Dan Rooney', '#MondayMotivation', '#YoureAtABadHotelIf', '#ShootYourShot', 'Ray Tensing', '#SchoologyNEXT', '#FunBelt', 'Interstate Bridge', '#B1GMediaDay', "

## Example 5. Collecting search results

In [32]:
# Set a variable `q` to a trending topic or anything else for that matter.
q = '#YoureAtABadHotelIf' 

number = 100

# See https://dev.twitter.com/docs/api/1.1/get/search/tweets

# search through all tweets for a specified number of tweets about a topic
search_results = twitter_api.search.tweets(q = q, count = number)

# get data from these tweets
statuses = search_results['statuses']

print(len(statuses),'\n')
print(statuses)

100 

[{'created_at': 'Mon Jul 24 15:19:50 +0000 2017', 'id': 889505411052732416, 'id_str': '889505411052732416', 'text': '#YoureAtABadHotelIf theres urine stains on the carpet', 'truncated': False, 'entities': {'hashtags': [{'text': 'YoureAtABadHotelIf', 'indices': [0, 19]}], 'symbols': [], 'user_mentions': [], 'urls': []}, 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'}, 'source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 923987281, 'id_str': '923987281', 'name': 'carl brown', 'screen_name': 'carlbrown1978', 'location': '', 'description': 'theres not much in life worth worrying about, so enjoy,  middlesbrough', 'url': None, 'entities': {'description': {'urls': []}}, 'protected': False, 'followers_count': 1001, 'friends_count': 1133, 'listed_count': 

Twitter often returns duplicate results, we can filter them out checking for duplicate texts:

In [33]:
all_text = []
filtered_statuses = []

# for each actual tweet, if the tweet's text is not already in our list, add the tweet data is filtered_statuses
#   and the actual tweet text to all_text
for s in statuses:
    if not s["text"] in all_text:
        filtered_statuses.append(s)
        all_text.append(s["text"])
        
statuses = filtered_statuses 
len(statuses)

94

So we removed 6 duplicate statuses

In [36]:
# get the actual text/message for each tweet 
[s['text'] for s in statuses]

['#YoureAtABadHotelIf theres urine stains on the carpet',
 'RT @SheaBrowning: #YoureAtABadHotelIf the roach at the check in counter escorts you to your room.',
 '#YoureAtABadHotelIf you stay in Staten Island.',
 '#YoureAtABadHotelIf instead of HBO, your room just has BO.',
 '#YoureAtABadHotelIf the hotel manager can determine the cause and time of death more accurately than the police coroner.',
 "#YoureAtABadHotelIf You don't have those little candies on the desk in the lobby.",
 'RT @helenmaryme: #YoureAtABadHotelIf you call down for an extra towel, and they tell you that you have to wait because someone else is usin…',
 "RT @geoffreyclark37: #YoureAtABadHotelIf when you check in, the guy at the counter asks if you're really sure you want to stay here.",
 "#YoureAtABadHotelIf you turn on the TV 📺 and the only channel's that you get are porn. 🍆💦 https://t.co/QTkWcvFLoF",
 '#YoureAtABadHotelIf https://t.co/k0WG2GLgrK',
 'Charge your phone 5x faster\nhttps://t.co/muqBLXzM4c\n#MondayMoti

In [40]:
# Show 1 sample JSON search result by slicing the list
print(json.dumps(statuses[1], indent=1))

{
 "created_at": "Mon Jul 24 15:19:47 +0000 2017",
 "id": 889505395743576066,
 "id_str": "889505395743576066",
 "text": "RT @SheaBrowning: #YoureAtABadHotelIf the roach at the check in counter escorts you to your room.",
 "truncated": false,
 "entities": {
  "hashtags": [
   {
    "text": "YoureAtABadHotelIf",
    "indices": [
     18,
     37
    ]
   }
  ],
  "symbols": [],
  "user_mentions": [
   {
    "screen_name": "SheaBrowning",
    "name": "Shea Browning",
    "id": 25776495,
    "id_str": "25776495",
    "indices": [
     3,
     16
    ]
   }
  ],
  "urls": []
 },
 "metadata": {
  "iso_language_code": "en",
  "result_type": "recent"
 },
 "source": "<a href=\"https://mobile.twitter.com\" rel=\"nofollow\">Twitter Lite</a>",
 "in_reply_to_status_id": null,
 "in_reply_to_status_id_str": null,
 "in_reply_to_user_id": null,
 "in_reply_to_user_id_str": null,
 "in_reply_to_screen_name": null,
 "user": {
  "id": 808722445440024577,
  "id_str": "808722445440024577",
  "name": "Kaegan L

In [60]:
# The result of the list comprehension is a list with only 1 element that can be accessed by its index 
# Set this list to the variable t
t = statuses[1]

#[ status for status in statuses 
#          if status['id'] == 316948241264549888 ][0]

# Explore the variable t to get familiarized with the data structure...

print(t['retweet_count'],'\n')
print(t['favorite_count'],'\n')
print(t['entities'],'\n')
print(t['user'],'\n')
print(t['lang'])

43 

0 

{'hashtags': [{'text': 'YoureAtABadHotelIf', 'indices': [18, 37]}], 'symbols': [], 'user_mentions': [{'screen_name': 'SheaBrowning', 'name': 'Shea Browning', 'id': 25776495, 'id_str': '25776495', 'indices': [3, 16]}], 'urls': []} 

{'id': 808722445440024577, 'id_str': '808722445440024577', 'name': 'Kaegan Laveau', 'screen_name': 'Kaegan_xo', 'location': 'Bless your ❤, SC', 'description': '❤Angela Bassett ❤Michelle Obama ❤Paula Abdul ❤JK Rowling #LiBERAL #americanHORRORstory #NEWorleans #nursingSTUDENT #GIFqueen #books', 'url': 'https://t.co/gNP6jj9mgc', 'entities': {'url': {'urls': [{'url': 'https://t.co/gNP6jj9mgc', 'expanded_url': 'http://misskaeganxo.tumblr.com', 'display_url': 'misskaeganxo.tumblr.com', 'indices': [0, 23]}]}, 'description': {'urls': []}}, 'protected': False, 'followers_count': 54, 'friends_count': 285, 'listed_count': 1, 'created_at': 'Tue Dec 13 17:17:11 +0000 2016', 'favourites_count': 590, 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 've

## Example 6. Extracting text, screen names, and hashtags from tweets

In [61]:
#get the actual tweet message for each tweet
status_texts = [status['text'] for status in statuses ]

# get the usernames from each tweet
screen_names = [user_mention['screen_name'] 
                 for status in statuses  # for each tweet retrieved
                     for user_mention in status['entities']['user_mentions'] ]  # for each user_mention in the "entity" dict

# get the usernames from each tweet
hashtags = [hashtag['text'] 
             for status in statuses # for each tweet retrieved 
                 for hashtag in status['entities']['hashtags']] # for each hashtag value in the "entity" dict

# Compute a collection of all words from all tweets
words = [w 
          for t in status_texts 
              for w in t.split()]

In [62]:
# Explore the first 5 items for each...
print(json.dumps(status_texts[0:5], indent = 1))
print(json.dumps(screen_names[0:5], indent = 1)) 
print(json.dumps(hashtags[0:5], indent = 1))
print(json.dumps(words[0:5], indent = 1))

[
 "#YoureAtABadHotelIf theres urine stains on the carpet",
 "RT @SheaBrowning: #YoureAtABadHotelIf the roach at the check in counter escorts you to your room.",
 "#YoureAtABadHotelIf you stay in Staten Island.",
 "#YoureAtABadHotelIf instead of HBO, your room just has BO.",
 "#YoureAtABadHotelIf the hotel manager can determine the cause and time of death more accurately than the police coroner."
]
[
 "SheaBrowning",
 "helenmaryme",
 "geoffreyclark37",
 "BookstrOfficial",
 "googlygirl98"
]
[
 "YoureAtABadHotelIf",
 "YoureAtABadHotelIf",
 "YoureAtABadHotelIf",
 "YoureAtABadHotelIf",
 "YoureAtABadHotelIf"
]
[
 "#YoureAtABadHotelIf",
 "theres",
 "urine",
 "stains",
 "on"
]


## Example 7. Creating a basic frequency distribution from the words in tweets

In [63]:
from collections import Counter

# for each list created above, count the top 10 of each (top 10 words used, screen names present, hashtags used)
for item in [words, screen_names, hashtags]:
    c = Counter(item)
    print(c.most_common()[:10]) # top 10
    print()

[('#YoureAtABadHotelIf', 92), ('the', 74), ('you', 38), ('to', 28), ('on', 23), ('your', 23), ('RT', 21), ('in', 21), ('and', 20), ('are', 19)]

[('SheaBrowning', 1), ('helenmaryme', 1), ('geoffreyclark37', 1), ('BookstrOfficial', 1), ('googlygirl98', 1), ('LockesAlterEgo', 1), ('LouBrutus', 1), ('Alohababe2011', 1), ('ErinRHazel', 1), ('JMoney731', 1)]

[('YoureAtABadHotelIf', 92), ('MondayMotivation', 3), ('Trump', 2), ('truestory', 2), ('NationalTequilaDay', 1), ('MusicMonday', 1), ('ParentSchmovies', 1), ('MuseMon', 1), ('youreatabadhotelif', 1), ('MondayMotivaton', 1)]



## Example 8. Create a prettyprint function to display tuples in a nice tabular format

In [None]:
def prettyprint_counts(label, list_of_tuples):
    print("\n{:^20} | {:^6}".format(label, "Count"))
    print("*"*40)
    for k,v in list_of_tuples:
        print("{:20} | {:>6}".format(k,v))

In [None]:
for label, data in (('Word', words), 
                    ('Screen Name', screen_names), 
                    ('Hashtag', hashtags)):
    
    c = Counter(data)
    prettyprint_counts(label, c.most_common()[:10])

## Example 9. Finding the most popular retweets

In [None]:
retweets = [
            # Store out a tuple of these three values ...
            (status['retweet_count'], 
             status['retweeted_status']['user']['screen_name'],
             status['text'].replace("\n","\\")) 
            
            # ... for each status ...
            for status in statuses 
            
            # ... so long as the status meets this condition.
                if 'retweeted_status' in status
           ]

We can build another `prettyprint` function to print entire tweets with their retweet count.

We also want to split the text of the tweet in up to 3 lines, if needed.

In [None]:
row_template = "{:^7} | {:^15} | {:50}"
def prettyprint_tweets(list_of_tuples):
    print()
    print(row_template.format("Count", "Screen Name", "Text"))
    print("*"*60)
    for count, screen_name, text in list_of_tuples:
        print(row_template.format(count, screen_name, text[:50]))
        if len(text) > 50:
            print(row_template.format("", "", text[50:100]))
            if len(text) > 100:
                print(row_template.format("", "", text[100:]))

In [None]:
# Slice off the first 5 from the sorted results and display each item in the tuple

prettyprint_tweets(sorted(retweets, reverse=True)[:10])