# Twitter data

## Copyright and Licensing

You are free to use or adapt this notebook for any purpose you'd like. However, please respect the [Simplified BSD License](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/blob/master/LICENSE.txt) that governs its use.

# Twitter API Access

Twitter implements OAuth 1.0A as its standard authentication mechanism, and in order to use it to make requests to Twitter's API, you'll need to go to https://dev.twitter.com/apps and create a sample application.

Choose any name for your application, write a description and use `http://google.com` for the website.

Under **Key and Access Tokens**, there are four primary identifiers you'll need to note for an OAuth 1.0A workflow: 
* consumer key, 
* consumer secret, 
* access token, and 
* access token secret (Click on Create Access Token to create those).

Note that you will need an ordinary Twitter account in order to login, create an app, and get these credentials.

The first time you execute the notebook, add all credentials so that you can save them in the `pkl` file, then you can remove the secret keys from the notebook because they will just be loaded from the `pkl` file.

The `pkl` file contains sensitive information that can be used to take control of your twitter acccount, **do not share it**.

In [1]:
import pickle
import os

In [2]:
if not os.path.exists('secret_twitter_credentials.pkl'):
    Twitter={}
    Twitter['Consumer Key'] = '2MhIKFCy1TGU6qpR16cYGcfsx'
    Twitter['Consumer Secret'] = 'YFwi89ZtUqfml69QRCBJQnlpsXT2PJZHcCMNQJHO6xhz25uCN2'
    Twitter['Access Token'] = '801452532577030144-lnPJ4ltWqxa7TKoMbRx2WSeF8OtSi6K'
    Twitter['Access Token Secret'] = 'KGJjj3R5bd4gFBYArboT3B7961ZnJG8EFa8icXUtlWygz'
    with open('secret_twitter_credentials.pkl','wb') as f:
        pickle.dump(Twitter, f)
else:
    Twitter=pickle.load(open('secret_twitter_credentials.pkl','rb'))

Install the `twitter` package to interface with the Twitter API

In [3]:
!pip install twitter



## Example 1. Authorizing an application to access Twitter account data

In [4]:
import twitter

auth = twitter.oauth.OAuth(Twitter['Access Token'],
                           Twitter['Access Token Secret'],
                           Twitter['Consumer Key'],
                           Twitter['Consumer Secret'])

twitter_api = twitter.Twitter(auth=auth)

# Nothing to see by displaying twitter_api except that it's now a
# defined variable

print(twitter_api)

<twitter.api.Twitter object at 0x0000021944CF3438>


## Example 2. Retrieving trends

Twitter identifies locations using the Yahoo! Where On Earth ID.

The Yahoo! Where On Earth ID for the entire world is 1.
See https://dev.twitter.com/docs/api/1.1/get/trends/place and
http://developer.yahoo.com/geo/geoplanet/

look at the BOSS placefinder here: https://developer.yahoo.com/boss/placefinder/

In [5]:
WORLD_WOE_ID = 1
US_WOE_ID = 23424848

Look for the WOEID for [san-diego](http://woeid.rosselliot.co.nz/lookup/san%20diego%20%20ca)

You can change it to another location.

In [6]:
LOCAL_WOE_ID=2295424

# Prefix ID with the underscore for query string parameterization.
# Without the underscore, the twitter package appends the ID value
# to the URL itself as a special case keyword argument.

world_trends = twitter_api.trends.place(_id=WORLD_WOE_ID)
us_trends = twitter_api.trends.place(_id=US_WOE_ID)
local_trends = twitter_api.trends.place(_id=LOCAL_WOE_ID)

In [7]:
world_trends[:2]

[{'trends': [{'name': '京アニ',
    'url': 'http://twitter.com/search?q=%E4%BA%AC%E3%82%A2%E3%83%8B',
    'promoted_content': None,
    'query': '%E4%BA%AC%E3%82%A2%E3%83%8B',
    'tweet_volume': 1750869},
   {'name': '#PrayForKyoani',
    'url': 'http://twitter.com/search?q=%23PrayForKyoani',
    'promoted_content': None,
    'query': '%23PrayForKyoani',
    'tweet_volume': 125724},
   {'name': '死者25人',
    'url': 'http://twitter.com/search?q=%E6%AD%BB%E8%80%8525%E4%BA%BA',
    'promoted_content': None,
    'query': '%E6%AD%BB%E8%80%8525%E4%BA%BA',
    'tweet_volume': 90320},
   {'name': '#TAEYONG_LongFlight',
    'url': 'http://twitter.com/search?q=%23TAEYONG_LongFlight',
    'promoted_content': None,
    'query': '%23TAEYONG_LongFlight',
    'tweet_volume': 207234},
   {'name': '#instagramDELETE',
    'url': 'http://twitter.com/search?q=%23instagramDELETE',
    'promoted_content': None,
    'query': '%23instagramDELETE',
    'tweet_volume': None},
   {'name': '#TheFinalQuarter',
    'u

In [9]:
trends=local_trends
print(type(trends))
print(list(trends[0].keys()))
print(trends[0]['trends'])

<class 'twitter.api.TwitterListResponse'>
['trends', 'as_of', 'created_at', 'locations']
[{'name': '#WhatsAppDown', 'url': 'http://twitter.com/search?q=%23WhatsAppDown', 'promoted_content': None, 'query': '%23WhatsAppDown', 'tweet_volume': 144552}, {'name': '#ENGvNZ', 'url': 'http://twitter.com/search?q=%23ENGvNZ', 'promoted_content': None, 'query': '%23ENGvNZ', 'tweet_volume': 72417}, {'name': '#RahulGandhi', 'url': 'http://twitter.com/search?q=%23RahulGandhi', 'promoted_content': None, 'query': '%23RahulGandhi', 'tweet_volume': 16185}, {'name': 'Manjrekar', 'url': 'http://twitter.com/search?q=Manjrekar', 'promoted_content': None, 'query': 'Manjrekar', 'tweet_volume': 18238}, {'name': '#GoGreen_SayNoToPolythene', 'url': 'http://twitter.com/search?q=%23GoGreen_SayNoToPolythene', 'promoted_content': None, 'query': '%23GoGreen_SayNoToPolythene', 'tweet_volume': 20703}, {'name': 'Jaddu', 'url': 'http://twitter.com/search?q=Jaddu', 'promoted_content': None, 'query': 'Jaddu', 'tweet_volume'

## Example 3. Displaying API responses as pretty-printed JSON

In [8]:
import json

print((json.dumps(us_trends[:2], indent=1)))

[
 {
  "trends": [
   {
    "name": "#MissionMangalTrailer",
    "url": "http://twitter.com/search?q=%23MissionMangalTrailer",
    "promoted_content": null,
    "query": "%23MissionMangalTrailer",
    "tweet_volume": 19095
   },
   {
    "name": "#KarnatakaFloorTest",
    "url": "http://twitter.com/search?q=%23KarnatakaFloorTest",
    "promoted_content": null,
    "query": "%23KarnatakaFloorTest",
    "tweet_volume": null
   },
   {
    "name": "#RainRaider",
    "url": "http://twitter.com/search?q=%23RainRaider",
    "promoted_content": null,
    "query": "%23RainRaider",
    "tweet_volume": null
   },
   {
    "name": "#SilsilaOnVoot",
    "url": "http://twitter.com/search?q=%23SilsilaOnVoot",
    "promoted_content": null,
    "query": "%23SilsilaOnVoot",
    "tweet_volume": null
   },
   {
    "name": "#LifeChangingPlaces",
    "url": "http://twitter.com/search?q=%23LifeChangingPlaces",
    "promoted_content": null,
    "query": "%23LifeChangingPlaces",
    "tweet_volume": null
   }

## Example 4. Computing the intersection of two sets of trends

In [11]:
trends_set = {}
trends_set['world'] = set([trend['name'] 
                        for trend in world_trends[0]['trends']])

trends_set['us'] = set([trend['name'] 
                     for trend in us_trends[0]['trends']]) 

trends_set['san diego'] = set([trend['name'] 
                     for trend in local_trends[0]['trends']]) 

In [12]:
for loc in ['world','us','san diego']:
    print(('-'*10,loc))
    print((','.join(trends_set[loc])))

('----------', 'world')
#EfendiBirgünBeniUnutursun,彼方のアストラ,MehmetÖzere BetsatSponsor,ruletturnuvası ligobet40ta,Boban,#FelizMiercoles,BirkezDaha SusDediler,Héctor Herrera,#살다살다뭐닮았다고들어본경험,#ไอจีล่ม,#BRINGTHESOUL_THEMOVIE,#زواج_ابو_فريح,Happy 4th of July,O Instagram,#TürkiyeTürkiyedenBüyüktür,Zuckerberg,#الواتساب_معلق,#Stromboli,Ip Man,#BoycottTrump4thOfJuly,#3Jul,#تمثال_الحريه,シンデレラガール,#çöktü,#ENGvNZ,#SüresizSözleşmeliyiDuyReis,IG and FB,Jim Beam,#realişçileri23aydırmücadelede,#AVAxNCT127,#TanışmaBaşlatanCümleler,#DünyaNeresi,#乃木坂46ANN,#탐라_장래희망_자랑,cinselistismar iftirası,#fumou954,#PBB8SecondBigJumper,Tarımİşsiz Bakanİlgisiz,インスタ不具合,#هات_نكته_تضحك,#whatsappdown,Twitter DMs,#الترفيه_تستضيف_مغنيه_اباحيه,#SamsunAtakumSahildeyiz,Chandler Parsons,#PedroResponde,Donnie Yen,#في_اي_مدينه_مولود,#増田貴久誕生祭,#LollaAR
('----------', 'us')
#FortuneDhanushBdayIn25Days,Jaddu,#BRINGTHESOUL_THEMOVIE,Mihir,Parody,#GoGreen_SayNoToPolythene,Server,Sanjay,#RahulGandhi,#BKBirla,Twitter DM,#1992MeinBhi,#ENGvNZ,#5

In [13]:
print(( '='*10,'intersection of world and us'))
print((trends_set['world'].intersection(trends_set['us'])))

print(('='*10,'intersection of us and san-diego'))
print((trends_set['san diego'].intersection(trends_set['us'])))

{'#BRINGTHESOUL_THEMOVIE', '#ENGvNZ'}
{'#FortuneDhanushBdayIn25Days', 'Jaddu', '#BRINGTHESOUL_THEMOVIE', 'Mihir', 'Parody', '#GoGreen_SayNoToPolythene', 'Server', 'Sanjay', '#RahulGandhi', '#BKBirla', '#1992MeinBhi', 'Twitter DM', '#ENGvNZ', '#50DaysForTFIEmperorChiruBday', '#KadaramKondanTrailer', 'Motilal Vora', 'Mark Wood', '#WhatsAppDown', 'Manjrekar', '#FarFromHomeWithMNX', '#Blogchatter', '#YamahaFZ25', '#SirikiSingle', 'Aus and India', '#BackTheBlackCaps', 'Ned Leeds', 'Mark Zuckerberg', '#AmbatiRayuduRetires', '#AUSvsNZ', '#HafizSaeed', '#DelhiTempleAttack', '#Sherin'}


## Example 5. Collecting search results

Set the variable `q` to a trending topic, 
or anything else for that matter. The example query below
was a trending topic when this content was being developed
and is used throughout the remainder of this chapter

In [13]:
q = '#CWC19' 

number = 100

# See https://dev.twitter.com/docs/api/1.1/get/search/tweets

search_results = twitter_api.search.tweets(q=q, count=number)

statuses = search_results['statuses']

In [15]:
print(len(statuses))
print(statuses)

100
[{'created_at': 'Thu Jul 18 12:27:10 +0000 2019', 'id': 1151830771453313026, 'id_str': '1151830771453313026', 'text': 'RT @_cricingif: In the end, the world went home only with ifs and buts and the trophy stayed with the creators of the game, England #CWC19…', 'truncated': False, 'entities': {'hashtags': [{'text': 'CWC19', 'indices': [132, 138]}], 'symbols': [], 'user_mentions': [{'screen_name': '_cricingif', 'name': 'Cricingif', 'id': 710017952297394176, 'id_str': '710017952297394176', 'indices': [3, 14]}], 'urls': []}, 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'}, 'source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 4871616651, 'id_str': '4871616651', 'name': 'Taqi Zaidi', 'screen_name': 'TaqiTzaidi7214', 'location': 'اسلام آباد, پاکستان', 'de

Twitter often returns duplicate results, we can filter them out checking for duplicate texts:

In [16]:
all_text = []
filtered_statuses = []
for s in statuses:
    if not s["text"] in all_text:
        filtered_statuses.append(s)
        all_text.append(s["text"])
statuses = filtered_statuses     

In [17]:
len(statuses)

84

In [18]:
[s['text'] for s in search_results['statuses']]

["RT @SamMorshead_: England will compete in a World Cup semi-final on July 11 for the second time in two years, and this time the oppo won't…",
 'RT @marooashish: @bira91 #GiveawayAlert #Bira91 #CWC19 @bira91\nwould love to party with \n@auk_sanejourno \n@Devanginee \n@ThakorVaishnavi \n\n#…',
 'Jaa tujhee maaf kiya Dil ko tornay walay 💔\n#ENGvsNZ #CWC19 #Pakistani https://t.co/h6r1Rtyzpi',
 'You sure are, unless you can defeat Bangladesh by at least 315 runs tomorrow. Good luck!\n#ENGvNZ #NZvENG #CWC19 https://t.co/nRccQDubiN',
 'RT @deeputalks: For Pakistan to qualify:\nBeat Bangladesh by 311 runs after scoring 350\nBeat Bangladesh by 316 runs after scoring 400\nBeat B…',
 '#JonnyBairstow’s second successive hundred laid the foundation for England’s 119-run thrashing of New Zealand that… https://t.co/ANRGjtn32W',
 'RT @rgcricket: New Zealand 175-8 after 41 overs. Assuming NZ lose this game, the equation for Pakistan to qualify for semi-final:\n-Pak bat…',
 'RT @ladywithflaws: Pakista

In [19]:
# Show one sample search result by slicing the list...
print(json.dumps(statuses[0], indent=1))

{
 "created_at": "Wed Jul 03 17:42:33 +0000 2019",
 "id": 1146474320509755392,
 "id_str": "1146474320509755392",
 "text": "RT @SamMorshead_: England will compete in a World Cup semi-final on July 11 for the second time in two years, and this time the oppo won't\u2026",
 "truncated": false,
 "entities": {
  "hashtags": [],
  "symbols": [],
  "user_mentions": [
   {
    "screen_name": "SamMorshead_",
    "name": "Sam Morshead",
    "id": 368955949,
    "id_str": "368955949",
    "indices": [
     3,
     16
    ]
   }
  ],
  "urls": []
 },
 "metadata": {
  "iso_language_code": "en",
  "result_type": "recent"
 },
 "source": "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>",
 "in_reply_to_status_id": null,
 "in_reply_to_status_id_str": null,
 "in_reply_to_user_id": null,
 "in_reply_to_user_id_str": null,
 "in_reply_to_screen_name": null,
 "user": {
  "id": 392155503,
  "id_str": "392155503",
  "name": "Cameron",
  "screen_name": "CamMuxworthy",
  "loc

In [46]:
# The result of the list comprehension is a list with only one element that
# can be accessed by its index and set to the variable t
t = statuses[0]
#[ status for status in statuses 
#          if status['id'] == 316948241264549888 ][0]

# Explore the variable t to get familiarized with the data structure...

print(t['retweet_count'])
print(t['retweeted'])


7
False


## Example 6. Extracting text, screen names, and hashtags from tweets

In [47]:
status_texts = [ status['text'] 
                 for status in statuses ]

screen_names = [ user_mention['screen_name'] 
                 for status in statuses
                     for user_mention in status['entities']['user_mentions'] ]

hashtags = [ hashtag['text'] 
             for status in statuses
                 for hashtag in status['entities']['hashtags'] ]

# Compute a collection of all words from all tweets
words = [ w 
          for t in status_texts 
              for w in t.split() ]

In [48]:
# Explore the first 5 items for each...

print(json.dumps(status_texts[0:5], indent=1))
print(json.dumps(screen_names[0:5], indent=1)) 
print(json.dumps(hashtags[0:5], indent=1))
print(json.dumps(words[0:5], indent=1))

[
 "RT @deeputalks: Sixth 50+ score (2 X 100s) for Shakib Al Hasan in #CWC19 - only Sachin Tendulkar has scored more in a single edition of the\u2026",
 "RT @CensorReports: Dear @ICC and @BCCI  you are hosting one of the best sporting event but we are really not enjoying because of #SanjayMan\u2026",
 "She is 87 years old but still supporting Indian Cricket Team by shouting. \nThe love for the this beautiful game is\u2026 https://t.co/ycmoGjqP0Q",
 "RT @ICC: 146 runs to win \ud83d\ude2e \n19 overs remaining \ud83d\ude27 \nSix wickets in hand \ud83d\ude00 \n\nWe have another tense finish on the cards at Edgbaston!\n\n#CWC19 |\u2026",
 "RT @cricBC: Dhoni's innings these days: \n\nUnable to take singles early in the innings.\n\nRefusing singles late in the innings.\n\n#CWC19"
]
[
 "deeputalks",
 "CensorReports",
 "ICC",
 "BCCI",
 "ICC"
]
[
 "CWC19",
 "CWC19",
 "CWC19",
 "INDvBAN",
 "MenInGreen"
]
[
 "RT",
 "@deeputalks:",
 "Sixth",
 "50+",
 "score"
]


## Example 7. Creating a basic frequency distribution from the words in tweets

In [49]:
from collections import Counter

for item in [words, screen_names, hashtags]:
    c = Counter(item)
    print(c.most_common()[:10]) # top 10
    print()

[('RT', 50), ('the', 42), ('#CWC19', 32), ('in', 28), ('a', 22), ('for', 20), ('and', 20), ('to', 20), ('of', 19), ('is', 17)]

[('BCCI', 8), ('cricketworldcup', 8), ('ICC', 5), ('ImRo45', 4), ('cricbuzz', 3), ('Sah75official', 3), ('MRFWorldwide', 2), ('mipaltan', 2), ('gauravkapur', 2), ('rameshlaus', 2)]

[('CWC19', 37), ('INDvBAN', 17), ('BANvIND', 10), ('TeamIndia', 8), ('RiseOfTheTigers', 4), ('Dhoni', 4), ('Pandya', 2), ('Bangladesh', 2), ('RohitSharma', 2), ('cwc19', 2)]



## Example 8. Create a prettyprint function to display tuples in a nice tabular format

In [50]:
def prettyprint_counts(label, list_of_tuples):
    print("\n{:^20} | {:^6}".format(label, "Count"))
    print("*"*40)
    for k,v in list_of_tuples:
        print("{:20} | {:>6}".format(k,v))

In [51]:
for label, data in (('Word', words), 
                    ('Screen Name', screen_names), 
                    ('Hashtag', hashtags)):
    
    c = Counter(data)
    prettyprint_counts(label, c.most_common()[:10])


        Word         | Count 
****************************************
RT                   |     50
the                  |     42
#CWC19               |     32
in                   |     28
a                    |     22
for                  |     20
and                  |     20
to                   |     20
of                   |     19
is                   |     17

    Screen Name      | Count 
****************************************
BCCI                 |      8
cricketworldcup      |      8
ICC                  |      5
ImRo45               |      4
cricbuzz             |      3
Sah75official        |      3
MRFWorldwide         |      2
mipaltan             |      2
gauravkapur          |      2
rameshlaus           |      2

      Hashtag        | Count 
****************************************
CWC19                |     37
INDvBAN              |     17
BANvIND              |     10
TeamIndia            |      8
RiseOfTheTigers      |      4
Dhoni                |      4
Pand

## Example 9. Finding the most popular retweets

In [52]:
retweets = [
            # Store out a tuple of these three values ...
            (status['retweet_count'], 
             status['retweeted_status']['user']['screen_name'],
             status['text'].replace("\n","\\")) 
            
            # ... for each status ...
            for status in statuses 
            
            # ... so long as the status meets this condition.
                if 'retweeted_status' in status
           ]

We can build another `prettyprint` function to print entire tweets with their retweet count.

We also want to split the text of the tweet in up to 3 lines, if needed.

In [53]:
row_template = "{:^7} | {:^15} | {:50}"
def prettyprint_tweets(list_of_tuples):
    print()
    print(row_template.format("Count", "Screen Name", "Text"))
    print("*"*60)
    for count, screen_name, text in list_of_tuples:
        print(row_template.format(count, screen_name, text[:50]))
        if len(text) > 50:
            print(row_template.format("", "", text[50:100]))
            if len(text) > 100:
                print(row_template.format("", "", text[100:]))

In [54]:
# Slice off the first 5 from the sorted results and display each item in the tuple

prettyprint_tweets(sorted(retweets, reverse=True)[:10])


 Count  |   Screen Name   | Text                                              
************************************************************
 3565   |      BCCI       | RT @BCCI: 💯 Rohit Sharma, you genius. \\Back to ba
        |                 | ck centuries for @ImRo45 and fourth in #CWC19, 26t
        |                 | h overall in ODIs 👏👏👏👏 https://t.co/ADD…          
 1103   |       ICC       | RT @ICC: Century for Rohit Sharma!\\He becomes jus
        |                 | t the second player to score four hundreds in a si
        |                 | ngle World Cup campaign. \\What a tourn…          
  806   |   gauravkapur   | RT @gauravkapur: I don’t know about the Man of the
        |                 |  Match, but we have an undisputed winner of the Fa
        |                 | n of the Match 🙌💖 #CWC19 https://t.co/D…          
  679   |    mipaltan     | RT @mipaltan: ODI averages as an Opener:\\🇮🇳 Rohit
        |                 |  Sharma - 57.88\🇿🇦 Hashim Amla - 49.89\🇮🇳