# **DataCamp - Analyzing Social Media in Python**

## Introduction
This course is meant to learn how to collect Twitter data, proccess Twitter  text, analyze Twitter data and map Twitter data geographically. Twitter can help to provide insight into popular trends. Data scientists use Twitter to for marketing or product analysis purposes. Additionally, Twitter is used as a measure of public opinion on important political or social topics. 

### What you can't access on Twitter


*   Can't collect data on observers
*   Free-level of access is restrictive
*   Can't access historical data
*   List item

### Why Analyze Twitter Data?
There are many reasons you may want to analyze Twitter data. Which of these is NOT an area of data science you could use analyzing Twitter data for?

 Uncovering the motives of Twitter users following a hashtag.
 
 ### Uses of Twitter analysis
You've been asked to identify the success (or failure) of a particular product. What Twitter analysis strategy could you use to best execute this?

* Collect mentions of the product and identify if people are talking about it positively.
* Examine the size of the retweet network mentioning the product.
* Analyzing the geographical penetration of users mentioning the product.


### Twitter API
API, Application Programming Interface, are methods of accesising data from a business. Most social media companies have APIs. Twitter has multiple APIs, including search API, Ads API, and Streaming API.  This course is focused on Streaming API. 

Streaming API allows us to collect a sample of tweets in real time based on keywords, user ID, and location. Streaming API has two end-points, filter and sample. The filter endpoint allows you to request data on a few hundred keywords, a few thousand usernames, and 25 location ranges.  The sample endpoint returns a random 1% sample of all of Twitter. 


##Chapter 1 - Basics of Analyzing Twitter Data

###Setting up tweepy authentication

In the video, we saw how tweepy can be used to collect Twitter data with the Streaming API. tweepy requires a Twitter API key to authenticate with Twitter.

In this exercise, you will load several objects from tweepy and set up the authentication for the package.

The API keys access_token, access_token_secret, consumer_key, and consumer_secret have already been defined for you.

In [None]:
consumer_key = 'uHznHbfDel8CjvgDoDtnNwdS4'
consumer_secret = 'XBG5cdTedojhRyG1gBqDX7CzaogcTOgTypPl0EkSGrJRES9CN8'
access_token = '62816278-9JdVW5cQDcniscXYh8QYvBPoqmcUiETotg7z9LZ39'
access_token_secret = 'nxMQT37QwF80bNNlgy1lmp9ol8c71zUnfz3Datwma9b0U'

In [None]:
from tweepy import OAuthHandler
from tweepy import API

# Consumer key authentication
auth = OAuthHandler(consumer_key, consumer_secret)

# Access key authentication
auth.set_access_token(access_token, access_token_secret)

# Set up the API with the authentication handler
api = API(auth)

### Collecting data on keywords
Now that we've set up the authentication, we can begin to collect Twitter data. Recall that with the Streaming API, we will be collecting real-time Twitter data based on either a sample or filtered by a keyword.

In our example, we will collect data on any tweet mentioning #rstats or #python in the tweet text, username, or user description with the filter endpoint.

The SListener module has already been defined and imported for you.

In [None]:
from tweepy import Stream
from tweepy import StreamListener

class ProcessFoundItemsListener(StreamListener):
    tweet_json = ''
    
    def on_data(self, tweet):
        print(tweet)
        self.tweet_json = tweet
        return False

    def on_error(self, status_code):
        print (status_code)

# Set up words to track
keywords_to_track = ['python']

# Instantiate the SListener object 
listen = ProcessFoundItemsListener(api)

# Instantiate the Stream object
stream = Stream(auth, listen)

# Begin collecting data
stream.filter(track = keywords_to_track)

print('JSON>',listen.tweet_json)
tweet_json = listen.tweet_json

{"created_at":"Tue Apr 23 20:08:57 +0000 2019","id":1120781627393761283,"id_str":"1120781627393761283","text":"RT @PythonHub: Inline Python in Rust https:\/\/t.co\/EWGQmTtEm4","source":"\u003ca href=\"http:\/\/twitter.com\" rel=\"nofollow\"\u003eTwitter Web Client\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":373198636,"id_str":"373198636","name":"Jesus Alex Lara","screen_name":"jesusAlexLara","location":"San Luis Potos\u00ed, M\u00e9xico","url":null,"description":"For victory  or  death...","translator_type":"none","protected":false,"verified":false,"followers_count":419,"friends_count":1690,"listed_count":1,"favourites_count":997,"statuses_count":678,"created_at":"Wed Sep 14 04:55:16 +0000 2011","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":"es","contributors_enabled":false,"is_translator":false,"profile_background_col

### Loading and accessing tweets
In the video, we loaded a tweet we collected using tweepy into Python. Tweets arrive from the Streaming API in JSON format and need to be converted into a Python data structure.

In this exercise, we'll load a single tweet into Python and print out some fields.

The tweet JSON has been loaded for you and is stored in tweet_json.

In [None]:
# Load JSON
import json

# Convert from JSON to Python object
tweet = json.loads(tweet_json)

# Print tweet text
print(tweet['text'])

# Print tweet id
print(tweet['id'])

RT @PythonHub: Inline Python in Rust https://t.co/EWGQmTtEm4
1120781627393761283


### Accessing user data
Much of the data which we want to know about the Twitter data is stored in child JSON objects. We will access several parts of the user's information with the user child JSON object.

The tweet from the previous exercise has been loaded for you.

In [None]:
# Print user handle
print(tweet['user']['screen_name'])

# Print user follower count
print(tweet['user']['followers_count'])

# Print user location
print(tweet['user']['location'])

# Print user description
print(tweet['user']['description'])

jesusAlexLara
419
San Luis Potosí, México
For victory  or  death...


###Accessing retweet data
Now we're going to work with a tweet JSON that contains a retweet. A retweet has the same structure as a regular tweet, except that it has another tweet stored in retweeted_status.

The new tweet has been loaded as rt.

In [None]:
rt = {'contributors': None, 'coordinates': None, 'created_at': 'Thu Apr 19 12:45:59 +0000 2018', 'entities': {'hashtags': [],  'symbols': [],  'urls': [],  'user_mentions': [{'id': 823957466,    'id_str': '823957466',    'indices': [3, 16],    'name': 'Hanna Wallach',    'screen_name': 'hannawallach'}]}, 'favorite_count': 0, 'favorited': False, 'geo': None, 'id': 986949027123154944, 'id_str': '986949027123154944', 'in_reply_to_screen_name': None, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'is_quote_status': False, 'lang': 'en', 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'}, 'place': None, 'retweet_count': 37, 'retweeted': False, 'retweeted_status': {'contributors': None,  'coordinates': None,  'created_at': 'Tue Mar 06 23:50:35 +0000 2018',  'entities': {'hashtags': [],   'symbols': [],   'urls': [{'display_url': 'twitter.com/i/web/status/9…',     'expanded_url': 'https://twitter.com/i/web/status/971171213216239616',     'indices': [117, 140],     'url': 'https://t.co/aB9Y5tTyHT'}],   'user_mentions': []},  'favorite_count': 52,  'favorited': False,  'geo': None,  'id': 971171213216239616,  'id_str': '971171213216239616',  'in_reply_to_screen_name': None,  'in_reply_to_status_id': None,  'in_reply_to_status_id_str': None,  'in_reply_to_user_id': None,  'in_reply_to_user_id_str': None,  'is_quote_status': False,  'lang': 'en',  'metadata': {'iso_language_code': 'en', 'result_type': 'recent'},  'place': None,  'possibly_sensitive': False,  'retweet_count': 37,  'retweeted': False,  'text': "ICYMI: NIPS/ICML/ICLR are looking for a full-time programmer to run the conferences' submission/review processes. M… https://t.co/aB9Y5tTyHT",  'truncated': True,  'user': {'contributors_enabled': False,   'created_at': 'Fri Sep 14 20:38:24 +0000 2012',   'default_profile': False,   'default_profile_image': False,   'description': 'MSR NYC. Machine learning, computational social science, fairness/accountability/transparency in ML. NIPS 2018 program chair, WiML co-founder, sloth enthusiast.',   'entities': {'description': {'urls': []},    'url': {'urls': [{'display_url': 'dirichlet.net',       'expanded_url': 'http://dirichlet.net/',       'indices': [0, 23],       'url': 'https://t.co/hrcIziHrkf'}]}},   'favourites_count': 3507,   'follow_request_sent': False,   'followers_count': 10614,   'following': True,   'friends_count': 865,   'geo_enabled': False,   'has_extended_profile': False,   'id': 823957466,   'id_str': '823957466',   'is_translation_enabled': False,   'is_translator': False,   'lang': 'en',   'listed_count': 499,   'location': 'Brooklyn, NY',   'name': 'Hanna Wallach',   'notifications': False,   'profile_background_color': 'CCCCCC',   'profile_background_image_url': 'http://pbs.twimg.com/profile_background_images/521040468528754688/_Ayh3ZCE.jpeg',   'profile_background_image_url_https': 'https://pbs.twimg.com/profile_background_images/521040468528754688/_Ayh3ZCE.jpeg',   'profile_background_tile': False,   'profile_banner_url': 'https://pbs.twimg.com/profile_banners/823957466/1347986011',   'profile_image_url': 'http://pbs.twimg.com/profile_images/2623320981/kinlr53ma1flkp9jerk4_normal.jpeg',   'profile_image_url_https': 'https://pbs.twimg.com/profile_images/2623320981/kinlr53ma1flkp9jerk4_normal.jpeg',   'profile_link_color': '999999',   'profile_sidebar_border_color': 'FFFFFF',   'profile_sidebar_fill_color': 'DDEEF6',   'profile_text_color': '333333',   'profile_use_background_image': False,   'protected': False,   'screen_name': 'hannawallach',   'statuses_count': 1505,   'time_zone': 'Eastern Time (US & Canada)',   'translator_type': 'none',   'url': 'https://t.co/hrcIziHrkf',   'utc_offset': -14400,   'verified': False}}, 'text': "RT @hannawallach: ICYMI: NIPS/ICML/ICLR are looking for a full-time programmer to run the conferences' submission/review processes. More in…", 'truncated': False, 'user': {'contributors_enabled': False,  'created_at': 'Thu Jan 18 20:37:52 +0000 2007',  'default_profile': False,  'default_profile_image': False,  'description': 'Assistant professor @UofT. Protest, media, computation. Trans. Roller derby athlete @TOROLLERDERBY (Kate Silver #538). She/her.',  'entities': {'description': {'urls': []},   'url': {'urls': [{'display_url': 'alex-hanna.com',      'expanded_url': 'http://alex-hanna.com',      'indices': [0, 23],      'url': 'https://t.co/WGddk8Cc6v'}]}},  'favourites_count': 23387,  'follow_request_sent': False,  'followers_count': 4267,  'following': False,  'friends_count': 2801,  'geo_enabled': True,  'has_extended_profile': False,  'id': 661613,  'id_str': '661613',  'is_translation_enabled': False,  'is_translator': False,  'lang': 'en',  'listed_count': 246,  'location': 'Toronto, ON',  'name': 'Alex Hanna, Data Witch',  'notifications': False,  'profile_background_color': '000000',  'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme16/bg.gif',  'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme16/bg.gif',  'profile_background_tile': False,  'profile_banner_url': 'https://pbs.twimg.com/profile_banners/661613/1514976085',  'profile_image_url': 'http://pbs.twimg.com/profile_images/980799823900180483/J9CDOX_X_normal.jpg',  'profile_image_url_https': 'https://pbs.twimg.com/profile_images/980799823900180483/J9CDOX_X_normal.jpg',  'profile_link_color': '0671B8',  'profile_sidebar_border_color': '666666',  'profile_sidebar_fill_color': 'CCCCCC',  'profile_text_color': '333333',  'profile_use_background_image': False,  'protected': False,  'screen_name': 'alexhanna',  'statuses_count': 71840,  'time_zone': 'Eastern Time (US & Canada)',  'translator_type': 'regular',  'url': 'https://t.co/WGddk8Cc6v',  'utc_offset': -14400,  'verified': False}}

In [None]:
# Print the text of the tweet
print(rt['text'])

# Print the text of tweet which has been retweeted
print(rt['retweeted_status']['text'])

# Print the user handle of the tweet
print(rt['user']['screen_name'])

# Print the user handle of the tweet which has been retweeted
print(rt['retweeted_status']['user']['screen_name'])

RT @hannawallach: ICYMI: NIPS/ICML/ICLR are looking for a full-time programmer to run the conferences' submission/review processes. More in…
ICYMI: NIPS/ICML/ICLR are looking for a full-time programmer to run the conferences' submission/review processes. M… https://t.co/aB9Y5tTyHT
alexhanna
hannawallach


## Chapter 2 - Processing Twitter text

### Tweet Items and Tweet Flattening
There are multiple fields in the Twitter JSON which contains textual data. In a typical tweet, there's the tweet text, the user description, and the user location. In a tweet longer than 140 characters, there's the extended tweet child JSON. And in a quoted tweet, there's the original tweet text and the commentary with the quoted tweet.

For this exercise, you'll extract textual elements from a single quoted tweet in which the original tweet has more than 140 characters. Then, to analyze tweets at scale, we will want to flatten the tweet JSON into a single level. This will allow us to store the tweets in a DataFrame format.

quoted_tweet has been loaded for you.

In [None]:
quoted_tweet = {'contributors': None, 'coordinates': None, 'created_at': 'Wed Apr 25 17:20:04 +0000 2018', 'display_text_range': [0, 35], 'entities': {'hashtags': [],  'symbols': [],  'urls': [{'display_url': 'twitter.com/alexhanna/stat…',    'expanded_url': 'https://twitter.com/alexhanna/status/989191655759663105',    'indices': [36, 59],    'url': 'https://t.co/BzbLDz9j6g'}],  'user_mentions': []}, 'favorite_count': 0, 'favorited': False, 'filter_level': 'low', 'geo': None, 'id': 989192330832891904, 'id_str': '989192330832891904', 'in_reply_to_screen_name': None, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'is_quote_status': True, 'lang': 'en', 'place': None, 'possibly_sensitive': False, 'quote_count': 0, 'quoted_status': {'contributors': None,  'coordinates': None,  'created_at': 'Wed Apr 25 17:17:23 +0000 2018',  'entities': {'hashtags': [],   'symbols': [],   'urls': [{'display_url': 'twitter.com/i/web/status/9…',     'expanded_url': 'https://twitter.com/i/web/status/989191655759663105',     'indices': [116, 139],     'url': 'https://t.co/MlFg4qFnEC'}],   'user_mentions': []},  'extended_tweet': {'display_text_range': [0, 191],   'entities': {'hashtags': [],    'symbols': [],    'urls': [],    'user_mentions': []},   'full_text': 'O 280 characters, 280 characters! Wherefore art thou 280 characters?\nDeny thy JSON and refuse thy key.\nOr, if thou wilt not, be but sworn my love,\nAnd I’ll no longer be a 140 character tweet.'},  'favorite_count': 1,  'favorited': False,  'filter_level': 'low',  'geo': None,  'id': 989191655759663105,  'id_str': '989191655759663105',  'in_reply_to_screen_name': None,  'in_reply_to_status_id': None,  'in_reply_to_status_id_str': None,  'in_reply_to_user_id': None,  'in_reply_to_user_id_str': None,  'is_quote_status': False,  'lang': 'en',  'place': None,  'quote_count': 0,  'reply_count': 1,  'retweet_count': 0,  'retweeted': False,  'source': '<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>',  'text': 'O 280 characters, 280 characters! Wherefore art thou 280 characters?\nDeny thy JSON and refuse thy key.\nOr, if thou… https://t.co/MlFg4qFnEC',  'truncated': True,  'user': {'contributors_enabled': False,   'created_at': 'Thu Jan 18 20:37:52 +0000 2007',   'default_profile': False,   'default_profile_image': False,   'description': 'Assistant professor @UofT. Protest, media, computation. Trans. Roller derby athlete @TOROLLERDERBY (Kate Silver #538). She/her.',   'favourites_count': 23526,   'follow_request_sent': None,   'followers_count': 4275,   'following': None,   'friends_count': 2806,   'geo_enabled': True,   'id': 661613,   'id_str': '661613',   'is_translator': False,   'lang': 'en',   'listed_count': 246,   'location': 'Toronto, ON',   'name': 'Alex Hanna, Data Witch',   'notifications': None,   'profile_background_color': '000000',   'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme16/bg.gif',   'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme16/bg.gif',   'profile_background_tile': False,   'profile_banner_url': 'https://pbs.twimg.com/profile_banners/661613/1524231456',   'profile_image_url': 'http://pbs.twimg.com/profile_images/980799823900180483/J9CDOX_X_normal.jpg',   'profile_image_url_https': 'https://pbs.twimg.com/profile_images/980799823900180483/J9CDOX_X_normal.jpg',   'profile_link_color': '0671B8',   'profile_sidebar_border_color': '666666',   'profile_sidebar_fill_color': 'CCCCCC',   'profile_text_color': '333333',   'profile_use_background_image': False,   'protected': False,   'screen_name': 'alexhanna',   'statuses_count': 71925,   'time_zone': 'Eastern Time (US & Canada)',   'translator_type': 'regular',   'url': 'http://alex-hanna.com',   'utc_offset': -14400,   'verified': False}}, 'quoted_status_id': 989191655759663105, 'quoted_status_id_str': '989191655759663105', 'reply_count': 0, 'retweet_count': 0, 'retweeted': False, 'source': '<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>', 'text': 'maybe if I quote tweet this lil guy https://t.co/BzbLDz9j6g', 'timestamp_ms': '1524676804632', 'truncated': False, 'user': {'contributors_enabled': False,  'created_at': 'Thu Jan 18 20:37:52 +0000 2007',  'default_profile': False,  'default_profile_image': False,  'description': 'Assistant professor @UofT. Protest, media, computation. Trans. Roller derby athlete @TOROLLERDERBY (Kate Silver #538). She/her.',  'favourites_count': 23526,  'follow_request_sent': None,  'followers_count': 4275,  'following': None,  'friends_count': 2806,  'geo_enabled': True,  'id': 661613,  'id_str': '661613',  'is_translator': False,  'lang': 'en',  'listed_count': 246,  'location': 'Toronto, ON',  'name': 'Alex Hanna, Data Witch',  'notifications': None,  'profile_background_color': '000000',  'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme16/bg.gif',  'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme16/bg.gif',  'profile_background_tile': False,  'profile_banner_url': 'https://pbs.twimg.com/profile_banners/661613/1524231456',  'profile_image_url': 'http://pbs.twimg.com/profile_images/980799823900180483/J9CDOX_X_normal.jpg',  'profile_image_url_https': 'https://pbs.twimg.com/profile_images/980799823900180483/J9CDOX_X_normal.jpg',  'profile_link_color': '0671B8',  'profile_sidebar_border_color': '666666',  'profile_sidebar_fill_color': 'CCCCCC',  'profile_text_color': '333333',  'profile_use_background_image': False,  'protected': False,  'screen_name': 'alexhanna',  'statuses_count': 71926,  'time_zone': 'Eastern Time (US & Canada)',  'translator_type': 'regular',  'url': 'http://alex-hanna.com',  'utc_offset': -14400,  'verified': False}}

In [None]:
# Print the tweet text
print(quoted_tweet['text'])

# Print the quoted tweet text
print(quoted_tweet['quoted_status']['text'])

# Print the quoted tweet's extended (140+) text
print(quoted_tweet['quoted_status']['extended_tweet']['full_text'])

# Print the quoted user location
print(quoted_tweet['quoted_status']['user']['location'])

maybe if I quote tweet this lil guy https://t.co/BzbLDz9j6g
O 280 characters, 280 characters! Wherefore art thou 280 characters?
Deny thy JSON and refuse thy key.
Or, if thou… https://t.co/MlFg4qFnEC
O 280 characters, 280 characters! Wherefore art thou 280 characters?
Deny thy JSON and refuse thy key.
Or, if thou wilt not, be but sworn my love,
And I’ll no longer be a 140 character tweet.
Toronto, ON


In [None]:
# Store the user screen_name in 'user-screen_name'
quoted_tweet['user-screen_name'] = quoted_tweet['user']['screen_name']
print(quoted_tweet['user-screen_name'])

# Store the quoted_status text in 'quoted_status-text'
quoted_tweet['quoted_status-text'] = quoted_tweet['quoted_status']['text']
print(quoted_tweet['quoted_status-text'])

# Store the quoted tweet's extended (140+) text in 
# 'quoted_status-extended_tweet-full_text'
quoted_tweet['quoted_status-extended_tweet-full_text'] = quoted_tweet['quoted_status']['extended_tweet']['full_text']
print(quoted_tweet['quoted_status-extended_tweet-full_text'])

alexhanna
O 280 characters, 280 characters! Wherefore art thou 280 characters?
Deny thy JSON and refuse thy key.
Or, if thou… https://t.co/MlFg4qFnEC
O 280 characters, 280 characters! Wherefore art thou 280 characters?
Deny thy JSON and refuse thy key.
Or, if thou wilt not, be but sworn my love,
And I’ll no longer be a 140 character tweet.


### Tweet Items and Tweet Flattening
There are multiple fields in the Twitter JSON which contains textual data. In a typical tweet, there's the tweet text, the user description, and the user location. In a tweet longer than 140 characters, there's the extended tweet child JSON. And in a quoted tweet, there's the original tweet text and the commentary with the quoted tweet.

For this exercise, you'll extract textual elements from a single quoted tweet in which the original tweet has more than 140 characters. Then, to analyze tweets at scale, we will want to flatten the tweet JSON into a single level. This will allow us to store the tweets in a DataFrame format.

quoted_tweet has been loaded for you.

In [None]:
# Print the tweet text
print(quoted_tweet['text'])

# Print the quoted tweet text
print(quoted_tweet['quoted_status']['text'])

# Print the quoted tweet's extended (140+) text
print(quoted_tweet['quoted_status']['extended_tweet']['full_text'])

# Print the quoted user location
print(quoted_tweet['quoted_status']['user']['location'])

### A tweet flattening function
We are typically interested in hundreds or thousands of tweets. For this purpose, it makes sense to define a function to flatten JSON file full of tweets. Let's call this function flatten_tweets(). We will use this function multiple times in this course and change it slightly as we deal with different types of data.

json has been loaded for you.

In [None]:
def flatten_tweets(tweets_json):
    """ Flattens out tweet dictionaries so relevant JSON
        is in a top-level dictionary."""
    tweets_list = []
    
    # Iterate through each tweet
    for tweet in tweets_json:
        tweet_obj = json.loads(tweet)
    
        # Store the user screen name in 'user-screen_name'
        tweet_obj['user-screen_name'] = tweet_obj['user']['screen_name']
    
        # Check if this is a 140+ character tweet
        if 'extended_tweet' in tweet_obj:
            # Store the extended tweet text in 'extended_tweet-full_text'
            tweet_obj['extended_tweet-full_text'] = tweet_obj['extended_tweet']['full_text']
    
        if 'retweeted_status' in tweet_obj:
            # Store the retweet user screen name in 'retweeted_status-user-screen_name'
            tweet_obj['retweeted_status-user-screen_name'] = tweet_obj['retweeted_status']['user']['screen_name']

            # Store the retweet text in 'retweeted_status-text'
            tweet_obj['retweeted_status-text'] = tweet_obj['retweeted_status']['text']
            
        tweets_list.append(tweet_obj)
    return tweets_list

### Loading tweets into a DataFrame
Now it's time to import data into a pandas DataFrame so we can analyze tweets at scale.

We will work with a dataset of tweets which contain the hashtag '#rstats' or '#python'. This dataset is stored as a list of tweet JSON objects in data_science_json.

In [None]:
data_science_json = ['{"created_at":"Fri Mar 30 13:04:22 +0000 2018","id":979705897457942528,"id_str":"979705897457942528","text":"RT @Dennboss: Hahahah Efteling maakt Maxi-Cosi\'s voor in de Python, duidelijk een perfect uitgewerkte 1 april grap en toch zijn er van die\\u2026","source":"\\u003ca href=\\"http:\\/\\/twitter.com\\/download\\/android\\" rel=\\"nofollow\\"\\u003eTwitter for Android\\u003c\\/a\\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":970664028,"id_str":"970664028","name":"Irma","screen_name":"mlvttweet","location":null,"url":null,"description":"\\u2764Family comes first\\u2764  leef en laten leven. Nijmegen mijn stad.","translator_type":"none","protected":false,"verified":false,"followers_count":226,"friends_count":374,"listed_count":10,"favourites_count":24087,"statuses_count":15093,"created_at":"Sun Nov 25 19:53:26 +0000 2012","utc_offset":-25200,"time_zone":"Pacific Time (US & Canada)","geo_enabled":false,"lang":"nl","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000","profile_background_image_url":"http:\\/\\/abs.twimg.com\\/images\\/themes\\/theme1\\/bg.png","profile_background_image_url_https":"https:\\/\\/abs.twimg.com\\/images\\/themes\\/theme1\\/bg.png","profile_background_tile":false,"profile_link_color":"981CEB","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"000000","profile_text_color":"000000","profile_use_background_image":false,"profile_image_url":"http:\\/\\/pbs.twimg.com\\/profile_images\\/938342330905620480\\/NZsbXjCe_normal.jpg","profile_image_url_https":"https:\\/\\/pbs.twimg.com\\/profile_images\\/938342330905620480\\/NZsbXjCe_normal.jpg","profile_banner_url":"https:\\/\\/pbs.twimg.com\\/profile_banners\\/970664028\\/1514711104","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweeted_status":{"created_at":"Fri Mar 30 09:58:34 +0000 2018","id":979659137964441600,"id_str":"979659137964441600","text":"Hahahah Efteling maakt Maxi-Cosi\'s voor in de Python, duidelijk een perfect uitgewerkte 1 april grap en toch zijn e\\u2026 https:\\/\\/t.co\\/IhQ6BNrOKp","source":"\\u003ca href=\\"http:\\/\\/twitter.com\\/download\\/android\\" rel=\\"nofollow\\"\\u003eTwitter for Android\\u003c\\/a\\u003e","truncated":true,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":229139485,"id_str":"229139485","name":"Dennis Gerritsen","screen_name":"Dennboss","location":"Netherlands","url":"http:\\/\\/www.instagram.com\\/darumnl","description":"Look who\'s back! \\/\\/ Gifjes zeggen meer dan 1000 woorden","translator_type":"none","protected":false,"verified":false,"followers_count":2012,"friends_count":269,"listed_count":14,"favourites_count":8684,"statuses_count":9218,"created_at":"Tue Dec 21 16:44:50 +0000 2010","utc_offset":7200,"time_zone":"Amsterdam","geo_enabled":true,"lang":"nl","contributors_enabled":false,"is_translator":false,"profile_background_color":"FFFFFF","profile_background_image_url":"http:\\/\\/pbs.twimg.com\\/profile_background_images\\/234269769\\/Dennis_houdt_van_Lieke_twitter_versie.jpg","profile_background_image_url_https":"https:\\/\\/pbs.twimg.com\\/profile_background_images\\/234269769\\/Dennis_houdt_van_Lieke_twitter_versie.jpg","profile_background_tile":false,"profile_link_color":"3E32E3","profile_sidebar_border_color":"FF0000","profile_sidebar_fill_color":"FFFFFF","profile_text_color":"FF0808","profile_use_background_image":true,"profile_image_url":"http:\\/\\/pbs.twimg.com\\/profile_images\\/977441919642013696\\/iAtXB8QA_normal.jpg","profile_image_url_https":"https:\\/\\/pbs.twimg.com\\/profile_images\\/977441919642013696\\/iAtXB8QA_normal.jpg","profile_banner_url":"https:\\/\\/pbs.twimg.com\\/profile_banners\\/229139485\\/1497084982","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"extended_tweet":{"full_text":"Hahahah Efteling maakt Maxi-Cosi\'s voor in de Python, duidelijk een perfect uitgewerkte 1 april grap en toch zijn er van die typetjes die vol woede en verontwaardiging reageren \\"... want het is wel waar! Want het heb op feesboek gestaan!\\"\\n\\n#efteling #1april #python","display_text_range":[0,265],"entities":{"hashtags":[{"text":"efteling","indices":[240,249]},{"text":"1april","indices":[250,257]},{"text":"python","indices":[258,265]}],"urls":[],"user_mentions":[],"symbols":[]}},"quote_count":0,"reply_count":3,"retweet_count":3,"favorite_count":55,"entities":{"hashtags":[],"urls":[{"url":"https:\\/\\/t.co\\/IhQ6BNrOKp","expanded_url":"https:\\/\\/twitter.com\\/i\\/web\\/status\\/979659137964441600","display_url":"twitter.com\\/i\\/web\\/status\\/9\\u2026","indices":[117,140]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"nl"},"is_quote_status":false,"quote_count":0,"reply_count":0,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[],"user_mentions":[{"screen_name":"Dennboss","name":"Dennis Gerritsen","id":229139485,"id_str":"229139485","indices":[3,12]}],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"nl","timestamp_ms":"1522415062666"}',
 '{"created_at":"Fri Mar 16 11:59:09 +0000 2018","id":974616055006941184,"id_str":"974616055006941184","text":"RT @PythonWeekly: Python Weekly - Issue 338 https:\\/\\/t.co\\/7gJSoLJj3V  #python #django #flask #slack #blockchain #bitcoin #twilio #opencv #ma\\u2026","source":"\\u003ca href=\\"http:\\/\\/twitter.com\\/download\\/iphone\\" rel=\\"nofollow\\"\\u003eTwitter for iPhone\\u003c\\/a\\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":937888350974787584,"id_str":"937888350974787584","name":"TestDriven.io","screen_name":"testdrivenio","location":"Denver, CO","url":"https:\\/\\/testdriven.io","description":"Teaching practical programming through real-world applications","translator_type":"none","protected":false,"verified":false,"followers_count":283,"friends_count":38,"listed_count":11,"favourites_count":838,"statuses_count":1479,"created_at":"Tue Dec 05 03:36:22 +0000 2017","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000","profile_background_image_url":"http:\\/\\/abs.twimg.com\\/images\\/themes\\/theme1\\/bg.png","profile_background_image_url_https":"https:\\/\\/abs.twimg.com\\/images\\/themes\\/theme1\\/bg.png","profile_background_tile":false,"profile_link_color":"1B95E0","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"000000","profile_text_color":"000000","profile_use_background_image":false,"profile_image_url":"http:\\/\\/pbs.twimg.com\\/profile_images\\/950867940395556864\\/eUKHi10g_normal.jpg","profile_image_url_https":"https:\\/\\/pbs.twimg.com\\/profile_images\\/950867940395556864\\/eUKHi10g_normal.jpg","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweeted_status":{"created_at":"Fri Mar 16 02:48:22 +0000 2018","id":974477447075856394,"id_str":"974477447075856394","text":"Python Weekly - Issue 338 https:\\/\\/t.co\\/7gJSoLJj3V  #python #django #flask #slack #blockchain #bitcoin #twilio\\u2026 https:\\/\\/t.co\\/AJ2eZTuSfT","display_text_range":[0,140],"source":"\\u003ca href=\\"http:\\/\\/bufferapp.com\\" rel=\\"nofollow\\"\\u003eBuffer\\u003c\\/a\\u003e","truncated":true,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":373620985,"id_str":"373620985","name":"Python Weekly","screen_name":"PythonWeekly","location":null,"url":"http:\\/\\/www.pythonweekly.com\\/","description":"Python Weekly is a free weekly newsletter, which features curated news, articles, new releases, tools and libraries, events, jobs etc related to Python","translator_type":"none","protected":false,"verified":false,"followers_count":22084,"friends_count":0,"listed_count":1017,"favourites_count":0,"statuses_count":7493,"created_at":"Wed Sep 14 22:49:27 +0000 2011","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"1A1B1F","profile_background_image_url":"http:\\/\\/abs.twimg.com\\/images\\/themes\\/theme9\\/bg.gif","profile_background_image_url_https":"https:\\/\\/abs.twimg.com\\/images\\/themes\\/theme9\\/bg.gif","profile_background_tile":false,"profile_link_color":"2FC2EF","profile_sidebar_border_color":"181A1E","profile_sidebar_fill_color":"252429","profile_text_color":"666666","profile_use_background_image":true,"profile_image_url":"http:\\/\\/pbs.twimg.com\\/profile_images\\/1554238582\\/python_normal.jpg","profile_image_url_https":"https:\\/\\/pbs.twimg.com\\/profile_images\\/1554238582\\/python_normal.jpg","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"extended_tweet":{"full_text":"Python Weekly - Issue 338 https:\\/\\/t.co\\/7gJSoLJj3V  #python #django #flask #slack #blockchain #bitcoin #twilio #opencv #machinelearning #artificialintelligence #deeplearning #iphone #celery #jupyter #job https:\\/\\/t.co\\/0WnIUuSqF4","display_text_range":[0,202],"entities":{"hashtags":[{"text":"python","indices":[51,58]},{"text":"django","indices":[59,66]},{"text":"flask","indices":[67,73]},{"text":"slack","indices":[74,80]},{"text":"blockchain","indices":[81,92]},{"text":"bitcoin","indices":[93,101]},{"text":"twilio","indices":[102,109]},{"text":"opencv","indices":[110,117]},{"text":"machinelearning","indices":[118,134]},{"text":"artificialintelligence","indices":[135,158]},{"text":"deeplearning","indices":[159,172]},{"text":"iphone","indices":[173,180]},{"text":"celery","indices":[181,188]},{"text":"jupyter","indices":[189,197]},{"text":"job","indices":[198,202]}],"urls":[{"url":"https:\\/\\/t.co\\/7gJSoLJj3V","expanded_url":"https:\\/\\/buff.ly\\/2IsbcB9","display_url":"buff.ly\\/2IsbcB9","indices":[26,49]}],"user_mentions":[],"symbols":[],"media":[{"id":974477444966166533,"id_str":"974477444966166533","indices":[203,226],"media_url":"http:\\/\\/pbs.twimg.com\\/media\\/DYYKErzWsAU045C.jpg","media_url_https":"https:\\/\\/pbs.twimg.com\\/media\\/DYYKErzWsAU045C.jpg","url":"https:\\/\\/t.co\\/0WnIUuSqF4","display_url":"pic.twitter.com\\/0WnIUuSqF4","expanded_url":"https:\\/\\/twitter.com\\/PythonWeekly\\/status\\/974477447075856394\\/photo\\/1","type":"photo","sizes":{"large":{"w":324,"h":87,"resize":"fit"},"medium":{"w":324,"h":87,"resize":"fit"},"small":{"w":324,"h":87,"resize":"fit"},"thumb":{"w":87,"h":87,"resize":"crop"}}}]},"extended_entities":{"media":[{"id":974477444966166533,"id_str":"974477444966166533","indices":[203,226],"media_url":"http:\\/\\/pbs.twimg.com\\/media\\/DYYKErzWsAU045C.jpg","media_url_https":"https:\\/\\/pbs.twimg.com\\/media\\/DYYKErzWsAU045C.jpg","url":"https:\\/\\/t.co\\/0WnIUuSqF4","display_url":"pic.twitter.com\\/0WnIUuSqF4","expanded_url":"https:\\/\\/twitter.com\\/PythonWeekly\\/status\\/974477447075856394\\/photo\\/1","type":"photo","sizes":{"large":{"w":324,"h":87,"resize":"fit"},"medium":{"w":324,"h":87,"resize":"fit"},"small":{"w":324,"h":87,"resize":"fit"},"thumb":{"w":87,"h":87,"resize":"crop"}}}]}},"quote_count":0,"reply_count":0,"retweet_count":8,"favorite_count":15,"entities":{"hashtags":[{"text":"python","indices":[51,58]},{"text":"django","indices":[59,66]},{"text":"flask","indices":[67,73]},{"text":"slack","indices":[74,80]},{"text":"blockchain","indices":[81,92]},{"text":"bitcoin","indices":[93,101]},{"text":"twilio","indices":[102,109]}],"urls":[{"url":"https:\\/\\/t.co\\/7gJSoLJj3V","expanded_url":"https:\\/\\/buff.ly\\/2IsbcB9","display_url":"buff.ly\\/2IsbcB9","indices":[26,49]},{"url":"https:\\/\\/t.co\\/AJ2eZTuSfT","expanded_url":"https:\\/\\/twitter.com\\/i\\/web\\/status\\/974477447075856394","display_url":"twitter.com\\/i\\/web\\/status\\/9\\u2026","indices":[111,134]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en"},"is_quote_status":false,"quote_count":0,"reply_count":0,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[{"text":"python","indices":[69,76]},{"text":"django","indices":[77,84]},{"text":"flask","indices":[85,91]},{"text":"slack","indices":[92,98]},{"text":"blockchain","indices":[99,110]},{"text":"bitcoin","indices":[111,119]},{"text":"twilio","indices":[120,127]},{"text":"opencv","indices":[128,135]}],"urls":[{"url":"https:\\/\\/t.co\\/7gJSoLJj3V","expanded_url":"https:\\/\\/buff.ly\\/2IsbcB9","display_url":"buff.ly\\/2IsbcB9","indices":[44,67]}],"user_mentions":[{"screen_name":"PythonWeekly","name":"Python Weekly","id":373620985,"id_str":"373620985","indices":[3,16]}],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1521201549661"}']

In [None]:
# Import pandas
import pandas as pd

# Flatten the tweets and store in `tweets`
tweets = flatten_tweets(data_science_json)

# Create a DataFrame from `tweets`
ds_tweets = pd.DataFrame(tweets)

# Print out the first 5 tweets from this dataset
print(ds_tweets['text'].values[0:5])

["RT @Dennboss: Hahahah Efteling maakt Maxi-Cosi's voor in de Python, duidelijk een perfect uitgewerkte 1 april grap en toch zijn er van die…"
 'RT @PythonWeekly: Python Weekly - Issue 338 https://t.co/7gJSoLJj3V  #python #django #flask #slack #blockchain #bitcoin #twilio #opencv #ma…']


### Finding keywords
Counting known keywords is one of the first ways you can analyze text data in a Twitter dataset. In this dataset, you're going to count the number of times specific hashtags occur in a collection of tweets about data science. To this end, you're going to use the string methods in the pandas Series object to do this.

pandas and numpy have been imported as pd and np, respectively. A more fully-featured flatten_tweets and data_science_json have also been loaded for you.

In [None]:
import numpy as np

# Flatten the tweets and store them
flat_tweets = flatten_tweets(data_science_json)

# Convert to DataFrame
ds_tweets = pd.DataFrame(flat_tweets)

# Find mentions of #python in 'text'
python = ds_tweets['text'].str.contains('#python', case=False)

# Print proportion of tweets mentioning #python
print("Proportion of #python tweets:", np.sum(python) / ds_tweets.shape[0])

Proportion of #python tweets: 0.5


### Looking for text in all the wrong places
Recall that relevant text may not only be in the main text field of the tweet. It may also be in the extended_tweet, the retweeted_status, or the quoted_status. We need to check all of these fields to make sure we've accounted for all the of the relevant text. We'll do this often so we're going to create a function which does this.

In [None]:
def check_word_in_tweet(word, data):
    """Checks if a word is in a Twitter dataset's text. 
    Checks text and extended tweet (140+ character tweets) for tweets,
    retweets and quoted tweets.
    Returns a logical pandas Series.
    """
    contains_column = data['text'].str.contains(word, case = False)
    contains_column |= data['extended_tweet-full_text'].str.contains(word, case = False)
    contains_column |= data['quoted_status-text'].str.contains(word, case = False)
    contains_column |= data['quoted_status-extended_tweet-full_text'].str.contains(word, case = False)
    contains_column |= data['retweeted_status-text'].str.contains(word, case = False)
    contains_column |= data['retweeted_status-extended_tweet-full_text'].str.contains(word, case = False)
    return contains_column

### Comparing #python to #rstats
Now that we have a function to check whether or not the word is in the tweet in multiple places, we can deploy this across multiple words and compare them. Let's return to our example with the data science hashtag dataset. We want to see how many times that #rstats occurs compared to #python.

In [None]:
# Find mentions of #python in all text fields
python = check_word_in_tweet('#python', ds_tweets)

# Find mentions of #rstats in all text fields
rstats = check_word_in_tweet('rstats', ds_tweets)

# Print proportion of tweets mentioning #python
print("Proportion of #python tweets:", np.sum(python) / ds_tweets.shape[0])

# Print proportion of tweets mentioning #rstats
print("Proportion of #rstats tweets:", np.sum(rstats) / ds_tweets.shape[0])

### Creating time series data frame
Time series data is used when we want to analyze or explore variation over time. This is useful when exploring Twitter text data if we want to track the prevalence of a word or set of words.

In [None]:
# Print created_at to see the original format of datetime in Twitter data
print(ds_tweets['created_at'].head())

# Convert the created_at column to np.datetime object
ds_tweets['created_at'] = pd.to_datetime(ds_tweets['created_at'])

# Print created_at to see new format
print(ds_tweets['created_at'].head())

# Set the index of ds_tweets to created_at
ds_tweets = ds_tweets.set_index('created_at')

### Generating mean frequency
We need to produce a metric which can be graphed over time. Our function check_word_in_tweet() returns a boolean Series. Remember that the boolean value True == 1, so we can produce a column for each keyword we're interested in and use it to understand its over time prevalence.

In [None]:
# Create a python column
ds_tweets['python'] = check_word_in_tweet('#python', ds_tweets)

# Create an rstats column
ds_tweets['rstats'] = check_word_in_tweet('#rstats', ds_tweets)

### Plotting mean frequency
Lastly, we'll create a per-day average of the mentions of both hashtags and plot them across time. We'll first create proportions from the two boolean Series by the day, then we'll plot them.

In [None]:
# Average of python column by day
mean_python = ds_tweets['python'].resample('1 d').mean()

# Average of rstats column by day
mean_rstats = ds_tweets['rstats'].resample('1 d').mean()

# Plot mean python/rstats by day
plt.plot(mean_python.index.day, mean_python, color = 'green')
plt.plot(mean_rstats.index.day, mean_rstats, color = 'blue')

# Add labels and show
plt.xlabel('Day'); plt.ylabel('Frequency')
plt.title('Language mentions over time')
plt.legend(('#python', '#rstats'))
plt.show()

### Loading VADER
Sentiment analysis provides us a small glimpse of the meaning of texts with a rather directly interpretable method. While it has its limitations, it's a good place to begin working with textual data. There's a number of out-of-the-box tools in Python we can use for sentiment analysis.

In [None]:
# Load SentimentIntensityAnalyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Instantiate new SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

# Generate sentiment scores
sentiment_scores = ds_tweets['text'].apply(sid.polarity_scores)

### Calculating sentiment scores
A rough measure of sentiment towards a particular hashtag is to measure average sentiment for tweets mentioning a particular hashtag. It's also possible that other things are happening in that tweet, so it's important to inspect both text as well as metrics generated by automated text methods.

In [None]:
# Print out the text of a positive tweet
print(ds_tweets[sentiment > 0.6]['text'].values[0])

# Print out the text of a negative tweet
print(ds_tweets[sentiment < -0.6]['text'].values[0])

# Generate average sentiment scores for #python
sentiment_py = sentiment[ check_word_in_tweet('#python', ds_tweets) ].resample('1 d').mean()

# Generate average sentiment scores for #rstats
sentiment_r = sentiment[ check_word_in_tweet('#rstats', ds_tweets) ].resample('1 d').mean()

### Plotting sentiment scores
Lastly, let's plot the sentiment of each hashtag over time. This is largely similar to plotting the prevalence of tweets.

In [None]:
# Import matplotlib
import matplotlib.pyplot as plt

# Plot average #python sentiment per day
plt.plot(sentiment_py.index.day, sentiment_py, color = 'green')

# Plot average #rstats sentiment per day
plt.plot(sentiment_r.index.day, sentiment_r, color = 'blue')

plt.xlabel('Day')
plt.ylabel('Sentiment')
plt.title('Sentiment of data science languages')
plt.legend(('#python', '#rstats'))
plt.show()

## Chapter 3 - Twitter Networks

### Creating retweet network
Social media is, by nature, networked data. Twitter networks manifest in multiple ways. One of the most important types of networks that appear in Twitter are retweet networks. We can represent these as directed graphs, with the retweeting user as the source and the retweeted person as the target. With Twitter data in our flattened DataFrame, we can import these into networkx and create a retweet network.

In [None]:
# Import networkx
import networkx as nx

# Create retweet network from edgelist
G_rt = nx.from_pandas_edgelist(
    sotu_retweets,
    source = 'user-screen_name', 
    target = 'retweeted_status-user-screen_name',
    create_using = nx.DiGraph())
    
# Print the number of nodes
print('Nodes in RT network:', len(G_rt.nodes()))

# Print the number of edges
print('Edges in RT network:', len(G_rt.edges()))

### Creating reply network
Reply networks have a markedly different structure to retweet networks. While retweet networks often signal agreement, replies can signal discussion, deliberation, and disagreement. The network properties are the same, however: the network is directed, the source is the replier and the target is the user who is being replied to.

In [None]:
# Import networkx
import networkx as nx

# Create reply network from edgelist
G_reply = nx.from_pandas_edgelist(
    sotu_replies,
    source = 'user-screen_name', 
    target = 'in_reply_to_screen_name',
    create_using = nx.DiGraph())
    
# Print the number of nodes
print('Nodes in reply network:', len(G_reply.nodes()))

# Print the number of edges
print('Edges in reply network:', len(G_reply.edges()))

### Visualizing retweet network
Visualizing retweets networks is an important exploratory data analysis step because it allows us to visually inspect the structure of the network, understand if there is any user that has disproportionate influence, and if there are different spheres of conversation.

In [None]:
# Create random layout positions
pos = nx.random_layout(G_rt)

# Create size list
sizes = [x[1] for x in G_rt.degree()]

# Draw the network
nx.draw_networkx(G_rt, pos, 
    with_labels = False, 
    node_size = sizes,
    width = 0.1, alpha = 0.7,
    arrowsize = 2, linewidths = 0)

# Turn axis off and show
plt.axis('off'); plt.show()

### In-degree centrality
Centrality is a measure of importance of a node to a network. There are many different types of centrality and each of them has slightly different meaning in Twitter networks. We are first focusing on degree centrality, since its calculation is straightforward and has an intuitive explanation.

For directed networks like Twitter, we need to be careful to distinguish between in-degree and out-degree centrality, especially in retweet networks. In-degree centrality for retweet networks signals users who are getting many retweets.

In [None]:
# Generate in-degree centrality for retweets 
rt_centrality = nx.in_degree_centrality(G_rt)

# Generate in-degree centrality for replies 
reply_centrality = nx.in_degree_centrality(G_reply)

# Store centralities in DataFrame
rt = pd.DataFrame(list(rt_centrality.items()), columns = column_names)
reply = pd.DataFrame(list(reply_centrality.items()), columns = column_names)

# Print first five results in descending order of centrality
print(rt.sort_values('degree_centrality', ascending = False).head())

# Print first five results in descending order of centrality
print(reply.sort_values('degree_centrality', ascending = False).head())

### Betweenness Centrality
Betweenness centrality for retweet and reply networks signals users who bridge between different Twitter communities. These communities may be tied together by topic or ideology.

networkx has been imported as nx. The networks G_rt and G_reply, and column_names = ['screen_name', 'betweenness_centrality'] have been loaded for you.

In [None]:
# Generate betweenness centrality for retweets 
rt_centrality = nx.betweenness_centrality(G_rt)

# Generate betweenness centrality for replies 
reply_centrality = nx.betweenness_centrality(G_reply)

# Store centralities in data frames
rt = pd.DataFrame(list(rt_centrality.items()), columns = column_names)
reply = pd.DataFrame(list(reply_centrality.items()), columns = column_names)

# Print first five results in descending order of centrality
print(rt.sort_values('betweenness_centrality', ascending = False).head())

# Print first five results in descending order of centrality
print(reply.sort_values('betweenness_centrality', ascending = False).head())

### Ratios
While not strictly a measure of importance to a network, the idea of being "ratio'd" is a network measure which is particular to Twitter and is typically used to judge the unpopularity of a tweet. "The Ratio," as it is called, is calculated by taking the number of replies and dividing it by the number of retweets. For our purposes, it makes conceptual sense to take only the in-degrees of both the retweet and reply networks.

In [None]:
# Calculate in-degrees and store in DataFrame
degree_rt = pd.DataFrame(list(G_rt.in_degree()), columns = column_names)
degree_reply = pd.DataFrame(list(G_reply.in_degree()), columns = column_names)

# Merge the two DataFrames on screen name
ratio = degree_rt.merge(degree_reply, on = 'screen_name', suffixes = ('_rt', '_reply'))

# Calculate the ratio
ratio['ratio'] = ratio['degree_reply'] / ratio['degree_rt']

# Exclude any tweets with less than 5 retweets
ratio = ratio[ratio['degree_rt'] >= 5]

# Print out first five with highest ratio
print(ratio.sort_values('ratio', ascending = False).head())

## Chapter 4 - Putting Twitter data on the map

### Accessing user-defined location
In the slides, we saw that we could obtain user location via user-generated text, including the tweet itself and the location field in the user's description. These are the two most imprecise methods of obtaining user location, but also possibly more readily available.

In this exercise, you're going extract the user-defined location from a single example tweet as well as a large set of tweets. We've added another line to our flatten_tweets() function which will allow you to access user-defined location within the data frame.

tweet_obj['user-location'] = tweet_obj['user']['location']

In addition, the single tweet in JSON format tweet_json and the State of the Union tweets in JSON format tweets_sotu_json have been loaded for you.

In [None]:

# Print out the location of a single tweet
print(tweet_json['user']['location'])

# Flatten and load the SOTU tweets into a dataframe
tweets_sotu = pd.DataFrame(flatten_tweets(tweets_sotu_json))

# Print out top five user-defined locations
print(tweets_sotu['user-location'].value_counts().head())

###Accessing bounding box
Most tweets which have coordinate-level geographical information attached to them typically come in the form of a bounding box. Bounding boxes are a set of four longitudinal/latitudinal coordinates which denote a particular area in which the user can be located. The bounding box is located in the place value of the Twitter JSON.

The dataset has been loaded for you as a DataFrame in tweets_sotu.

In [None]:
def getBoundingBox(place):
    """ Returns the bounding box coordinates."""
    return place['bounding_box']['coordinates']

# Apply the function which gets bounding box coordinates
bounding_boxes = tweets_sotu['place'].apply(getBoundingBox)

# Print out the first bounding box coordinates
print(bounding_boxes.values[0])

### Calculating the centroid
The bounding box can range from a city block to a whole state or even country. For simplicity's sake, one way we can deal with handling these data is by translating the bounding box into what's called a centroid, or the center of the bounding box. The calculation of the centroid is straight forward -- we calculate the midpoints of the lines created by the latitude and longitudes.

In [None]:
def calculateCentroid(place):
    """ Calculates the centroid from a bounding box."""
    # Obtain the coordinates from the bounding box.
    coordinates = place['bounding_box']['coordinates'][0]
        
    longs = np.unique( [x[0] for x in coordinates] )
    lats  = np.unique( [x[1] for x in coordinates] )

    if len(longs) == 1 and len(lats) == 1:
        # return a single coordinate
        return (longs[0], lats[0])
    elif len(longs) == 2 and len(lats) == 2:
        # If we have two longs and lats, we have a box.
        central_long = np.sum(longs) / 2
        central_lat  = np.sum(lats) / 2
    else:
        raise ValueError("Non-rectangular polygon not supported: %s" % 
            ",".join(map(lambda x: str(x), coordinates)) )

    return (central_long, central_lat)
    
# Calculate the centroids of place     
centroids = tweets_sotu['place'].apply(calculateCentroid)

### Creating Basemap map
Basemap allows you to create maps in Python. The library builds projections for latitude and longitude coordinates and then passes the plotting work on to matplotlib. This means you can build extra features based on the power of matplotlib.

In this exercise, we're going to set up a map of the continental United States on a Mercator projection. The corner coordinates of this map are provided for you below.

In [None]:
# Import Basemap
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt

# Set up the US bounding box
us_boundingbox = [-125, 22, -64, 50] 

# Set up the Basemap object
m = Basemap(llcrnrlon = us_boundingbox[0],
            llcrnrlat = us_boundingbox[1],
            urcrnrlon = us_boundingbox[2],
            urcrnrlat = us_boundingbox[3],
            projection='merc')

In [None]:
# Draw continents in white,
# coastlines and countries in gray
m.fillcontinents(color='white')
m.drawcoastlines(color='gray')
m.drawcountries(color='gray')

# Draw the states and show the plot
m.drawstates(color='gray')
plt.show()

### Plotting centroid coordinates
Because we can't plot whole bounding boxes, we summarize the bounding box location into a single point called a centroid. Plotting these on a Basemap map is straightforward. Once we calculate the centroids, we separate the longitudes and latitudes, then pass to the .scatter() method.

The Basemap object m has been created for you. The dataset tweets_sotu and function calculateCentroid() have also been loaded.

In [None]:
# Calculate the centroids for the dataset
# and isolate longitudue and latitudes
centroids = tweets_sotu['place'].apply(calculateCentroid)
lon = [x[0] for x in centroids]
lat = [x[1] for x in centroids]

# Draw continents, coastlines, countries, and states
m.fillcontinents(color='white', zorder=0)
m.drawcoastlines(color='gray')
m.drawcountries(color='gray')
m.drawstates(color='gray')

# Draw the points and show the plot
m.scatter(lon, lat, latlon = True, alpha = 0.7)
plt.show()

### Coloring by sentiment
We want to be able to differentiate by place with our Twitter analysis. One distinguishing factor between places is how the State of the Union speech was received. For this purpose, we'll use the sentiment analysis we covered in Chapter 2 to evaluate how the speech was received in different parts of the country.

In [None]:
# Generate sentiment scores
sentiment_scores = tweets_sotu['text'].apply(sid.polarity_scores)

# Isolate the compound element
sentiment_scores = [x['compound'] for x in sentiment_scores]

# Draw the points
m.scatter(lon, lat, latlon = True, 
           c = sentiment_scores,
           cmap = 'coolwarm', alpha = 0.7)
           
# Show the plot
plt.show()

## Key Takeaways
The course presented a objetive approach to reading, analyzing and plotting Twitter data, introducing the usage of many interesting resources available in python. Most of them are pretty simple to use and provide shortcuts to text search, sentiment analysis and network evaluation, as well as scattering datapoints in maps.