# Assignment 2: Getting data from Twitter API

For working on this assignment, the easiest way is to log into the [datahub.berkeley.edu](http://datahub.berkeley.edu). If you have a @berkeley.edu email address, you already have full access to the programming environment hosted on that site. 

## 1. Intro

Twitter collects a *lot* of data. Ranging from tweets themselves, to data about users, to data about likes and other interactions, Twitter basically records everything that happens on their website. Lucky for data scientists like ourselves, Twitter also shares that data with us! In this assignment, we're going to use Twitter's API to analyze retweet statistics, demographics, and some other data too! 

## 2. Importing Libraries

Other people are also interested in analyzing Twitter data, so there's been work done here already. That means other folks have developed useful collections of code — called libraries — which handle a lot of parsing and data management, so that we don't have to. Since these libraries are published online, we have access to all that hard work too! That means we can use code from those libraries to handle all the complicated Twitter models, so we only have to worry about the actual analysis (which is the fun part).

You don't need to worry too much about the code in the next cell. Its purpose is to install libraries that other people have written, so that we have access to them later on.

In [4]:
!pip install tweepy    # This halps us access Twitter data.
!pip install textblob  # This helps us parse text.
!pip install plotly    # This makes it easy to plot graphs.
!pip install nltk      # This is also to parse text.



## 3. Accessing the Data

### Question 1
Follow these instructions to get your keys:
To work on Twitter data, we'll first need two things: a Twitter account, and Twitter keys. Here are the steps to follow:

1. [Create a Twitter account](https://twitter.com).  You can use an existing account if you have one.
1. [Create a Twitter developer account](https://dev.twitter.com/resources/signup).  Attach it to your Twitter account.
1. Once you're logged into your developer account, [create an application for this assignment](https://apps.twitter.com/app/new).  You can call it whatever you want, and you can write any URL when it asks for a web site.
1. On the page for that application, find your Consumer Key and Consumer Secret. Don't lose these!
1. On the same page, create an Access Token. Record the resulting Access Token and Access Token Secret. Don't lose these either!

**Security concern:** DO NOT share your access keys with anyone. They can be used to manage your Twitter account without your permission.

Add your credentials in the cell below. Your program will use them to access Twitter data.

In [31]:
consumer_key = ""
consumer_secret = ""

access_key = ""
access_secret = ""

The next cell will authorize your program to request Twitter data, through the developer account you just set up.

In [6]:
import tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)

**Note**: A lot of data scientists like Twitter data. As a result, you can only request data approximately once every 15 minutes to keep Twitter's servers from crashing due to too much activity. Use your requests wisely to avoid unnecessary waiting time.

### Question 2:
[Twitter API](https://dev.twitter.com/overview/api) can be used for retrieving different objects (e.g., tweets). List the other possible objects that can be retrieved by Twitter API.


**Answer:**
1. Users
1. Entities 
1. Places

Now that everything is set up, we can use [Twitter's search API](https://dev.twitter.com/rest/reference/get/search/tweets) to find the word "Berkeley". This will give us the same results as using [Twitter's online "search" page](https://twitter.com/search?q=berkeley).

In [7]:
results = tweepy.Cursor(api.search,   # `api.search` specifies we want to perform a search.
                        q='Berkeley', # `q` is the query, or the words we're searching for.
                        result_type='popular') # We'll prioritize more popular results first.

In [8]:
print(results)

<tweepy.cursor.Cursor object at 0x7ff228b01ef0>


In [9]:
# what does <tweepy.cursor.Cursor object at 0x7f30fc94ca58> mean? - possibly, too many results to show?

Now `results` is a long list of search results. Since it is actually pretty extensive, let's just take the first ten results. In the next cell we build up a list called `first_ten`, which contains just the first ten tweets we found in `results`.

In [10]:
first_ten = []                  # We start out with an empty list called `first_ten`.
for tweet in results.items(10): # Then, we'll iterate over the first 10 tweets in `results`...
    first_ten.append(tweet)     # And we'll add each of those tweets to `first_ten`.

Now lets have a peek at what the data looks like. 

In [11]:
print(first_ten)

[Status(_api=<tweepy.api.API object at 0x7ff230075518>, _json={'created_at': 'Tue Sep 19 16:39:39 +0000 2017', 'id': 910181603875442688, 'id_str': '910181603875442688', 'text': 'UC Berkeley https://t.co/20nuAbZPpx', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 910181593444241408, 'id_str': '910181593444241408', 'indices': [12, 35], 'media_url': 'http://pbs.twimg.com/media/DKGdVj5X0AAtdJD.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DKGdVj5X0AAtdJD.jpg', 'url': 'https://t.co/20nuAbZPpx', 'display_url': 'pic.twitter.com/20nuAbZPpx', 'expanded_url': 'https://twitter.com/PrisonPlanet/status/910181603875442688/photo/1', 'type': 'photo', 'sizes': {'large': {'w': 719, 'h': 710, 'resize': 'fit'}, 'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'medium': {'w': 719, 'h': 710, 'resize': 'fit'}, 'small': {'w': 680, 'h': 671, 'resize': 'fit'}}}]}, 'extended_entities': {'media': [{'id': 910181593444241408, 'id_str': '9101

In [12]:
first_hundredo = []
for tweet in results.items(100):
    first_hundredo.append(tweet)
print(first_hundredo)

[Status(_api=<tweepy.api.API object at 0x7ff230075518>, _json={'created_at': 'Mon Sep 18 13:20:01 +0000 2017', 'id': 909768977030828034, 'id_str': '909768977030828034', 'text': 'Video =&gt; Grown Man Goes Bananas Over Berkeley Free Speek Week Poster https://t.co/gt1nQgtkug https://t.co/fIZ26o9v9y', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/gt1nQgtkug', 'expanded_url': 'https://www.louderwithcrowder.com/grown-man-free-speech-week-poster/', 'display_url': 'louderwithcrowder.com/grown-man-free…', 'indices': [72, 95]}], 'media': [{'id': 909768974409244672, 'id_str': '909768974409244672', 'indices': [96, 119], 'media_url': 'http://pbs.twimg.com/media/DKAmD-NVoAAyUBI.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DKAmD-NVoAAyUBI.jpg', 'url': 'https://t.co/fIZ26o9v9y', 'display_url': 'pic.twitter.com/fIZ26o9v9y', 'expanded_url': 'https://twitter.com/scrowder/status/909768977030828034/photo/1', 'type': 'photo', 's

## 4. Exploring the Dataset

Twitter gives us a lot of information about each tweet, not just its text. You can read about all the details [here](https://dev.twitter.com/overview/api/tweets). Let's look at one tweet to get a sense of the information we have available. We can access just the first tweet in our list by indexing into it. Note, the first index in the list is actually 0, not 1, so we will actually say `first_ten[0]` to see the first tweet in our list of ten tweets.

In [13]:
print(first_ten[2]) # Try changing this to any number 0-9, to see other tweets in the list.

Status(_api=<tweepy.api.API object at 0x7ff230075518>, _json={'created_at': 'Tue Sep 19 00:04:39 +0000 2017', 'id': 909931206002774017, 'id_str': '909931206002774017', 'text': 'Announcing! \n\nANTIFA: America Under Siege \n\nPremiering at Berkeley Free Speech Week! https://t.co/90xSMLmNgP', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 909931199056904192, 'id_str': '909931199056904192', 'indices': [85, 108], 'media_url': 'http://pbs.twimg.com/media/DKC5msHVAAAKhXM.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DKC5msHVAAAKhXM.jpg', 'url': 'https://t.co/90xSMLmNgP', 'display_url': 'pic.twitter.com/90xSMLmNgP', 'expanded_url': 'https://twitter.com/JackPosobiec/status/909931206002774017/photo/1', 'type': 'photo', 'sizes': {'small': {'w': 680, 'h': 680, 'resize': 'fit'}, 'large': {'w': 1080, 'h': 1080, 'resize': 'fit'}, 'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'medium': {'w': 1080, 'h': 1080, 'resize': 'fit'

In [14]:
print(first_ten[0].text) # Try this and inspect what it does

UC Berkeley https://t.co/20nuAbZPpx


In [15]:
print(first_ten[0].created_at) # Try this and inspect what it does 

2017-09-19 16:39:39


### Question 3:
Which field contains each of the following attributes:
1. The tweet's text?
1. The time when the tweet was posted?
1. The geographic location of the tweet?
1. The source (device and app) where the tweet was written?

**Answer:**
1. text
1. created_at
1. coordinates
1. source

## 5. Analyzing the Dataset

It's time to do analysis! Let's start out by getting a list, where each entry corresponds to how many retweets we got in the first 

In [16]:
retweet_counts = []                      # We start with an empty list called `retweet_counts`.
for tweet in first_ten:                  # Then, we iterate over the tweets in `first_ten`...
    retweet_count = tweet.retweet_count  # And, for each tweet, get the number of retweets...
    retweet_counts.append(retweet_count) # And append that number to our list `retweet_counts`.
    
print(retweet_counts)

[763, 510, 173, 246, 269, 227, 79, 127, 71, 106]


Let's try and draw it:

In [17]:
import matplotlib.pyplot as plt
plt.hist(retweet_counts_m)
plt.xlabel("Retweet counts")
plt.ylabel("Frequency")
plt.show()

NameError: name 'retweet_counts_m' is not defined

### Question 4
Twitter search api provides three modes (check out this [guide](https://dev.twitter.com/rest/reference/get/search/tweets)) for the result_type: mix, recent, and popular. In the previous code, we retrieved the popular tweets. 
Now it is your turn to retrieve recent 100 tweets and assign them to a new variable (e.g., results_recent100), then plot a histogram for the retweet count of the recent 100 tweets. You can follow our example code if you wish. 


**Answer**


In [None]:
#recent
berkeley_results = tweepy.Cursor(api.search,q='Berkeley',result_type='recent', location='Berkeley',count =(100))

first_hundredo = []
for tweet in berkeley_results.items(100):
    first_hundredo.append(tweet)
first_hundredo

In [None]:
#recent
retweet_counter = []                      # We start with an empty list called `retweet_counts`.
for tweet in first_hundredo:                  # Then, we iterate over the tweets in `first_ten`...
    retweet_count = tweet.retweet_count  # And, for each tweet, get the number of retweets...
    retweet_counter.append(retweet_count) # And append that number to our list `retweet_counts`.
    
print(retweet_counter)



In [None]:
#recent100, popularity shown through retweets.
import matplotlib.pyplot as plt
plt.hist(retweet_counter)
plt.xlabel("Retweet counts")
plt.ylabel("Frequency")
plt.show()

In [None]:
#popular
berkeley_results = tweepy.Cursor(api.search,q='Berkeley',result_type='popular', location='Berkeley',count =(100))

results_popular = []
for tweet in berkeley_results.items(100):
    results_popular.append(tweet)
results_popular

In [None]:
#popular
retweet_counter_popular = []                      # We start with an empty list called `retweet_counts`.
for tweet in results_popular:                  # Then, we iterate over the tweets in `first_ten`...
    retweet_count = tweet.retweet_count  # And, for each tweet, get the number of retweets...
    retweet_counter_popular.append(retweet_count) # And append that number to our list `retweet_counts`.
    
print(retweet_counter_popular)

In [None]:
#popular
import matplotlib.pyplot as plt
plt.hist(retweet_counter_popular)
plt.xlabel("Retweet counts")
plt.ylabel("Frequency")
plt.show()




### Question 5
Compare and contrast between the two histograms for retweet counts of recent and popular 100 tweets that returns from searching the word Berkeley. 

**Answer**: 
Similarities: Most of the retweets ended up being on the lower end of the scale of less than 10,000 retweet counts for both charts. *** Both graphs would probably have a similar distribution if they both used the same scale on the x axis.

Differences: While the graph distributions are different, not much can be said reliably since the two graphs use different scales on the x axis. This leads to an incorrect bias that may lead people to think that the popular retweet distribution is more spread out. Another difference is that the recent retweet data set has one tweet over 20,000 retweets while the popilar data set didn't.






## Users
Instead of searching for tweets, you can use Twitter APIs to get details about specific user account. It includes user’s timeline, followers, etc.

Get the latest 10 tweets from an account that interests you (e.g.,  UCBerkeley) twitter account using the following code.

In [18]:
user_results = api.user_timeline(screen_name='UCBerkeley', count=10)

user_results_tweets = []           # We start with an empty list called user_results_tweets

for t in user_results:             #Then, we iterate over the tweets in user_results
    user_results_tweets.append(t)  #And we'll add each of those tweets to user_results_tweets 

In [19]:
#Let's look at one tweet
print(user_results_tweets[0])

Status(_api=<tweepy.api.API object at 0x7ff230075518>, _json={'created_at': 'Wed Sep 20 03:04:00 +0000 2017', 'id': 910338728186535937, 'id_str': '910338728186535937', 'text': "Monday's memorial honored the dead with words and music https://t.co/CsKczVxsrK #highered", 'truncated': False, 'entities': {'hashtags': [{'text': 'highered', 'indices': [80, 89]}], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/CsKczVxsrK', 'expanded_url': 'http://bit.ly/2fi2xrh', 'display_url': 'bit.ly/2fi2xrh', 'indices': [56, 79]}]}, 'source': '<a href="http://sproutsocial.com" rel="nofollow">Sprout Social</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 176932593, 'id_str': '176932593', 'name': 'UC Berkeley', 'screen_name': 'UCBerkeley', 'location': 'Berkeley, California', 'description': 'University of California, Berkeley, the premier public institution of

In [20]:
#print the text of the first 10 tweets
for tweet in user_results_tweets:
    print(tweet.created_at.strftime("%Y-%b-%d %H:%M"))
    print(tweet.text)
    print()

2017-Sep-20 03:04
Monday's memorial honored the dead with words and music https://t.co/CsKczVxsrK #highered

2017-Sep-20 02:07
RT @CalMBBall: The day you've been waiting for is here. Our full 2017-18 #Pac12Hoops schedule &amp; broadcast info is live NOW!

Season 🔜! http…

2017-Sep-20 02:03
GO BEARS! Berkeley’s first campuswide tech club for women https://t.co/mdMWkc3sHd #womenintech

2017-Sep-20 01:20
LIVE: The #altright on campus — what students need to know  https://t.co/UdWlTlIScz @splcenter #freespeech https://t.co/ivSUeXgs59

2017-Sep-19 22:15
Professor Andrew Garrett helps recover wax cylinder audio of indigenous languages https://t.co/IxBlKs0Ujv @NSF #ThankAScientist

2017-Sep-19 19:51
RT @UCBDiversity: @nytimes wants to know what free speech means to @UCBerkeley students by sharing your story  https://t.co/1PXQaEypeX #Fre…

2017-Sep-19 19:04
TODAY @ 6PM: The #altright on campus — what students need to know https://t.co/L4LcWWvKvO @splcenter #freespeech

2017-Sep-19 02:09
RT @UC

### Question 6

Look at the text of retrieved tweets and compare them to the latest 10 tweets of the [web interface](https://twitter.com/UCBerkeley) for the same user. Do you see any difference? 

**Answer**: The results are in the same order and arppear to be the same - minus the time stamp. On the desktop web interface it shows how many hours ago the tweet was posted compared to a general time stamp on the list of retrieved tweets.

Get a list of a followers for UCBerkeley.

In [30]:
followers_list = []

followers_results = api.followers(screen_name='UCBerkeley', count=200)
for f in followers_results:
      followers_list.append(f)

for f in followers_list:
    print(f.name)


Raj K Tiwari
FedUni Research
Brandon Sward
oak
Alan Scote
Philipp Gillé
Yuexin Li
Yvng dvmb
susan horgan
Subhra Shankha
sonia amimi
luz marí
Yi Li
EVERYTHING 🐸 🔵
LoftsgardUC
@SouthernLady270
Alan James Lucas
sirui Qiu
Impeach Trump v2.0
Rick1992
Xiao Zijin
Bella
Srinidhi Iyengar
Hia Ming
PovertyIsDisability
Max Rice
Steffanie Riess
RACISM IS ARROGANT
FAU Tech Runway
KenWayne
RiverGirlCancun
K Owen
imiguate
Kurniawan Junaidy
edu
Adam Kehl/Kale
Gagan micro
jenwex
karen holleran
Anwar Ali T.P.
Alfredo Storck
Lisa Criscitello
D. Flaner
testing
Buffy Webster
Samuel
Christine Long
Sarah El Safty
PAVEHAWK
Karolina MZ
UConn Air Force ROTC
Tim Prince
zz_beginner
Angham
Bryanne Aler-Ningas
Chad Chapel
Ochoa Middle
Michelle Demishevich
ASUC STeam
Frédéric Béziers
EJ
Len Wolfenstein
OSU Grad Recruitment
Judah Obi
Benjamin Moe
Anthony
B R I E
CEE Hacks
tallulah-blue
Doug Sovern
musa
Mari Makharadze
Tomuchslaying
Kaely Monahan
Joyce Zhang
AJ Fox
Jackson Bullard
MD
Felipe Abreo
Checho Waldo
Corpus VE

In [22]:
#what does "Cursor" object is not iterable mean?

In [23]:
print(followers_list)



There’s a limit on how many users can be returned by one request. If you need more, please read [using cursors to navigate collections](https://dev.twitter.com/overview/api/cursoring).

### Extra Credit Question 
We saw how to use Twitter API to search for tweets. The [Search API](https://dev.twitter.com/rest/public/search) has an option to filter the query results by geo location. The parameter value is specified by ”latitude,longitude,radius” (check out the documentation of API  for more information). Compare the top 10 popular tweets text that contain the word 'berkeley' from four geo locations: Berkeley, Kansas City, New York, and Barcelona Spain.


**Answer:**

In [24]:
# use geocode

### Extra Credit Question
Based on profile_location in the follower data, Compare between the followers of Donald Trump and Hillary Clinton in terms of their locations. Draw a map for both followers lists. 

**Answer:**

In [25]:
# use geo_enabled

##### Submitting the assignment

- Delete your Twitter API credentials, ie. re-assign `consumer_key`, `consumer_secret`, `access_key`, and `access_secret` to empty strings so that we won't see your credentials when you save and sumbit it.
- Save this jupyter notebook as a pdf. Click File, Download as, PDF via LaTex (.pdf).
- Upload the pdf file into bcourses under Assignment 2.