# Case Study 1 : Data Science in Twitter Data

**Required Readings:** 
* Chapter 1 and Chapter 9 of the book [Mining the Social Web](http://www.webpages.uidaho.edu/~stevel/504/Mining-the-Social-Web-2nd-Edition.pdf) 
* The codes for [Chapter 1](http://bit.ly/1qCtMrr) and [Chapter 9](http://bit.ly/1u7eP33)
* [TED Talks](https://www.ted.com/talks) for examples of 10 minutes talks.

** NOTE **
* Please don't forget to save the notebook frequently when working in Jupyter Notebook, otherwise the changes you made can be lost.

*----------------------

# Problem: pick a data science problem that you plan to solve using Twitter Data
* The problem should be important and interesting, which has a potential impact in some area.
* The problem should be solvable using twitter data and data science solutions.

Please briefly describe in the following cell: what problem are you trying to solve? why this problem is important and interesting?


We are interested in the potential impact a person of power's inner thoughts can have on the world at large.  In particular, we are interested at what are the potential economic impacts and ramifcations of the tweets that the current acting president of the United States posts.  In other words, we want to analyze impact, if any, Trump's tweets have on the stock market.

This problem is interesting, as, in general, it could potentially be used to rank the economical importance of any particular individual (that uses Twitter).

## Data Collection: Download Twitter Data using API

* In order to solve the above problem, you need to collect some twitter data. You could select a topic that is relevant to your problem, and use Twitter API to download the relevant tweets. It is recommended that the number of tweets should be larger than 200, but smaller than 1 million.
* Store the tweets you downloaded into a local file (txt file or json file) 

In [1]:
# Install dependencies on the google-provided VM hosting this notebook
# Only needs to be run once at the top of the notebook.
!pip install twitter
!pip install plotly
!pip install textblob
!pip install prettytable
!pip install quandl
!pip install pandas
!pip install python-dateutil


Collecting twitter
  Downloading twitter-1.18.0-py2.py3-none-any.whl (54kB)
[K    100% |████████████████████████████████| 61kB 2.3MB/s 
[?25hInstalling collected packages: twitter
Successfully installed twitter-1.18.0
Collecting textblob
  Downloading textblob-0.15.1-py2.py3-none-any.whl (631kB)
[K    100% |████████████████████████████████| 634kB 1.3MB/s 
Installing collected packages: textblob
Successfully installed textblob-0.15.1
Collecting prettytable
  Downloading prettytable-0.7.2.zip
Building wheels for collected packages: prettytable
  Running setup.py bdist_wheel for prettytable ... [?25l- done
[?25h  Stored in directory: /content/.cache/pip/wheels/b6/90/7b/1c22b89217d0eba6d5f406e562365ebee804f0d4595b2bdbcd
Successfully built prettytable
Installing collected packages: prettytable
Successfully installed prettytable-0.7.2
Collecting quandl
  Downloading Quandl-3.3.0-py2.py3-none-any.whl
Collecting pyOpenSSL (from quandl)
  Downloading pyOpenSSL-17.5.0-py2.py3-none-any.whl

  Downloading asn1crypto-0.24.0-py2.py3-none-any.whl (101kB)
[K    100% |████████████████████████████████| 102kB 9.1MB/s 
[?25hCollecting ipaddress; python_version < "3" (from cryptography>=2.1.4->pyOpenSSL->quandl)
  Downloading ipaddress-1.0.19.tar.gz
Collecting pycparser (from cffi>=1.7; platform_python_implementation != "PyPy"->cryptography>=2.1.4->pyOpenSSL->quandl)
  Downloading pycparser-2.18.tar.gz (245kB)
[K    100% |████████████████████████████████| 256kB 4.1MB/s 
[?25hBuilding wheels for collected packages: inflection, ipaddress, pycparser
  Running setup.py bdist_wheel for inflection ... [?25l- done
[?25h  Stored in directory: /content/.cache/pip/wheels/41/fa/e9/2995f4ab121e9f30f342fa2d43f0b27f851a0cb9f0d98d3b45
  Running setup.py bdist_wheel for ipaddress ... [?25l- done
[?25h  Stored in directory: /content/.cache/pip/wheels/d7/6b/69/666188e8101897abb2e115d408d139a372bdf6bfa7abb5aef5
  Running setup.py bdist_wheel for pycparser ... [?25l- \ done
[?25h  S

In [0]:
# Define a bunch of helper functions ripped from chapter's 1 && 9 of
# "Mining the Social Web"

import twitter
# Define a Function to Login Twitter API
def oauth_login():
    # Go to http://twitter.com/apps/new to create an app and get values
    # for these credentials that you'll need to provide in place of these
    # empty string values that are defined as placeholders.
    # See https://dev.twitter.com/docs/auth/oauth for more information 
    # on Twitter's OAuth implementation.
    
    CONSUMER_KEY = 'ouMDjeSY1kY9hqUDuN3WYGxEJ'
    CONSUMER_SECRET ='Gzw5aljoe0cONRhof5mQ0bbSvqpUjyzUiGDBe5Lfoskzki7xUe'
    OAUTH_TOKEN = '958500278151602176-jTycQLTNHoR3CEUYz5rJOUg6w7VSB7M'
    OAUTH_TOKEN_SECRET = '7quNIFQRGI29WZ1ba3UXDSNGQwuQf8MQo8JRVEwdhU8lI'
    
    auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
                               CONSUMER_KEY, CONSUMER_SECRET)
    
    twitter_api = twitter.Twitter(auth=auth)
    return twitter_api
  
#------------------------------------------------------------------------------
import sys
import time
from urllib2 import URLError
from httplib import BadStatusLine
import json
import twitter

def make_twitter_request(twitter_api_func, max_errors=10, *args, **kw): 
    # A nested helper function that handles common HTTPErrors. Return an updated
    # value for wait_period if the problem is a 500 level error. Block until the
    # rate limit is reset if it's a rate limiting issue (429 error). Returns None
    # for 401 and 404 errors, which requires special handling by the caller.
    def handle_twitter_http_error(e, wait_period=2, sleep_when_rate_limited=True):
    
        if wait_period > 3600: # Seconds
            print >> sys.stderr, 'Too many retries. Quitting.'
            raise e
    
        # See https://dev.twitter.com/docs/error-codes-responses for common codes
    
        if e.e.code == 401:
            print >> sys.stderr, 'Encountered 401 Error (Not Authorized)'
            return None
        elif e.e.code == 404:
            print >> sys.stderr, 'Encountered 404 Error (Not Found)'
            return None
        elif e.e.code == 429: 
            print >> sys.stderr, 'Encountered 429 Error (Rate Limit Exceeded)'
            if sleep_when_rate_limited:
                print >> sys.stderr, "Retrying in 15 minutes...ZzZ..."
                sys.stderr.flush()
                time.sleep(60*15 + 5)
                print >> sys.stderr, '...ZzZ...Awake now and trying again.'
                return 2
            else:
                raise e # Caller must handle the rate limiting issue
        elif e.e.code in (500, 502, 503, 504):
            print >> sys.stderr, 'Encountered %i Error. Retrying in %i seconds' % \
                (e.e.code, wait_period)
            time.sleep(wait_period)
            wait_period *= 1.5
            return wait_period
        else:
            raise e

    # End of nested helper function
    
    wait_period = 2 
    error_count = 0 

    while True:
        try:
            return twitter_api_func(*args, **kw)
        except twitter.api.TwitterHTTPError, e:
            error_count = 0 
            wait_period = handle_twitter_http_error(e, wait_period)
            if wait_period is None:
                return
        except URLError, e:
            error_count += 1
            time.sleep(wait_period)
            wait_period *= 1.5
            print >> sys.stderr, "URLError encountered. Continuing."
            if error_count > max_errors:
                print >> sys.stderr, "Too many consecutive errors...bailing out."
                raise
        except BadStatusLine, e:
            error_count += 1
            time.sleep(wait_period)
            wait_period *= 1.5
            print >> sys.stderr, "BadStatusLine encountered. Continuing."
            if error_count > max_errors:
                print >> sys.stderr, "Too many consecutive errors...bailing out."
                raise
                
#------------------------------------------------------------------------------
def harvest_user_timeline(twitter_api, screen_name=None, user_id=None, max_results=1000):
    assert (screen_name != None) != (user_id != None), \
    "Must have screen_name or user_id, but not both"    
    
    kw = {  # Keyword args for the Twitter API call
        'count': 200,
        'trim_user': 'true',
        'include_rts' : 'true',
        'since_id' : 822501803615014918,
        'tweet_mode' : 'extended'
        }
    
    if screen_name:
        kw['screen_name'] = screen_name
    else:
        kw['user_id'] = user_id
        
    max_pages = 16
    results = []
    
    tweets = make_twitter_request(twitter_api.statuses.user_timeline, **kw)
    
    if tweets is None: # 401 (Not Authorized) - Need to bail out on loop entry
        tweets = []
        
    results += tweets
    
    print >> sys.stderr, 'Fetched %i tweets' % len(tweets)
    
    page_num = 1
    
    # Many Twitter accounts have fewer than 200 tweets so you don't want to enter
    # the loop and waste a precious request if max_results = 200.
    
    # Note: Analogous optimizations could be applied inside the loop to try and 
    # save requests. e.g. Don't make a third request if you have 287 tweets out of 
    # a possible 400 tweets after your second request. Twitter does do some 
    # post-filtering on censored and deleted tweets out of batches of 'count', though,
    # so you can't strictly check for the number of results being 200. You might get
    # back 198, for example, and still have many more tweets to go. If you have the
    # total number of tweets for an account (by GET /users/lookup/), then you could 
    # simply use this value as a guide.
    
    if max_results == kw['count']:
        page_num = max_pages # Prevent loop entry
    
    while page_num < max_pages and len(tweets) > 0 and len(results) < max_results:
    
        # Necessary for traversing the timeline in Twitter's v1.1 API:
        # get the next query's max-id parameter to pass in.
        # See https://dev.twitter.com/docs/working-with-timelines.
        kw['max_id'] = min([ tweet['id'] for tweet in tweets]) - 1 
    
        tweets = make_twitter_request(twitter_api.statuses.user_timeline, **kw)
        results += tweets

        print >> sys.stderr, 'Fetched %i tweets' % (len(tweets),)
    
        page_num += 1
        
    print >> sys.stderr, 'Done fetching tweets'

    return results[:max_results]
  
#------------------------------------------------------------------------------
import re

def clean_tweet(tweet):
    '''
    Utility function to clean the text in a tweet by removing 
    links and special characters using regex.
    '''
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())
  
  
#------------------------------------------------------------------------------
# set up plotly API
import plotly

plotly.tools.set_credentials_file(username='sevensail', api_key='ikeKkHpLj086xlhPIkVm')

### Collect Tweets and rebort basic statistics

In [4]:
# The total number of tweets collected:  1000
twitter_api = oauth_login()
tweets = harvest_user_timeline(twitter_api, screen_name="realDonaldTrump", \
                               max_results=3200)

print "Successfully Collected: " + str(len(tweets)) + " tweets"


Fetched 200 tweets
Fetched 200 tweets
Fetched 197 tweets
Fetched 200 tweets
Fetched 199 tweets
Fetched 200 tweets
Fetched 200 tweets
Fetched 200 tweets
Fetched 200 tweets
Fetched 200 tweets
Fetched 200 tweets
Fetched 200 tweets
Fetched 200 tweets
Fetched 28 tweets


Successfully Collected: 2624 tweets


Fetched 0 tweets
Done fetching tweets


# Data Exploration: Exploring the Tweets and Tweet Entities

**(1) Word Count:** 
* Load the tweets you collected in the local file (txt or json)
* compute the frequencies of the words being used in these tweets. 
* Plot a table of the top 30 most-frequent words with their counts

In [11]:
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary

tweets_text = [clean_tweet(tweet['full_text']) for tweet in tweets ]
words = [w.lower() for t in tweets_text for w in t.split()]

from collections import Counter
c = Counter(words)
useless = ['the','to','of','and','a','in','for','is','our','are','on','that','i','with','was','it','we','my','be','has','you','as','at','s']
frequency = c.most_common()[:(50+len(useless))]
w = [tuple[0] for tuple in frequency if tuple[0] not in useless]
f = [tuple[1] for tuple in frequency if tuple[0] not in useless]

import plotly.plotly as py
from plotly.graph_objs import *

data = Data([Bar(x=w,y=f)])
layout = Layout(title = "30 Most Frequent Words", xaxis = {'title':'Words'}, yaxis = {'title':'Frequency'})
figure = Figure(data = data, layout = layout)
py.iplot(figure)


** (2) Find the most popular tweets in your collection of tweets**

---



Please plot a table of the top 10 most-retweeted tweets in your collection, i.e., the tweets with the largest number of retweet counts.


In [6]:
# Combine Retweet and text data, rank them, and convert them to utf-8
retweets = [int(tweet['retweet_count']) for tweet in tweets]
retweets_and_content = zip(tweets_text,retweets)
s = sorted(retweets_and_content, key=lambda tup: tup[1], reverse=True)
s = [(rt[0].encode('utf-8'), rt[1]) for rt in s]

from prettytable import PrettyTable
table = PrettyTable(['Tweet','Retweet Count'])
for i in s[:10]:
  table.add_row(i)
table.align = 'l'
print table



+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+
| Tweet                                                                                                                                                                                                                                                                                | Retweet Count |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+
| Why would Kim Jong un insult me by calling me old when I would NEVER call him short and fat Oh well I try s

**(3) Find the most popular Tweet Entities in your collection of tweets**

Please plot the top 10 most-frequent hashtags and top 10 most-mentioned users in your collection of tweets.

In [7]:
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary

def extract_tweet_entities(statuses):
    
    # See https://dev.twitter.com/docs/tweet-entities for more details on tweet
    # entities

    if len(statuses) == 0:
        return [], [], [], [], []
    
    screen_names = [ user_mention['screen_name'] 
                         for status in statuses
                            for user_mention in status['entities']['user_mentions'] ]
    
    hashtags = [ hashtag['text'] 
                     for status in statuses 
                        for hashtag in status['entities']['hashtags'] ]

    return screen_names, hashtags

screen_names, hashtags = extract_tweet_entities(tweets)

from collections import Counter
# Explore the first five items for each...

commonscreennames = zip(*Counter(screen_names).most_common()[0:20])
commonhashtags = zip(*Counter(hashtags).most_common()[0:20])

screennamesdata = Data([Bar(x=commonscreennames[0],y=commonscreennames[1])])
layout = Layout(title = "20 Most Frequent User Mentions", xaxis = {'title':'User Mention'}, yaxis = {'title':'Frequency'})
figure = Figure(data = screennamesdata,layout = layout)
py.iplot(figure, image_width=800, image_height=600)

#print json.dumps(Counter(urls).most_common()[0:5], indent=1)
#print json.dumps(Counter(media).most_common()[0:5], indent=1)
#print json.dumps(Counter(symbols).most_common()[0:5], indent=1)





In [8]:
hashtagsdata = Data([Bar(x=commonhashtags[0],y=commonhashtags[1])])
layout = Layout(title = "20 Most Frequent Hashtags", xaxis = {'title':'Hashtag'}, yaxis = {'title':'Frequency'})
figure = Figure(data = hashtagsdata,layout = layout)
py.iplot(figure, image_width=800, image_height=600)


**Plot** a histogram of the number of user mentions in the list using the following bins.


 ** (4) Getting "All" friends and "All" followers of a popular user in the tweets**

* choose a popular twitter user who has many followers in your collection of tweets.
* Get the list of all friends and all followers of the twitter user.
* Plot 20 out of the followers, plot their ID numbers and screen names in a table.
* Plot 20 out of the friends (if the user has more than 20 friends), plot their ID numbers and screen names in a table.

In [9]:
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary


from functools import partial
from sys import maxint

def get_friends_followers_ids(twitter_api, screen_name=None, user_id=None,
                              friends_limit=maxint, followers_limit=maxint):
    
    # Must have either screen_name or user_id (logical xor)
    assert (screen_name != None) != (user_id != None), \
    "Must have screen_name or user_id, but not both"
    
    # See https://dev.twitter.com/docs/api/1.1/get/friends/ids and
    # https://dev.twitter.com/docs/api/1.1/get/followers/ids for details
    # on API parameters
    
    get_friends_ids = partial(make_twitter_request, twitter_api.friends.ids, 
                              count=5000)
    get_followers_ids = partial(make_twitter_request, twitter_api.followers.ids, 
                                count=5000)

    friends_ids, followers_ids = [], []
    
    for twitter_api_func, limit, ids, label in [
                    [get_friends_ids, friends_limit, friends_ids, "friends"], 
                    [get_followers_ids, followers_limit, followers_ids, "followers"]
                ]:
        
        if limit == 0: continue
        
        cursor = -1
        while cursor != 0:
        
            # Use make_twitter_request via the partially bound callable...
            if screen_name: 
                response = twitter_api_func(screen_name=screen_name, cursor=cursor)
            else: # user_id
                response = twitter_api_func(user_id=user_id, cursor=cursor)

            if response is not None:
                ids += response['ids']
                cursor = response['next_cursor']
        
            print >> sys.stderr, 'Fetched {0} total {1} ids for {2}'.format(len(ids), 
                                                    label, (user_id or screen_name))
        
            # XXX: You may want to store data during each iteration to provide an 
            # an additional layer of protection from exceptional circumstances
        
            if len(ids) >= limit or response is None:
                break

    # Do something useful with the IDs, like store them to disk...
    friends = [(i,make_twitter_request(twitter_api.users.lookup, user_id = i)[0]['screen_name']) for i in friends_ids[:friends_limit]]
    followers = [(i,make_twitter_request(twitter_api.users.lookup, user_id = i)[0]['screen_name']) for i in followers_ids[:followers_limit]]
    return friends, followers

# Sample usage

twitter_api = oauth_login()

friends, followers = get_friends_followers_ids(twitter_api, 
                                                       screen_name="realDonaldTrump", 
                                                       friends_limit=20, 
                                                       followers_limit=20)

#print friends
#print followers

Fetched 45 total friends ids for realDonaldTrump
Fetched 5000 total followers ids for realDonaldTrump


In [10]:
def PTable(columns,rows):
   table = PrettyTable(columns)
   for i in rows:
    table.add_row(i)
   table.align = 'l'
   return table

print("Friends")
print(PTable(['ID','Screen Name'],friends))

print("Followers")
print(PTable(['ID','Screen Name'],followers))


Friends
+--------------------+-----------------+
| ID                 | Screen Name     |
+--------------------+-----------------+
| 818927131883356161 | PressSec        |
| 22703645           | TuckerCarlson   |
| 56561449           | JesseBWatters   |
| 822215673812119553 | WhiteHouse      |
| 823367015830323201 | Scavino45       |
| 471672239          | KellyannePolls  |
| 20733972           | Reince          |
| 322293052          | RealRomaDowney  |
| 720293443260456960 | Trump           |
| 2325495378         | TrumpGolf       |
| 245963716          | TiffanyATrump   |
| 50769180           | IngrahamAngle   |
| 22203756           | mike_pence      |
| 729676086632656900 | TeamTrump       |
| 14669951           | DRUDGE_REPORT   |
| 475802156          | MrsVanessaTrump |
| 75541946           | LaraLeaTrump    |
| 41634520           | seanhannity     |
| 37764422           | foxnation       |
| 4121225056         | CLewandowski_   |
+--------------------+-----------------+
Follower

# The Solution: implement a data science solution to the problem you are trying to solve.

Briefly describe the idea of your solution to the problem in the following cell:

First, we did sentiment analysis on extracted trump's tweets using Textblob. Then, we downloaded stock values data using quandl, another python api. Based on the ploarity we got from sentiment analysis and the stock values, we try to find some relationship between them using correlation matrix.

In [0]:
import time
import datetime
import pandas
import quandl

quandl.ApiConfig.api_key = 'uJJsbxupJZuBAG5CtjKH'
 
#using quandl to download stock values, give a start_date, end_date, and stock name
#it will return a pandas' matrix
#doc : https://blog.quandl.com/getting-started-with-the-quandl-api
#code example : https://chrisconlan.com/download-historical-stock-data-google-r-python/
#
#someone said Google Finance API was to shut down on October 2012.
#https://stackoverflow.com/questions/46070126/google-finance-json-stock-quote-stopped-working/
def quandl_stocks(symbol, start_date=(2017, 1, 1), end_date=None, returns='pandas'):
    """
    symbol is a string representing a stock symbol, e.g. 'AAPL'
 
    start_date and end_date are tuples of integers representing the year, month,
    and day
 
    end_date defaults to the current date when None
    """
 
    query_list = ['WIKI' + '/' + symbol + '.' + str(1), 'WIKI' + '/' + symbol + '.' + str(4)]
 
    start_date = datetime.date(*start_date)
 
    if end_date:
        end_date = datetime.date(*end_date)
    else:
        end_date = datetime.date.today()
         
    return quandl.get(query_list, 
            returns='pandas', 
            start_date=start_date,
            end_date=end_date,
            collapse='daily',
            order='asc'
            )
  

Write codes to implement the solution in python:

In [0]:
import textblob
import datetime

def analize_sentiment(tweet):
    '''
    Utility function to classify the polarity of a tweet
    using textblob.
    '''
    analysis = textblob.TextBlob(tweet)
    return analysis.sentiment.polarity
  

#sentiment_data = [analize_sentiment(tweetText) for tweetText in tweets_text]
#pos_tweets = [ (index, tweet) for index, tweet in enumerate(tweets) if sentiment_data[index] > 0]
#neu_tweets = [ (index, tweet) for index, tweet in enumerate(tweets) if sentiment_data[index] == 0]
#neg_tweets = [ (index, tweet) for index, tweet in enumerate(tweets) if sentiment_data[index] < 0]
#pos_tweets_dict = dict(pos_tweets)
#neu_tweets_dict = dict(neu_tweets)
#neg_tweets_dict = dict(neg_tweets)
#print("Percentage of positive tweets: {}%".format(len(pos_tweets)*100/len(tweets)))
#print("Percentage of neutral tweets: {}%".format(len(neu_tweets)*100/len(tweets)))
#print("Percentage de negative tweets: {}%".format(len(neg_tweets)*100/len(tweets)))

import datetime
from dateutil import parser
import pandas as pd

def getDailyChanges(ticker, tweetsPd):    
  #get stock values from quandl, and select the 0 and 3 lines which is open and close values
  ticker_data = quandl_stocks(ticker, (2016, 9, 1))
  #open subtract close values, and we get a daily change values
  ticker_data_bri = ticker_data['WIKI/' + ticker + ' - Open'].sub(ticker_data['WIKI/' + ticker + ' - Close'], axis = 0).div(ticker_data['WIKI/' + ticker + ' - Open'], axis = 0)
  #convert it to dataframe
  ticker_data_bri = pd.DataFrame(ticker_data_bri)
  #join them together and get the correlation matrix
  resultPd = tweetsPd.join(ticker_data_bri, how = 'inner')
  return resultPd

def getDailyPrice(ticker, tweetsPd):
  #get stock values from quandl, and select the 0 and 3 lines which is open and close values
  ticker_data = quandl_stocks(ticker, (2017, 9, 1))
  #open subtract close values, and we get a daily change values
  ticker_data_bri = ticker_data['WIKI/' + ticker + ' - Open']
  #convert it to dataframe
  ticker_data_bri = pd.DataFrame(ticker_data_bri)
  resultPd = tweetsPd.join(ticker_data_bri, how = 'inner')
  return resultPd


In [13]:
# RUNNING SENTIMENT API && MASSAGING DATA
sentiment_data = [analize_sentiment(tweetText) for tweetText in tweets_text]
#tweets_date is a list, contains tweets' created date in '%Y-%m-%d' format
tweets_date = [(parser.parse(tweetText['created_at'])).strftime('%Y-%m-%d') for tweetText in tweets]

# Create a dictionary of (tweet date):sentiment such that tweets dated from 
# the same day will contribute to a weighted average of a single dictionary value
# this value is output as tweetsPd
tweets_plo_key_date = {}
for i in range(len(tweets_date)):
  if tweets_date[i] in tweets_plo_key_date:
    #below, add all polarity in the same date
    tweets_plo_key_date[tweets_date[i]] = (tweets_plo_key_date[tweets_date[i]][0] + sentiment_data[i],tweets_plo_key_date[tweets_date[i]][1]+1)
  else:
    tweets_plo_key_date[tweets_date[i]] = (sentiment_data[i],1)
tweets_plo_key_date1 = {}
for i in tweets_plo_key_date:
  tweets_plo_key_date1[i] = float(tweets_plo_key_date[i][0])/tweets_plo_key_date[i][1]

tweetsPd = pd.DataFrame(tweets_plo_key_date1.items(), columns=['Date', 'DateValue']).set_index('Date').sort_index()

# EXTRACTING Plottable DATA
tickers   = ['XOM', 'AAPL', 'GEO', 'HCA']
deltaTick = []
corrs     = []
prices = []
for ticker in tickers:
  print ticker
  changes = getDailyChanges(ticker, tweetsPd)
  pri = getDailyPrice(ticker, tweetsPd)
  prices.append(pri)
  deltaTick.append(changes)
  corrs.append(changes.corr()['DateValue'][0])
  
print corrs

XOM
AAPL
GEO
HCA
[-0.007795146384413103, 0.103357078112181, -0.06436174961034623, 0.007543110270724468]


In [15]:
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.graph_objs import *
import numpy as np

plots = []
for i in range(len(tickers)):
  data = list(deltaTick[i][0])
  trace = go.Box(y=data,name=tickers[i])
  plots.append(trace)
  
  
layout = Layout(title = "Box and Whisker Plots of Selected Stocks on Days that Trump Tweeted", yaxis = {'title':'% change (stock price)'})
figure = Figure(data = plots,layout = layout)
py.iplot(figure)
  

In [20]:
# Linear Regression Comparison (bar charts)
import plotly.plotly as py
import plotly.graph_objs as go

data = [go.Bar(
            x=tickers,
            y=corrs,
            name=tickers
    )]

layout = Layout(title = "Bar Chart Comparison Regression Values of Selected Stocks.  (Tweet Sentiment vs. Stock %Change)", yaxis = {'title':'R-Value'})
figure = Figure(data = data,layout = layout)
py.iplot(figure)


# Results: summarize and visualize the results discovered from the analysis



Chart below reflects how Trump's tweets influence the stock price of Apple Inc. We chose Apple Inc. because Trump's taxes plan. The taxes plan means companies holding lots of cash abroad such as Apple (AAPL - Get Report) would finally bring their money back to the U.S.

The blue line indicates how the stock price fluctuate and the orange line indicates the polarity of Trump's tweets.

In [20]:
aaplprice = list(deltaTick[1][0])
aaplpol = list(deltaTick[1]['DateValue'])

trace1 = go.Scatter(
    x = prices[1].index,
    y = aaplprice,
    name='stock price',
    mode = 'lines'
)
trace2 = go.Scatter(
    x = prices[1].index,
    y = aaplpol,
    name='tweets polarity',
    yaxis='y2',
    mode = 'lines+markers'
)
data = [trace1, trace2]
layout = go.Layout(
    title='Line Chart with polarity and stock price',
    yaxis=dict(
        title='stock price change %',
        range=[-0.05, 0.04]
    ),
    yaxis2=dict(
        title='tweets polarity',
        titlefont=dict(
            color='rgb(148, 103, 189)'
        ),
        tickfont=dict(
            color='rgb(148, 103, 189)'
        ),
        overlaying='y',
        side='right',
        range=[-0.6,2]
    )
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='Line Chart with polarity and stock price')

In [41]:
aaplprice = list(deltaTick[1][0])
aaplpol = list(deltaTick[1]['DateValue'])
aaplCorCounter = 0
aaplCorCounterp = 0;
for i in range(len(aaplpol)):
  if (aaplprice[i]*(aaplpol[i]))>=0:
    aaplCorCounterp+=1
  aaplCorCounter+=1;

labels = ['Negative','Positive']
values = [aaplCorCounter - aaplCorCounterp, aaplCorCounterp]

trace = go.Pie(labels=labels, values=values)

py.iplot([trace], filename='basic_pie_chart')

The scatter plot shows the relationship between Trump's twwet polarity and Apple Inc. stock price. As we can see, it's not a strong correlation. Stock price is related to many other factors.

In [17]:

aapldeltastock = list(deltaTick[1][0])
aaplpolarity = list(deltaTick[1]['DateValue'])
scatterplot = [Scatter(x = aapldeltastock, y = aaplpolarity, mode = 'markers')]
layout = Layout(title = "Scatter Plot for AAPL - Price Difference versus Polarity", xaxis = {'title':'Price Difference'}, yaxis = {'title':'Polarity'})
figure = Figure(data = scatterplot, layout = layout)
py.iplot(figure)

*-----------------
# Done

All set! 

** What do you need to submit?**

* **Notebook File**: Save this Jupyter notebook, and find the notebook file in your folder (for example, "filename.ipynb"). This is the file you need to submit. Please make sure all the plotted tables and figures are in the notebook. If you used "jupyter notebook --pylab=inline" to open the notebook, all the figures and tables should have shown up in the notebook.

* **PPT Slides**: please prepare PPT slides (for 10 minutes' talk) to present about the case study . Each team present their case studies in class for 10 minutes.

Please compress all the files in a zipped file.


** How to submit: **

        Please submit through Canvas, in the Assignment "Case Study 1".
        
** Note: Each team only needs to submit one submission in Canvas **


# Peer-Review Grading Template:

** Total Points: (100 points) ** Please don't worry about the absolute scores, we will rescale the final grading according to the performance of all teams in the class.

Please add an "**X**" mark in front of your rating: 

For example:

*2: bad*
          
**X** *3: good*
    
*4: perfect*


    ---------------------------------
    The Problem: 
    ---------------------------------
    
    1. (5 points) how well did the team describe the problem they are trying to solve using twitter data? 
       0: not clear
       1: I can barely understand the problem
       2: okay, can be improved
       3: good, but can be improved
       4: very good
       5: crystal clear
    
    2. (10 points) do you think the problem is important or has a potential impact?
        0: not important at all
        2: not sure if it is important
        4: seems important, but not clear
        6: interesting problem
        8: an important problem, which I want to know the answer myself
       10: very important, I would be happy invest money on a project like this.
    
    ----------------------------------
    Data Collection:
    ----------------------------------
    
    3. (10 points) Do you think the data collected are relevant and sufficient for solving the above problem? 
       0: not clear
       2: I can barely understand what data they are trying to collect
       4: I can barely understand why the data is relevant to the problem
       6: the data are relevant to the problem, but better data can be collected
       8: the data collected are relevant and at a proper scale (> 300 tweets)
      10: the data are properly collected and they are sufficient

    -----------------------------------
    Data Exploration:
    -----------------------------------
    4. How well did the team solve the following task:
    (1) Word Count (5 points):
       0: missing answer
       1: okay, but with major problems
       3: good, but with minor problems
       5: perfect
    
    (2) Find the most popular tweets in your collection of tweets: (5 points)
       0: missing answer
       1: okay, but with major problems
       3: good, but with minor problems
       5: perfect
    
    (3) Find popular twitter entities  (5 points)
       0: missing answer
       1: okay, but with major problems
       3: good, but with minor problems
       5: perfect

    (4) Find user's followers and friends (5 points)
       0: missing answer
       1: okay, but with major problems
       3: good, but with minor problems
       5: perfect

    -----------------------------------
    The Solution
    -----------------------------------
    5.  how well did the team describe the solution they used to solve the problem? 
       0: not clear
       2: I can barely understand
       4: okay, can be improved
       6: good, but can be improved
       8: very good
       10: crystal clear
       
    6. how well is the solution in solving the problem? 
       0: not relevant
       1: barely relevant to the problem
       2: okay solution, but there is an easier solution.
       3: good, but can be improved
       4: very good, but solution is simple/old
       5: innovative and technically sound
       
    7. how well did the team implement the solution in python? 
       0: the code is not relevant to the solution proposed
       2: the code is barely understandable, but not relevant
       4: okay, the code is clear but incorrect
       6: good, the code is correct, but with major errors
       8: very good, the code is correct, but with minor errors
      10: perfect 
   
    -----------------------------------
    The Results
    -----------------------------------
     8.  How well did the team present the results they found in the data? 
       0: not clear
       2: I can barely understand
       4: okay, can be improved
       6: good, but can be improved
       8: very good
      10: crystal clear
       
     9.  How do you think the results they found in the data? 
       0: not clear
       1: likely to be wrong
       2: okay, maybe wrong
       3: good, but can be improved
       4: make sense, but not interesting
       5: make sense and very interesting
     
    -----------------------------------
    The Presentation
    -----------------------------------
    10. How all the different parts (data, problem, solution, result) fit together as a coherent story?  
       0: they are irrelevant
       1: I can barely understand how they are related to each other
       2: okay, the problem is good, but the solution doesn't match well, or the problem is not solvable.
       3: good, but the results don't make much sense in the context
       4: very good fit, but not exciting (the storyline can be improved/polished)
       5: a perfect story
      
    11. Did the presenter make good use of the 10 minutes for presentation?  
       0: the team didn't present
       1: bad, barely finished a small part of the talk
       2: okay, barely finished most parts of the talk.
       3: good, finished all parts of the talk, but some part is rushed
       4: very good, but the allocation of time on different parts can be improved.
       5: perfect timing and good use of time      

    12. How well do you think of the presentation (overall quality)?  
       0: the team didn't present
       1: bad
       2: okay
       3: good
       4: very good
       5: perfect


    -----------------------------------
    Overall: 
    -----------------------------------
    13. How many points out of the 100 do you give to this project in total?  Please don't worry about the absolute scores, we will rescale the final grading according to the performance of all teams in the class.
    Total score:
    
    14. What are the strengths of this project? Briefly, list up to 3 strengths.
       1: 
       2:
       3:
    
    15. What are the weaknesses of this project? Briefly, list up to 3 weaknesses.
       1:
       2:
       3:
    
    16. Detailed comments and suggestions. What suggestions do you have for this project to improve its quality further.
    
    
    

    ---------------------------------
    Your Vote: 
    ---------------------------------
    1. [Overall Quality] Between the two submissions that you are reviewing, which team would you vote for a better score?  
       -1: I vote the other team is better than this team
        0: the same
        1: I vote this team is better than the other team 
        
    2. [Presentation] Among all the teams in the presentation, which team do you think deserves the best presentation award for this case study?  
        1: Team 1
        2: Team 2
        3: Team 3
        4: Team 4
        5: Team 5
        6: Team 6
        7: Team 7
        8: Team 8
        9: Team 9
       10: Team 10

