# Case Study 1 : Data Science in Twitter Data

**Required Readings:** 
* Chapter 1 and Chapter 9 of the book [Mining the Social Web](http://cdn.oreillystatic.com/oreilly/booksamplers/9781449367619_sampler.pdf) 
* The codes for [Chapter 1](http://bit.ly/1qCtMrr) and [Chapter 9](http://bit.ly/1u7eP33)
* [TED Talks](https://www.ted.com/talks) for examples of 10 minutes talks.


** NOTE **
* Please don't forget to save the notebook frequently when working in Jupyter Notebook, otherwise the changes you made can be lost.

*----------------------

# Problem: pick a data science problem that you plan to solve using Twitter Data
* The problem should be important and interesting, which has a potential impact in some area.
* The problem should be solvable using twitter data and data science solutions.

Please briefly describe in the following cell: what problem are you trying to solve? why this problem is important and interesting?

In [3]:
'''
The superbowl is a huge US sports event. This year New England Patriots play against the Atlanta Falcons 
in Houston Texas. Twitter data has been collected from the New England, Atlanta and Houston areas.  
This data can be used to identify potential advertising customers. New England , Houston and Atlanta areas represent
different marketing areas. Which area is the better investment for superbowl advertising?

'''

'\nThe superbowl is a huge US sports event. This year New England Patriots play against the Atlanta Falcons \nin Houston Texas. Twitter data has been collected from the New England, Atlanta and Houston areas.  \nThis data can be used to identify potential advertising customers. New England , Houston and Atlanta areas represent\n'

## Data Collection: Download Twitter Data using API

* In order to solve the above problem, you need to collect some twitter data. You could select a topic that is relevant to your problem, and use Twitter API to download the relevant tweets. It is recommended that the number of tweets should be larger than 200, but smaller than 1 million.
* Store the tweets you downloaded into a local file (txt file or json file) 

In [1]:
import twitter
import json
import datetime

CONSUMER_KEY = '4bVYOyoJP3Jwcf5SQflhV4qcT'
CONSUMER_SECRET = 'l45TjHGHN2G5vA0sv5q9xhlM6F9IyQwYIgDnm9qSLD3vjEP1aG'
OAUTH_TOKEN = '825413411089571841-YcKtJ9LEXWMjSdgFVT5eueuJpdGygB1'
OAUTH_TOKEN_SECRET = 'Qx9Q8aIRX6FgloBl5D1dI8ZoZU3x95nNH9NrjS6aULg7A'

MAX_ALLOWED_TWEETS = 2000
# search has a 7-day limit. so they're only ever in the last week.
MAX_DAYS_PER_QUERY = 7
WORLD_WOEID = 1
USA_WOEID = 23424977
MA_WOEID = 2347580 # No Trends Found
WORCESTER_MA_WOEID = 2523945 # No Trends Found
WPI_LAT = 42.2749
WPI_LON = -71.8092

# ---------------------------------------------
# Define a Function to Login Twitter API
def twitter_oauth_login():
    # Go to http://twitter.com/apps/new to create an app and get values
    # for these credentials that you'll need to provide in place of these
    # empty string values that are defined as placeholders.
    # See https://dev.twitter.com/docs/auth/oauth for more information
    # on Twitter's OAuth implementation.
    # studentllpage1 info

    auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
                               CONSUMER_KEY, CONSUMER_SECRET)

    twitter_api = twitter.Twitter(auth=auth)
    return twitter_api
# ----------------------------------------------
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary

# ----------------------------------------------
# query_twitter return array of statuses
# twitterAPI  - required logged in twitter handle
# query - required twitter Query to execute  'OR' queries are split into individual queries whose results are combined.
# resultType - optional type of tweets to get, 'popular' (most popular), 'recent' (most recent), or 'mixed' (popular or most recent) (default: mixed)
# lat, lon, radius  - optional latitude, longitude, and radius (in miles) in which to confine search geographically (default: none)
# untilDate - optional date to query tweets until (default: none)
# return array of statuses extracted from query results
#
def query_twitter(twitterAPI, query,
                  resultType = 'mixed',
                  lat = None, lon = None, radius = None,
                  untilDate = None):
    subqueries = query.split(' OR ')
    queryFilters = 'resultType={1}'.format(query, resultType)
    geocodeFilter = None
    if lat and lon and radius:
        geocodeFilter = '{0},{1},{2}mi'.format(lat, lon, radius)
        queryFilters = '{0} geocode={1}'.format(queryFilters, geocodeFilter)
    if untilDate:
        queryFilters = '{0} until={1}'.format(queryFilters, untilDate)

    results = []
    print('Executing Twitter Query: query={0} {1} ...'.format(query, queryFilters))
    for subquery in subqueries:
        queryResults = None
        newQuery = subquery
        print('     subquery: query={0} {1} ...'.format(subquery, queryFilters))
        if geocodeFilter:
            if untilDate:
                queryResults = twitterAPI.search.tweets(q=newQuery, result_type=resultType,
                                                        geocode=geocodeFilter, until=untilDate)
            else:
                queryResults = twitterAPI.search.tweets(q=newQuery, result_type=resultType, geocode=geocodeFilter)
        else:
            if untilDate:
                queryResults = twitterAPI.search.tweets(q=newQuery, result_type=resultType, until=untilDate)
            else:
                queryResults = twitterAPI.search.tweets(q=newQuery, result_type=resultType)
        continueCollectResults = True
        lastCollectedResults = len(results)
        totalCollectedResults = 0
        while continueCollectResults:
            next_results = None
            try:
                next_results = queryResults['search_metadata']['next_results']
            except:  # No more results when next_results doesn't exist
                continueCollectResults = False
            statuses = queryResults['statuses']
            totalCollectedResults += len(statuses)

            results += statuses
            if not (continueCollectResults):
                print('     {0} tweets subquery={1} {2}'.format(len(results) - lastCollectedResults, newQuery, queryFilters))
                lastCollectedResults = len(results)
                break
            # Create a dictionary from next_results, which has the following form:
            # ?max_id=313519052523986943&q=NCAA&include_entities=1
            kwargs = dict([kv.split('=') for kv in next_results[1:].split("&")])
            queryResults = twitterAPI.search.tweets(**kwargs)
    print('{0} tweets query={1} {2}'.format(len(results), query, queryFilters))
    return results

def query_to_filename(query):
    return query.replace('#','').replace('@','').replace(':','_').replace(' ','_') + '.json'
#
# write 'data' to 'filename' as JSON
#
def write_to_file(data, filename):
    with open(filename, 'w') as f:
        json.dump(data, f)
    f.closed

#
# read 'filename' as JSON into returned data
#
def read_from_file(filename):
    result = None
    with open(filename, 'r') as f:
        result = json.load( f)
    f.closed
    return result
#------------------------------------------------------------

#QUERIES:

queries = [ '#Superbowl OR superbowl OR party',
            '#Superbowl Boston OR superbowl Boston OR party Boston',
            '#Superbowl Worcester OR superbowl Worcester OR party Worcester',
            '#Superbowl Providence OR superbowl Providence OR party Providence',
            '#Superbowl Nashua OR superbowl Nashua OR party Nashua',
            '#Superbowl Atlanta OR superbowl Atlanta OR party Atlanta',
            '#Superbowl Macon OR superbowl Macon OR party Macon',
            '#Superbowl Houston OR superbowl Houston OR party Houston'
            ]
twitterapi= twitter_oauth_login()
today = datetime.datetime.today().strftime('%Y-%m-%d')
totalTweets = 0
for query in queries:
    filename = today + '.' + query_to_filename(query)
    twitter_statuses = query_twitter(twitterapi,query,'mixed')
    #twitter_statuses = query_twitter_dateranges(twitterapi, '2011-01-31 2011-02-06  2012-01-29', query, 'mixed')
    write_to_file(twitter_statuses, filename)
    totalTweets += len(twitter_statuses)
print('{0} total tweets {1} queries'.format(totalTweets, len(queries)))



Executing Twitter Query: query=#Superbowl OR superbowl OR party resultType=mixed ...
     subquery: query=#Superbowl resultType=mixed ...
     30 tweets subquery=#Superbowl resultType=mixed
     subquery: query=superbowl resultType=mixed ...
     164 tweets subquery=superbowl resultType=mixed
     subquery: query=party resultType=mixed ...
     19 tweets subquery=party resultType=mixed
213 tweets query=#Superbowl OR superbowl OR party resultType=mixed
Executing Twitter Query: query=#Superbowl Boston OR superbowl Boston OR party Boston resultType=mixed ...
     subquery: query=#Superbowl Boston resultType=mixed ...
     30 tweets subquery=#Superbowl Boston resultType=mixed
     subquery: query=superbowl Boston resultType=mixed ...
     30 tweets subquery=superbowl Boston resultType=mixed
     subquery: query=party Boston resultType=mixed ...
     30 tweets subquery=party Boston resultType=mixed
90 tweets query=#Superbowl Boston OR superbowl Boston OR party Boston resultType=mixed
Execut

### Report  statistics about the tweets you collected 

In [None]:
# The total number of tweets collected:  634

# Data Exploration: Exploring the Tweets and Tweet Entities

**(1) Word Count:** 
* Load the tweets you collected in the local file (txt or json)
* compute the frequencies of the words being used in these tweets. 
* Plot a table of the top 30 most-frequent words with their counts

In [30]:
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary

** (2) Find the most popular tweets in your collection of tweets**

Please plot a table of the top 10 most-retweeted tweets in your collection, i.e., the tweets with the largest number of retweet counts.


In [31]:
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary

**(3) Find the most popular Tweet Entities in your collection of tweets**

Please plot the top 10 most-frequent hashtags and top 10 most-mentioned users in your collection of tweets.

In [32]:
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary

Plot a histogram of the number of user mentions in the list using the following bins.

In [None]:
bins=[0, 10, 20, 30, 40, 50, 100]

# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary


 ** (4) Getting "All" friends and "All" followers of a popular user in the tweets**

* choose a popular twitter user who has many followers in your collection of tweets.
* Get the list of all friends and all followers of the twitter user.
* Plot 20 out of the followers, plot their ID numbers and screen names in a table.
* Plot 20 out of the friends (if the user has more than 20 friends), plot their ID numbers and screen names in a table.

In [35]:
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary

# The Solution: implement a data science solution to the problem you are trying to solve.

Briefly describe the idea of your solution to the problem in the following cell:

In [None]:
'''
The idea of this solution is to collect twitter data, through the python twitter API, related to the superbowl 
and superbowl parties in parts of New England, around the Atlanta Georgia area, and in Houston Texas. 
The collected tweet counts 
'''

Write codes to implement the solution in python:

In [2]:
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary

# Results: summarize and visualize the results discovered from the analysis

Please use figures, tables, or videos to communicate the results with the audience.


In [None]:
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary

*-----------------
# Done

All set! 

** What do you need to submit?**

* **Notebook File**: Save this Jupyter notebook, and find the notebook file in your folder (for example, "filename.ipynb"). This is the file you need to submit. Please make sure all the plotted tables and figures are in the notebook. If you used "jupyter notebook --pylab=inline" to open the notebook, all the figures and tables should have shown up in the notebook.

* **PPT Slides**: please prepare PPT slides (for 10 minutes' talk) to present about the case study . Each team present their case studies in class for 10 minutes.

Please compress all the files in a zipped file.


** How to submit: **

        Please submit through Canvas, in the Assignment "Case Study 1".
        
** Note: Each team only needs to submit one submission in Canvas **


# Peer-Review Grading Template:

** Total Points: (100 points) ** Please don't worry about the absolute scores, we will rescale the final grading according to the performance of all teams in the class.

Please add an "**X**" mark in front of your rating: 

For example:

*2: bad*
          
**X** *3: good*
    
*4: perfect*


    ---------------------------------
    The Problem: 
    ---------------------------------
    
    1. (5 points) how well did the team describe the problem they are trying to solve using twitter data? 
       0: not clear
       1: I can barely understand the problem
       2: okay, can be improved
       3: good, but can be improved
       4: very good
       5: crystal clear
    
    2. (10 points) do you think the problem is important or has a potential impact?
        0: not important at all
        2: not sure if it is important
        4: seems important, but not clear
        6: interesting problem
        8: an important problem, which I want to know the answer myself
       10: very important, I would be happy invest money on a project like this.
    
    ----------------------------------
    Data Collection:
    ----------------------------------
    
    3. (10 points) Do you think the data collected are relevant and sufficient for solving the above problem? 
       0: not clear
       2: I can barely understand what data they are trying to collect
       4: I can barely understand why the data is relevant to the problem
       6: the data are relevant to the problem, but better data can be collected
       8: the data collected are relevant and at a proper scale (> 300 tweets)
      10: the data are properly collected and they are sufficient

    -----------------------------------
    Data Exploration:
    -----------------------------------
    4. How well did the team solve the following task:
    (1) Word Count (5 points):
       0: missing answer
       1: okay, but with major problems
       3: good, but with minor problems
       5: perfect
    
    (2) Find the most popular tweets in your collection of tweets: (5 points)
       0: missing answer
       1: okay, but with major problems
       3: good, but with minor problems
       5: perfect
    
    (3) Find popular twitter entities  (5 points)
       0: missing answer
       1: okay, but with major problems
       3: good, but with minor problems
       5: perfect

    (4) Find user's followers and friends (5 points)
       0: missing answer
       1: okay, but with major problems
       3: good, but with minor problems
       5: perfect

    -----------------------------------
    The Solution
    -----------------------------------
    5.  how well did the team describe the solution they used to solve the problem? 
       0: not clear
       2: I can barely understand
       4: okay, can be improved
       6: good, but can be improved
       8: very good
       10: crystal clear
       
    6. how well is the solution in solving the problem? 
       0: not relevant
       1: barely relevant to the problem
       2: okay solution, but there is an easier solution.
       3: good, but can be improved
       4: very good, but solution is simple/old
       5: innovative and technically sound
       
    7. how well did the team implement the solution in python? 
       0: the code is not relevant to the solution proposed
       2: the code is barely understandable, but not relevant
       4: okay, the code is clear but incorrect
       6: good, the code is correct, but with major errors
       8: very good, the code is correct, but with minor errors
      10: perfect 
   
    -----------------------------------
    The Results
    -----------------------------------
     8.  How well did the team present the results they found in the data? 
       0: not clear
       2: I can barely understand
       4: okay, can be improved
       6: good, but can be improved
       8: very good
      10: crystal clear
       
     9.  How do you think the results they found in the data? 
       0: not clear
       1: likely to be wrong
       2: okay, maybe wrong
       3: good, but can be improved
       4: make sense, but not interesting
       5: make sense and very interesting
     
    -----------------------------------
    The Presentation
    -----------------------------------
    10. How all the different parts (data, problem, solution, result) fit together as a coherent story?  
       0: they are irrelevant
       1: I can barely understand how they are related to each other
       2: okay, the problem is good, but the solution doesn't match well, or the problem is not solvable.
       3: good, but the results don't make much sense in the context
       4: very good fit, but not exciting (the storyline can be improved/polished)
       5: a perfect story
      
    11. Did the presenter make good use of the 10 minutes for presentation?  
       0: the team didn't present
       1: bad, barely finished a small part of the talk
       2: okay, barely finished most parts of the talk.
       3: good, finished all parts of the talk, but some part is rushed
       4: very good, but the allocation of time on different parts can be improved.
       5: perfect timing and good use of time      

    12. How well do you think of the presentation (overall quality)?  
       0: the team didn't present
       1: bad
       2: okay
       3: good
       4: very good
       5: perfect


    -----------------------------------
    Overall: 
    -----------------------------------
    13. How many points out of the 100 do you give to this project in total?  Please don't worry about the absolute scores, we will rescale the final grading according to the performance of all teams in the class.
    Total score:
    
    14. What are the strengths of this project? Briefly, list up to 3 strengths.
       1: 
       2:
       3:
    
    15. What are the weaknesses of this project? Briefly, list up to 3 weaknesses.
       1:
       2:
       3:
    
    16. Detailed comments and suggestions. What suggestions do you have for this project to improve its quality further.
    
    
    

    ---------------------------------
    Your Vote: 
    ---------------------------------
    1. [Overall Quality] Between the two submissions that you are reviewing, which team would you vote for a better score?  
       -1: I vote the other team is better than this team
        0: the same
        1: I vote this team is better than the other team 
        
    2. [Presentation] Among all the teams in the presentation, which team do you think deserves the best presentation award for this case study?  
        1: Team 1
        2: Team 2
        3: Team 3
        4: Team 4
        5: Team 5
        6: Team 6
        7: Team 7
        8: Team 8
        9: Team 9
       10: Team 10

