## Module-3 Twitter Data

This is M3 of 3 modules for the twitter dataset. In this module, we cover the prediction and inference part of Q29

### QUESTION 29

**Describe task**

Given a set of tweets text data, we try to find out the entities present in dataset using tweet text using a text rank phrase extraction algorithm along with fuzzy matching. The control parameters are - 'number of top phrases per tweet'; 'minimum frequency of phrase' for it to be considered as an entity; 'minimum number of other close phrases present'. After identifying the entities, we further extract closest keywords to the entity to understand the reference in which it is being talked about. We find the set of tweets talking about this entity and further rank them using a page rank algorithm to generate a tweet summary consisting of 4 top tweets. <br>

We predict the closest key phrases, summary and sentiment for entities in each day/ every 10 min on game day (1st Feb) in the dataset. For the game day of 1st Feb, we predict key phrases in each 10 min interval. <br>

To run the script, you will need the following:
1. './twitter_files_v3/entities.pkl' - Dictionary of entities generated in module 2, also provided in the zip file
2. './twitter_files_v3/prediction_data.pkl' - Prediction data generated in module 2, also provided in the zip file
3. './glove/glove.6B.100d.txt' - Glove embeddings

For each task - **you need to provide 4 inputs** - 
1. **entity** (from the list of entities),
2. **pred_type** (game_day (predicts in last 10 min), reg_day (predicts for entire day))
3. **task_type** (from "sentiment", "summary", "keywords")
3. **date** (format %Y-%m-%d for reg_day; %Y-%m-%d %H:%M:%S for game day)

The 3 task types are: 
1. 'sentiment': returns the sentiment of the set of tweets for a given entity for the given day or in last 10 min if prediction type is game_day.
2. 'summary': returns list of 4 tweets which summarize the tweets for a given entity for the given day or in last 10 min if prediction type is game_day.
3. 'keywords': returns list of 10 key phrases that appear in context of a given entity for the given day or in last 10 min if prediction type is game_day.

In [1]:
import numpy as np
import random
import pandas as pd
import orjson as json
import time
from datetime import datetime,timedelta

import regex as re
import spacy
import pytextrank
import multiprocessing as mp
from multiprocessing import Pool
import pickle

from fuzzywuzzy import fuzz
import textblob

import matplotlib.pyplot as plt
import warnings
warnings.simplefilter("ignore")
num_cores = 4 #number of cores on your machine
num_partitions = 16 #number of partitions to split dataframe

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx

# nltk.download('stopwords')

import pytz
pst_tz = pytz.timezone('America/Los_Angeles')
utc_tz = pytz.utc

In [2]:
## Load extracted files from M2

## list of entities extracted
entities = open('./twitter_files_v3/entities_v2.pkl', 'rb')
entities_dict = pickle.load(entities)
entities = list(entities_dict.values())

## prediction data
df_file = open('./twitter_files_v3/prediction_data_v2.pkl', 'rb')
prediction_df = pickle.load(df_file)

## datetime conversions
prediction_df['citation_dt_trans'] = prediction_df['citation_datetime'].apply(lambda x: datetime.fromtimestamp(x, pst_tz))
prediction_df['utc_datetime'] = prediction_df['citation_datetime'].apply(lambda x: datetime.fromtimestamp(x, utc_tz))
prediction_df['date'] = pd.to_datetime(prediction_df['citation_dt_trans']).dt.date
prediction_df['datetime'] = prediction_df['citation_dt_trans'].apply(lambda x: str(x).rsplit('-', 1)[0])
prediction_df['datetime'] = pd.to_datetime(prediction_df['datetime'])


In [3]:
# prediction_df.head()

Unnamed: 0,index,0,tweet_id,phrase,clean_phrase,count,entity,file,text,citation_datetime,clean_text,citation_dt_trans,utc_datetime,date,datetime
0,"(549327579782840320, #gohawks http://t.co/u1pc...",0.215096,549327579782840320,#gohawks http://t.co/u1pcxpesr8,gohawks,67966,gohawks,gohawks,I &lt;3 our defense! #GoHawks http://t.co/U1pc...,1421518778,i &lt;3 our defense! #gohawks http://t.co/u1pc...,2015-01-17 10:19:38-08:00,2015-01-17 18:19:38+00:00,2015-01-17,2015-01-17 10:19:38
1,"(549327579782840320, our defense)",0.117698,549327579782840320,our defense,our defense,51,Not an entity,gohawks,I &lt;3 our defense! #GoHawks http://t.co/U1pc...,1421518778,i &lt;3 our defense! #gohawks http://t.co/u1pc...,2015-01-17 10:19:38-08:00,2015-01-17 18:19:38+00:00,2015-01-17,2015-01-17 10:19:38
2,"(549575600210718721, #dogslife http://t.co/gd3...",0.158353,549575600210718721,#dogslife http://t.co/gd3v6vqps5,dogslife,6,,gohawks,twelfth dogs are ready! #gohawks #dogslife htt...,1421259536,twelfth dogs are ready! #gohawks #dogslife htt...,2015-01-14 10:18:56-08:00,2015-01-14 18:18:56+00:00,2015-01-14,2015-01-14 10:18:56
3,"(549575600210718721, twelfth)",0.157154,549575600210718721,twelfth,twelfth,25,Not an entity,gohawks,twelfth dogs are ready! #gohawks #dogslife htt...,1421259536,twelfth dogs are ready! #gohawks #dogslife htt...,2015-01-14 10:18:56-08:00,2015-01-14 18:18:56+00:00,2015-01-14,2015-01-14 10:18:56
4,"(549647876406534144, gohawks)",0.196769,549647876406534144,gohawks,gohawks,67966,gohawks,gohawks,"""Oh no big deal, just NFC West Champs and the ...",1421468519,"""oh no big deal, just nfc west champs and the ...",2015-01-16 20:21:59-08:00,2015-01-17 04:21:59+00:00,2015-01-16,2015-01-16 20:21:59


In [4]:
entities[:50]

['superbowl',
 'superbowlxlix',
 'patriots',
 'seahawks',
 'gohawks',
 'patriotswin nfl',
 'katyperry',
 'tom brady',
 'seattle',
 'halftime',
 'football',
 'pats',
 'superbowlcommercials',
 'gopats',
 'superbowlsunday',
 'seattleseahawks',
 'touchdown',
 'commercials',
 'new england',
 'superbowl2015',
 'marshawn lynch',
 'katy',
 'patsnation',
 'new england patriots',
 'missyelliott',
 'budweiser',
 'this game',
 'sb49 superbowl',
 'patriotsnation',
 'russell wilson',
 'katyperry superbowl',
 'packers',
 'wilson',
 'people',
 'halftime show',
 'pete carroll',
 'america',
 'chris matthews',
 'beastmode',
 'dangerusswilson',
 'allyouneedisecuador',
 'nflplayoffs',
 'lenny kravitz',
 'bill belichick',
 'last year',
 'next year',
 'los',
 'tom',
 'national anthem',
 'belichick']

### TASK 1: Get key phrases for a given entity in each day or last 10 min on game day

In [5]:
#### For each entity get the top 10 descriptive sentiments around it
def get_n_close_phrases (data: pd.DataFrame(), entity: str, date: str, pred_type = 'reg_day', n = 10):
    try:
    
        # tweets corresponding to the entity
        if(pred_type == 'reg_day'):
            data['date'] = data['date'].astype(str)
            tmp = data[(data['entity'] == entity) & (data['date'] == date)]
            tmp = tmp.drop(['file'], axis =1)
            tmp = tmp.drop_duplicates()
            
        elif(pred_type == 'game_day'):
            d = datetime.strptime(date, '%Y-%m-%d %H:%M:%S')
            d_prev = d - timedelta(minutes=10)
            tmp = data[(data['entity'] == entity) & (data['datetime'] >= d_prev) & \
                  (data['datetime'] <= d)]
            tmp = tmp.drop(['file'], axis =1)
            tmp = tmp.drop_duplicates()
            
        tweet_ids = list(set(tmp['tweet_id']))

        ## weighted score for other phrases from the tweets
        # get relevant tweet data
        tmp = data[data['tweet_id'].isin(tweet_ids)]
    
        # remove rows corresponding to the entity itself
        tmp = tmp[tmp['entity'] != entity]
        
        phrase_counts = tmp.groupby(['clean_phrase']).size().reset_index(name = 'count')
        tmp = tmp.drop(['count', 'file'], axis = 1)
        tmp = tmp.drop_duplicates()
        tmp = pd.merge(tmp, phrase_counts, how = 'left', on = 'clean_phrase')
    
        # get weighted scores
        tmp['weighted_score'] = tmp.apply(lambda row: row[0] * row['count'], axis = 1)
    
        other_phrases = tmp.groupby(['clean_phrase'])['weighted_score'].sum()
        other_phrases = other_phrases.reset_index()
        other_phrases = other_phrases.sort_values('weighted_score', ascending = False)
        print(other_phrases[:n])
        
    except:
        print("Entity not important in the day/interval!")

In [6]:
get_n_close_phrases(prediction_df, 'katyperry', '2015-01-18', 'reg_day')

     clean_phrase  weighted_score
96  superbowlxlix       60.837808
72       patriots       40.829373
35       halftime       18.224625
87       seahawks       12.932766
93     super bowl       11.643691
95      superbowl        5.264207
39           katy        4.418890
22       el medio        1.691668
57   medio tiempo        1.374621
71   para su show        0.947675


In [7]:
get_n_close_phrases(prediction_df, 'john legend', '2015-02-01 15:20:00', 'game_day')

           clean_phrase  weighted_score
558           superbowl    26438.615728
572       superbowlxlix    19883.578545
493                sb49     4912.672432
57              america     2285.457083
410     national anthem       91.130337
568     superbowlsunday       76.039311
354    love john legend       51.594259
315  john legends voice       49.223582
143              church       32.807374
361                 man       30.259418


### TASK 2: Get summary for a given entity in each day or last 10 min on game day

In [8]:
# Extract word vectors
word_embeddings = {}
f = open('./glove/glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

# function to remove stopwords
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

In [9]:
#### For each entity get data for the entity and date
def get_subset_data (data: pd.DataFrame(), entity: str, date: str, pred_type = 'reg_day'):
    try:
    
        # tweets corresponding to the entity
        if(pred_type == 'reg_day'):
            data['date'] = data['date'].astype(str)
            tmp = data[(data['entity'] == entity) & (data['date'] == date)]
            tmp = tmp.drop(['file'], axis =1)
            tmp = tmp.drop_duplicates()
            
        elif(pred_type == 'game_day'):
            d = datetime.strptime(date, '%Y-%m-%d %H:%M:%S')
            d_prev = d - timedelta(minutes=10)
            tmp = data[(data['entity'] == entity) & (data['datetime'] >= d_prev) & \
                  (data['datetime'] <= d)]
            tmp = tmp.drop(['file'], axis =1)
            tmp = tmp.drop_duplicates()
            
        tweet_ids = list(set(tmp['tweet_id']))
        
        ## weighted score for other phrases from the tweets
        # get relevant tweet data
        tmp = data[data['tweet_id'].isin(tweet_ids)]
        tmp = tmp.drop(['count', 'file'], axis = 1)
        tmp = tmp.drop_duplicates()
    
        # remove rows corresponding to the entity itself
        tmp = tmp[tmp['entity'] != entity]
        phrase_counts = tmp.groupby(['clean_phrase']).size().reset_index(name = 'count')
        tmp = pd.merge(tmp, phrase_counts, how = 'left', on = 'clean_phrase')
    
        # get weighted scores
        tmp['weighted_score'] = tmp.apply(lambda row: row[0] * row['count'], axis = 1)
        
        return tmp
    except:
        print("Not enough data")

In [10]:
def get_topn_tweets (data: pd.DataFrame(), entity: str, date: str, pred_type = 'reg_day', n=4):
    
    sub_data = get_subset_data(data, entity, date, pred_type)
    
    ## process top 100 candidates at max according to important phrases
    filter_df = sub_data.groupby(['tweet_id'])['weighted_score'].sum().reset_index()
    if(filter_df.shape[0] > 100):
        filter_df = filter_df.sort_values('weighted_score', ascending = False)[:100]
        tweet_ids = list(set(filter_df['tweet_id']))
        sub_data = sub_data[sub_data['tweet_id'].isin(tweet_ids)]
        
    sentences = list(set(sub_data['text']))
    
    # clean sentences
    clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")
    clean_sentences = [s.lower() for s in clean_sentences]
    clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]
    
    sentence_vectors = []
    for i in clean_sentences:
        if len(i) != 0:
            v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
        else:
            v = np.zeros((100,))
        sentence_vectors.append(v)
    sim_mat = np.zeros([len(sentences), len(sentences)])
    for i in range(len(sentences)):
        for j in range(len(sentences)):
            if i != j:
                sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]
     
    nx_graph = nx.from_numpy_array(sim_mat)
    scores = nx.pagerank(nx_graph)
    
    ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
    results = []
    for i in range(n):
        results.append(ranked_sentences[i][1])
    return results

In [11]:
get_topn_tweets(prediction_df, 'peyton manning', '2015-01-18')

["It's over. Let's see if Tom Brady can play better than Peyton Manning against the #Seahawks in the #SuperBowl #NFLPlayoffs #INDvsNE",
 "Andrew Luck has taken the torch from Peyton Manning as the next great #Colts QB that can't get past the #Patriots in the playoffs.",
 'Peyton Manning had his chance last year. Tom Brady gets his chance against the Seahawks this year. #Brady #Patriots',
 "Real #Patriots fans should be happy Seattle won, now Tom Brady can do what Peyton Manning couldn't do last year."]

In [12]:
get_topn_tweets(prediction_df, 'john legend', '2015-02-01 15:20:00', 'game_day')

['John legend 😍 #imean #yes #Superbowl #',
 'John Legend was very good! :) #SuperBowl',
 'Love John Legend #SuperBowl',
 'John Legend getting down #SuperBowl']

### TASK 3: Get sentiment for a given entity in each day or last 10 min on game day

In [21]:
def get_sentiment (data: pd.DataFrame(), entity: str, date: str, pred_type = 'reg_day', n=4):
    
    sub_data = get_subset_data(data, entity, date, pred_type)
    sentences = list(set(sub_data['text']))
    
    # clean sentences
    clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")
    clean_sentences = [s.lower() for s in clean_sentences]
    clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]
    
    polarities_ls = []
    for i in clean_sentences:
        polarities_ls.append(textblob.TextBlob(i).sentiment.polarity)
    sentiment_score =  sum(polarities_ls)/len(polarities_ls)
    sentiment = 'Neutral'
    if(sentiment_score > 0.05):
        sentiment = 'Positive'
    if(sentiment_score < -0.05):
        sentiment = 'Negative'
    
    print("Overall sentiment is: ", sentiment,", with score:", sentiment_score)

In [14]:
get_sentiment(prediction_df, 'peyton manning', '2015-01-18')

Overall sentiment is:  Positive , with score: 0.13933725005153577


In [15]:
get_sentiment(prediction_df, 'john legend', '2015-02-01 15:20:00', 'game_day')

Overall sentiment is:  Positive , with score: 0.192128009052351


### PREDICTION

In [19]:
def validate(datetime_string, pred_type):
    try:
        if(pred_type == 'reg_day'):
            return datetime.strptime(datetime_string,"%Y-%m-%d")
        elif(pred_type == 'game_day'):
            return datetime.strptime(datetime_string,"%Y-%m-%d %H:%M:%S")

    except ValueError:
        return False
        
def perform_task (entity, date, task, pred_type):
    if(task not in ['sentiment', 'summary', 'keywords'] ):
        print("Task can only be - sentiment, summary or keywords!")
    elif(pred_type not in ['reg_day', 'game_day']):
        print("Prediction type can only be - reg_day or game_day!")
    elif(entity not in entities):
        print("Entity not in data!")
        print("Try entities - katyperry, tom brady, rpeyton manning.. (check entities file for more)!")
    elif(validate(date, pred_type) == False):
        print("Date format not valid!")
        print("Try date in format - %Y-%m-%d for reg_day and %Y-%m-%d %H:%M:%S for game_day!")
    else:
        if(task == 'keywords'):
            get_n_close_phrases(prediction_df, entity, date, pred_type)
        elif(task == 'sentiment'):
            get_sentiment(prediction_df, entity, date, pred_type)
        elif(task == 'summary'):  
            results = get_topn_tweets(prediction_df, entity, date, pred_type)
            print(results)
        else:
            print("Unknown error occured! Please check input!")

In [22]:
task = input('Task to be performed [sentiment, summary, keywords]: ')
entity = input('Entity: ')
pred_type = input('Prediction type [reg_day, game_day]: ')
date = input('Date: ')
perform_task(str(entity), str(date), task, pred_type)

Task to be performed [sentiment, summary, keywords]: sentiment
Entity: tom brady
Prediction type [reg_day, game_day]: game_day
Date: 2015-02-01 15:20:00
Overall sentiment is:  Positive , with score: 0.09779048814873557
