# **working with Twitter data**

### Step 1:  Loading the data

If you want to load the results you have previously saved, simply execute the next code, specifying the path to the file.

You will need to either upload it to the Colab workspace or copy the path to the file on Drive.

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta


In [None]:
path = "D:/other/job/students_project/network_science/maryam_project/data/gymnastics/"
filterPath = "D:/other/job/students_project/network_science/maryam_project/data/gymnastics/filtered_data/"

In [None]:
tweets_df = pd.read_pickle(path+"tweets.pkl")
tweets_df

### Step 2: Preprocessing the data

In our dataframe we have the entire Tweet object. Some columns that might be of particular interest to us are: 

*   created_at - date when Tweet was posted
*   id - unique Tweet identifier
*   text - the content of the Tweet
*   author_id - unique Tweet identifier
*   retweeted_status  - information about the original Tweet
*   public metrics - quote/reply/retweet/favorite count
*   entities - hashtags, urls, annotations present in Tweet

We can filter the dataframe and keep only columns we are interested in. You can pick which columns you'd like to keep and put them int the column_list below.



In [None]:
tweets_filtered = tweets_df.copy()

tweets_filtered.shape

### Step 3: Extracting words/hashtags

There are many ways to build networks from the data we download from Twitter.

One possibility is to have a bipartite network of Tweets and words/hashtags and then observe word, hashtag or word-hashtag projections.

#### Extracting words

In order to extract words, we first need to clean the Tweet text. This way we will remove punctuation, hashtags/mentions/urls (they are preserved in the entity column anyway). We will also turn all letters to lowercase.

You can also consider removing stopwords, removing words that are not in the english language corpora, lematizing the words, etc. I suggest you research nltk library and its possibilities.

In [None]:
import re
import string
# NLTK tools
import nltk
nltk.download('words')
words = set(nltk.corpus.words.words())
nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words("english")
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import pos_tag
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
from collections import defaultdict
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

In [None]:
def cleaner(tweet):
    tweet = re.sub("@[A-Za-z0-9]+","",tweet) # remove mentions
    tweet = re.sub("#[A-Za-z0-9]+", "",tweet) # remove hashtags
    tweet = re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "", tweet) # remove http links
    tweet = " ".join(tweet.split())
    tweet = " ".join(w for w in nltk.wordpunct_tokenize(tweet) if w.lower() in words and not w.lower() in stop_words)
     #remove stop words
    lemma_function = WordNetLemmatizer()
    tweet = " ".join(lemma_function.lemmatize(token, tag_map[tag[0]]) for token, tag in nltk.pos_tag(nltk.wordpunct_tokenize(tweet))) #lemmatize
    tweet = str.lower(tweet) #to lowercase
    return tweet

In [None]:
tweets_filtered["clean_text"] = tweets_filtered["text"].map(cleaner)

In [None]:
tweets_filtered

We are going to loop through the dataframe and then through the words in the clean text. We are going to add the words as keys to dictionary and use their frequencies as values.

In [None]:
tweets_filtered.loc[tweets_filtered["clean_text"].isnull(),"clean_text"] = ""

In [None]:
uniqueTweets = tweets_filtered.copy()
uniqueTweets = uniqueTweets.drop_duplicates('clean_text')
uniqueTweets

In [None]:
tweet_tokenizer = nltk.TweetTokenizer()

#initialize an empty dict
unique_words = {}

for idx, row in uniqueTweets.iterrows():
  if row["clean_text"] != "":
    for word in tweet_tokenizer.tokenize(row["clean_text"]):
      unique_words.setdefault(word,0)
      unique_words[word] += 1

In [None]:
uw_df = pd.DataFrame.from_dict(unique_words, orient='index').reset_index()
uw_df.rename(columns = {'index':'Word', 0:'Count'}, inplace=True)
uw_df.sort_values(by=['Count'], ascending=False, inplace=True)
uw_df = uw_df.reset_index().drop(columns=["index"])

We can inspect the words as a dataframe. 


You can always save this dataframe as .csv for future reference.

In [None]:
uw_df

In [None]:
uw_df.to_csv(path+"words.csv")

#### Extracting the hashtags

We are going to loop through the dataframe and then through the hashtags in the entities. We are going to add the hashtags as keys to dictionary and use their frequencies as values. At the same time, we are going to save them in a list and add them to a separate column to facilitate our future work.

In [None]:
uniqueTweets.loc[tweets_df["entities"].isnull(), "entities"] = None
uniqueTweets["hashtags"] = ""

In [None]:
unique_hashtags = {}
index = 0

for idx, row in uniqueTweets.iterrows():
  if row["entities"] is not None and "hashtags" in row["entities"]:
    hl = []
    for hashtag in row["entities"]["hashtags"]:
      tag = "#" + hashtag["tag"].lower()
      unique_hashtags.setdefault(tag, 0)
      unique_hashtags[tag] += 1
      hl.append(tag)

    uniqueTweets.at[idx,"hashtags"] = hl

In [None]:
unique_hashtags = dict(sorted(unique_hashtags.items(), key=lambda item: item[1], reverse=True))

In [None]:
uh_df = pd.DataFrame.from_dict(unique_hashtags, orient='index').reset_index()
uh_df.rename(columns = {'index':'Hashtag', 0:'Count'}, inplace=True)

In [None]:
uh_df[0:50]

In [None]:
uh_df.to_csv(path+"hashtags.csv")

In [None]:
tweets_filtered

In [None]:
tweets_filtered.to_pickle(path + "filtered_data/tweets-filtered.pkl")

In [None]:
type(tweets_df.at[35, 'clean_text'])

### Step 4: Building the network

We are going to use the networkx library, which is a Python library that enables network science analysis of the data.

We are going to use it to create our network and extract edgelist from it, since we can easily import it to Gephi (a software we are going to see in visualization labs).

However, it offers implemented algorithms for analysis (for example PageRank) that you can use out-of-box to analyze your network.

But first, we will loop through our dataframe and connect words and hashtags if they appear together in the same Tweet.

In [None]:
import itertools
import networkx as nx

In [None]:
uh = unique_hashtags.keys()
uw = unique_words.keys()

In [None]:
network = {}
network_key = 0
for index, row in uniqueTweets.iterrows():
    combined_list = [hashtag for hashtag in row["hashtags"]] + [word for word in str.split(row["clean_text"], " ") if word in uw]
    #itertool product creates Cartesian product of each element in the combined list
    for pair in itertools.product(combined_list, combined_list):
        #exclude self-loops and count each pair only once because our graph is undirected and we do not take self-loops into account
        if pair[0]!=pair[1] and not(pair[::-1] in network):
            network.setdefault(pair,0)
            network[pair] += 1 
    
network_df = pd.DataFrame.from_dict(network, orient="index")

In [None]:
network_df.reset_index(inplace=True)
network_df.columns = ["pair","weight"]
network_df.sort_values(by="weight",inplace=True, ascending=False)
network_df

In [None]:
#to get weighted graph we need a list of 3-element tuplels (u,v,w) where u and v are nodes and w is a number representing weight
up_weighted = []
for edge in network:
    #we can filter edges by weight by uncommenting the next line and setting desired weight threshold
    #if(network[edge])>1:
    up_weighted.append((edge[0],edge[1],network[edge]))

G = nx.Graph()
G.add_weighted_edges_from(up_weighted)

In [None]:
print(len(G.nodes()))
print(len(G.edges()))

In [None]:
nx.write_gpickle(G,path+"network.pkl")

#### Save edgelist

In [None]:
filename = path+"edgelist.csv"

In [None]:
nx.write_weighted_edgelist(G, filename, delimiter=",")

In [None]:
headerList = ['Source', 'Target', 'Weight']
file = pd.read_csv(filename)
file.to_csv(filename, header=headerList, index=False)

#### Create and save node list


In [None]:
word_nodes = pd.DataFrame.from_dict(unique_words,orient="index")
word_nodes.reset_index(inplace=True)
word_nodes["Label"] = word_nodes["index"]
word_nodes.rename(columns={"index":"Id",0:"delete"},inplace=True)
word_nodes = word_nodes.drop(columns=['delete'])

word_nodes

In [None]:
hashtag_nodes = uh_df.copy()
hashtag_nodes["Label"] = hashtag_nodes["Hashtag"]
hashtag_nodes.rename(columns={"Hashtag":"Id"},inplace=True)
hashtag_nodes = hashtag_nodes.drop(columns=['Count'])
hashtag_nodes

In [None]:
nodelist = pd.concat([hashtag_nodes, word_nodes], ignore_index=True)
nodelist.to_csv(path+"nodelist.csv", index=False)

Tasks: 

*  We created a network where nodes are mixed (both words and hashtags). Create network of words only and one of hashtags only.
* Pick one of these network and rank the nodes using PageRank centrality. Extract information about top-20 rated nodes.
* following the procedure for extracting hashtags, extract mentions and annotations
* following the same procedure, extract the public metric counts for tweets



