# Xoxoday Tweet Network
Let's now move onto some real data used in the Xoxoday case study.

Let's import `pandas` and `networkx` again.

In [None]:
import pandas as pd
import networkx as nx

Let's import the "Tweets Data" sheet (in CVS format) from the Xoxoday data.

In [None]:
df = pd.read_csv('./xoxoday_tweets_data.csv')
df.head()

Let's create a small (de-duplicated) DataFrame of tweets containing just the Tweets ID, text, the date-time it was created, and the ID of the tweet replied to if the tweet was a reply (otherwise `NaN`). This is all we really need for this notebook.

In [None]:
df_tweets = df[['Imported ID','Tweet','Tweet Date (UTC)','In-Reply Tweet ID']]
df_tweets = df_tweets.drop_duplicates(subset=['Imported ID'])
df_tweets.head()

## Construct Tweet Network
Let's now construct a "Tweet Network" using the data provided. This is a graph where the nodes represent tweets and the edges represent relationships between the tweets. The `Imported ID` column in the dataset contains the set of Tweets ID's, some of which we will use to represent tweets as nodes. With regards to edges, we will create an edge directed from one tweet to another if the first tweet was a reply to the other, indicated by `In-Reply Tweet ID`.

Thus, we will create an `edgelist` for the Tweet Network using the following method:
1. Initiate the edge list using the `Imported ID` and `In-Reply Tweet ID` columns
2. Drop edges that were not replies (i.e., `NaN` in `In-Reply Tweet ID` column)
3. Drop duplicates (as tweets are duplicated in the dataset) and ensure the node ID's are integer valued
4. Create edge weights that are all equal to $1$ (as the reply relationship is binary)

In [None]:
tweets_edgelist = df_tweets[['Imported ID','In-Reply Tweet ID']]
tweets_edgelist = tweets_edgelist.dropna().drop_duplicates().astype(int)
tweets_edgelist['Weight'] = 1
tweets_edgelist.head()

Next we create the network / graph using `from_pandas_edgelist` and visualise it.

In [None]:
G = nx.from_pandas_edgelist(
    tweets_edgelist, 
    source='Imported ID', 
    target='In-Reply Tweet ID', 
    edge_attr='Weight',
    create_using=nx.DiGraph(),
)

nx.draw_spring(G, node_size=5, arrowsize=5)

**Question:** What do you see? Can you explain it?

## Tweet Network Analysis
Let's now perform some quantitative analysis of the tweet network. 

In [None]:
N, K = G.order(), G.size()
avg_deg = float(K) / N
print(f'Nodes: {N}')
print(f'Edges: {K}')
print(f'Average degree: {avg_deg}')

**Question:** How can we interpret the average degree in terms of how many tweets reply to other tweets?

Let's now compute the in-degree of each node (i.e. the number of replies of each tweet) and print the top 5 tweets with highest in-degree (i.e., most replies).

In [None]:
in_degrees_sorted = sorted(G.in_degree(), key=lambda tup: tup[1], reverse=True)
for tweet_ID, in_degree in in_degrees_sorted[:5]:
    tweet_text = df_tweets.query(f'`Imported ID` == {tweet_ID}').Tweet.values
    print(f'Tweet ID: {tweet_ID}')
    print(f'Importance: {in_degree}')
    print(f'Tweet text: {tweet_text}')
    print('\n')


## Tweet Content Analysis
Looks like a number of the tweets with high in-degrees relate to a "contest". Let's investigate if tweets relating to contests really do have higher in-degrees (number of replies). To do this, we will:
1. Create a new DataFrame for the labelled tweets (using the `df_tweets` DataFrame from before)
2. Label the tweets as contest or not based on whether the text contains the "contest" string (ignoring case)
3. Create an `In-degree` column in the DataFrame to store the in-degree values for each tweet
4. Calculate the mean `In-degree` value (number of replies) for contest and not contest labelled tweets

In [None]:
df_tweets_labeled = df_tweets.copy()
df_tweets_labeled['Contest'] = df_tweets_labeled['Tweet'].str.contains('contest',case=False)

# Create in-degree column
in_degrees_dict = {node:in_degree for node,in_degree in in_degrees_sorted}
in_degrees_func = lambda ID: in_degrees_dict[ID] if ID in in_degrees_dict else 0
df_tweets_labeled['In-degree'] = df_tweets_labeled['Imported ID'].apply(in_degrees_func)

# Calculate mean In-degree value for contest and not contest labelled tweets
df_tweets_labeled.groupby('Contest').agg({'In-degree':'mean'})

## Exercise 01 - Analysis in action
Explain if/how Xoxoday could use the above analysis and identify potential limitations.

## Exercise 02 - Betweeness Centrality
In the previous tweet analysis, we use the in-degree of tweets to order tweets in terms of potential influence, but the in-degree of the tweets (in this network) is simply the number of replies to the tweet. This could be collected from the `public_metrics` data using Twitter API. Let's try a different metric which is not included in `public_metrics`.

The code below calculates the [betweeness centrality](https://en.wikipedia.org/wiki/Betweenness_centrality) of the nodes in the graph. Sort the degree centralities and print the 5 most "important" tweets w.r.t. betweeness centrality. Are you able to use the definition of betweeness centrality to explain why you are seeing the results you are?

In [None]:
degree_centralities = nx.betweenness_centrality(G).items()

# (SOLUTION)

## (Optional) Exercise 03 - Content relevance
Find some other words (in addition to "contest" found in the tweets) which are strongly associated with an increase activity (e.g. number of replies).

In [None]:
# (SOLUTION)