# Interim Report

On completing a simple Exploratory Data Analysis (below). I have noted the countries which dominated the conversations around Nancy Pelosi's visit both international and continent-wide

I also noted ...

# Exploratory Data Analysis

In [3]:
import pandas as pd
from extract_dataframe import read_json
from extract_dataframe import TweetDfExtractor
from clean_tweets_dataframe import Clean_Tweets

#from wordcloud import WordCloud
import gensim
from gensim import corpora
from gensim.models import CoherenceModel
from pprint import pprint
from matplotlib import pyplot as plt

import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

## Setting up dataframe

In [None]:
_, tweet_data_global = read_json("data/global_twitter_data.json")
_, tweet_data_africa = read_json("data/africa_twitter_data.json")
tdg = TweetDfExtractor(tweet_data_global)
tda = TweetDfExtractor(tweet_data_africa)
df_global = tdg.get_tweet_df()
df_africa = tda.get_tweet_df()
df_global = Clean_Tweets.clean_df(df_global)
df_africa = Clean_Tweets.clean_df(df_africa)

### Global Tweets DataFrame preview

In [None]:
df_global.head()

In [None]:
df_global.info()

### Africa tweets data frame preview

In [None]:
df_africa.head()

In [None]:
df_africa.info()

## Where are the global twitterati?

In [None]:
(df_global["place_country"].value_counts())[:10].plot.bar();
plt.title("Top 10 countries where global tweets are coming from");
plt.xlabel("Country")
plt.ylabel("Frequency")

Most users discussing Nancy Pelosi's visit to Taiwan are from the USA India and Taiwan. The PR of China also features in the top 10. This makes sense. However the plot above is based on a very small sample size of only ~8% of the tweets.

## Where are the African Twitterati?

In [None]:
(df_africa["place_country"].value_counts())[:10].plot.bar();
plt.title("Top 10 countries where global tweets are coming from");
plt.xlabel("Country")
plt.ylabel("Frequency")

An overwhelming majority of the place-tagged tweets in the African data set came from Nigeria. This is interesting, and surprising, but I cannot tell why this might be the case

## How big is the audience?

In [None]:
df_global["followers_count"].plot(kind="line")

Many of the accounts tweeting have more than 2 million followers corresponding with large various news networks all over the globe discussing the topic

In [None]:
df_africa["followers_count"].plot(kind="line")

Many lines visibly repeat at exactly the same numbers suggesting about there were few big participants actively dominating the discussion

## Preparing data for modeling

In [None]:
sentence_list = [tweet for tweet in df_global["clean_text"]]
word_list = [s.split() for s in sentence_list]
id2word = corpora.Dictionary(word_list)
corpus = [word_to_id.doc2bow(tweet) for tweet in word_list]

id_words = [[(id2word[id], count) for id, count in line] for line in corpus]

In [None]:
print(id_words[:1])

## Topic Modelling

In [None]:
# Build model
lda_model = gensim.models.ldamodel.LdaModel(
    corpus,
    id2word = id2word,
    num_topics = 5,
    random_state = 43,
    update_every = 1,
    chunksize = 100,
    passes = 10,
    alpha = 'auto',
    per_word_topics = True
)

In [None]:
pprint(lda_model.print_topics())

## Model Analysis

In [None]:
# Compute Perplexity

#It's a measure of how good the model is. The lower the better. Perplexity is a negative value
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  
doc_lda = lda_model[corpus]


# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=word_list, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\n Ldamodel Coherence Score/Accuracy on Tweets: ', coherence_lda)

Low coherence score, suggesting that the model has not performed well in topic modelling

## Sentiment analysis