#1. Overview

#2. Pipeline Methodology

The project pipeline is depicted in the image below. In the deployed app we used only data obtained from twitter however, other data sources that we investigated include GDELT and RSS feeds, which are discussed in the next section, could be integrated into the pipeline. The pre-processing steps applied to our dataset include lemmatisation and removal of stop words. Sentiment analysis and topic modelling can then be performed on this cleaned dataset. 

Sentiment analysis provides us with polarity and subjectivity metrics. We are also able to determine the top emotions using sentiment analysis and topic modelling. In addition, topic modelling allows us to create workclouds associated with positive and negative emotions, topic workclouds and a visualisation of the topic clusters. Finally, we use GPT-3 to perform text generation in order to provide the user with background information. 

<img title="pipeline" alt="pipeline" src="pipeline.png">



#3. Data

For this project we tapped into the following data sources:
1. Tweets : Top trending tweets were harvested for a week from the twitter API.
2. GDELT: 

#4. Sentiment Analysis

Sentiment analysis is the task of determining the emotional value of a given expression in natural language. In this project we explored the Polarity and Subjectivity of the texts using rule based models including TextBlob and Vedar. We settled on TextBlob since it is simple and has the ability to perform subjectivity analysis.

Polarity analysis determins whether a word, phrase, or document is positive, negative, or neutral. The scores ranges from -1 as very negative to +1 as very posivite with 0 being neutral. The subjectivity score varies from 0 as objective texts to 1 being more subjected texts. The higly subjective texts are not facts but hingly influenced by the writers feelings and emotions.

We invetigated and detected Text Emotion using Hugging Face Model Hub's EmoRoberta. EmoRoberta leverages Roberta to perform Emotion classification into 28 categories of Emotion. We also explored the NRC Lexicon model which gives the fraction for all 8 emotions to the text.

Sentiment Analysis Notebook: [Sentiment Analysis Notebook](https://colab.research.google.com/drive/18p_jRikwWqI3GdYBXDOb218HhvhiAFc-#scrollTo=IJZiLAktEOuM)

# 5. Topic Modeling

A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. During this project we chose to apply a Latent Dirichlet Allocation (LDA) technique. LDA topic modelling discovers topics that are hidden (latent) in a set of text documents. It does this by inferring possible topics based on the words in the documents using a generative probabilistic model and Dirichlet distributions. The number of topics is chosen in advance by the user and can be varied using a slider in our deployed app. 

We tested the LDA models provided by sklearn as well as gensim. We displayed the results using wordcloud visualisations and provided an interactive plot using pyLDAvis.

Topic Modeling Notebook: [Topic Modeling Notebook](https://colab.research.google.com/drive/1eVFfwBcytlZLZcGiM2uBQU-ZxXsYW-id)

#6. Text Generation

Text generation is a subfield of natural language processing (NLP). It leverages knowledge in computational linguistics and artificial intelligence to automatically generate natural language texts, which can satisfy certain communicative requirements. Deep Learning models are trained to generate random but hopefully meaningful text in the simplest form.

For this project we pursued 2 approaches:

1. Transfer learning: Utilizing a pre-trained model from hugging face to spin articles. The pre-trained model will be fed seed text from a trending subject and will output an article of specified length. 
2. Training a custom LSTM model on the tweets and predicting a sequence of the next most probable words.

**Text generation notebook link**: [Text Generation Notebook](https://colab.research.google.com/drive/18Zye_xyVSluo0w4XiWU0FREmciVRanQJ)

# 7. Deployment

The project's main aim was to design a product (web page) that would act as an aid to a journalist having the following features.

- Word cloud visualization on what's trending.
- Sentiment and emotion analysis.
- Topic discovery leveraging machine learning.
- Article generation using deep learning models.

A webpage with the above stated features was designed, developed and hosted in the cloud.

Some of the tools utilized for the webpage deployment are:

- Streamlit: Open-source app framework for Machine Learning and Data Science teams
- Streamlit Cloud: Workspace to deploy, share, and collaborate on Streamlit applications.

The product name is Scoop Finder.

**Web Application link**: [Scoop Finder Link](https://share.streamlit.io/tejlibre/dsi-nlp-news/dev/Home_Page/app.py)

#8. Results and Conclusion

In this project we were able to create an application that provides the user with a detailed analysis of Twitter data. This app was created using Streamlit and deployed on the cloud using the free serve provided by Streamlit. Below is a detailed summary of the conclusion made for each aspect of the project.

Sentiment & Emotion Analysis
- Alot of tweets had sentiment analysis scores of neutral. A possible explanation is lamguage barrier. Most probably the transformer and lexicon models had a hard time understanding non-english words.
- The emotinal analysis was successfully done using both the Hugging Face transformer Model and the NRC lexicon model.
- In our case transformers had a wider range of emotions (28 classes) in comparison to lexicon based (8 classes) and may be preferred for futher analysis or model deployment.
- Lexicon based models are super fast in the emotion detection task in comparison to transformers. 
- With more time, it will be interesting to train transformer model on custom data for both the sentiment and emotion analysis.

Topic modelling
-	LDA with initialization of 3 topics had commendable results with less overlap between topics.
-	Further analysis was also carried out to optimize the model.
-	Given time we would have considered other algorithms such as Non-Negative matrix factorization.

Text Generation
-	Both transformer model and LSTM model had promising results.
-	We opted for GPT2 rather than GPT3 due to application response speed. GPT3 performs better than GPT2 but is larger in size and takes longer to load.
-	An interesting area to pursue given more time is to train transformer model on custom data.
- Training the LSTM model for more epochs also looks promising for output improvement.

Deployment

-	Streamlit is easy to use, and free online deployment is very useful.
-	Customisation is somewhat limited.
-	There are difficulties in resizing pyLDAvis plot.
-	Setting a default colour scheme is important to ensure that all plots and text are clearly visible. 
