## NLP Topic Modeling Project

### Project Summary
This project aims to apply appropriate topic modeling techniques to identify top N most important topics respectively from a collection of tweets and a collection of news articles about one particular company.

### Data
The data has 9,962 news articles and 9,941 tweets.

### Project Sections

1. Data Import

2. Text Data Cleaning

3. Topic Modeling
 - Create Bigrams & Trigrams
 - Select Right Number of Topics via Coherence Score Analysis
 - Visualize Actual Topics

### Author & Platform
Yezi Liu conducted this project independently in Visual Studio Code.

## Load Packages

In [None]:
# Running this cell may make changes to your environment

# !pip install pyLDAvis
# pip install --upgrade scipy numpy pandas gensim pyLDAvis

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial.distance import jaccard
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
from nltk.util import ngrams
from itertools import combinations
from nltk.tokenize import TweetTokenizer
import numpy as np
import itertools
from nltk.stem.wordnet import WordNetLemmatizer
import multiprocessing
import warnings
import gensim
from gensim import corpora, models
from gensim.models.ldamulticore import LdaMulticore
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from gensim.models import Phrases
from gensim.corpora import Dictionary
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

nltk.download('punkt')
nltk.download('stopwords')

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 500)

## Set Up Environmental Variables

In [None]:
NEWS_DATA_PATH = 'https://storage.googleapis.com/msca-bdp-data-open/news/nlp_a_6_news.json'
TWEETS_DATA_PATH = 'https://storage.googleapis.com/msca-bdp-data-open/tweets/nlp_a_6_tweets.json'

## Data Import

### News Data

In [None]:
news_df = pd.read_json(NEWS_DATA_PATH, orient='records', lines=True)
print(f'Sample contains {news_df.shape[0]:,.0f} news articles')
news_df.head(2)

Sample contains 9,962 news articles


Unnamed: 0,url,date,language,title,text
0,http://oaklandnewsnow.com/breaking-bts-announces-las-vegas-us-concert-date-in-2022/,2022-02-24,en,"BREAKING: BTS Announces LAS VEGAS, US Concert Date in 2022! | Oakland News Now - Oakland News, SF Bay Area, East Bay, California, World","BREAKING: BTS Announces LAS VEGAS, US Concert Date in 2022! | Oakland News Now - Oakland News, SF Bay Area, East Bay, California, WorldSorry, you have Javascript Disabled! To see this page as it is meant to appear, please enable your Javascript!BREAKING: BTS Announces LAS VEGAS, US Concert Date in 2022! | Oakland News Now - Oakland News, SF Bay Area, East Bay, California, WorldSkip to contentMenuSearch for:SearchOakland News Now – Oakland News, SF Bay Area, East Bay, California, WorldOakland..."
1,http://www.newsdzezimbabwe.co.uk/2022/04/mai-tt-weds.html,2022-04-09,en,MAI TT WEDS newsdzeZimbabweNewsdzeZimbabwe,"MAI TT WEDS newsdzeZimbabweNewsdzeZimbabweskip to main | skip to sidebarHomeAboutContactAdvertiseNewsdzeZimbabweOur Zimbabwe Our NewsHomeNewsBusinessEntertainmentSaturday, 9 April 2022MAI TT WEDSSaturday, April 09, 2022 NewsdzeZimbabwe 0 Best moments... @Chakariboy @NyamayaroArron @restmutore @Lattynyangu pic.twitter.com/MsrhcFXUJj— H-Metro (@HMetro_) April 9, 2022 Posted in: Share to TwitterShare to FacebookOlder PostHome0comments: Post a CommentFollow NewsdzeZimbabweRecent..."


### Tweets data

In [None]:
tweets_df = pd.read_json(TWEETS_DATA_PATH, orient='records', lines=True)
print(f'Sample contains {tweets_df.shape[0]:,.0f} tweets')
tweets_df.head(2)

Sample contains 9,941 tweets


Unnamed: 0,id,lang,date,name,retweeted,text
0,1484553027222741001,en,2022-01-21,Dylan Green,RT,*Microsoft has entered the chat* https://t.co/Uz3pZrk6B3
1,1505486305102557184,en,2022-03-20,Rahim Rajwani,,"""I actually use an @Android phone. Some #Android manufacturers pre-install @Microsoft software in a way that makes it easy for me. They’re more flexible about how the software connects up with the OS. So that’s what I ended up getting used to.""\nhttps://t.co/C0VjfS9PUO"


In [None]:
news_df = news_df[news_df['language']=='en'].reset_index(drop=True)

tweets_df = tweets_df[tweets_df['lang']=='en'].reset_index(drop=True)

## Text Data Cleaning

### News Article Cleaning

Since in the actual text of news articles, there are many unusually long tokens(several words connected together without a space), such as readsoffersnewfind, they are usually from web-related buttons or ads on a specific web page and are unrelated to the news article contents. So I added a cleaning rule to remove tokens longer than 18 characters(single words are usually not that long).

In [None]:
lemma = WordNetLemmatizer()
stop_words = set(nltk.corpus.stopwords.words('english'))

def cleaned_news(text, max_length=18, min_length=3):
    """
    This function cleans the news article text.
    """
    text = re.sub(r'(?:\@|http?\://|https?\://|www)\S+', '', text)
    text = re.sub(r'(?:\n)', '', text)
    text = re.sub(r'\d+', '', text)
    tokens = nltk.tokenize.word_tokenize(text)
    return ' '.join([lemma.lemmatize(token.lower()) for token in tokens
        if token.lower() not in stop_words
        and token.isalpha()
        and not token.isnumeric()
        and len(token) <= max_length
        and len(token) >= min_length])

[nltk_data] Downloading package stopwords to /Users/lize/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


After careful examination, I found out that almost all titles are already included in the news text. And even if some titles are not included, they should convey the same theme as the text anyway. To effectively use information from both titles and texts without increasing unnecessary computational cost, I decided to just use news article text to represent both sources of information.


In [None]:
news_df['cleaned_news_text'] = news_df['text'].apply(lambda x: cleaned_news(x))
news_df['cleaned_news_title'] = news_df['title'].apply(lambda x: cleaned_news(x))

### Tweets Cleaning

In [None]:
tweet_tokenizer = TweetTokenizer(preserve_case=False, strip_handles=False, reduce_len=True)

# Remove urls, \n, emojis, #, @ for tweets as well as punctuations, stop words, and numbers.
def cleaned_tweets(text, min_length=3):
    """
    This function cleans the tweets text.
    """
    text = re.sub(r'(?:\@|http?\://|https?\://|www)\S+', '', text)
    text = re.sub(r'(?:\n)', '', text)
    text = re.sub(r'\d+', '', text)

    emoji_pattern = re.compile(
        "["
        "\U0001F600-\U0001F64F"  # emoticons
        "\U0001F300-\U0001F5FF"  # symbols & pictographs
        "\U0001F680-\U0001F6FF"  # transport & map symbols
        "\U0001F1E0-\U0001F1FF"  # flags (iOS)
        "\U00002702-\U000027B0"
        "\U000024C2-\U0001F251"
        "]+",
        flags=re.UNICODE)
    text =  emoji_pattern.sub(r'', text)

    tokens = tweet_tokenizer.tokenize(text)

    return ' '.join([lemma.lemmatize(token.lower().lstrip('#@')) for token in tokens
            if token.lower().lstrip('#@') not in stop_words
            and token.lower().lstrip('#@').isalpha()
            and not token.lower().lstrip('#@').isnumeric()
            and len(token) >= min_length
            ])

In [None]:
tweets_df['cleaned_tweets'] = tweets_df['text'].apply(cleaned_tweets)

## Topic Modeling for News Articles

### Functions Used

In [None]:
# Hyperparameter tuning for LDA models on news article text
num_processors = multiprocessing.cpu_count()
workers = num_processors-1


def compute_coherence_values(dictionary, corpus, texts, num_topics_range, alpha_range, eta_range):
    """
    This function computes coherence values for different models.
    """
    coherence_values = []
    model_list = []
    for num_topics in num_topics_range:
        for alpha in alpha_range:
            for eta in eta_range:
                model = LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=num_topics,
                                     random_state=100, passes=10, alpha=alpha, eta=eta, per_word_topics=True,
                                     workers=workers)
                model_list.append(model)
                coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
                coherence_values.append((num_topics, alpha, eta, coherencemodel.get_coherence()))

    return model_list, coherence_values

### Create Bigrams and Trigrams

In [None]:
# To create bigrams/trigrams for news article text

# Tokenize cleaned news text
news_df['cleaned_news_text_tokens'] = news_df['cleaned_news_text'].apply(word_tokenize)

# Create Bigrams and Trigrams
# Train the models
news_text_bigram_model = Phrases(news_df['cleaned_news_text_tokens'], min_count=5, threshold=100) # higher threshold fewer phrases.
news_text_trigram_model = Phrases(news_text_bigram_model[news_df['cleaned_news_text_tokens']], threshold=100)

# Apply the trained models to transform the sentences
news_df['news_text_bigrams'] = [news_text_bigram_model[doc] for doc in news_df['cleaned_news_text_tokens']]
news_df['news_text_trigrams'] = [news_text_trigram_model[news_text_bigram_model[doc]] for doc in news_df['cleaned_news_text_tokens']]


In [None]:
# Create a Dictionary and Corpus needed for Topic Modeling: every unique term is assigned an index
new_text_dictionary = Dictionary(news_df['news_text_trigrams'])

# Filter out words that occur in less than 20 documents, or more than 50% of the documents.
new_text_dictionary.filter_extremes(no_below=20, no_above=0.5)

# Create the Corpus: Term Document Frequency
news_text_corpus = [new_text_dictionary.doc2bow(text) for text in news_df['news_text_trigrams']]

### Select Right Number of Topics via Coherence Score

In [None]:
num_topics_range = range(4, 11, 1)
alpha_range = ['asymmetric']
eta_range = ['auto']

In [None]:
news_text_model_list, news_text_coherence_values = compute_coherence_values(dictionary=new_text_dictionary,
                                                                            corpus=news_text_corpus,
                                                                            texts=news_df['news_text_trigrams'],
                                                                            num_topics_range=num_topics_range,
                                                                            alpha_range=alpha_range,
                                                                            eta_range=eta_range)

In [None]:
# Displaying the coherence scores for each model for news article text
for model_scores in news_text_coherence_values:
    print("Num Topics:", model_scores[0], " Alpha:", model_scores[1], " Eta:", model_scores[2], " Coherence:", model_scores[3])


Num Topics: 4  Alpha: asymmetric  Eta: auto  Coherence: 0.5569369624599487
Num Topics: 5  Alpha: asymmetric  Eta: auto  Coherence: 0.4988000907548926
Num Topics: 6  Alpha: asymmetric  Eta: auto  Coherence: 0.5587946643233509
Num Topics: 7  Alpha: asymmetric  Eta: auto  Coherence: 0.5000552959851742
Num Topics: 8  Alpha: asymmetric  Eta: auto  Coherence: 0.5264889936739028
Num Topics: 9  Alpha: asymmetric  Eta: auto  Coherence: 0.5344224289942269
Num Topics: 10  Alpha: asymmetric  Eta: auto  Coherence: 0.5547021249102447


### Visualize Actual Topics

In [None]:
# Choose three models with relatively high coherence scores to visualize actual topics
model_topics4 = news_text_model_list[0]
model_topics6 = news_text_model_list[2]
model_topics10 = news_text_model_list[6]

In [None]:
model_topics4.print_topics()

[(0,
  '0.017*"market" + 0.011*"stock" + 0.005*"business" + 0.005*"price" + 0.005*"data" + 0.005*"report" + 0.005*"global" + 0.004*"technology" + 0.004*"product" + 0.003*"growth"'),
 (1,
  '0.054*"video" + 0.047*"music" + 0.044*"official" + 0.011*"oakland" + 0.006*"nba" + 0.006*"nfl" + 0.005*"game" + 0.005*"song" + 0.004*"georgia" + 0.004*"black_history_month"'),
 (2,
  '0.005*"game" + 0.004*"best" + 0.004*"video" + 0.004*"say" + 0.004*"ago" + 0.003*"hour" + 0.003*"like" + 0.003*"show" + 0.003*"make" + 0.003*"people"'),
 (3,
  '0.024*"open" + 0.015*"tab" + 0.009*"link" + 0.008*"music" + 0.007*"say" + 0.007*"people" + 0.007*"video" + 0.004*"show" + 0.004*"see" + 0.004*"way"')]

With 4 topics, topics 3 and 4 contain many common words with little information, like say, ago, like, make, see, way, etc, making it difficult to generate main ideas.


In [None]:
model_topics6.print_topics()

[(0,
  '0.008*"stock" + 0.007*"business" + 0.006*"data" + 0.005*"technology" + 0.005*"market" + 0.005*"product" + 0.005*"management" + 0.005*"rating" + 0.004*"price" + 0.004*"report"'),
 (1,
  '0.055*"video" + 0.048*"music" + 0.045*"official" + 0.012*"oakland" + 0.006*"nba" + 0.006*"nfl" + 0.005*"game" + 0.005*"song" + 0.004*"georgia" + 0.004*"black_history_month"'),
 (2,
  '0.005*"ago" + 0.004*"say" + 0.004*"hour" + 0.003*"show" + 0.003*"video" + 0.003*"people" + 0.003*"http" + 0.003*"like" + 0.003*"state" + 0.003*"would"'),
 (3,
  '0.031*"open" + 0.017*"tab" + 0.015*"link" + 0.012*"music" + 0.010*"people" + 0.008*"video" + 0.006*"say" + 0.005*"show" + 0.005*"join" + 0.005*"close_dialog_window"'),
 (4,
  '0.011*"game" + 0.010*"best" + 0.006*"video" + 0.005*"review" + 0.005*"apple" + 0.005*"feature" + 0.005*"window" + 0.004*"use" + 0.004*"deal" + 0.004*"gaming"'),
 (5,
  '0.026*"market" + 0.013*"stock" + 0.006*"price" + 0.005*"global" + 0.005*"report" + 0.004*"growth" + 0.004*"data" + 

With 6 topics, topic 3 contain words like ago, say, show, http, like, would. Topic 1 and 6 are quite similar with same words like stock, price, report, data.


In [None]:
model_topics10.print_topics()

[(0,
  '0.008*"business" + 0.006*"technology" + 0.006*"data" + 0.005*"product" + 0.005*"customer" + 0.004*"solution" + 0.004*"platform" + 0.004*"cloud" + 0.004*"global" + 0.004*"industry"'),
 (1,
  '0.009*"show" + 0.007*"say" + 0.007*"star" + 0.004*"reveals" + 0.004*"look" + 0.004*"advertisement" + 0.004*"black" + 0.004*"video" + 0.004*"best" + 0.003*"two"'),
 (2,
  '0.008*"ago" + 0.005*"hour" + 0.004*"state" + 0.004*"people" + 0.003*"used" + 0.003*"say" + 0.003*"cookie" + 0.003*"would" + 0.003*"like" + 0.003*"week"'),
 (3,
  '0.017*"open" + 0.011*"link" + 0.009*"people" + 0.008*"music" + 0.007*"say" + 0.007*"tab" + 0.007*"video" + 0.006*"ukraine" + 0.005*"russia" + 0.005*"russian"'),
 (4,
  '0.013*"game" + 0.012*"best" + 0.006*"open" + 0.006*"review" + 0.006*"apple" + 0.006*"video" + 0.006*"tab" + 0.005*"deal" + 0.005*"feature" + 0.005*"gaming"'),
 (5,
  '0.015*"stock" + 0.014*"market" + 0.007*"price" + 0.005*"investor" + 0.004*"say" + 0.004*"billion" + 0.004*"inflation" + 0.003*"indi

With 10 topics, even if each topic still contains some useless words like say, http, would, each individual topic has more distinct/unique signal words that distinguish them from other topics and determine the central idea. So I chose 10 topics and would visualize the key words below to further understand each topic and identify similarities.


In [None]:
pyLDAvis.enable_notebook()

In [None]:
%%time
# Visualize 10 topics from news text
# warnings.filterwarnings('ignore')
lda_display = gensimvis.prepare(model_topics10, news_text_corpus, new_text_dictionary, sort_topics=False, mds='mmds')
pyLDAvis.display(lda_display)

![Topics for News Articles](news_text_topics.png)

**Analysis:**

**We can see that these 10 topics are quite different in nature with some overlapping just a little bit.**

Topic 1 key words: business, technology, data, product, customer, solution, platform, cloud.

Topic 2 key words: star, reveals, advertisement, black, video

Topic 3 key words: state, people, cookie, week, user, woman

Topic 4 key words: link, people, music, video, ukraine, russia, russian, close_dialog_window

Topc 5 key words: game, best, open, review, apple, video, tab, feature, mobile, google

Topc 6 key words: stock, market, price, investor, billion, inflation, india, business

Topc 7 key words: video, music, official, oakland, nba, nfl, game, georgia, black_history_month

Topc 8 key words: user, file, data, security, project, software, linux

Topc 9 key words: stock, rating, price, quarter, nyse, buy, report, nasdaq

Topc 10 key words: market, global, report, growth, data, forecast, research, analysis

**After combining these key words, I came up with these 10 topics for news articles:**

Topic 1: use technology & data to help business craft solutions for customers on the cloud platform

Topic 2: advertisement videos reveal things about black stars

Topic 3: people, in particular women, eat cookies during the week in states

Topic 4: people close dialog windows on music and video links in ukraine and russia

Topic 5: apple and google mobile features have best video game tabs and reviews

Topic 6: stock market prices are impacted by inflation in india for business investors, impacting billions of dollars

Topic 7: videos and music about official NBA and NFL games in georgia and oakland during black_history_month

Topic 8: user projects involving file data security issues using linux software

Topic 9: stock price and rating report from NYSE and Nasdaq this quarter

Topic 10: global market report uses data and research to do analysis to forecast growth

**Even if there are some overlapping words among certain topics, I believe that each topic is specific with minimized duplication. Therefore, the N for news articles is 10.**

## Topic Modeling for Tweets

### Create Bigrams/Trigrams

In [None]:
# Tokenize cleaned tweets text
tweets_df['cleaned_tweets_tokens'] = tweets_df['cleaned_tweets'].apply(word_tokenize)

# Create Bigrams and Trigrams
# Train the models
tweets_bigram_model = Phrases(tweets_df['cleaned_tweets_tokens'], min_count=5, threshold=100) # higher threshold fewer phrases.
tweets_trigram_model = Phrases(tweets_bigram_model[tweets_df['cleaned_tweets_tokens']], threshold=100)

# Apply the trained models to transform the sentences
tweets_df['tweets_bigrams'] = [tweets_bigram_model[doc] for doc in tweets_df['cleaned_tweets_tokens']]
tweets_df['tweets_trigrams'] = [tweets_trigram_model[tweets_bigram_model[doc]] for doc in tweets_df['cleaned_tweets_tokens']]


In [None]:
# Create a Dictionary and Corpus needed for Topic Modeling: every unique term is assigned an index
tweets_dictionary = Dictionary(tweets_df['tweets_trigrams'])

# Filter out words that occur in less than 20 documents, or more than 50% of the documents.
tweets_dictionary.filter_extremes(no_below=20, no_above=0.5)

# Create the Corpus: Term Document Frequency
tweets_corpus = [tweets_dictionary.doc2bow(text) for text in tweets_df['tweets_trigrams']]

### Select Right Number of Topics via Coherence Score

In [None]:
num_topics_range = range(3, 8, 1)
alpha_range = ['asymmetric']
eta_range = ['auto']

In [None]:
tweets_model_list, tweets_coherence_values = compute_coherence_values(dictionary=tweets_dictionary,
                                                                            corpus=tweets_corpus,
                                                                            texts=tweets_df['tweets_trigrams'],
                                                                            num_topics_range=num_topics_range,
                                                                            alpha_range=alpha_range,
                                                                            eta_range=eta_range)

In [None]:
# Displaying the coherence scores for each model for tweets text
for model_scores in tweets_coherence_values:
    print("Num Topics:", model_scores[0], " Alpha:", model_scores[1], " Eta:", model_scores[2], " Coherence:", model_scores[3])


Num Topics: 3  Alpha: asymmetric  Eta: auto  Coherence: 0.3512396691237734
Num Topics: 4  Alpha: asymmetric  Eta: auto  Coherence: 0.41765729813247005
Num Topics: 5  Alpha: asymmetric  Eta: auto  Coherence: 0.47929999655968986
Num Topics: 6  Alpha: asymmetric  Eta: auto  Coherence: 0.5079815339248602
Num Topics: 7  Alpha: asymmetric  Eta: auto  Coherence: 0.4968297390722774


### Visualize Actual Topics

In [None]:
# Choose 2 models with relatively high coherence scores to visualize actual topics
tweets_model_topics6 = tweets_model_list[3]
tweets_model_topics7 = tweets_model_list[4]

In [None]:
tweets_model_topics6.print_topics()

[(0,
  '0.020*"window" + 0.016*"new" + 0.012*"business" + 0.012*"team" + 0.011*"azure" + 0.010*"use" + 0.008*"office" + 0.008*"get" + 0.007*"tech" + 0.007*"people"'),
 (1,
  '0.039*"google" + 0.029*"apple" + 0.022*"amazon" + 0.015*"company" + 0.014*"word" + 0.013*"year" + 0.011*"free" + 0.011*"job" + 0.010*"team" + 0.009*"facebook"'),
 (2,
  '0.311*"ever" + 0.015*"ceo" + 0.015*"cloud" + 0.014*"msft" + 0.013*"stock" + 0.011*"dear" + 0.011*"never" + 0.009*"formatting" + 0.009*"azure" + 0.008*"data"'),
 (3,
  '0.041*"youtube" + 0.030*"viu" + 0.026*"premium" + 0.025*"excel" + 0.024*"netflix" + 0.024*"premium_account" + 0.024*"grammarly" + 0.022*"scribd" + 0.021*"canva" + 0.019*"spotify"'),
 (4,
  '0.077*"xbox" + 0.043*"game" + 0.024*"sony" + 0.020*"year" + 0.019*"playstation" + 0.016*"buy" + 0.016*"gaming" + 0.014*"metaverse" + 0.013*"billion" + 0.011*"one"'),
 (5,
  '0.041*"game" + 0.018*"like" + 0.017*"activision_blizzard" + 0.016*"deal" + 0.016*"make" + 0.015*"company" + 0.014*"sony" + 

If the total number of topics are 6, there are some overlapping between topics. For example, both topic 1 and 3 contain keyword azure. Both topic 5 and 6 contain keyword sony.

In [None]:
tweets_model_topics7.print_topics()

[(0,
  '0.023*"window" + 0.017*"new" + 0.014*"azure" + 0.012*"team" + 0.010*"business" + 0.010*"use" + 0.008*"office" + 0.008*"people" + 0.008*"cloud" + 0.007*"security"'),
 (1,
  '0.041*"google" + 0.032*"apple" + 0.024*"amazon" + 0.016*"company" + 0.014*"year" + 0.011*"one" + 0.011*"team" + 0.010*"job" + 0.010*"word" + 0.010*"facebook"'),
 (2,
  '0.313*"ever" + 0.018*"ceo" + 0.013*"breaking" + 0.012*"never" + 0.012*"cloud" + 0.010*"dear" + 0.010*"news" + 0.009*"buy" + 0.009*"msft" + 0.009*"report"'),
 (3,
  '0.042*"youtube" + 0.032*"viu" + 0.029*"premium" + 0.026*"netflix" + 0.025*"premium_account" + 0.025*"grammarly" + 0.023*"scribd" + 0.023*"canva" + 0.020*"spotify" + 0.019*"canva_pro"'),
 (4,
  '0.082*"xbox" + 0.045*"game" + 0.028*"sony" + 0.019*"playstation" + 0.017*"year" + 0.017*"buy" + 0.016*"gaming" + 0.013*"metaverse" + 0.013*"company" + 0.012*"console"'),
 (5,
  '0.039*"game" + 0.020*"like" + 0.017*"activision_blizzard" + 0.016*"deal" + 0.016*"company" + 0.014*"make" + 0.013

Similar to 6 topic model, these 7 topics also have some overlapping in between. For example, both topic 5 and 6 contain keyword sony. But since the 7th topic is a brand new topic without much duplication from previous 6 topics. So I chose 7 topics and would visualize the key words below to further understand each topic and identify similarities.

In [None]:
%%time
# Visualize 7 topics from tweets
# warnings.filterwarnings('ignore')
lda_display = gensimvis.prepare(tweets_model_topics7, tweets_corpus, tweets_dictionary, sort_topics=False, mds='mmds')
pyLDAvis.display(lda_display)

![Topics for News Articles](tweets_topics.png)

**Analysis:**

**We can see that all 7 topics are far apart from each other without any overlapping.**

Topic 1 key words: azure, team, business, use, office, cloud, security

Topic 2 key words: google, apple, amazon, company, year, team, job, facebook, meta

Topic 3 key words: ceo, breaking, cloud, news, msft, report, stock

Topic 4 key words: youtube, premium, netflix, premium_account, grammarly, scribd, canva, spotify

Topic 5 key words: xbox, game, sony, playstation, gaming, metaverse, company

Topic 6 key words: game, activision_blizzard, deal, company, billion, sony

Topic 7 key words: excel, free, data, business, power, startup, training, technology

**After combining these key words, I came up with these 7 topics for tweets:**

Topic 1: azure cloud service is for teams and businesses to use in office for data security

Topic 2: teams and jobs in huge companies like google, apply, amazon, facebook, and meta this year

Topic 3: breaking news about msft cloud services, ceo, and stock report

Topic 4: premium accounts for popular social/entertainment platforms like youtube, netflix, scribd, canva, spotify

Topic 5: games from playstation of sony, xbox, and metaverse game companies

Topic 6: game company activision_blizzard has a billion deal with sony

Topic 7: startup businesses excel at powering free data training and technology

Topic 5 & 6 are extremely similar and are both about video games, and game companies, so can be merged into one topic.

Other than this, the rest 5 topics are specific with minimized duplication.

**Therefore, the N for tweets is 6.**