## NLP Project  - Surendiran Rangaraj
**Date : 11/22/2021**


Illinois is famous for being one of the very few states in the country with negative population growth.  The objective of your final project is to:

    1. Identify the key reasons for the declining population (what people like / dislike about Chicago / suburbs) by extracting meaningful insights from unstructured text
    2. Provide actionable recommendations on what can be done to reverse this trend (how to make Chicago / suburbs more attractive)
    
You have access to a collection of ~200K news articles (about 500 MB).  The news articles are related to either Chicago and / or Illinois and you can access them in the following ways:

    . Download a data by following this think from your browser: https://storage.googleapis.com/msca-bdp-data-open/news/news_final_project.jsonLinks to an external site.
    . Use Spark on GCP news_final_project = spark.read.parquet('gs://msca-bdp-data-open/news_final_project')
    . Use Pandas from anywhere (your laptop, Colab or any cloud) df_news_final_project = pd.read_json('https://storage.googleapis.com/msca-bdp-data-open/news/news_final_project.json', orient='records', lines=True) 
 

To complete your assignment, I suggest considering the following steps:

    . Clean-up the noise (eliminate articles irrelevant to the analysis)
    . Detect major topics
    . Identify top reasons for population decline (negative sentiment)
        . Suggest corrective actions
        . Plot a timeline to illustrate how the sentiment is changing over time
    . Demonstrate how the city / state can attract new businesses (positive sentiment)
    . Leverage appropriate NLP techniques to identify organizations and people and apply targeted sentiment
        . Why businesses should stay in IL or move into IL?
            . Create appropriate visualization to summarize your recommendations (i.e. word cloud chart or bubble chart)
        . Why residents should stay in IL or move into IL?
            . Create appropriate visualization to summarize your recommendations (i.e. word cloud chart or bubble chart)

**Additional guidance:**

    . Default sentiment will likely be wrong from any software package and will require significant tweaking
        . Either keyword / dictionary approach or
        . Labeling and classification
        
    . You are encouraged to explore a combination several techniques to identify key topics:
        . Topic modeling (i.e. LDA using gensim or ktrain)
        . Classification (hand-label several topics on a sample and then train classifier)
        . Clustering (cluster topics around pre-selected keywords or word vectors)
        . Zero-shot (NLI) modeling
        . Please ensure your PowerPoint presentation (in PPTX or PDF format) is submitted to the course module as-is (not zipped). Otherwise I am unable to use Canvas SpeedGrader.
        . The presentation should look professional – not a collection of screenshots from your analytical software
        . Roughly 8-12 pages is reasonable for this kind of project but there are no strict restrictions.
        . On your slides you will want to provide:
            . Executive Summary
            . Methodology and source data overview
            . Actionable recommendations
            . Apply text summarization algorithms where possible to synthesize your insights
        . Please submit your actual program codes (Jupyter notebooks) along with your PowerPoint
        . The slides should be self-sufficient and after reading them, there should not be any need to read the notebook (we are still asking you to provide the notebooks as a proof or work though).
        . The slides should clearly answer all the questions and the answers should be supported with the plots/tables/numbers produced in the notebook based on the actual data.
        . The slides should contain the RIGHT amount of supporting material for each question, putting too much supporting material is as bad as putting too little: too much - you would not be able to keep the audience attention and your presentation would be a mess, too little - your statements would not look convincing.
        . Everything should be clear, logical, well organized, as simple as possible.  Use proper English and run spell check.
        . All the plots should be of production quality and easily readable. Foggy plots, untitled plots, unreadable labels, overlapping labels are unacceptable.
        . If you formatting somehow gets corrupted when you put your slides into Canvas (sometimes it happens), it is your responsibility to fix formatting. For example, try saving it in some other format like PDF, HTML.
        . Any statements you make should be supported by data. Only recommendations or goals of the project sections can contain elements not directly supported by the data
        . Please submit your actual program codes (i.e. Python Notebook) along with your PowerPoint – as a separate attachment
            . Your presentation should be targeted toward business audience and must not contain any code snippets
    .You are welcome to use any software packages of your choice to complete the assignment

In [None]:
#!pip install rake_nltk --user

## Import Libraries

In [1]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [None]:
import sys
import os
import numpy as np
import pandas as pd
import tensorflow as tf
import pickle
import re
import string

import nltk as nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

import time
import math
from pprint import pprint
from textblob import TextBlob

import spacy
import multiprocessing
import string
print('Python Version: ' + sys.version)
print('TensorFlow Version: ' + tf.__version__)

In [None]:
import gensim
from gensim import corpora, models
from gensim.models.ldamulticore import LdaMulticore,LdaModel
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()

In [None]:
import bokeh

print('Bokeh Version: ' + bokeh.__version__)

import ktrain

print('Ktrain Version: ' + ktrain.__version__)

In [None]:
pd.set_option('display.max_rows', 500) # Set Max Number of Rows
pd.set_option('display.max_columns', 50) # Set Max Number of Columns
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 200) # Set Max Width of Cell
pd.set_option('display.max_info_columns', 200) # Set Max Number of Columns Shown in info()
pd.set_option('display.precision', 6)#Set Display Precision of float values (0.123456)
#pd.options.display.float_format = '{:.2f}%'.format # Set Decimal Format (0.12%)

### Read Input Data

In [None]:
news_input_raw_json = pd.read_json("news_final_project.json",orient='records',lines=True)

In [None]:
news_input_raw_json.head()

In [37]:
news_input_raw_json.language.value_counts()

english    200891
Name: language, dtype: int64

In [None]:
print("Date Range-From:" , news_input_raw_json.date.min(), "to:" , news_input_raw_json.date.max())

### Randomly Select Records

Due to huge volume of records, in order to identify key topics, selecting only portion of the rows

In [None]:
news_input_json = news_input_raw_json.sample(frac=0.3)

### Data Cleaning and PreProcessing

Tokenize text into words and remove punctuation

In [None]:
%%time

def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

In [None]:
%%time

stop_words = stopwords.words('english')

# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [43]:
# Remove special characters
news_input_json['text_clean'] = news_input_json['text'].map(lambda x: re.sub('[^a-zA-Z0-9 @ . , : - _]', '', str(x)))
news_input_json['title_clean'] = news_input_json['title'].map(lambda x: re.sub('[^a-zA-Z0-9 @ . , : - _]', '', str(x)))

In [44]:
news_text_list = news_input_json['text_clean'].tolist()        
news_text_tokens = list(sent_to_words(news_text_list))

### KeyWork Extraction - Rake

In [127]:
from rake_nltk import Rake
r = Rake() # Uses stopwords for english from NLTK, and all puntuation characters.

def rake_implement(x,r):
    r.extract_keywords_from_text(x) # r.extract_keywords_from_text(<text to process>)
    return r.get_ranked_phrases() # r.get_ranked_phrases() # To get keyword phrases ranked highest to lowest.


#### Appending RAKE keywords to the Dataframe

In [130]:
news_input_json['rake_phrases']=news_input_json['text_clean'].apply(lambda x: rake_implement(x,r)).apply(', '.join)

#### Selecting on RAKE keywords

In [137]:
news_input_json[['text', 'rake_phrases']][news_input_json['rake_phrases'].str.contains("camry", na=False)].head(50)

Unnamed: 0,text,rake_phrases
76389,"Like a February snowstorm, the Chicago Auto Show blew into town with great hullabaloo then seemed to fade. But the biggest weekend of the 110th installment of the nation’s largest and longest-runn...","ongoing 2018 chicago auto show coverage chicago auto show history chicago auto show 2018, chicago tribune automotive industry chicago auto show jeep nissan motor co, also giving away special chica..."
158716,Reckless murder alleged in a Near West Side crash that killed a one-year-old girl – NBC Chicago rogermilian 8 mins ago A serious offender convicted of the parole of a deadly drunk driver was charg...,"oneyearold girl nbc chicago rogermilian 8 mins ago facebook twitter linkedin tumblr pinterest reddit skype whatsapp telegram viber share via email print related articles, oneyearold girl nbc chica..."
148807,"At 6:29 pm, Contra Costa County Fire Protection District responded to a report of a shooting on Highway 4 near Port Chicago Highway in the City of Concord. By 6:40 pm, AMR arrived on scene reporti...","chp golden gate division special investigations unit siu, contra costa county fire protection district responded, another victim suffered moderate injuries due, highway 4 near port chicago highway..."
126363,"1 slaughtered as hostage situation ceases at l a chicago kee""> ارسال به 1 murdered as hostage occasion halts at los angeles shop perfect after cops go in pursuit of male assumed about picture taki...","automobile clearly chased cheap jacksonville jaguars jersey china arrest vehicle crash appropriate person, indiana viable hostage relationship distinctly intimately knowing, one girl cheap nike je..."
82691,"NA\nHighway Driving Assist Highway Driving Assist (HDA) is a driving convenience system that assists drivers in maintaining the vehicle in the center of the lane, at an appropriate speed, while ke...","reverse parking collisionavoidance assist reverse parking collisionavoidance assist helps detect pedestrians, frontrear wheel members chrysler pacifica allwheeldrivechrysler pacifica reaches new p..."
8725,"Get your motor running with over 18 interactive driving experiences at the 2018 Chicago Auto Show. In addition to opening the doors and getting behind the wheel of nearly 1,000 vehicles from 36 au...","kias new indoor test track featuring awd demonstrations, dodge challenger srt demon simulator lets users get, jonathon berlin auto reviews jeep kia motors corp, big trucks ,... robert duffer outdo..."
194295,"Though it might not be the hottest event in terms of vehicle debuts, the Chicago Auto Show is extremely important for consumers - you know, the people actually buying all these new cars. That's ho...","craig cole2020 honda civic type rthe 2020 honda civic type r, sean szymkowski 2020 mercedesbenz metris weekenderthe outdoorsy vanlife craze, andrew krok2021 chrysler pacificathe chrysler pacifica,..."
35768,"At least four people were carjacked in Chicago Tuesday evening into Wednesday morning in Chicago.\nThe victims include a man unloading groceries in Jefferson Park, an Uber driver in South Shore, a...","cps calls ctus proposed rejection, selfdefense mostvideo shows group wanted, wacker drive near willis tower, would cancel inperson learning, first reported carjacking happened, district official s..."
121217,"Domestic violence task force in honor of slain Joliet toddler approved by Illinois lawmakers Stacy St. Clair Chicago Tribune May 31, 2021 Save\nIllinois is expected to launch a new task force aime...","also reading man robs decatur gas station, starved rock state park david proebertanner miller, cassandra tanner miller warned dupage prosecutors, clair chicago tribune may 31, 3 billion budget tod..."
46253,"More than 45 people were shot , including a 14-year-old boy, and seven were killed in Chicago citywide over the weekend, police said Monday.There were 36 shootings reported from 6 p.m. Friday to 1...","white 2020 toyota camry around 9, kamil krzaczynskigetty images getty images, citys grand crossing neighborhood shortly, neither victim needed medical attention, car around 3 p, chicago police off..."


### genism models Setup

In [45]:
%time

bigram = gensim.models.Phrases(news_text_tokens, min_count=1, threshold=1)
trigram = gensim.models.Phrases(bigram[news_text_tokens], threshold=1)

bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

Wall time: 0 ns


In [46]:
%%time

stop_words = stopwords.words('english')

# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) #using spacy package
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

Wall time: 39.7 ms


In [47]:
# Remove Stop Words
data_tokens_nostops = remove_stopwords(news_text_tokens)

In [48]:
# Create n-grams
data_words_bigrams = make_bigrams(data_tokens_nostops)
data_words_trigrams = make_trigrams(data_tokens_nostops)

# Combine tokens and n-grams
# data_tokens_cobnined = data_tokens_nostops + data_words_bigrams + data_words_trigrams
data_tokens_combined = data_words_trigrams

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

# Lemmatize text keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_tokens_combined, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(*data_lemmatized[:1])

['ernatilla', 'min_readstreaming', 'online_watch', 'white_knuckle', 'full_episode', 'chicago_season_episode', 'hd_quality', 'season_episode', 'white_knuckle', 'tv_show', 'honettv', 'watch_free', 'son', 'influential_former', 'officer', 'implicate', 'murder', 'moore_pressure', 'voight', 'charge', 'quickly', 'cutt_ly', 'stream', 'season_episode_full_episode', 'riveting_police', 'drama', 'men_women', 'chicago_police_departments', 'district', 'put', 'line', 'serve_protect', 'community', 'district_made', 'two_distinctly', 'different_group', 'uniformed_cop', 'patrol', 'beat', 'go_headtohead', 'city', 'street_crime', 'combat', 'offenses_organize', 'crime_drug', 'profile_murder', 'season_episode_full_episode', 'cast', 'ep', 'chicago_season_episode', 'full', 'seriesstreame', 'episode', 'white_knuckle', 'full_episode', 'exclusively', 'lets_go', 'watch', 'favourite', 'nbcchicago', 'onlinechicago', 'full_freechicago', 'season_episode', 'online_streame', 'media_streame', 'medium', 'multimedia', 'con

**Create Corpus Dictionary**

In [49]:
# Creating the term dictionary of our corpus, where every unique term is assigned an index. 
dictionary = corpora.Dictionary(data_lemmatized)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in data_lemmatized]

### Topic Modeling

#### LDA (latent dirichlet allocation)

In [50]:
#Get workder nodes
num_processors = multiprocessing.cpu_count()
num_processors

workers = num_processors-1

print(f'Using {workers} workers')

Using 3 workers


In [51]:
# supporting function
def compute_coherence_values(corpus, dictionary, k, a, b):
    
    lda_model = LdaMulticore(corpus=doc_term_matrix,
                       id2word=dictionary,
                       num_topics=k,
                       random_state=100,                  
                       passes=10,
                       alpha=a,
                       eta=b,
                       workers=workers)
    
    coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=dictionary, coherence='c_v')
    
    return coherence_model_lda.get_coherence()

In [52]:
start_time = time.time()

def tic():
    global start_time 
    start_time = time.time()

def tac():
    t_sec = round(time.time() - start_time)
    (t_min, t_sec) = divmod(t_sec,60)
    (t_hour,t_min) = divmod(t_min,60) 
    print(f'Execution time to calculate for topic {k}: {t_hour}hour:{t_min}min:{t_sec}sec'.format(t_hour,t_min,t_sec))

In [54]:
len(news_input_json)

60267

In [195]:
news_input_json.reset_index(inplace=True)

### Hyperparameter Tuning and save models 

In [83]:
%%time

#n_topics=50

for i in range(150,351,50):
    n_topics = i
    print("n_topics:" , n_topics)
    tuned_lda_model = LdaMulticore(corpus=doc_term_matrix,
                           id2word=dictionary,
                           num_topics=n_topics,
                           random_state=100,
                           passes=10,
                           alpha='symmetric',
                           eta='auto',
                           workers = workers)

    coherence_model_lda = CoherenceModel(model=tuned_lda_model, texts=data_lemmatized, dictionary=dictionary, coherence='c_v')
    coherence_lda = coherence_model_lda.get_coherence()
    print('\nCoherence Score: ', coherence_lda)

    from gensim.test.utils import datapath
    modelname = "LdAsample_"+str(n_topics)+"Topics"
    print('modelname: ',modelname)
    # Save model to disk.
    temp_file = datapath(modelname)
    print('temp_file: ',temp_file)
    tuned_lda_model.save(temp_file)    

n_topics: 150

Coherence Score:  0.49475499753485375
modelname:  LdAsample_150Topics
temp_file:  C:\Users\rsure\anaconda3\lib\site-packages\gensim\test\test_data\LdAsample_150Topics
n_topics: 200

Coherence Score:  0.5011341662288783
modelname:  LdAsample_200Topics
temp_file:  C:\Users\rsure\anaconda3\lib\site-packages\gensim\test\test_data\LdAsample_200Topics
n_topics: 250

Coherence Score:  0.4980895929543222
modelname:  LdAsample_250Topics
temp_file:  C:\Users\rsure\anaconda3\lib\site-packages\gensim\test\test_data\LdAsample_250Topics
n_topics: 300

Coherence Score:  0.4810048067543755
modelname:  LdAsample_300Topics
temp_file:  C:\Users\rsure\anaconda3\lib\site-packages\gensim\test\test_data\LdAsample_300Topics
n_topics: 350

Coherence Score:  0.4958557606958788
modelname:  LdAsample_350Topics
temp_file:  C:\Users\rsure\anaconda3\lib\site-packages\gensim\test\test_data\LdAsample_350Topics
Wall time: 10h 29min 42s


In [80]:
#50topics: #Coherence Score:  0.4661072653870857


Coherence Score:  0.4661072653870857


**Loading the best one based on Coherence**

In [93]:
%%time
# Load a  pretrained model with 50 Topics from disk to view the Coherence Score
load_50model_file = datapath('LdAsample_50Topics')
lda_50model= LdaModel.load(load_50model_file)
coherence_50model_lda = CoherenceModel(model=lda_50model, texts=data_lemmatized, dictionary=dictionary, coherence='c_v')
coherence_50lda = coherence_50model_lda.get_coherence()
print('\nCoherence Score: ', coherence_50lda)


Coherence Score:  0.4661072653870857


In [111]:
# Print the Keyword in the topics
pprint(lda_50model.print_topics())

[(26,
  '0.062*"bull" + 0.032*"chicago_bull" + 0.014*"photo" + 0.009*"lavine" + '
  '0.008*"flickr" + 0.006*"explore" + 0.006*"nba" + 0.005*"laker" + '
  '0.004*"eastern_conference" + 0.004*"markkanen"'),
 (39,
  '0.011*"area" + 0.008*"airport" + 0.007*"flight" + 0.007*"snow" + '
  '0.007*"river" + 0.006*"ticket" + 0.006*"pm" + 0.006*"travel" + '
  '0.006*"expect" + 0.005*"route"'),
 (46,
  '0.012*"write" + 0.011*"log" + 0.010*"story" + 0.009*"change" + 0.007*"link" '
  '+ 0.006*"commenting_use" + 0.006*"unsubscribe" + 0.005*"whenever" + '
  '0.005*"setting" + 0.005*"get_notifie"'),
 (48,
  '0.013*"hair" + 0.009*"hair_extension" + 0.008*"extension" + 0.004*"covid" + '
  '0.004*"confirmed_case" + 0.004*"dentist" + 0.003*"illinois_reporte" + '
  '0.003*"dental" + 0.002*"natural_hair" + 0.002*"friday_march"'),
 (27,
  '0.012*"show" + 0.011*"day" + 0.008*"see" + 0.005*"episode" + 0.005*"love" + '
  '0.005*"film" + 0.005*"life" + 0.005*"character" + 0.005*"story" + '
  '0.004*"go"'),
 (18,


In [94]:
%%time

lda50_display = gensimvis.prepare(lda_50model, doc_term_matrix, dictionary, sort_topics=False, mds='mmds')
pyLDAvis.display(lda50_display)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in

Wall time: 5min 16s


In [132]:
# all_topics50 = {}
# min_topics = 0
# tuned_topics = 50
# num_terms = 10
# lambd = 0.5  # Adjust this accordingly

# for i in range(min_topics,tuned_topics): #Adjust number of topics in final model
#     topic50 = lda50_display.topic_info[lda50_display.topic_info.Category == 'Topic'+str(i)]
#     topic50['relevance'] = topic['loglift']*(1-lambd) + topic['logprob']*lambd
#     all_topics50['Topic '+str(i)] = topic.sort_values(by='relevance', ascending=False).Term[:num_terms].values
# pd.DataFrame(all_topics50).T

Topic 100 has better Coherence and 250 was second best

In [88]:
# Load a best pretrained model with 100 Topics from disk
load_model_file = datapath('LdAsample_100Topic')
tuned_lda_model= LdaModel.load(load_model_file)
coherence_model_lda = CoherenceModel(model=tuned_lda_model, texts=data_lemmatized, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.5157499068489365


**Print the Keyword in the topics**

In [89]:
# Print the Keyword in the topics
pprint(tuned_lda_model.print_topics())
doc_lda = tuned_lda_model[doc_term_matrix]

[(90,
  '0.025*"hair" + 0.016*"hair_extension" + 0.016*"bankruptcy" + '
  '0.014*"extension" + 0.012*"total" + 0.010*"case" + '
  '0.008*"preliminary_sevenday" + 0.008*"illinois_reporte" + 0.007*"past_hour" '
  '+ 0.006*"specimen"'),
 (76,
  '0.039*"match" + 0.017*"view" + 0.016*"stream" + 0.008*"los_angele" + '
  '0.007*"home" + 0.007*"free" + 0.007*"register" + 0.007*"information" + '
  '0.007*"matches_drawn_matches_lost" + 0.006*"chance"'),
 (26,
  '0.006*"best_workplace" + 0.005*"hick" + 0.005*"und" + 0.004*"great_place" + '
  '0.004*"maxim_integrated_product" + 0.004*"yum_brand" + 0.003*"underwriter" '
  '+ 0.002*"underwriting" + 0.002*"quad" + 0.002*"shares_well"'),
 (25,
  '0.025*"office" + 0.015*"community_consolidated_school_district" + '
  '0.013*"community_high_school_district" + 0.011*"unit_school" + '
  '0.010*"township_high_school_district" + 0.009*"special_education" + '
  '0.008*"regional_office" + 0.006*"pension" + 0.006*"elementary_school" + '
  '0.006*"community_unit

**Visualize the Topic Clustersand Visualize the Coherence**

In [90]:
%%time

tuned_lda_display = gensimvis.prepare(tuned_lda_model, doc_term_matrix, dictionary, sort_topics=False, mds='mmds')
pyLDAvis.display(tuned_lda_display)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in

Wall time: 12min 19s


In [None]:
exchange, fee, license, lottery, connection, kelly, commission, center, market_participant, charge, back, blue, facility, connectivity, public, section, fee_schedule, act, service, musician

In [119]:
all_topics = {}
min_topics = 1
tuned_topics = 100
num_terms = 15
lambd = 1  # Adjust this accordingly

for i in range(min_topics,tuned_topics): #Adjust number of topics in final model
    topic = tuned_lda_display.topic_info[tuned_lda_display.topic_info.Category == 'Topic'+str(i)]
    topic['relevance'] = topic['loglift']*(1-lambd) + topic['logprob']*lambd
    all_topics['Topic '+str(i)] = topic.sort_values(by='relevance', ascending=False).Term[:num_terms].values


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  topic['relevance'] = topic['loglift']*(1-lambd) + topic['logprob']*lambd


In [138]:
topic_df = pd.DataFrame(all_topics).T
topic_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
Topic 1,work,company,employee,team,experience,client,business,ability,support,include,job,position,service,require,provide
Topic 2,beer,brewery,beat,brew,reading,golden_knight,ipa,chicago_kid,facebook_twitter,craft_beer,northwestern_memorial,temp,per_month,harrelson,taproom
Topic 3,site,date,indiana_state,continue_reade,sycamore,purchase_somethe,recommended_link,instruct,observe,kanye_w,provide,compensate,central_heate,clean,floor
Topic 4,etf,shark,valentine,kimberly,hair,seven_people,luxy_hair,price_range,contribution,braid,plain,woodland,etf_worth,wbbmtv,driving_home
Topic 5,scooter,san_jose,southbound,mobile_app,cattle,comorbiditie,marquette,use_contact,ios_android,carjacker,spotify,criminal_trespass,recommended_video,rider,yellow
Topic 6,percent_increase,chicago_theatre,gaming_board,chase_field,pritzker,iwu,troll,illinois_wesleyan,water_level,hartman,child_welfare,wrigley_field,hamel,long_beach,news_network
Topic 7,accident,attorney,injury,deer,personal_injury,hog,medical_malpractice,selina_meyer,lawyer,car_accident,shots_fired,illinois_tollway,different_specie,dar,forest
Topic 8,hammond,last_update,comments_observe,hilton,financial_condition,batting_average_era_outscore,cvs,human_service,well,odonnell,pdf,ap_associate,manchester_unite,phone_email,traverse_city
Topic 9,snow,storm,inch,show_less,area,forecast,rain,festival,report,expect,damage,national_weather_service,chicago_area,cancel,show
Topic 10,department,human_specialist,city,federal_agent,say,madison_county,funeral,cocaine,service_administrator,carter_said,shooting_black,work,kilogram,dan_ryan,karen


#### Hyperparameter Tuning Remove Below Since Hyperparemeter done above and number of topics is selected

In [61]:
%%time

tuning = False

if tuning:

    grid = {}
    grid['Validation_Set'] = {}
    # Topics range
    min_topics = 25
    max_topics = 250
    step_size = 5
    topics_range = range(min_topics, max_topics+1, step_size)

    # Alpha parameter
    #alpha = list(np.arange(0.01, 1, 0.3))
    #alpha.append('symmetric')
    #alpha.append('asymmetric')
    alpha = ['asymmetric'] # Run for number of topics only

    # Beta parameter
    #beta = list(np.arange(0.01, 1, 0.3))
    #beta.append('symmetric')
    #beta.append('auto')
    beta = ['auto'] # Run for number of topics only


    # Validation sets
    num_of_docs = len(doc_term_matrix)
    corpus_sets = [doc_term_matrix]
    #corpus_title = ['75% Corpus', '100% Corpus']
    corpus_title = ['100% Corpus']
    model_results = {
                     'Topics': [],
                     'Alpha': [],
                     'Beta': [],
                     'Coherence': []
                    }

    itr = 0
    itr_total = len(beta)*len(alpha)*len(topics_range)*len(corpus_title)
    print(f'LDA will execute {itr_total} iterations')


    # iterate through hyperparameters
    for i in range(len(corpus_sets)):
        # iterate through number of topics
        for k in topics_range:
            # iterate through alpha values
            tic()
            for a in alpha:
                # iterare through beta values
                for b in beta:
                    # get the coherence score for the given parameters
                    itr += 1
                    cv = compute_coherence_values(corpus=corpus_sets[i], dictionary=dictionary, 
                                                  k=k, a=a, b=b)
                    # Save the model results
                    model_results['Topics'].append(k)
                    model_results['Alpha'].append(a)
                    model_results['Beta'].append(b)
                    model_results['Coherence'].append(cv)
                    pct_completed = round((itr / itr_total * 100),1)
    #                 print(f'Completed Percent: {pct_completed}%, Corpus: {corpus_title[i]}, Topics: {k}, Alpha: {a}, Beta: {b}, Coherence: {cv}')
            print(f'Completed model based on {k} LDA topics. Finished {pct_completed}% of LDA runs')
            tac()

    lda_tuning = pd.DataFrame(model_results)
    lda_tuning.to_csv(os.path.join('lda_tuning_results.csv'), index=False)
    # Best LDA parameters
    #Review Coherence Scores
    lda_tuning.sort_values(by=['Coherence'], ascending=False).head(10)
    lda_tuning.plot(x ='Topics', y='Coherence', kind = 'scatter', xticks=topics_range)
    lda_tuning.plot(x ='Topics', y='Coherence', kind = 'line', xticks=topics_range)

Wall time: 4 ms


Running Best Model

In [None]:
if tuning:
    lda_tuning_best = lda_tuning.sort_values(by=['Coherence'], ascending=False).head(1)


    tuned_topics = int(lda_tuning_best['Topics'].to_string(index=False))


    # Since the values for Alpha and Beta can be float, symmetric and asymmetric, we will either strip or convert to float
    try:
        tuned_alpha = float(lda_tuning_best['Alpha'].to_string(index=False))
    except:
        tuned_alpha = lda_tuning_best['Alpha'].to_string(index=False).strip()


    try:
        tuned_beta = float(lda_tuning_best['Beta'].to_string(index=False))
    except:
        tuned_beta = lda_tuning_best['Beta'].to_string(index=False).strip()    

    print(f'Best Parameters: Topics: {tuned_topics}, Alpha: {tuned_alpha}, Beta: {tuned_beta}')

In [None]:
%%time
if tuning:
    tuned_lda_model = LdaMulticore(corpus=doc_term_matrix,
                           id2word=dictionary,
                           num_topics=tuned_topics,
                           random_state=100,
                           passes=10,
                           alpha=tuned_alpha,
                           eta=tuned_beta,
                           workers = workers)

    coherence_model_lda = CoherenceModel(model=tuned_lda_model, texts=data_lemmatized, dictionary=dictionary, coherence='c_v')
    print('\nCoherence Score: ', coherence_lda)
    tuned_lda_display = gensimvis.prepare(tuned_lda_model, doc_term_matrix, dictionary, sort_topics=False, mds='mmds')
    pyLDAvis.display(tuned_lda_display)

In [206]:
def add_topic_to_input(ldamodel,corpus,original_df):
    
    #Function to find the Top topic in each query
    top_topics_df = pd.DataFrame()
    
    # Get main topic in each query
    for i, row in enumerate(ldamodel[corpus]):
        rows = sorted(row, key=lambda x: (x[1]), reverse=True)
    
        # Get the Top topic, Perc Contribution and Keywords for each query
        for j, (topic_num, prop_topic) in enumerate(rows):
            if j == 0:  # =&gt; Top topic
                wp = ldamodel.show_topic(topic_num,topn=20)
                topic_keywords = ", ".join([word for word, prop in wp])
                top_topics_df = top_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    #print(top_topics_df.head(1))        
    top_topics_df.columns = ['Top_Topic_Assigned', 'Perc_Contribution', 'Topic_Keywords']
        #original_contents = pd.Series(original_content)
    final_topics_df = pd.concat([original_df,top_topics_df], axis=1)
    return(final_topics_df)

In [207]:
add_topic_df = add_topic_to_input(ldamodel = tuned_lda_model, corpus = doc_term_matrix , original_df = news_input_json)

In [209]:
add_topic_df.head()

Unnamed: 0,index,date,language,title,text,text_clean,title_clean,rake_phrases,Top_Topic_Assigned,Perc_Contribution,Topic_Keywords
0,162424,2020-11-18,english,"Chicago P.D Season 8 || Episode 2, Full Stream! On NBC!! | Chicago P.D 8x2 [NBC] 2020","Ernatilla Follow Nov 18 · 5 min read\nStreaming ~ Online!! Watch Chicago P.D (2020) S8/E2 Season 8 , Episode 2 ~ White Knuckle ~ FULL EPISODES 2 ~ Chicago P.D Season 8 Episode 2 [HD Quality 1080p]...","Ernatilla Follow Nov 18 5 min readStreaming Online Watch Chicago P.D 2020 S8E2 Season 8 , Episode 2 White Knuckle FULL EPISODES 2 Chicago P.D Season 8 Episode 2 HD Quality 1080pWatch Chicago ...","Chicago P.D Season 8 Episode 2, Full Stream On NBC Chicago P.D 8x2 NBC 2020","ernatilla follow nov 18 5 min readstreaming online watch chicago p, users whose internet connection lacks satisfactory bandwidth may experience stops, white knuckle hd free tv show honettv watch f...",78.0,0.891,"exchange, fee, license, lottery, connection, kelly, commission, center, market_participant, charge, back, blue, facility, connectivity, public, section, fee_schedule, act, service, musician"
1,99742,2021-07-25,english,What to know about dog vomit slime mold in your garden - Chicago Tribune,"Dog vomit slime mold looks as gross as it sounds. Here’s what you need to know for your garden. By Tim Johnson, Chicago Botanic Garden Chicago Tribune | Jul 25, 2021 at 5:00 AM “I found an odd-loo...","Dog vomit slime mold looks as gross as it sounds. Heres what you need to know for your garden. By Tim Johnson, Chicago Botanic Garden Chicago Tribune Jul 25, 2021 at 5:00 AM I found an oddlooking...",What to know about dog vomit slime mold in your garden Chicago Tribune,"chicago botanic garden chicago tribune jul 25, slime molds like dog vomit slime mold, create new dog slime mold patches, lot like dog vomit slime mold, dog vomit slime mold looks, advertisement do...",39.0,0.4188,"post, link, contact, bet, casino, sports_bette, unsolicited_service, offers_post_id, gambling, card, sportsbook, new, operator, american_water, gaming, wager, back, text, customer, use"
2,141177,2020-03-04,english,"Steel Tongue Scraper Review, Tank Drum Vente, Steel Drums Chicago, Handpan 9 Tone","Petal, reed, metal, tonal drums. The written agreement is explicitly tuned. It goes in fine with the flute, piano, guitar. To fritz an awesome implement that recently appeared - the Petal note dru...","Petal, reed, metal, tonal drums. The written agreement is explicitly tuned. It goes in fine with the flute, piano, guitar. To fritz an awesome implement that recently appeared the Petal note drum...","Steel Tongue Scraper Review, Tank Drum Vente, Steel Drums Chicago, Handpan 9 Tone","infoopus tongue drumhandpan drum unitmetal drum bowlelite gift boxes norwichoriginal wallet gift setspecial 30th gift ideashandpan cheaphandpan drum playingmetal drum ufo, petal note drum, always ...",99.0,0.3986,"pt, simmon, flight_deal, find, say, boden, need, chicago_new_york, season_episode, lieutenant, went_misse, gabby, gmt, colombo, booker, labor_day_weekend, payne, ben_simmon, move, dance_floor"
3,10493,2021-04-18,english,"ERE'S EVERYWHERE BUT THE BORDER WHERE SHE'S BEEN: OAKLAND, LOS ANGELES, CHICAGO. NEW HAVEN","ERE'S EVERYWHERE BUT THE BORDER WHERE SHE'S BEEN: OAKLAND, LOS ANGELES, CHICAGO. NEW HAVEN","ERES EVERYWHERE BUT THE BORDER WHERE SHES BEEN: OAKLAND, LOS ANGELES, CHICAGO. NEW HAVEN","ERES EVERYWHERE BUT THE BORDER WHERE SHES BEEN: OAKLAND, LOS ANGELES, CHICAGO. NEW HAVEN","los angeles, eres everywhere, shes, oakland, new, chicago, border",92.0,0.576,"say, go, get, know, s, want, think, see, come, time, make, work, people, way, take, try, need, do, tell, thing"
4,44292,2021-04-28,english,Gov. Pritzker signs bills improving healthcare equity across Illinois,Gov. Pritzker signs bills improving healthcare equity across Illinois Posted on 1 News - 1 eMovies - 1 eMusic - 1 eBooks - 1 Search Search for:\nIllinois is taking bold steps to make health care m...,Gov. Pritzker signs bills improving healthcare equity across Illinois Posted on 1 News 1 eMovies 1 eMusic 1 eBooks 1 Search Search for:Illinois is taking bold steps to make health care more eq...,Gov. Pritzker signs bills improving healthcare equity across Illinois,"1 news 1 emovies 1 emusic 1 ebooks 1 search search, pritzker signs bills improving healthcare equity across illinois posted, taking bold steps, states underserved population, make health care, bri...",33.0,0.6302,"state, say, people, covid, number, accord, pandemic, case, report, virus, coronavirus, continue, week, include, vaccine, day, increase, test, need, plan"


In [212]:
tuned_lda_model.show_topic(78,topn=20)

[('exchange', 0.015354435),
 ('fee', 0.006578189),
 ('license', 0.004558412),
 ('lottery', 0.0040707258),
 ('connection', 0.003813696),
 ('kelly', 0.0036245694),
 ('commission', 0.0035350057),
 ('center', 0.0033487435),
 ('market_participant', 0.0031338562),
 ('charge', 0.0031337996),
 ('back', 0.002978791),
 ('blue', 0.002944234),
 ('facility', 0.0029413865),
 ('connectivity', 0.002892276),
 ('public', 0.0027476945),
 ('section', 0.00269197),
 ('fee_schedule', 0.0024799276),
 ('act', 0.0024186783),
 ('service', 0.0023765552),
 ('musician', 0.002312407)]

In [213]:
tuned_lda_model.show_topic(90,topn=20)

[('hair', 0.025253845),
 ('hair_extension', 0.016275372),
 ('bankruptcy', 0.015996182),
 ('extension', 0.013508565),
 ('total', 0.011866739),
 ('case', 0.0095039485),
 ('preliminary_sevenday', 0.00832393),
 ('illinois_reporte', 0.008123738),
 ('past_hour', 0.007123231),
 ('specimen', 0.0062287995),
 ('covid', 0.0054896604),
 ('vaccines_administere', 0.005449063),
 ('icu_patient', 0.0052515194),
 ('laboratories_reporte', 0.0052163587),
 ('probable_case', 0.004651411),
 ('including_death', 0.004544768),
 ('statewide_positivity', 0.00444678),
 ('total_case', 0.0043079057),
 ('positivity', 0.004149218),
 ('percent', 0.0040197694)]

In [None]:
# [(90,
#   '0.025*"hair" + 0.016*"hair_extension" + 0.016*"bankruptcy" + '
#   '0.014*"extension" + 0.012*"total" + 0.010*"case" + '
#   '0.008*"preliminary_sevenday" + 0.008*"illinois_reporte" + 0.007*"past_hour" '
#   '+ 0.006*"specimen"'),

In [197]:
# #Function to find the Top topic in each query
# final_topics_df = pd.DataFrame()
# # Get main topic in each query
# for i, row in enumerate(tuned_lda_model[doc_term_matrix]):
#     print(i,row)
#     rows = sorted(row, key=lambda x: (x[1]), reverse=True)
#     print("rows:", rows)
#     for j, (topic_num, prop_topic) in enumerate(rows):
#         if j == 0:  # =&gt; Top topic
#             wp = tuned_lda_model.show_topic(topic_num,topn=20)
#             topic_keywords = ", ".join([word for word, prop in wp])
#             final_topics_df = final_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
#         else:
#             break
    
#     if i == 1:
#         final_topics_df.columns = ['Top_Topic_Assigned', 'Perc_Contribution', 'Topic_Keywords']
#         #original_contents = pd.Series(news_input_json[0:2])
#         final_topics_df = pd.concat([final_topics_df, news_input_json[0:2]], axis=1)
#         break
     

0 [(11, 0.026440859), (61, 0.021336608), (70, 0.026190212), (78, 0.8910775), (90, 0.028919829)]
rows: [(78, 0.8910775), (90, 0.028919829), (11, 0.026440859), (70, 0.026190212), (61, 0.021336608)]
1 [(10, 0.23756656), (27, 0.037593875), (37, 0.07031458), (39, 0.4184647), (92, 0.19283536), (93, 0.01152359), (97, 0.026066773)]
rows: [(39, 0.4184647), (10, 0.23756656), (92, 0.19283536), (37, 0.07031458), (27, 0.037593875), (97, 0.026066773), (93, 0.01152359)]


In [222]:
review_list = add_topic_df[add_topic_df['Top_Topic_Assigned'] == 0].head()

In [223]:
review_list

Unnamed: 0,index,date,language,title,text,text_clean,title_clean,rake_phrases,Top_Topic_Assigned,Perc_Contribution,Topic_Keywords
19,48393,2020-06-17,english,Illinois Department of Insurance Adopts Rule to Require Corporate Governance Reporting for Insurers,"The Illinois Department of Insurance has adopted a new rule, 50 Ill. Adm. Code 630, Corporate Governance Annual Disclosure , effective May 29, 2020. The new rule will require corporate governance ...","The Illinois Department of Insurance has adopted a new rule, 50 Ill. Adm. Code 630, Corporate Governance Annual Disclosure , effective May 29, 2020. The new rule will require corporate governance ...",Illinois Department of Insurance Adopts Rule to Require Corporate Governance Reporting for Insurers,"naics corporate governance annual disclosure model act 305, corporate governance annual disclosure model regulation 306, impose additional corporate governance rules, corporate governance annual d...",0.0,0.7046,"work, company, employee, team, experience, client, business, ability, support, include, job, position, service, require, provide, ensure, need, management, process, lead"
86,57327,2019-12-02,english,Chicago top cop Eddie Johnson; asleep at the wheel and fired [video],Mayor Lori Lightfoot fired embattled Chicago police Supt. Eddie Johnson this morning announcing that the termination is effective immediately. The termination comes just weeks before Johnson was s...,Mayor Lori Lightfoot fired embattled Chicago police Supt. Eddie Johnson this morning announcing that the termination is effective immediately. The termination comes just weeks before Johnson was s...,Chicago top cop Eddie Johnson asleep at the wheel and fired video,"officer jason van dyke fatally shooting laquan mcdonald, mayor lori lightfoot fired embattled chicago police supt, false statements regarding material aspects, false statements regarding material ...",0.0,0.6974,"work, company, employee, team, experience, client, business, ability, support, include, job, position, service, require, provide, ensure, need, management, process, lead"
104,153694,2021-04-06,english,Examine This Report on Milton Lee Olive Park of Chicago,"Examine This Report on Milton Lee Olive Park of Chicago Examine This Report on Milton Lee Olive Park of Chicago Category: Blog Now I recognize that not Every person has an image ideal yard, but yo...","Examine This Report on Milton Lee Olive Park of Chicago Examine This Report on Milton Lee Olive Park of Chicago Category: Blog Now I recognize that not Every person has an image ideal yard, but yo...",Examine This Report on Milton Lee Olive Park of Chicago,"itinerary automatically incorporates milton lee olive park, sixtysquaremile place called war zone, h2o filtration plant future doorway, big indoor lap pool highlight, among chicagos wellknown deep...",0.0,0.6676,"work, company, employee, team, experience, client, business, ability, support, include, job, position, service, require, provide, ensure, need, management, process, lead"
146,186792,2020-05-02,english,Travel Feature: Illinois is Great For Touring & Tasting Wine,"/ Travel Feature: Illinois is Great For Touring & Tasting Wine Travel Feature: Illinois is Great For Touring & Tasting Wine May 2, 2020 By: Alison Blackman Like it? Share it! *Editor’s note: Just ...","Travel Feature: Illinois is Great For Touring Tasting Wine Travel Feature: Illinois is Great For Touring Tasting Wine May 2, 2020 By: Alison Blackman Like it Share it Editors note: Just prior t...",Travel Feature: Illinois is Great For Touring Tasting Wine,"amazing fruit wine called berry berry berry 15, introducing two american vinicultural areas avas, two american vinicultural areas avas, advice sisters wine spirits editor, brand new hyatt right do...",0.0,0.4253,"work, company, employee, team, experience, client, business, ability, support, include, job, position, service, require, provide, ensure, need, management, process, lead"
203,176264,2018-11-07,english,Ex-Rock Hall boss Terry Stewart champions ‘Chicago Plays the Stones’ gig at Beachland | cleveland.com,"Ex-Rock Hall boss Terry Stewart champions ‘Chicago Plays the Stones’ gig at Beachland Posted November 6, 2018 at 7:00 AM Guitarist Ronnie Baker Brooks and harp player Billy Branch are two of the k...","ExRock Hall boss Terry Stewart champions Chicago Plays the Stones gig at Beachland Posted November 6, 2018 at 7:00 AM Guitarist Ronnie Baker Brooks and harp player Billy Branch are two of the key ...",ExRock Hall boss Terry Stewart champions Chicago Plays the Stones gig at Beachland cleveland.com,"exrock hall boss terry stewart champions chicago plays, plain dealer lonnie timmons iii lonnie timmon, track doo doo doo doo heartbreaker, united kingdoms first pop music project, soft alabama dra...",0.0,0.4362,"work, company, employee, team, experience, client, business, ability, support, include, job, position, service, require, provide, ensure, need, management, process, lead"


In [221]:
add_topic_df[add_topic_df['Top_Topic_Assigned'] == 0].head().text


19     The Illinois Department of Insurance has adopted a new rule, 50 Ill. Adm. Code 630, Corporate Governance Annual Disclosure , effective May 29, 2020. The new rule will require corporate governance ...
86     Mayor Lori Lightfoot fired embattled Chicago police Supt. Eddie Johnson this morning announcing that the termination is effective immediately. The termination comes just weeks before Johnson was s...
104    Examine This Report on Milton Lee Olive Park of Chicago Examine This Report on Milton Lee Olive Park of Chicago Category: Blog Now I recognize that not Every person has an image ideal yard, but yo...
146    / Travel Feature: Illinois is Great For Touring & Tasting Wine Travel Feature: Illinois is Great For Touring & Tasting Wine May 2, 2020 By: Alison Blackman Like it? Share it! *Editor’s note: Just ...
203    Ex-Rock Hall boss Terry Stewart champions ‘Chicago Plays the Stones’ gig at Beachland Posted November 6, 2018 at 7:00 AM Guitarist Ronnie Baker Brooks and harp playe