# MSCA 32018 Natural Language Processing and Cognitive Computing
## Final Project - Topic Detection 
### Zero-shot (NLI) modeling based on Sentiment Analysis

Shijia Huang

-----

In [1]:
!pip install -r requirements.txt



In [2]:
# Import basic libraries
import os
import sys
import time
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 

%matplotlib inline

In [3]:
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 500)

In [4]:
# Import NLP libraries
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from pprint import pprint
import string
from rake_nltk import Rake

import spacy
from spacy import displacy
from spacy.util import minibatch, compounding
spacy.prefer_gpu()
print(spacy.__version__)

import gensim
from gensim import corpora, models
from gensim.utils import simple_preprocess
from gensim.models.ldamulticore import LdaMulticore
from gensim.models import CoherenceModel

import pyLDAvis
import pyLDAvis.gensim as gensimvis
#import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()

import tensorflow as tf
from transformers import pipeline

2023-05-16 23:51:13.581978: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-05-16 23:51:13.582059: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64


3.5.3


In [5]:
import multiprocessing as mp

num_processors = mp.cpu_count()
print(f'Available CPUs: {num_processors}')

Available CPUs: 4


### Read New Articles with Sentiment Score

In [6]:
%%time

# GCP version
#path = "gs://nlp-final-project-data/data/"
#df_news = pd.read_parquet(path + 'news_sentiment.parquet', engine='pyarrow')

# Sagemaker version
df_news = pd.read_parquet('data_news_sentiment.parquet', engine='pyarrow')
df_news.shape

CPU times: user 1min 10s, sys: 11 s, total: 1min 21s
Wall time: 1min 11s


(154283, 11)

In [7]:
df_news.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 154283 entries, 0 to 154282
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   id                154283 non-null  int64 
 1   date              154283 non-null  object
 2   cleaned title     154283 non-null  object
 3   cleaned text      154283 non-null  object
 4   title_tokens      154283 non-null  object
 5   title_lemmatized  154283 non-null  object
 6   text_tokens       154283 non-null  object
 7   text_lemmatized   154283 non-null  object
 8   title_keywords    154283 non-null  object
 9   text_keywords     154283 non-null  object
 10  sentiment         154283 non-null  int64 
dtypes: int64(2), object(9)
memory usage: 12.9+ MB


In [8]:
df_news.head(2)

Unnamed: 0,id,date,cleaned title,cleaned text,title_tokens,title_lemmatized,text_tokens,text_lemmatized,title_keywords,text_keywords,sentiment
0,1,2020-02-27,Children With Autism Saw Their Learning and Social Skills Boosted After Playing With This AI Robot,"Children With Autism Saw Their Learning and Social Skills Boosted After Playing With This AI Robot admin Latest posts by admin see all Mansplaining in conferences: How can we get him to forestall February 27, 2020 Coronavirus Could Explode in the U.S. Overnight Like it Did in Italy February 27, 2020 Levi Strauss marks the next phase in corporate paid leave policies February 27, 2020 Scientists who designed an artificially clever robotic that helped youngsters with autism spice up their ...","[children, autism, saw, learning, social, skills, boosted, playing, ai, robot, children, autism, saw, learning, social, skills, boosted, playing, ai, robot, children, autism, saw, learning, social, skills, boosted, playing, ai_robot]","[child, autism, see, learn, social, skill, boost, play, robot, child, autism, see, learn, social, skill, boost, play, robot, child, autism, see, learn, social, skill, boost, play, ai_robot]","[children, autism, saw, learning, social, skills, boosted, playing, ai, robot, admin, latest, posts, admin, see, mansplaining, conferences, get, forestall, february, coronavirus, could, explode, overnight, like, italy, february, levi, strauss, marks, next, phase, corporate, paid, leave, policies, february, scientists, designed, artificially, clever, robotic, helped, youngsters, autism, spice, studying, social, talents, hope, era, may, future, help, others, developmental, dysfunction, learn, ...","[child, autism, see, learn, social, skill, boost, play, robot, late, post, admin, see, mansplaining, conference, get, explode, overnight, mark, next, phase, corporate, pay, leave, policy, scientist, design, artificially, clever, robotic, help, youngster, autism, spice, study, social, talent, era, future, help, other, developmental, dysfunction, learn, notice, youngster, gentle, average, autism, take, domestic, s, refer, socially, assistive, robotic, name, kiwi, month, accord, commentary, way...","[social, skill, see, play, learn, child, boost, autism, robot, ai_robot]","[robotic, youngster, kid, child, kiwi, market, autism, learn, crew, talent]",5
1,2,2021-03-26,"Forget ML, AI and Industry 4.0 – obsolescence should be your focus","Forget ML, AI and Industry 4.0 obsolescence should be your focus The world entered a new era of accelerated transformation in the last eighteen months that will continue to evolve and press forward for years to come. Most businesses are playing catchup trying to make sense of a new timeline where the ten years that had been set aside for careful planning and implementation of what was coming up next no longer exists. The next is happening now and, regardless of your industry or seniority, t...","[forget, ml, ai, industry, obsolescence, focus, forget, ml, ai, industry, obsolescence, focus, forget, ml, ai, industry, obsolescence, focus]","[forget, ml, ai, industry, obsolescence, focus, forget, ml, ai, industry, obsolescence, focus, forget, ml, ai, industry, obsolescence, focus]","[forget, ml, ai, industry, obsolescence, focus, world, entered, new, era, accelerated, transformation, last, eighteen, months, continue, evolve, press, forward, years, come, businesses, playing, catchup, trying, make, sense, new, timeline, ten, years, set, aside, careful, planning, implementation, coming, next, longer, exists, next, happening, regardless, industry, seniority, status, quo, shifted, better, face, back, invited, attend, pompous, meeting, london, brazilian, embassy, along, selec...","[forget, ai, industry, obsolescence, focus, world, enter, new, era, accelerate, transformation, last, month, continue, evolve, press, forward, year, come, business, play, catchup, try, make, sense, new, timeline, year, set, aside, careful, planning, implementation, come, next, long, exist, next, happen, regardless, industry, seniority, status, quo, shift, well, face, back, invite, attend, pompous, meeting, brazilian, embassy, select, lead, name, oil, energy, industry, get, update, go, happen...","[obsolescence, ml, industry, forget, focus, ai]","[electronic, come, card, industry, repair, new, system, require, test, business]",4


In [9]:
### SAMPLE DATA
# df_news = df_news.sample(frac=0.01, random_state=42)
df_news.shape

(154283, 11)

# Select articles with positive and negative sentiment scores

- Positive: sentiment score > 3
- Negative: sentiment score < 3

In [10]:
# select positive sentiment articles
df_news_positive = df_news[df_news['sentiment'] > 3].reset_index(drop=True)
df_news_positive.shape

(104228, 11)

In [11]:
# print a sample
df_news_positive.sample(1)

Unnamed: 0,id,date,cleaned title,cleaned text,title_tokens,title_lemmatized,text_tokens,text_lemmatized,title_keywords,text_keywords,sentiment
73704,141701,2021-05-25,Eys3D raises $7M for AI edge sensor technology,"Eys3D raises 7M for AI edge sensor technology Start transforming your games economy and increase your bottom line. Get the free guide now. Eys3D Microelectronics, a company producing chips and sensors for autonomous operations including security, touchless control, autonomous vehicles, and smart retail, today announced that its raised 7 million in a series A round led by strategic partners Arm IoT Capital, WI Harper Group, and Marubun Corporation. The company says that the funding will allow...","[eys, raises, ai, edge, sensor, technology, eys_raises, ai, edge, sensor, technology, eys_raises, ai, edge, sensor, technology]","[raise, ai, edge, sensor, technology, eys_raise, ai, edge, sensor, technology, eys_raise, ai, edge, sensor, technology]","[eys, raises, ai, edge, sensor, technology, start, transforming, games, economy, increase, bottom, line, get, free, guide, eys, company, producing, chips, sensors, autonomous, operations, including, security, touchless, control, autonomous, vehicles, smart, retail, today, announced, raised, million, series, round, led, strategic, partners, arm, iot, capital, wi, harper, group, marubun, corporation, company, says, funding, allow, grow, product, development, research, efforts, including, depth...","[raise, ai, edge, sensor, technology, start, transform, game, economy, increase, bottom, line, get, free, guide, company, produce, chip, sensor, autonomous, operation, include, security, touchless, control, autonomous, vehicle, smart, retail, today, announce, raise, series, round, lead, strategic, partner, arm, group, say, funding, allow, grow, product, development, research, effort, include, depthsense, camera, sensor, integration, solution, enable, new, capability, object, recognition, dis...","[technology, sensor, edge, ai, eys_raise, raise]","[include, say, technology, sensor, company, market, ai, chip, partner, design]",5


In [12]:
# select negative sentiment articles
df_news_negative = df_news[df_news['sentiment'] < 3].reset_index(drop=True)
df_news_negative.shape

(14932, 11)

In [13]:
# print a sample
df_news_negative.sample(1)

Unnamed: 0,id,date,cleaned title,cleaned text,title_tokens,title_lemmatized,text_tokens,text_lemmatized,title_keywords,text_keywords,sentiment
8653,116637,2023-04-04,Is Artificial Intelligence Really the Road to a 'Terminator' Future For Humanity?,Is Artificial Intelligence Really the Road to a Terminator Future For Humanity Business 2 Community The Best VPNs For The USA In 2022 Unblock Content With a US VPN The Best App Builder in 2022 Top 15 NoCode App Builders Reviewed The Best Background Check Software Top 12 Reviewed for 2022 The 92 Best Small Business Ideas for 2022 Ranked by Category How to Remove Spyware Best Spy App Removal Tools in 2022 POS The Best POS Systems in 2022 Simplify Payments With POS Tools Reviews Promo C...,"[artificial, intelligence, really, road, terminator, future, humanity, artificial_intelligence, really, road, terminator, future, humanity, artificial_intelligence_really, road, terminator, future, humanity]","[artificial, intelligence, really, road, terminator, future, humanity, artificial_intelligence, really, road, terminator, future, humanity, artificial_intelligence_really, road, terminator, future, humanity]","[artificial, intelligence, really, road, terminator, future, humanity, business, community, best, vpns, usa, unblock, content, us, vpn, best, app, builder, top, nocode, app, builders, reviewed, best, background, check, software, top, reviewed, best, small, business, ideas, ranked, category, remove, spyware, best, spy, app, removal, tools, pos, best, pos, systems, simplify, payments, pos, tools, reviews, promo, codes, lucky, block, crypto, casino, sportsbook, review, best, altcoins, invest, n...","[artificial, intelligence, really, humanity, business, community, good, vpns, vpn, good, app, builder, app, builder, review, good, background, check, software, top, review, good, small, business, idea, rank, category, remove, spyware, good, spy, removal, tool, pos, good, pos, system, simplify, payment, pos, tool, review, promo, code, lucky, block, crypto, casino, sportsbook, review, good, altcoin, invest, new, altcoin, buy, artificial, intelligence, really, future, humanity, note, authorise,...","[terminator, road, humanity, future, really, intelligence, artificial_intelligence_really, artificial_intelligence, artificial]","[ai, human, agent, safety, good, constraint, information, technology, design, task]",2


## Topic Modeling - Zero-shot (NLI) modeling

Using candidate labels from LDA model with n=18 topics

#### Check for GPU presence

In [14]:
#Verify we got CPU + GPU or only CPU
tf.config.list_physical_devices()

[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
 PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

In [15]:
!nvidia-smi

Tue May 16 23:52:28 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   28C    P8     9W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [16]:
tf.__version__

'2.11.1'

In [17]:
!pip install torch --upgrade



In [18]:
!pip install ktrain --upgrade



In [19]:
!python3 -m pip install tensorflow



In [20]:
import ktrain
import torch

### Modelling

In [21]:
zsl = ktrain.text.ZeroShotClassifier()

In [22]:
candidate_labels = ['digital transaction', 'healthcare', 'news platform', 'data analytics', 'insurance', 'investment', 'global market', 'autonomous car', 'customer experience', 'data science', 'cryptocurrency', 'camera', 'robot', 'chatgpt', 'image', 'voice', 'patient care', 'research']
len(candidate_labels)

18

## Positive Sentiment Articles

In [31]:
# Put reviews in a list
seq_pos = df_news_positive['cleaned text'].to_list()

# Set the hyppothesis template
hypothesis_template = "The topic of this news article is {}."

In [None]:
%%time

topic_pos = zsl.predict(seq_pos, 
                        labels=candidate_labels, 
                        include_labels=False, 
                        nli_template=hypothesis_template, 
                        batch_size=18)

In [33]:
pos_pred_df = pd.DataFrame(topic_pos, columns=candidate_labels) 
pos_pred_df.head()

# SAVE THE RESULTS
pos_pred_df.to_json('result/pos_pred.json', orient='records', lines=True)

In [34]:
pos_news_topics = df_news_positive.join(pos_pred_df, how='inner')
pos_news_topics = pos_news_topics[['id', 'cleaned text', 'sentiment'] + candidate_labels]

# SAVE THE RESULTS
pos_news_topics.to_json('result/pos_news_topics.json', orient='records', lines=True)

# Reset multi-level index
pos_news_topics.columns = pos_news_topics.columns.get_level_values(0)

# Select the small-ish articles only
pos_news_topics.head()

Unnamed: 0,id,cleaned text,sentiment,digital transaction,healthcare,news platform,data analytics,insurance,investment,global market,autonomous car,customer experience,data science,cryptocurrency,camera,robot,chatgpt,image,voice,patient care,research
0,1,"Children With Autism Saw Their Learning and Social Skills Boosted After Playing With This AI Robot admin Latest posts by admin see all Mansplaining in conferences: How can we get him to forestall February 27, 2020 Coronavirus Could Explode in the U.S. Overnight Like it Did in Italy February 27, 2020 Levi Strauss marks the next phase in corporate paid leave policies February 27, 2020 Scientists who designed an artificially clever robotic that helped youngsters with autism spice up their ...",5,0.126413,0.573291,0.747695,0.287535,0.100931,0.226929,0.178021,0.066381,0.378634,0.172105,0.040583,0.43677,0.932445,0.437214,0.537201,0.390813,0.396779,0.911051
1,2,"Forget ML, AI and Industry 4.0 obsolescence should be your focus The world entered a new era of accelerated transformation in the last eighteen months that will continue to evolve and press forward for years to come. Most businesses are playing catchup trying to make sense of a new timeline where the ten years that had been set aside for careful planning and implementation of what was coming up next no longer exists. The next is happening now and, regardless of your industry or seniority, t...",4,0.303175,0.008955,0.644975,0.231231,0.028623,0.726941,0.749578,0.043793,0.465443,0.213941,0.041343,0.096056,0.095282,0.413589,0.383721,0.159051,0.01475,0.383123
2,3,"Strategy Analytics: 71 of Smartphones Sold Globally in 2021 will be AI Powered BOSTONBUSINESS WIREStrategy Analytics in a newly published report, Smartphones: Global Artificial Intelligence Technologies Forecast to 2025, finds that ondevice Artificial Intelligence AI is being rapidly implemented by smartphone vendors. AI is used in various functions inside smartphones such as intelligent power optimization, imaging, virtual assistants, and to enhance device performance. The report highlights...",5,0.254997,0.047431,0.805587,0.695647,0.047576,0.520027,0.957824,0.018728,0.896138,0.669529,0.031697,0.0379,0.105813,0.628416,0.671183,0.28314,0.124894,0.930628
3,9,"Artificial Intelligence In Behavioral And Mental Health Care Market to Witness Astonishing Growth by 2026 Focusing on Leading Players AdvancedMD , Cerner , Core Solutions , Credible Behavioral Health June 13, 2020 emailprotected Artificial Intelligence In Behavioral And Mental Health Care Market Artificial Intelligence in Behavioral and Mental Health Care Market research report is the new statistical data source added by Healthcare Intelligence Markets. It uses several approaches for anal...",5,0.170653,0.771679,0.380666,0.757577,0.101454,0.265526,0.80129,0.092516,0.340393,0.251144,0.035433,0.052589,0.050366,0.357945,0.306858,0.206618,0.184397,0.952538
4,10,"AI Machine Learning Market 2020 Expected to Reach XX Million by 2024 IBM, BAIDU, SOUNDHOUND, ZEBRA MEDICAL VISION, PRISMA, IRIS AI The Global AI Machine Learning Market report is aimed at highlighting a firsthand documentation of all the best practices in the AI Machine Learning industry that subsequently set the growth course active. These vital market oriented details are highly crucial to overcome cut throat competition and all the growth oriented practices typically embraced by frontlin...",5,0.383716,0.203133,0.605019,0.580184,0.168859,0.771888,0.933147,0.279561,0.621637,0.382832,0.112535,0.17296,0.1773,0.584303,0.402161,0.419516,0.225927,0.948273


In [None]:
pos_news_topics.info()

## Negetive Sentiment Articles

In [None]:
# Put reviews in a list
seq_neg = df_news_negative['cleaned text'].to_list()

# Set the hyppothesis template
hypothesis_template = "The topic of this news article is {}."

In [None]:
%%time

topic_neg = zsl.predict(seq_pos, 
                        labels=candidate_labels, 
                        include_labels=False, 
                        nli_template=hypothesis_template, 
                        batch_size = 18)

CPU times: user 2min 15s, sys: 1min 22s, total: 3min 38s
Wall time: 3min 27s


In [None]:
neg_pred_df = pd.DataFrame(topic_neg, columns=candidate_labels)
neg_pred_df.head()

# SAVE THE RESULTS
neg_pred_df.to_json('result/neg_pred.json', orient='records', lines=True)

In [None]:
neg_news_topics = df_news_negative.join(neg_pred_df, how='inner')
neg_news_topics = neg_news_topics[['id', 'cleaned text', 'sentiment'] + candidate_labels]

# SAVE THE RESULTS
neg_news_topics.to_json('result/neg_news_topics.json', orient='records', lines=True)

# Reset multi-level index
neg_news_topics.columns = neg_news_topics.columns.get_level_values(0)

# Select the small-ish articles only
neg_news_topics.head()

Unnamed: 0,id,cleaned text,sentiment,digital transaction,healthcare,news platform,data analytics,insurance,investment,global market,autonomous car,customer experience,data science,cryptocurrency,camera,robot,chatgpt,image,voice,patient care,research
0,5,"Cr Bard Inc Has Returned 48.9 Since SmarTrend Recommendation BCR SmarTrend identified an Uptrend for Cr Bard Inc :BCR on December 23rd, 2016 at 222.45. In approximately 40 months, Cr Bard Inc has returned 48.91 as of todays recent price of 331.24.In the past 52 weeks, Cr Bard Inc share prices have been bracketed by a low of 0.00 and a high of 0.00 and are now at 331.24, 100 above that low price. In the last five trading sessions, the 50day moving average MA has remained constant while the...",2,0.126413,0.573291,0.747695,0.287535,0.100931,0.226929,0.178021,0.066381,0.378634,0.172105,0.040583,0.43677,0.932445,0.437214,0.537201,0.390813,0.396779,0.911051
1,12,"Conversational AI Marketplace Enlargement Possibilities, Regional Traits and Call for, Most sensible Avid gamers, Alternatives with Forecasts 2025 International Conversational AI marketplace 2020 analysis document is a solitary instrument that provides an indepth scrutiny of various Conversational AI marketplace insights, alternatives, collateral approaches and more than a few techniques of creating robust determinations. The Conversational AI marketplace CAGR charge would possibly building...",2,0.303175,0.008955,0.644975,0.231231,0.028623,0.726941,0.749578,0.043793,0.465443,0.213941,0.041343,0.096056,0.095282,0.413589,0.383721,0.159051,0.01475,0.383123
2,23,"Walmart employees are out to show its antishoplifting AI doesnt work The retailer denies there is any widespread issue with the software. In January, my coworker received a peculiar email. The message, which she forwarded to me, was from a handful of corporate Walmart employees calling themselves the Concerned Home Office Associates. Walmarts headquarters in Bentonville, Arkansas, is often referred to as the Home Office. While its not unusual for journalists to receive anonymous tips, they d...",2,0.254997,0.047431,0.805587,0.695647,0.047576,0.520027,0.957824,0.018728,0.896138,0.669529,0.031697,0.0379,0.105813,0.628416,0.671183,0.28314,0.124894,0.930628
3,28,"Tesla CEO Elon Musk urges pause on AI systems, citing risks to society Elon Musk and a group of artificial intelligence experts and industry executives are calling for a sixmonth pause in developing systems more powerful than OpenAIs newly launched GPT4, in an open letter citing potential risks to society and humanity. The letter, issued by the nonprofit Future of Life Institute and signed by more than 1,000 people including Musk, called for a pause on advanced AI development until shared sa...",1,0.170653,0.771679,0.380666,0.757577,0.101454,0.265526,0.80129,0.092516,0.340393,0.251144,0.035433,0.052589,0.050366,0.357945,0.306858,0.206618,0.184397,0.952538
4,38,"LegalTech Artificial Intelligence Industry June 2021 Middle East and Africa Market Research Report 2020 Under COVID19 outbreak globally, this report provides 360 degrees of analysis from supply chain, import and export control to regional government policy and future influence on the industry. Detailed analysis about market status 20152020, enterprise competition pattern, advantages and disadvantages of enterprise products, industry development trends 20202025, regional industrial layout ch...",2,0.383716,0.203133,0.605019,0.580184,0.168859,0.771888,0.933147,0.279561,0.621637,0.382832,0.112535,0.17296,0.1773,0.584303,0.402161,0.419516,0.225927,0.948273
