# MSCA 32018 Natural Language Processing and Cognitive Computing
## Final Project - Preprocessing

Shijia Huang

-----

In [1]:
# Import basic libraries
import time
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
# Import NLP libraries
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

from pprint import pprint
import string
from rake_nltk import Rake

import spacy
from spacy import displacy
from spacy.util import minibatch, compounding
spacy.prefer_gpu()
print(spacy.__version__)

import gensim
from gensim import corpora, models
from gensim.utils import simple_preprocess
from gensim.models.ldamulticore import LdaMulticore
from gensim.models import CoherenceModel

import pyLDAvis
import pyLDAvis.gensim as gensimvis
#import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()

3.5.2


In [3]:
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 500)

In [4]:
import multiprocessing as mp

num_processors = mp.cpu_count()
print(f'Available CPUs: {num_processors}')

Available CPUs: 12


### Read News Articles

In [5]:
%%time 

df_news_raw = pd.read_parquet('https://storage.googleapis.com/msca-bdp-data-open/news_final_project/news_final_project.parquet', engine='pyarrow')
df_news_raw.shape

CPU times: user 6.9 s, sys: 4.74 s, total: 11.6 s
Wall time: 48.4 s


(200332, 5)

In [6]:
df_news_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200332 entries, 0 to 200331
Data columns (total 5 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   url       200332 non-null  object
 1   date      200332 non-null  object
 2   language  200332 non-null  object
 3   title     200332 non-null  object
 4   text      200332 non-null  object
dtypes: object(5)
memory usage: 7.6+ MB


In [7]:
df_news_raw.head()

Unnamed: 0,url,date,language,title,text
0,http://en.people.cn/n3/2021/0318/c90000-9830122.html,2021-03-18,en,Artificial intelligence improves parking efficiency in Chinese cities - People's Daily Online,"\n\nArtificial intelligence improves parking efficiency in Chinese cities - People's Daily Online\n\nHome\nChina Politics\nForeign Affairs\nOpinions\nVideo: We Are China\nBusiness\nMilitary\nWorld\nSociety\nCulture\nTravel\nScience\nSports\nPhoto\n\nLanguages\n\nChinese\nJapanese\nFrench\nSpanish\nRussian\nArabic\nKorean\nGerman\nPortuguese\nThursday, March 18, 2021\nHome>>\n\t\t\nArtificial intelligence improves parking efficiency in Chinese cities\nBy Liu Shiyao (People's Daily) 09:16, Mar..."
1,http://newsparliament.com/2020/02/27/children-with-autism-saw-their-learning-and-social-skills-boosted-after-playing-with-this-ai-robot/,2020-02-27,en,Children With Autism Saw Their Learning and Social Skills Boosted After Playing With This AI Robot – News Parliament,"\nChildren With Autism Saw Their Learning and Social Skills Boosted After Playing With This AI Robot – News Parliament\n \n\nSkip to content\n\t\t\tThursday, February 27, 2020\t\t\n\nLatest:\n\n\nMansplaining in conferences: How can we get him to forestall?\n\n\nDrax power station to cease burning coal in March 2021\n\n\nCoronavirus Could Explode in the U.S. Overnight Like it Did in Italy\n\n\nCoronavirus: Dettol sales surge as markets fall again\n\n\nLevi Strauss marks the next phase in cor..."
2,http://www.dataweek.co.za/12835r,2021-03-26,en,"Forget ML, AI and Industry 4.0 – obsolescence should be your focus - 26 February 2021 - Test & Rework Solutions - Dataweek","\n\nForget ML, AI and Industry 4.0 – obsolescence should be your focus - 26 February 2021 - Test & Rework Solutions - Dataweek\nHome\nAbout us\nBack issues / E-book / PDF\nEMP Handbook\nSubscribe\nAdvertise\n\nCategories\n\n▸ Editor's Choice\n▸ Multimedia, Videos\n▸ Analogue, Mixed Signal, LSI\n▸ Circuit & System Protection\n▸ Computer/Embedded Technology\n▸ Design Automation\n▸ DSP, Micros & Memory\n▸ Electronics Technology\n▸ Enclosures, Racks, Cabinets & Panel Products\n▸ Events\n▸ Interc..."
3,http://www.homeoffice.consumerelectronicsnet.com/strategy-analytics-71-of-smartphones-sold-globally-in-2021-will-be-ai-powered/,2021-03-10,en,Strategy Analytics: 71% of Smartphones Sold Globally in 2021 will be AI Powered – Consumer Electronics Net,\n\nStrategy Analytics: 71% of Smartphones Sold Globally in 2021 will be AI Powered – Consumer Electronics Net\n \nSkip to content\n\nConsumer Electronics Net\n\nPrimary Menu\n\nConsumer Electronics Net\n\nSearch for:\n \nHomeNewsStrategy Analytics: 71% of Smartphones Sold Globally in 2021 will be AI Powered \n \n News\n \n \nStrategy Analytics: 71% of Smartphones Sold Globally in 2021 will be AI Powered\n 7 hours...
4,http://www.itbusinessnet.com/2020/10/olympus-to-support-endoscopic-ai-diagnosis-education-for-doctors-in-india-and-to-launch-ai-diagnostic-support-application/?utm_source=rss&utm_medium=rss&utm_campaign=olympus-to-support-endoscopic-ai-diagnosis-education-for-doctors-in-india-and-to-launch-ai-diagnostic-support-application,2020-10-20,en,Olympus to Support Endoscopic AI Diagnosis Education for Doctors in India and to Launch AI Diagnostic Support Application | | IT Business Net,\n\nOlympus to Support Endoscopic AI Diagnosis Education for Doctors in India and to Launch AI Diagnostic Support Application | | IT Business Net\n \nSkip to content\n\nIT Business Net\n\nNews for IT Professionals\nPrimary Menu\n\nIT Business Net\nAbout IT Business Net\n\nSearch for:\n \nHome2020OctoberOlympus to Support Endoscopic AI Diagnosis Education for Doctors in India and to Launch AI Diagnostic Support Application \n \n News\n ...


## Text Cleaning

- Filter out non-English articles
- Add id column
- Remove web crawling artifacts
- Remove Urls and links
- Remove punctuation and special characters

In [8]:
# Filter non-English articles
df_news = df_news_raw[df_news_raw['language'] == 'en'].reset_index(drop=True)

In [9]:
# Create id column
df_news['id'] = df_news.index

In [10]:
# Check for missing values
df_news.isnull().sum()

url         0
date        0
language    0
title       0
text        0
id          0
dtype: int64

### Clean News Titles

In [11]:
# define function to clean news titles
def clean_news_title(title):

    # remove name of news source (at the end of the title)
    split_chars = [' | ', ' - ', ' – ', ' — ']
    for char in split_chars:
        if char in title:
            title = title.split(char)[-0]
            break

    return title

In [12]:
%%time

# clean the news title
news_title_cleaned = np.array(df_news['title'].apply(clean_news_title))
df_news['cleaned title'] = news_title_cleaned

CPU times: user 206 ms, sys: 10.8 ms, total: 216 ms
Wall time: 215 ms


In [13]:
df_news[['title', 'cleaned title']].sample(5)

Unnamed: 0,title,cleaned title
104231,"SK hynix Develops PIM, Next-Generation AI Accelerator","SK hynix Develops PIM, Next-Generation AI Accelerator"
171930,"Hewlett Packard Enterprise Powers the University of Edinburgh’s International Data Facility with Software, HPC and AI Solutions | Business Wire","Hewlett Packard Enterprise Powers the University of Edinburgh’s International Data Facility with Software, HPC and AI Solutions"
185642,"SDAIA, Google Cloud to launch training program to empower women in AI sector","SDAIA, Google Cloud to launch training program to empower women in AI sector"
181261,Artists are unhappy with a man who submitted AI artwork to a contest,Artists are unhappy with a man who submitted AI artwork to a contest
195250,Portrait of Shakespeare said to be painted while Bard was alive goes on display - Independent.ie,Portrait of Shakespeare said to be painted while Bard was alive goes on display


In [14]:
# Filter out titles stays the same after cleaning
df_uncleaned = df_news[df_news['title'] == df_news['cleaned title']]
df_uncleaned[['title', 'cleaned title']].sample(5)

Unnamed: 0,title,cleaned title
18629,Wharton Professor Who Tested ChatGPT Says AI Can Be an 'Amazing Tool',Wharton Professor Who Tested ChatGPT Says AI Can Be an 'Amazing Tool'
187265,OrigiMed announced its strategic partnership with Janssen to develop clinical innovative solutions driven by data science,OrigiMed announced its strategic partnership with Janssen to develop clinical innovative solutions driven by data science
55630,"Unmanned submarines, flying bikes, AI pilots proposed as future game changers","Unmanned submarines, flying bikes, AI pilots proposed as future game changers"
59151,Replicant Named 2022 Hot Vendor in Conversational AI by Aragon Research,Replicant Named 2022 Hot Vendor in Conversational AI by Aragon Research
148277,Infosys and Aramco Aim to Leverage AI to Create Digitally Connected Employee Experiences,Infosys and Aramco Aim to Leverage AI to Create Digitally Connected Employee Experiences


### Clean News Text

In [15]:
# check if any news title is empty after cleaning
len(df_news[df_news['cleaned title'] == ''])

0

#### Find Long Paragraphs

In [16]:
# find main paragraphs in news articles
def find_paragraphs(news_df):

    text = news_df['text']
    
    # remove the space at the end of the title
    title = news_df['cleaned title'].strip()

    # replace 3 or more spaces with '\n'
    text = re.sub(' {3,}', '\n', text)

    # split text by '\n' or '\t'
    split_chars = '[\n\t]+'
    sentences = re.split(split_chars, text)

    # remove empty sentences
    sentences = [sentence for sentence in sentences if sentence != '']

    # remove sentences with less than 10 words except for cleaned title
    sentences = [sentence for sentence in sentences if len(
        sentence.split()) > 10 or sentence.startswith(title)]

    # join sentences by '\n'
    paragraph = '\n'.join(sentences)

    return paragraph

In [17]:
%%time

# find paragraphs in news articles
news_paragraphs = df_news.apply(find_paragraphs, axis=1)
df_news['paragraph'] = news_paragraphs

CPU times: user 1min 20s, sys: 1.16 s, total: 1min 22s
Wall time: 1min 22s


In [18]:
df_news[['text', 'paragraph']].sample(5)

Unnamed: 0,text,paragraph
28939,"Vectorspace AI Releases Thematic Crypto Basket APIs for Exchanges\n\nSkip to contentCircle - Country Music & LifestyleAdvertise With UsTeacher TributeAsk The ExpertHealth UpdateWatch LiveNewsVideoVaccine TrackerWeatherSportsCommunityContestsAbout UsCoronavirusSearchHomeWatch East Texas NowWatch Live/Watch NewscastsBig Red BoxSee it, Snap it, Send itNewsNationalCrimeCoronavirusStateA Better East TexasHeroes FlightEast Texas Now7 InvestigatesPet ProjectSept 11thVideoWeatherLake LevelsPollen Ce...","Vectorspace AI Releases Thematic Crypto Basket APIs for Exchanges\nSkip to contentCircle - Country Music & LifestyleAdvertise With UsTeacher TributeAsk The ExpertHealth UpdateWatch LiveNewsVideoVaccine TrackerWeatherSportsCommunityContestsAbout UsCoronavirusSearchHomeWatch East Texas NowWatch Live/Watch NewscastsBig Red BoxSee it, Snap it, Send itNewsNationalCrimeCoronavirusStateA Better East TexasHeroes FlightEast Texas Now7 InvestigatesPet ProjectSept 11thVideoWeatherLake LevelsPollen Cent..."
169653,"\n\nArtificial Intelligence in Energy Market Research Report, Size, Share, Industry Outlook – 2020-2028 – The Bisouv Network\n\nSkip to the content\n\n \nSearch\nThe Bisouv Network\n\n \nMenu\n\nEnergy\nEntertainment\nFashion\nPolitics\nSports\nAll News\nWorld\nContact\n\n Search\nSearch for:\nClose search\n \nClose Menu\n \n\n\nEnergy\nEntertainment\nFashion\nPolitics\nSports\nAll News\nWorld\nContact\nCategories\n\nAll News \n\nArtificial Intelligence in Energy Market Research Report, Size...","Artificial Intelligence in Energy Market Research Report, Size, Share, Industry Outlook – 2020-2028 – The Bisouv Network\nArtificial Intelligence in Energy Market Research Report, Size, Share, Industry Outlook – 2020-2028\nThe global Artificial Intelligence in energy market is expected to reach a market size of USD 20.83 Billion at a steady CAGR of 23.6% in 2028, according to latest analysis by Emergen Research. Adoption of Artificial Intelligence solutions among oilfield services providers ..."
196969,\nGoogle unveils its ChatGPT rival: Bard | CNN Business\n\nCNN values your feedback\n\n 1. How relevant is this ad to you?\n \n\n 2. Did you encounter any technical issues?\n \n Video player was slow to load content\n ...,"Google unveils its ChatGPT rival: Bard | CNN Business\nVideo: This tiny shape-shifting robot can melt its way out of a cage\nVideo: This tiny shape-shifting robot can melt its way out of a cage\nHear why this teacher says schools should embrace ChatGPT, not ban it\n'Make my dad famous': A daughter's quest to showcase her dad's artwork\nHe loves artificial intelligence. Hear why he is issuing a warning about ChatGPT\nTinder is reportedly testing a $500 per month subscription plan. Is it worth..."
41131,"AIRS Medical showcases award-winning MRI enhancement AI solution at RSNA 2022\n\nSkip to contentFirst Alert WeatherShare Your HolidaysNBC15 InvestigatesLatest NewsNewscastsConnectedHomeSubmit a News TipSubmit Photos, VideoFirst Alert WeatherClosingsDownload AppInteractive RadarMap RoomWeather CamsWeather HeadlinesNewsLocalStateRegionalNationalCrimeMaking A DifferenceCoronavirusCOVID-19 NewsVaccine TrackerCOVID-19 MapNavigating SchoolNBC15 InvestigatesNBC15 InvestigatesInvestigate TVShare You...","AIRS Medical showcases award-winning MRI enhancement AI solution at RSNA 2022\nSkip to contentFirst Alert WeatherShare Your HolidaysNBC15 InvestigatesLatest NewsNewscastsConnectedHomeSubmit a News TipSubmit Photos, VideoFirst Alert WeatherClosingsDownload AppInteractive RadarMap RoomWeather CamsWeather HeadlinesNewsLocalStateRegionalNationalCrimeMaking A DifferenceCoronavirusCOVID-19 NewsVaccine TrackerCOVID-19 MapNavigating SchoolNBC15 InvestigatesNBC15 InvestigatesInvestigate TVShare Your ..."
77576,The third Wireless Communication AI Competition has been launched by IMT-2020(5G) Promotion Group\n\nSkip to contentSky CamsBusiness PartnersCommunity CalendarTop TeacherLive HealthyBooks to KidsLiveNewsWeatherSportsMorning BreakInvestigatesCommunityHomeProgramming ScheduleDownload Our AppsSubmit Photos & VideosNewsHealthBack 2 SchoolCoronavirusTracking the VaccineCrimeMurdaugh CaseLowcountry NewsWTOC InvestigatesEducationElections CenterTrafficGas PricesNationalFirst Alert WeatherHeadlinesS...,The third Wireless Communication AI Competition has been launched by IMT-2020(5G) Promotion Group\nSkip to contentSky CamsBusiness PartnersCommunity CalendarTop TeacherLive HealthyBooks to KidsLiveNewsWeatherSportsMorning BreakInvestigatesCommunityHomeProgramming ScheduleDownload Our AppsSubmit Photos & VideosNewsHealthBack 2 SchoolCoronavirusTracking the VaccineCrimeMurdaugh CaseLowcountry NewsWTOC InvestigatesEducationElections CenterTrafficGas PricesNationalFirst Alert WeatherHeadlinesSky...


In [19]:
# check if any news article paragraph is empty after cleaning
df_news[df_news['paragraph'] == ''].head()

Unnamed: 0,url,date,language,title,text,id,cleaned title,paragraph
55364,https://www.post-gazette.com/opinion/Op-Ed/2021/11/03/Maureen-Dowd-AI-is-not-A-OK/stories/202111030018,2021-11-03,en,Maureen Dowd: AI is not A-OK | Pittsburgh Post-Gazette,\n Maureen Dowd: AI is not A-OK | Pittsburgh Post-Gazette\n\n \n \n12:09AM\nObituaries\nPGe\n\nPG Store\nArchives\n\nClassifieds \n\nClassified\nEvents\nJobs\nReal Estate\nLegal Notices\nPets\n\nMENU\n\nSUBSCRIBE\n\n\nLOGIN\n\n\nREGISTER\n\n\nLOG OUT\n\n\nMY PROFILE\n \n\nHome\nNews\nLocal\nSports\nOpinion\nA&E\nLife\nBusiness\nContact Us\n\n\n \nNEWSLETTERS\n \n \nMENU\nACCOUNT\n\nSubscribe\n\n\nLogin\n\n\nRegister\n\n\nLog out\n\n\nMy Profile\n\n\nSubscriber Services\n\nSearch\nSECTIONS\n...,55364,Maureen Dowd: AI is not A-OK,


In [20]:
len(df_news[df_news['paragraph'] == ''])

1

#### Find Main Text

In [21]:
# function to extract news main text by searching the location of news title (cleaned)
def extract_news_text(news_df):
    paragraph = news_df['paragraph']
    title = news_df['cleaned title']

    try:
        # split paragraph by title
        texts = paragraph.split(title)

        # get the text section with the longest length
        text = max(texts, key=len)

        # append the cleaned title back to the text due to the split
        text = title + text

    except:
        text = paragraph

    return text

In [22]:
# function to remove duplicate sentences in news main text
def remove_duplicate_sentences(text):

    # split text by '\n\t'
    sentences = text.split('\n')

    # remove duplicate sentences while preserving the order
    sentences = list(dict.fromkeys(sentences))

    # join sentences by '\n'
    text = '\n'.join(sentences)

    return text

In [23]:
%%time

# extract news text
news_text = df_news.apply(extract_news_text, axis=1)
df_news['main text'] = news_text

CPU times: user 5.26 s, sys: 4.61 s, total: 9.86 s
Wall time: 12.9 s


In [24]:
%%time

# remove duplicate sentences in news text
news_text_cleaned = np.array(df_news['main text'].apply(remove_duplicate_sentences))
df_news['main text nodup'] = news_text_cleaned

CPU times: user 4.72 s, sys: 4.19 s, total: 8.91 s
Wall time: 12.4 s


In [25]:
df_news[['text', 'paragraph', 'main text', 'main text nodup']].head()

Unnamed: 0,text,paragraph,main text,main text nodup
0,"\n\nArtificial intelligence improves parking efficiency in Chinese cities - People's Daily Online\n\nHome\nChina Politics\nForeign Affairs\nOpinions\nVideo: We Are China\nBusiness\nMilitary\nWorld\nSociety\nCulture\nTravel\nScience\nSports\nPhoto\n\nLanguages\n\nChinese\nJapanese\nFrench\nSpanish\nRussian\nArabic\nKorean\nGerman\nPortuguese\nThursday, March 18, 2021\nHome>>\n\t\t\nArtificial intelligence improves parking efficiency in Chinese cities\nBy Liu Shiyao (People's Daily) 09:16, Mar...","Artificial intelligence improves parking efficiency in Chinese cities - People's Daily Online\nArtificial intelligence improves parking efficiency in Chinese cities\nPhoto taken on July 1, 2019, shows a sign for electronic toll collection (ETC) newly set up at a roadside parking space on Yangzhuang road, Shijingshan district, Beijing. Some urban areas of the city started to use ETC system for roadside parking spaces since July 1, 2019. (People’s Daily Online/Li Wenming)\nThanks to the applic...","Artificial intelligence improves parking efficiency in Chinese cities\nPhoto taken on July 1, 2019, shows a sign for electronic toll collection (ETC) newly set up at a roadside parking space on Yangzhuang road, Shijingshan district, Beijing. Some urban areas of the city started to use ETC system for roadside parking spaces since July 1, 2019. (People’s Daily Online/Li Wenming)\nThanks to the application of an artificial intelligence (AI)-empowered roadside electronic toll collection (ETC) sy...","Artificial intelligence improves parking efficiency in Chinese cities\nPhoto taken on July 1, 2019, shows a sign for electronic toll collection (ETC) newly set up at a roadside parking space on Yangzhuang road, Shijingshan district, Beijing. Some urban areas of the city started to use ETC system for roadside parking spaces since July 1, 2019. (People’s Daily Online/Li Wenming)\nThanks to the application of an artificial intelligence (AI)-empowered roadside electronic toll collection (ETC) sy..."
1,"\nChildren With Autism Saw Their Learning and Social Skills Boosted After Playing With This AI Robot – News Parliament\n \n\nSkip to content\n\t\t\tThursday, February 27, 2020\t\t\n\nLatest:\n\n\nMansplaining in conferences: How can we get him to forestall?\n\n\nDrax power station to cease burning coal in March 2021\n\n\nCoronavirus Could Explode in the U.S. Overnight Like it Did in Italy\n\n\nCoronavirus: Dettol sales surge as markets fall again\n\n\nLevi Strauss marks the next phase in cor...","Children With Autism Saw Their Learning and Social Skills Boosted After Playing With This AI Robot – News Parliament\nCoronavirus Could Explode in the U.S. Overnight Like it Did in Italy\nLevi Strauss marks the next phase in corporate paid leave policies\nChildren With Autism Saw Their Learning and Social Skills Boosted After Playing With This AI Robot\nadmin Latest posts by admin (see all) Mansplaining in conferences: How can we get him to forestall? - February 27, 2020\nCoronavirus Could...","Children With Autism Saw Their Learning and Social Skills Boosted After Playing With This AI Robot\nadmin Latest posts by admin (see all) Mansplaining in conferences: How can we get him to forestall? - February 27, 2020\nCoronavirus Could Explode in the U.S. Overnight Like it Did in Italy - February 27, 2020\nLevi Strauss marks the next phase in corporate paid leave policies - February 27, 2020\nScientists who designed an artificially clever robotic that helped youngsters with autism spice...","Children With Autism Saw Their Learning and Social Skills Boosted After Playing With This AI Robot\nadmin Latest posts by admin (see all) Mansplaining in conferences: How can we get him to forestall? - February 27, 2020\nCoronavirus Could Explode in the U.S. Overnight Like it Did in Italy - February 27, 2020\nLevi Strauss marks the next phase in corporate paid leave policies - February 27, 2020\nScientists who designed an artificially clever robotic that helped youngsters with autism spice..."
2,"\n\nForget ML, AI and Industry 4.0 – obsolescence should be your focus - 26 February 2021 - Test & Rework Solutions - Dataweek\nHome\nAbout us\nBack issues / E-book / PDF\nEMP Handbook\nSubscribe\nAdvertise\n\nCategories\n\n▸ Editor's Choice\n▸ Multimedia, Videos\n▸ Analogue, Mixed Signal, LSI\n▸ Circuit & System Protection\n▸ Computer/Embedded Technology\n▸ Design Automation\n▸ DSP, Micros & Memory\n▸ Electronics Technology\n▸ Enclosures, Racks, Cabinets & Panel Products\n▸ Events\n▸ Interc...","Forget ML, AI and Industry 4.0 – obsolescence should be your focus - 26 February 2021 - Test & Rework Solutions - Dataweek\nForget ML, AI and Industry 4.0 – obsolescence should be your focus\nThe world entered a new era of accelerated transformation in the last eighteen months that will continue to evolve and press forward for years to come. Most businesses are playing catch-up trying to make sense of a new timeline where the ten years that had been set aside for careful planning and impleme...","Forget ML, AI and Industry 4.0 – obsolescence should be your focus\nThe world entered a new era of accelerated transformation in the last eighteen months that will continue to evolve and press forward for years to come. Most businesses are playing catch-up trying to make sense of a new timeline where the ten years that had been set aside for careful planning and implementation of what was coming up next no longer exists. The next is happening now and, regardless of your industry or seniority...","Forget ML, AI and Industry 4.0 – obsolescence should be your focus\nThe world entered a new era of accelerated transformation in the last eighteen months that will continue to evolve and press forward for years to come. Most businesses are playing catch-up trying to make sense of a new timeline where the ten years that had been set aside for careful planning and implementation of what was coming up next no longer exists. The next is happening now and, regardless of your industry or seniority..."
3,\n\nStrategy Analytics: 71% of Smartphones Sold Globally in 2021 will be AI Powered – Consumer Electronics Net\n \nSkip to content\n\nConsumer Electronics Net\n\nPrimary Menu\n\nConsumer Electronics Net\n\nSearch for:\n \nHomeNewsStrategy Analytics: 71% of Smartphones Sold Globally in 2021 will be AI Powered \n \n News\n \n \nStrategy Analytics: 71% of Smartphones Sold Globally in 2021 will be AI Powered\n 7 hours...,"Strategy Analytics: 71% of Smartphones Sold Globally in 2021 will be AI Powered – Consumer Electronics Net\nHomeNewsStrategy Analytics: 71% of Smartphones Sold Globally in 2021 will be AI Powered \nStrategy Analytics: 71% of Smartphones Sold Globally in 2021 will be AI Powered\nBOSTON–(BUSINESS WIRE)–Strategy Analytics in a newly published report, Smartphones: Global Artificial Intelligence Technologies Forecast to 2025, finds that on-device Artificial Intelligence (AI) is being rapidly impl...","Strategy Analytics: 71% of Smartphones Sold Globally in 2021 will be AI Powered\nBOSTON–(BUSINESS WIRE)–Strategy Analytics in a newly published report, Smartphones: Global Artificial Intelligence Technologies Forecast to 2025, finds that on-device Artificial Intelligence (AI) is being rapidly implemented by smartphone vendors. AI is used in various functions inside smartphones such as intelligent power optimization, imaging, virtual assistants, and to enhance device performance. The report h...","Strategy Analytics: 71% of Smartphones Sold Globally in 2021 will be AI Powered\nBOSTON–(BUSINESS WIRE)–Strategy Analytics in a newly published report, Smartphones: Global Artificial Intelligence Technologies Forecast to 2025, finds that on-device Artificial Intelligence (AI) is being rapidly implemented by smartphone vendors. AI is used in various functions inside smartphones such as intelligent power optimization, imaging, virtual assistants, and to enhance device performance. The report h..."
4,\n\nOlympus to Support Endoscopic AI Diagnosis Education for Doctors in India and to Launch AI Diagnostic Support Application | | IT Business Net\n \nSkip to content\n\nIT Business Net\n\nNews for IT Professionals\nPrimary Menu\n\nIT Business Net\nAbout IT Business Net\n\nSearch for:\n \nHome2020OctoberOlympus to Support Endoscopic AI Diagnosis Education for Doctors in India and to Launch AI Diagnostic Support Application \n \n News\n ...,"Olympus to Support Endoscopic AI Diagnosis Education for Doctors in India and to Launch AI Diagnostic Support Application | | IT Business Net\nHome2020OctoberOlympus to Support Endoscopic AI Diagnosis Education for Doctors in India and to Launch AI Diagnostic Support Application \nOlympus to Support Endoscopic AI Diagnosis Education for Doctors in India and to Launch AI Diagnostic Support Application\nTOKYO, Oct 20, 2020 – (ACN Newswire) – Olympus Corporation took part in a ground-breaking ...","Olympus to Support Endoscopic AI Diagnosis Education for Doctors in India and to Launch AI Diagnostic Support Application\nTOKYO, Oct 20, 2020 – (ACN Newswire) – Olympus Corporation took part in a ground-breaking project as a business promoter, in cooperation with the Ministry of Internal Affairs and Communications (MIC), entitled, “Survey Study for International Expansion of AI Diagnosis Support System Using Ultra-High Magnifying Endoscopes in India.” The project aims to develop advanced e...","Olympus to Support Endoscopic AI Diagnosis Education for Doctors in India and to Launch AI Diagnostic Support Application\nTOKYO, Oct 20, 2020 – (ACN Newswire) – Olympus Corporation took part in a ground-breaking project as a business promoter, in cooperation with the Ministry of Internal Affairs and Communications (MIC), entitled, “Survey Study for International Expansion of AI Diagnosis Support System Using Ultra-High Magnifying Endoscopes in India.” The project aims to develop advanced e..."


In [26]:
# check if any news text is empty after extraction
df_news[df_news['main text'] == df_news['cleaned title']].sample(5)

Unnamed: 0,url,date,language,title,text,id,cleaned title,paragraph,main text,main text nodup
43396,https://www.todayonline.com/world/chinese-ai-startup-sensetime-file-hong-kong-ipo-end-august-sources,2021-08-19,en,Chinese AI startup SenseTime to file for Hong Kong IPO by end-August -sources,\n\n\nChinese AI startup SenseTime to file for Hong Kong IPO by end-August -sources\n\n,43396,Chinese AI startup SenseTime to file for Hong Kong IPO by end-August -sources,Chinese AI startup SenseTime to file for Hong Kong IPO by end-August -sources,Chinese AI startup SenseTime to file for Hong Kong IPO by end-August -sources,Chinese AI startup SenseTime to file for Hong Kong IPO by end-August -sources
163009,https://www.alpenhornnews.com/artificial-intelligence-speech-recognition-system-market-73030/,2022-09-28,en,Artificial Intelligence Speech Recognition System Market Growth By Top Companies with Forecast 2028,"\nArtificial Intelligence Speech Recognition System Market Growth By Top Companies with Forecast 2028 \nContact Us\n\n\nAbout Us\n\nSeptember 28, 2022\t\t\t\t\t\t\t\n\nBusiness\n\n\nHealth\n\n\nScience\n\n\nTechnology\n\n\nWorld\n\nBusiness\n\n\nHealth\n\n\nScience\n\n\nTechnology\n\n\nWorld\nBusiness\nHealth\nScience\nTechnology\nWorld\n",163009,Artificial Intelligence Speech Recognition System Market Growth By Top Companies with Forecast 2028,Artificial Intelligence Speech Recognition System Market Growth By Top Companies with Forecast 2028,Artificial Intelligence Speech Recognition System Market Growth By Top Companies with Forecast 2028,Artificial Intelligence Speech Recognition System Market Growth By Top Companies with Forecast 2028
114046,https://www.todayonline.com/world/russia-bans-us-ngo-bard-college,2021-06-21,en,Russia bans U.S. NGO Bard College,\n\n\nRussia bans U.S. NGO Bard College\n,114046,Russia bans U.S. NGO Bard College,Russia bans U.S. NGO Bard College,Russia bans U.S. NGO Bard College,Russia bans U.S. NGO Bard College
20601,https://www.alpenhornnews.com/machine-learning-software-market-57352/,2022-09-04,en,Machine Learning Software Market to Grow with Sustainable CAGR During 2022 â?? 2028,"\nMachine Learning Software Market to Grow with Sustainable CAGR During 2022 â?? 2028 \nContact Us\n\n\nAbout Us\n\nSeptember 04, 2022\t\t\t\t\t\t\t\n\nBusiness\n\n\nHealth\n\n\nScience\n\n\nTechnology\n\n\nWorld\n\nBusiness\n\n\nHealth\n\n\nScience\n\n\nTechnology\n\n\nWorld\nBusiness\nHealth\nScience\nTechnology\nWorld\n",20601,Machine Learning Software Market to Grow with Sustainable CAGR During 2022 â?? 2028,Machine Learning Software Market to Grow with Sustainable CAGR During 2022 â?? 2028,Machine Learning Software Market to Grow with Sustainable CAGR During 2022 â?? 2028,Machine Learning Software Market to Grow with Sustainable CAGR During 2022 â?? 2028
115555,https://www.alpenhornnews.com/tiny-machine-learning-tinyml-market-38537/,2022-07-30,en,"Tiny Machine Learning (TinyML) Market Development, Growth, Trends, Demand, Share, Analysis and Forecast 2028","\nTiny Machine Learning (TinyML) Market Development, Growth, Trends, Demand, Share, Analysis and Forecast 2028 \nContact Us\n\n\nAbout Us\n\nJuly 30, 2022\t\t\t\t\t\t\t\n\nBusiness\n\n\nHealth\n\n\nScience\n\n\nTechnology\n\n\nWorld\n\nBusiness\n\n\nHealth\n\n\nScience\n\n\nTechnology\n\n\nWorld\nBusiness\nHealth\nScience\nTechnology\nWorld\n\n",115555,"Tiny Machine Learning (TinyML) Market Development, Growth, Trends, Demand, Share, Analysis and Forecast 2028","Tiny Machine Learning (TinyML) Market Development, Growth, Trends, Demand, Share, Analysis and Forecast 2028","Tiny Machine Learning (TinyML) Market Development, Growth, Trends, Demand, Share, Analysis and Forecast 2028","Tiny Machine Learning (TinyML) Market Development, Growth, Trends, Demand, Share, Analysis and Forecast 2028"


In [27]:
len(df_news[df_news['main text'] == df_news['cleaned title']])

289

In [28]:
# check if main text is stays the same after removing duplicate sentences
len(df_news[df_news['main text nodup'] != df_news['main text']])

28198

In [29]:
# drop news articles with empty main text (except for cleaned title)
df_news_noemp = df_news[df_news['main text'] != df_news['cleaned title']].reset_index(drop=True)

In [30]:
df_news_noemp.shape

(200043, 10)

In [31]:
df_news_noemp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200043 entries, 0 to 200042
Data columns (total 10 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   url              200043 non-null  object
 1   date             200043 non-null  object
 2   language         200043 non-null  object
 3   title            200043 non-null  object
 4   text             200043 non-null  object
 5   id               200043 non-null  int64 
 6   cleaned title    200043 non-null  object
 7   paragraph        200043 non-null  object
 8   main text        200043 non-null  object
 9   main text nodup  200043 non-null  object
dtypes: int64(1), object(9)
memory usage: 15.3+ MB


#### Clean Main Text

In [32]:
# function to remove Links, Emails, and URLs
def remove_links(text):

    # remove links
    text = re.sub(r'http\S+', '', str(text))

    # remove emails
    text = re.sub(r'\S*@\S*\s?', '', str(text))

    # remove URLs
    text = re.sub(r'www\S+', '', str(text))

    return text

In [33]:
# function to remove special characters
def remove_spc_char(text):

    # remove newline characters
    text = re.sub(r'\n+', ' ', text)

    # remove tab characters
    text = re.sub(r'\t+', ' ', text)

    # remove special characters
    text = re.sub(r'[^a-zA-Z0-9 @ . , : - _]', '', str(text))

    return text


In [34]:
# remove links, emails, and URLs and special characters
news_text_cleaned = np.array(df_news_noemp['main text nodup'].apply(remove_links))
news_text_cleaned = np.array(df_news_noemp['main text nodup'].apply(remove_spc_char))
df_news_noemp['cleaned text'] = news_text_cleaned

In [35]:
df_news_noemp[['main text nodup', 'cleaned text']].head()

Unnamed: 0,main text nodup,cleaned text
0,"Artificial intelligence improves parking efficiency in Chinese cities\nPhoto taken on July 1, 2019, shows a sign for electronic toll collection (ETC) newly set up at a roadside parking space on Yangzhuang road, Shijingshan district, Beijing. Some urban areas of the city started to use ETC system for roadside parking spaces since July 1, 2019. (People’s Daily Online/Li Wenming)\nThanks to the application of an artificial intelligence (AI)-empowered roadside electronic toll collection (ETC) sy...","Artificial intelligence improves parking efficiency in Chinese cities Photo taken on July 1, 2019, shows a sign for electronic toll collection ETC newly set up at a roadside parking space on Yangzhuang road, Shijingshan district, Beijing. Some urban areas of the city started to use ETC system for roadside parking spaces since July 1, 2019. Peoples Daily OnlineLi Wenming Thanks to the application of an artificial intelligence AIempowered roadside electronic toll collection ETC system, Chinas ..."
1,"Children With Autism Saw Their Learning and Social Skills Boosted After Playing With This AI Robot\nadmin Latest posts by admin (see all) Mansplaining in conferences: How can we get him to forestall? - February 27, 2020\nCoronavirus Could Explode in the U.S. Overnight Like it Did in Italy - February 27, 2020\nLevi Strauss marks the next phase in corporate paid leave policies - February 27, 2020\nScientists who designed an artificially clever robotic that helped youngsters with autism spice...","Children With Autism Saw Their Learning and Social Skills Boosted After Playing With This AI Robot admin Latest posts by admin see all Mansplaining in conferences: How can we get him to forestall February 27, 2020 Coronavirus Could Explode in the U.S. Overnight Like it Did in Italy February 27, 2020 Levi Strauss marks the next phase in corporate paid leave policies February 27, 2020 Scientists who designed an artificially clever robotic that helped youngsters with autism spice up their ..."
2,"Forget ML, AI and Industry 4.0 – obsolescence should be your focus\nThe world entered a new era of accelerated transformation in the last eighteen months that will continue to evolve and press forward for years to come. Most businesses are playing catch-up trying to make sense of a new timeline where the ten years that had been set aside for careful planning and implementation of what was coming up next no longer exists. The next is happening now and, regardless of your industry or seniority...","Forget ML, AI and Industry 4.0 obsolescence should be your focus The world entered a new era of accelerated transformation in the last eighteen months that will continue to evolve and press forward for years to come. Most businesses are playing catchup trying to make sense of a new timeline where the ten years that had been set aside for careful planning and implementation of what was coming up next no longer exists. The next is happening now and, regardless of your industry or seniority, t..."
3,"Strategy Analytics: 71% of Smartphones Sold Globally in 2021 will be AI Powered\nBOSTON–(BUSINESS WIRE)–Strategy Analytics in a newly published report, Smartphones: Global Artificial Intelligence Technologies Forecast to 2025, finds that on-device Artificial Intelligence (AI) is being rapidly implemented by smartphone vendors. AI is used in various functions inside smartphones such as intelligent power optimization, imaging, virtual assistants, and to enhance device performance. The report h...","Strategy Analytics: 71 of Smartphones Sold Globally in 2021 will be AI Powered BOSTONBUSINESS WIREStrategy Analytics in a newly published report, Smartphones: Global Artificial Intelligence Technologies Forecast to 2025, finds that ondevice Artificial Intelligence AI is being rapidly implemented by smartphone vendors. AI is used in various functions inside smartphones such as intelligent power optimization, imaging, virtual assistants, and to enhance device performance. The report highlights..."
4,"Olympus to Support Endoscopic AI Diagnosis Education for Doctors in India and to Launch AI Diagnostic Support Application\nTOKYO, Oct 20, 2020 – (ACN Newswire) – Olympus Corporation took part in a ground-breaking project as a business promoter, in cooperation with the Ministry of Internal Affairs and Communications (MIC), entitled, “Survey Study for International Expansion of AI Diagnosis Support System Using Ultra-High Magnifying Endoscopes in India.” The project aims to develop advanced e...","Olympus to Support Endoscopic AI Diagnosis Education for Doctors in India and to Launch AI Diagnostic Support Application TOKYO, Oct 20, 2020 ACN Newswire Olympus Corporation took part in a groundbreaking project as a business promoter, in cooperation with the Ministry of Internal Affairs and Communications MIC, entitled, Survey Study for International Expansion of AI Diagnosis Support System Using UltraHigh Magnifying Endoscopes in India. The project aims to develop advanced endoscopy di..."


In [36]:
df_news_noemp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200043 entries, 0 to 200042
Data columns (total 11 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   url              200043 non-null  object
 1   date             200043 non-null  object
 2   language         200043 non-null  object
 3   title            200043 non-null  object
 4   text             200043 non-null  object
 5   id               200043 non-null  int64 
 6   cleaned title    200043 non-null  object
 7   paragraph        200043 non-null  object
 8   main text        200043 non-null  object
 9   main text nodup  200043 non-null  object
 10  cleaned text     200043 non-null  object
dtypes: int64(1), object(10)
memory usage: 16.8+ MB


In [37]:
# check if any news text is empty after cleaning
len(df_news_noemp[df_news_noemp['cleaned text'] == ''])

0

## Show a example

In [38]:
k = 3333

# original title
df_news_noemp['title'][k]

'Study finds artificial intelligence may decrease frequency of adverse drug events'

In [39]:
# cleaned title
df_news_noemp['cleaned title'][k].strip()

'Study finds artificial intelligence may decrease frequency of adverse drug events'

In [40]:
# print the original news text
print(df_news_noemp['text'][k])


Study finds artificial intelligence may decrease frequency of adverse drug events

Fri, Jan 21, 2022 |Updated 14:01 IST


Toggle navigation
Toggle navigation
             
               National
             


General News


Politics


Features
             
               World
             


Asia


US


Europe


Pacific


Others


Middle East
             
               Business
             


Corporate
             
               Sports
             


Cricket


Football


Others


Tennis


Hockey
             
               Lifestyle
             


Relationships


Sexuality


Beauty


Parenting


Fashion


Food


Travel


Quirky


Fitness


Culture
             
               Entertainment
             


Bollywood


Hollywood


Music


Out of box
             
               Health
             
             
               Science
             
             
               Tech
             


Mobile


Internet


Computers


Others
             
               Environme

In [41]:
# print the paragraphs extracted from the news text
print(df_news_noemp['paragraph'][k])

Study finds artificial intelligence may decrease frequency of adverse drug events
Study finds artificial intelligence may decrease frequency of adverse drug events
Washington [US], January 21 (ANI): A new study has found that artificial intelligence (AI) could be harnessed to prevent or mitigate the effects of adverse drug events (ADEs).The study has been published in 'The Lancet Digital Health Journal'.The review's authors described the use of AI to reduce the frequency of ADEs as an emerging area of study and identify several use cases in which AI could contribute to reducing or preventing ADEs. Furthermore, genetic information is thought to be critical in improving the performance of AI algorithms.  With the prevalence of genotyping, researchers are confident that this type of data can be more accessible over time and ultimately used to improve AI algorithm functioning and patient health."One of our challenges is how to identify and select the most relevant genetic variables among l

In [42]:
# print the extracted news main text
print(df_news_noemp['main text'][k])

Study finds artificial intelligence may decrease frequency of adverse drug events
Washington [US], January 21 (ANI): A new study has found that artificial intelligence (AI) could be harnessed to prevent or mitigate the effects of adverse drug events (ADEs).The study has been published in 'The Lancet Digital Health Journal'.The review's authors described the use of AI to reduce the frequency of ADEs as an emerging area of study and identify several use cases in which AI could contribute to reducing or preventing ADEs. Furthermore, genetic information is thought to be critical in improving the performance of AI algorithms.  With the prevalence of genotyping, researchers are confident that this type of data can be more accessible over time and ultimately used to improve AI algorithm functioning and patient health."One of our challenges is how to identify and select the most relevant genetic variables among large amounts of genetic profile information," said lead author Ania Syrowatka, PhD

In [43]:
# print the extracted news main text without duplicate sentences
print(df_news_noemp['main text nodup'][k])

Study finds artificial intelligence may decrease frequency of adverse drug events
Washington [US], January 21 (ANI): A new study has found that artificial intelligence (AI) could be harnessed to prevent or mitigate the effects of adverse drug events (ADEs).The study has been published in 'The Lancet Digital Health Journal'.The review's authors described the use of AI to reduce the frequency of ADEs as an emerging area of study and identify several use cases in which AI could contribute to reducing or preventing ADEs. Furthermore, genetic information is thought to be critical in improving the performance of AI algorithms.  With the prevalence of genotyping, researchers are confident that this type of data can be more accessible over time and ultimately used to improve AI algorithm functioning and patient health."One of our challenges is how to identify and select the most relevant genetic variables among large amounts of genetic profile information," said lead author Ania Syrowatka, PhD

In [44]:
# print the cleaned news main text
print(df_news_noemp['cleaned text'][k])

Study finds artificial intelligence may decrease frequency of adverse drug events Washington US, January 21 ANI: A new study has found that artificial intelligence AI could be harnessed to prevent or mitigate the effects of adverse drug events ADEs.The study has been published in The Lancet Digital Health Journal.The reviews authors described the use of AI to reduce the frequency of ADEs as an emerging area of study and identify several use cases in which AI could contribute to reducing or preventing ADEs. Furthermore, genetic information is thought to be critical in improving the performance of AI algorithms.  With the prevalence of genotyping, researchers are confident that this type of data can be more accessible over time and ultimately used to improve AI algorithm functioning and patient health.One of our challenges is how to identify and select the most relevant genetic variables among large amounts of genetic profile information, said lead author Ania Syrowatka, PhD, of the Divi

## Save Cleaned Data

In [45]:
# discard irrelevant columns
df_news_cleaned = df_news_noemp[['id', 'date', 'cleaned title', 'cleaned text']].copy()
df_news_cleaned.head()

Unnamed: 0,id,date,cleaned title,cleaned text
0,0,2021-03-18,Artificial intelligence improves parking efficiency in Chinese cities,"Artificial intelligence improves parking efficiency in Chinese cities Photo taken on July 1, 2019, shows a sign for electronic toll collection ETC newly set up at a roadside parking space on Yangzhuang road, Shijingshan district, Beijing. Some urban areas of the city started to use ETC system for roadside parking spaces since July 1, 2019. Peoples Daily OnlineLi Wenming Thanks to the application of an artificial intelligence AIempowered roadside electronic toll collection ETC system, Chinas ..."
1,1,2020-02-27,Children With Autism Saw Their Learning and Social Skills Boosted After Playing With This AI Robot,"Children With Autism Saw Their Learning and Social Skills Boosted After Playing With This AI Robot admin Latest posts by admin see all Mansplaining in conferences: How can we get him to forestall February 27, 2020 Coronavirus Could Explode in the U.S. Overnight Like it Did in Italy February 27, 2020 Levi Strauss marks the next phase in corporate paid leave policies February 27, 2020 Scientists who designed an artificially clever robotic that helped youngsters with autism spice up their ..."
2,2,2021-03-26,"Forget ML, AI and Industry 4.0 – obsolescence should be your focus","Forget ML, AI and Industry 4.0 obsolescence should be your focus The world entered a new era of accelerated transformation in the last eighteen months that will continue to evolve and press forward for years to come. Most businesses are playing catchup trying to make sense of a new timeline where the ten years that had been set aside for careful planning and implementation of what was coming up next no longer exists. The next is happening now and, regardless of your industry or seniority, t..."
3,3,2021-03-10,Strategy Analytics: 71% of Smartphones Sold Globally in 2021 will be AI Powered,"Strategy Analytics: 71 of Smartphones Sold Globally in 2021 will be AI Powered BOSTONBUSINESS WIREStrategy Analytics in a newly published report, Smartphones: Global Artificial Intelligence Technologies Forecast to 2025, finds that ondevice Artificial Intelligence AI is being rapidly implemented by smartphone vendors. AI is used in various functions inside smartphones such as intelligent power optimization, imaging, virtual assistants, and to enhance device performance. The report highlights..."
4,4,2020-10-20,Olympus to Support Endoscopic AI Diagnosis Education for Doctors in India and to Launch AI Diagnostic Support Application,"Olympus to Support Endoscopic AI Diagnosis Education for Doctors in India and to Launch AI Diagnostic Support Application TOKYO, Oct 20, 2020 ACN Newswire Olympus Corporation took part in a groundbreaking project as a business promoter, in cooperation with the Ministry of Internal Affairs and Communications MIC, entitled, Survey Study for International Expansion of AI Diagnosis Support System Using UltraHigh Magnifying Endoscopes in India. The project aims to develop advanced endoscopy di..."


In [47]:
%%time

# save processed news text as parquet file
# GCP version
# path = "gs://nlp-final-project-data/data/"
# df_news_cleaned.to_parquet(path + 'news_cleaned.parquet', engine='pyarrow')