# MSCA 32018 NLP Final Project 

## Unstructured Text Analysis for Anticipating AI shift in Job Industry

GARIMA SOHI

This project focused on the future impact of AI on various industries and job lines by mining insights from a corpus of approximately 200K news articles about Data Science, Machine Learning, and Artificial Intelligence.

Project is mainly divided into following modules:

A. Pre-Processing
B. Topic Modelling
C. Sentiment Analysis - Named Entity Recognition 
D. Targeted Sentiment Analysis


## A. Pre-Processing : PART-I

This module focuses on following functions:
1. Text Cleaning
2. Lemmatization
3. Remove Duplicates

In [1]:
# Import required libraries/packages
import pandas as pd
import numpy as np
import warnings
import time

import string
import re
from tqdm.notebook import tqdm

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from bs4 import BeautifulSoup
import spacy

nltk.download('averaged_perceptron_tagger')
nltk.download('punkt') 
nltk.download('stopwords')

warnings.filterwarnings("ignore")

2023-05-27 18:06:53.311198: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-05-27 18:06:56.295443: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2023-05-27 18:06:56.295582: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/loca

#### a) Data Extraction

In [2]:
# Load the dataset
%time

df_news_final_project = pd.read_parquet('https://storage.googleapis.com/msca-bdp-data-open/news_final_project/news_final_project.parquet', engine='pyarrow')

# Check the shape of dataset
df_news_final_project.shape

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 8.82 µs


(200332, 5)

Given dataset has ~200k news articles.

In [3]:
# Display the first few rows of the df_news_final_project DataFrame
df_news_final_project.head(3)

Unnamed: 0,url,date,language,title,text
0,http://en.people.cn/n3/2021/0318/c90000-983012...,2021-03-18,en,Artificial intelligence improves parking effic...,\n\nArtificial intelligence improves parking e...
1,http://newsparliament.com/2020/02/27/children-...,2020-02-27,en,Children With Autism Saw Their Learning and So...,\nChildren With Autism Saw Their Learning and ...
2,http://www.dataweek.co.za/12835r,2021-03-26,en,"Forget ML, AI and Industry 4.0 – obsolescence ...","\n\nForget ML, AI and Industry 4.0 – obsolesce..."


In [4]:
# Filter the DataFrame to include only English news articles
df_news_en = df_news_final_project[df_news_final_project['language'] == 'en']

# Get the shape of the filtered DataFrame
df_news_en.shape


(200332, 5)

In [5]:
# Check for missing values in the DataFrame
missing_values = df_news_en.isna().sum()

# Display the count of missing values for each column
missing_values


url         0
date        0
language    0
title       0
text        0
dtype: int64

There are no missing values in the dataset.

#### b) Data Cleaning

In [6]:
from bs4 import BeautifulSoup
import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Function to clean the text of news articles
def text_cleaning(text, max_word_length=15):
    
    # Convert non-string inputs to string
    if not isinstance(text, str):
        text = str(text)
    
    # Remove HTML tags
    soup = BeautifulSoup(text, "html.parser")
    text = soup.get_text()
    
    # Remove URLs
    text = re.sub(r'http\S+|www\S+', '', text)
    
    # Remove special characters like '&nbsp;' and '\xa0'
    text = re.sub(r'&\S*;|\xa0', ' ', text)

    # Remove any non-alphabetic character from the text
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    
    # Convert text to lowercase
    # text = text.lower()
    
    # Tokenize text
    tokens = word_tokenize(text)
    
    # Remove stopwords, punctuation, and long words
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words and word not in string.punctuation and len(word) <= max_word_length]
    
    # Remove leading/trailing white space
    cleaned_text = ' '.join(tokens).strip()
    
    return cleaned_text


In [7]:
# Function to apply lemmatization with POS tagging to each word
def lemmatize_with_pos(word):
    pos = get_wordnet_pos(word)
    if pos:
        return lemmatizer.lemmatize(word, pos=pos)
    else:
        return lemmatizer.lemmatize(word)

# Function to get the appropriate POS tag for a word
def get_wordnet_pos(word):
    """Map POS tag to first character used by WordNetLemmatizer"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)  # default to noun if not found

# Function to apply lemmatization to each word
def lemmatize_text(text):
    return [lemmatize_with_pos(word) for word in text]

In [8]:
# Function to remove duplicates from news articles
def remove_duplicates(df):
    
    # Remove duplicate rows based on 'text', 'title', and 'date' columns
    unique_df = df.drop_duplicates(subset=['text', 'title', 'date'])
    
    # Return the DataFrame with duplicate rows removed
    return unique_df


In [9]:
# Function to implement pre-processing
def process_df(df):

    # 1. Apply the text_cleaning function to the 'text' column using apply() method
    print("Cleaning News Articles")
    start_time = time.time()
    df['cleaned_text'] = df['text'].apply(text_cleaning)

    cleaning_time = (time.time() - start_time) / 60
    print(f"Cleaning time: {cleaning_time:.2f} minutes")
    
    # 2. Apply the lemmatization function to the 'text' column using apply() method
    print("Lemmatizing cleaned News Articles")
    start_time = time.time()
    df['lemmatized_text'] = df['cleaned_text'].apply(text_cleaning)

    lemmatizing_time = (time.time() - start_time) / 60
    print(f"Lemmatization time: {lemmatizing_time:.2f} minutes")
    
    # 3. Apply remove_duplicates function 
    print("Removing Duplicate Records")
    start_time = time.time()
    df = remove_duplicates(df)
    print('Number of unique News Articles after removing duplicates:', df.shape[0])

    removal_time = (time.time() - start_time) / 60
    print(f"Removing duplicates time: {removal_time:.2f} minutes")

    return df


In [10]:
df_news_clean = process_df(df_news_en)

Cleaning News Articles
Cleaning time: 18.53 minutes
Lemmatizing cleaned News Articles
Lemmatization time: 11.85 minutes
Removing Duplicate Records
Number of unique News Articles after removing duplicates: 198735
Removing duplicates time: 0.11 minutes


In [11]:
# Display the first few rows of the resulting dataframe
df_news_clean.head(3)

Unnamed: 0,url,date,language,title,text,cleaned_text,lemmatized_text
0,http://en.people.cn/n3/2021/0318/c90000-983012...,2021-03-18,en,Artificial intelligence improves parking effic...,\n\nArtificial intelligence improves parking e...,Artificial intelligence improves parking effic...,Artificial intelligence improves parking effic...
1,http://newsparliament.com/2020/02/27/children-...,2020-02-27,en,Children With Autism Saw Their Learning and So...,\nChildren With Autism Saw Their Learning and ...,Children With Autism Saw Their Learning Social...,Children With Autism Saw Their Learning Social...
2,http://www.dataweek.co.za/12835r,2021-03-26,en,"Forget ML, AI and Industry 4.0 – obsolescence ...","\n\nForget ML, AI and Industry 4.0 – obsolesce...",Forget ML AI Industry obsolescence focus Febru...,Forget ML AI Industry obsolescence focus Febru...


In [12]:
# Check the shape of dataset
df_news_clean.shape

(198735, 7)

Number of news articles reduced by ~2K after removing duplicates where date, title and text matched.

In [13]:
# Save DataFrame as Parquet
df_news_clean.to_parquet('cleaned_news_articles.parquet', engine='pyarrow')