# Obtaining text for Natural Language Processing (NLP) Tasks

In this notebook, we will first get data using 
1. Web scraping
2. Store them in a dataframe
3. Perform NLP tasks with the text in the dataframe.

### 1. Web scraping using Beautiful Soup


Beautiful Soup is a Python library designed to help you easily extract information from web pages by parsing HTML and XML documents. Here we provide a step-by-step tutorial on how to use Beautiful Soup for web scraping

https://beautiful-soup-4.readthedocs.io/en/latest/

Import Libraries

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

We will use the **Reuters** dataset for this notebook. The dataset contains about 22,000 business news articles that appeared on the Reuters newswire in 1987. The articles were provided as a number of SGM files, which is a type of markdown format.

Here is the Readme file for the datset - https://www.daviddlewis.com/resources/testcollections/reuters21578/readme.txt

The dataset can be downloaded from here - https://www.daviddlewis.com/resources/testcollections/reuters21578/ or 
https://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html

In [2]:
f = open('reut2-021.sgm', encoding='utf-8', errors='ignore')
dataframe = f.read()
dataframe[0:1000]

'<!DOCTYPE lewis SYSTEM "lewis.dtd">\n<REUTERS TOPICS="NO" LEWISSPLIT="TEST" CGISPLIT="TRAINING-SET" OLDID="20436" NEWID="21001">\n<DATE>19-OCT-1987 15:37:46.03</DATE>\n<TOPICS></TOPICS>\n<PLACES></PLACES>\n<PEOPLE></PEOPLE>\n<ORGS></ORGS>\n<EXCHANGES></EXCHANGES>\n<COMPANIES></COMPANIES>\n<UNKNOWN> \n&#5;&#5;&#5;F \n&#22;&#22;&#1;f2882&#31;reute\nf f BC-CITYFED-FINANCI   10-19 0013</UNKNOWN>\n<TEXT TYPE="BRIEF">&#2;\n******<TITLE>CITYFED FINANCIAL CORP SAYS IT CUT QTRLY DIVIDEND TO ONE CENT FROM 10 CTS/SHR\n</TITLE>Blah blah blah.\n&#3;\n\n</TEXT>\n</REUTERS>\n<REUTERS TOPICS="YES" LEWISSPLIT="TEST" CGISPLIT="TRAINING-SET" OLDID="20435" NEWID="21002">\n<DATE>19-OCT-1987 15:35:53.55</DATE>\n<TOPICS><D>crude</D><D>ship</D></TOPICS>\n<PLACES><D>bahrain</D><D>iran</D><D>usa</D></PLACES>\n<PEOPLE></PEOPLE>\n<ORGS></ORGS>\n<EXCHANGES></EXCHANGES>\n<COMPANIES></COMPANIES>\n<UNKNOWN> \n&#5;&#5;&#5;Y \n&#22;&#22;&#1;f2873&#31;reute\nr f AM-GULF-PLATFORM   10-19 0101</UNKNOWN>\n<TEXT>&#2;\n<TIT

In [3]:
soup = BeautifulSoup(dataframe, 'lxml')
content = soup.prettify()
print(content[0:300])

<!DOCTYPE lewis SYSTEM "lewis.dtd">
<html>
 <body>
  <reuters cgisplit="TRAINING-SET" lewissplit="TEST" newid="21001" oldid="20436" topics="NO">
   <date>
    19-OCT-1987 15:37:46.03
   </date>
   <topics>
   </topics>
   <places>
   </places>
   <people>
   </people>
   <orgs>
   </orgs>
   <exchan


In [4]:
topics = soup.find_all('topics')
topics[0:10]

[<topics></topics>,
 <topics><d>crude</d><d>ship</d></topics>,
 <topics><d>acq</d></topics>,
 <topics></topics>,
 <topics></topics>,
 <topics><d>crude</d><d>ship</d></topics>,
 <topics><d>acq</d></topics>,
 <topics></topics>,
 <topics></topics>,
 <topics></topics>]

In [5]:
# instantiate an empty list
topic_list = list()

for x in topics:
    # turn bs4.tag into text and create a list of each article's topics
    words = [i.text for i in x]
    #append this list to the larger list
    topic_list.append(words)

topic_list[0:10]

[[],
 ['crude', 'ship'],
 ['acq'],
 [],
 [],
 ['crude', 'ship'],
 ['acq'],
 [],
 [],
 []]

In [6]:
def pull_out_earn_topic(topic_list):
    for i, topic in enumerate(topic_list):
        
        # format is a list of strings, so this loop removes topics from nested list
        article_topics = ''
        for word in topic:
            article_topics += (word + ' ')
            
        # assign desired topic 
        if not article_topics:
            topic_list[i] = 'blank'
        elif 'earn' in article_topics:
            topic_list[i] = 'earn'
        else:
            topic_list[i] = 'other'
    
    return topic_list

### 2. Make a Dataframe

In [7]:
topics_for_df = pull_out_earn_topic(topic_list)
df = pd.DataFrame(topics_for_df, columns=['topic'])

df

Unnamed: 0,topic
0,blank
1,other
2,other
3,blank
4,blank
...,...
573,other
574,other
575,other
576,blank


In [8]:
# pull out everything with the "text" tag from the bs4 object
all_text = soup.find_all("text")

# instatiate empty list 
list_all_text = list()

# loop through the bs4 element
for text in all_text:
    
    # getting just the text from the element
    # stripping out the newline indicator
    working_text = text.get_text().replace("\n", " ")
    
    # removing extra spaces
    working_text = ' '.join(working_text.split())
    
    # appending to list
    list_all_text.append(working_text)

In [9]:
df['text'] = list_all_text
df.head()

Unnamed: 0,topic,text
0,blank,******CITYFED FINANCIAL CORP SAYS IT CUT QTRLY...
1,other,HUGE OIL PLATFORMS DOT GULF LIKE BEACONS By AS...
2,other,******CCR VIDEO SAYST RECEIVED OFFER TO NEGOTI...
3,blank,GM <GM> CANADA UNIT MAJOR OFFER ACCEPTED BY UN...
4,blank,CANADA DEVELOPMENT UNIT <CDC.TO> REFINANCES SA...


### 3. NLP Task

Now that we have a text dataframe, we can use NLTK for NLP tasks. 

We will show Sentiment Analysis using NLTK; however, any other NLP task should also be possible using the above dataframe.

Load NLTK libraries

In [10]:
import string
from collections import Counter

import nltk
import pandas as pd
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

*Note: The Reuters dataset is in-built in NLTK and can be read in as `from nltk.corpus import reuters`*


In [11]:
# nltk.download('reuters')
# nltk.download('punkt')
# from nltk.corpus import reuters

Initialize the text processing tools

In [12]:
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

Handling Missing Values

In [13]:
df['clean_text'] = df['text'].fillna('')
df.head(20)

Unnamed: 0,topic,text,clean_text
0,blank,******CITYFED FINANCIAL CORP SAYS IT CUT QTRLY...,******CITYFED FINANCIAL CORP SAYS IT CUT QTRLY...
1,other,HUGE OIL PLATFORMS DOT GULF LIKE BEACONS By AS...,HUGE OIL PLATFORMS DOT GULF LIKE BEACONS By AS...
2,other,******CCR VIDEO SAYST RECEIVED OFFER TO NEGOTI...,******CCR VIDEO SAYST RECEIVED OFFER TO NEGOTI...
3,blank,GM <GM> CANADA UNIT MAJOR OFFER ACCEPTED BY UN...,GM <GM> CANADA UNIT MAJOR OFFER ACCEPTED BY UN...
4,blank,CANADA DEVELOPMENT UNIT <CDC.TO> REFINANCES SA...,CANADA DEVELOPMENT UNIT <CDC.TO> REFINANCES SA...
5,other,DIPLOMATS CALL U.S. ATTACK ON OIL RIG RESTRAIN...,DIPLOMATS CALL U.S. ATTACK ON OIL RIG RESTRAIN...
6,other,BROWN DISC TO BUY RHONE-POULENC <RHON.PA> UNIT...,BROWN DISC TO BUY RHONE-POULENC <RHON.PA> UNIT...
7,blank,"******DOW SINKS TO LOWEST LEVEL OF THE YEAR, D...","******DOW SINKS TO LOWEST LEVEL OF THE YEAR, D..."
8,blank,LANE TELECOMMUNICATIONS PRESIDENT RESIGNS HOUS...,LANE TELECOMMUNICATIONS PRESIDENT RESIGNS HOUS...
9,blank,"PERKIN-ELMER <PKN> WINS EPA CONTRACT NORWALK, ...","PERKIN-ELMER <PKN> WINS EPA CONTRACT NORWALK, ..."


### Text Preprocessing
To ensure that our sentiment analysis is accurate, it’s essential to preprocess the comments. This includes tokenization, lemmatization, and removing stop words:

In [14]:
def get_wordnet_pos(treebank_tag):
    """Map POS tag to first character used by WordNetLemmatizer"""
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # by default, treat as noun

def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [word.lower() for word in tokens]
    tokens = [word for word in tokens if word.isalpha() or word in string.punctuation]
    pos_tags = nltk.pos_tag(tokens)
    tokens = [lemmatizer.lemmatize(word, get_wordnet_pos(pos)) for word, pos in pos_tags]
    tokens = [word for word in tokens if word not in stop_words]

    return " ".join(tokens)

df['processed_text'] = df['clean_text'].apply(preprocess_text)
df.head()

Unnamed: 0,topic,text,clean_text,processed_text
0,blank,******CITYFED FINANCIAL CORP SAYS IT CUT QTRLY...,******CITYFED FINANCIAL CORP SAYS IT CUT QTRLY...,* * * * * * cityfed financial corp say cut qtr...
1,other,HUGE OIL PLATFORMS DOT GULF LIKE BEACONS By AS...,HUGE OIL PLATFORMS DOT GULF LIKE BEACONS By AS...,huge oil platform dot gulf like beacon ashraf ...
2,other,******CCR VIDEO SAYST RECEIVED OFFER TO NEGOTI...,******CCR VIDEO SAYST RECEIVED OFFER TO NEGOTI...,* * * * * * ccr video sayst receive offer nego...
3,blank,GM <GM> CANADA UNIT MAJOR OFFER ACCEPTED BY UN...,GM <GM> CANADA UNIT MAJOR OFFER ACCEPTED BY UN...,gm < gm > canada unit major offer accept union...
4,blank,CANADA DEVELOPMENT UNIT <CDC.TO> REFINANCES SA...,CANADA DEVELOPMENT UNIT <CDC.TO> REFINANCES SA...,"canada development unit < > refinance sarnia ,..."


### Sentiments analysis
To get insights from the comments, we’ll first categorize them based on sentiment using the SentimentIntensityAnalyzer:

In [15]:
sia = SentimentIntensityAnalyzer()

def get_sentiment(text):
    sentiment_score = sia.polarity_scores(text)['compound']
    if sentiment_score >= 0.05:
        return "positive"
    elif sentiment_score <= -0.05:
        return "negative"
    else:
        return "neutral"
        
df['sentiment_text'] = df['processed_text'].apply(get_sentiment)

Inspect the dataset with the newly added sentiment column:

In [16]:
cols = ['text', 'clean_text', 'processed_text', 'sentiment_text']
df[cols].head()

Unnamed: 0,text,clean_text,processed_text,sentiment_text
0,******CITYFED FINANCIAL CORP SAYS IT CUT QTRLY...,******CITYFED FINANCIAL CORP SAYS IT CUT QTRLY...,* * * * * * cityfed financial corp say cut qtr...,negative
1,HUGE OIL PLATFORMS DOT GULF LIKE BEACONS By AS...,HUGE OIL PLATFORMS DOT GULF LIKE BEACONS By AS...,huge oil platform dot gulf like beacon ashraf ...,negative
2,******CCR VIDEO SAYST RECEIVED OFFER TO NEGOTI...,******CCR VIDEO SAYST RECEIVED OFFER TO NEGOTI...,* * * * * * ccr video sayst receive offer nego...,negative
3,GM <GM> CANADA UNIT MAJOR OFFER ACCEPTED BY UN...,GM <GM> CANADA UNIT MAJOR OFFER ACCEPTED BY UN...,gm < gm > canada unit major offer accept union...,positive
4,CANADA DEVELOPMENT UNIT <CDC.TO> REFINANCES SA...,CANADA DEVELOPMENT UNIT <CDC.TO> REFINANCES SA...,"canada development unit < > refinance sarnia ,...",positive


To further analsze, we’ll segregate the comments based on their sentiment.

In [17]:
df_sentiment = df[cols]

negatives_df = df_sentiment[df_sentiment['sentiment_text'] == 'negative'][['text', 'processed_text']]
negatives = negatives_df['text'].tolist()

positives_df = df_sentiment[df_sentiment['sentiment_text'] == 'positive'][['text', 'processed_text']]
positives = positives_df['text'].tolist()

Let us preview some of the comments to better understand the sentiments:

Negative Comments Sample

In [18]:
for i in range(3):
    print("{}\n".format(negatives[i]))

******CITYFED FINANCIAL CORP SAYS IT CUT QTRLY DIVIDEND TO ONE CENT FROM 10 CTS/SHR Blah blah blah.

HUGE OIL PLATFORMS DOT GULF LIKE BEACONS By ASHRAF FOUAD BAHRAIN, Oct 19 - Huge oil platforms dot the Gulf like beacons -- usually lit up like Christmas trees at night. One of them, sitting astride the Rostam offshore oilfield, was all but blown out of the water by U.S. Warships on Monday. The Iranian platform, an unsightly mass of steel and concrete, was a three-tier structure rising 200 feet (60 metres) above the warm waters of the Gulf until four U.S. Destroyers pumped some 1,000 shells into it. The U.S. Defense Department said just 10 pct of one section of the structure remained. U.S. helicopters destroyed three Iranian gunboats after an American helicopter came under fire earlier this month and U.S. forces attacked, seized, and sank an Iranian ship they said had been caught laying mines. But Iran was not deterred, according to U.S. defense officials, who said Iranian forces used Ch

Positive Comments Sample

In [19]:
for i in range(3):
    print("{}\n".format(positives[i]))

GM <GM> CANADA UNIT MAJOR OFFER ACCEPTED BY UNION TORONTO, Oct 19 - The Canadian Auto Workers' Union said it accepted an economic offer from the Canadian division of General Motors Corp <GM> in contract negotiations. But union president Bob White said many local issues at the 11 plants in Ontario and Quebec still remained unresolved ahead of Thursday's deadline for a strike by 40,000 workers. "It minimizes the possibility of a strike," White told reporters. However, "if we don't have local agreements settled by Thursday, there will be a strike," he said. The local issues still unresolved involved health care, skilled trades and job classifications, White said. GM Canada negotiator Rick Curd said he believed a strike would be avoided. "Even though there are some tough issues to be resolved we're on the right schedule to meet the target," Curd said. "I'm very pleased with the state of the negotiations," he said. Union membership meetings have been scheduled for the weekend in case a tent