[Reference](https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72)

# Scraping News Articles for Data Retrieval

We will be scraping [inshorts](https://inshorts.com/), the website, by leveraging python to retrieve news articles. We will be focusing on articles on technology, sports and world affairs. We will retrieve one page’s worth of articles for each category. A typical news category landing page is depicted in the following figure, which also highlights the HTML section for the textual content of each article.

Thus, we can see the specific HTML tags which contain the textual content of each news article in the landing page mentioned above. We will be using this information to extract news articles by leveraging the `BeautifulSoup` and `requests` libraries. Let’s first load up the following dependencies.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
%matplotlib inline

We will now build a function which will leverage `requests` to access and get the HTML content from the landing pages of each of the three news categories. Then, we will use `BeautifulSoup` to parse and extract the news headline and article textual content for all the news articles in each category. We find the content by accessing the specific HTML tags and classes, where they are present.

In [2]:
seed_urls = ['https://inshorts.com/en/read/technology',
             'https://inshorts.com/en/read/sports',
             'https://inshorts.com/en/read/world']

In [3]:
def build_dataset(seed_urls):
    news_data = []
    for url in seed_urls:
        news_category = url.split('/')[-1]
        data = requests.get(url)
        soup = BeautifulSoup(data.content, 'html.parser')
        
        news_articles = [{'news_headline': headline.find('span', 
                                                         attrs={"itemprop": "headline"}).string,
                          'news_article': article.find('div', 
                                                       attrs={"itemprop": "articleBody"}).string,
                          'news_category': news_category}
                         
                            for headline, article in 
                             zip(soup.find_all('div', 
                                               class_=["news-card-title news-right-box"]),
                                 soup.find_all('div', 
                                               class_=["news-card-content news-right-box"]))
                        ]
        news_data.extend(news_articles)
        
    df =  pd.DataFrame(news_data)
    df = df[['news_headline', 'news_article', 'news_category']]
    return df

It is pretty clear that we extract the news headline, article text and category and build out a data frame, where each row corresponds to a specific news article. We will now invoke this function and build our dataset.

In [4]:
news_df = build_dataset(seed_urls)
news_df.head(10)

Unnamed: 0,news_headline,news_article,news_category
0,End $479 mn US Army contract: Microsoft employ...,Over 100 Microsoft workers have written to CEO...,technology
1,Twitter CEO to not attend parliament panel mee...,Twitter CEO Jack Dorsey will not attend BJP le...,technology
2,"Miss him today, every day: Apple CEO on Steve ...",Remembering late Apple Co-founder Steve Jobs o...,technology
3,Adobe fixes bug that damaged MacBook Pro speakers,Adobe has fixed a bug in its Premiere Pro afte...,technology
4,"PETA criticises Google Doodle on Steve Irwin, ...",Animal rights organisation PETA faced heavy ba...,technology
5,Twitter Co-founder Evan Williams to quit its b...,Twitter on Friday announced that Co-founder Ev...,technology
6,Satellite view was almost named 'Bird Mode': G...,Google Maps co-creator Bret Taylor has reveale...,technology
7,Alibaba rules out layoffs this year despite Ch...,Chinese e-commerce giant Alibaba's CEO Daniel ...,technology
8,BJP MP-led panel summons officials of Facebook...,The Parliamentary Committee on Information Tec...,technology
9,Japan firm makes video to teach traffic safety...,Japan-based auto parts and service chain Yello...,technology


We, now, have a neatly formatted dataset of news articles and you can quickly check the total number of news articles with the following code.

In [5]:
news_df.news_category.value_counts()

technology    25
world         25
sports        25
Name: news_category, dtype: int64

# Text Wrangling & Pre-processing

There are usually multiple steps involved in cleaning and pre-processing textual data. I have covered text pre-processing in detail in [***Chapter 3 of ‘Text Analytics with Python’***](https://github.com/dipanjanS/text-analytics-with-python/tree/master/Chapter-3) (code is open-sourced). However, in this section, I will highlight some of the most important steps which are used heavily in Natural Language Processing (NLP) pipelines and I frequently use them in my NLP projects. We will be leveraging a fair bit of **`nltk`** and **`spacy`**, both state-of-the-art libraries in NLP. Typically a `pip install <library>` or a `conda install <library>` should suffice. However, in case you face issues with loading up `spacy`’s language models, feel free to follow the steps highlighted below to resolve this issue (I had faced this issue in one of my systems).

Let’s now load up the necessary dependencies for text pre-processing. We will remove negation words from stop words, since we would want to keep them as they might be useful, especially during sentiment analysis.

❗ **IMPORTANT NOTE**: A lot of you have messaged me about not being able to load the contractions module. It’s not a standard python module. We leverage a standard set of contractions available in the `contractions.py` file in [my repository](https://github.com/dipanjanS/practical-machine-learning-with-python/tree/master/bonus%20content/nlp%20proven%20approach). Please add it in the same directory you run your code from, else it will not work.

In [9]:
import spacy
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
import re
from bs4 import BeautifulSoup
from contractions import CONTRACTION_MAP
import unicodedata

### To download `en_core_web_lg`.
`$python3 -m spacy download en_core_web_lg`

In case if [NLTK download SSL: Certificate verify failed](https://stackoverflow.com/questions/38916452/nltk-download-ssl-certificate-verify-failed)

`$bash /Applications/Python\ 3.7/Install\ Certificates.command`

In python 3 IDLE:
```
>>import nltk
>>nltk.download('stopwords')
```

In [13]:
nlp = spacy.load('en_core_web_lg', parse=True, tag=True, entity=True)
#nlp_vec = spacy.load('en_vecs', parse = True, tag=True, #entity=True)
tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')

# Removing HTML tags

Often, unstructured text contains a lot of noise, especially if you use techniques like web or screen scraping. HTML tags are typically one of these components which don’t add much value towards understanding and analyzing text.

In [14]:
def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    stripped_text = soup.get_text()
    return stripped_text

In [15]:
strip_html_tags('<html><h2>Some important text</h2></html>')

'Some important text'

# Removing accented characters

Usually in any text corpus, you might be dealing with accented characters/letters, especially if you only want to analyze the English language. Hence, we need to make sure that these characters are converted and standardized into ASCII characters. A simple example — converting **é** to __e*__.

In [16]:
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

In [17]:
remove_accented_chars('Sómě Áccěntěd těxt')

'Some Accented text'

# Expanding Contractions

Contractions are shortened version of words or syllables. They often exist in either written or spoken forms in the English language. These shortened versions or contractions of words are created by removing specific letters and sounds. In case of English contractions, they are often created by removing one of the vowels from the word. Examples would be, **do not** to __don’t__ and **I would** to __I’d__. Converting each contraction to its expanded, original form helps with text standardization.

*We leverage a standard set of contractions available in the `contractions.py` file in [my repository](https://github.com/dipanjanS/practical-machine-learning-with-python/tree/master/bonus%20content/nlp%20proven%20approach)*.

In [18]:
def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

In [19]:
expand_contractions("Y'all can't expand contractions I'd think")

'You all cannot expand contractions I would think'

# Removing Special Characters

Special characters and symbols are usually non-alphanumeric characters or even occasionally numeric characters (depending on the problem), which add to the extra noise in unstructured text. Usually, simple regular expressions (regexes) can be used to remove them.

In [20]:
def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

In [21]:
remove_special_characters("Well this was fun! What do you think? 123#@!", remove_digits=True)

'Well this was fun What do you think '

# Stemming

To understand stemming, you need to gain some perspective on what word stems represent. Word stems are also known as the ***base form*** of a word, and we can create new words by attaching affixes to them in a process known as inflection. Consider the word **JUMP**. You can add affixes to it and form new words like **JUMPS**, __JUMPED__, and **JUMPING**. In this case, the base word **JUMP** is the word stem.

In [22]:
def simple_stemmer(text):
    ps = nltk.porter.PorterStemmer()
    text = ' '.join([ps.stem(word) for word in text.split()])
    return text

In [23]:
simple_stemmer("My system keeps crashing his crashed yesterday, ours crashes daily")

'My system keep crash hi crash yesterday, our crash daili'

# Lemmatization

*Lemmatization* is very similar to stemming, where we remove word affixes to get to the base form of a word. However, the base form in this case is known as the root word, but not the root stem. The difference being that the root word is always a lexicographically correct word (present in the dictionary), but the root stem may not be so. Thus, root word, also known as the lemma, will always be present in the dictionary. Both **`nltk`** and __`spacy`__ have excellent lemmatizers. We will be using **`spacy`** here.

In [24]:
def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

In [25]:
lemmatize_text("My system keeps crashing! his crashed yesterday, ours crashes daily")

'My system keep crash ! his crash yesterday , ours crash daily'

# Removing Stopwords

Words which have little or no significance, especially when constructing meaningful features from text, are known as stopwords or stop words. These are usually words that end up having the maximum frequency if you do a simple term or word frequency in a corpus. Typically, these can be articles, conjunctions, prepositions and so on. Some examples of stopwords are **a, an, the,** and the like.

In [26]:
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

In [27]:
remove_stopwords("The, and, if are stopwords, computer is not")

', , stopwords , computer not'

# Bringing it all together — Building a Text Normalizer

While we can definitely keep going with more techniques like correcting spelling, grammar and so on, let’s now bring everything we learnt together and chain these operations to build a text normalizer to pre-process text data.

In [28]:
def normalize_corpus(corpus, html_stripping=True, contraction_expansion=True,
                     accented_char_removal=True, text_lower_case=True, 
                     text_lemmatization=True, special_char_removal=True, 
                     stopword_removal=True, remove_digits=True):
    
    normalized_corpus = []
    # normalize each document in the corpus
    for doc in corpus:
        # strip HTML
        if html_stripping:
            doc = strip_html_tags(doc)
        # remove accented characters
        if accented_char_removal:
            doc = remove_accented_chars(doc)
        # expand contractions    
        if contraction_expansion:
            doc = expand_contractions(doc)
        # lowercase the text    
        if text_lower_case:
            doc = doc.lower()
        # remove extra newlines
        doc = re.sub(r'[\r|\n|\r\n]+', ' ',doc)
        # lemmatize text
        if text_lemmatization:
            doc = lemmatize_text(doc)
        # remove special characters and\or digits    
        if special_char_removal:
            # insert spaces between special characters to isolate them    
            special_char_pattern = re.compile(r'([{.(-)!}])')
            doc = special_char_pattern.sub(" \\1 ", doc)
            doc = remove_special_characters(doc, remove_digits=remove_digits)  
        # remove extra whitespace
        doc = re.sub(' +', ' ', doc)
        # remove stopwords
        if stopword_removal:
            doc = remove_stopwords(doc, is_lower_case=text_lower_case)
            
        normalized_corpus.append(doc)
        
    return normalized_corpus

Let’s now put this function in action! We will first combine the news headline and the news article text together to form a document for each piece of news. Then, we will pre-process them.

In [29]:
# combining headline and article text
news_df['full_text'] = news_df["news_headline"].map(str)+ '. ' + news_df["news_article"]

# pre-process text and store the same
news_df['clean_text'] = normalize_corpus(news_df['full_text'])
norm_corpus = list(news_df['clean_text'])

# show a sample news article
news_df.iloc[1][['full_text', 'clean_text']].to_dict()

{'full_text': 'Twitter CEO to not attend parliament panel meet on Monday. Twitter CEO Jack Dorsey will not attend BJP leader Anurag Thakur-headed parliamentary panel meeting, which was rescheduled to February 25. The February 11 meeting on "safeguarding citizens\' interests" following anti-right-wing bias allegations on Twitter was postponed after Dorsey failed to appear, citing "short notice". Twitter said its public policy head Colin Crowell will attend the meeting.',
 'clean_text': 'twitter ceo not attend parliament panel meet monday twitter ceo jack dorsey not attend bjp leader anurag thakur head parliamentary panel meeting reschedule february february meeting safeguard citizen interest follow anti right wing bias allegation twitter postpone dorsey fail appear cite short notice twitter say public policy head colin crowell attend meeting'}

Thus, you can see how our text pre-processor helps in pre-processing our news articles! After this, you can save this dataset to disk if needed, so that you can always load it up later for future analysis.

In [30]:
news_df.to_csv('news.csv', index=False, encoding='utf-8')