# Parse and Preprocess HTML Pages

This notebook demonstrates preprocessing HTML sources for an NLP engine.

## Parse and Load HTML Pages

1. Get HTML from specific pages of a documentation site. 
    1. The actual pages are configured in `config.json`.
1. Extract HTML content using `BeautifulSoup4`. 
    1. The parser looks for a `div class=content--main` in the HTML. This works specifically for the configured page but may not work equally well for other pages.
1. Get all paragraphs `<p>`

## References 

1. [`get_text` with CSS class](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text)

In [1]:
from bs4 import BeautifulSoup
import json,requests
import pandas as pd
pd.set_option('display.max_colwidth', 100)

with open('config.json','r') as f:
    config = json.load(f)

topic = 'dynatrace'
for page in config[topic]:
    print(page)
    r = requests.get(page)
    soup = BeautifulSoup(r.text,'lxml')
    content_main = soup.find_all('div',class_ = 'content--main')
    if len(content_main) == 0:
        print(f'-- Fetching Content failed for {page}')
        continue
    else: 
        print(f'{len(content_main)} documents fetched')

    paras = content_main[0].find_all('p')
    pretty_paras = [para.get_text() for para in paras]
    print(f'loading {len(pretty_paras)} paragraphs into dataframe')
    df = pd.DataFrame({'raw_data': pretty_paras})

https://www.dynatrace.com/support/help/get-started/what-is-dynatrace/
1 documents fetched
loading 26 paragraphs into dataframe


## Load a DataFrame

This is how the parsed data looks. 

In [2]:
print(f'-- dataframe is size {len(df)}')
print(f'First element:\n{df.iloc[0].raw_data}')

-- dataframe is size 26
First element:
Dynatrace is a software-intelligence monitoring platform that simplifies enterprise cloud complexity and accelerates digital transformation. With Davis (the Dynatrace AI causation engine) and complete automation, the Dynatrace all-in-one platform provides answers, not just data, about the performance of your applications, their underlying infrastructure, and the experience of your end users. Dynatrace is used to modernize and automate enterprise cloud operations, release higher-quality software faster, and deliver optimum digital experiences to your organization's customers.


## Remove Punctuations

1. Removing punctuations is best done as the first step

In [3]:
import string
def remove_punct(raw_text):
    return "".join([c for c in raw_text if c not in string.punctuation])

df['remove_punct_data'] = df.raw_data.apply(lambda x: remove_punct(x))
print(df.iloc[0])

raw_data             Dynatrace is a software-intelligence monitoring platform that simplifies enterprise cloud comple...
remove_punct_data    Dynatrace is a softwareintelligence monitoring platform that simplifies enterprise cloud complex...
Name: 0, dtype: object


## Tokenize

1. Create a function that does a simple `string.split()` 
1. Apply it as a lambda on each row of the dataframe. 
1. Output is a list of individual words

In [4]:
def tokenize(raw_text):
    return raw_text.split()

df['tokenized_data'] = df.remove_punct_data.apply(lambda x: tokenize(x))
df.head()

Unnamed: 0,raw_data,remove_punct_data,tokenized_data
0,Dynatrace is a software-intelligence monitoring platform that simplifies enterprise cloud comple...,Dynatrace is a softwareintelligence monitoring platform that simplifies enterprise cloud complex...,"[Dynatrace, is, a, softwareintelligence, monitoring, platform, that, simplifies, enterprise, clo..."
1,"Dynatrace seamlessly brings infrastructure and cloud, application performance, and digital exper...",Dynatrace seamlessly brings infrastructure and cloud application performance and digital experie...,"[Dynatrace, seamlessly, brings, infrastructure, and, cloud, application, performance, and, digit..."
2,Dynatrace provides the following capabilities for monitoring and analyzing the performance of al...,Dynatrace provides the following capabilities for monitoring and analyzing the performance of al...,"[Dynatrace, provides, the, following, capabilities, for, monitoring, and, analyzing, the, perfor..."
3,"Real User Monitoring analyzes the performance of all user interactions with your applications, w...",Real User Monitoring analyzes the performance of all user interactions with your applications wh...,"[Real, User, Monitoring, analyzes, the, performance, of, all, user, interactions, with, your, ap..."
4,Dynatrace supports Real User Monitoring for mobile apps as well. The process of monitoring the u...,Dynatrace supports Real User Monitoring for mobile apps as well The process of monitoring the us...,"[Dynatrace, supports, Real, User, Monitoring, for, mobile, apps, as, well, The, process, of, mon..."


## Remove Stop Words

> :bulb: List comprehensions and lambda functions are tremendously helpful in preprocessing!

* This uses the corpus of stopwords available in the NLTK. 
* It needs to be downloaded prior to use (one-time.) 

In [5]:
import nltk
nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')

def remove_stop_words(raw_words):
    return [w for w in raw_words if w not in stop_words]

df['no_stopwords_data'] = df.tokenized_data.apply(lambda x: remove_stop_words(x))
df.head()

[nltk_data] Downloading package stopwords to /home/savis/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,raw_data,remove_punct_data,tokenized_data,no_stopwords_data
0,Dynatrace is a software-intelligence monitoring platform that simplifies enterprise cloud comple...,Dynatrace is a softwareintelligence monitoring platform that simplifies enterprise cloud complex...,"[Dynatrace, is, a, softwareintelligence, monitoring, platform, that, simplifies, enterprise, clo...","[Dynatrace, softwareintelligence, monitoring, platform, simplifies, enterprise, cloud, complexit..."
1,"Dynatrace seamlessly brings infrastructure and cloud, application performance, and digital exper...",Dynatrace seamlessly brings infrastructure and cloud application performance and digital experie...,"[Dynatrace, seamlessly, brings, infrastructure, and, cloud, application, performance, and, digit...","[Dynatrace, seamlessly, brings, infrastructure, cloud, application, performance, digital, experi..."
2,Dynatrace provides the following capabilities for monitoring and analyzing the performance of al...,Dynatrace provides the following capabilities for monitoring and analyzing the performance of al...,"[Dynatrace, provides, the, following, capabilities, for, monitoring, and, analyzing, the, perfor...","[Dynatrace, provides, following, capabilities, monitoring, analyzing, performance, aspects, appl..."
3,"Real User Monitoring analyzes the performance of all user interactions with your applications, w...",Real User Monitoring analyzes the performance of all user interactions with your applications wh...,"[Real, User, Monitoring, analyzes, the, performance, of, all, user, interactions, with, your, ap...","[Real, User, Monitoring, analyzes, performance, user, interactions, applications, whether, inter..."
4,Dynatrace supports Real User Monitoring for mobile apps as well. The process of monitoring the u...,Dynatrace supports Real User Monitoring for mobile apps as well The process of monitoring the us...,"[Dynatrace, supports, Real, User, Monitoring, for, mobile, apps, as, well, The, process, of, mon...","[Dynatrace, supports, Real, User, Monitoring, mobile, apps, well, The, process, monitoring, user..."


## Stemming

* Stemming does not seem to work very well. Some examples, `dynatrace` is stemmed to `dynatrac` :( 

In [8]:
import nltk
ps = nltk.PorterStemmer()

def stem(raw_data):
    return [ps.stem(word) for word in raw_data]

df['stemmed_data'] = df.no_stopwords_data.apply(lambda x: stem(x))
df.head()

Unnamed: 0,raw_data,remove_punct_data,tokenized_data,no_stopwords_data,stemmed_data
0,Dynatrace is a software-intelligence monitoring platform that simplifies enterprise cloud comple...,Dynatrace is a softwareintelligence monitoring platform that simplifies enterprise cloud complex...,"[Dynatrace, is, a, softwareintelligence, monitoring, platform, that, simplifies, enterprise, clo...","[Dynatrace, softwareintelligence, monitoring, platform, simplifies, enterprise, cloud, complexit...","[dynatrac, softwareintellig, monitor, platform, simplifi, enterpris, cloud, complex, acceler, di..."
1,"Dynatrace seamlessly brings infrastructure and cloud, application performance, and digital exper...",Dynatrace seamlessly brings infrastructure and cloud application performance and digital experie...,"[Dynatrace, seamlessly, brings, infrastructure, and, cloud, application, performance, and, digit...","[Dynatrace, seamlessly, brings, infrastructure, cloud, application, performance, digital, experi...","[dynatrac, seamlessli, bring, infrastructur, cloud, applic, perform, digit, experi, monitor, all..."
2,Dynatrace provides the following capabilities for monitoring and analyzing the performance of al...,Dynatrace provides the following capabilities for monitoring and analyzing the performance of al...,"[Dynatrace, provides, the, following, capabilities, for, monitoring, and, analyzing, the, perfor...","[Dynatrace, provides, following, capabilities, monitoring, analyzing, performance, aspects, appl...","[dynatrac, provid, follow, capabl, monitor, analyz, perform, aspect, applic, environ]"
3,"Real User Monitoring analyzes the performance of all user interactions with your applications, w...",Real User Monitoring analyzes the performance of all user interactions with your applications wh...,"[Real, User, Monitoring, analyzes, the, performance, of, all, user, interactions, with, your, ap...","[Real, User, Monitoring, analyzes, performance, user, interactions, applications, whether, inter...","[real, user, monitor, analyz, perform, user, interact, applic, whether, interact, take, place, b..."
4,Dynatrace supports Real User Monitoring for mobile apps as well. The process of monitoring the u...,Dynatrace supports Real User Monitoring for mobile apps as well The process of monitoring the us...,"[Dynatrace, supports, Real, User, Monitoring, for, mobile, apps, as, well, The, process, of, mon...","[Dynatrace, supports, Real, User, Monitoring, mobile, apps, well, The, process, monitoring, user...","[dynatrac, support, real, user, monitor, mobil, app, well, the, process, monitor, user, experi, ..."


## Lemmatizing vs Stemming  

* For the corpus of words from the Dynatrace documentation, Stemming corrupted some really simple words like `mobile`. 
* Lemmatizing however, retains the original word. 


In [15]:
import nltk
nltk.download('wordnet')
wn = nltk.WordNetLemmatizer()
ps = nltk.PorterStemmer() 

words = ['Dynatrace','enterprise','mobile']
print([wn.lemmatize(word) for word in words])
print([ps.stem(word) for word in words])

['Dynatrace', 'enterprise', 'mobile']
['dynatrac', 'enterpris', 'mobil']


[nltk_data] Downloading package wordnet to /home/savis/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Lemmatizing the data


In [17]:
import nltk
nltk.download('wordnet')
wn = nltk.WordNetLemmatizer()

def lemmatize(word_list):
    return [wn.lemmatize(word) for word in word_list]

df['lemmatized_data'] = df.no_stopwords_data.apply(lambda x : lemmatize(x))
df.head()

[nltk_data] Downloading package wordnet to /home/savis/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,raw_data,remove_punct_data,tokenized_data,no_stopwords_data,stemmed_data,lemmatized_data
0,Dynatrace is a software-intelligence monitoring platform that simplifies enterprise cloud comple...,Dynatrace is a softwareintelligence monitoring platform that simplifies enterprise cloud complex...,"[Dynatrace, is, a, softwareintelligence, monitoring, platform, that, simplifies, enterprise, clo...","[Dynatrace, softwareintelligence, monitoring, platform, simplifies, enterprise, cloud, complexit...","[dynatrac, softwareintellig, monitor, platform, simplifi, enterpris, cloud, complex, acceler, di...","[Dynatrace, softwareintelligence, monitoring, platform, simplifies, enterprise, cloud, complexit..."
1,"Dynatrace seamlessly brings infrastructure and cloud, application performance, and digital exper...",Dynatrace seamlessly brings infrastructure and cloud application performance and digital experie...,"[Dynatrace, seamlessly, brings, infrastructure, and, cloud, application, performance, and, digit...","[Dynatrace, seamlessly, brings, infrastructure, cloud, application, performance, digital, experi...","[dynatrac, seamlessli, bring, infrastructur, cloud, applic, perform, digit, experi, monitor, all...","[Dynatrace, seamlessly, brings, infrastructure, cloud, application, performance, digital, experi..."
2,Dynatrace provides the following capabilities for monitoring and analyzing the performance of al...,Dynatrace provides the following capabilities for monitoring and analyzing the performance of al...,"[Dynatrace, provides, the, following, capabilities, for, monitoring, and, analyzing, the, perfor...","[Dynatrace, provides, following, capabilities, monitoring, analyzing, performance, aspects, appl...","[dynatrac, provid, follow, capabl, monitor, analyz, perform, aspect, applic, environ]","[Dynatrace, provides, following, capability, monitoring, analyzing, performance, aspect, applica..."
3,"Real User Monitoring analyzes the performance of all user interactions with your applications, w...",Real User Monitoring analyzes the performance of all user interactions with your applications wh...,"[Real, User, Monitoring, analyzes, the, performance, of, all, user, interactions, with, your, ap...","[Real, User, Monitoring, analyzes, performance, user, interactions, applications, whether, inter...","[real, user, monitor, analyz, perform, user, interact, applic, whether, interact, take, place, b...","[Real, User, Monitoring, analyzes, performance, user, interaction, application, whether, interac..."
4,Dynatrace supports Real User Monitoring for mobile apps as well. The process of monitoring the u...,Dynatrace supports Real User Monitoring for mobile apps as well The process of monitoring the us...,"[Dynatrace, supports, Real, User, Monitoring, for, mobile, apps, as, well, The, process, of, mon...","[Dynatrace, supports, Real, User, Monitoring, mobile, apps, well, The, process, monitoring, user...","[dynatrac, support, real, user, monitor, mobil, app, well, the, process, monitor, user, experi, ...","[Dynatrace, support, Real, User, Monitoring, mobile, apps, well, The, process, monitoring, user,..."
