<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

# NLP Basics

**`nltk` & `lxml` Packages**

&copy; Dr. Yves J. Hilpisch

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:team@tpq.io">team@tpq.io</a>

## Imports

In [None]:
!git clone https://github.com/tpq-classes/natural_language_processing.git
import sys
sys.path.append('natural_language_processing')


In [None]:
import nltk

In [None]:
# nltk.download('popular')

### HTML Documents

Getting some docs maybe from https://www.apple.com/newsroom/.

In [None]:
import requests

In [None]:
url_01 = 'https://www.apple.com/newsroom/2024/02/apple-announces'
url_01 += '-more-than-600-new-apps-built-for-apple-vision-pro/'

In [None]:
html_01 = requests.get(url_01).text

In [None]:
html_01[:250]

In [None]:
html_01[1950:2250]

In [None]:
len(html_01)

In [None]:
url_02 = 'https://www.apple.com/newsroom/2024/02/apple-vision-pro-arrives'
url_02 += '-in-apple-store-locations-across-the-us/'

In [None]:
html_02 = requests.get(url_02).text

In [None]:
len(html_02)

In [None]:
html_02[1950:2250]

In [None]:
url_03 = 'https://finance.yahoo.com/'

In [None]:
html_03 = requests.get(url_03).text

In [None]:
len(html_03)

In [None]:
html_03[6000:6250]

### Cleaning Up HTML

We are using the `lxml` package for generating plain text from HTML.

    conda install lxml

In [None]:
from lxml.html import fromstring
from lxml.html.clean import clean_html
from lxml.html.clean import Cleaner

In [None]:
chtml = clean_html(fromstring(html_01))

In [None]:
chtml_01 = chtml.text_content()

In [None]:
chtml_01 = chtml_01.replace('\t', '')

In [None]:
chtml_01 = chtml_01.replace('\n', '')

In [None]:
len(chtml_01)

In [None]:
chtml = clean_html(fromstring(html_03))

In [None]:
chtml_03 = chtml.text_content()

In [None]:
len(chtml_03)

In [None]:
cleaner = Cleaner(links=True, style=True, allow_tags=[''])

In [None]:
chtml_01 = cleaner.clean_html(html_01)

In [None]:
chtml_01 = chtml_01.replace('\t', '')

In [None]:
chtml_01 = chtml_01.replace('\n', '')

In [None]:
len(chtml_01)

In [None]:
# chtml_01

In [None]:
chtml_03 = cleaner.clean_html(html_03)

In [None]:
len(chtml_03)

In [None]:
chtml_03[:1000]

### Sentiment Analysis

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [None]:
# nltk.download('vader_lexicon')

In [None]:
sia = SentimentIntensityAnalyzer()

In [None]:
sia.polarity_scores('I like NLP. And I love nltk for doing NLP.')

In [None]:
sia.polarity_scores('Today is a really bad day. It is raining like hell.')

In [None]:
ps = sia.polarity_scores('Today is February 09, 2024. This is a NLP class session.')
ps

In [None]:
ps['compound']

### Case Study

News stories from Eikon Data API (LSEG/Refinitiv Workspace).

In [None]:
url = "https://certificate.tpq.io/apple_news_06_to_10_02_2024.pkl"

In [None]:
# !wget $url

In [None]:
import pickle

In [None]:
url.split('/')[-1]

In [None]:
news = pickle.load(open(url.split('/')[-1], 'rb'))

In [None]:
type(news)

In [None]:
len(news)

In [None]:
from IPython.display import HTML

In [None]:
# HTML(news[0])

In [None]:
cnews = [cleaner.clean_html(story) for story in news]

In [None]:
print(cnews[0][:400])

In [None]:
pol_scores = [sia.polarity_scores(story) for story in cnews]

In [None]:
pol_scores[:7]

In [None]:
com_scores = [ps['compound'] for ps in pol_scores]

In [None]:
com_scores[:7]

In [None]:
news_comp = [(story, comp) for story, comp in zip(cnews, com_scores)]

In [None]:
news_comp[3:5]

In [None]:
neg_news = [nc for nc in news_comp if nc[1] < 0]

In [None]:
len(neg_news)

In [None]:
neg_news[4][1]

In [None]:
print(neg_news[4][0])

<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:team@tpq.io">team@tpq.io</a>