<img src="http://certificate.tpq.io/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

# AI in Finance

**Workshop at Texas State University (October 2023)**

**_Natural Language Processing_**

Dr. Yves J. Hilpisch | The Python Quants GmbH | http://tpq.io

## Basic Imports

In [None]:
import nlp
import nltk
import requests
import numpy as np
import pandas as pd
from collections import Counter
from pylab import plt
plt.style.use('seaborn-v0_8')
%config InlineBackend.figure_format = 'svg'

## Retrieving HTML

In [None]:
with open('../sources/urls.txt', 'r') as f:
    urls = f.readlines()
urls = [url.strip('\n') for url in urls]

In [None]:
urls

In [None]:
html = requests.get(urls[0]).text

In [None]:
html[18000:21000]

In [None]:
len(html)

## Processing Text

In [None]:
raw = nlp.clean_up_html(html)

In [None]:
len(raw)

In [None]:
raw[1000:2000]

## Tokens

In [None]:
tokens = nlp.tokenize(raw.lower())

In [None]:
tokens = [t for t in tokens if len(t) > 4 and len(t) < 20]

In [None]:
freq = Counter(tokens)

In [None]:
# freq

In [None]:
mc = freq.most_common(10)
mc

In [None]:
mc = pd.DataFrame(mc, columns=['word', 'freq'])

In [None]:
mc.set_index('word').plot(kind='bar');

In [None]:
tokens = set(tokens)

In [None]:
# tokens

In [None]:
tokens = sorted(tokens)

In [None]:
tokens[100:120]

## WordCloud

In [None]:
# !pip install wordcloud

In [None]:
nlp.generate_key_words(raw, 10)

In [None]:
nlp.generate_key_words(' '.join(tokens), 10)

In [None]:
nlp.generate_word_cloud(raw, 25)

## Topic Modelling

### Press Releases

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
%%time
news = list()
for url in urls:
    t = requests.get(url).text
    c = nlp.clean_up_html(t)
    news.append(c)

In [None]:
# news

In [None]:
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(news)

In [None]:
lda = LatentDirichletAllocation(n_components=4, random_state=100)

In [None]:
%%time
# Train the LDA model on the text data
l = lda.fit_transform(X)

In [None]:
# Extract the topics and their corresponding words
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
    print(f"Topic {topic_idx}:")
    print(" ".join([feature_names[i] for i in topic.argsort()[:-7:-1]]))

### Classic Book 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

In [None]:
# Preprocess the text/data
with open('../sources/walden.txt', 'r') as f:
    text = f.read()

In [None]:
# print(text)

In [None]:
text_ = text.split()

In [None]:
text_ = [w.lower() for w in text_ if len(w) > 4]

In [None]:
text_ = ' '.join(text_)

In [None]:
# text_

In [None]:
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform([text_])

In [None]:
# Create an NMF topic model with multiple topics
nmf = NMF(n_components=7, random_state=100)

In [None]:
# Train the NMF model on the text data
nmf.fit_transform(X)

In [None]:
# Extract the topics and their corresponding words
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(nmf.components_):
    print(f'Topic {topic_idx}:')
    print(' '.join([feature_names[i] for i in topic.argsort()[:-8:-1]]))

### PDF File

In [None]:
# !pip install pdfminer.six

In [1]:
from pdfminer.high_level import extract_text

In [11]:
path = 'https://certificate.tpq.io/'

In [12]:
fn = 'b_schools_and_ai.pdf'

In [14]:
!echo $(path + fn)

/bin/bash: path: command not found



In [4]:
# Extract the text from the PDF file
text = extract_text('b_schools_and_ai.pdf')

In [10]:
!rm $fn

In [5]:
# Print the extracted text
print(text[103:1000])


What you need to know: The Future of B-Schools

Businessweek + Work Shift
Business Schools Grapple With How To Teach Artificial Intelligence

MBA programs want to better prepare students for a world undergoing rapid transformation fueled by generative AI.

Photographer: Inga Kjer/Photothek

By Jo Constantz

October 19, 2023 at 4:30 PM GMT+2

How are B-schools preparing the next generation of leaders to most eﬀectively use and apply artiﬁcial intelligence in business?

New and emerging technolo y, including machine learning, have become a bigger part of business school course lists over the past
decade. Data analytics, for instance, is among the most popular specialized B-school degrees today.

Thanks in part to the release of ChatGPT last November and the ensuing alarm over what the artiﬁcial intelligence-powered chatbot

could do, attention on generative AI — technolo y that allows 


In [None]:
text_ = nltk.word_tokenize(text, preserve_line=True)

In [None]:
text_ = [w.lower() for w in text_ if len(w) > 4 and len(w) < 15]

In [None]:
text_ = ' '.join(text_)

In [None]:
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform([text_])

In [None]:
# Create an NMF topic model with multiple topics
nmf = NMF(n_components=7, random_state=100)

In [None]:
# Train the NMF model on the text data
nmf.fit_transform(X)

In [None]:
# Extract the topics and their corresponding words
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(nmf.components_):
    print(f'Topic {topic_idx}:')
    print(' '.join([feature_names[i] for i in topic.argsort()[:-6:-1]]))

## Sentiment Analysis

In [None]:
# nltk.download('vader_lexicon')

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [None]:
# Create a SentimentIntensityAnalyzer object
analyzer = SentimentIntensityAnalyzer()

In [None]:
# Analyze the sentiment of each text
scores = list()
for text in news:
    scores.append(analyzer.polarity_scores(text))

In [None]:
scores

<img src='http://hilpisch.com/tpq_logo.png' width="35%" align="right">

<br><br><a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:team@tpq.io">ai@tpq.io</a>