# Exercise 2b: Topic modeling on a custom pdf document.
It is time to put your skills to test! We want to load a pdf document with python, explode it to different pages and perform topic modeling.



## Libraries

In [1]:
%%capture
!pip install pypdf2
!pip install bertopic datasets accelerate bitsandbytes xformers adjustText

In [2]:
from sentence_transformers.models.Asym import import_from_string
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm
import requests
from io import BytesIO
from PyPDF2 import PdfReader
from tqdm import tqdm

The following cells help you download a pdf from a link, parse it into pages and create a dataset. You can alternatively upload a pdf and specify its path directly.

In [3]:
def get_pdf_as_memory_stream(url):
    response = requests.get(url)
    response.raise_for_status()  # Raise an error for HTTP errors

    # Convert the response content into a BytesIO stream
    return BytesIO(response.content)

For this part of the exercise, we can use Specifications for Structural Steel buildings from American Institute of Steel Construction (AISC) which is publically available.

In [4]:
aisc_360_path = "https://www.aisc.org/globalassets/aisc/publications/standards/a360-16w-rev-june-2019.pdf"
aisc_pdf_file = get_pdf_as_memory_stream(aisc_360_path)

In [5]:
reader = PdfReader(aisc_pdf_file)
# number of pages
print(f"Number of pages: {len(reader.pages)}")

Number of pages: 680


Let's test how an extracted pdf page looks like:

In [6]:
page = reader.pages[400]
print(page.extract_text())

Comm. F10.] SINGLE ANGLES 16.1-343
Specification for Structural Steel Buildings, July 7, 2016
AMERICAN INSTITUTE OF STEEL CONSTRUCTIONstrength, Mn=1.5M y, will occur when the theoretical buckling moment, M cr, reaches
or exceeds 7.7M y. Myis the moment at first yield in Equations F10-2 and F10-3, the
same as the Myin Equation F10-1. These equations are modifications of those devel-
oped from the results of Australian research on single angles in flexure and on ananalytical model consisting of two rectangular elements of length equal to the actualangle leg width minus one-half the thickness (AISC, 1975; Leigh and Lay, 1978,1984; Madugula and Kennedy, 1985). When bending is applied about one leg of a laterally unrestrained single angle, theangle will deflect laterally as well as in the bending direction. Its behavior can beevaluated by resolving the load and/or moments into principal axis components anddetermining the sum of these principal axis flexural effects. Subsection (i) of Sectio

Now, we will collect all extracted pages in a single list. For a real data science project, there is benefits in cleaning the text data and removing irrelevant sections.  For example, one might remove table of contents.

In [7]:
dataset = []
for i in tqdm(range(len(reader.pages))):
  page_i = reader.pages[i]
  dataset.append(page_i.extract_text())

100%|██████████| 680/680 [00:44<00:00, 15.31it/s]


In [8]:
dataset[251]

'16.1-194 IMPROVED DESIGN FOR PONDING [App. 2.2.\nSpecification for Structural Steel Buildings, July 7, 2016\nAMERICAN INSTITUTE OF STEEL CONSTRUCTION\nFig. A-2.1. Limiting flexibility coefficient for the primary systems.7 AISC_PART 16_A_Spec. L-App2-App8 (192-252)_15th Ed._2016  2016-11-14  3:53 PM  Page 194    (Black plate)'

## Students are expected to complete the rest of this excercise based on the previous example

In [9]:
# Prepare embeddings
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(dataset, show_progress_bar=True)

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/22 [00:00<?, ?it/s]

In [11]:
topic_model = BERTopic().fit(dataset, embeddings)

In [12]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,117,-1_the_of_for_in,"[the, of, for, in, and, to, is, strength, with...",[CHAPTER H\nDESIGN OF MEMBERS FOR COMBINED \nF...
1,0,66,0_the_steel_composite_concrete,"[the, steel, composite, concrete, of, to, stre...",[Sect. I8.] STEEL ANCHORS 16.1-109\nSpecificat...
2,1,47,1_analysis_the_of_in,"[analysis, the, of, in, to, and, secondorder, ...",[Comm. 1.3.] DESIGN BY INELASTIC ANALYSIS 16.1...
3,2,45,2_buckling_the_for_of,"[buckling, the, for, of, in, angles, is, and, ...",[Comm. F10.] SINGLE ANGLES 16.1-341\nSpecifica...
4,3,42,3_load_of_the_and,"[load, of, the, and, structural, design, speci...",[Comm. B3.] DESIGN BASIS 16.1-269\nSpecificati...
5,4,41,4_branch_the_hss_of,"[branch, the, hss, of, chord, for, in, connect...",[16.1-152 CONCENTRATED FORCES ON HSS [Sect. K2...
6,5,39,5_inspection_and_for_welding,"[inspection, and, for, welding, the, of, stand...",[Comm. N5.] MINIMUM REQUIREMENTS FOR INSPECTIO...
7,6,34,6_fire_temperature_the_structural,"[fire, temperature, the, structural, and, of, ...",[16.1-222\nSpecification for Structural Steel ...
8,7,33,7_bracing_the_brace_stiffness,"[bracing, the, brace, stiffness, to, of, point...",[16.1-238 GENERAL PROVISIONS [App. 6.1.\nSpeci...
9,8,32,8_weld_the_welds_to,"[weld, the, welds, to, of, metal, in, fillet, ...",[Sect. J1.] GENERAL PROVISIONS 16.1-115\nSpeci...


In [26]:
topic_model.get_document_info(dataset)

Unnamed: 0,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Probability,Representative_document
0,ANSI/AISC 360-16\nAn American National Standar...,3,3_load_of_the_and,"[load, of, the, and, structural, design, speci...",[Comm. B3.] DESIGN BASIS 16.1-269\nSpecificati...,load - of - the - and - structural - design - ...,1.000000,False
1,,-1,-1_the_of_for_in,"[the, of, for, in, and, to, is, strength, with...",[CHAPTER H\nDESIGN OF MEMBERS FOR COMBINED \nF...,the - of - for - in - and - to - is - strength...,0.000000,False
2,ANSI/AISC 360-16\nAn American National Standar...,3,3_load_of_the_and,"[load, of, the, and, structural, design, speci...",[Comm. B3.] DESIGN BASIS 16.1-269\nSpecificati...,load - of - the - and - structural - design - ...,1.000000,False
3,...,3,3_load_of_the_and,"[load, of, the, and, structural, design, speci...",[Comm. B3.] DESIGN BASIS 16.1-269\nSpecificati...,load - of - the - and - structural - design - ...,1.000000,False
4,16.1-iii\nSpecification for Structural Steel B...,3,3_load_of_the_and,"[load, of, the, and, structural, design, speci...",[Comm. B3.] DESIGN BASIS 16.1-269\nSpecificati...,load - of - the - and - structural - design - ...,1.000000,False
...,...,...,...,...,...,...,...,...
675,16.1-618 REFERENCES\nSpecification for Structu...,1,1_analysis_the_of_in,"[analysis, the, of, in, to, and, secondorder, ...",[Comm. 1.3.] DESIGN BY INELASTIC ANALYSIS 16.1...,analysis - the - of - in - to - and - secondor...,0.467821,False
676,METRIC CONVERSION FACTORS 16.1-619\nSpecificat...,6,6_fire_temperature_the_structural,"[fire, temperature, the, structural, and, of, ...",[16.1-222\nSpecification for Structural Steel ...,fire - temperature - the - structural - and - ...,0.502825,False
677,16.1-620\nSpecification for Structural Steel B...,3,3_load_of_the_and,"[load, of, the, and, structural, design, speci...",[Comm. B3.] DESIGN BASIS 16.1-269\nSpecificati...,load - of - the - and - structural - design - ...,1.000000,False
678,,-1,-1_the_of_for_in,"[the, of, for, in, and, to, is, strength, with...",[CHAPTER H\nDESIGN OF MEMBERS FOR COMBINED \nF...,the - of - for - in - and - to - is - strength...,0.000000,False


In [23]:
topic_model.get_topic(5)

[('inspection', 0.03962346679380569),
 ('and', 0.038534683108101174),
 ('for', 0.03377247337270443),
 ('welding', 0.028827702483794498),
 ('the', 0.02859114972818962),
 ('of', 0.02789951465582364),
 ('standard', 0.02727157317595078),
 ('quality', 0.0261745300659607),
 ('be', 0.025560014179039536),
 ('steel', 0.024060275322371764)]

In [20]:
topic_model.visualize_barchart(top_n_topics=50,n_words=5)

In [15]:
fig= topic_model.visualize_topics()
fig.write_html("topics_LLM.html")
fig

In [16]:
topic_model.visualize_heatmap(height=1000,width=1000)

In [18]:
# Marteen suggests reducing dimensionality of embeddings, this step is optional but much faster to perform iteratively:
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
fig = topic_model.visualize_documents(dataset, reduced_embeddings=reduced_embeddings,height=1200,width=1800)
fig.write_html("document_view.html")
fig