In [6]:
from gensim.models import Word2Vec
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from nltk.tokenize import sent_tokenize, word_tokenize

# Step 1: Preprocess the Text
def preprocess_text(text):
    sentences = sent_tokenize(text)
    tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]
    return sentences, tokenized_sentences


In [7]:

# Step 2: Load or Train Word2Vec Model
# You can use a pre-trained Word2Vec model or train your own on a relevant corpus

import pymupdf # imports the pymupdf library
text = ""
doc = pymupdf.open("./app/first_chapter.pdf") # open a document
for page in doc: # iterate the document pages
  text += page.get_text() # get plain text encoded as UTF-8


In [8]:

sentences, tokenized_sentences = preprocess_text (text)
model = Word2Vec(tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4, sg=1)


In [9]:

# Step 3: Compute Sentence Embeddings
def sentence_embedding(sentence, model):
    words = [word for word in sentence if word in model.wv]
    if words:
        return np.mean(model.wv[words], axis=0)
    else:
        return np.zeros(model.vector_size)


In [10]:

sentence_embeddings = [sentence_embedding(sentence, model) for sentence in tokenized_sentences]

# Step 4: Compute Document Embedding
document_embedding = np.mean(sentence_embeddings, axis=0)

# Step 5: Rank Sentences by Importance
similarity_scores = [cosine_similarity([embedding], [document_embedding])[0][0] for embedding in sentence_embeddings]
ranked_sentences = sorted(((score, idx) for idx, score in enumerate(similarity_scores)), reverse=True)

# Step 6: Select Top Sentences
# You can decide how many sentences you want in the summary
top_n = 30
top_sentence_indices = [idx for score, idx in ranked_sentences[:top_n]]
top_sentence_indices.sort()

# Step 7: Extract and Return the Summarized Text
def extract_summary(sentences, indices):
    return ' '.join([sentences[idx] for idx in indices])

# text = "Your input text here."
sentences, tokenized_sentences = preprocess_text(text)
summary = extract_summary(sentences, top_sentence_indices)

print(summary)


As always, the aim of the DBIR is 
to shine a light on the various Actor types, the tactics they utilize and the targets they 
choose. Thanks to our talented, generous and civic-minded contributors from around 
the world who continue to stick with us and share their data and insight, and deep 
appreciation for our very own Verizon Threat Research Advisory Center (VTRAC) 
team (rock stars that they are). From the exploitation of well-known 
and far-reaching zero-day vulnerabilities, such as the one that affected MOVEit, to 
the much more mundane but still incredibly effective Ransomware and Denial of 
Service (DoS) attacks, criminals continue to do their utmost to prove the old adage 
“crime does not pay” wrong. Enterprise floats of all shapes and sizes 
cruising past a large crowd of threat actors who are shouting out gleefully “Throw 
me some creds!” Of course, human nature being what it is, all too often, the folks 
on the floats do just that. For example, the “first-time reader” sec

As always, the aim of the DBIR is 
to shine a light on the various Actor types, the tactics they utilize and the targets they 
choose. Thanks to our talented, generous and civic-minded contributors from around 
the world who continue to stick with us and share their data and insight, and deep 
appreciation for our very own Verizon Threat Research Advisory Center (VTRAC) 
team (rock stars that they are). From the exploitation of well-known 
and far-reaching zero-day vulnerabilities, such as the one that affected MOVEit, to 
the much more mundane but still incredibly effective Ransomware and Denial of 
Service (DoS) attacks, criminals continue to do their utmost to prove the old adage 
“crime does not pay” wrong. Enterprise floats of all shapes and sizes 
cruising past a large crowd of threat actors who are shouting out gleefully “Throw 
me some creds!” Of course, human nature being what it is, all too often, the folks 
on the floats do just that. For example, the “first-time reader” section is now located in 
Appendix A rather than at the beginning of the report. But we do encourage those 
who are new to the DBIR to give it a read-through before diving into the report. Last, but certainly not least, we extend a most sincere thanks yet again to our 
contributors (without whom we could not do this) and to our readers (without whom 
there would be no point in doing it). Sincerely,
The Verizon DBIR Team 
C. David Hylender, Philippe Langlois, Alex Pinto, Suzanne Widup
Very special thanks to:
– Christopher Novak for his continued support and insight
– Dave Kennedy and Erika Gifford from VTRAC
– 
Kate Kutchko, Marziyeh Khanouki and Yoni Fridman from the Verizon Business 
Product Data Science Team
6
2024 DBIR Helpful guidance
Helpful guidance
About the 2024 DBIR incident dataset
Each year, the DBIR timeline for in-scope incidents is from November 1 of one 
calendar year through October 31 of the next calendar year. Thus, the incidents 
described in this report took place between November 1, 2022, and October 31, 
2023. The 2023 caseload is the primary analytical focus of the 2024 report, but 
the entire range of data is referenced throughout, notably in trending graphs. The 
time between the latter date and the date of publication for this report is spent in 
acquiring the data from our global contributors, anonymizing and aggregating that 
data, analyzing the dataset, and finally creating the graphics and writing the report. You are permitted to include statistics, figures and other information from the report, 
provided that (a) you cite the source as “Verizon 2024 Data Breach Investigations 
Report” and (b) the content is not modified in any way. If you would like to provide people a copy of the 
report, we ask that you provide them a link to verizon.com/dbir rather than the PDF. If your organization aggregates incident or security data and is interested 
in becoming a contributor to the annual Verizon DBIR (and we hope you 
are), the process is very easy and straightforward. Ransomware and Extortion breaches over time
Summary of findings
Our ways-in analysis witnessed a 
substantial growth of attacks involving 
the exploitation of vulnerabilities as the 
critical path to initiate a breach when 
compared to previous years. It almost 
tripled (180% increase) from last year, 
which will come as no surprise to 
anyone who has been following the 
effect of MOVEit and similar zero-day 
vulnerabilities. Pure Extortion 
attacks have risen over the past year 
and are now a component of 9% of 
all breaches. The shift of traditional 
ransomware actors toward these newer 
techniques resulted in a bit of a decline 
in Ransomware to 23%. However, when 
combined, given that they share threat 
actors, they represent a strong growth 
to 32% of breaches. 8
2024 DBIR Summary of findings
We have revised our calculation of the 
involvement of the human element to 
exclude malicious Privilege Misuse in 
an effort to provide a clearer metric of 
what security awareness can affect. For 
this year’s dataset, the human element 
was a component of 68% of breaches, 
roughly the same as the previous period 
described in the 2023 DBIR. We see this figure at 
15% this year, a 68% increase from the 
previous year, mostly fueled by the use 
of zero-day exploits for Ransomware 
and Extortion attacks. Phishing email report rate by click status
2024 DBIR Summary of findings
Financially motivated threat actors will 
typically stick to the attack techniques 
that will give them the most return  
on investment. Over the past three years, the 
combination of Ransomware and 
other Extortion breaches accounted 
for almost two-thirds (fluctuating 
between 59% and 66%) of those 
attacks. According to the FBI’s 
Internet Crime Complaint Center 
(IC3) ransomware complaint data, 
the median loss associated with the 
combination of Ransomware and 
other Extortion breaches has been 
$46,000, ranging between $3 (three 
dollars) and $1,141,467 for 95% of the 
cases. We also found from ransomware 
negotiation data contributors that 
the median ratio of initially requested 
ransom and company revenue is 1.34%, 
but it fluctuated between 0.13% and 
8.30% for 80% of the cases. Similarly, over the past two years, we 
have seen incidents involving Pretexting 
(the majority of which had Business 
Email Compromise [BEC] as the 
outcome) accounting for one-fourth 
(ranging between 24% and 25%) of 
financially motivated attacks. In both 
years, the median transaction amount 
of a BEC was around $50,000, also 
according to the FBI IC3 dataset. In security awareness exercise 
data contributed by our partners during 
2023, 20% of users reported phishing 
in simulation engagements, and 11% 
of the users who clicked the email 
also reported. This is welcome news 
because on the flip side, the median 
time to click on a malicious link after the 
email is opened is 21 seconds and then 
only another 28 seconds for the person 
caught in the phishing scheme to enter 
their data.