In [1]:
import pandas as pd
import numpy as np
from transformers import BertTokenizer, BertForQuestionAnswering
import torch

In [2]:
tokenizer = BertTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

Downloading tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [3]:
model = BertForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [4]:
import torch

def answer_question(question, context, model, tokenizer):
    inputs = tokenizer(question, context, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits
    
    start_idx = torch.argmax(start_logits)
    end_idx = torch.argmax(end_logits)
    
    answer = tokenizer.decode(inputs.input_ids[0][start_idx:end_idx+1])
    
    return answer

In [5]:
sample_question = "what did the mob do"
sample_context= "mob did not keep peace"
answer = answer_question(sample_question, sample_context, model, tokenizer)
print("Answer:", answer)

Answer: not keep peace


In [6]:
context = """In computer science, a hash function is a mathematical function that takes an input (or "message") and returns a fixed-size string of bytes. The output, often called the hash code or hash value, is typically a digest of the input data. Hash functions are commonly used in various applications, including data integrity verification, password storage, and digital signatures. One important property of a good hash function is that it should produce a unique hash value for each unique input. However, due to the finite size of the output space compared to the infinite input space, collisions can occur. A collision happens when two different inputs produce the same hash value. Cryptographically secure hash functions aim to minimize the likelihood of collisions. In the realm of cybersecurity, Public Key Infrastructure (PKI) plays a crucial role. PKI is a framework that manages digital keys and certificates. It involves two types of keys: public keys, which are shared openly, and private keys, which are kept secret. Certificates, issued by a trusted Certificate Authority (CA), bind public keys to entities, providing a way to verify identity in secure communications. Secure Sockets Layer (SSL) and its successor, Transport Layer Security (TLS), are cryptographic protocols that provide secure communication over a computer network. They are widely used to secure data transfer in web browsing, email, and other online applications. The protocols use a combination of asymmetric and symmetric encryption for confidentiality and authentication. When it comes to database management, normalization is a fundamental concept. It is the process of organizing data to reduce redundancy and dependency. The goal is to achieve data integrity and efficient data storage. Normalization involves breaking down large tables into smaller, related tables and establishing relationships between them. The result is a more flexible and maintainable database structure. Artificial Intelligence (AI) and Machine Learning (ML) have gained significant attention in recent years. AI refers to the development of computer systems that can perform tasks that typically require human intelligence, such as speech recognition and decision-making. ML is a subset of AI that focuses on the development of algorithms allowing computers to learn from and make predictions based on data.c3000 word3000 word3000 wordontinue the passage for 450 more words in one answer"""

xx1. **Question:** What is a hash function?

   **Answer:** A hash function is a mathematical function that takes an input and produces a fixed-size string of bytes, commonly used for data integrity verification, password storage, and digital signatures.

2. **Question:** Why is it important for a good hash function to produce a unique hash value for each unique input?

   **Answer:** It's important to avoid collisions, where two different inputs produce the same hash value, to ensure the integrity and reliability of the hash function.

3. **Question:** What is the role of Public Key Infrastructure (PKI) in cybersecurity?

   **Answer:** PKI is a framework that manages digital keys and certificates, providing a secure way to manage public and private keys, and establishing identity in secure communications.

4. **Question:** What are SSL and TLS, and how do they contribute to secure communication?

   **Answer:** SSL and TLS are cryptographic protocols widely used to secure data transfer in web browsing, email, and online applications. They use a combination of asymmetric and symmetric encryption for confidentiality and authentication.

5. **Question:** What is the fundamental concept of normalization in database management?

   **Answer:** Normalization is the process of organizing data to reduce redundancy and dependency, aiming to achieve data integrity and efficient data storage by breaking down large tables into smaller, related tables.

6. **Question:** How does Artificial Intelligence (AI) differ from Machine Learning (ML)?

   **Answer:** AI refers to the development of computer systems that can perform tasks requiring human intelligence, while ML is a subset of AI focusing on algorithms that allow computers to learn and make predictions based on data.

7. **Question:** What is a collision in the context of hash functions?

   **Answer:** A collision occurs when two different inputs produce the same hash value, highlighting a potential weakness in a hash function.

8. **Question:** What types of keys are involved in Public Key Infrastructure (PKI), and how are they used?

   **Answer:** PKI involves public keys (shared openly) and private keys (kept secret), and certificates issued by a trusted Certificate Authority (CA) bind public keys to entities, enabling secure communications.

9. **Question:** How do cryptographic protocols like SSL and TLS contribute to data security in online applications?

   **Answer:** SSL and TLS provide secure communication by using encryption techniques, ensuring confidentiality and authentication of data transfer over a computer network.

10. **Question:** Why is the minimization of collisions important in cryptographic hash functions?

    **Answer:** Minimizing collisions is crucial to maintain the reliability and security of cryptographic hash functions, ensuring that different inputs do not produce the same hash value.


In [7]:
from nltk.tokenize import sent_tokenize

def split_text_into_chunks_with_overlap(text, sentences_per_chunk, overlap_sentences):
    sentences = sent_tokenize(text)
    chunks = []
    start_idx = 0
    while start_idx < len(sentences):
        end_idx = start_idx + sentences_per_chunk
        end_idx = min(end_idx, len(sentences))
        chunk = ' '.join(sentences[start_idx:end_idx])
        end_idx = min(end_idx + overlap_sentences, len(sentences))
        if end_idx < len(sentences):
            chunk += ' '.join(sentences[end_idx - overlap_sentences:end_idx])
        chunks.append(chunk)
        start_idx = start_idx + sentences_per_chunk - overlap_sentences
    return chunks

chunks = split_text_into_chunks_with_overlap(context, sentences_per_chunk=4, overlap_sentences=2)

In [8]:
question = "What is a collision in the context of hash functions?"

In [9]:
#run from question_embedding = get_bert... cell after changing qquestion to save time

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import BertModel
from sklearn.metrics.pairwise import cosine_similarity

In [12]:
model_fs = BertModel.from_pretrained('bert-base-uncased') #model for similarity

In [13]:
def get_bert_embedding(text):
    tokens = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
    with torch.no_grad():
        outputs = model_fs(**tokens)
    embeddings = outputs['last_hidden_state'][:, 0, :] #[:, 0, :] is the CLS token which is like the vector representation of each token
    return embeddings.numpy()

In [14]:
question_embedding = get_bert_embedding(question)
chunk_embeddings = []
for chunk in chunks:
    chunk_embeddings.append(get_bert_embedding(chunk))

In [15]:
bert_similarities = []
for embedding in chunk_embeddings:
    bert_similarities.append(cosine_similarity(question_embedding, embedding)[0][0])
#when comparing a single vector to multiple vectors,
#[0][0] is used to extract the scalar cosine similarity value from the resulting matrix.

In [16]:
cosine_similarity(question_embedding, embedding)

array([[0.7285042]], dtype=float32)

In [17]:
bert_similarities = np.array(bert_similarities)
#python list to np array for weight assigning later

In [18]:
tfidf_vectorizer = TfidfVectorizer(ngram_range=(2, 4))
tfidf_matrix = tfidf_vectorizer.fit_transform([question] + chunks)

tfidf_similarities = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1:])[0]

In [19]:
cosine_similarity(tfidf_matrix[0], tfidf_matrix[1:])[0]

array([0.00541673, 0.02112343, 0.01593019, 0.01493902, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        ])

In [20]:
print("BERT Similarities:", bert_similarities)
print("TF-IDF Similarities:", tfidf_similarities)

BERT Similarities: [0.4546045  0.5487602  0.46011317 0.6222875  0.61315775 0.55653375
 0.50459254 0.7605978  0.7708142  0.74303573 0.7285042 ]
TF-IDF Similarities: [0.00541673 0.02112343 0.01593019 0.01493902 0.         0.
 0.         0.         0.         0.         0.        ]


In [21]:
#both the similarities have huge differences in terms of range of values
#bert ranges 0.5 to 0.8 and tfidf ranges 0 to 0.05

In [22]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

tfidf_similarities = scaler.fit_transform(tfidf_similarities.reshape(-1, 1)).flatten()
bert_similarities = scaler.fit_transform(bert_similarities.reshape(-1, 1)).flatten()

#reshape to convert 1d to 2d for scaler, and flatten to convert it back to 1d 

In [23]:
print("BERT Similarities:", bert_similarities)
print("TF-IDF Similarities:", tfidf_similarities)

BERT Similarities: [0.         0.29776335 0.01742089 0.5302906  0.5014181  0.32234704
 0.15808511 0.9676913  1.0000001  0.91215193 0.8661963 ]
TF-IDF Similarities: [0.25643237 1.         0.75414799 0.70722532 0.         0.
 0.         0.         0.         0.         0.        ]


In [24]:
similarity_scores = (0.8*bert_similarities +  1*tfidf_similarities)
#tinker with weights for tuning

In [25]:
print("Combined Similarities:", similarity_scores)

Combined Similarities: [0.25643237 1.23821068 0.7680847  1.1314578  0.40113449 0.25787765
 0.12646809 0.77415305 0.80000013 0.72972155 0.69295704]


In [26]:
similarity_scores_sorted = sorted(similarity_scores,reverse=True)
for i in range(len(similarity_scores)):
    print(similarity_scores_sorted[i])
print("\ntotal chunks: ")
print(len(similarity_scores))

1.238210678100586
1.1314578038705843
0.8000001311302185
0.7741530537605286
0.7680847036448658
0.7297215461730957
0.6929570436477661
0.4011344909667969
0.2578776478767395
0.25643237135234076
0.12646809220314026

total chunks: 
11


In [27]:
answers = []
answer_n_score = dict(zip(chunks, similarity_scores))
answer_n_score = dict(sorted(answer_n_score.items(), key=lambda item: item[1], reverse=True))

i=0

for key, value in answer_n_score.items():
    answers.append(answer_question(question, key, model, tokenizer))
    print("\n"+str(value))
    i+=1
    if i>5:
        break


1.238210678100586

1.1314578038705843

0.8000001311302185

0.7741530537605286

0.7680847036448658

0.7297215461730957


In [28]:
print("question: ")
print(question)
print("\nanswers: ")
i = 0
for answer in answers:
    print("answer " + str(i+1) + ": ")
    print(answer)
    i += 1
print("\n\ntotal chunks: ")
print(len(chunks))
print("\nchunks considered for answer: ")
print(i)

question: 
What is a collision in the context of hash functions?

answers: 
answer 1: 
two different inputs produce the same hash value
answer 2: 
cryptographically secure hash functions aim to minimize the likelihood of collisions
answer 3: 
normalization involves breaking down large tables into smaller, related tables
answer 4: 
redundancy and dependency
answer 5: 
two different inputs produce the same hash value
answer 6: 
c3000 word3000 word3000 wordontinue the passage for 450 more words in one answer


total chunks: 
11

chunks considered for answer: 
6


In [125]:
#correct
#PKI involves public keys (shared openly) and private keys (kept secret), and certificates issued by a trusted 
#Certificate Authority (CA) bind public keys to entities, enabling secure communications.

In [126]:
#do determine weights, we may need to study results for both similarities individually, then figure out weights
#and do so on different types and size of questions. then we can make a rule based system to assign 
#appropriate weights to both the similarity scores changing for each class of of question. 