<a href="https://colab.research.google.com/github/zobayer0x01/Qwen2.5-1.5B-SecQA/blob/main/SecQA_Dataset_creation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating Dataset from Security StackExchange

In [2]:
!git clone https://github.com/EleutherAI/stackexchange-dataset.git
!pip install -r /content/stackexchange-dataset/requirements.txt

Cloning into 'stackexchange-dataset'...
remote: Enumerating objects: 103, done.[K
remote: Counting objects: 100% (51/51), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 103 (delta 45), reused 42 (delta 42), pack-reused 52 (from 1)[K
Receiving objects: 100% (103/103), 19.13 KiB | 4.78 MiB/s, done.
Resolving deltas: 100% (61/61), done.
Collecting bs4 (from -r /content/stackexchange-dataset/requirements.txt (line 1))
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Collecting py7zr (from -r /content/stackexchange-dataset/requirements.txt (line 3))
  Downloading py7zr-0.22.0-py3-none-any.whl.metadata (16 kB)
Collecting lm-dataformat (from -r /content/stackexchange-dataset/requirements.txt (line 5))
  Downloading lm_dataformat-0.0.20-py3-none-any.whl.metadata (1.2 kB)
Collecting jsonlines (from -r /content/stackexchange-dataset/requirements.txt (line 6))
  Downloading jsonlines-4.0.0-py3-none-any.whl.metadata (1.6 kB)
Collecting texttable (from p

In [3]:
!python3 /content/stackexchange-dataset/main.py --names security.stackexchange

Downloading and processing stackexchange dumps for ['security.stackexchange']
wget https://archive.org/download/stackexchange/security.stackexchange.com.7z -P dumps
--2025-04-02 03:29:12--  https://archive.org/download/stackexchange/security.stackexchange.com.7z
Resolving archive.org (archive.org)... 207.241.224.2
Connecting to archive.org (archive.org)|207.241.224.2|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://dn720201.ca.archive.org/0/items/stackexchange/security.stackexchange.com.7z [following]
--2025-04-02 03:29:13--  https://dn720201.ca.archive.org/0/items/stackexchange/security.stackexchange.com.7z
Resolving dn720201.ca.archive.org (dn720201.ca.archive.org)... 64.71.129.148
Connecting to dn720201.ca.archive.org (dn720201.ca.archive.org)|64.71.129.148|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 260037012 (248M) [application/x-7z-compressed]
Saving to: ‘dumps/security.stackexchange.com.7z’


2025-04-02 03:29:22

In [4]:
!mkdir -p /content/data/security_stackexchnge
!unzip -o /content/out/security.stackexchange.zip -d /content/data/security_stackexchange

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 extracting: /content/data/security_stackexchange/security.stackexchange_0000238988.txt  
 extracting: /content/data/security_stackexchange/security.stackexchange_0000238975.txt  
 extracting: /content/data/security_stackexchange/security.stackexchange_0000238977.txt  
 extracting: /content/data/security_stackexchange/security.stackexchange_0000238907.txt  
 extracting: /content/data/security_stackexchange/security.stackexchange_0000238991.txt  
 extracting: /content/data/security_stackexchange/security.stackexchange_0000239024.txt  
 extracting: /content/data/security_stackexchange/security.stackexchange_0000238974.txt  
 extracting: /content/data/security_stackexchange/security.stackexchange_0000239039.txt  
 extracting: /content/data/security_stackexchange/security.stackexchange_0000161029.txt  
 extracting: /content/data/security_stackexchange/security.stackexchange_0000239051.txt  
 extracting: /content/data/security

In [6]:
import re
import pandas as pd
import os

# Initialize list to store QA pairs
qa_pairs = []
folder_path = "/content/data/security_stackexchange"  # Folder with all .txt files

# Load and process files
for filename in os.listdir(folder_path):
    if filename.endswith(".txt"):
        try:
            with open(os.path.join(folder_path, filename), "r", encoding="utf-8") as f:
                content = f.read()
                # Split into Q and A sections
                if "Q:" in content and "A:" in content:
                    question = content.split("Q:")[1].split("A:")[0].strip()
                    answer = content.split("A:")[1].strip()
                    qa_pairs.append({"question": question, "answer": answer})
        except Exception as e:
            print(f"Error processing {filename}: {str(e)}")
            continue

def clean_text(text):
    """Clean text by normalizing whitespace and newlines"""
    if pd.isna(text):
        return ""

    text = str(text)  # Convert to string

    # Normalize whitespace and newlines
    text = re.sub(r'\n{2,}', '\n', text)  # Replace multiple newlines
    text = re.sub(r'[^\S\n]+', ' ', text)  # Fix irregular spaces
    return text.strip()

# Create DataFrame (without dtype parameter)
df = pd.DataFrame(qa_pairs)

# Convert columns to string type explicitly
df = df.astype({'question': str, 'answer': str})

# Clean text columns
df['question'] = df['question'].apply(clean_text)
df['answer'] = df['answer'].apply(clean_text)

# Remove empty rows (where either question or answer is empty)
df = df[(df['question'].str.len() > 0) & (df['answer'].str.len() > 0)]

print(f"Successfully processed {len(df)} QA pairs")
print(df.head())

Successfully processed 42482 QA pairs
                                            question  \
0  What is the purpose of a targeted email withou...   
1  Is Linux kernel supported by Linux Mint 17 LTS...   
2  Deduce RSA 1024 bit key from known input and o...   
3  Should I be concerned about this VPN logging?\...   
4  3rd party API access: Is OAuth really required...   

                                              answer  
0  Attempting to send a message to a non-existant...  
1  Linux Mint founder and lead developer Clement ...  
2  No, it is not possible. openssl rsautl perform...  
3  Presumably you are using a VPN to ensure anony...  
4  It all comes down to the old adage: "Good IT s...  


In [7]:
df = pd.DataFrame(qa_pairs)
df.to_csv("cybersecurity_qa.csv", index=False)

In [8]:
pd.read_csv("cybersecurity_qa.csv").head()

Unnamed: 0,question,answer
0,What is the purpose of a targeted email withou...,Attempting to send a message to a non-existant...
1,Is Linux kernel supported by Linux Mint 17 LTS...,Linux Mint founder and lead developer Clement ...
2,Deduce RSA 1024 bit key from known input and o...,"No, it is not possible. openssl rsautl perform..."
3,Should I be concerned about this VPN logging?\...,Presumably you are using a VPN to ensure anony...
4,3rd party API access: Is OAuth really required...,"It all comes down to the old adage: ""Good IT s..."


In [12]:
!zip  /content/cybersecurity_qa.zip /content/cybersecurity_qa.csv

  adding: content/cybersecurity_qa.csv (deflated 63%)


In [9]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [13]:
!mkdir -p /content/drive/MyDrive/Projects/SecQAAI/
!cp /content/cybersecurity_qa.zip /content/drive/MyDrive/Projects/SecQAAI/
!ls /content/drive/MyDrive/Projects/SecQAAI/

cybersecurity_qa.csv  cybersecurity_qa.zip


# Dataset Link   [cybersecurity_qa.zip](https://drive.google.com/file/d/1-2Oc8HwH7VA_GC_1MGaU2FcyCP6jmyRk/view?usp=sharing)

#ID = `1-2Oc8HwH7VA_GC_1MGaU2FcyCP6jmyRk`