# Global Cyberattack Pattern Analysis — Data Mining (Anaconda Edition)

**Notebook:** 08_report_summary.ipynb — Collect assets & write conclusions


# Global Cyberattack Pattern Analysis — Final Summary

This project analyzes global cyberattack patterns using the CISA KEV dataset to uncover trends, frequent vulnerability types, and predictive insights. The study combines unsupervised and supervised data mining techniques to identify high-risk vendors and predict response speed for better cybersecurity management.

# 1️⃣ Dataset Overview
The project used the **CISA Known Exploited Vulnerabilities (KEV)** dataset, containing over **1,400** records of vulnerabilities from vendors such as Microsoft, Adobe, Cisco, and others.

Each record includes fields like `vendorProject`, `product`, `cwe_primary`, `year_added`, and `ransomware_known`, which were analyzed to identify attack trends and exploitation patterns.


# 2️⃣ Data Preprocessing
Data cleaning and feature engineering steps included:
- Handling missing values and irrelevant columns.
- Encoding categorical variables (`vendorProject`, `product`, etc.).
- Extracting temporal features like `year_added` and `month_added`.
- Creating new derived attributes for clustering and classification (e.g., **response_speed**).

#  3️⃣ Exploratory Data Analysis
EDA revealed that:
- The majority of vulnerabilities occurred between **2022 and 2025**.
- **Microsoft**, **Adobe**, and **Cisco** were among the most frequently targeted vendors.
- 
- Common CWE types included **CWE-79 (Cross-Site Scripting)** and **CWE-20 (Input Validation)**.
Visualizations were used to show frequency distributions and temporal trends of cyberattacks.
"""))

# 4️⃣ Unsupervised Learning (K-Means Clustering)
Using the Elbow and Silhouette methods, the optimal cluster count was **K = 3**.
The clustering grouped vulnerabilities into three main profiles:
1. High-volume vendors like Microsoft with diverse CWE types.
2. Moderate-frequency vulnerabilities mostly from Adobe and Apple.
3. Low-frequency or emerging vulnerabilities from smaller vendors.
This helped to identify vendor-specific attack clusters and time-based patterns.


#  5️⃣ Association Rule Mining
The Apriori algorithm was applied to uncover frequent co-occurrences between vulnerability features.
Key findings included:
- Strong association between **Microsoft products** and CWE-119 / CWE-20.
- Older vulnerabilities (≤ 2022) often appeared in the same transactions.
- Non-ransomware cases were significantly more common.


# 6️⃣ Supervised Machine Learning
Two models were trained to predict the **response speed** of vulnerability mitigation:
- **Random Forest Classifier**
- **Logistic Regression**

After parameter tuning using GridSearchCV:
- Random Forest achieved **91.6% accuracy**
- Logistic Regression achieved **87.0% accuracy**


# 7️⃣ Final Comparison and Insights
| Model | Accuracy |
|--------|-----------|
| RandomForest | 0.916 |
| LogisticRegression | 0.871 |

The **Random Forest** model outperformed Logistic Regression, demonstrating robustness and better handling of non-linear patterns.  
Its confusion matrix confirmed strong classification performance for the majority of vulnerability categories.


# ✅ Conclusion
The analysis successfully identified key **patterns in global cyberattacks** and demonstrated that:

- Most vulnerabilities target widely used software like **Microsoft Windows**.
- Input validation (CWE-20) and memory management flaws (CWE-119) remain dominant.
- Machine learning can effectively classify response speed and detect underlying relationships between vendors, years, and vulnerability types.

This project provides a **data-driven foundation** for improving **cyber threat prediction and risk management strategies**.
