### ML-based Cyber Risk Assessment for Vulnerability Severity Predictions
A machine learning model is used to predict a vulnerability's CVSS Base Score for improved risk management. Using the scraped data which is then integrated with the dataset from NVD, CISA KEV, and EPSS, the model provides a data-driven approach to prioritize threats and streamline vulnerability assessment.

Data obtained from [this study](https://www.mdpi.com/2078-2489/15/4/199)

##### Import Packages

In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from  IPython.display import display
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.multioutput import MultiOutputRegressor

##### Data Loading and Initial Checks

In [26]:
# Load data
df = pd.read_csv('cyber_data.csv')
# df = pd.read_csv('cve_cisa_epss_enriched_dataset.csv')

df

Unnamed: 0,AttackDate,Country,Spam,Ransomware,Local Infection,Exploit,Malicious Mail,Network Attack,On Demand Scan,Web Threat,Rank Spam,Rank Ransomware,Rank Local Infection,Rank Exploit,Rank Malicious Mail,Rank Network Attack,Rank On Demand Scan,Rank Web Threat
0,11/10/2022 0:00,Arab Republic of Egypt,0.00090,0.00013,0.01353,0.00013,0.00287,0.01007,0.01148,0.01708,68,47,85,176,34,11,78,53
1,11/10/2022 0:00,Argentine Republic,0.00601,0.00006,0.00575,0.00035,0.00058,0.00095,0.00482,0.00974,27,86,174,128,140,138,174,160
2,11/10/2022 0:00,Aruba,,,0.01384,,0.00092,,0.00830,0.00554,162,143,82,186,104,187,119,190
3,11/10/2022 0:00,Bailiwick of Guernsey,,,0.00546,0.00273,,0.00091,0.00546,0.01001,162,143,179,1,186,141,164,159
4,11/10/2022 0:00,Bailiwick of Jersey,0.00003,,0.00774,0.00101,0.00067,,0.00707,0.01145,138,143,150,31,133,187,137,146
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
77618,11/12/2023 0:00,United Arab Emirates,0.00064,0.00009,0.00901,0.00057,0.00198,0.00149,0.00892,0.01274,86,37,100,43,2,40,42,87
77619,11/12/2023 0:00,United Kingdom of Great Britain and Northern I...,0.01292,0.00003,0.00428,0.00084,0.00021,0.00045,0.00382,0.01205,12,89,173,14,97,139,171,100
77620,11/12/2023 0:00,United Mexican States,0.00500,0.00004,0.00870,0.00019,0.00035,0.00106,0.00772,0.00834,34,77,106,122,65,68,74,152
77621,11/12/2023 0:00,United Republic of Tanzania,0.00030,0.00002,0.01201,0.00031,0.00028,0.00091,0.00717,0.01145,101,116,54,83,77,83,92,114


In [24]:
#summary statistics
df.describe()

Unnamed: 0,base_score,exploitability_score,impact_score,epss_score,epss_perc
count,155852.0,155852.0,155852.0,155852.0,155852.0
mean,7.142007,2.691608,4.304447,0.027664,0.449421
std,1.701806,0.939721,1.533836,0.116194,0.274225
min,1.6,0.1,1.4,1e-05,1e-05
25%,5.5,1.8,3.4,0.00072,0.223928
50%,7.5,2.8,3.6,0.00206,0.43042
75%,8.8,3.9,5.9,0.00529,0.66212
max,10.0,3.9,6.0,0.94582,1.0


In [27]:
#check total count and nulls counts
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77623 entries, 0 to 77622
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   AttackDate            77623 non-null  object 
 1   Country               77623 non-null  object 
 2   Spam                  62982 non-null  float64
 3   Ransomware            52144 non-null  float64
 4   Local Infection       74469 non-null  float64
 5   Exploit               64264 non-null  float64
 6   Malicious Mail        69184 non-null  float64
 7   Network Attack        71532 non-null  float64
 8   On Demand Scan        74231 non-null  float64
 9   Web Threat            73892 non-null  float64
 10  Rank Spam             77623 non-null  int64  
 11  Rank Ransomware       77623 non-null  int64  
 12  Rank Local Infection  77623 non-null  int64  
 13  Rank Exploit          77623 non-null  int64  
 14  Rank Malicious Mail   77623 non-null  int64  
 15  Rank Network Attack

In [22]:
#check columns that can be encoded
for i in df.columns:
    if df[i].dtype == 'int64':
        pass
    else:
        display(df[i].value_counts())

AttackDate
14/10/2023 0:00    205
11/12/2022 0:00    205
21/10/2022 0:00    205
14/08/2023 0:00    204
13/08/2023 0:00    204
                  ... 
18/07/2023 0:00    193
11/11/2023 0:00    193
01/08/2023 0:00    193
16/09/2023 0:00    192
29/07/2023 0:00    185
Name: count, Length: 392, dtype: int64

Country
Arab Republic of Egypt                   392
Argentine Republic                       392
Bailiwick of Jersey                      392
Belize                                   392
Barbados                                 392
                                        ... 
Virgin Islands of the United States        4
State of Eritrea                           3
Federated States of Micronesia             2
Antarctica                                 2
Democratic People's Republic of Korea      2
Name: count, Length: 225, dtype: int64

Spam
0.00001    4698
0.00002    2729
0.00003    2046
0.00004    1660
0.00005    1359
           ... 
0.04840       1
0.17932       1
0.08137       1
0.19277       1
0.03432       1
Name: count, Length: 4139, dtype: int64

Ransomware
0.00003    5973
0.00004    4696
0.00002    4550
0.00006    3966
0.00005    3925
           ... 
0.00161       1
0.00313       1
0.00163       1
0.00184       1
0.00643       1
Name: count, Length: 161, dtype: int64

Local Infection
0.00771    83
0.00595    78
0.00610    77
0.00774    75
0.01006    74
           ..
0.04072     1
0.03867     1
0.04166     1
0.04133     1
0.00086     1
Name: count, Length: 4036, dtype: int64

Exploit
0.00015    1195
0.00016    1171
0.00013    1147
0.00014    1137
0.00011    1102
           ... 
0.00232       1
0.00367       1
0.00411       1
0.00363       1
0.00316       1
Name: count, Length: 332, dtype: int64

Malicious Mail
0.00006    813
0.00021    753
0.00011    748
0.00014    727
0.00009    724
          ... 
0.01366      1
0.01274      1
0.01350      1
0.00849      1
0.01143      1
Name: count, Length: 1125, dtype: int64

Network Attack
0.00055    471
0.00058    440
0.00063    404
0.00083    401
0.00060    399
          ... 
0.01836      1
0.02255      1
0.01257      1
0.02238      1
0.02410      1
Name: count, Length: 2012, dtype: int64

On Demand Scan
0.00483    99
0.00512    98
0.00391    92
0.00474    91
0.00552    91
           ..
0.03480     1
0.02911     1
0.02825     1
0.02737     1
0.02687     1
Name: count, Length: 3012, dtype: int64

Web Threat
0.01231    94
0.01247    92
0.01302    90
0.01186    88
0.01162    88
           ..
0.03065     1
0.02773     1
0.02984     1
0.02781     1
0.03428     1
Name: count, Length: 2904, dtype: int64

##### Feature Engineering

##### Model Building and Training

##### Evaluation and Iteration