#   NVD Introduction

## Introduction

Vulnerabilities data are available in three different sources: CVE Mitre, NVD and CVE Details, being created and
annotated through the data sources in this respective order.

Launched in 1999 when most information security tools used their own databases with their own names for security vulnerabilities, the **Common Vulnerabilities and Exposures (CVE) by Mitre** documents known vulnerabilities
manually for public usage. 

Each vulnerability contains a description, is uniquely identified by a CVE ID, and may also include fields specifying the vulnerable software, version and vendors affected by it. If a set of vulnerabilities
are similar, but occur for different software, they can have different CVE-IDs, and contain the same weakness ID (CWE ID). 

When created by CVE Mitre, each vulnerability may or not be annotated with a weakness ID(CWE ID),but when available they can serve to group similar vulnerabilities conceptually,and observe how they have been ‘instantiated’ in different software, version or vendor.

CVE Mitre’s vulnerabilities are then annotated with severity scores, fix information,
and impact ratings in the **National Vulnerability Database(NVD)**,and made available for download as XML feeds.

**CVE Details** was created to provide a user-friendly interface to NVD’s XML feeds. For instance, using vulnerabilities’ CWE IDs and keyword matching, it defines 13 vulnerability types to facilitate browsing vulnerabilities. Since CVE Details warns about inconsistencies in NVD XML Feeds (e.g.same vendor’s software having different names), and irrelevant entries to our purposes (i.e. reserved, duplicates and removed entries), we downloaded all software vulnerabilities to date from the three sources to define our vulnerability dataset and ensure consistency.


## Motivation

Will be added later.

## Method

### Parsing XML files to CVS

The data from the NVD website is available for each year from 2002 up to 2017 in XML format. To perform further analysis, we require few fields (CVE-ID, CWE-ID,Timestamp) to be extracted, thus we will parse only those tags from XML to CSV. The code below has demostrated conversion and extraction for the years 2002 to 2017 (Feb), and the files after conversion will be found in the same folder as the notebook exists identifyable with the name of the year. The range of years or the destination folder to be copied can be changed by althering the path and the numbers mentioned in the code. 
The below method uses ElementTree for XML parsing.
If you choose to skip this step and access the CSV files directly, you can find them on our [Google Drive](https://drive.google.com/open?id=0B-NONBqqQBznYlRLUU5zS0lLZU0).

## Histograms by Month

Motivation will be added later.

In [2]:
#using panda 
import pandas as pd
import csv

# Histograms 

In [9]:
#We first load all file paths
import glob
nvd_filepaths = glob.glob("data/*.csv")
#Then we prepare a list, that will contain all the tables that exist in these file paths
nvd_dataframes = []
for nvd_filepath in nvd_filepaths:
    #the csvs do not contain headers, so are added here. TO-DO: Add headers to CSV when they are generated.
    nvd_dataframes.append(pd.read_csv(nvd_filepath,names=['cve_id', 'cwe_id','timestamp']))
print nvd_dataframes[3]

For the sake of clarity, let's consider just one of the dataframes in the list of dataframes to show how the percent is done. 

In [15]:
#Choose the first dataframe at position 0
nvd_df = nvd_dataframes[0]
#Parse the timestamp field turning it into a datetimeindex object, and then access the month attribute
nvd_df['month'] = pd.DatetimeIndex(nvd_df['timestamp']).month
print nvd_df
#Now that we have a month column, we can 'group by' the table by the month column. 
nvd_df = nvd_df.groupby(by=['month'])['cve_id','cwe_id'].count()
nvd_df

Unnamed: 0_level_0,cve_id,cwe_id,timestamp
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,51,9,51
2,83,17,83
3,105,4,105
4,56,6,56
5,83,3,83
6,129,5,129
7,48,1,48
8,215,8,215
9,43,1,43
10,96,5,96


In [72]:
#All that is left is divide row-wise the number of cwe_ids, by the numter of cve_ids. 
#Since the cwe_ids are never null, then they effectively represent the number of rows for the given month. 
#cwe_id, instead, that can be null, will only counted when it occurs. 
#Dividing one by the other, gives us the cwe_coverage we desire for the timeseries.
nvd_df['cwe_coverage'] = nvd_df['cwe_id']/nvd_df['cve_id']
nvd_df

Unnamed: 0_level_0,cve_id,cwe_id,cwe_coverage
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,51,9,0.176471
2,83,17,0.204819
3,105,4,0.038095
4,56,6,0.107143
5,83,3,0.036145
6,129,5,0.03876
7,48,1,0.020833
8,215,8,0.037209
9,43,1,0.023256
10,96,5,0.052083


Let's define a function with the code above, so that given a dataframe, it generates the table above, with cve_id,_cwe_id, and cwe_coverage. 

In [17]:
def calculate_cwe_coverage(nvd_df):
    #Parse the timestamp field turning it into a datetimeindex object, and then access the month attribute
    nvd_df['month'] = pd.DatetimeIndex(nvd_df['timestamp']).month
    #Now that we have a month column, we can 'group by' the table by the month column. 
    nvd_df = nvd_df.groupby(by=['month'])['cve_id','cwe_id'].count()
    nvd_df['cwe_coverage'] = nvd_df['cwe_id']/nvd_df['cve_id']
    return nvd_df
    

Now we generate the cwe coverage table for all our dataframes in our list.

In [24]:
cwe_coverage_dfs = []
for nvd_df in nvd_dataframes: 
    cwe_coverage_dfs.append(calculate_cwe_coverage(nvd_df))
#cwe coverage for the 3rd dataset.
nvd_dataframes[4]

Unnamed: 0,cve_id,cwe_id,timestamp,month
0,CVE-2007-0001,,2007-03-02T16:18:00.000-05:00,3
1,CVE-2007-0002,CWE-119,2007-03-16T17:19:00.000-04:00,3
2,CVE-2007-0003,,2007-01-23T16:28:00.000-05:00,1
3,CVE-2007-0004,CWE-264,2007-09-18T15:17:00.000-04:00,9
4,CVE-2007-0005,CWE-119,2007-03-09T19:19:00.000-05:00,3
5,CVE-2007-0006,,2007-02-06T14:28:00.000-05:00,2
6,CVE-2007-0007,,2007-02-19T21:28:00.000-05:00,2
7,CVE-2007-0008,CWE-189,2007-02-26T15:28:00.000-05:00,2
8,CVE-2007-0009,CWE-119,2007-02-26T15:28:00.000-05:00,2
9,CVE-2007-0010,,2007-01-24T14:28:00.000-05:00,1


We now have all the information needed to plot our timeseries. 

# CWE Coverage Timeseries

Will be added later. 