# 5-2: DataFrames and Series

Now we're ready for prime time. Working with data in Pandas is some of the most important material in this course. Once you know how to do it, it will transform your analysis capabilities. Do not skip this section; do not skimp on this section. It's far too valuable for anything less than your complete attention.

Now then, let's begin.

You've already seen the creation of `DataFrame`s from `[dict]`-shaped data. Let's explore some other common uses.

## DataFrames from CSVs

This particular pipeline is probably _the most common_ ingestion method to get data into a `DataFrame`. CSV exports of logs, IoCs, etc. are made extra powerful inside of Pandas. Luckily, Pandas has a built-in method called `.read_csv()` that will take a CSV and create a `DataFrame` using a header row for column names.

The CSV we're going to use is a list of [every CVE ever published](https://cve.mitre.org/data/downloads/index.html)[every CVE ever published](https://cve.mitre.org/data/downloads/index.html). But it isn't in this repo because it's too big. So we're gonna download it now and then clean it for your use.

Let's load that in now and use `.head()` to check it out.


In [39]:
# Import Pandas
import pandas as pd

# Download and set up the CVE file
! wget https://cve.mitre.org/data/downloads/allitems.csv
! tail -n +3 allitems.csv | sed "2,8d" | iconv -f iso8859-1 -t utf-8 > all_cves_utf8.csv

--2022-11-06 14:22:01--  https://cve.mitre.org/data/downloads/allitems.csv
Resolving cve.mitre.org (cve.mitre.org)... 192.52.194.205
Connecting to cve.mitre.org (cve.mitre.org)|192.52.194.205|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 164265545 (157M) [text/csv]
Saving to: ‘allitems.csv.1’


2022-11-06 14:22:14 (12.3 MB/s) - ‘allitems.csv.1’ saved [164265545/164265545]



In [37]:
# Use .read_csv() to create our DataFrame
# This may gnerate a DTypeWarning. Don't sweat it.
df = pd.read_csv("all_cves_utf8.csv")

  df = pd.read_csv("all_cves_utf8.csv")


In [38]:
# Look at the first 5 rows with .head()
df.head()

Unnamed: 0,Name,Status,Description,References,Phase,Votes,Comments
0,CVE-1999-0001,Candidate,ip_input.c in BSD-derived TCP/IP implementatio...,BUGTRAQ:19981223 Re: CERT Advisory CA-98.13 - ...,Modified (20051217),"MODIFY(1) Frech | NOOP(2) Northcutt, W...",Christey> A Bugtraq posting indicates that the...
1,CVE-1999-0002,Entry,Buffer overflow in NFS mountd gives root acces...,BID:121 | URL:http://www.securityfocus.com...,,,
2,CVE-1999-0003,Entry,Execute commands as root via buffer overflow i...,BID:122 | URL:http://www.securityfocus.com...,,,
3,CVE-1999-0004,Candidate,"MIME buffer overflow in email clients, e.g. So...",CERT:CA-98.10.mime_buffer_overflows | MS:M...,Modified (19990621),"ACCEPT(8) Baker, Cole, Collins, Dik, Landfi...","Frech> Extremely minor, but I believe e-mail i..."
4,CVE-1999-0005,Entry,Arbitrary command execution via IMAP buffer ov...,BID:130 | URL:http://www.securityfocus.com...,,,


## What's in a DataFrame?

A `DataFrame` contains multitudes. Understanding how each part functions will make manipulating them to do our bidding much, much easier.

### Indices

If you look closely at the output of `df.head()` above, you'll see that just to the left of the `Name` column, there appears to be an extra column. What gives?! We didn't define that!

Indeed we did not. Every DataFrame requires an **Index**, which is how the individual rows in the DataFrame are referenced. Any column containing unique values can be an Index, but ideally it's one containing sensible sequential values. We can define and `index` column when we create the DataFrame, but if we don't, Pandas will create one for us. Let's look at `df.index` to see what it is.

In [10]:
df.index

RangeIndex(start=0, stop=253434, step=1)

Indices are important because they are the _way we access rows_. It's important to not think of indices in DataFrames the same way we think of them in a list. They don't really work the same way. Watch what happens when we try to use the list indexing syntax with a DataFrame:

In [11]:
# Try to get the first row...or will we?
df[0]

KeyError: 0

Yeah, that doesn't work. Pandas DataFrames have their own unique syntax for accessing data in this manner, built on two properties: `.loc` and `.iloc`. They look a little strange in practice.

Let's try `.loc` first.

In [13]:
# Access the first row
df.loc[0]

Name                                               CVE-1999-0001
Status                                                 Candidate
Description    ip_input.c in BSD-derived TCP/IP implementatio...
References     BUGTRAQ:19981223 Re: CERT Advisory CA-98.13 - ...
Phase                                        Modified (20051217)
Votes             MODIFY(1) Frech  |     NOOP(2) Northcutt, W...
Comments       Christey> A Bugtraq posting indicates that the...
Name: 0, dtype: object

But that's not all `.loc` can do. We can pass column names as well after a comma! Let's say we wanted the `Description` column.

In [14]:
df.loc[0, "Description"]

'ip_input.c in BSD-derived TCP/IP implementations allows remote attackers to cause a denial of service (crash or hang) via crafted packets.'

That's all we get! What if we want more than one col? List 'em.

In [15]:
df.loc[0, ["Name", "Description"]]

Name                                               CVE-1999-0001
Description    ip_input.c in BSD-derived TCP/IP implementatio...
Name: 0, dtype: object

Okay, but what if we want multiple rows? We can either list them or use a slice syntax.

In [19]:
# Access the first 10 rows
# We can also use a list of index values
df.loc[0:9, ["Name", "Description"]]

Unnamed: 0,Name,Description
0,CVE-1999-0001,ip_input.c in BSD-derived TCP/IP implementatio...
1,CVE-1999-0002,Buffer overflow in NFS mountd gives root acces...
2,CVE-1999-0003,Execute commands as root via buffer overflow i...
3,CVE-1999-0004,"MIME buffer overflow in email clients, e.g. So..."
4,CVE-1999-0005,Arbitrary command execution via IMAP buffer ov...
5,CVE-1999-0006,Buffer overflow in POP servers based on BSD/Qu...
6,CVE-1999-0007,Information from SSL-encrypted sessions via PK...
7,CVE-1999-0008,"Buffer overflow in NIS+, in Sun's rpc.nisd pro..."
8,CVE-1999-0009,Inverse query buffer overflow in BIND 4.9 and ...
9,CVE-1999-0010,Denial of Service vulnerability in BIND 8 Rele...


Notice that we get a much nicer output when we select multiple rows.

`.iloc` works similarly, except that it uses integer values for access rather than names. In the case of our Index, they are one and the same, but watch how we can access columns with `.iloc`.

In [25]:
# Using .iloc to access data
df.iloc[0:10, 0:3]

Unnamed: 0,Name,Status,Description
0,CVE-1999-0001,Candidate,ip_input.c in BSD-derived TCP/IP implementatio...
1,CVE-1999-0002,Entry,Buffer overflow in NFS mountd gives root acces...
2,CVE-1999-0003,Entry,Execute commands as root via buffer overflow i...
3,CVE-1999-0004,Candidate,"MIME buffer overflow in email clients, e.g. So..."
4,CVE-1999-0005,Entry,Arbitrary command execution via IMAP buffer ov...
5,CVE-1999-0006,Entry,Buffer overflow in POP servers based on BSD/Qu...
6,CVE-1999-0007,Entry,Information from SSL-encrypted sessions via PK...
7,CVE-1999-0008,Entry,"Buffer overflow in NIS+, in Sun's rpc.nisd pro..."
8,CVE-1999-0009,Entry,Inverse query buffer overflow in BIND 4.9 and ...
9,CVE-1999-0010,Entry,Denial of Service vulnerability in BIND 8 Rele...


In truth, I rarely access data directly in this way. The whole point is to manipulate the data at scale, so it's uncommon for me to need to access specific rows.

But before we depart Indices, I want to stress the value of using a custom Index rather than the default. Depending on your data shape, this can be incredibly convenient. One common example is if your dataset has (completely) unique timestamps. In that case, you can convert the timestamp to a `DateTimeIndex` and have Pandas automatically sort your data chronologically. This also gives you the power to group data by hour, month, day, etc. 

Our data happens to have unique values in the `Name` column. So if we wanted to, we could use that as our index. Let's try it and see the difference.

We can change the index of a `DataFrame` with `.set_index()`. I want you to look at the [docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html) for this method because it introduces a common pattern in Pandas: normally, when we make a global change to a DataFrame, Pandas will return a new DataFrame instead of mutating the original. This is for data integrity and is a good idea! However, if you know for sure you want to change the original, you can often pass `inplace=True` as an optional argument to the method.

But we won't be doing that right now.

In [26]:
# Create cve index
cve_idx_df = df.set_index("Name")
cve_idx_df.head()

Unnamed: 0_level_0,Status,Description,References,Phase,Votes,Comments
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
CVE-1999-0001,Candidate,ip_input.c in BSD-derived TCP/IP implementatio...,BUGTRAQ:19981223 Re: CERT Advisory CA-98.13 - ...,Modified (20051217),"MODIFY(1) Frech | NOOP(2) Northcutt, W...",Christey> A Bugtraq posting indicates that the...
CVE-1999-0002,Entry,Buffer overflow in NFS mountd gives root acces...,BID:121 | URL:http://www.securityfocus.com...,,,
CVE-1999-0003,Entry,Execute commands as root via buffer overflow i...,BID:122 | URL:http://www.securityfocus.com...,,,
CVE-1999-0004,Candidate,"MIME buffer overflow in email clients, e.g. So...",CERT:CA-98.10.mime_buffer_overflows | MS:M...,Modified (19990621),"ACCEPT(8) Baker, Cole, Collins, Dik, Landfi...","Frech> Extremely minor, but I believe e-mail i..."
CVE-1999-0005,Entry,Arbitrary command execution via IMAP buffer ov...,BID:130 | URL:http://www.securityfocus.com...,,,


Look ma, no integers! This also means that our use of `.loc` will have to change. Let's try for one of my favorites:

In [28]:
cve_idx_df.loc["CVE-2019-19781"]

Status                                                 Candidate
Description    An issue was discovered in Citrix Application ...
References     CERT-VN:VU#619785   |   URL:https://www.kb.cer...
Phase                                        Assigned (20191213)
Votes                          None (candidate not yet proposed)
Comments                                                     NaN
Name: CVE-2019-19781, dtype: object

Aw yiss.

Why don't you try looking for _your_ favorite CVEs with `.loc`?

In [None]:
# Look for your favorite CVEs here! Try to get just the Name and Description columns
cve_idx_df.loc[]

## Series

Take a DataFrame and smash it apart, what would you get? Series! Each column is a Series, but don't think of them as just glorified lists. Each Series has many of the same capabilities as a DataFrame—they even have their own Index!

We can access Series/columns a bunch of different syntaxes. We can use a `dict`-like square brace syntax...

In [31]:
df["Name"]

0          CVE-1999-0001
1          CVE-1999-0002
2          CVE-1999-0003
3          CVE-1999-0004
4          CVE-1999-0005
               ...      
253429    CVE-2023-21414
253430    CVE-2023-21415
253431    CVE-2023-21416
253432    CVE-2023-21417
253433    CVE-2023-21418
Name: Name, Length: 253434, dtype: object

...a list of them will work as well:

In [32]:
df[["Name", "Description"]]

Unnamed: 0,Name,Description
0,CVE-1999-0001,ip_input.c in BSD-derived TCP/IP implementatio...
1,CVE-1999-0002,Buffer overflow in NFS mountd gives root acces...
2,CVE-1999-0003,Execute commands as root via buffer overflow i...
3,CVE-1999-0004,"MIME buffer overflow in email clients, e.g. So..."
4,CVE-1999-0005,Arbitrary command execution via IMAP buffer ov...
...,...,...
253429,CVE-2023-21414,** RESERVED ** This candidate has been reserve...
253430,CVE-2023-21415,** RESERVED ** This candidate has been reserve...
253431,CVE-2023-21416,** RESERVED ** This candidate has been reserve...
253432,CVE-2023-21417,** RESERVED ** This candidate has been reserve...


Or we can use a dot notation, if there are no spaces in the column name:

In [35]:
df.Name

0          CVE-1999-0001
1          CVE-1999-0002
2          CVE-1999-0003
3          CVE-1999-0004
4          CVE-1999-0005
               ...      
253429    CVE-2023-21414
253430    CVE-2023-21415
253431    CVE-2023-21416
253432    CVE-2023-21417
253433    CVE-2023-21418
Name: Name, Length: 253434, dtype: object

Note that these are printing with an integer column. That's because Series also have an Index, which you normally don't want to mess with. But it's there!

Just take a look at everything a [Series](https://pandas.pydata.org/docs/reference/series.html) has inside of it to give you an idea of the capabilities here.

## Check For Understanding

These won't be tested, but try these challenges yourself to see if you have grasped DataFrames and the basics of data access.

1. Use `.iloc` on `df` to access the `Description` and `Comments` columns of rows `30-40`.
2. Use `.loc` to find the details on the Follina vulnerability. No, I won't tell you the CVE. Go find it!
3. Access just the `Comments` Series using dot notation.
4. Access the `Name` and `Votes` Series using brace notation.

And that'll do it for this intro to Pandas! Up next, filtering and data aggregation!