# 5-4: Quantitative Analysis

Let's get down to business. In this lesson, we're working with an actual, factual, malware artifact. One that you'll want to be able to manipulate easily: a packet capture. The packet capture comes from [this analysis](https://tria.ge/221108-bzzf5sdacm/).

You won't always have the luxury of a PCAP, but analyzing them with Jupyter gives us superpowers.

To extract data from a PCAP in Python, we use the `scapy` library. Let's import that and pull in the packets with the built-in `rdpcap()` function.

In [19]:
# Import Scapy stuff
from scapy.all import *

# Get packets
packets = rdpcap("emo.pcapng")

Depending on the packet tyype, there will be different information available. The data is separated into OSI-model layers.

Let's start by making a DataFrame of all `IP` packets to get general information about the TCP/IP conversations in the PCAP. To do so, we will use the `.getlayer()` method to retrieve the `IP` layer, and the  `.haslayer()` method to look for the `IP` layer.

In [67]:
# IP packets
ip_packets = [p.getlayer(IP) for p in packets if p.haslayer(IP)]

Let's examine the first packet to see what we're dealing with.

In [68]:
ip_packets[0]

<IP  version=4 ihl=5 tos=0x0 len=77 id=36144 flags= frag=0 ttl=128 proto=udp chksum=0x9257 src=10.127.0.138 dst=8.8.8.8 |<UDP  sport=53361 dport=domain len=57 chksum=0xf875 |<DNS  id=24635 qr=0 opcode=QUERY aa=0 tc=0 rd=1 ra=0 z=0 ad=0 cd=0 rcode=ok qdcount=1 ancount=0 nscount=0 arcount=0 qd=<DNSQR  qname='settings-win.data.microsoft.com.' qtype=A qclass=IN |> an=None ns=None ar=None |>>>

Kinda hard to read at first, but the `|` separates the layers of the packet. You can see that the `IP` layer has `src`, `dst`, `sport`, `dport`, and `len` data. In the case of this packet, the next layer is the DNS application data, which will contain the DNS query, among other things.

But for now, we're just concerned with the `IP` layer. Now that we know the names of the properties, we can access them directly. Let's make a list of `dict`s with this information to produce a `DataFrame`.

In [69]:
# Import Pandas
import pandas as pd

# Create IP data dicts
ip_data = [{"src": p.src, "dst":p.dst, "sport": p.sport, "dport": p.dport, "len": p.len} for p in ip_packets]

# Generate DataFrame
ip_df = pd.DataFrame(ip_data)

In [70]:
# Review the IP DataFrame
ip_df

Unnamed: 0,src,dst,sport,dport,len
0,10.127.0.138,8.8.8.8,53361,53,77
1,8.8.8.8,10.127.0.138,53,53361,211
2,10.127.0.138,51.104.136.2,49727,443,52
3,51.104.136.2,10.127.0.138,443,49727,52
4,10.127.0.138,51.104.136.2,49727,443,40
...,...,...,...,...,...
2040,209.197.3.8,10.127.0.138,80,49755,40
2041,10.127.0.138,209.197.3.8,49755,80,40
2042,10.127.0.138,209.197.3.8,49755,80,40
2043,209.197.3.8,10.127.0.138,80,49755,40


Even without the later layers, there's a lot we can do with this data. We can begin with some **research questions**.

1. What source transferred the most bytes?
2. What destination ports are in play?
3. What are the external IP addresses?

## Grouping and Slicing for Truth

Just as we've done before, we'll by grouping our data by a field—in this case `src`. But instead of `count()`, finally, we have a reason for another aggregator. We want to add up the `len` field, so `sum()` is our choice.

**Note**: the `numeric_only` for `sum()` tells Pandas to only aggregate fields with numbers in it.

In [76]:
# Group by src and sum up len
ip_df.groupby("src").sum(numeric_only=True).sort_values(by="len", ascending=False)[["len"]]

Unnamed: 0_level_0,len
src,Unnamed: 1_level_1
204.79.197.200,646697
10.127.0.138,151852
209.197.3.8,66740
40.126.31.71,47942
204.79.197.203,46275
51.104.136.2,41502
52.109.13.64,19956
204.79.197.222,16059
104.80.224.44,14625
52.152.110.14,9770


Now of course, this isn't super informative. We can do a 2-dimensional group to see largest conversations.

In [77]:
# Group by src and dst
ip_df.groupby(["src", "dst"]).sum(numeric_only=True).sort_values(by="len", ascending=False)[["len"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,len
src,dst,Unnamed: 2_level_1
204.79.197.200,10.127.0.138,646697
10.127.0.138,204.79.197.200,94563
209.197.3.8,10.127.0.138,66740
40.126.31.71,10.127.0.138,47942
204.79.197.203,10.127.0.138,46275
51.104.136.2,10.127.0.138,41502
52.109.13.64,10.127.0.138,19956
10.127.0.138,40.126.31.71,18739
204.79.197.222,10.127.0.138,16059
104.80.224.44,10.127.0.138,14625


That's better. If we group by all 4 fields, we start to see conversation sizes. 

We'll need to expand our max rows to see them all.

In [83]:
# Expand max rows
pd.set_option("display.max_rows", 150)

# Group by src and sum up len
ip_df.groupby(["src", "dst","sport", "dport"]).sum(numeric_only=True).sort_values(by="len", ascending=False)[["len"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,len
src,dst,sport,dport,Unnamed: 4_level_1
204.79.197.200,10.127.0.138,443,49758,562361
204.79.197.200,10.127.0.138,443,49745,84336
10.127.0.138,204.79.197.200,49745,443,76946
209.197.3.8,10.127.0.138,80,49755,65369
204.79.197.203,10.127.0.138,443,49737,46275
52.109.13.64,10.127.0.138,443,49762,19956
10.127.0.138,204.79.197.200,49758,443,17617
40.126.31.71,10.127.0.138,443,49740,17552
40.126.31.71,10.127.0.138,443,49731,17552
51.104.136.2,10.127.0.138,443,49742,16304


Now we can see the conversations. Looks like a lot of HTTPS traffic, which is unsurprising.

It's a little messy to look at the conversations bidirectionally. If we want to see outbound communications, we can use the IP pattern to slice our dataframe to only those.

In [85]:
# Filter for internal sources only
outbound_df = ip_df[ip_df.src.str.startswith("10.")]

# Show the outbound comms grouped by src, dst, and dport. No need for sport.
outbound_df.groupby(["src", "dst","dport"]).sum(numeric_only=True).sort_values(by="len", ascending=False)[["len"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,len
src,dst,dport,Unnamed: 3_level_1
10.127.0.138,204.79.197.200,443,94563
10.127.0.138,40.126.31.71,443,18739
10.127.0.138,51.104.136.2,443,9605
10.127.0.138,209.197.3.8,80,4318
10.127.0.138,204.79.197.203,443,3836
10.127.0.138,52.152.110.14,443,3614
10.127.0.138,204.79.197.222,443,2521
10.127.0.138,104.80.224.44,443,1856
10.127.0.138,52.123.128.254,443,1713
10.127.0.138,13.107.42.254,443,1663


Much cleaner, especially with only one source!

And with that, we've answered our first research questions.

## Assessing DNS Data

For our next trick, we want to extract the DNS data from our packets. We'll use the same method we did before, looking for the `DNS` layer with `.haslayer()`.

In [93]:
# Get DNS packets only
dns_packets = [p for p in packets if p.haslayer(DNS)]

DNS query data is going to live in the `qd.qname` property. They're `bytes`, so a little `.decode()` is appropriate here. We can use that to build our `DataFrame`.

Let's try to do it as a one-liner this time.

In [100]:
dns_df = pd.DataFrame([{"id": p.id, "query": p.qd.qname.decode()} for p in dns_packets])
dns_df.head()

Unnamed: 0,id,query
0,36144,settings-win.data.microsoft.com.
1,24603,settings-win.data.microsoft.com.
2,36145,login.live.com.
3,63448,login.live.com.
4,36146,ocsp.digicert.com.


Looking good! Now, let's ask some questions.

1. What were our top queries?
2. What were the oddball queries?

Those are the common questions. Don't sleep on the rare/oddball queries. That's often where you'll find evil. Never ignore the bottom of the stack.

Hopefully this is getting familiar now. We'll group by `query` and aggregate with a `count()`.

In [102]:
# Count of each query
dns_df.groupby("query").count().sort_values(by="id", ascending=False)

Unnamed: 0_level_0,id
query,Unnamed: 1_level_1
settings-win.data.microsoft.com.,4
slscr.update.microsoft.com.,4
api.msn.com.,2
bdeeb3d3f2ce8abffd84fc3a380fc37c.clo.footprintdns.com.,2
crl3.digicert.com.,2
dual-s-ring.msedge.net.,2
fe3cr.delivery.mp.microsoft.com.,2
fp.msedge.net.,2
fs.microsoft.com.,2
l-ring.msedge.net.,2


So uh, one of those sticks out, huh? By the way, a solid detection opportunity is for max-length DNS queries (253 chars).

But is the long DNS query actually malicious, or just weird? We can use`whois` right from the Notebook to find out.

In [106]:
# Whois that weird domain's owner?
! whois footprintdns.com | grep Organization

Registrant [01;31m[KOrganization[m[K: Microsoft Corporation
Admin [01;31m[KOrganization[m[K: Microsoft Corporation
Tech [01;31m[KOrganization[m[K: Microsoft Corporation


What da—Microsoft??

Yeah, it's some weird tracking thing they do. It looks gnarly but is in fact legitimate.

So DNS isn't telling us much, but that in itself can be a clue! If DNS shows nothing odd, then perhaps communication was done directly via IP!

## IP Analysis/Data Enrichment

Of course we have all the IP data from these communications. It'd be nice if we would add `whois` data like the above to each IP. And what if we could run each against VirusTotal for any information?

We can.

Let's start with `df_outbound`, which handily already has our external IPs for us. We just want unique IPs, so we don't need every row in that `DataFrame`. In fact, the `groupby()` will do nicely. We want the unique IPs, so we can export the index of the `groupby()`. While this will give us an `Index` object, we can get the raw values with the `.values` property.

You might notice that the result is not a `list`, but an `array`. This is an object from the `numpy` library. It has some more capabilities than a `list`, but works similarly enough for our purposes.

We're going to build a new `DataFrame` column by column, which we haven't done before. To do this it's imperative that each list or Series that we add is the same length. We'll base everything off our `outbound_ips` array.

In [115]:
# Get just unique destination IPs. Exclude the first entry as that's our internal IP
outbound_ips = outbound_df.groupby(["dst"]).count().index.values[1:]

In [117]:
# Show the IPs array
outbound_ips

array(['104.80.224.44', '13.107.42.254', '182.162.143.56', '2.18.109.224',
       '20.108.172.194', '204.79.197.200', '204.79.197.203',
       '204.79.197.222', '209.197.3.8', '239.255.255.250', '40.126.31.71',
       '51.104.136.2', '52.109.13.64', '52.123.128.254', '52.152.110.14',
       '52.242.97.97', '72.21.91.29', '8.8.8.8', '93.184.220.29'],
      dtype=object)

Now that we have the IPs isolated, let's build our new `DataFrame`. We'll pass the constructor a slightly different object than before. Instead of a list of dicts, we'll pass a single dict with the key as a column name, and the value as the column values.

In [163]:
# Begin the DataFrame with our outbound_ips
ips_df = pd.DataFrame({"ip": outbound_ips})
ips_df

Unnamed: 0,ip
0,104.80.224.44
1,13.107.42.254
2,182.162.143.56
3,2.18.109.224
4,20.108.172.194
5,204.79.197.200
6,204.79.197.203
7,204.79.197.222
8,209.197.3.8
9,239.255.255.250


Now, let's enrich this data. We'll start with `whois`. There is a [python-whois](https://pypi.org/project/python-whois/) library, but this is Jupyter! We can use shell commands to do this. We'll look for parts of the `whois` data with the word `country`, case-insensitive.

In [164]:
# Initialize the list of results
whois_data: list = []

for i in outbound_ips:
    # Use whois shell command
    whois_result = ! whois {i} | grep -i country
    whois_data.append(whois_result)

In [165]:
# Set the `whois_country` column
ips_df["whois_country"] = whois_data
ips_df

Unnamed: 0,ip,whois_country
0,104.80.224.44,"[Country: US, Country: NL]"
1,13.107.42.254,[Country: US]
2,182.162.143.56,"[country: KR, country: KR, count..."
3,2.18.109.224,[country: EU]
4,20.108.172.194,[Country: US]
5,204.79.197.200,[Country: US]
6,204.79.197.203,[Country: US]
7,204.79.197.222,[Country: US]
8,209.197.3.8,"[Country: US, Country: US]"
9,239.255.255.250,[]


Look at that! Geo data! And one distinct outlier.

But we're not done just yet. Remember back in 4-2, when we used the VirusTotal API? Let's try searching for each of these IPs and saving the data in a new column.

We'll start by importing what we need for VirusTotal.

In [152]:
# import the VT library
import vt
import nest_asyncio
from getpass import getpass
nest_asyncio.apply()

In [153]:
vt_api_key = getpass("VirusTotal API Key:")

VirusTotal API Key: ········


In [155]:
# Instantiate the VT Client
client = vt.Client(vt_api_key)

### `.apply()`

To create our new column, we're going to introduce a new technique: `.apply()`. This `Series` method allows us to apply a given function on each member of a `Series`, resulting in a new one. We can create a new column this way. 

`apply()` takes a function as an argument. That function, in turn, should be written to accept the `Series` values. You can write the function ahead of time or, if the function is small, you can use the `lambda` syntax to write a function in-place.

We'll use the `last_analysis_stats` component of the VirusTotal results. It contains `harmless`, `malicious`, `suspicious` values. We'll make those 3 separate columns.

In [168]:
vt_data = ips_df.ip.apply(lambda i: client.get_object(f"/ip_addresses/{i}").last_analysis_stats)


In [170]:
ips_df["vt_harmless"] = vt_data.apply(lambda v: v["harmless"])
ips_df["vt_malicious"] = vt_data.apply(lambda v: v["malicious"])
ips_df["vt_suspicious"] = vt_data.apply(lambda v: v["suspicious"])
ips_df

Unnamed: 0,ip,whois_country,vt_harmless,vt_malicious,vt_suspicious
0,104.80.224.44,"[Country: US, Country: NL]",94,0,0
1,13.107.42.254,[Country: US],81,0,0
2,182.162.143.56,"[country: KR, country: KR, count...",65,18,0
3,2.18.109.224,[country: EU],84,0,0
4,20.108.172.194,[Country: US],94,0,0
5,204.79.197.200,[Country: US],82,1,0
6,204.79.197.203,[Country: US],82,0,0
7,204.79.197.222,[Country: US],81,0,0
8,209.197.3.8,"[Country: US, Country: US]",81,0,0
9,239.255.255.250,[],80,0,0


Finally, we'll sort by those 3 columns in `malicious`, `suspicious`, and `harmless` orders to see what floats to the top.

In [172]:
ips_df.sort_values(by=["vt_malicious", "vt_suspicious", "vt_harmless"], ascending=False)

Unnamed: 0,ip,whois_country,vt_harmless,vt_malicious,vt_suspicious
2,182.162.143.56,"[country: KR, country: KR, count...",65,18,0
18,93.184.220.29,[country: EU],80,2,0
5,204.79.197.200,[Country: US],82,1,0
16,72.21.91.29,[Country: US],82,1,0
17,8.8.8.8,"[Country: US, Country: US]",79,1,0
0,104.80.224.44,"[Country: US, Country: NL]",94,0,0
4,20.108.172.194,[Country: US],94,0,0
3,2.18.109.224,[country: EU],84,0,0
12,52.109.13.64,[Country: US],83,0,0
13,52.123.128.254,[Country: US],83,0,0


We now have reason to suspect that the communication with `182.162.143.56` is suspicious. We can continue our investigation with other data sources, using this as a correlation point.

## Check For Understanding

You may wish to try this analysis with another PCAP from your own data or another [Hatching Triage](https://tria.ge) report.

Up next, we'll add visualizations to our analyses as we investigate more complex logs!