# Video Streaming: Feature Extraction from Video Streams

In this assignment, you will explore a capture of a Netflix video stream. The packet capture itself has some additional traffic beyond Netflix traffic, and so part of the exercise involves filtering the traffic to include only the Netflix traffic.

## Learning Objectives

In this hands-on activity, you will learn how to:

* Identify service types using TLS SNI/DNS
* Calculate network counters
* Infer video segment downloads


### Step 0: Netflix PCAP to CSV
To manipulate a network trace effectively, you can employ tshark to extract specific packet headers, saving the information into a CSV file for further analysis. The following command illustrates the process:

In [86]:

cmd = "tshark -r netflix.pcapng -T fields -e frame.time_epoch -e frame.len -e ip.src -e ip.dst -e ipv6.src -e ipv6.dst -e ip.proto -e ipv6.nxt -e tcp.srcport -e tcp.dstport -e udp.srcport -e udp.dstport -e tcp.len -e tls.handshake.extensions_server_name -e dns.qry.name > netflix.csv"
os.system(cmd)

0

### Step 1: Identifying Netflix Traffic from DNS

One of the challenges with packet captures is that they often contain a mix of traffic from devices, destinations, and applications. When diagnosing performance problems with a particular service, often the first challenge is identifying and extracting the subset of traffic corresponding to that service.

We have seen in earlier lectures how ML can be used for traffic classification. In this exercise, however, we will rely on two non-ML based methods to identify the IP addresses associated with Netflix. The first is using domain name system lookups and the second is using the Server Name Indication in the TLS handshake packets. Both of these rely on looking into unecrypted portion of the application payload. 

In [51]:
import os
import pandas as pd
import seaborn as sns
NF_DOMAINS = (["nflxvideo", 
              "netflix", 
              "nflxso", 
              "nflxext"])

### Load the Packet Capture and Identify Netflix Traffic

First, load the traffic capture and inspect it.

In [87]:
# Read the CSV file into a DataFrame
columns = ["frame.time", "frame.len", "ip.src", "ip.dst", "ipv6.src", "ipv6.dst", "ip.proto", "ipv6.nxt", "tcp.srcport", "tcp.dstport", "udp.srcport", "udp.dstport", "tcp.len", "sni", "dns"]
ndf = pd.read_csv("netflix.csv", sep="\t", header=None, names=columns)

#### Netflix traffic identification using SNI query
We first filter packets with non-null SNI

In [88]:
pd.options.display.max_rows = 1000 # display up to 1000 rows
ndf[~pd.isna(ndf["sni"])].head(2)

Unnamed: 0,frame.time,frame.len,ip.src,ip.dst,ipv6.src,ipv6.dst,ip.proto,ipv6.nxt,tcp.srcport,tcp.dstport,udp.srcport,udp.dstport,tcp.len,sni,dns
19,1706371000.0,1292,,,2401:4900:1c54:5e4b:f19c:20ef:7149:ce60,2404:6800:4002:81e::200e,,17.0,,,55234.0,443.0,,clients4.google.com,
207,1706371000.0,969,,,2401:4900:1c54:5e4b:f19c:20ef:7149:ce60,2404:a800:0:29::42b,,6.0,51671.0,443.0,,,883.0,occ-0-3752-3647.1.nflxso.net,


Next, write an expression that filters the SNI packets corresponding to Netflix video data. Because you are looking for the IP addresses that are associated with Netflix traffic, you need to match SNI in the TLS handshake that contain Netflix domains.  You can use it to find the IP addresses associated with all Netflix traffic in the trace. 

In [None]:
### 

#### Netflix traffic identification using DNS

Now, can you write expressions to find out Netflix servers using the DNS data? You can follow a similar methodology as TLS SNI. 

### Step 2: Counting Traffic to Each Netflix Destination

An important feature for inferring video quality of experience is the throughput of each flow in the video stream. To compute throughput, we need to divide the number of bytes transferred per unit time.

As a first step towards computing that feature, count the number of packets and bytes, in each direction, to each Netflix IP address in the trace.

#### Count the Number of Downstream Bytes and Packets

#### Count the Number of Upstream Bytes and Packets

### Step 3: Inferring Segment Downloads

Another important feature that can be used in inferring video quality of experience is the number of segments per unit time. In this step we will infer the number of segments downloaded per unit time for each IP address.

The number of segments can be determined by counting the number of continuous downstream transfers separated by a packet with a payload of zero bytes. For the last step, compute the number of segment downloads from each Netflix IP address.