# Extracting features from network traffic

In this exercise we will learn how to extract information, i.e. features, from encrypted traffic. The ultimate goal is to gather the statistics necessary to train an inference model for video quality.

For this exercise, we will use a pcap trace recorded from a laptop streaming a Netflix movie over a wifi connection. For more information on the pcap format: https://www.tcpdump.org/pcap.html

In [None]:
#Get the data
import os
if not os.path.exists("/data/netflix.pkl"):
  !wget 'https://docs.google.com/uc?export=download&id=1tH-Jf-skYAwAbp0INzNUuBKqheKDd1fF' -O /data/netflix.pcap

### Iterating through a pcap trace

To read and process packets from a trace, we use a library called scapy: https://scapy.net/

In [None]:
import scapy.all as sp

pcap_file = "/data/netflix.pcap"

# Load the pcap into memory
trace = sp.rdpcap(pcap_file)
# Iterate over the first ten packets
for packet in trace[:1]:
  # Print the IP HEADER
  if packet.haslayer(sp.IP):
    packet.show()
  if sp.TCP in packet:
    print(str(packet[sp.TCP])[:4*packet[sp.TCP].dataofs])

  # TRY PRINTING THE TCP HEADER ONLY

### Identifying the service type using DNS traffic

DNS queries can be used to identify to which service a remote IP belongs to. 

To read and process DNS queries we use a library called dnslib: https://github.com/paulc/dnslib

Using the pcap trace previously downloaded, we aim to identify which IP addresses belong to Netflix

In [None]:
import scapy.all as sp
import dnslib

# Netflix keywords used inside their domain names
NF_DOMAINS = ["nflxvideo", "netflix", "nflxso", "nflxext"]

# READ THE PCAP AND INSERT INTO THIS ARRAY THE IP ADDRESS THAT BELONG TO NETFLIX
netflix_ips = []

pcap_file = "/data/netflix.pcap"

with sp.PcapReader(pcap_file) as trace:
  for packet in trace:
    # DNS Packet
    if packet.haslayer(sp.UDP) and packet[sp.UDP].sport == 53:
      # Get DNS data
      raw = sp.raw(packet[sp.UDP].payload)
      # Process the DNS query
      dns = dnslib.DNSRecord.parse(raw)
      # Iterate over answers
      for a in dns.rr:
        # Check if it's a domain of interest
        question = str(a.rname)
        if any(s in question for s in NF_DOMAINS):
          # Check if it's an answer
          if a.rtype == 1 or a.rtype == 28:
            print("Query {} is a netflix one. Appending IP {} to netflix IPs".format(question, a.rdata))
            netflix_ips.append(str(a.rdata))

print("Netflix IPs: {}".format(netflix_ips))

### Collecting network counters

Network counters are useful for quality inference because reveal how much data is exchanged between two IP addresses

In this exercise, we aim to collect all network counters exchanged between the local machine and Netflix's servers

In [None]:
pcap_file = "/data/netflix.pcap"

# READ THE PCAP AND INSERT INTO THIS DICTONARY THE NETWORK COUNTERS FOR EACH IP
# ADDRESS BELONGING TO NETFLIX
network_counters = {}

# You can use this dictionary to collect counters
def counters():
  return {"in_pkts": 0, "out_pkts": 0, "in_bytes": 0, "out_bytes": 0}

# Remember that the amount of data contained in a packet as specified by the IP
# header is called "length"

packet[sp.IP].src
packet[sp.IP].dst
packet[sp.IP].len

# After reading all packets, print the counters you have found
for ip in network_counters:
  print("IP {} generated the following amout of traffic {}".format(ip, network_counters[ip]))

### Inferring segment downloads

Capturing the segments that are downloaded during the streaming session is the most useful information for inferring the video quality.

In this exercise, we aim to detect when a segment is downloaded and its size. Remember that we are only interested in Netflix's traffic.

In [None]:
import numpy as np

pcap_file = "/data/netflix.pcap"

# READ THE PCAP AND INSERT INTO THIS DICTONARY THE LIST OF SEGMENTS DOWNLOADED FOR EACH IP
# ADDRESS BELONGING TO NETFLIX
completed_video_segments = {}

# I suggest using a support dictionary for ongoing downloads
ongoing_video_segment = {}

# You can use this dictionary to collect segment information
def segment():
  return {"pkts": 0, "bytes": 0}

# Getting the payload size of a TCP packet requires combining the following information
# from different layers:
# - Start from the total packet length: packet[sp.IP].len 
# - Use the IP header size: 4*packet[sp.IP].ihl 
# - And the TCP header size: 4*packet[sp.TCP].dataofs


# After capturing all segments, print the number of segments you have found for each IP
for ip in completed_video_segments.keys():
  print("IP {} downloaded {} segments ".format(
      ip,
      len(completed_video_segments[ip])
  ))

### Putting all together

In [None]:
# Time to put it all together and the traffic of the session.


## Which IPs are the servers transmitting actual video contents?

Answer here