# Inferring video resolution from encrypted traffic

In this exercise we will learn how to features extracted from encrypted traffic to infer the video resolution every 10 seconds of the session. The ultimate goal is to learn how different features impact differently the accuracy of inference model.

For this exercise, we will use dataset containing 4000 video sessions from four services: Netflix, YouTube, Twitch, and Amazon Prime Video. The trace is provided as a pandas DataFrame. For more information on pandas: https://pandas.pydata.org/

In [None]:
#Get the data
import os
if not os.path.exists("/data/video_dataset.pkl"):
  !gdown https://drive.google.com/uc?id=1PHvEID7My6VZXZveCpQYy3lMo9RvMNTI -O data/video_dataset.pkl

### Data cleaning

The first part of the model design pipeline requires you to explore the available dataset and remove possible bad or highly biased values.

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_pickle('/data/video_dataset.pkl')

# Explore the structure of the dataset (e.g. the features that it contains)
df.head()

Remove "bad" values

In [None]:
# Tip: These are the only valid values for resolutions
valid_resolutions = [
                     280,
                     360,
                     480,
                     720,
                     1080
]


Remove unwanted bias

Sometimes datasets might contain features that could impact the accuracy of the model. Explore the dataset and remove the columns that you believe would have a negative effect on the final model due to unwanted bias.

In [None]:
# Tip: strings are definitely a problem! For example:
unwanted_data = [
  "video_id",
  'home_id'
]

# To drop the unwanted columns
df = df.drop(columns=unwanted_data)

# What else?

### Simple quality inference



Let's try a first attempt at inferring the resolution.

First, define the target of the inference.
Then prepare the data.

Finally train and test your model. How is the performance?

In [None]:
import numpy as np

# Split the data in train / test datasets. What is the best unit to split your dataset?

# df_train = 
# df_test = 


In [None]:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd


# You are free to try any classification method you think could work well for inferring resolution

# x_train = 
# x_test = 
# y_train = 
# y_test = 


# Ultimately, you want to produce a precision and recall plot using the provided code
from sklearn.metrics import average_precision_score, precision_recall_curve

plt.plot(recall, precision["micro"], label=' P/R curve (AP = %0.2f)' % (average_precision), 
         linestyle='-', linewidth=0.8, marker='*', markersize=3)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.legend()
plt.grid(True, which='major', axis='both')
plt.show()

### Feature importance

After designing your first model, it's important to understand which features had the highest impact on your prediction accuracy.

REMEMBER: extracting features from network traffic is costly! The more features you use the more powerful your network capturing tools have to be.

In this exercise we want to quantitatively study which features yield the highest inference power. 

In [None]:
# Collect in an ordered array the features and their importance in the prediction
feature_importance = []

# Here is an example of how to get the feature importance for 
for i, feature in enumerate(features):
  feature_importance.append({'name': feature, 'GINI_index': clf.estimators_[1].feature_importances_[i]})

feature_importance = sorted(feature_importance, key=lambda k: k['GINI_index'], reverse=True)

# Which are the most important features?

### Select features by layer



In the previous exercise you studied which features have the highest impact on the inference accuracy. We use the otained results to group features into groups and evaluate which collection of features achieves the highest accuracy.

Remember that features from the same layer might be using the same information to be computed!

In [None]:
# Hint: features are conveniently tagged with the layer they belong to. E.g.:
l3_features = [col for col in df.columns if 'L3' in col]

# Replicate the study from the previous exercise ("Simple quality inference")
# using different feature groups.

# What do you observe?


# Predict the ongoing resolution of a real Netflix session

Now that you have your model, it's time to put it in practice!

Use a preprocessed Netflix video session to infer the resolution at 10 second time windows

In [None]:
#Get the data
if not os.path.exists("/data/netflix_session.pkl"):
  !gdown https://drive.google.com/uc?id=1N-Cf4dJ3fpak_AWgO05Fopq_XPYLVqdS -O data/netflix_session.pkl

In [None]:
df_session = pd.read_pickle("netflix_session.pkl")

unwanted_data = [
  "video_id",
  "video_position",
  "index",
  'home_id',
  "relative_timestamp",
  "absolute_timestamp",
  'resolution', 
  'session_id'
]

x = df_session.drop(columns=unwanted_data).values
y = [0 if v is None else int(v) for v in df_session['resolution'].values]

# Predict the inferred resolutions and compare

# You can use this code to plot the result (predicted_resolutions is a list)
plt.plot(df_session['relative_timestamp'].values, y, label='Real')
plt.plot(df_session['relative_timestamp'].values, predicted_resolutions, label='Predicted')
plt.xlabel('Session time')
plt.ylabel('Resolution')
plt.legend()
plt.show()