# Assignment: Application Classification

To this point in the class, you have learned various techniques for application/traffic classification. In this assignment, you will put it into practice, by training a model to identify applications using a network traffic trace.

**Submission Instructions**: 
- Ensure your assignment is submitted through Gradescope. To do so, sign up for the course on Gradescope using the code: B2W3YG. 
- You should submit a single notebook containing your code to extract features and model evaluation, and response to Part 3.
- You should assume the CSVs are located in a folder called `data`, co-located with the notebook.
- Make sure the notebook is styled well. Write code in the relevant sections of the notebook. 
- I should be able to run the entire notebook without any errors. 

## Dataset download and Warmup

We will use a public dataset that consists of annotated traffic logs. The dataset we will use for this assignment is available on [Onedrive](https://csciitd-my.sharepoint.com/:u:/g/personal/tmangla_csciitd_onmicrosoft_com/EafyJbnixmJIvZN1bgwD2W4BIrzc5yy9AP9uNrkmNTMfoA?e=Qa1NVU). The data consists of TSVs (tab-seprated) with headers corresponding to packets for an application. Each row corresponds to one packet. The headers follow have this schema: 
```
columns = ["frame.time_epoch", "frame.len", "ip.src", "ip.dst", "ip.proto",
    "udp.srcport", "udp.dstport", "tcp.srcport", "tcp.dstport",
    "tcp.flags", "tcp.flags.syn", "tcp.flags.fin", "dns.qry.name"]
```

**Getting application ground truth:** You can use the filename of the CSV file

Download the dataset and read it. You can use read the data in a dataframe: 
```
df = pd.read_csv(filename, sep="\t", header=None, names=columns)
```

## Part 1: Extracting Features

### Data cleaning

Your goal is to extract the following features from the dataset: 
- Flow-level (5 features): flow duration, volume (upstream and downstream), number of packets (upstream and downstream)
- Packet-level features (36 features): Statistics on packet inter-arrival times and packet size. These need to be computed for both upstream and downstream direction. You should compute the following statistics for each flow: mean, median, std, min, max, quantiles (25%ile, 75%iles), and deciles (10%ile, 90%ile). Compute these statistics per feature (IAT, size) and direction (upstream, downstream).

**Defining Flows**: For TCP, a flow is same as connection (determined using SYN/FIN packets). You should define UDP flows using inactivity timeout (as discussed in class). Use an inactivity timeout of 60s. 

Make sure you filter out the non-IP traffic as well as the DNS traffic from the data.

**Checkpoint**: Once you do that, summarize the number of flows for each application. You can extract the application name from the file name. VPN and non-VPN applications should be treated differently. You can remove classes with less than 10 instances for the next part. 

In [1]:
## Code here

## Part 2: Application Classification

### Prepare your data

### Train Your Model
- Select a model of your choice.
- Train the model using the training data.

### Tune Your Model
Perform hyperparameter tuning to find optimal parameters for your model.

### Evaluate Your Model

**Checkpoint**: Evaluate your model accuracy according to the following metrics using 10-fold cross validation:

- Accuracy
- F1 Score
- Confusion Matrix
- ROC/AUC

Your code should evaluate these metrics in separate cells

In [2]:
## Code here

## Part 3: Results analysis

Write a short report summarizing the results. Also, explain your results along the following questions:

- Which category of applications were categorized correctly (incorrectly) and why?
- For application categories that were predicted incorrectly, how would you improve their accuracy? Be specific about your answer. For instance, do not write I will collect more data. Explain what data would you collect and why that will help? 

In [3]:
## Report here