#  DS3000A/9000A 

# Final Exam – Part 2 (60 pts)

### Student Name: xxxxxxxx
### Student ID: xxxxxxxx

## General
This part of the exam is **Open Book** and you will answer to the programming questions below on this Jupyter Notebook. You have **2 hours (3:00 pm - 5:00 pm)** to finish the exam and upload your notebook on OWL. 
* You **are allowed** to use any document and sources on your computer and look up documents on the internet. **You need to cite any code that you use if it is NOT from the course Labs or Tutorial examples**.
* You or **NOT allowed** to share documents, or communicate in any other way with people inside/outside of the exam room during the final. Using AI chatbots is **NOT allowed and will be counted as cheating or plagiarism**.
* All Figures should have a x-axis and y-axis label.
* Add as many cells as you want, whenever you need to. 
* To finish the exam in the alloted time, you will have to work efficiently. You need to submit the exam Jupyter Notebook by the **due date (Dec 12, 2023 at 5:00 pm)** on **OWL in the Assignments / Final Exam - Part 2** where you downloaded the Dataset and Jupyter Notebook. **Late submission will be scored with 0 pts, unless you have received special accommodations. To avoid technical difficulties, start your submission at latest five to ten minutes before the deadline. To be sure, you can also submit multiple versions - only the latest version will be graded. 

**Ensure that your code runs correctly by choosing "Kernel -> Restart and Run All" before submitting.**

### Additional Guidance

If at any point you are not sure about the answer, then *write your assumptions clearly in your exam and proceed according to those assumptions.*

Good luck!

In [3]:
## Preliminaries
### YOU MAY ADD ADDITIONAL IMPORTS IF YOU WISH
import matplotlib.pyplot as plt 
%matplotlib inline

import pandas as pd 
import numpy as np
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso, LassoCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error, accuracy_score, f1_score, confusion_matrix, roc_curve, auc, roc_auc_score

import time


## Dataset
In this exam, we will work on the network anomaly detection dataset "IP_Activity_Dataset_5000.csv". It was generated from web server access logs and collected from a real-world website in Content Delivery Networks (CDNs). Each sample/row in the dataset represents a unique Internet Protocol (IP) address with 9 columns/variables. Each feature/column is a performance indicator that reflect the state or activity of each sample/IP. The IP addresses were masked due to privacy reasons.  

### Variables/Features
Feature description: 
1.	**requests**: the number of requests sent by per IP.
2.	**request-interval**:  the average time interval between consecutive requests sent by per IP. Unit: milliseconds
3.	**request-popularity**: what percentage of the requests sent by per IP are for popular contents.
4.	**bytes**: the average bytes received by per IP after requesting the content.
5.	**delivery-time**: the average request delivery time experienced by per IP. Unit: milliseconds
6.	**hit-rate**: cache hit rate of per IP.
7.	**nodes**: the number of nodes that received requests from per IP.
8.	**contents**: the number of contents/files that per IP requested for.
9.	**label**: 0-normal, 1-abnormal (potential cache pollution attacks). 

---
# Question 1 - Explore dataset ( X / 5 pts )

- Read the dataset "IP_Activity_Dataset_5000.csv" as a pandas dataframe.
- Print the number of observations in the dataset
- Print the number of variables in the dataset (all variables regardless of whether they are a feature or label or neither)
- Print the number of observations for each class in the 'label' variable
- Print the first five rows of the dataset

---
---
# Question 2 - Regression and Evaluation (X / 20 pts)
Your next task is to build regression models that predicts the delivery-time of IPs.

---
## Question 2 Part A - Data Splitting For Regression ( X / 2 pts )
- Use 'delivery-time' as the target variable y for your regression models, and other variables as the feature set X.
- Split the data into equals-sized training and test sets (do not shuffle the data).


---
## Question 2 Part B - Data Standardization ( X / 2 pts )
- Z-standarize the input features of the training and test sets.
- All the questions below should be based on the standarized dataset.

---
## Question 2 Part C - Basic Lasso Regression ( X / 4 pts )
- Build a regression model with L1 regularization (Lasso) and the default alpha value. Fit it on your training set, and set the random state to 42.
- Report the coefficients and intercept of the model.
- Report the Root Mean Square Error (RMSE) to evaluate the testing performance of your model.

---
## Question 2 Part D - Determine the Optimal Regularization Term ( X / 12 pts )
- Perform Lasso Regression with 5-fold cross-validation on the training set to find and **print out** the optimal regularization parameter (alpha) value. Vary the regularization parameter (alpha) between 0.01 and 100, evenly spaced in log-space, and generate 100 values. Set the random state to 42. Tip: use LassoCV function.
- Create a plot showing the relationship between these 100 alpha values and their corresponding mean RMSE values. Sets the scale of the x-axis to a logarithmic scale. 
- Build and fit a Lasso Regression model on the training set using the optimal alpha and a random state of 42. Report the coefficients and intercept of the model. Report the Root Mean Square Error (RMSE) to evaluate the testing performance of your model.

---
---
# Question 3 - Classification and Evaluation (X / 35 pts)
Your next task is to build classification models that can identify the malicious attacker IPs.

---
## Question 3 Part A - Data Splitting For Classification ( X / 2 pts )
- Use 'label' as the target variable y for your classification models for abnormal IP detection, and other variables as the feature set X.
- Split the data into equals-sized training and test sets, and ensure the balanced distribution of labels when splitting data.

---
## Question 3 Part B - Data Standardization ( X / 2 pts )
- Z-standarize the input features of the training and test sets.
- All the questions below should be based on the standarized dataset.

---
## Question 3 Part C - Random Forest ( X / 5 pts )
- Build a Random Forest model that consists of 5 base decision trees with the maximum depth of 5, and fit the training set. Set random state to 42.
- Print out the accuracy, F1-score, confusion matrix, and execution time (including both training and testing time) of the model when evaluating the testing performance of your model.


---
## Question 3 Part D - Feature Selection by Random Forest ( X / 14 pts )
- Use the Random Forest model you built in Q3-C to generate feature importance scores and select the most important features (rank the importance scores of each feature in descending order, and only select the important features from most to least important until the accumulated relative importance score reaches 90% or 0.9).
- Use a horizontal bar chart to plot the importance scores of all features in descending order. Add appropriate x-axis and y-axis labels.
- Print out the selected features with their importance scores, and generate the new training and test sets with the new feature set. 
- Retrain the same Random Forest model from Q3-C on the new training set, and print out the accuracy, F1-score, confusion matrix, and execution time (including both training and testing time) of the model on the new test set.
- Plot the ROC curve for evaluating the Random Forest model on the new test set and report the area under the ROC curve.

---
## Question 3 Part E - Hyperparamete Tuning of Random Forest ( X / 8 pts )
- Use 3-fold grid search to tune two hyperparameters for the Random Forest model you built in Q3-D:
    - The number of base estimators/decision trees (find the better value among the two numbers 10 and 20).
    - The maximum tree depth (find the better value among the two numbers 10 and 20).
- Print out the detected better hyperparameter values and cross-validation score.
- Build the Random Forest model with the better hyperparameter values you found, and fit the new training set from Q3-D.
- Report the accuracy, F1-score. confusion matrix, and execution time (including both training and testing time) of the model when evaluating the testing performance of your model on the new test set from Q3-D.


---
## Question 3 Part F - Classification Model Performance Discussion ( X / 4 pts )
- Compare the performance of the three models from Questions 3-C, 3-D, and 3-E, and discuss reasons for performance difference.
- Compare the execution time of the three models from Questions 3-C, 3-D, and 3-E, and discuss reasons for time/efficiency difference.

#### Written answer: Explain here.


#### Written answer: Explain here.


---
---
---
**You're done! As always, double-check your work by re-running the notebook from scratch.**