## Mid Project: Anomaly Detection with Autoencoders

Dr. Leslie Kerby, CS 6699 Advanced AI Methods <br>
Spring 2024

The goal of this assignment is to design, implement, and evaluate an autoencoder for the purpose of anomaly detection on the [Credit Card Fraud Detection dataset](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud), available in Kaggle. This dataset contains transactions made by credit cards in September 2013 by European cardholders.

This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

Note:
- Experiment with different autoencoder architectures, including varying the number and size of layers, to find the best model for this task.
- Discuss the importance of choosing an appropriate threshold for anomaly detection and the trade-off between false positives and false negatives.

### Part 1: Data Exploration and Preprocessing:
   - Familiarize yourself with the dataset. Understand the features and the target variable.
   - Handle missing values if any, and normalize the data if required. Given the PCA-transformed nature of the dataset, normalization might already be taken care of, but it's good to check.
   - Since the dataset is imbalanced, discuss how this might affect training and how you plan to address it.


### Part 2: Model Design
   - Design an autoencoder architecture suitable for this dataset. Given the nature of the data (numerical and possibly high-dimensional), a dense (fully connected) network might be a good starting point.
   - Consider the size of the latent space carefully; it should be small enough to force the autoencoder to learn a compressed representation but large enough to capture the essential characteristics of normal transactions.

### Part 3: Training
   - Train your autoencoder using only the normal transactions in the training set. This is crucial as the autoencoder needs to learn to reconstruct normal transaction profiles.
   - Validate the performance of your autoencoder on a separate set of normal transactions to tune hyperparameters and avoid overfitting.

### Part 4: Anomaly Detection
   - Use the reconstruction error as the metric to detect anomalies. Transactions that result in a high reconstruction error are likely to be anomalies (fraudulent transactions in this case).
   - Determine a threshold for the reconstruction error above which a transaction is considered fraudulent. This can be based on a validation set or statistical methods.

### Part 5: Evaluation
   - Evaluate your model on a test set containing both normal and fraudulent transactions. Use metrics suitable for imbalanced classification problems, such as precision, recall, F1-score, and the area under the ROC curve (AUC).
   - Discuss the performance of your model and any challenges you encountered.