# Fraud Detection System Report

## System Design
By revisiting our original requirements below, it's clear that the system essentially meets the requirements.
1. COO & Legal team want the system to follow all data handling and privacy laws.
    * While implementing data privacy and security functionality is out of scope for this case study, some safe data handling practices are used. For example, the data (that gets stored in the resources folder) does not get committed to any code repository like GitHub. Additionally, the actual data rows and information do not reach the end API user, which is a good thing since we would have different types of users who may have different permissions. In the real world, we wouldn't want any data or information in GitHub, and instead would securely store it in a database. Additionally, our API doesn't have any kind of authentication which should be implemented in a real-world scenario. A bearer token could be used, and could contain information about the API user that we could use to govern permissions.
2. Engineering and legal teams want the system and design decisions to be well-documented.
    * This document, as well as the ones in the requirements folder, the README, and the notebooks in the analysis folder all contribute to meeting this requirement. Any futher real-world system changes would also be documented similarly.
3. CEO and Chief Risk Officer want to prioritize fraud detection rate and maximize true positives.
    * To maximize true positive rates, we focused on prioritizing recall over precision when comparing the performance of different models. This was done by using the F-beta score, with beta=2, when comparing different models during hyperparameter tuning in the `train()` function of `model.py`. This F-beta score allowed us to specify that recall is twice as important as precision when calculating the performance of the model. Additionally, the precision, recall, f1_score, f2_score, and accuracy metrics are made available via the `GET /model_metrics` endpoint.
4. Chief Risk Officer wants to minimize the highest risk fraud type, which is account take over.
    * This functional requirement cannot be directly measured by some metric, but could be thoroughly tested by a QA team using our API endpoints. If the QA team has a set of transaction data where they know account takeover has occurred, they can provide that to us to enter into the database so they can use it for thorough training and testing via our API endpoints.
5. Chief Risk Officer wants the system to prioritize reducing the amount of financial loss instead of total number of fraudulent transactions.
    * Like above, this functional requirement is difficult to measure directly. One way to attempt to track this in a real world setting would be to implement an additional system metric that keeps track of the total transaction amounts for transactions that the deployed model predicted to be fraudulent. This is similar to what I discussed in the Module 4 discussion, but one drawback to this is that a rise in total fraud amount may not necessarily be cause by large fraudulent transactions, and isntead could be caused by some hidden variable like a recent data breach. Regardless, adding additional future online metrics such as this to the system after QA testing would be easy since we have a well-defined `Metrics` class to hold our metrics calculations, and we're storing `prediction` logs to keep track of what the deployed model has predicted for a given input.
6. CEO wants new system to exceed previous metrics of 40% precision and 85% recall.
    * Using the API calls (in order) outlined in the Postman collection in [assets/EN.705.603.82 Case Study 5.postman_collection.json](../assets/EN.705.603.82%20Case%20Study%205.postman_collection.json), I was able to achieve a precision of 79% and recall of 79% when calling my `GET /model_metrics` endpoint on my `test1` data with my `model1` random forest model (created using the requests in the Postman collection). Although the recall value is slightly below the goal of 85%, the precision value is well above the goal of 40%. Further tuning/training could be performed to improve the recall value. Perhaps, we could increase the value of beta in our F-beta score calculating to put more emphasis on recall over precision when comparing models.
    
![model_metrics results](../assets/images/model_metrics.png)

7. Credit card users want to receive as few transaction confirmation notifications as possible.
    * The notification system was out of scope for this case study project, but mazimizing true positives as described in #3 will help achieve this requirement. Ensuring that all positive predictions really are positive cases means that false positives are minimized so that customers do not get unnecessary and frequent notifications. Additional processes could be put in place in the company, such as alerting a customer service or fraud analyst team to perform further inspection of a questionable prediction, instead of directly alerting the customer.
8. CEO wants fraud to be detected as quickly as possible when the transaction occurs.
    * The `POST /predict` endpoint returns its response very quickly (49ms in the example below), which would not show any noticable amount of time to a customer making a transaction. Therefore, this requirement can be considered met. Additional monitoring that tracks request duration while the system is running would be useful to ensure this requirement continues to be met. In practice, this can be done with solutions such as AWS CloudWatch.

![predict results](../assets/images/predict.png)

9. CEO wants the system to have a flexible and customizable alert system to notify stakeholders such as fraud analysts and customer service representatives about flagged transactions.
    * Again, the alert system was out of scope for this case study. However, the system does produce log events upon each API call and when a prediction is made. In the real world, these log events could easily be turned into alerts or notifications based on specific events that occur. Additionally, a cloud monitoring solution such as AWS CloudWatch, Splunk, or Datadog could be used for additional alerts and monitoring of the system.
10. CEO and Head of Credit Card Operations want the system to integrate with customer relationship management platforms, payment systems, analytic tools.
    * Since the we've made a REST API, it is easy to pass data into and out of our fraud detection system. This will allow for easy integrations with any other partner systems that should be connected.

A diagram of the system can be seen below. There are 2 main parts: the top row involves the data storage and training process, while the borrom row describes the inference and model usage process. Both proceses involve creating logs.

![diagram](../assets/images/diagram.png)

## Data, Data Pipelines, and Model
### Data
The data sources of transactions are expected to be stored in the `/data` folder. There's a [put_data_here.txt](../data/put_data_here.txt) placeholder file indicating where the three data sources should be places: `transactions_0.csv`, `transactions_1.parquet`, and `transactions_2.json`. The following columns are expected: `trans_date_trans_time, cc_num, merchant, category, amt, first, last, sex, street, city, state, zip, lat, long, city_pop, job, dob, trans_num, unix_time, merch_lat, merch_long`. The datasets and feature files that the system generates in response to API requests are stored in a `resources` folder that gets created when the system runs.

### Data Pipelines
There are two main data pipelines: one for dataset generation and training, and one for test set generation and inference. The first pipeline begins when the system starts. Assuming the 3 raw data sources have been stored in the `/data` folder and named as described above, the system will process these upon system start up. The first two data sources `transactions_0.csv` and `transactions_1.parquet` are used when sampling to create a training dataset. The third data source `transactions_2.json` is used when sampling to create a test set. Upon system start up, the raw datasets are read using `deployment.py` and cleaned and validated using functions from `data_engineering.py`. The data is then stored in `Dataset` instances as class variables of the `DeploymentPipeline` class in `self.dataset_train`(contains info from the first two files)  and `self.dataset_test`(contains info from the third file). Next, when the `PUT /generate_new_dataset` endpoint is called, then a new dataset is generated by sampling the overall database. The endpoint allows parameters `type` to specify if you want to generate a test or train set, `sampling_type` (either random or stratified, but stratified is used by default), `n_samples` to specify sample size (for random) or samples per class (for stratified), and `generate_features` which is a boolean indicating whether the feature should be extracted and saved for this dataset (True by default). Features can also be generated later for a dataset using the `PUT /generate_new_features` endpoint, giving a dataset as the `version` parameter. The feature generation process involves transforming the data by removing identiifer columns (cc_num and trans_num), ensurign numbers and categories are stored correctly, performing SMOTE if specified to be True in the `run_smote` parameter in the `PUT /generate_new_features` request, scaling, standardizing, and adding noise to numeric columns, then finally using a Chi-square and ANOVA test to select the top 50% best categorical and numeric features to keep. The features are then saved to the `resources/features` folder, with the same name as the dataset in the `resources/datasets` folder. Once the features are generated, they can be used to train a model with the `PUT /train` endpoint.

The inference pipeline is similar, but starts with an unknown transaction being passed as JSON in the body of a `POST /predict` request. Depending on the `version` parameter specified in the request to determine the model to use for inference, then the system will find out what training data was used to train that specified model, and then find the statistics that were used to transform the training data, so they can be used to transform the new "test" data to be predicted. The input data is cleaned using `DataEngineering`, and then the same transformations (remove ids, ensure categorical and numeric columns are typed correctly, scale, standardize, add noise, and filter to use only the best columns determiend during training) are applied using `FeatureEngineering` as explained above, minus SMOTE and Chi-square/ANOVA testing. Once the features are extracted from this test data point, the selected model is used to get a prediction outcome, which is then returned as the response to the POST request.

### Model
Three different model types can be used: Logistic Regression, Stochastic Gradient Descent, and Random Forest. The model type is specified by the `model_type` parameter in the `PUT /train` request. Additionally, the JSON body of this PUT request is where users should enter the hyperparameters for a specific model. The values should be lists (even if only one item is contained in the list), since the system is set up to perform automated hyperparameter tuning on a list of provided hyperparameters. Once the model is trained, it can be used for predictions and for obtaining model metrics on a given test set using the `GET /model_metrics` endpoint.

## Metrics Definition
Most of the metrics tracked by the system are offline metrics due to time and feasibility constraints of this Case Study project. In the real world, cloud reasources such as AWS CloudWatch, Datadog, and Splunk could be used to obtain mroe online metrics. These could be configured to include things like API errors and traffic, and logs could be created to help detect things like model drift and online performance.

### Offline Metrics
1. F1 Score: This metric is calculated because it gives a measure of balancing precision and recall equally. However, in our case where we want to maximize true positives, an F-beta score may be more useful.
2. F2 Score: This metric is similar to the F1 score above, but instead is an F-beta score with beta = 2. This allows us to give greater importance to recall rather than precision, since we care more about maximizing true positives for our case. This is a valuable metric, which is used by the sytem to compare the resutls of different models during hyperparameter tuning.
3. Recall: This is an important metric for ensuring that we're corerctly identifying transactions that are truly fraudlent. Mazimizing recall corresponds with our goal of maximizing true positives, so this one is very important to track.
4. Precision: This metric is tracked so we can identify how useful a model is for reducing the risk of false positives. Although we may value recall slightly more than precision for our use case where we want to maximize true positives, this is still a good and relevant metric to be able to track to get a better idea of the ways in which the model may be failing.
5. Accuracy: This metric is tracked to give a full picture of what a confusion matrix would look like for a given model (along with the recall and precision scores above), and is a widely used and understood, simple metric. However, for our case with heavily imbalanced fraud vs non-fraud classes, this metric likely is not very useful since it will typically be high since there are a large number of non-fraud transactions that will likely be accurately classified.

### Online Metrics
1. Endpoint Usage: an online metric thatthe system tracks while it's running is how many times each endpoint is called. These values are reset to 0 if the system is stopped (i.e. everytime you run `python main.py`, these endpoint usage metrics will be reset). In a real world setting, this could be done using a tool like Datadog, but this information is useful to determine which of our endpoints are being used by our users, and how frequently they're being used. This could lead us to allocate more resources to high-traffic endpoints, or identify if errors are being returned from low-traffic endpoints. These metrics will help give us insight on where to focus our improvement efforts.

## Analysis on System parameters and Configurations
### Dataset Evaluation
The source dataset is heavily imbalanced and less than 1% of the rows are labeled as fraudulent transactions. This led to the choice to use SMOTE for data augmentation to oversample the minority class so we can train the model with more fraudulent-like transactions. This also lead to the decision to perform stratified sampling by default when constructing new datasets from the source data. However, for flexibility, there's still an option to specify `random` as the sampling type if desired. During EDA (described in [exploratory_data_analysis.ipynb](./exploratory_data_analysis.ipynb)), it was clear that some features such as customer birth year, time of transaction, purchase category, and potentially month of transaction and city population have an affect on fraud vs. non-fraud classification. This led to the choice to statistically determine the most influential features using a Chi-square test for categorical features and ANOVA test for numeric features.

### Feature Engineering Evaluation
The results of the EDA mentioned above played a role in the feature engineering process. As mentioned above, this included data augmentation using SMOTE to create mroe balance among the fraud/non-fraud classes in our training data. This was to avoid overfitting the model on the non-fraud data. Additionally, Chi-square and ANOVA tests were performed to find the statistical top 50% best categorical and numeric features to use for fraud classification, since EDA showed that not all feature contribute equally to the fraud classification. Some features that I saw get chosen frequently during my testing included category, day_of_week_trans_data_trans_time, hour_of_day_trans_data_trans_time, and year_dob. This seemed to match up will with the plots in [exploratory_data_analysis.ipynb](./exploratory_data_analysis.ipynb) that show that some features such as customer birth year, time of transaction, and purchase category may have an affect on fraud vs. non-fraud classification. Additionally, the transformation steps that occurred in the `FeatureEngineering` class such as scale, standardize, and add noise were required to ensure that all numeric features were considered equally, especially in statistical tests like the ANOVA. Additionally, categorical encoding was necessary in order to include those variables in training.

### Model Evaluation.
The [model_performance_and_analysis.ipynb](./model_performance_and_selection.ipynb) showed that a random forest model seemed to provide the best precision and recall results, and provided the best overall f2 score that was used as the metric for comparing different models. The models tested from best f2-score to worst were random forest, stochastic gradient descent, and logistic regression. The system allows all 3 types of models to be trained using the `PUT /train` endpoint by specifying the `model_type` parameter. This was implemented to allow greater flexibility to data scientists and end users who may be using our API, so they can decide what model to use basd on their own tests too. Each of the three model types were trained using different hyperparameters. For the stochastic gradient descent model, the l1_ratio and alpha hyperparameters were analyzed, resulting in `'alpha': 0.001, 'l1_ratio': 0.25` providing the best f2 score of 78.9%. The logistic regression model was analyzed by tuning the C and l1_ration hyperparameters, resulting in the best score of 69.3% with `'C': 0.5, 'l1_ratio': 0.75`. Finally, the random forest classifer was tuned with the min_samples_split and n_estimators hyperparameters, with the best f2 score being 98.6% with `'min_samples_split': 2, 'n_estimators': 200`. Since only 3 values were tested for each hyperparameter, resulting in a 9x9 parameter search grid for each model, more experimentation with hyperparameters would be required to find the best model configuration. The PR curve in [model_performance_and_analysis.ipynb](./model_performance_and_selection.ipynb) shows that the random forest model in the notebook performed nearly perfectly on the training set, but performed poorly on the test set. This indicates potential overfitting to the training set, and additional regularization methods should be tested to reduce overfitting. To aid with this hyperparameter tuning experimentation, this is done automatically by our system via the `PUT /train` endpoint. The body of the request is where uesrs can enter the hyperparameter lists that they'd like to tune. Available hyperparameters can be determined from the scikit-learn documentation for the [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), [Stochastic Gradient Descent](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html), and [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) classifiers.

### Overall System
Overall, each component of the system allows for flexibility and different use cases by the end users. For example, dataset generation can be stratified or random; feature extraction can include SMOTE or not, and model training can include hyperparameter tuning and three different model types. Automation is included where possible, such as using Chi-square and ANOVA tests to automatically determine the best 50% of features to include for training. This flexibility and automation will allow our system to be used by more users with various use cases.

## Post-deployment Policies
### Monitoring and Maintenance Plan
The logs and online metrics are a large part of the monitoring and mitigation plan. Logs are written frequently upon datseet creation, feature extraction, model training, input data prediction, and model metric calcualtion. This will give us insights into what's happening at each step of the pipeline, and allow us to identify any potential issues or reasons for why a model may not be performing as expected. The online metric of endpoint usage that we're tracking will allow us to determine our most and least popular endpoints, so we can focus our improvement efforts accordingly. Additional online metrics such as customer satisfaction, total fraudulent amount, number of fraudulent transactions over time, data distributions over time, and model performance overtime all would help to ensure our system is continuing to meet performance requirements. Any ongoing maintenance and code changes could be performed locally, and the docker image could be built and deployed in a container to ensure stability and reproducibility.

### Fault Mitigation Strategies
Some fault mitigation stratigies may include backing up the docker image so it can be redeployed if for some reason the system goes down. Containeriation makes it easier to rebuild the system the exact same way repeatedly. Additionally, storing our logs and data outside of the system (i.e. in a database) would be ideal, so we could still access the logs and data even if the docker container goes down or is redeployed. Carefully monitoring the logs and online metrics, in addition to implemention more online metrics as described above, will help us catch ny potential issues before they arise or immediately when they arise. This can occur by setting up alerts using tools like AWS CloudWatch or Datadog.