![Fraud detection image](cover_image.jpg)

üè¶ Banks are battling frauds with machine learning models, but changing data patterns can weaken these defenses. London's Poundbank needs your help to figure out why their fraud detection models aren't as accurate anymore.

Poundbank recommends the `nannyml` library for monitoring machine learning models, which is also their tool of choice.

## The data

They have provided you with a reference(test data) and analysis set(production data). A summary and preview are provided below.

## reference.csv and analysis.csv

| Column     | Description              |
|------------|--------------------------|
| `'timestamp'` | Date of the transaction. |
| `'time_since_login_min'` | Time since the user logged in to the app. |
| `'transaction_amount'` | The amount of Pounds(¬£) that users sent to another account. |
| `'transaction_type'` | Transaction type: <ul><li>`CASH-OUT` - Withdrawing money from an account.</li><li>`PAYMENT` - Transaction where a payment is made to a third party.</li><li>`CASH-IN` - This is the opposite of a cash-out. It involves depositing money into an account.</li><li>`TRANSFER` - Transaction which involves moving funds from one account to another.</li> |
| `'is_first_transaction'` | A binary indicator denoting if the transaction is the user's first (1 for the first transaction, 0 otherwise). |
| `'user_tenure_months'` | The duration in months since the user's account was created or since they became a member. |
| `'is_fraud'` | A binary label indicating whether the transaction is fraudulent (1 for fraud, 0 otherwise). |
| `'predicted_fraud_proba'` | The probability assigned by a detection model indicates the likelihood of a fraudulent transaction. |
| `'predicted_fraud'` |  The predicted classification label is calculated based on predicted fraud probability by the detection model (1 for predicted fraud, 0 otherwise). |

# Project Instructions

Use the `reference.csv` and `analysis.csv` datasets to monitor a fraud detection model and address the following questions:

- Identify the months in which the estimated(expected) and realized(actual) accuracy of the model triggers alerts. Put these months in a list named `months_with_performance_alerts`, using lowercase and separating the month and year with an underscore. For example: `months_with_performance_alerts = ["january_2018", "march_2018"]`.

- Determine the feature that shows the most drift between the reference and analysis sets, thereby impacting the drop in realized accuracy the most. Historically, Poundbank's data science team used the Kolmogorov-Smirnov and Chi-square methods to detect this drift. Store the name of this feature in a variable named `highest_correlation_feature`.

- Look for instances where the monthly average transaction amount differs from the usual, causing an alert. Save this amount in a variable named `alert_avg_transaction_amount`, ensuring it has a minimum of one decimal place in the results.

Extra task for you to try (this won't be tested):

- Use the univariate drift detection method to figure out why the accuracy dropped. Think of a possible explanation. At the end of the project, we'll give you our analysis. Remember, there are no wrong or right answers here.

# How to approach the project
1. Identifying Months with Alerts in Estimated and Realized Accuracy.
2. Identifying the Alerting Feature Most Correlated with Performance.
3. Finding the Alert-Triggering Monthly Average Transaction Amount

## Steps to complete
### 1. Identifying Months with Alerts in Estimated and Realized Accuracy.
You can estimate the model's performance using `nannyml`'s CBPE estimator and calculate the realized accuracy using a performance calculator. Then, use the comparison plot to identify when both indicate alerts.

#### Get the estimated accuracy
- Initialize the `nml.CBPE` estimator by defining `timestamp_column_name`, setting `metrics` to `["accuracy"]`, `y_true` to `"is_fraud"`, `y_pred` to `"predicted_fraud"`, `y_pred_proba` to `"predicted_fraud_proba"`, `problem_type` to `"classification_binary"`, and `chunk_period` to `"m"`, as we are identifying the month with the performance alert.
- Use the `.fit()` method to input the reference data
- Call the `.estimate()` method on the analysis set.
- Save the results to the `est_results` variable.

#### Calculate the realized accuracy
- Initialize the `nml.PerformanceCalculator()` by defining `timestamp_column_name`, setting `metrics` to `["accuracy"]`, `y_true` to `"is_fraud"`, `y_pred` to `"predicted_fraud"`, `y_pred_proba` to `"predicted_fraud_proba"`,` problem_type` to `"classification_binary"`, and `chunk_period` to `"m"`.
- Pass the reference data using the `.fit()` method.
- Call the `.calculate()` method on the analysis set.
- Save the results to the `calc_results` variable.

#### Compare the results and find the months with alerts
- Compare results from `CBPE` and `PerformanceCalculator` by chaining the `.compare().show()` methods.
- After analyzing the plot, create a list called `months_with_performance_alerts` and add the relevant months to it.

### 2. Identifying the Alerting Feature Most Correlated with Performance.
You will need to calculate the univariate drift results, and use the correlation ranker to find which feature is correlating the most.

#### Calculate the univariate drift results
- Define the variable `features` to include these feature names: `["transaction_amount", "transaction_type", "user_tenure_months", "time_since_login_min", "is_first_transaction"]`.
- Initialize `nml.UnivariateDriftCalculator()` and specify `timestamp_column_name`, `features`, chunk_period set to `"m"`, `"kolmogorov_smirnov"` as the continuous method and `"chi-2"` for categorical analysis.
- Use the `.fit()` method with your reference data
- Call the `.calculate()` method on the analysis set.
- Save the results in a variable named `udc_results`.

#### Use the correlation ranker
- Initialize `nml.CorrelationRanker()` using the default parameters.
- Use the `.fit()` method and pass there a filtered `PerformanceCalculator` results for the analysis period. F.e `.fit(perf_calc_results.filter(period="analysis"))`.
- Rank the features by using the `.rank()` function, by passing there `UnivariateDriftCalculator` and `PerfromanceCalculator` results.
- Save the output in the `correlation_ranked_features` variable.

#### Find the highest correlating feature
- Display the correlation rank features.
- Based on the DataFrame, create a variable called `highest_correlation_feature` and assign the appropriate feature name to it.

### 3. Finding the Alert-Triggering Monthly Average Transaction Amount
Use the summary average statistics calculator to determine the monthly average transaction amounts that trigger an alert.

#### Calculate average monthly transactions
- Initialize `nml.SummaryStatsAvgCalculator()`, setting the `column_name` parameter to `["transaction_amount"]`, the `chunk_period` to `"m"`, and specifying the `timestamp_column_name`.
- Input the reference data using the ``.fit()` method.
- Call the `.calculate()` method on the analysis set.
- Store the results in the `stats_avg_results` variable.

#### Find the month
- Display the `SummaryStatsAvgCalculator` by chaining the `.plot().show()` methods.
- Based on the plot, create a variable named `alert_avg_transaction_amount` and assign the relevant value to it.

In [27]:
# Re-run this cell to install nannyml
!pip install nannyml

Defaulting to user installation because normal site-packages is not writeable


In [28]:
# Re-run this cell
#¬†Import required libraries
import pandas as pd
import nannyml as nml
nml.disable_usage_logging()

In [29]:
reference = pd.read_csv("reference.csv")
analysis = pd.read_csv("analysis.csv")

In [30]:
reference.head(6)

Unnamed: 0,timestamp,time_since_login_min,transaction_amount,transaction_type,is_first_transaction,user_tenure_months,is_fraud,predicted_fraud_proba,predicted_fraud
0,2018-01-01 00:00:00.000,1.56175,3981.1,PAYMENT,False,0.31898,1.0,0.99,1
1,2018-01-01 00:08:43.152,1.658074,1267.9,PAYMENT,False,7.391323,0.0,0.07,0
2,2018-01-01 00:17:26.304,2.454287,1984.7,CASH-IN,False,0.781225,1.0,1.0,1
3,2018-01-01 00:26:09.456,2.392085,2265.2,CASH-OUT,False,0.680473,1.0,0.98,1
4,2018-01-01 00:34:52.608,2.189806,2126.8,CASH-IN,False,8.542895,1.0,0.99,1
5,2018-01-01 00:43:35.760,2.253766,1346.2,CASH-IN,False,2.535341,1.0,0.95,1


In [31]:
analysis.head(6)

Unnamed: 0,timestamp,time_since_login_min,transaction_amount,transaction_type,is_first_transaction,user_tenure_months,predicted_fraud_proba,predicted_fraud,is_fraud
0,2018-11-01 00:04:52.464,2.174243,2832.3,CASH-OUT,False,1.013445,0.97,1,1
1,2018-11-01 00:13:35.616,2.493543,1426.9,CASH-OUT,False,6.700041,0.09,0,0
2,2018-11-01 00:22:18.768,1.807432,1302.0,PAYMENT,False,6.291723,0.01,0,0
3,2018-11-01 00:31:01.920,2.133415,1432.1,PAYMENT,True,8.165503,0.0,0,0
4,2018-11-01 00:39:45.072,1.987827,1870.3,CASH-OUT,False,8.205203,0.03,0,0
5,2018-11-01 00:48:28.224,2.978838,1512.5,PAYMENT,False,9.49067,0.02,0,0


In [32]:
reference.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50207 entries, 0 to 50206
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   timestamp              50207 non-null  object 
 1   time_since_login_min   50207 non-null  float64
 2   transaction_amount     50207 non-null  float64
 3   transaction_type       47155 non-null  object 
 4   is_first_transaction   50207 non-null  bool   
 5   user_tenure_months     50207 non-null  float64
 6   is_fraud               50207 non-null  float64
 7   predicted_fraud_proba  50207 non-null  float64
 8   predicted_fraud        50207 non-null  int64  
dtypes: bool(1), float64(5), int64(1), object(2)
memory usage: 3.1+ MB


In [33]:
analysis.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39967 entries, 0 to 39966
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   timestamp              39967 non-null  object 
 1   time_since_login_min   39967 non-null  float64
 2   transaction_amount     39967 non-null  float64
 3   transaction_type       37514 non-null  object 
 4   is_first_transaction   39967 non-null  bool   
 5   user_tenure_months     39967 non-null  float64
 6   predicted_fraud_proba  39967 non-null  float64
 7   predicted_fraud        39967 non-null  int64  
 8   is_fraud               39967 non-null  int64  
dtypes: bool(1), float64(4), int64(2), object(2)
memory usage: 2.5+ MB


In [34]:
## Identifing the months when both the estimated and realized ROC AUC of the model have alerts. Store the names of these months as lowercase strings in a list named months_with_performance_alerts. 

# Get the estimated performance using CBPE algorithm
cbpe = nml.CBPE(
    timestamp_column_name="timestamp",
    y_true="is_fraud",
    y_pred="predicted_fraud",
    y_pred_proba="predicted_fraud_proba",
    problem_type="classification_binary",
    metrics=["accuracy"],
    chunk_period="m"
)

cbpe.fit(reference)
est_results = cbpe.estimate(analysis)

# Calculate the realized performance
calculator = nml.PerformanceCalculator(
    y_true="is_fraud",
    y_pred="predicted_fraud",
    y_pred_proba="predicted_fraud_proba",
    timestamp_column_name="timestamp",
    metrics=["accuracy"],
    chunk_period="m",
    problem_type="classification_binary",
)
calculator = calculator.fit(reference)
calc_results = calculator.calculate(analysis)

# Compare the results and find the months with alerts
est_results.compare(calc_results).plot().show()
months_with_performance_alerts = ["april_2019", "may_2019", "june_2019"]
print("months_with_performance_alerts:", months_with_performance_alerts)

## Determining which alerting feature has the strongest correlation with the model‚Äôs realized performance. Store the name of this feature in a variable named highest_correlation_feature. 

features = ["time_since_login_min", "transaction_amount",
            "transaction_type", "is_first_transaction", 
            "user_tenure_months"]

# Calculate the univariate drift results
udc = nml.UnivariateDriftCalculator(
    timestamp_column_name="timestamp",
    column_names=features,
    chunk_period="m",
    continuous_methods=["kolmogorov_smirnov"],
    categorical_methods=["chi2"]
)

udc.fit(reference)
udc_results = udc.calculate(analysis)

# Use the correlation ranker
ranker = nml.CorrelationRanker()
ranker.fit(
    calc_results.filter(period="reference"))

correlation_ranked_features = ranker.rank(udc_results, calc_results)

# Find the highest correlating feature
display(correlation_ranked_features)
highest_correlation_feature = "time_since_login_min"
print("highest_correlation_feature:", highest_correlation_feature)

## Use the summary average statistics calculator to find out what were the monthly average transactions amounts, and if there's any alert. Record this value in a variable called alert_avg_transaction_amount.

# Calculate average monthly transactions
calc = nml.SummaryStatsAvgCalculator(
    column_names=["transaction_amount"],
    chunk_period="m",
    timestamp_column_name="timestamp",
)

calc.fit(reference)
stats_avg_results = calc.calculate(analysis)

# Find the month
stats_avg_results.plot().show()
alert_avg_transaction_amount = 3069.8184
print("alert_avg_transaction_amount:", alert_avg_transaction_amount)

months_with_performance_alerts: ['april_2019', 'may_2019', 'june_2019']


Unnamed: 0,column_name,pearsonr_correlation,pearsonr_pvalue,has_drifted,rank
0,time_since_login_min,0.952925,1.045775e-09,True,1
1,transaction_amount,0.626235,0.005427712,True,2
2,is_first_transaction,0.054255,0.8306916,True,3
3,user_tenure_months,-0.100547,0.6913911,True,4
4,transaction_type,-0.186569,0.4585328,True,5


highest_correlation_feature: time_since_login_min


alert_avg_transaction_amount: 3069.8184


# Answer to the bonus question

First, I recommend looking at the distribution plots for all features and analyzing them using this command: 
- `univariate_data_drift.filter(column_names=features).plot(kind="distribution")`

***Observations:***

- `time_since_log_min` - From April to June, the transactions made within one minute after logging in completely vanished.
- `transaction_amount` - In May and June, a larger number of transactions appeared. Additionally, as you discovered in the third question, the average transaction value has increased and raised an alert.

****Possible explanation:*** 

Fraudsters may have noticed that early card transactions, when done right after logging in, often led to account blocking. As a result, they began waiting a bit longer before transferring money to their account to avoid detection. Furthermore, they tend to make a single larger transfer instead of many smaller ones, leading to an increase in the average transaction value.