# Anomaly Detection: Bank Marketing

## <span style="color:red">Problem</span>

A retail bank wants to sell a financial product (bank term deposit) to their clients using a telemarketing campaign. Because a successful sale happens much rarely than an unsuccessful one, a client accepting the offer is consider an 'anomaly' in the data sense. Using the given marketing dataset, your goal is to build a model that can detect which marketing campaigns would trigger a successful outcome from a client.


## <span style="color:red">Data</span>

**Original data source**: *S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014*

**Pre-processed data source**: *Pang, Guansong, Chunhua Shen, and Anton van den Hengel. "Deep anomaly detection with deviation networks." In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 353-362. 2019.*

This is a data set of direct marketing campaigns of a Portuguese banking institution via phone calls, in which the rarely successful campaigning records, accounting for about 10% records, are considered as anomalies. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be subscribed or not.


### <span style="color:blue">Features</span>

The original dataset has been pre-processed to transform all the following categorical variables into binary variables (0/1 values) per category, also known as one-hot encoding, and the numerical variables like age variable have been normalized.

#### Bank client data

- **age** (numeric)


- **job** : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')


- **marital** : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)


- **education** (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')


- **default**: has credit in default? (categorical: 'no','yes','unknown')


- **housing**: has housing loan? (categorical: 'no','yes','unknown')


- **loan**: has personal loan? (categorical: 'no','yes','unknown')

#### Last contact of the current campaign

- **contact**: contact communication type (categorical: 'cellular','telephone')


- **month**: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')


- **day_of_week**: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')


- **duration**: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

#### Other attributes

- **campaign**: number of contacts performed during this campaign and for this client (numeric, includes last contact)


- **pdays**: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)


- **previous**: number of contacts performed before this campaign and for this client (numeric)


- **poutcome**: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

#### social and economic context attributes

- **emp.var.rate**: employment variation rate - quarterly indicator (numeric)


- **cons.price.idx**: consumer price index - monthly indicator (numeric)


- **cons.conf.idx**: consumer confidence index - monthly indicator (numeric)


- **euribor3m**: euribor 3 month rate - daily indicator (numeric)


- **nr.employed**: number of employees - quarterly indicator (numeric)

### <span style="color:blue">Target variable</span>

- **class** - has the client subscribed a term deposit? (binary: 1 (yes), 0 (no))


### <span style="color:blue">Train/Test sets</span>

The train set contains 80% of the original data and the test set contains 1000 records, both sets have roughly the same rate of anomalies (i.e. successful marketing campaigns).

## <span style="color:red">Before starting</span>

Given the problem and data:
- Which machine learning approach do you think would be most suited between classification and regression ?
- What is the range of values your model should be able to return ?

Answer in the below cell

## <span style="color:red">Coding starts here</span>

In [None]:
%pylab inline

### Import packages
More can be added here on top of the default ones if necessary.

In [None]:
import pandas as pd
import seaborn

**Import Training Data**

In [None]:
train_data = pd.read_csv('https://github.com/youtalspectra/spectra_ml_example/raw/master/data/anomaly_train.csv')

## Exploratory Data Analysis

Explore, pre-process and/or clean the data here. 

What is the type and/or range of values for each feature/variable? Are there any relationships or correlations between the different variables? Is any transformation of the data needed before fitting any model?

## Model fitting

Fit/optmize your model here, and get the model training score.

## Predictions

Make predictions on the following test set and get the model score here. Remember to apply the same pre-processing to the test set as done on the training set !

In [None]:
test_data = pd.read_csv('https://github.com/youtalspectra/spectra_ml_example/raw/master/data/anomaly_test_1000.csv')