# The Business Goal

(Extracted from [the original statement](CRISP-DM-BANK.pdf))

An implementation of a DM project based on the CRISP-DM methodology. Real-world data were collected from a Portuguese marketing campaign related with bank deposit subscription. 

The business goal is to find a model that can explain success of a contact, i.e. if the client subscribes (to) the (bank) deposit. Such model can increase campaign efficiency by identifying the main characteristics that affect success, helping in a better management of the available resources (e.g. human effort, phone calls, time) and selection of a high quality and affordable set of potential buying customers.

# Results

The results of the study described in The Procedure section below was that **KNearestNeighbors (KNN) model** had the best score-speed performance as compared to LogisticRegression, DecisionTreeClassifier and SVM for the sample dataset collected by the bank from 17 campaigns conducted between May 2008 and November 2010

Consequently this Data Analysis team recommends using the KNN model to identify potential clients who are mostly likely to subscribe to the deposit offered by the bank in their future marketing campaigns. 


![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

# The Procedure

For more elaboration, please refer to [this](prompt_III-SriramRao.ipynb) jupyter notebook.

## Dataset Features

![image-3.png](attachment:image-3.png)

## Engineering Features

Next we engineer the features to clean and optimize them for model training and testing. 

1. **Nulls and Unknowns**: This data does not have nulls. Also, the not so categorical value of 'unknown' is present only in small percentages
2. **Target Encoding**: Columns job, marital and education have a number of possible values and using OneHotEncoding for them will result in explosion of dimensions which may become an issue with SVMs. For this reason, I will use Traget Encoder for these features.
3. **One Hot Encoding**: Features with yes/no like values are best One Hot Encoded. I will plan to use this encoding for default, housing, loan, contact and poutcome
4. **Date Time Fields**: I will convert month and day of the week into datetime fields so they can be treated as numeric
5. **Zero and One**: Finally the target feature y will be converted to 0 for no and 1 for yes

## Model Comparisons

Next, we aim to compare the performance of the Logistic Regression model to our KNN algorithm, Decision Tree, and SVM models. 

With each of the models we use defaults as well as search for optimal hyperparameters using GridSearchCV. The best observations are documented below:

### Default Hyper Parameters

![image.png](attachment:image.png)

### Search Optimized Hyper Parameters

![image-2.png](attachment:image-2.png)


# Conclusion

Interestingly the performance metrics of models with default parameters came out better than hperparameter optimization with Grid search. 

The training times of the default models, however, were noticeably higher than that of grid search (with the caveat that performance of default models was measured using process time while we used results from the pipe for grid search).

Overall **KNN** had the best score-speed performance against this banking dataset in both the runs. No wonder KNN is the model used in implementing Vector Search - a technology used in the search industry to fetch similar results by querying data.