# Model Building
Here we will choose and built model, which could work in real world. This is explonatory part, working model with code, where we prepare our data and send it to model you can find in separate file.

In [104]:
#importing main data analysis libraries  
import pandas as pd 
import numpy as np

#importing main visualization libraries
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns

%matplotlib inline 

In [105]:
#reading prepared data
train = pd.read_csv('C:/Users/user/Projects/Health_Insurance_Sell_Analysis/prepared_data')

In [106]:
#split our dataset

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    train.drop(columns= 'response', axis = 1), 
    train['response'], 
    test_size=0.2)

print(' x_train: ',X_train.shape, '\n',
      'y_train:',y_train.shape,'\n',
      'x_test:',X_test.shape,'\n',
      'y_test:',y_test.shape)

 x_train:  (295160, 7) 
 y_train: (295160,) 
 x_test: (73790, 7) 
 y_test: (73790,)


## Training models

In this part we will train our main models. I train a lot of models, because we can't say for sure which model would be preferable, so we will train several models, which could work.

### Random Forest Classifier

In [107]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier()

rf_clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

### Extra Trees Classifier

In [108]:
from sklearn.ensemble import ExtraTreesClassifier

et_clf = ExtraTreesClassifier()
et_clf.fit(X_train, y_train)

ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=None, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

### Support Vector Classifier with RBF kernel

In [109]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

rbf_kernel_sv_clf = Pipeline([
    ('scaler', StandardScaler()),
    ('svm_clf', SVC(kernel = 'rbf'))
])

rbf_kernel_sv_clf.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('svm_clf',
                 SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None,
                     coef0=0.0, decision_function_shape='ovr', degree=3,
                     gamma='scale', kernel='rbf', max_iter=-1,
                     probability=False, random_state=None, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

### SGDClassifier 

In [110]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(loss = 'log')

sgd_clf.fit(X_train, y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=1000,
              n_iter_no_change=5, n_jobs=None, penalty='l2', power_t=0.5,
              random_state=None, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

### KNeighbors Classifier

In [111]:
from sklearn.neighbors import KNeighborsClassifier

kn_clf = KNeighborsClassifier()

kn_clf.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

### Naive Bayes

In [112]:
from sklearn.naive_bayes import CategoricalNB

naive_bayes_clf = CategoricalNB()

naive_bayes_clf.fit(X_train, y_train)

CategoricalNB(alpha=1.0, class_prior=None, fit_prior=True)

### Saving models without tuning

In [113]:
from joblib import dump, load

clfs = [rf_clf,rbf_kernel_sv_clf, sgd_clf,kn_clf, et_clf, naive_bayes_clf]
names = ['rf_clf_wt.joblib', 'rbf_kernel_sv_clf_wt.joblib', 'sgd_clf_wt.joblib', 'kn_clf_wt.joblib', 'et_clf_wt.joblib','naive_bayes_clf_wt.joblib']
for clf in clfs:
    for name in names:
            dump(clf, name)

## Performance score
Here we will check performance of our model, our main goal to get most efficient model. We will focus on ROC AUC, but we have recall in priority. Our goal is to have the widest range of clients, to have our sales managers reach the largest number of potential customers, which would buy our insurance.

In [114]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

def score(clfs):
    for clf in clfs:
        y_pred = clf.predict(X_test)
        print(
            clf.__class__.__name__, '\n',
            'Accuracy score: ',accuracy_score(y_test, y_pred), '\n',
            'Precision score: ',precision_score(y_test, y_pred, zero_division = 1), '\n',
            'Recall score: ',recall_score(y_test, y_pred, zero_division = 1), '\n',
            'F1 score: ',f1_score(y_test, y_pred, zero_division = 1), '\n',
            'ROC AUC score: ',roc_auc_score(y_test, y_pred), '\n',
        )

In [115]:
clfs = [rf_clf,rbf_kernel_sv_clf, sgd_clf,kn_clf, naive_bayes_clf, et_clf]
score(clfs)

RandomForestClassifier 
 Accuracy score:  0.856687898089172 
 Precision score:  0.314643188137164 
 Recall score:  0.1513091922005571 
 F1 score:  0.20434880746369724 
 ROC AUC score:  0.5528358041539698 

Pipeline 
 Accuracy score:  0.8783710529882098 
 Precision score:  1.0 
 Recall score:  0.0 
 F1 score:  0.0 
 ROC AUC score:  0.5 

SGDClassifier 
 Accuracy score:  0.8783710529882098 
 Precision score:  1.0 
 Recall score:  0.0 
 F1 score:  0.0 
 ROC AUC score:  0.5 

KNeighborsClassifier 
 Accuracy score:  0.8565252744274292 
 Precision score:  0.3314512756168967 
 Recall score:  0.1766016713091922 
 F1 score:  0.2304281456712946 
 ROC AUC score:  0.5636383346903132 

CategoricalNB 
 Accuracy score:  0.7177666350453991 
 Precision score:  0.2847073356828834 
 Recall score:  0.8730919220055711 
 F1 score:  0.4293933914187079 
 ROC AUC score:  0.784675252061954 

ExtraTreesClassifier 
 Accuracy score:  0.8541672313321589 
 Precision score:  0.308861301369863 
 Recall score:  0.16077

## Model Choosing

After scoring, we can see, how our models worked. In my opinion we should tune next models: KNeighbors and Extra Trees Classifier. We have perspective Naive Bayes model, but we will try it, if others won't work well. They are looking the most perspective in this case. We will tune them in next step. 

But this are not our final models, because later we will try ensemble models with Naive Bayes, which working very well with our data.


## Model Tuning

## Extra Trees Classifier

In [23]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

### Randomized Search
Here we explore the widest range of parameters, and then we move on to more detailed ones.

In [6]:
rscv_et_clf = ExtraTreesClassifier()

In [7]:
n_estimators = [32, 64, 128,  256, 512]
max_featurese = ['auto', 'sqrt', 'log2']
max_depth = [32, 64, 128, 256, 512]
min_samples_leaf = [1, 2, 5, 10, 15, 20, 25, 30]
criterion = ['entropy', 'gini']
bootstrap = [True, False]

random_param_et_clf = {
    'n_estimators': n_estimators,
    'max_features': max_featurese,
    'criterion' : criterion,
    'bootstrap': bootstrap,
    'max_depth': max_depth,
    'min_samples_leaf': min_samples_leaf,
}

In [8]:
rscv_et_clf_f = RandomizedSearchCV(estimator= rscv_et_clf, param_distributions= random_param_et_clf,
                                   cv = 5, scoring= 'recall')

In [9]:
rscv_et_clf_f.fit(X_train, y_train)

RandomizedSearchCV(cv=5, error_score=nan,
                   estimator=ExtraTreesClassifier(bootstrap=False,
                                                  ccp_alpha=0.0,
                                                  class_weight=None,
                                                  criterion='gini',
                                                  max_depth=None,
                                                  max_features='auto',
                                                  max_leaf_nodes=None,
                                                  max_samples=None,
                                                  min_impurity_decrease=0.0,
                                                  min_impurity_split=None,
                                                  min_samples_leaf=1,
                                                  min_samples_split=2,
                                                  min_weight_fraction_leaf=0.0,
                                        

In [10]:
rscv_et_clf_f.best_estimator_

ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='entropy', max_depth=128, max_features='sqrt',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=512,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

In [14]:
rscv_et_clf_2 = ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='entropy', max_depth=128, max_features='sqrt',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=512,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

In [15]:
rscv_et_clf_2.fit(X_train, y_train)

ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='entropy', max_depth=128, max_features='sqrt',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=512,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

In [16]:
score([rscv_et_clf_2])

#we improved our accuracy, but it's not most important, because now we don't have to check max_features, bootstrap params
#and we can concentrate on GridSearch, which could improve our model

ExtraTreesClassifier 
 Accuracy score:  0.8551836292180512 
 Precision score:  0.32345894140710085 
 Recall score:  0.16381687810259238 
 F1 score:  0.21748681898066782 
 ROC AUC score:  0.5579146190435712 



### Grid Search

In [17]:
n_estimators = [128,  256, 512]
max_featurese = ['sqrt']
max_depth = [64, 128, 256]
min_samples_leaf = [1, 2, 4]
bootstrap = [False]
criterion = ['entropy']

grid_param_et_clf = {
    'n_estimators': n_estimators,
    'max_features': max_featurese,
    'criterion' : criterion,
    'bootstrap': bootstrap,
    'max_depth': max_depth,
    'min_samples_leaf': min_samples_leaf,
}

In [18]:
gscv_et_clf = ExtraTreesClassifier()

In [19]:
gscv_et_clf_f =  GridSearchCV(estimator= gscv_et_clf, param_grid= grid_param_et_clf, cv = 3, scoring= 'recall')


In [20]:
gscv_et_clf_f.fit(X_train, y_train)

GridSearchCV(cv=3, error_score=nan,
             estimator=ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0,
                                            class_weight=None, criterion='gini',
                                            max_depth=None, max_features='auto',
                                            max_leaf_nodes=None,
                                            max_samples=None,
                                            min_impurity_decrease=0.0,
                                            min_impurity_split=None,
                                            min_samples_leaf=1,
                                            min_samples_split=2,
                                            min_weight_fraction_leaf=0.0,
                                            n_estimators=100, n_jobs=None,
                                            oob_score=False, random_state=None,
                                            verbose=0, warm_start=False),
             iid='deprecate

In [21]:
gscv_et_clf_f.best_estimator_

ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='entropy', max_depth=128, max_features='sqrt',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=128,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

In [22]:
score([gscv_et_clf_f])

GridSearchCV 
 Accuracy score:  0.8550481094999323 
 Precision score:  0.3215926493108729 
 Recall score:  0.16216216216216217 
 F1 score:  0.21560574948665298 
 ROC AUC score:  0.557125886025075 



### Extra Tree Conclusion

As we can see this model wasn't effective after tuning, but we won't give up, because we have a lot of others classificator, which we will tune and try them.

## KNeighbors Classifier
Here we will explore our next classifier, which looks very perspective. Here we won't use Random Search, because our model teaching very fast and we don't have a lot paramets, so we started with GridSearch and look how it would work. Here we will concentrate on ROC AUC, it might give us better results. 

In [24]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

In [26]:
from sklearn.neighbors import KNeighborsClassifier

scv_knn_clf = KNeighborsClassifier()

In [31]:
leaf_size = list(range(1,30))
n_neighbors = list(range(1,30))
p = [1,2]

random_param_knn_clf = {
    'p': p,
    'n_neighbors': n_neighbors,
    'leaf_size' : leaf_size
}

In [33]:
gscv_knn_clf_f =  RandomizedSearchCV(estimator= scv_knn_clf, param_distributions = random_param_knn_clf, cv = 3, scoring= 'roc_auc')

In [34]:
gscv_knn_clf_f.fit(X_train, y_train)

RandomizedSearchCV(cv=3, error_score=nan,
                   estimator=KNeighborsClassifier(algorithm='auto',
                                                  leaf_size=30,
                                                  metric='minkowski',
                                                  metric_params=None,
                                                  n_jobs=None, n_neighbors=5,
                                                  p=2, weights='uniform'),
                   iid='deprecated', n_iter=10, n_jobs=None,
                   param_distributions={'leaf_size': [1, 2, 3, 4, 5, 6, 7, 8, 9,
                                                      10, 11, 12, 13, 14, 15,
                                                      16, 17, 18, 19, 20, 21,
                                                      22, 23, 24, 25, 26, 27,
                                                      28, 29],
                                        'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8,
              

In [35]:
gscv_knn_clf_f.best_estimator_

KNeighborsClassifier(algorithm='auto', leaf_size=5, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=29, p=2,
                     weights='uniform')

In [36]:
gscv_knn_clf_s = KNeighborsClassifier(algorithm='auto', leaf_size=5, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=29, p=2,
                     weights='uniform')

In [37]:
gscv_knn_clf_s.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=5, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=29, p=2,
                     weights='uniform')

In [38]:
score([gscv_knn_clf_s])

KNeighborsClassifier 
 Accuracy score:  0.8739531101775309 
 Precision score:  0.3805668016194332 
 Recall score:  0.041478212906784336 
 F1 score:  0.07480354123147319 
 ROC AUC score:  0.5160114123630097 



### Conclusion
We improve our accuracy score on 2% and precision on 5, but we lost a lot in our recall. Unfortunately, this model wasn't worked to good. So we can say, that we can drop her and continue finding better results.

## Ensemble models
Here we will test some of our models in ensemble, in my opinion Naive Bayes and SGDClassifier together could make a better result.

## Voting Classifier

### Hard Voting

In [41]:
from sklearn.ensemble import VotingClassifier
from sklearn.naive_bayes import CategoricalNB
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

In [48]:
vt_catnb_clf_f  = CategoricalNB()
vt_rfc_clf_f  = RandomForestClassifier()
vt_sgd_clf_f  = SGDClassifier(loss = 'log')
vt_knn_clf_f  = KNeighborsClassifier()

In [52]:
voting_clf_f = VotingClassifier(
    estimators = [('rf', vt_catnb_clf_f), ('nb', vt_rfc_clf_f), ('sgd', vt_sgd_clf_f), ('knn', vt_knn_clf_f)],
    voting = 'hard'
)

In [53]:
voting_clf_f.fit(X_train, y_train)

VotingClassifier(estimators=[('rf',
                              CategoricalNB(alpha=1.0, class_prior=None,
                                            fit_prior=True)),
                             ('nb',
                              RandomForestClassifier(bootstrap=True,
                                                     ccp_alpha=0.0,
                                                     class_weight=None,
                                                     criterion='gini',
                                                     max_depth=None,
                                                     max_features='auto',
                                                     max_leaf_nodes=None,
                                                     max_samples=None,
                                                     min_impurity_decrease=0.0,
                                                     min_impurity_split=None,
                                                     min_samples_le

In [54]:
score([voting_clf_f])

VotingClassifier 
 Accuracy score:  0.8694809594796042 
 Precision score:  0.3569261880687563 
 Recall score:  0.07788196359624931 
 F1 score:  0.1278638051254188 
 ROC AUC score:  0.5291147940808593 



### Soft voting with params

In [55]:
vt_catnb_clf_s  = CategoricalNB()
vt_rfc_clf_s  = RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='entropy', max_depth=128, max_features='sqrt',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=128,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)
vt_sgd_clf_s  = SGDClassifier(loss = 'log')
vt_knn_clf_s  = KNeighborsClassifier(algorithm='auto', leaf_size=5, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=29, p=2,
                     weights='uniform')

In [67]:
voting_clf_s = VotingClassifier(
    estimators = [('nb', vt_catnb_clf_s), ('sgd', vt_sgd_clf_s)],
    voting = 'soft'
)

In [68]:
voting_clf_s.fit(X_train, y_train)

VotingClassifier(estimators=[('nb',
                              CategoricalNB(alpha=1.0, class_prior=None,
                                            fit_prior=True)),
                             ('sgd',
                              SGDClassifier(alpha=0.0001, average=False,
                                            class_weight=None,
                                            early_stopping=False, epsilon=0.1,
                                            eta0=0.0, fit_intercept=True,
                                            l1_ratio=0.15,
                                            learning_rate='optimal', loss='log',
                                            max_iter=1000, n_iter_no_change=5,
                                            n_jobs=None, penalty='l2',
                                            power_t=0.5, random_state=None,
                                            shuffle=True, tol=0.001,
                                            validation_fraction=0.1,

In [69]:
score([voting_clf_s])

VotingClassifier 
 Accuracy score:  0.7285946605231061 
 Precision score:  0.2934815373021854 
 Recall score:  0.8592388306674021 
 F1 score:  0.4375228198286757 
 ROC AUC score:  0.7847681213978185 



### Conclusion
Unfortunately we didn't get big results, but we should try another one. We have AdaBoost, maybe it would work well with our Naive Bayes.

## AdaBoost

In [71]:
ab_nb_clf = CategoricalNB()

In [72]:
from sklearn.ensemble import AdaBoostClassifier

In [73]:
ab_clf_f = AdaBoostClassifier(base_estimator= ab_nb_clf, n_estimators= 50)

In [74]:
ab_clf_f.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=CategoricalNB(alpha=1.0, class_prior=None,
                                                fit_prior=True),
                   learning_rate=1.0, n_estimators=50, random_state=None)

In [75]:
score([ab_clf_f])

AdaBoostClassifier 
 Accuracy score:  0.8771513755251389 
 Precision score:  1.0 
 Recall score:  0.0 
 F1 score:  0.0 
 ROC AUC score:  0.5 



## Ensemble conclusion
As we can see our ensemble and grid search methods didn't work well, but we don't lost our faith, because our best model now is Naive Bayes and we can try different Naive Bayes, not only categorical.

## Naive Bayes 

In [100]:
from sklearn.naive_bayes import ComplementNB, GaussianNB, BernoulliNB

brn_nb_clf = BernoulliNB()
compl_nb_clf = ComplementNB()
gaus_nb_clf = GaussianNB()

compl_nb_clf.fit(X_train, y_train)
gaus_nb_clf.fit(X_train, y_train)
brn_nb_clf.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [102]:
score([compl_nb_clf, gaus_nb_clf, brn_nb_clf])

ComplementNB 
 Accuracy score:  0.6260062339070335 
 Precision score:  0.24468202358646535 
 Recall score:  0.9795918367346939 
 F1 score:  0.3915602883788611 
 ROC AUC score:  0.778038483064141 

GaussianNB 
 Accuracy score:  0.63875863938203 
 Precision score:  0.25081451681446015 
 Recall score:  0.9766133480419195 
 F1 score:  0.39912537757540234 
 ROC AUC score:  0.7840270293705156 

BernoulliNB 
 Accuracy score:  0.6461986719067624 
 Precision score:  0.25431059339138457 
 Recall score:  0.972972972972973 
 F1 score:  0.40322764989599286 
 ROC AUC score:  0.7867027862161118 



## Comparison Naive Bayes models


As we can see our models give us good results. In my opinion we should use GaussianNB, because it could work online and we can send info to her quickly without collection big amount of data to teach. 

# Conclusion

We tried a lot of models, but our best is Gaussian Naive Bayes. So we should improve it later for faster working. All files of models you can file in folder 'models', they are in rar archive. Main model will be with this file.

What we could say in conclusion of our project?

This was intesting experience. We analyse our data from step to step. We got working model, which we could use in real world practice. In my opinion I can reach the goal, which get in the beggining. Our managers would make a miss call, but they got 97% of all potential clients with 25% precision of call. So we got 1/4 of potential clients.

Classification report and saving model you can see under this post.

In [89]:
from sklearn.metrics import classification_report

final_clf = GaussianNB()
final_clf.fit(X_train, y_train)


predicted = final_clf.predict(X_test)
print(classification_report(y_test, predicted))

dump(final_clf, 'final_clf.joblib')

              precision    recall  f1-score   support

           0       0.98      0.70      0.81     64725
           1       0.29      0.88      0.44      9065

    accuracy                           0.72     73790
   macro avg       0.63      0.79      0.63     73790
weighted avg       0.89      0.72      0.77     73790

