# Introduction

Context
Using Watson Analytics, you can predict behavior to retain your customers. You can analyze all relevant customer data and develop focused customer retention programs.

Inspiration
Understand customer demographics and buying behavior. Use predictive analytics to analyze the most profitable customers and how they interact. Take targeted actions to increase profitable customer response, retention, and growth.

<font color = 'blue' >
Content:

1. [Load and Check Data](#1)
2. [Variable Description](#2)
    * [Univariate Variable Analysis](#3)
        * [Categorical Variable Analysis](#4)
        * [Numerical Variable Analysis](#5)
3. [Basic Data Analysis](#6)
4. [Outlier Detection](#7)
5. [Modeling](#29)
     * [train_test_split](#30)
     * [Simple Logistic Regression](#31)
     * [Hyperparameter Tuning -- Grid Search -- Cross Validation](#32)
     * [Ensemble Modeling](#33)
     * [Prediction and Submission](#34)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
# linear algebra
import numpy as np
# data processing, CSV file I/O (e.g. pd.read_csv)
import pandas as pd
#plt.style.use("seaborn-whitegrid")
import matplotlib.pyplot as plt
# data visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style
# Algorithms
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB
from collections import Counter
import warnings
warnings.filterwarnings("ignore")
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

import warnings                                            # Ignore warning related to pandas_profiling
warnings.filterwarnings('ignore') 

def annot_plot(ax,w,h):                                    # function to add data to plot
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    for p in ax.patches:
         ax.annotate(f"{p.get_height() * 100 / df_watson.shape[0]:.2f}%", (p.get_x() + p.get_width() / 2., p.get_height()),
         ha='center', va='center', fontsize=11, color='black', rotation=0, xytext=(0, 10),
         textcoords='offset points')             
def annot_plot_num(ax,w,h):                                    # function to add data to plot
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    for p in ax.patches:
        ax.annotate('{0:.1f}'.format(p.get_height()), (p.get_x()+w, p.get_height()+h))

In [None]:
plt.style.available

<a id="1"></a>
# 1. Load and Check Data

In [None]:
df_watson = pd.read_csv("/kaggle/input/ibm-watson-marketing-customer-value-data/WA_Fn-UseC_-Marketing-Customer-Value-Analysis.csv")

In [None]:
df1=df_watson["Customer"]

In [None]:
df_watson.info

In [None]:
df_watson.columns

In [None]:
df_watson.head()

In [None]:
df_watson.describe()

In [None]:
df_watson.shape

In [None]:
df_watson.Response = df_watson.Response.apply(lambda X : 0 if X == 'No' else 1)

In [None]:
df_watson.head()

In [None]:
total = df_watson.isnull().sum().sort_values(ascending=False)
percent_1 = df_watson.isnull().sum()/df_watson.isnull().count()*100
percent_2 = (round(percent_1, 1)).sort_values(ascending=False)
missing_data = pd.concat([total, percent_2], axis=1, keys=['Total', '%'])
missing_data.head(24)

<a id="2"></a>
# 2. Variable Description

1. Customer
2. State
3. Customer Lifetime Value: In marketing, customer lifetime value (CLV) is a metric that represents the total net profit a company makes from any given customer. CLV is a projection to estimate a customer's monetary worth to a business after factoring in the value of the relationship with a customer over time. The most basic way to determine CLV is to add up the revenue earned from a customer (annual revenue multiplied by the average customer lifespan) minus the initial cost of acquiring them.
4. Response: YES/NO
5. Coverage: Basic, Extended, Premium
6. Education : Education level refers to the years of formal instruction received and successfully completed, usually based on passing formal exams.
7. Effective To Date
8. EmploymentStatus :Employment status is the status of a worker in a company on the basis of the contract of work or duration of work done.
9. Gender
10. Income
11. Location Code: Urban, Suburban, Rural
12. Marital Status: Single, Married, Divorced
13. Monthly Premium Auto
14. Months Since Last Claim: Son Talepten Bu Yana Aylar
15. Months Since Policy Inception: Politika başlangıcından bu yana geçen aylar
16. Number of Open Complaints: Açık Şikayet Sayısı
17. Number of Policies
18. Policy Type
19. Policy
20. Renew Offer Type
21. Sales Channel
22. Total Claim Amount: Toplam Talep Tutarı
23. Vehicle Class
24. Vehicle Size

In [None]:
df_watson.info()

* float64(2):  Customer Lifetime Value, Total Claim Amount 
* int64(6): Income, Monthly Premium Auto, Months Since Last Claim, Months Since Policy Inception, Number of Open Complaints, Number of Policies     
* object(16): Customer, State, Response, Coverage, Education, Effective To Date, EmploymentStatus, Gender, Location Code, Marital Status, Policy Type, Policy, Renew Offer Type, Sales Channel, Vehicle Class, Vehicle Size           

 <a id="3"></a>
# Univariate Variable Analysis
* **Categorical Variable:** State, Response, Coverage, Education, Effective To Date, EmploymentStatus, Gender, Location Code, Marital Status, Policy Type, Policy, Renew Offer Type, Sales Channel, Vehicle Class, Vehicle Size 
* **Numerical Variable:** Customer, Customer Lifetime Value, Income, Monthly Premium Auto, Months Since Last Claim, Months Since Policy Inception, Number of Open Complaints, Number of Policies, Total Claim Amount      

In [None]:
def bar_plot(variable):
    """
        input: variable ex:"Gender"
        output: bar plot & value count
    """
    #get feature
    var = df_watson[variable]
    #count number of categorical variable (value/sample)
    varValue=var.value_counts()
    
    #visualize
    plt.figure(figsize = (9,3))
    plt.bar(varValue.index, varValue)
    plt.xticks(varValue.index, varValue.index.values)
    plt.ylabel("Frequency")
    plt.title(variable)
    plt.show()
    print("{}: \n {}".format(variable, varValue))

In [None]:
category1=["State", "Response", "Coverage", "Education", "EmploymentStatus", "Gender", "Location Code", "Marital Status", "Policy Type", "Renew Offer Type", "Sales Channel", "Vehicle Size"]
for c in category1:
    bar_plot(c)

* California and Oregon contain more customers than other states
* The overwhelming majority of the response is NO
* More than half of the coverage is Basic
* The number of customers with master and phd education is very few
* Most customers are working
* The number of men and women is almost equal
* Those who order from the suburbans are more than others.
* Almost half of the customers are married
* The policy type is usually personal
* Renew Offer Type number from large to small, respectively Offer1, Offer2, Offer3, Offer4
* Renew the Sales Channel from large to small, respectively Agent, Branch, Call Center, Web
* Vehicle Size is usually Medsize



In [None]:
category2 = ["Effective To Date","Policy","Vehicle Class"]
for c in category2:
    print("{} \n".format(df_watson[c].value_counts()))

<a id="5"></a>
## Numerical Variable

In [None]:
def plot_hist(variable):
    plt.figure(figsize = (9,3))
    plt.hist(df_watson[variable], bins=50)
    plt.xlabel(variable)
    plt.ylabel("Frequency")
    plt.title("{} distribution with hist".format(variable))
    plt.show()

In [None]:
numericVar=["Customer", "Customer Lifetime Value", "Income", "Monthly Premium Auto", "Months Since Last Claim", "Months Since Policy Inception", "Number of Open Complaints", "Number of Policies", "Total Claim Amount"]
for n in numericVar:
    plot_hist(n)

* Customer Lifetime Value is concentrated between 2000 and 10000
* Monthly Premium Auto is between 60 and 140 intense
* Number of Open Complaints is largely 0
* Total Claim Amount ranges from 0 to 1000

<a id = "6"></a><br>
# 3. Basic Data Analysis

# a) Response Rate:

In [None]:
ax = sns.countplot('Response',data = df_watson)
plt.ylabel('Total number of Response')
annot_plot(ax, 0.08,1)
plt.show()

We see that approximately 14.32% of customers respond to marketing calls and the remaining 85.68% do not.Those who answer no are in majority.

In [None]:
#Average response of Male and Female
# Gender vs Response
df_watson[["Gender", "Response"]].groupby(["Gender"], as_index = False).mean().sort_values(by="Response", ascending=False)

The response rates of men and women to marketing calls are almost equal.

In [None]:
def plot_hist(var):
    ax = sns.countplot('Response', hue = var, data = df_watson)
    plt.ylabel('Total number of Response')
    annot_plot(ax,0.08,1)
    plt.show()

category1=["State", "Response", "Coverage", "Education", "EmploymentStatus", "Gender", "Location Code", "Marital Status", "Policy Type", "Renew Offer Type", "Sales Channel", "Vehicle Size"]

for n in category1:
    plot_hist(n)

In [None]:
g = sns.FacetGrid(df_watson, col = "Response")
g.map(sns.distplot, "Total Claim Amount", bins = 25)
plt.show()

In [None]:
g = sns.FacetGrid(df_watson, col = "Response")
g.map(sns.distplot, "Customer Lifetime Value", bins = 25)
plt.show()

In [None]:
g = sns.FacetGrid(df_watson, col = "Response")
g.map(sns.distplot, "Income", bins = 25)
plt.show()

Notice that, ratio of male and female for responding to a marketing call is almost same.

In [None]:
# Marital Status vs Response
df_watson[["Marital Status", "Response"]].groupby(["Marital Status"], as_index = False).mean().sort_values(by="Response", ascending=False)

Divorced people's YES rate of responce is higher.

Notice that out of 14% customers, 8% customers those who rsponded to marketing calls are from married category

# b) Response rate by renew offer

In [None]:
# Renew Offer Type vs Response
df_watson[["Renew Offer Type", "Response"]].groupby(["Renew Offer Type"], as_index = False).mean().sort_values(by="Response", ascending=False)

For Offer1 and Offer2 customers have responded to marketing calls,but for Offer3 and Offer4 almost nobody responded.

# c) Response rate by Education

In [None]:
# Education vs Response
df_watson[["Education", "Response"]].groupby(["Education"], as_index = False).mean().sort_values(by="Response", ascending=False)

Notice that customers with Doctor and master degree are very less who responded to marketing calls, may be they are not intersted or busy. or we can say young people are most likely to respond to marketing calls.

# d) Response rate by Sales Channel

In [None]:
# Sales Channel vs Response
df_watson[["Sales Channel", "Response"]].groupby(["Sales Channel"], as_index = False).mean().sort_values(by="Response", ascending=False)

# e) Response rate by Total Claim Amount

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(y = 'Total Claim Amount' , x = 'Response', data = df_watson)
plt.ylabel('Total number of Response')
plt.show()

Box plots are a great way to visualize the distribuation of countinous variables. They show the min, max, first quatile, meadian and third quartile, all in one view. The central rectangle spans from the first quartile to the third quartile, and the green line shows the median. The lower and upper ends show the minimum and the maximum of each distribution.

The dots above the upper boundry line show the suspected outliers that are decided based on the INterquartile range (IQR). The points that fall 1.5*IQR above the third quartile or 1.5*IQR below the quartile are suspected outliers and are drawn with the dots.

# f) Response rate by Income Distributions

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(y = 'Income' , x = 'Response', data = df_watson)
plt.show()

# g) Response rate by EmploymentStatus

In [None]:
# EmploymentStatus vs Response
df_watson[["EmploymentStatus", "Response"]].groupby(["EmploymentStatus"], as_index = False).mean().sort_values(by="Response", ascending=False)

In [None]:
plt.figure(figsize=(10,6))
ax = sns.countplot('Response',hue = 'EmploymentStatus' ,data = df_watson)
plt.ylabel('Total number of Response')
annot_plot(ax, 0.08,1)
plt.show()

# h) Response rate by Vehicle Class

In [None]:
# Vehicle Class of Response vs Response
df_watson[["Vehicle Class", "Response"]].groupby(["Vehicle Class"], as_index = False).mean().sort_values(by="Response", ascending=False)

In [None]:
plt.figure(figsize=(10,6))
ax = sns.countplot('Response',hue = 'Vehicle Class' ,data = df_watson)
plt.ylabel('Total number of Response')
annot_plot(ax, 0.08,1)
plt.show()

# i) Response rate by Policy

In [None]:
# Policy vs Response
df_watson[["Policy", "Response"]].groupby(["Policy"], as_index = False).mean().sort_values(by="Response", ascending=False)

In [None]:
plt.figure(figsize=(15,6))
ax = sns.countplot('Response',hue = 'Policy' ,data = df_watson)
plt.ylabel('Total number of Response')
annot_plot(ax, 0.08,1)
plt.show()

<a id = "7"></a><br>
# Outlier Detection

In [None]:
def detect_outliers(df,features):
    outlier_indices = []
    
    for c in features:
        #1st quartile
        Q1 = np.percentile(df[c],25)
        #3rd quartile
        Q3 = np.percentile(df[c],75)
        #IQR
        IQR = Q3 - Q1
        #Outlier step
        outlier_step = IQR * 1.5
        #detect outlier and their indeces
        outlier_list_col = df[(df[c] < Q1 - outlier_step) | (df[c] > Q3 + outlier_step)].index
        #store indeces
        outlier_indices.extend(outlier_list_col)
        
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(i for i, v in outlier_indices.items() if v > 2)
    
    return multiple_outliers

In [None]:
df_watson.loc[detect_outliers(df_watson,["Total Claim Amount","Income"])]


In [None]:
df_watson = df_watson.drop(['Customer','Effective To Date','Gender','Policy','Vehicle Class'], axis = 1)


In [None]:
df_watson["State"] = [0 if i == "California" else 1 if i == "Oregon"
                     else 2 if i == "Arizona" else 3 if i == "Nevada" else 4 for i in df_watson["State"]]

In [None]:
sns.countplot(x= "State",data = df_watson)
plt.xticks(rotation = 60)
plt.show()

In [None]:
g=sns.factorplot(x="State", y="Response", data=df_watson, kind="bar")
g.set_xticklabels(["California", "Oregon","Arizona","Nevada","Washington"])
g.set_ylabels("State Response Rate")
plt.show()

In [None]:
df_watson = pd.get_dummies(df_watson,columns = ["State"])
df_watson.head()

In [None]:
df_watson["Coverage"] = [0 if i == "Basic" else 1 if i == "Extended"
                      else 2 for i in df_watson["Coverage"]]

sns.countplot(x= "Coverage",data = df_watson)
plt.xticks(rotation = 60)
plt.show()

g=sns.factorplot(x="Coverage", y="Response", data=df_watson, kind="bar")
g.set_xticklabels(["Basic", "Extended","Premium"])
g.set_ylabels("Coverage Response Rate")
plt.show()

df_watson = pd.get_dummies(df_watson,columns = ["Coverage"])
df_watson.head()

In [None]:
df_watson["Education"] = [0 if i == "Bachelor" else 1 if i == "College"
                      else 2 if i == "High School or Below" else 3 if i == "Master" else 4 for i in df_watson["Education"]]

sns.countplot(x= "Education",data = df_watson)
plt.xticks(rotation = 60)
plt.show()

g=sns.factorplot(x="Education", y="Response", data=df_watson, kind="bar")
g.set_xticklabels(["Bachelor", "College","High School or Below", "Master", "Doctor"])
g.set_ylabels("Education Response Rate")
plt.show()

df_watson = pd.get_dummies(df_watson,columns = ["Education"])
df_watson.head()

In [None]:
df_watson["EmploymentStatus"] = [0 if i == "Employed" else 1 if i == "Unemployed"
                      else 2 if i == "Medical Leave" else 3 if i == "Disabled" else 4 for i in df_watson["EmploymentStatus"]]

sns.countplot(x= "EmploymentStatus",data = df_watson)
plt.xticks(rotation = 60)
plt.show()

g=sns.factorplot(x="EmploymentStatus", y="Response", data=df_watson, kind="bar")
g.set_xticklabels(["Employed", "Unemployed","Medical Leave", "Disabled", "Retired"])
g.set_ylabels("EmploymentStatus Response Rate")
plt.show()

df_watson = pd.get_dummies(df_watson,columns = ["EmploymentStatus"])
df_watson.head()

In [None]:
df_watson["Location Code"] = [0 if i == "Suburban" else 1 if i == "Rural"
                      else 2 for i in df_watson["Location Code"]]

sns.countplot(x= "Location Code",data = df_watson)
plt.xticks(rotation = 60)
plt.show()

g=sns.factorplot(x="Location Code", y="Response", data=df_watson, kind="bar")
g.set_xticklabels(["Suburban", "Rural","Urban"])
g.set_ylabels("Location Code Response Rate")
plt.show()

df_watson = pd.get_dummies(df_watson,columns = ["Location Code"])
df_watson.head()

In [None]:
df_watson["Marital Status"] = [0 if i == "Married" else 1 if i == "Single"
                      else 2 for i in df_watson["Marital Status"]]

sns.countplot(x= "Marital Status",data = df_watson)
plt.xticks(rotation = 60)
plt.show()

g=sns.factorplot(x="Marital Status", y="Response", data=df_watson, kind="bar")
g.set_xticklabels(["Married", "Single","Divorced"])
g.set_ylabels("Marital Status Response Rate")
plt.show()

df_watson = pd.get_dummies(df_watson,columns = ["Marital Status"])
df_watson.head()

In [None]:
df_watson["Policy Type"] = [0 if i == "Personal Auto" else 1 if i == "Corporate Auto"
                      else 2 for i in df_watson["Policy Type"]]

sns.countplot(x= "Policy Type",data = df_watson)
plt.xticks(rotation = 60)
plt.show()

g=sns.factorplot(x="Policy Type", y="Response", data=df_watson, kind="bar")
g.set_xticklabels(["Personal Auto", "Corporate Auto","Special Auto"])
g.set_ylabels("Policy Type Response Rate")
plt.show()

df_watson = pd.get_dummies(df_watson,columns = ["Policy Type"])
df_watson.head()

In [None]:
df_watson["Renew Offer Type"] = [0 if i == "Offer1" else 1 if i == "Offer2"
                      else 2 if i == "Offer3" else 3 for i in df_watson["Renew Offer Type"]]

sns.countplot(x= "Renew Offer Type",data = df_watson)
plt.xticks(rotation = 60)
plt.show()

g=sns.factorplot(x="Renew Offer Type", y="Response", data=df_watson, kind="bar")
g.set_xticklabels(["Offer1", "Offer2","Offer3", "Offer3"])
g.set_ylabels("Renew Offer Type Response Rate")
plt.show()

df_watson = pd.get_dummies(df_watson,columns = ["Renew Offer Type"])
df_watson.head()

In [None]:
df_watson["Sales Channel"] = [0 if i == "Agent" else 1 if i == "Branch"
                      else 2 if i == "Call Center" else 3 for i in df_watson["Sales Channel"]]

sns.countplot(x= "Sales Channel",data = df_watson)
plt.xticks(rotation = 60)
plt.show()

g=sns.factorplot(x="Sales Channel", y="Response", data=df_watson, kind="bar")
g.set_xticklabels(["Agent", "Branch","Call Center", "Web"])
g.set_ylabels("Sales Channel Type Response Rate")
plt.show()

df_watson = pd.get_dummies(df_watson,columns = ["Sales Channel"])
df_watson.head()

In [None]:
df_watson["Vehicle Size"] = [0 if i == "Medsize" else 1 if i == "Small"
                      else 2 for i in df_watson["Vehicle Size"]]

sns.countplot(x= "Vehicle Size",data = df_watson)
plt.xticks(rotation = 60)
plt.show()

g=sns.factorplot(x="Vehicle Size", y="Response", data=df_watson, kind="bar")
g.set_xticklabels(["Medsize", "Small","Large"])
g.set_ylabels("Vehicle Size Response Rate")
plt.show()

df_watson = pd.get_dummies(df_watson,columns = ["Vehicle Size"])
df_watson.head()

In [None]:
df_watson.info

In [None]:
df_watson.head()

In [None]:
list1 = ["State_0","State_1","State_2","State_3","State_4","Customer Lifetime Value","Response","Coverage_0","Coverage_1","Coverage_2",
       "Education_0","Education_1","Education_2","Education_3","Education_4","EmploymentStatus_0","EmploymentStatus_1","EmploymentStatus_2","EmploymentStatus_3","EmploymentStatus_4","Income",
       "Location Code_0","Location Code_1","Location Code_2","Marital Status_0","Marital Status_1","Marital Status_2","Monthly Premium Auto",
       "Months Since Last Claim", "Months Since Policy Inception",
       "Number of Open Complaints", "Number of Policies", "Policy Type_0","Policy Type_1", "Policy Type_2",
       "Renew Offer Type_0","Renew Offer Type_1","Renew Offer Type_2","Renew Offer Type_3","Sales Channel_0","Sales Channel_1","Sales Channel_2","Sales Channel_3","Total Claim Amount","Vehicle Size_0","Vehicle Size_1","Vehicle Size_2"]
sns.heatmap(df_watson[list1].corr(),annot = True, fmt= ".2f")
plt.show()

In [None]:
g = sns.FacetGrid(df_watson, col = "Response", row = "Number of Open Complaints", size = 3)
g.map(plt.hist, "Number of Policies", bins = 25)
g.add_legend()
plt.show()

<a id = "29"></a><br>
# Modeling

In [None]:
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

In [None]:
#In order not to lose size
train_df_len=len(df_watson)

In [None]:
df_watson.head()

In [None]:
test = df_watson[:train_df_len]
#There won't be Response column in the test.
test.drop(labels = ["Response"],axis = 1, inplace = True)
test.head(10)

# train_test_split

In [None]:
train = df_watson[:train_df_len]
X_train = train.drop(labels = "Response", axis = 1)
y_train = train["Response"]
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size = 0.33, random_state = 42)
print("X_train",len(X_train))
print("X_test",len(X_test))
print("y_train",len(y_train))
print("y_test",len(y_test))
print("test",len(test))

# Simple Logistic Regression

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
acc_log_train = round(logreg.score(X_train, y_train)*100,2)
acc_log_test = round(logreg.score(X_test, y_test)*100,2)
print("Training Accuracy:%{}".format(acc_log_train))
print("Testing Accuracy:%{}".format(acc_log_test))

# Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


# Hyperparameter Tuning -- Grid Search -- Cross Validation
We will compare 5 ml classifier and evaluate mean accuracy of each of them by stratified cross validation.

* Decision Tree
* SVM
* Random Forest
* KNN
* Logistic Regression

In [None]:
random_state = 42
classifier = [DecisionTreeClassifier(random_state = random_state),
             SVC(random_state = random_state),
             RandomForestClassifier(random_state = random_state),
             LogisticRegression(random_state = random_state),
             KNeighborsClassifier()]

dt_param_grid = {"min_samples_split" : range(10,500,20),
                "max_depth": range(1,20,2)}

svc_param_grid = {"kernel" : ["rbf"],
                 "gamma": [0.001, 0.01, 0.1, 1],
                 "C": [1,10,50,100,200,300,1000]}

rf_param_grid = {"max_features": [1,3,10],
                "min_samples_split":[2,3,10],
                "min_samples_leaf":[1,3,10],
                "bootstrap":[False],
                "n_estimators":[100,300],
                "criterion":["gini"]}

logreg_param_grid = {"C":np.logspace(-3,3,7),
                    "penalty": ["l1","l2"]}

knn_param_grid = {"n_neighbors": np.linspace(1,19,10, dtype = int).tolist(),
                 "weights": ["uniform","distance"],
                 "metric":["euclidean","manhattan"]}
classifier_param = [dt_param_grid,
                   svc_param_grid,
                   rf_param_grid,
                   logreg_param_grid,
                   knn_param_grid]

In [None]:
cv_result = []
best_estimators = []
for i in range(len(classifier)):
    clf = GridSearchCV(classifier[i], param_grid=classifier_param[i], 
                       cv = StratifiedKFold(n_splits = 10), scoring = "accuracy", n_jobs = -1,verbose = 1)
    clf.fit(X_train,y_train)
    cv_result.append(clf.best_score_)
    best_estimators.append(clf.best_estimator_)
    print(cv_result[i])

In [None]:
cv_results = pd.DataFrame({"Cross Validation Means":cv_result,
                           "ML Models":["DecisionTreeClassifier", "SVM","RandomForestClassifier",
             "LogisticRegression",
             "KNeighborsClassifier"]})

g = sns.barplot("Cross Validation Means", "ML Models", data = cv_results)
g.set_xlabel("Mean Accuracy")
g.set_title("Cross Validation Scores")

In [None]:
votingC = VotingClassifier(estimators = [("dt",best_estimators[0]),
                                        ("rfc",best_estimators[2]),
                                        ("lr",best_estimators[3])],
                                        voting = "soft", n_jobs = -1)
votingC = votingC.fit(X_train, y_train)
#According to my votingC classifier, I predict X_test and then compare y_test to accuracy skore.
print(accuracy_score(votingC.predict(X_test),y_test))

In [None]:
test_response = pd.Series(votingC.predict(test), name = "Response").astype(int)
results = pd.concat([df1, test_response],axis = 1)
results.to_csv("watson.csv", index = False)

In [None]:
test_response

In [None]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
predictions = cross_val_predict(votingC, X_train, y_train, cv=3)
confusion_matrix(y_train, predictions)

In [None]:
from sklearn.metrics import precision_score, recall_score

print("Precision:", precision_score(y_train, predictions))
print("Recall:",recall_score(y_train, predictions))

In [None]:
from sklearn.metrics import f1_score
f1_score(y_train, predictions)

In [None]:
from sklearn.metrics import precision_recall_curve

# getting the probabilities of our predictions
y_scores = votingC.predict_proba(X_train)
y_scores = y_scores[:,1]

precision, recall, threshold = precision_recall_curve(y_train, y_scores)
def plot_precision_and_recall(precision, recall, threshold):
    plt.plot(threshold, precision[:-1], "r-", label="precision", linewidth=5)
    plt.plot(threshold, recall[:-1], "b", label="recall", linewidth=5)
    plt.xlabel("threshold", fontsize=19)
    plt.legend(loc="upper right", fontsize=19)
    plt.ylim([0, 1])

plt.figure(figsize=(14, 7))
plot_precision_and_recall(precision, recall, threshold)
plt.show()

In [None]:
def plot_precision_vs_recall(precision, recall):
    plt.plot(recall, precision, "g--", linewidth=2.5)
    plt.ylabel("recall", fontsize=19)
    plt.xlabel("precision", fontsize=19)
    plt.axis([0, 1.5, 0, 1.5])

plt.figure(figsize=(14, 7))
plot_precision_vs_recall(precision, recall)
plt.show()

In [None]:
from sklearn.metrics import roc_curve
# compute true positive rate and false positive rate
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, y_scores)
# plotting them against each other
def plot_roc_curve(false_positive_rate, true_positive_rate, label=None):
    plt.plot(false_positive_rate, true_positive_rate, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'r', linewidth=4)
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate (FPR)', fontsize=16)
    plt.ylabel('True Positive Rate (TPR)', fontsize=16)

plt.figure(figsize=(14, 7))
plot_roc_curve(false_positive_rate, true_positive_rate)
plt.show()


In [None]:
from sklearn.metrics import roc_auc_score
r_a_score = roc_auc_score(y_train, y_scores)
print("ROC-AUC-Score:", r_a_score)

# Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
linear_reg=LinearRegression()
linear_reg.fit(X_train, y_train)
y_pred = linear_reg.predict(X_test)
print('Intercept: \n', linear_reg.intercept_)
print('Coefficients: \n', linear_reg.coef_)

In [None]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from distutils.version import LooseVersion

In [None]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeCV
from sklearn.compose import TransformedTargetRegressor
from sklearn.metrics import median_absolute_error, r2_score
# `normed` is being deprecated in favor of `density` in histograms
if LooseVersion(matplotlib.__version__) >= '2.1':
    density_param = {'density': True}
else:
    density_param = {'normed': True}
X, y = make_regression(n_samples=10000, noise=100, random_state=0)
y = np.exp((y + abs(y.min())) / 200)
y_trans = np.log1p(y)

f, (ax0, ax1) = plt.subplots(1, 2)

ax0.hist(y, bins=100, **density_param)
ax0.set_xlim([0, 2000])
ax0.set_ylabel('Probability')
ax0.set_xlabel('Response')
ax0.set_title('Response distribution')

ax1.hist(y_trans, bins=100, **density_param)
ax1.set_ylabel('Probability')
ax1.set_xlabel('Response')
ax1.set_title('Response target distribution')

f.suptitle("Watson data", y=0.035)
f.tight_layout(rect=[0.05, 0.05, 0.95, 0.95])

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

f, (ax0, ax1) = plt.subplots(1, 2, sharey=True)

regr = RidgeCV()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)

ax0.scatter(y_test, y_pred)
ax0.plot([0, 2000], [0, 2000], '--k')
ax0.set_ylabel('Response predicted')
ax0.set_xlabel('True Response')
ax0.set_title('Ridge regression \n without response transformation')
ax0.text(100, 1750, r'$R^2$=%.2f, MAE=%.2f' % (
    r2_score(y_test, y_pred), median_absolute_error(y_test, y_pred)))
ax0.set_xlim([0, 2000])
ax0.set_ylim([0, 2000])

regr_trans = TransformedTargetRegressor(regressor=RidgeCV(),
                                        func=np.log1p,
                                        inverse_func=np.expm1)
regr_trans.fit(X_train, y_train)
y_pred = regr_trans.predict(X_test)

ax1.scatter(y_test, y_pred)
ax1.plot([0, 2000], [0, 2000], '--k')
ax1.set_ylabel('Response predicted')
ax1.set_xlabel('True Response')
ax1.set_title('Ridge regression \n with response transformation')
ax1.text(100, 1750, r'$R^2$=%.2f, MAE=%.2f' % (
    r2_score(y_test, y_pred), median_absolute_error(y_test, y_pred)))
ax1.set_xlim([0, 2000])
ax1.set_ylim([0, 2000])

f.suptitle("Watson data", y=0.035)
f.tight_layout(rect=[0.05, 0.05, 0.95, 0.95])