# Care Team Engagment Prediction - Naive Bayes Classifier

This notebook includes the following steps:
<ul>
<li> Setup
<li> Read data file </li>
<li> Process data file </li>
<li> Remove nulls </li>
<li> Define Y </li>
<li> Split into train and test </li>
<li> Predict using Naive Bayes Classifier </li>
</ul>

### 0. Setup

In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


import sklearn
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

In [12]:
%matplotlib inline

### 1. Read data file (locally or from S3) 

In [13]:
# Read local data file
df1 = pd.read_csv('data5.csv')

### 2. Process Input file

In [14]:
df1.shape

(53895, 30)

In [15]:
df1.columns

Index(['days_to_first_et', 'days_to_coach', 'indication', 'is_gender_female',
       'is_gender_male', 'is_gender_other', 'bio_length', 'reasons_length',
       'imagine_free_length', 'reason_limited_time',
       'reason_family_obligations', 'reason_work_obligations', 'reason_other',
       'surgery_1yr', 'pain_severity', 'pain_vas', 'pain_description_length',
       'bmi', 'gad', 'phq', 'inbound_coach_messages_4_weeks',
       'inbound_coach_messages_1_week', 'inbound_coach_messages_length_1_week',
       'inbound_member_messages_4_weeks', 'inbound_member_messages_1_week',
       'surgery_message', 'call_message', 'interaction_message',
       'video_message', 'booking_message'],
      dtype='object')

In [16]:
# Make sure we have no member identification fields
df1 = df1.drop(['pathway_id', 'user_id', 'uuid'], axis=1, errors='ignore') ## PHI Safety

In [17]:
# Replace indication with dummy variables
indication_dummies = pd.get_dummies(df1['indication'])
df1 = pd.concat([df1, indication_dummies], axis=1)      
df1 = df1.drop(['indication'], axis=1)

In [18]:
# Combine all hot words into one column
df1['hot_word'] =  df1['surgery_message'] + df1['call_message'] + df1['interaction_message'] + df1['video_message']

# take out of the members that used a hot word. We know those should be assign to mid level
hot_word_memebrs = df1[df1['hot_word'] > 0]
df1 = df1[df1['hot_word'] == 0]
df1 = df1.drop(['surgery_message', 'call_message', 'interaction_message', 'video_message', 'booking_message', 'hot_word'], axis=1)

### 3. Remove null values

In [19]:
print(df1.isnull().sum()) # found no missing values in the data

days_to_first_et                            0
days_to_coach                             987
is_gender_female                            0
is_gender_male                              0
is_gender_other                             0
bio_length                                  0
reasons_length                              0
imagine_free_length                         0
reason_limited_time                         0
reason_family_obligations                   0
reason_work_obligations                     0
reason_other                                0
surgery_1yr                                 0
pain_severity                           12177
pain_vas                                 2099
pain_description_length                     0
bmi                                      2295
gad                                         6
phq                                         7
inbound_coach_messages_4_weeks              0
inbound_coach_messages_1_week               0
inbound_coach_messages_length_1_we

In [20]:
# Remove members with transferred_to_coach_day = null
df1 = df1[df1['days_to_coach'].notna()]

In [21]:
# if pain_severity or pain_vas is null -> 0
df1['pain_severity'].fillna(0, inplace=True)
df1['pain_vas'].fillna(0, inplace=True)
df1['gad'].fillna(0, inplace=True)
df1['phq'].fillna(0, inplace=True)

In [22]:
# but average BMI where BMI is null
df1['bmi'].fillna((df1['bmi'].mean()), inplace=True)

In [23]:
print(df1.isnull().sum()) # found no missing values in the data

days_to_first_et                        0
days_to_coach                           0
is_gender_female                        0
is_gender_male                          0
is_gender_other                         0
bio_length                              0
reasons_length                          0
imagine_free_length                     0
reason_limited_time                     0
reason_family_obligations               0
reason_work_obligations                 0
reason_other                            0
surgery_1yr                             0
pain_severity                           0
pain_vas                                0
pain_description_length                 0
bmi                                     0
gad                                     0
phq                                     0
inbound_coach_messages_4_weeks          0
inbound_coach_messages_1_week           0
inbound_coach_messages_length_1_week    0
inbound_member_messages_4_weeks         0
inbound_member_messages_1_week    

### 4. Define Y

In [24]:
limit = 7

# Define target column
# See analysis below showed the 20% of customer = 9 or more messages
df1['Y'] = df1['inbound_coach_messages_4_weeks'] > limit
df1 = df1.drop(['inbound_member_messages_4_weeks', 'inbound_coach_messages_4_weeks'], axis=1)

In [25]:
df1['Y'].value_counts()

False    39638
True     11997
Name: Y, dtype: int64

In [26]:
# Save cleaned data for future use
df1.to_csv('cleanData1.csv')

### 5. Split data into train and test

In [27]:
X = df1.iloc[:, 0:-1].values
y = df1.iloc[:, -1].values

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

### 6. Predict using Naive Bayes Classifier

In [29]:
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB

from sklearn.metrics import confusion_matrix,accuracy_score, precision_score, recall_score

In [30]:
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [31]:
y_pred  =  classifier.predict(X_test)

In [32]:
confusion_matrix(y_test, y_pred)

array([[6245, 1718],
       [1540,  824]])

In [33]:
accuracy_score(y_test,y_pred)

0.6845163164520189

In [34]:
precision_score(y_test, y_pred)

0.3241542092840283

In [35]:
recall_score(y_test, y_pred)

0.34856175972927245

### Target is recall = 75% and precision = 65%

<div class="alert alert-block alert-warning">
<b>Naive Bayes classifiers</b> are fast and easy to implement but they assume the following: </BR> 
<ol>
    <li> Features are independent </li>
    <li> Features are Gaussian distributed </li>
    <li> All features have the same weight </li>
</ol>    
In most of the real life cases, like in this case, this is not true.
</div>