# Predicting Patient's Smoker Status Through Text Mining and Machine Learning | SCIKIT-LEARN
## Leveraging Latent Semantic Analysis and Stochastic Gradient Descent Classifier for Healthcare Predictive Analytics
### Pablo X Zumba

We will predict the smoker status of a patient based on that patient's note (recorded by a doctor)

**The unit of analysis is a patient note**

In [1]:
import pandas as pd
import numpy as np

In [2]:
notes = pd.read_csv('smokers.csv')

In [3]:
notes

Unnamed: 0,ID,TEXT,STATUS
0,641,977146916\nHLGMC\n2878891\n022690\n01/27/1997 ...,CURRENT SMOKER
1,643,026738007\nCMC\n15319689\n3/25/1998 12:00:00 A...,CURRENT SMOKER
2,681,071962960\nBH\n4236518\n417454\n12/10/2001 12:...,CURRENT SMOKER
3,704,418520250\nNVH\n61562872\n3/11/1995 12:00:00 A...,CURRENT SMOKER
4,757,301443520\nCTMC\n49020928\n448922\n1/11/1990 1...,CURRENT SMOKER
...,...,...,...
393,401,917989835 RWH\n5427551\n405831\n9660879\n01/09...,UNKNOWN
394,403,817406016 RWH\n3154334\n554691\n3547577\n7/6/2...,UNKNOWN
395,416,517502848 ELMVH\n18587541\n6634152\n12/12/2004...,UNKNOWN
396,417,895872725 ELMVH\n99080881\n979718\n5/25/2002 1...,UNKNOWN


In [4]:
#print(notes['TEXT'][0])


In [5]:
print(notes['STATUS'].unique())

['CURRENT SMOKER' 'NON-SMOKER' 'PAST SMOKER' 'SMOKER' 'UNKNOWN']


In [6]:
notes.shape

(398, 3)

## Assign the "target" variable

In [7]:
target = notes['STATUS']

## Assign the "text" (input) variable

In [8]:
notes[['TEXT']].isna().sum() #Fortunately, No missing values

TEXT    0
dtype: int64

In [9]:
input_data = notes['TEXT']

## Split the data

In [10]:
from sklearn.model_selection import train_test_split

train_set, test_set, train_y, test_y = train_test_split(input_data, target, test_size=0.3, random_state=42)

In [11]:
train_set.shape, train_y.shape

((278,), (278,))

In [12]:
test_set.shape, test_y.shape

((120,), (120,))

## Sklearn: Text preparation


In [13]:
#TfidfVectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english')

train_x_tr = tfidf_vect.fit_transform(train_set)

In [14]:
# Perform the TfidfVectorizer transformation
# Be careful: We are using the train fit to transform the test data set. Otherwise, the test data 
# features will be very different and match the train set!!!

test_x_tr = tfidf_vect.transform(test_set)

In [15]:
train_x_tr.shape, test_x_tr.shape

((278, 12240), (120, 12240))

In [16]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
train_x_tr

<278x12240 sparse matrix of type '<class 'numpy.float64'>'
	with 71810 stored elements in Compressed Sparse Row format>

In [17]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
train_x_tr.toarray()

array([[0.0095233 , 0.01267037, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.02616653, 0.03481352, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.01448694, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.03956983, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.06738015, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.02353552, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

## Latent Semantic Analysis (Singular Value Decomposition)

### Don't forget to create SVDs for both train and test

In [18]:
from sklearn.decomposition import TruncatedSVD

In [19]:
#If you are performing Latent Semantic Analysis, recommended number of components is 100

svd = TruncatedSVD(n_components=278, n_iter=50)

In [20]:
train_x_lsa = svd.fit_transform(train_x_tr)

In [21]:
train_x_lsa.shape

(278, 278)

In [22]:
train_x_lsa

array([[ 2.96692177e-01, -6.95753887e-02, -2.29544427e-02, ...,
         2.39689773e-04,  1.88918253e-05, -1.14582509e-03],
       [ 2.52213194e-01,  2.21258004e-02,  2.71268979e-02, ...,
         3.15188008e-03,  4.46321684e-03,  1.03471317e-02],
       [ 4.23545576e-01, -5.25166916e-02, -2.27631660e-02, ...,
        -5.11187395e-03,  7.82821053e-03,  2.65989855e-03],
       ...,
       [ 3.11761211e-01, -3.46701020e-03,  6.48129711e-02, ...,
        -6.86543535e-04, -8.08771274e-04, -2.32709631e-03],
       [ 2.61020664e-01,  1.71355498e-01,  3.40140476e-01, ...,
         1.75019586e-02, -9.42464230e-03,  9.01793438e-03],
       [ 3.51938483e-01, -8.24162551e-02, -4.96559201e-02, ...,
         4.17369461e-05, -2.92334947e-03,  1.19323823e-03]])

### Let's transform the test data set

In [23]:
test_x_lsa = svd.transform(test_x_tr)

In [24]:
test_x_lsa.shape

(120, 278)

## Check for the cumulative variance explained

**Increase the number of components if it the cumulative variance is low.**

In [25]:
svd.explained_variance_.sum()

0.9133371448625207

#### Based on the given variance, I can say that reducing 12240 columns to 278 columns is sufficient, as those 278 new columns explain 91% of the 12240 previous columns.

# Try one of the classifiers we have covered so far

## Stochastic Gradient Descent Classifier

In [26]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

In [27]:


sgd_clf = SGDClassifier(max_iter=5000, tol=0.9, alpha = 0.01, l1_ratio = 0.1)

In [28]:
sgd_clf.fit(train_x_lsa, train_y)

SGDClassifier(alpha=0.01, l1_ratio=0.1, max_iter=5000, tol=0.9)

## Accuracy

In [29]:
#Train accuracy

train_y_pred = sgd_clf.predict(train_x_lsa)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 0.6510791366906474


In [30]:
#Test accuracy

test_y_pred = sgd_clf.predict(test_x_lsa)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.5916666666666667


## Generate the confusion matrix

In [31]:
from sklearn.metrics import confusion_matrix

#Usually created on test set
confusion_matrix(test_y, test_y_pred)

array([[ 0,  0,  0,  0, 10],
       [ 0,  0,  0,  0, 23],
       [ 0,  0,  0,  0, 15],
       [ 0,  0,  0,  0,  1],
       [ 0,  0,  0,  0, 71]])