## Identify online patient conversations

ZS Data Science team collaborates with the Social Listening team for automating the process of gaining insights from social media conversations.

The Social Listening team has to manually validate heart failure related conversations fetched from the social listening tool which scans twitter, Facebook, forums, blogs etc.  Such conversations are posted by multiple stakeholders like patients, doctors, media houses, general public, etc. The team needs to identify the patient conversations, so as to dig deeper into them and identify the patient needs. The data science team wants to automate this process by building intelligent algorithms to predict patient conversations.

Build an Intelligent pipeline that can segregate patient conversations from the rest of the group given historically tagged patient data.

> You are expected to build an algorithm where they can ingest the social data and get the patient tags - 1 if patient and 0 if not a patient.

Please find the dataset at [http://hck.re/pnsNa4](http://hck.re/pnsNa4).

Description of attributes in dataset is given below - 

<img src = "https://i.ibb.co/GHX5SZf/Capture.jpg"></img>

## Dataset loading

In [None]:
# Dependecies
import numpy as np 
import pandas as pd 

import os
print(os.listdir("../dataset"))

In [41]:
# Specify the encoding exlicitly as the datasets are not encoded in UTF-8 
train_df = pd.read_csv('../dataset/train.csv', encoding = "ISO-8859-1")
test_df = pd.read_csv('../dataset/test.csv', encoding = "ISO-8859-1")

## Investigating the dataset

What are the dimensions of the train and test sets respectively?

In [15]:
train_df.shape, test_df.shape

((1157, 9), (571, 10))

Isn't it bit strange? The train set should ideally be higher than the test set in terms of dimensionality . Let's find out why. 

In [4]:
train_df.columns

Index(['Source', 'Host', 'Link', 'Date(ET)', 'Time(ET)', 'time(GMT)', 'Title',
       'TRANS_CONV_TEXT', 'Patient_Tag'],
      dtype='object')

In [7]:
test_df.columns

Index(['Index', 'Source', 'Host', 'Link', 'Date(ET)', 'Time(ET)', 'time(GMT)',
       'Title', 'TRANS_CONV_TEXT', 'Unnamed: 9'],
      dtype='object')

So it turns out that the test set has two columns `Index` and `Unnamed: 9` which are not present in the train set. The label here is the `Patient_Tag` column which is not present in the test set which is normal. 

Now, let's see what are these two columns `Index` and `Unnamed: 9` conveying? 

In [16]:
test_df[['Index','Unnamed: 9']].head(10)

Unnamed: 0,Index,Unnamed: 9
0,1,
1,2,
2,3,
3,4,
4,5,
5,6,
6,7,
7,8,
8,9,
9,10,


It seems like `Index` column is just denoting the indices of the samples explicitly and `Unnamed: 9` column is there erroneously. So, dropping them won't affect the performance of the final model. 

In [42]:
test_df.drop(columns = ['Index', 'Unnamed: 9'], inplace=True)

In [18]:
test_df.columns

Index(['Source', 'Host', 'Link', 'Date(ET)', 'Time(ET)', 'time(GMT)', 'Title',
       'TRANS_CONV_TEXT'],
      dtype='object')

Let's now take a look at the dataset itself. 

In [19]:
# Randomly sample 10 rows from the train set
train_df.sample(10)

Unnamed: 0,Source,Host,Link,Date(ET),Time(ET),time(GMT),Title,TRANS_CONV_TEXT,Patient_Tag
160,FORUMS,www.healthrevelations.com,http://www.healthrevelations.com/2016/04/04/ar...,4/3/2016,21:00:00,4/4/2016 6:30,Arthritis painkillers pack serious heart risks,Arthritis painkillers pack serious heart risks...,0
167,FORUMS,community.babycenter.com,http://community.babycenter.com/post/a63274957...,7/10/2016,13:24:00,,,Update: I have been taking it by pill also thi...,1
1030,BLOG,http://healthandfitness1blog.blogspot.com,http://healthandfitness1blog.blogspot.com/2016...,3/4/2016,6:23:00,3/4/2016 16:53,High coffee consumption may lower MS risk,Caffeine?s neuroprotective and anti-inflammato...,0
573,FORUMS,www.xboxhacker.org,http://www.xboxhacker.org/index.php?topic=7219...,6/18/2016,4:48:00,6/18/2016 14:18,Dosing can be paced as a side effect which ord...,Hematoma is a shot which scold always a profes...,0
416,FORUMS,allnurses.com,http://allnurses.com/nurse-colleague-patient/w...,7/29/2016,17:20:00,,,I just got report the other day from a newer n...,1
975,FORUMS,forums.hardwarezone.com.sg,http://forums.hardwarezone.com.sg/money-mind-2...,4/29/2016,22:18:00,4/30/2016 7:48,AIA shield Plan. Help me understand,Yes ur understanding is correct. For yr father...,0
488,FORUMS,reddit.com,https://www.reddit.com/r/childfree/comments/4v...,7/28/2016,23:31:00,,,My family has a really bad genetic line (histo...,1
619,FORUMS,community.diabetes.org,http://community.diabetes.org/t5/Adults-Living...,7/6/2016,13:05:00,,,I have worn a Holter monitor. It records your ...,1
1093,BLOG,http://sciencedaily.com/news,https://www.sciencedaily.com/releases/2016/04/...,4/4/2016,18:10:00,4/5/2016 3:40,New device for heart failure patients fails to...,A new implantable medical device intended to h...,0
657,FORUMS,www.lse.co.uk,http://www.lse.co.uk/ShareChat.asp?ShareTicker...,6/19/2016,8:00:00,6/19/2016 17:30,Cloudtag CTAG,co-founded this company http://www.impulse-dyn...,0


In [20]:
# Randomly sample 10 rows from the train set
test_df.sample(10)

Unnamed: 0,Source,Host,Link,Date(ET),Time(ET),time(GMT),Title,TRANS_CONV_TEXT
550,FORUMS,healthunlocked.com,https://healthunlocked.com/afassociation/posts...,5/7/2016,0.388888889,42497.78472,HealthUnlocked | The social network for health,Hi all I know that there have been lots of pos...
65,FORUMS,www.alsearsmd.com,http://www.alsearsmd.com/2016/04/better-than-a...,4/1/2016,10:57:00,4/1/2016 20:27,Better than Aspirin for Your Heart,Health Articles Better than Aspirin for Your H...
143,YOUTUBE,http://www.youtube.com,http://youtube.com/watch?v=Nx8kpdUkqME,6/21/2016,16:46:41,6/22/2016 2:16,Heart Failure: Palliative Approaches to Care O...,Description: Heart disease is the most common ...
214,FORUMS,boards.4chan.org,http://boards.4chan.org/tg/thread/47745953#p47...,6/16/2016,22:24:00,6/17/2016 7:54,/40krpg/ 40K Roleplay General,">>47814636 You, I like you. Shame the degenera..."
512,BLOG,http://healthtipsarticles.com,http://healthtipsarticles.com/tips-from-the-ex...,2/1/2016,0.172893519,42401.38123,Tips from the experts for a heart-healthy life...,When it comes to good advice about heart healt...
368,BLOG,evelynfastforward.blogspot.com,http://evelynfastforward.blogspot.com/2016/07/...,7/23/2016,14:20:00,,,One of my dear friends said to me one morning:...
172,BLOG,http://healthandfitness1blog.blogspot.com,http://healthandfitness1blog.blogspot.com/2016...,4/13/2016,6:50:00,4/13/2016 16:20,Lowering cholesterol with vegetable oils may n...,Study suggests avoiding saturated fats may not...
271,FORUMS,www.xboxhacker.org,http://www.xboxhacker.org/index.php?topic=7180...,6/16/2016,4:35:00,6/16/2016 14:05,Heart failure is a professional bandage.,Constipation can trick a cure before negative ...
39,YOUTUBE,http://www.youtube.com,http://youtube.com/watch?v=bDPtIxjuzmQ,6/20/2016,10:04:17,6/20/2016 19:34,"Hakan Altay, M.D. talks about ?Inflammation an...","Description: The E-Cardiology Academy, a novel..."
89,Facebook,,http://www.facebook.com/permalink.php?id=10153...,2-Jun-16,4:34 PM,6/2/2016 16:34,,""" I got congestive heart failure and haven't ..."


We can find out some more information about the train set and test set. 

In [23]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1157 entries, 0 to 1156
Data columns (total 9 columns):
Source             1157 non-null object
Host               1098 non-null object
Link               1157 non-null object
Date(ET)           1157 non-null object
Time(ET)           1157 non-null object
time(GMT)          996 non-null object
Title              941 non-null object
TRANS_CONV_TEXT    1156 non-null object
Patient_Tag        1157 non-null int64
dtypes: int64(1), object(8)
memory usage: 81.4+ KB


In [24]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 571 entries, 0 to 570
Data columns (total 8 columns):
Source             571 non-null object
Host               541 non-null object
Link               571 non-null object
Date(ET)           571 non-null object
Time(ET)           571 non-null object
time(GMT)          480 non-null object
Title              454 non-null object
TRANS_CONV_TEXT    571 non-null object
dtypes: object(8)
memory usage: 35.8+ KB


Most of the data is non-numeric in nature. It can also be seen that both the train and test sets have missing values. Let's talk in exact numbers. 

In [35]:
train_df.isna().sum()

Source               0
Host                59
Link                 0
Date(ET)             0
Time(ET)             0
time(GMT)          161
Title              216
TRANS_CONV_TEXT      1
Patient_Tag          0
dtype: int64

In [38]:
test_df.isna().sum()

Source               0
Host                30
Link                 0
Date(ET)             0
Time(ET)             0
time(GMT)           91
Title              117
TRANS_CONV_TEXT      0
dtype: int64

My intuition says that the maximum information of the dataset lies in the `TRANS_CONV_TEXT` column values which will help us to determine the labels of the text correctly. We can build the baseline model just by using two columns - `TRANS_CONV_TEXT` and `Patient_Tag`. So the problem statement can now be reduced to - 
> Given a conversation text, the task is to determine if the text is patient tagged or not. 

The task is a _binary classification task_.

We can see that the train set has one observation in which the value for `TRANS_CONV_TEXT` is missing. We can drop this observation otherwise, it can be problematic to create our models. 

In [61]:
train_df[train_df.TRANS_CONV_TEXT.isna()==True]

Unnamed: 0,Source,Host,Link,Date(ET),Time(ET),time(GMT),Title,TRANS_CONV_TEXT,Patient_Tag
841,FORUMS,www.reddit.com,https://www.reddit.com/r/science/comments/4ogb...,2016-06-16,19:25:00,6/17/2016 4:55,Teenage weight is linked to risk of heart fail...,,0


In [44]:
train_df.drop(841, inplace=True)
train_df.isna().sum()

Source               0
Host                59
Link                 0
Date(ET)             0
Time(ET)             0
time(GMT)          161
Title              216
TRANS_CONV_TEXT      0
Patient_Tag          0
dtype: int64

We can now extract the values for the columns `TRANS_CONV_TEXT` and `Patient_Tag` and separate them out in a DataFrame. 

In [45]:
new_train_frame = train_df[['TRANS_CONV_TEXT', 'Patient_Tag']]
new_test_frame = test_df[['TRANS_CONV_TEXT']]

new_train_frame.shape, new_test_frame.shape

((1156, 2), (571, 1))

We can save these two DataFrames for our further use. 

In [52]:
new_train_frame.to_csv('new_train_frame.csv')
new_test_frame.to_csv('new_test_frame.csv')

As we are now dealing with just two columns, it would be wise to investigate the data a bit more. Let's start by analyzing the article lengths - 

In [28]:
new_train_frame['TRANS_CONV_TEXT'].apply(len).describe()

count     1156.000000
mean      1851.491349
std       2324.415684
min          2.000000
25%        379.750000
50%        964.000000
75%       2441.250000
max      16000.000000
Name: TRANS_CONV_TEXT, dtype: float64

So here's the above information conveyed in words - 
* Average length of the texts - 1852 (approx)
* Minimum length of the texts - 2
* Highest length of the texts - 16000

And for the test set - 

In [30]:
new_test_frame['TRANS_CONV_TEXT'].apply(len).describe()

count      571.000000
mean      1851.010508
std       2399.454322
min          3.000000
25%        391.000000
50%        971.000000
75%       2530.000000
max      16000.000000
Name: TRANS_CONV_TEXT, dtype: float64

Everything is same except for the minimum length of the texts which in this case is 3. 

Let's now find out the class distribution to check if there is a class-imbalance. 

In [34]:
new_train_frame['Patient_Tag'].value_counts()

0    916
1    240
Name: Patient_Tag, dtype: int64

And there is a class-imbalance. But let's ignore this fact and proceed towards building a baseline. Let's first split the train set into train (yes another one) and validation sets in a 80:20 ratio respectively. 

In [46]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(new_train_frame['TRANS_CONV_TEXT'], new_train_frame['Patient_Tag'], \
                                                    test_size=0.2, random_state=42)

The baseline model we are going to build takes numbers as input. So, we will have to transform/preprocess the data to achieve the numeric conversion needed here. 

## Data preprocessing

Let's first remove the digits from the conversation texts which is a pretty common text-data preprocessing step. And then, let's vectorize the conversation texts which is just another way of saying finding a good numerical measure to characterize the texts. We will use count-vectorization which will help us converting the collection of conversation texts to a matrix of token (words in this case) counts. 

A simple reason behind going for count-based preprocessing is that we are interested in finding out the presence of specific words that denote that a conversation text is patient tagged. My hypothesis is that the order of the words can be ignored. 

In [47]:
from string import digits

def remove_digits(s: str) -> str:
    remove_digits = str.maketrans('', '', digits)
    res = s.translate(remove_digits)
    return res

### Removing the digits

In [48]:
X_train = X_train.apply(remove_digits)
X_valid = X_valid.apply(remove_digits)

### Count-vectorizing the text

In [50]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words=None, lowercase=True,
                             ngram_range=(1, 1), min_df=2, max_df=0.4, binary=True)

train_features = vectorizer.fit_transform(X_train)
train_labels = y_train

valid_features = vectorizer.transform(X_valid)
valid_labels = y_valid

<b>A note on the hyperparameter value choices </b>-
* `stop_words=None` - Means that we are instructing the CountVectorizer to not filter out the stop words. 
* `lowercase=True` - Converts all characters to lowercase before tokenizing.
* `ngram_range=(1,1)` - Specifies the minimum and maximum limit on the n-gram words to be extracted. (1,1) means we are considering uni-gram. 
* `min_df=2` - When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. 
* `max_df=0.4` - When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold. If float, the parameter represents a proportion of documents, integer absolute counts. 
* `binary=True` - This means that all the non-zero counts will be set to 1, This is useful for our baseline. 

## Building the baseline

We are going to use a **Bernoulli Naive Bayes** model. We are using the Bernoulli variant of the the Naive Bayes model as our task is a binary classification task. To build the model, we would need a corpus of consisting of documents. We have already built this corpus in the previous step. Our corpus contains documents which are nothing but the conversation texts and the corpus denotes if a certain word is present in a particular document or not. 

So, the problem statement can now be stated more mathematically (in terms of conditional probability) - 
> What is the probability of a class given a document? 

It can also be expressed as - 
$P(c|d)$

In our case, the class (or label) is `Patient_Tag` and the documents are the conversation text (`TRANS_CONV_TEXT`) as mentioned before. 

In [41]:
model = BernoulliNB(fit_prior=True)
model.fit(train_features, train_labels)

valid_preds = model.predict(valid_features)
print(classification_report(valid_labels, valid_preds))
print(f'Accuracy:{accuracy_score(valid_labels, valid_preds)}')

             precision    recall  f1-score   support

          0       0.96      0.91      0.93       179
          1       0.73      0.89      0.80        53

avg / total       0.91      0.90      0.90       232

Accuracy:0.9008620689655172


(Setting `fit_prior=True` means that the model will learn the class priors as well.)

We can see that the model's precision is not that good when it comes to the positive classes. So there is scope of improvement in that regard. 

We can now train our baseline using both the train and validation set and then use it to make predictions on the actual test set. 

In [39]:
new_train_frame['TRANS_CONV_TEXT'] = new_train_frame['TRANS_CONV_TEXT'].apply(remove_digits)

vectorizer = CountVectorizer(stop_words=None, lowercase=True,
                             ngram_range=(1, 1), min_df=2, max_df = 0.4)

features = vectorizer.fit_transform(new_train_frame['TRANS_CONV_TEXT'])
labels = new_train_frame['Patient_Tag']

model = BernoulliNB(fit_prior=True)
model.fit(features, labels)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

## Making predictions and preparing the submission file

We will have to make sure that the same preprocessing steps are applied on the test set as well.

In [33]:
new_test_frame['TRANS_CONV_TEXT'] = new_test_frame['TRANS_CONV_TEXT'].apply(remove_digits)

test_features = vectorizer.transform(new_test_frame['TRANS_CONV_TEXT'])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [34]:
test_preds = model.predict(test_features)

In [35]:
test_for_submission = pd.read_csv('../dataset/test.csv', encoding = "ISO-8859-1")

In [37]:
submission = pd.DataFrame()
submission['Index'] = test_for_submission['Index']
submission['Patient_Tag'] = test_preds

submission.to_csv('submission.csv',index=False)

In [38]:
!head -5 submission.csv

Index,Patient_Tag
1,0
2,1
3,0
4,1


## Further considerations - 

* Trying TF-IDF based vectorization as the Count-Vectorization suffers from a problem. The common words that occur in similar frequencies in all documents (i.e., words that are not particularly unique to the text samples in the dataset) are not penalized. For example, words like “a” will occur very frequently in all texts. So a higher token count for “the” than for other more meaningful words is not very useful.

* Trying out other n-gram models like shallow neural networks, logistic regression and so on. We are not going for sequence models as we are not considering the order of the words here. 

### References - 

* scikit-learn official documentation - https://scikit-learn.org
* Google Developer Guides on Machine Learning - https://developers.google.com/machine-learning/guides
* How to Clean Text for Machine Learning with Python - https://machinelearningmastery.com/clean-text-machine-learning-python/