![B&D](http://www.avenir-it.fr/wp-content/uploads/2015/10/BD-Logo-groupe.jpg)

# Demo text-mining: Pharma case

In this demo, I will demonstrate what are the basic steps that you will have to use in most text-mining cases. This are also some of the steps that have been used in the ResuMe app, Giulia just showed you. The case that we will cover here, is a simplified version of a project that has actually been carried out by Radia, where the goal was to identify if a given paper is treating about Pharmacovigilance or not. Pharmacovigilance is a domain of study in healthcare about drug safety. Consequently, we would like to predict, based on the text of the scientific article if the article treats about Pharmacovigilance or not.

For this we can use any kind of model, but in any case we will have to transform the words in numbers in some way. We'll see different methods and compare their performance.

## Downloading the dataset from PubMed

In [1]:
#import documents from PubMed
from Bio import Entrez

# Function to search for a certain number articles based on a certain keyword
def search(keyword,number=20):
    Entrez.email = 'your.email@example.com'
    handle = Entrez.esearch(db='pubmed', 
                            sort='relevance', 
                            retmax=str(number),
                            retmode='xml', 
                            term=keyword)
    results = Entrez.read(handle)
    return results

# Function to retrieve the results of previous search query
def fetch_details(id_list):
    ids = ','.join(id_list)
    Entrez.email = 'your.email@example.com'
    handle = Entrez.efetch(db='pubmed',
                           retmode='xml',
                           id=ids)
    results = Entrez.read(handle)
    return results


### Retrieving top 200 articles with Pharmacovigilance keyword

In [2]:
results = search('Pharmacovigilance', 200) #querying PubMed
id_list = results['IdList']
papers_pharmacov = fetch_details(id_list) #retrieving the info about the articles in nested lists & dictionary format

In [3]:
# checking article title for the first 10 retrieved articles
for i, paper in enumerate(papers_pharmacov['PubmedArticle'][:10]):
        print("%d) %s" % (i+1,paper['MedlineCitation']['Article']['ArticleTitle']))

1) FarmaREL: An Italian pharmacovigilance project to monitor and evaluate adverse drug reactions in haematologic patients.
2) Feasibility and Educational Value of a Student-Run Pharmacovigilance Programme: A Prospective Cohort Study.
3) Developing a Crowdsourcing Approach and Tool for Pharmacovigilance Education Material Delivery.
4) Promoting and Protecting Public Health: How the European Union Pharmacovigilance System Works.
5) Effect of an educational intervention on knowledge and attitude regarding pharmacovigilance and consumer pharmacovigilance among community pharmacists in Lalitpur district, Nepal.
6) Pharmacovigilance and Biomedical Informatics: A Model for Future Development.
7) Pharmacovigilance in Europe: Place of the Pharmacovigilance Risk Assessment Committee (PRAC) in organisation and decisional processes.
8) Tamoxifen Pharmacovigilance: Implications for Safe Use in the Future.
9) Pharmacovigilance Skills, Knowledge and Attitudes in our Future Doctors - A Nationwide Stud

### Retrieving top 1.000 articles with Pharma keyword
This will be our base of comparison, we want to separate them from the others

In [4]:
results = search('Pharma', 1000) #querying PubMed
id_list = results['IdList']
papers_pharma = fetch_details(id_list)#retrieving the info about the articles in nested lists & dictionary format

In [5]:
# checking article title for the first 10 retrieved articles
for i, paper in enumerate(papers_pharma['PubmedArticle'][:10]):
        print("%d) %s" % (i+1,paper['MedlineCitation']['Article']['ArticleTitle']))

1) Recent trends in specialty pharma business model.
2) The moderating role of absorptive capacity and the differential effects of acquisitions and alliances on Big Pharma firms' innovation performance.
3) Space-related pharma-motifs for fast search of protein binding motifs and polypharmacological targets.
4) Pharma Websites and "Professionals-Only" Information: The Implications for Patient Trust and Autonomy.
5) BRIC Health Systems and Big Pharma: A Challenge for Health Policy and Management.
6) Developing Deep Learning Applications for Life Science and Pharma Industry.
7) Exzellenz in der Bildung für eine innovative Schweiz: Die Position des Wirtschaftsdachverbandes Chemie Pharma Biotech.
8) Shaking Up Biotech/Pharma: Can Cues Be Taken from the Tech Industry?
9) Pharma-Nutritional Properties of Olive Oil Phenols. Transfer of New Findings to Human Nutrition.
10) Pharma Success in Product Development—Does Biotechnology Change the Paradigm in Product Development and Attrition.


### Saving ID's, labels and title + abstracts of the articles

When an article was retrieved via the Pharmacovigilance keyword, it will receive the label = 1 and = 0 else. We'll per article put the article title and article abstract together as our text data on the article. 

In [6]:
# Save ids & label 1 = pharmacovigilance , 0 =  not pharmacovigilance
# & Save title + abstract in  dico
ids = []
labels = []
data = []
for i, paper in enumerate(papers_pharmacov['PubmedArticle']):
    if 'Abstract' in paper['MedlineCitation']['Article'].keys(): #check that abstract info is available
        ids.append(str(paper['MedlineCitation']['PMID']))
        labels.append(1)
        title = paper['MedlineCitation']['Article']['ArticleTitle'] #Article title
        abstract = paper['MedlineCitation']['Article']['Abstract']['AbstractText'][0] #Abstract
        data.append( title + abstract )
for i, paper in enumerate(papers_pharma['PubmedArticle']):
    if 'Abstract' in paper['MedlineCitation']['Article'].keys(): #check that abstract info is available
        ids.append(str(paper['MedlineCitation']['PMID']))
        labels.append(0)
        title = paper['MedlineCitation']['Article']['ArticleTitle'] #Article title
        abstract = paper['MedlineCitation']['Article']['Abstract']['AbstractText'][0] #Abstract
        data.append( title + abstract )


In [7]:
# Check result for one paper
ids[0] # ID
labels[0] # 1 = pharmacovigilance , 0 =  not pharmacovigilance
data[0] # Title & abstract

'28771763'

1

'FarmaREL: An Italian pharmacovigilance project to monitor and evaluate adverse drug reactions in haematologic patients.Adverse drug reactions (ADRs) reduce patients\' quality of life, increase mortality and morbidity, and have a negative economic impact on healthcare systems. Nevertheless, the importance of ADR reporting is often underestimated. The project "FarmaREL" has been developed to monitor and evaluate ADRs in haematological patients and to increase pharmacovigilance culture among haematology specialists. In 13 haematology units, based in Lombardy, Italy, a dedicated specialist with the task of encouraging ADRs reporting and sensitizing healthcare professionals to pharmacovigilance has been assigned. The ADRs occurring in haematological patients were collected electronically and then analysed with multiple logistic regression. Between January 2009 and December 2011, 887 reports were collected. The number of ADRs was higher in older adults (528; 59%), in male (490; 55%), and in

### Transform to numeric attributes
We will now **transform** the **text into numeric attributes**. For this, we will convert every word to a number, but we first need to **split** the full text into **separate words**. This is done by using a ***Tokenizer***. The tokenizer will split the full text based on a certain pattern you specify. Here we'll take a very basic pattern and take any words that contain only upper- or lowercase letters and we will convert everything to lowercase.

In [8]:
from nltk.tokenize.regexp import RegexpTokenizer #import a tokenizer, to split the full text into separate words

def Tokenize_text_value(value):
    tokenizer1 = RegexpTokenizer(r"[A-Za-z]+")  # our self defined tokenizera
    value = value.lower()  # convert all words to lowercase
    return tokenizer1.tokenize(value)  # tokenize each text

In [9]:
# example of our tokenizer
Tokenize_text_value(data[0])

['farmarel',
 'an',
 'italian',
 'pharmacovigilance',
 'project',
 'to',
 'monitor',
 'and',
 'evaluate',
 'adverse',
 'drug',
 'reactions',
 'in',
 'haematologic',
 'patients',
 'adverse',
 'drug',
 'reactions',
 'adrs',
 'reduce',
 'patients',
 'quality',
 'of',
 'life',
 'increase',
 'mortality',
 'and',
 'morbidity',
 'and',
 'have',
 'a',
 'negative',
 'economic',
 'impact',
 'on',
 'healthcare',
 'systems',
 'nevertheless',
 'the',
 'importance',
 'of',
 'adr',
 'reporting',
 'is',
 'often',
 'underestimated',
 'the',
 'project',
 'farmarel',
 'has',
 'been',
 'developed',
 'to',
 'monitor',
 'and',
 'evaluate',
 'adrs',
 'in',
 'haematological',
 'patients',
 'and',
 'to',
 'increase',
 'pharmacovigilance',
 'culture',
 'among',
 'haematology',
 'specialists',
 'in',
 'haematology',
 'units',
 'based',
 'in',
 'lombardy',
 'italy',
 'a',
 'dedicated',
 'specialist',
 'with',
 'the',
 'task',
 'of',
 'encouraging',
 'adrs',
 'reporting',
 'and',
 'sensitizing',
 'healthcare',
 'p

Using the ***bag-of-words*** method we can transform any document to a vector. Using this method you have **one column per word and one row per document** and either a binary value 1 if the word is present in a certain document, 0 if not or a count value of the number of times the word appears in the document. 

For instance, the following three sentences:
1. Intelligent applications creates intelligent business processes
2. Bots are  intelligent applications
3. I do business intelligence

Can be represented in the following matrix using the counts of each word as values in the matrix
![matrix](http://www.darrinbishop.com/wp-content/uploads/2017/10/Document-Term-Matrix.png)

In [10]:
# transform non-processed data to nummeric features:
from sklearn.feature_extraction.text import TfidfVectorizer
binary_vectorizer = TfidfVectorizer(input=u'content', analyzer=u'word', binary = True,
                                           tokenizer=Tokenize_text_value)  # initialize the binary vectorizer
count_vectorizer = TfidfVectorizer(input=u'content', analyzer=u'word', use_idf=False, 
                                           tokenizer=Tokenize_text_value)  # initialize the count vectorizer

binary_matrix = binary_vectorizer.fit_transform(data)  # fit & transform
count_matrix = count_vectorizer.fit_transform(data)  # fit & transform

In [11]:
# Check our output matrix shape: rows = documents, columns = words
binary_matrix.shape

(743, 10098)

### Check performance in a basic model
We'll apply now a model on our 2 matrices. For this we will use the ***Naive Bayes model***, which (as the name tells) is based on the probabilistic Bayes theorem. It is used a lot in text-mining as it is really **fast** to train and apply and is able to **handle a lot of features**, which is often the case in text-mining, when you have one column per word. We will use the ***kappa*** measure to evaluate model performance. Kappa is a metric that is robust to class-imbalances in the data and varies from -1 to +1 with 0 being a random performance and +1 a perfect performance.

In [12]:
# apply cross validation Naive Bayes model
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (cohen_kappa_score, make_scorer)

NB = MultinomialNB() # our Naive Bayes Model initialisation
scorer = make_scorer(cohen_kappa_score) # Our kappa score

In [13]:
scores = cross_val_score(NB,binary_matrix,labels,scoring = scorer,cv = 5 )
print('Cross validation on Binary matrix with a mean kappa score of %f and variance of %f' % (scores.mean(),scores.var()))

Cross validation on Binary matrix with a mean kappa score of 0.210382 and variance of 0.003118


In [14]:
scores = cross_val_score(NB,count_matrix,labels,scoring = scorer,cv = 5 )
print('Cross validation on Count matrix with a mean kappa score of %f and variance of %f' % (scores.mean(),scores.var()))

Cross validation on Count matrix with a mean kappa score of 0.064193 and variance of 0.000682


### TF-IDF transformation

An alternative to the binary and count matrix is the **tf-idf transformation**. It stands for ***Term Frequency - Inverse Document Frequency*** and is a measure that will try to find the words that are unique to each document and that characterizes the document compared to the other documents. How this achieved is by taking the term frequency (which is the same as the count that we have defined before) and multiplying it by the inverse document frequency (which is low when the term appears in all other documents and high when it appears in few other documents):

![TF-IDF](https://chrisalbon.com/images/machine_learning_flashcards/TF-IDF_print.png)
*Copyright © Chris Albon, 2018*

In [15]:
# transform non-processed data to nummeric features:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(input=u'content', analyzer=u'word', use_idf=True, smooth_idf = True ,
                                           tokenizer=Tokenize_text_value)  # initialize the tf-idf vectorizer

tfidf_matrix = tfidf_vectorizer.fit_transform(data)  # fit & transform

In [16]:
scores = cross_val_score(NB,tfidf_matrix,labels,scoring = scorer,cv = 5 )
print('Cross validation on TF-IDF matrix with a mean kappa score of %f and variance of %f' % (scores.mean(),scores.var()))

Cross validation on TF-IDF matrix with a mean kappa score of 0.320332 and variance of 0.003960


### How to improve this score? 
How come the TF-IDF works the best, followed closely by the binary matrix and with the count matrix far behind? Let's have a look at the words that occur the most in the different documents:

In [18]:
import numpy as np
# Find words with maximum occurence for each document in the count_matrix
max_counts_per_doc = np.asarray(np.argmax(count_matrix,axis = 1)).ravel()
# Count how many times every word is the most occuring word across all documents
unique, counts = np.unique(max_counts_per_doc,return_counts=True)
# Keep only the words that are the most frequent word of at least 5 different documents
frequent = unique[counts > 5]

In [20]:
# Retrieve the vocabulary of our count matrix
vocab = count_vectorizer.get_feature_names()
# print out the words in frequent
for i in frequent:
    print(vocab[i])

a
and
for
in
of
the
to
with


As you can see those words are all words without any added value as they are mostly used to link certain words together in sentences, but have no standalone value. This is what we call ***Stop words***. So knowing that, we can find an intuition of why the tf-idf and binary transformations worked better than the count one. In the count one, we have seen that words that appear a lot, but have no value as such, get a high weight/value, whereas in binary every word gets the same weight and in tf-idf, the words that appear a lot in the other documents are automatically given a lower weight thanks to the IDF part. To avoid this problem we usually remove stop words

### Removing Stop words

In [21]:
# Remove the stop words
binary_vectorizer = TfidfVectorizer(input=u'content', analyzer=u'word', binary = True,
                                           tokenizer=Tokenize_text_value, stop_words = 'english')  # initialize the binary vectorizer
count_vectorizer = TfidfVectorizer(input=u'content', analyzer=u'word', use_idf=False, smooth_idf=False,
                                           tokenizer=Tokenize_text_value, stop_words = 'english')  # initialize the count vectorizer
tfidf_vectorizer = TfidfVectorizer(input=u'content', analyzer=u'word', use_idf=True, smooth_idf=True,
                                           tokenizer=Tokenize_text_value, stop_words = 'english')  # initialize the tf-idf vectorizer
binary_matrix = binary_vectorizer.fit_transform(data)  # fit & transform
count_matrix = count_vectorizer.fit_transform(data)  # fit & transform
tfidf_matrix = tfidf_vectorizer.fit_transform(data)  # fit & transform

In [22]:
scores = cross_val_score(NB,binary_matrix,labels,scoring = scorer,cv = 5 )
print('Cross validation on Binary matrix by removing stop-words with a mean kappa score of %f and variance of %f' % (scores.mean(),scores.var()))

Cross validation on Binary matrix by removing stop-words with a mean kappa score of 0.472072 and variance of 0.001247


In [23]:
scores = cross_val_score(NB,count_matrix,labels,scoring = scorer,cv = 5 )
print('Cross validation on Count matrix by removing stop-words with a mean kappa score of %f and variance of %f' % (scores.mean(),scores.var()))

Cross validation on Count matrix by removing stop-words with a mean kappa score of 0.753020 and variance of 0.009759


In [24]:
scores = cross_val_score(NB,tfidf_matrix,labels,scoring = scorer,cv = 5 )
print('Cross validation on TF-IDF matrix by removing stop-words with a mean kappa score of %f and variance of %f' % (scores.mean(),scores.var()))

Cross validation on TF-IDF matrix by removing stop-words with a mean kappa score of 0.682766 and variance of 0.011562


We have a big improvement in our performance when we remove the stop words. How can we go a step further? Now the following steps are mostly domain dependent. You have to think about your problem and what you would need to solve it. In this case, if we are using only the abstracts and the titles, if we had to do it ourselves, we would have a look at the most common keywords you have in the articles about Pharmacovigilance and when we have a new article to classify, we would look if we find those same keywords back. However, here we are analyzing all words (minus the stopwords) and not only the keywords. So we could try to filter out to keep only words that appear at least a certain number of times across all documents.
### Keeping only key-words

In [25]:
# keep only words that appear at least in 5% of the documents:
binary_vectorizer = TfidfVectorizer(input=u'content', analyzer=u'word', binary = True,
                                           tokenizer=Tokenize_text_value, stop_words = 'english'
                                        , min_df = 0.05)  # initialize the binary vectorizer
count_vectorizer = TfidfVectorizer(input=u'content', analyzer=u'word', use_idf=False, smooth_idf=False,
                                           tokenizer=Tokenize_text_value, stop_words = 'english'
                                       , min_df = 0.05)  # initialize the count vectorizer
tfidf_vectorizer = TfidfVectorizer(input=u'content', analyzer=u'word', use_idf=True, smooth_idf=True,
                                           tokenizer=Tokenize_text_value, stop_words = 'english'
                                       , min_df = 0.05)  # initialize the tf-idf vectorizer
binary_matrix = binary_vectorizer.fit_transform(data)  # fit & transform
count_matrix = count_vectorizer.fit_transform(data)  # fit & transform
tfidf_matrix = tfidf_vectorizer.fit_transform(data)  # fit & transform

In [26]:
scores = cross_val_score(NB,binary_matrix,labels,scoring = scorer,cv = 5 )
print('Cross validation on Binary matrix by keeping only keywords appearing in at least 5%% of the documents with a mean kappa score of %f and variance of %f' % (scores.mean(),scores.var()))

Cross validation on Binary matrix by keeping only keywords appearing in at least 5% of the documents with a mean kappa score of 0.876158 and variance of 0.003158


In [27]:
scores = cross_val_score(NB,count_matrix,labels,scoring = scorer,cv = 5 )
print('Cross validation on Count matrix by keeping only keywords appearing in at least 5%% of the documents with a mean kappa score of %f and variance of %f' % (scores.mean(),scores.var()))

Cross validation on Count matrix by keeping only keywords appearing in at least 5% of the documents with a mean kappa score of 0.951631 and variance of 0.001133


In [28]:
scores = cross_val_score(NB,tfidf_matrix,labels,scoring = scorer,cv = 5 )
print('Cross validation on TF-IDF matrix by keeping only keywords appearing in at least 5%% of the documents with a mean kappa score of %f and variance of %f' % (scores.mean(),scores.var()))

Cross validation on TF-IDF matrix by keeping only keywords appearing in at least 5% of the documents with a mean kappa score of 0.916734 and variance of 0.001633


### Final improvements
We've made a big improvement with this one as well. We can even go further and add some extra fine-tunings. Let's have a look at the final key-words:

In [29]:
tfidf_vectorizer.get_feature_names()

['active',
 'activities',
 'activity',
 'administration',
 'adrs',
 'adverse',
 'aim',
 'analysis',
 'anti',
 'approach',
 'approaches',
 'approved',
 'article',
 'assess',
 'assessment',
 'associated',
 'available',
 'based',
 'big',
 'care',
 'case',
 'challenges',
 'chronic',
 'clinical',
 'clinics',
 'companies',
 'company',
 'compared',
 'conducted',
 'control',
 'controlled',
 'current',
 'currently',
 'd',
 'data',
 'database',
 'design',
 'detection',
 'developed',
 'developing',
 'development',
 'diabetes',
 'different',
 'discovery',
 'disease',
 'diseases',
 'dose',
 'drug',
 'drugs',
 'effect',
 'effective',
 'effects',
 'efficacy',
 'european',
 'evaluate',
 'evaluation',
 'events',
 'evidence',
 'factors',
 'following',
 'future',
 'global',
 'health',
 'healthcare',
 'high',
 'human',
 'identify',
 'impact',
 'important',
 'improve',
 'including',
 'increase',
 'increased',
 'industry',
 'information',
 'inhibitor',
 'international',
 'issues',
 'key',
 'knowledge',
 'kn

We can see that some words all refer to the same thing: *report, reported, reporting, reports* all refer to one same thing *report* and should therefore be grouped together => this can be done by ***stemming***
### Stemming
Stemming is a technique where we try to reduce words to a common base form, this is done by chopping off the last part of the word: s's are removed, -ing is removed, -ed is removed, ...

In [30]:
# Define a stemmer that will preprocess the text before transforming it
from nltk.stem.porter import PorterStemmer  
def preprocess(value):   
    stemmer = PorterStemmer() 
     #split in tokens
    return ' '.join([stemmer.stem(i) for i in Tokenize_text_value(value) ])

In [31]:
# Have a look at what it gives on the first article
print(' '.join([i for i in Tokenize_text_value(data[0]) ])) # original
print('\n')
print(preprocess(data[0])) #stemmed

farmarel an italian pharmacovigilance project to monitor and evaluate adverse drug reactions in haematologic patients adverse drug reactions adrs reduce patients quality of life increase mortality and morbidity and have a negative economic impact on healthcare systems nevertheless the importance of adr reporting is often underestimated the project farmarel has been developed to monitor and evaluate adrs in haematological patients and to increase pharmacovigilance culture among haematology specialists in haematology units based in lombardy italy a dedicated specialist with the task of encouraging adrs reporting and sensitizing healthcare professionals to pharmacovigilance has been assigned the adrs occurring in haematological patients were collected electronically and then analysed with multiple logistic regression between january and december reports were collected the number of adrs was higher in older adults in male and in non hodgkin lymphoma patients most reactions were severe requ

In [32]:
# Preprocess the documents by stemming the words
binary_vectorizer = TfidfVectorizer(input=u'content', analyzer=u'word', binary = True,
                                           tokenizer=Tokenize_text_value, stop_words = 'english'
                                        , min_df = 0.05, preprocessor = preprocess)  # initialize the binary vectorizer
count_vectorizer = TfidfVectorizer(input=u'content', analyzer=u'word', use_idf=False, smooth_idf=False,
                                           tokenizer=Tokenize_text_value, stop_words = 'english'
                                       , min_df = 0.05, preprocessor = preprocess)  # initialize the count vectorizer
tfidf_vectorizer = TfidfVectorizer(input=u'content', analyzer=u'word', use_idf=True, smooth_idf=True,
                                           tokenizer=Tokenize_text_value, stop_words = 'english'
                                       , min_df = 0.05, preprocessor = preprocess)  # initialize the tf-idf vectorizer
binary_matrix = binary_vectorizer.fit_transform(data)  # fit & transform
count_matrix = count_vectorizer.fit_transform(data)  # fit & transform
tfidf_matrix = tfidf_vectorizer.fit_transform(data)  # fit & transform

In [33]:
scores = cross_val_score(NB,binary_matrix,labels,scoring = scorer,cv = 5 )
print('Cross validation on Binary matrix by stemming and keeping only keywords appearing in at least 10%% of the documents with a mean kappa score of %f and variance of %f' % (scores.mean(),scores.var()))

Cross validation on Binary matrix by stemming and keeping only keywords appearing in at least 10% of the documents with a mean kappa score of 0.868418 and variance of 0.000270


In [34]:
scores = cross_val_score(NB,count_matrix,labels,scoring = scorer,cv = 5 )
print('Cross validation on Count matrix by stemming and keeping only keywords appearing in at least 10%% of the documents with a mean kappa score of %f and variance of %f' % (scores.mean(),scores.var()))

Cross validation on Count matrix by stemming and keeping only keywords appearing in at least 10% of the documents with a mean kappa score of 0.944600 and variance of 0.001127


In [35]:
scores = cross_val_score(NB,tfidf_matrix,labels,scoring = scorer,cv = 5 )
print('Cross validation on TF-IDF matrix by stemming and keeping only keywords appearing in at least 10%% of the documents with a mean kappa score of %f and variance of %f' % (scores.mean(),scores.var()))

Cross validation on TF-IDF matrix by stemming and keeping only keywords appearing in at least 10% of the documents with a mean kappa score of 0.901758 and variance of 0.001610


In [36]:
# Check the stemmed final vocabulary
tfidf_vectorizer.get_feature_names()

['activ',
 'addit',
 'administr',
 'adr',
 'advanc',
 'advers',
 'agent',
 'aim',
 'analysi',
 'anti',
 'applic',
 'approach',
 'approv',
 'articl',
 'assess',
 'associ',
 'author',
 'avail',
 'base',
 'becom',
 'benefit',
 'big',
 'biolog',
 'care',
 'case',
 'caus',
 'challeng',
 'chang',
 'chronic',
 'clinic',
 'collabor',
 'combin',
 'commerci',
 'commun',
 'compani',
 'compar',
 'compound',
 'concern',
 'conduct',
 'consid',
 'control',
 'cost',
 'current',
 'd',
 'data',
 'databas',
 'demonstr',
 'describ',
 'design',
 'detect',
 'determin',
 'develop',
 'diabet',
 'differ',
 'discoveri',
 'discuss',
 'diseas',
 'dose',
 'drug',
 'dure',
 'effect',
 'efficaci',
 'emerg',
 'establish',
 'european',
 'evalu',
 'event',
 'evid',
 'exist',
 'factor',
 'follow',
 'formul',
 'function',
 'futur',
 'gener',
 'global',
 'group',
 'ha',
 'health',
 'healthcar',
 'help',
 'high',
 'howev',
 'human',
 'identifi',
 'impact',
 'implement',
 'import',
 'improv',
 'includ',
 'increas',
 'indic'

We can see that the performance slightly decreases with the stemming. Probably, because now when we are keeping words that appear in only 5% of the documents, we have more words than before, as before words with different endings were counted separately and now they are grouped together. So to correct for this we should increase our 5% threshold to take this effect into account.

In [37]:
# Preprocess the documents by stemming the words and keeping only words that appear in at least 10% of the documents:
binary_vectorizer = TfidfVectorizer(input=u'content', analyzer=u'word', binary = True,
                                           tokenizer=Tokenize_text_value, stop_words = 'english'
                                        , min_df = 0.1, preprocessor = preprocess)  # initialize the binary vectorizer
count_vectorizer = TfidfVectorizer(input=u'content', analyzer=u'word', use_idf=False, smooth_idf=False,
                                           tokenizer=Tokenize_text_value, stop_words = 'english'
                                       , min_df = 0.1, preprocessor = preprocess)  # initialize the count vectorizer
tfidf_vectorizer = TfidfVectorizer(input=u'content', analyzer=u'word', use_idf=True, smooth_idf=True,
                                           tokenizer=Tokenize_text_value, stop_words = 'english'
                                       , min_df = 0.1, preprocessor = preprocess)  # initialize the tf-idf vectorizer
binary_matrix = binary_vectorizer.fit_transform(data)  # fit & transform
count_matrix = count_vectorizer.fit_transform(data)  # fit & transform
tfidf_matrix = tfidf_vectorizer.fit_transform(data)  # fit & transform

In [38]:
scores = cross_val_score(NB,binary_matrix,labels,scoring = scorer,cv = 5 )
print('Cross validation on Binary matrix by stemming with a mean kappa score of %f and variance of %f' % (scores.mean(),scores.var()))

Cross validation on Binary matrix by stemming with a mean kappa score of 0.893254 and variance of 0.000962


In [39]:
scores = cross_val_score(NB,count_matrix,labels,scoring = scorer,cv = 5 )
print('Cross validation on Count matrix by stemming with a mean kappa score of %f and variance of %f' % (scores.mean(),scores.var()))

Cross validation on Count matrix by stemming with a mean kappa score of 0.951618 and variance of 0.001134


In [40]:
scores = cross_val_score(NB,tfidf_matrix,labels,scoring = scorer,cv = 5 )
print('Cross validation on TF-IDF matrix by stemming with a mean kappa score of %f and variance of %f' % (scores.mean(),scores.var()))

Cross validation on TF-IDF matrix by stemming with a mean kappa score of 0.937502 and variance of 0.000565


In [41]:
# Check the stemmed final vocabulary
tfidf_vectorizer.get_feature_names()

['activ',
 'advers',
 'aim',
 'analysi',
 'approach',
 'assess',
 'associ',
 'base',
 'case',
 'challeng',
 'clinic',
 'compani',
 'compar',
 'control',
 'current',
 'data',
 'develop',
 'discuss',
 'diseas',
 'drug',
 'effect',
 'efficaci',
 'evalu',
 'gener',
 'ha',
 'health',
 'high',
 'howev',
 'identifi',
 'import',
 'improv',
 'includ',
 'increas',
 'industri',
 'inform',
 'investig',
 'market',
 'medic',
 'medicin',
 'method',
 'need',
 'new',
 'patient',
 'pharma',
 'pharmaceut',
 'pharmacovigil',
 'phase',
 'potenti',
 'practic',
 'present',
 'product',
 'provid',
 'reaction',
 'relat',
 'report',
 'research',
 'result',
 'review',
 'risk',
 's',
 'safeti',
 'studi',
 'therapi',
 'thi',
 'time',
 'treatment',
 'trial',
 'use',
 'wa',
 'year']

Now we have the same or a bit higher performance as before.