# Text mining of Pubmed Database

**Goal**

To find the most appropriate journal for a researcher to publish, based on the keywords and the title of his article.

**Dataset**

The dataset is obtained through scrapping of Information obtained from the pubmed Database

It contains the following columns:

1. Title of the Paper
2. Authors of the Paper
3. Name of the Journal Published
4. Keywords of the article
5. Full citation
6. Year Published

**Importing the libraries**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#import plotly.express as px
import seaborn as sns  
!pip install wordcloud
from wordcloud import STOPWORDS, WordCloud #Libraries for Visualisation

import nltk
import warnings
warnings.filterwarnings('ignore')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')  #Libraries for Processing of Language

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import HashingVectorizer,TfidfVectorizer #Preprocessing the models

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier



[nltk_data] Downloading package punkt to /home/ibab/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/ibab/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/ibab/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


**Reading the dataset**

In [2]:
df = pd.read_csv('cancers.csv')
df.head()

Unnamed: 0,Authors,title,Full_citation,Journal,year,PMID,Keywords
0,"Torre LA, Bray F, Siegel RL, Ferlay J, Lortet-...","Global cancer statistics, 2012...",CA Cancer J Clin. 2015 Mar;65(2):87-108. doi: ...,CA Cancer J Clin,2015,PMID: 25651787,"cancer, epidemiology, healthdisparities, incid..."
1,"Morrison AH, Byrne KT, Vonderheide RH.",Immunotherapy and Prevention o...,Trends Cancer. 2018 Jun;4(6):418-428. doi: 10....,Trends Cancer,2018,PMID: 29860986,"immunotherapy, pancreaticcancer, preventionvac..."
2,Goral V.,Pancreatic Cancer: Pathogenesi...,Asian Pac J Cancer Prev. 2015;16(14):5619-24. ...,Asian Pac J Cancer Prev,2015,PMID: 26320426,
3,"Wang JJ, Lei KF, Han F.",Tumor microenvironment: recent...,Eur Rev Med Pharmacol Sci. 2018 Jun;22(12):385...,Eur Rev Med Pharmacol Sci,2018,PMID: 29949179,
4,"Torre LA, Siegel RL, Ward EM, Jemal A.",Global Cancer Incidence and Mo...,Cancer Epidemiol Biomarkers Prev. 2016 Jan;25(...,Cancer Epidemiol Biomarkers Prev,2016,PMID: 26667886,


From the columns we can take title and Keywords as features, and the journal as target.  

The journals can be encoded using either one-hot or label-encoding.  

**Title**: Tokenization, followed by the removal of stopwords and Lemmatization.

**Keywords**: Lemmatization

**Tokenization and Lemmatization of title**

In [3]:
stop =set(STOPWORDS)

In [4]:
df['title']=df['title'].str.lower()
df['title']

0                       global cancer statistics, 2012...
1                       immunotherapy and prevention o...
2                       pancreatic cancer: pathogenesi...
3                       tumor microenvironment: recent...
4                       global cancer incidence and mo...
                              ...                        
9995                    extracellular vesicles and mic...
9996                    dexamethasone modified by gamm...
9997                    evaluating the utility of comp...
9998                    over-expression of long noncod...
9999                    metabolic reprogramming and ca...
Name: title, Length: 10000, dtype: object

In [5]:
import nltk

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]

df['title_lemmatized'] = df['title'].str.lower().apply(lemmatize_text)
df['title_words']=df['title_lemmatized'].apply(lambda x: [item for item in x if item not in stop])

In [6]:
df['title_words'] = df['title_words'].apply(lambda x : " ".join(x))
df['title_words']=df['title_words'].str.replace(r"[^ A-Za-z0-9]",'')
df['title_words']

0                           global cancer statistics 2012
1              immunotherapy prevention pancreatic cancer
2                pancreatic cancer pathogenesis diagnosis
3       tumor microenvironment recent advance various ...
4       global cancer incidence mortality rate trendsa...
                              ...                        
9995    extracellular vesicle micrornas role tumorigen...
9996    dexamethasone modified gammairradiation novel ...
9997    evaluating utility computed tomography chest g...
9998    overexpression long noncoding rna bancr inhibi...
9999           metabolic reprogramming cancer progression
Name: title_words, Length: 10000, dtype: object

In [7]:
df['keyword_lemmatized'] = df['Keywords'].str.lower().apply(lemmatize_text)
df['keyword_lemmatized'] =df['keyword_lemmatized'].apply(lambda x: [item for item in x if item not in stop])
df['keyword_lemmatized'] = df['keyword_lemmatized'].apply(lambda x : " ".join(x))
df['keyword_lemmatized'] = df['keyword_lemmatized'].str.replace(r"[^ A-Za-z0-9]",'')
df['keyword_lemmatized']

0       cancer epidemiology healthdisparities incidenc...
1       immunotherapy pancreaticcancer preventionvaccines
2                                                    none
3                                                    none
4                                                    none
                              ...                        
9995    cancerstemcells exosomes extracellularvesicles...
9996                                                 none
9997                                                 none
9998        bancr bladdercancer lncrnas therapeutictarget
9999                                                 none
Name: keyword_lemmatized, Length: 10000, dtype: object

**Encoding the target**

In [8]:
le=LabelEncoder()
le.fit(df['Journal'])
df['target']=le.transform(df['Journal'])

In [9]:
df1=df.sample(6000)

In [10]:
vectorizer = TfidfVectorizer(stop_words='english')
X_1=vectorizer.fit_transform(df1['keyword_lemmatized'] + df1['title_words'])

**Model Building**

In [11]:
clf=RandomForestClassifier(n_estimators = 100)

In [12]:
import gc
y = df1['target']
X_train,X_test,y_train,y_test=train_test_split(X_1,y,test_size=0.25,random_state=100)
gc.collect()

51

In [None]:
clf.fit(X_train,y_train)

In [None]:
prob = pd.DataFrame(clf.predict_proba(X_test), columns=clf.classes_)
target_cols = prob.columns
target_cols

In [None]:
X_test

In [None]:
preds = np.array(prob)[:,:]

In [None]:
preds.shape

**Post prediction processing**

In [None]:
journal_lst = df['Journal'].to_list()
target_lst = df['target'].to_list()
d = dict(zip(target_lst,journal_lst))
final_preds = []
final_1 = np.argsort(preds, axis=1)
final_2 = np.fliplr(final_1)
for ind, pred in enumerate(final_2):
  top_picks = target_cols[pred]
  journal_name = []
  for i in top_picks[:5]:
    journal_name.append(d.get(i))
  final_preds.append(", ".join(journal_name))

In [None]:
ori_journal_name = []
for i in y_test:
  ori_journal_name.append(d.get(i))

In [None]:
counter = 0
acc = 0
for i in ori_journal_name:
  if i in final_preds[counter]:
    acc += 1
  counter += 1
print(acc/counter)

In [None]:
output = pd.DataFrame()
y_test.index

In [None]:
output['title'] = df.iloc[y_test.index]['title']
output['keywords'] = df.iloc[y_test.index]['Keywords']
output['Journals'] = final_preds

**Visualisation of the final output**

In [None]:
output.head(20)