<h1 align="center"><strong>Autosynthesis Club</strong></h1>
<h2 align="right">Session 2 – Processing text data </h2>
<h5 align="right">14th Jan 2019 </h5>
<h5 align="right">Vincent </h5>

<hr>

<h1>Introduction</h1>
<img src="Text mining.png" width="400" align="center" title="Text mining"/>

<h2 align="left"> (Data) mining</h2>
<p align="left"> - the practice of examining large pre-existing databases in order to generate new information </p>

<h2 align="left">TEXT (data) to comparable forms </h2>
<p align="left"> One important question is how to use text data to generate information?<p>

<h2 align="center">“Mathematics is the language in which God has written the universe”</h2>

<img style="float:center" src="Galilei.jpg"/>

<h4 align="center"> -Galileo Galilei </h4>

<h3 align="left"> What kind of ‘basic’ data can you generate from this abstract? (below) </h3>
<img style="float:centre" src="Picture3.png" width=600 title="Abstract_trimmed"/>

<hr>

<h1>Practical 1</h1>
<h2> Convert your results into .csv file</h2>
1. Open your Excel file and move to <strong>‘Results’</strong> tab <br>
2. Save as <strong>CSV/ UTF-8 (comma delimited)</strong> and renamed as ‘AutoSession2’ <br>
3. Drag them into the folder on Jupyer page <br>

<h2>Load data</h2>

In [2]:
import pandas as pd
import numpy as np

train = pd.read_csv('AutoSession2.csv'); #change it into your file name if it is in other names
train['Abstract'][:5]

0    Objective: Social media is an important pharma...
1    Physicians intuitively apply pattern recogniti...
2    BACKGROUND: The lipid scrambling activity of p...
3    BACKGROUND: Primary care electronic medical re...
4    The aim of this study was to evaluate the work...
Name: Abstract, dtype: object

<h2>Basic statistics from texts</h2>

- Number of words 
- Number of characters
- Average word length
- Number of stopwords (e.g. the, that, is…)
- Number of special characters (e.g. %, -, (, )...)
- Number of numerics
- Number of uppercase words (e.g. RCT, ML, AUROC…)


<h3>Number of words</h3>

In [3]:
train['word_count'] = train['Abstract'].apply(lambda x: len(str(x).split(" "))) #number of words
train[['Abstract','word_count']].head()

Unnamed: 0,Abstract,word_count
0,Objective: Social media is an important pharma...,246
1,Physicians intuitively apply pattern recogniti...,247
2,BACKGROUND: The lipid scrambling activity of p...,138
3,BACKGROUND: Primary care electronic medical re...,272
4,The aim of this study was to evaluate the work...,293


<h3>Number of characters</h3>

In [4]:
train['char_count'] = train['Abstract'].str.len() #number of characters
train[['Abstract','char_count']].head()

Unnamed: 0,Abstract,char_count
0,Objective: Social media is an important pharma...,1833
1,Physicians intuitively apply pattern recogniti...,1782
2,BACKGROUND: The lipid scrambling activity of p...,1030
3,BACKGROUND: Primary care electronic medical re...,1830
4,The aim of this study was to evaluate the work...,1909


<h3>Average word length</h3>

In [5]:
def avg_word(sentence):
  words = sentence.split()
  return (sum(len(word) for word in words)/len(words))

train['avg_word'] = train['Abstract'].apply(lambda x: avg_word(str(x))) #average word length
train[['Abstract','avg_word']].head()

Unnamed: 0,Abstract,avg_word
0,Objective: Social media is an important pharma...,6.455285
1,Physicians intuitively apply pattern recogniti...,6.218623
2,BACKGROUND: The lipid scrambling activity of p...,6.471014
3,BACKGROUND: Primary care electronic medical re...,5.731618
4,The aim of this study was to evaluate the work...,5.518771


<h3>Number of upper case words</h3>

In [57]:
train['upper'] = train['Abstract'].apply(lambda x: len([x for x in str(x).split() if x.isupper()])) #number of uppercase words
train[['Abstract','upper']].head()

Unnamed: 0,Abstract,upper
0,Objective: Social media is an important pharma...,9
1,Physicians intuitively apply pattern recogniti...,0
2,BACKGROUND: The lipid scrambling activity of p...,4
3,BACKGROUND: Primary care electronic medical re...,16
4,The aim of this study was to evaluate the work...,22


<h3>Number of digits</h3>

In [58]:
train['numerics'] = train['Abstract'].apply(lambda x: len([x for x in str(x).split() if x.isdigit()])) #number of digits
train[['Abstract','numerics']].head()

Unnamed: 0,Abstract,numerics
0,Objective: Social media is an important pharma...,0
1,Physicians intuitively apply pattern recogniti...,1
2,BACKGROUND: The lipid scrambling activity of p...,0
3,BACKGROUND: Primary care electronic medical re...,1
4,The aim of this study was to evaluate the work...,6


<h3>Number of stopwords</h3>

In [59]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')  #number of stopwords
print (stop)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to /home/linaro/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [60]:
train['stopwords']=train['Abstract'].apply(lambda x: len([x for x in str(x).split() if x in stop])) #number of stopwords
train[['Abstract','stopwords']].head()

Unnamed: 0,Abstract,stopwords
0,Objective: Social media is an important pharma...,63
1,Physicians intuitively apply pattern recogniti...,77
2,BACKGROUND: The lipid scrambling activity of p...,45
3,BACKGROUND: Primary care electronic medical re...,99
4,The aim of this study was to evaluate the work...,86


<h3>Number of specific special characters</h3>

In [61]:
train['special_char'] = train['Abstract'].apply(lambda x: len([x for x in str(x).split() if x.count('%')])) #number of special character, i.e. -
train[['Abstract','special_char']].head()

Unnamed: 0,Abstract,special_char
0,Objective: Social media is an important pharma...,0
1,Physicians intuitively apply pattern recogniti...,0
2,BACKGROUND: The lipid scrambling activity of p...,0
3,BACKGROUND: Primary care electronic medical re...,0
4,The aim of this study was to evaluate the work...,3


<h3>Output: Basic statistics of abstracts</h3>

In [62]:
df = pd.DataFrame(data=train)
df.to_excel('AutoSession2_BasicStatistics.xlsx')

<hr>
<h2>Noise</h2>
Text data tend to contain some mistakes or unnecessary words. Removing them helps machine to read the data and save the amount of data to be processed.
<h3 align="left"> Can you still ‘know’ what this abstract about? </h3>
<img style="float:centre" src="Picture4.png" width=600 title="Abstract_trimmed"/>

<hr>

<h1 align="left"> Practical 2</h1>
<h2>Basic pre-processing (cleaning)</h2>

<h3>Remove unnecessary words</h3>

 - Lower casing
 - Punctuation removal
 - Stopwords removal
 - Frequent words removal
 - Rare words removal

<h3>Recover/tidy words to be read</h3>

 - Spelling correction
 - Tokenization
 - Stemming
 - Lemmatization

<img style="float:centre" src="Picture5.png" width=400 title="Wrong spelling1"/>
<br>
<img style="float:centre" src="Picture6.png" width=400 title="Wrong spelling2"/>

<h3>Lower casing</h3>
- change all characters into lower case to avoid 'new' words

In [63]:
train['Abstract_lower_casing'] = train['Abstract'].apply(lambda x: " ".join(x.lower() for x in str(x).split())) #convert all characteristics into lower case
print (train.loc[3,'Abstract'])
print ('--------------------------------------------------------------------------------')
print (train.loc[3,'Abstract_lower_casing'])

BACKGROUND: Primary care electronic medical record (EMR) data are being used for research, surveillance, and clinical monitoring. To broaden the reach and usability of EMR data, case definitions must be specified to identify and characterize important chronic conditions. The purpose of this study is to identify all case definitions for a set of chronic conditions that have been tested and validated in primary care EMR and EMR-linked data. This work will provide a reference list of case definitions, together with their performance metrics, and will identify gaps where new case definitions are needed. METHODS: We will consider a set of 40 chronic conditions, previously identified as potentially important for surveillance in a review of multimorbidity measures. We will perform a systematic search of the published literature to identify studies that describe case definitions for clinical conditions in EMR data and report the performance of these definitions. We will stratify our search by 

<h3>Punctuation removal</h3>

In [64]:
train['Abstract_punct_remove'] = train['Abstract'].str.replace('[^\w\s]','') #remove all special characteristics
print (train.loc[3,'Abstract'])
print ('--------------------------------------------------------------------------------')
print (train.loc[3,'Abstract_punct_remove'])

BACKGROUND: Primary care electronic medical record (EMR) data are being used for research, surveillance, and clinical monitoring. To broaden the reach and usability of EMR data, case definitions must be specified to identify and characterize important chronic conditions. The purpose of this study is to identify all case definitions for a set of chronic conditions that have been tested and validated in primary care EMR and EMR-linked data. This work will provide a reference list of case definitions, together with their performance metrics, and will identify gaps where new case definitions are needed. METHODS: We will consider a set of 40 chronic conditions, previously identified as potentially important for surveillance in a review of multimorbidity measures. We will perform a systematic search of the published literature to identify studies that describe case definitions for clinical conditions in EMR data and report the performance of these definitions. We will stratify our search by 

<h3>Stopwords removal</h3>

In [70]:
train['Abstract_stop_remove'] = train['Abstract'].apply(lambda x: " ".join(x for x in str(x).split() if x not in stop))
print (train.loc[4,'Abstract'])
print ('--------------------------------------------------------------------------------')
print (train.loc[4,'Abstract_stop_remove'])

The aim of this study was to evaluate the workflow efficiency of a new automatic coronary-specific reconstruction technique (Smart Phase, GE Healthcare-SP) for selection of the best cardiac phase with least coronary motion when compared with expert manual selection (MS) of best phase in patients with high heart rate. A total of 46 patients with heart rates above 75 bpm who underwent single beat coronary computed tomography angiography (CCTA) were enrolled in this study. CCTA of all subjects were performed on a 256-detector row CT scanner (Revolution CT, GE Healthcare, Waukesha, Wisconsin, US). With the SP technique, the acquired phase range was automatically searched in 2% phase intervals during the reconstruction process to determine the optimal phase for coronary assessment, while for routine expert MS, reconstructions were performed at 5% intervals and a best phase was manually determined. The reconstruction and review times were recorded to measure the workflow efficiency for each 

<h3>Frequent words removal</h3>

In [71]:
freq = pd.Series(' '.join(train['Abstract'].dropna()).split()).value_counts()[:10]  #find first 10 most frequent occurred words

train['Abstract_clean'] = train['Abstract'].apply(lambda x: " ".join(x.lower() for x in str(x).split())) #convert all characteristics into lower case
train['Abstract_clean'] = train['Abstract_clean'].str.replace('[^\w\s]','') #remove all special characteristics
train['Abstract_clean'] = train['Abstract_clean'].apply(lambda x: " ".join(x for x in str(x).split() if x not in stop))
freq_clean = pd.Series(' '.join(train['Abstract_clean'].dropna()).split()).value_counts()[:10] #find first 10 most frequent occurred words

print (freq)
print ('--------------------------------------------------------------------------------')
print (freq_clean)

the     491
and     485
of      481
to      282
in      240
a       216
for     160
with    130
The     101
were     84
dtype: int64
--------------------------------------------------------------------------------
data        96
patients    63
review      60
clinical    56
results     52
methods     52
using       45
health      41
use         40
research    40
dtype: int64


In [72]:
freq = list(freq.index)  #remove first 10 most frequent occurred words
train['Abstract_freq_remove'] = train['Abstract'].apply(lambda x: " ".join(x for x in str(x).split() if x not in freq))

freq_clean = list(freq_clean.index)
train['Abstract_clean_freq_remove'] = train['Abstract_clean'].apply(lambda x: " ".join(x for x in str(x).split() if x not in freq_clean))

print (train.loc[3,'Abstract_freq_remove'])
print ('--------------------------------------------------------------------------------')
print (train.loc[3,'Abstract_clean_freq_remove'])

BACKGROUND: Primary care electronic medical record (EMR) data are being used research, surveillance, clinical monitoring. To broaden reach usability EMR data, case definitions must be specified identify characterize important chronic conditions. purpose this study is identify all case definitions set chronic conditions that have been tested validated primary care EMR EMR-linked data. This work will provide reference list case definitions, together their performance metrics, will identify gaps where new case definitions are needed. METHODS: We will consider set 40 chronic conditions, previously identified as potentially important surveillance review multimorbidity measures. We will perform systematic search published literature identify studies that describe case definitions clinical conditions EMR data report performance these definitions. We will stratify our search by studies that use EMR data alone those that use EMR-linked data. We will compare performance different definitions sam

<h3>Rare words removal</h3>

In [73]:
freq_les = pd.Series(' '.join(train['Abstract'].dropna()).split()).value_counts()[-10:] #find first 10 least frequent occurred words

#train['Abstract_clean'] = train['Abstract'].apply(lambda x: " ".join(x.lower() for x in str(x).split())) #convert all characteristics into lower case
#train['Abstract_clean'] = train['Abstract_clean'].str.replace('[^\w\s]','') #remove all special characteristics
#train['Abstract_clean'] = train['Abstract_clean'].apply(lambda x: " ".join(x for x in str(x).split() if x not in stop))
freq_clean_les = pd.Series(' '.join(train['Abstract_clean'].dropna()).split()).value_counts()[-10:] #find first 10 least frequent occurred words

print (freq_les)
print ('--------------------------------------------------------------------------------')
print (freq_clean_les)

Premiere          1
(CCTA)            1
traditionally     1
portray           1
Efficiency        1
Meta-analysis.    1
make              1
(0.86-0.89).      1
stimulating       1
EMBASE,           1
dtype: int64
--------------------------------------------------------------------------------
medically               1
originates              1
analyzing               1
insufficiency           1
parts                   1
floor                   1
ventilatorassociated    1
panorama                1
hospital                1
effort                  1
dtype: int64


In [74]:
freq_les = list(freq_les.index) #remove least frequent words from original abstracts
train['Abstract_freq_les_remove'] = train['Abstract'].apply(lambda x: " ".join(x for x in str(x).split() if x not in freq_les))

freq_clean_les = list(freq_clean_les.index) #remove least frequent words from cleaned abstracts
train['Abstract_clean_freq_les_remove'] = train['Abstract_clean'].apply(lambda x: " ".join(x for x in str(x).split() if x not in freq_clean_les))

<h3>Spelling correction</h3>

In [75]:
#pip install -U textblob #need to install textblob on terminal or anaconda promp
#python -m textblob.download_corpora

from textblob import TextBlob
tweet = TextBlob("Despite the constant negative press covfefe")
print (tweet.correct())

Despite the constant negative press coffee


<h3>Translation (Gooogle translate API)</h3>

In [76]:
translation = TextBlob('Julian is the tallest person in the room.')
translation.translate(to='zh-TW')

TextBlob("朱利安是房間裡最高的人。")

<h3>Stemming</h3>
Removing suffices, i.e. “ing”, “ly”, “s”…etc

In [77]:
from nltk.stem import PorterStemmer #Stemming
st = PorterStemmer()
train['Abstract_stem'] = train['Abstract_clean'].apply(lambda x: " ".join([st.stem(word) for word in str(x).split()]))
print (train.loc[3,'Abstract_clean'])
print ('--------------------------------------------------------------------------------')
print (train.loc[3,'Abstract_stem'])

background primary care electronic medical record emr data used research surveillance clinical monitoring broaden reach usability emr data case definitions must specified identify characterize important chronic conditions purpose study identify case definitions set chronic conditions tested validated primary care emr emrlinked data work provide reference list case definitions together performance metrics identify gaps new case definitions needed methods consider set 40 chronic conditions previously identified potentially important surveillance review multimorbidity measures perform systematic search published literature identify studies describe case definitions clinical conditions emr data report performance definitions stratify search studies use emr data alone use emrlinked data compare performance different definitions conditions explore influence data source jurisdiction patient population discussion emr data primary care providers compiled used benefit healthcare system work pote

<h3>Lemmatization</h3>
Convert the word into its root word

In [1]:
import nltk
nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer #need to install textblob #lemmer
lm = WordNetLemmatizer()

train['Abstract_lemm'] = train['Abstract_clean'].apply(lambda x: " ".join([lm.lemmatize(word) for word in str(x).split()]))
print (train.loc[3,'Abstract_clean'])
print ('--------------------------------------------------------------------------------')
print (train.loc[3,'Abstract_lemm'])

[nltk_data] Downloading package wordnet to /home/linaro/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


NameError: name 'train' is not defined

<h3>Tokenization</h3>
Dividing the text into a series of words

In [79]:
import sklearn
from sklearn.feature_extraction.text import CountVectorizer #Tokenization ('Bag of words')
vectorizer = CountVectorizer()
Tokenizer = vectorizer.fit_transform(train['Abstract_clean'])
Words = vectorizer.get_feature_names()
Words = np.asarray(Words)

BoW =np.vstack((Words, Tokenizer.toarray()))

<h3>Output: Basic text processing of abstracts</h3>

In [80]:
df = pd.DataFrame(data=train)
df2 = pd.DataFrame(data=Words)
writer = pd.ExcelWriter('AutoSession2_BasicProcessing.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name= 'BasicProcessing')
df2.to_excel(writer, sheet_name= 'Tokenization')
writer.save()

<hr>

<h1>Practical 3</h1>
<h2>Advance Text Processing</h2>

- N-grams
- Term Frequency (TF)
- Inverse Document Frequency (IDF)
- Term Frequency-Inverse Document Frequency (TF-IDF)
- Bag of Words
- Sentiment Analysis
- Word Embedding

<h3>N-grams</h3>
Combination of N words used together (i.e. [‘Randomised’, "controlled", ‘trial’] as trigrams

In [81]:
TextBlob(train['Abstract_clean'][0]).ngrams(2) #2-grams

[WordList(['objective', 'social']),
 WordList(['social', 'media']),
 WordList(['media', 'important']),
 WordList(['important', 'pharmacovigilance']),
 WordList(['pharmacovigilance', 'data']),
 WordList(['data', 'source']),
 WordList(['source', 'adverse']),
 WordList(['adverse', 'drug']),
 WordList(['drug', 'reaction']),
 WordList(['reaction', 'adr']),
 WordList(['adr', 'identification']),
 WordList(['identification', 'human']),
 WordList(['human', 'review']),
 WordList(['review', 'social']),
 WordList(['social', 'media']),
 WordList(['media', 'data']),
 WordList(['data', 'infeasible']),
 WordList(['infeasible', 'due']),
 WordList(['due', 'data']),
 WordList(['data', 'quantity']),
 WordList(['quantity', 'thus']),
 WordList(['thus', 'natural']),
 WordList(['natural', 'language']),
 WordList(['language', 'processing']),
 WordList(['processing', 'techniques']),
 WordList(['techniques', 'necessary']),
 WordList(['necessary', 'social']),
 WordList(['social', 'media']),
 WordList(['media', 'i

<h3>Term Frequency (TF)</h3>
$ TF = (Number\ of\ term\ appears\ in\ the\ text) \div (number\ of\ words\ in\ the\ text)$

In [82]:
tf1 = (train['Abstract_clean'][0:50]).apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0).reset_index() #Term Frequency

tf1.columns = ['words','tf']
tf1.head()

Unnamed: 0,words,tf
0,adr,6.0
1,media,6.0
2,model,29.0
3,social,11.0
4,performance,17.0


<h3>Inverse Document Frequency (IDF)</h3>
$ IDF = log(N/n)$ <br>
N=total number of words; n=number of the terms

In [83]:
for i,word in enumerate(tf1['words']):
  tf1.loc[i, 'idf'] = np.log(train.shape[0]/(len(train[train['Abstract_clean'].str.contains(word)])))

tf1.head()

Unnamed: 0,words,tf,idf
0,adr,6.0,3.198673
1,media,6.0,2.505526
2,model,29.0,0.800778
3,social,11.0,2.282382
4,performance,17.0,1.589235


<h3>Term Frequency-Inverse Document Frequency (TF-IDF)</h3>
$ TF{-}IDF=TF\times IDF$

<h4>Method 1</h4>

In [84]:
for i,word in enumerate(tf1['words']):
  tf1.loc[i, 'idf'] = np.log(train.shape[0]/(len(train[train['Abstract_clean'].str.contains(word)])))

tf1['tfidf'] = tf1['tf'] * tf1['idf']
tf1.head()

Unnamed: 0,words,tf,idf,tfidf
0,adr,6.0,3.198673,19.192039
1,media,6.0,2.505526,15.033156
2,model,29.0,0.800778,23.222557
3,social,11.0,2.282382,25.106206
4,performance,17.0,1.589235,27.016998


<a href="https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting"><h4>Method 2</h4></a>

In [85]:
from sklearn.feature_extraction.text import TfidfVectorizer  #Obtain TF-IDF from sklearn
tfidf = TfidfVectorizer(lowercase=True, analyzer='word', stop_words= 'english', ngram_range=(1,1))
train_vect = tfidf.fit_transform(train['Abstract_clean'])
train_vect

<49x2690 sparse matrix of type '<class 'numpy.float64'>'
	with 5421 stored elements in Compressed Sparse Row format>

from scipy import sparse #view structure of multiple arrays
sparse.save_npz("TF-IDF.npz", train_vect)
b = np.load('TF-IDF.npz')
list(b)
print (b['indices'])
print (b['data'])
print (b['shape'])
print (b['indptr'])
print (b['format'])

<h3>Bag of Words</h3>
Presence of words within the text data

In [86]:
from sklearn.feature_extraction.text import CountVectorizer  #Bag of words
bow = CountVectorizer(lowercase=True, ngram_range=(1,1),analyzer = "word")
train_bow = bow.fit_transform(train['Abstract_clean'])

Words = bow.get_feature_names()
Words = np.asarray(Words)
BoW =np.vstack((Words, train_bow.toarray()))

<h3>Sentiment Analysis</h3>
Identifying and categorizing opinions expressed in a piece of text - <a href="https://www.dremio.com/trump-twitter-sentiment-analysis/">Example</a>

<img style="float:center" src="Picture7.png" width=400 title="Sentiment Analysis"/>

<h3>Word Embedding</h3>
Representation of text in the form of vectors <a href="https://nlp.stanford.edu/projects/glove/">GloVe</a>

<h3>Pairwise similarity</h3>

In [87]:
from sklearn.feature_extraction.text import TfidfVectorizer #calcuate similarity using TF-IDF
tfidf_ps = TfidfVectorizer().fit_transform(train['Abstract_clean'])
pairwise_similarity = tfidf_ps * tfidf_ps.T
pairwise_similarity = pairwise_similarity.toarray()

<h3>Output: Advanced text processing of abstracts</h3>

In [88]:
df = pd.DataFrame(data=tf1)
df1 = pd.DataFrame(train_vect.toarray(), columns=tfidf.get_feature_names())
df2 = pd.DataFrame(data=BoW)
df3 = pd.DataFrame(data=pairwise_similarity)
writer = pd.ExcelWriter('AutoSession2_Advanced text processing.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name= 'Tf-IDF')
df1.to_excel(writer, sheet_name= 'Tf-IDF_Sklearn')
df2.to_excel(writer, sheet_name= 'Bag of Words')
df3.to_excel(writer, sheet_name= 'pairwise_similarity_Tf-IDF')
writer.save()

<hr>

<h1>Take home messages</h1>

<h2>Text data can be described by mathematic formats or themselves</h2>

- Basic statistics, TF-IDF, word embedding
- Bag of words

<h2>These will attribute to different strategies to be used in terms of analysing text data</h2>

<h2>Different external corpora and packages can be used for processing text data</h2>

- NLTK (i.e. Brown Corpus)
- Sklearn, nltk, TextBlob
- Use of these corpora in the data cleaning and processing can significantly affect the quality of text data and later analyses

<br>

<img style="float:center" src="Picture8.png" width=800 title="Garbage in, garbage out!"/>

<hr>

<h1>Homework</h1>
<h2>Python tutorials</h2>
<a href="https://github.com/chryswoods/siremol.org/blob/master/chryswoods.com/beginning_python/README.md">UoB ACRC Python 1: Beginning Python</a>
<br>
<a href="https://github.com/chryswoods/siremol.org/blob/master/chryswoods.com/intermediate_python/README.md">UoB ACRC Python 2: Intermediate Python</a>
<br>
<a href="https://www.codecademy.com/learn/learn-python">https://www.codecademy.com/learn/learn-python</a>
<br>
<a href="https://www.tutorialspoint.com/python/index.htm">https://www.tutorialspoint.com/python/index.htm</a>

<h2>Text-mining packages to try out</h2>

- [Sklearn](https://scikit-learn.org/stable/index.html) 
- [nltk](https://www.nltk.org/index.html)
- [TextBlob](https://textblob.readthedocs.io/en/dev/index.html)
- [gensim](https://radimrehurek.com/gensim/)


<h2>Questions and challenges</h2>

- What is the pros/cons of using TF-IDF, bag of words and word embedding to analyse text data?
- How will you utilise your results (decisions and/or terms) to find important/similar references?
- Can you use other information of references to help the analysis? (i.e. authors or PMID)
- Can you build your own corpora to help your data processing?


<h3>This teaching material is derived from <a href="https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python">Ultimate guide to deal with Text Data (using Python)</a></h3>

<hr>

<h1>Thank you for coming!</h1>

<h3>Any questions please e-mail:</h3>
<h5 align="center">Kazeem     <br> b.k.olorisade@bristol.ac.uk</h5>
<h5 align="center">Vincent   <br>Vincent.Cheng@bristol.ac.uk</h5>
