# Introduction

Data analysis is based on features, and different features are needed according to the data type. For text mining, we investigate the feature representation of text data. In this section, we'll be working on generating different types of text representation for a set of documents and see their performance. 

## How to Run the Module

Throughout this module you will encounter both text and code cells. Please run each cell in this Notebook by clicking "Run" button in the Toolbar or by pushing Shift+Enter keys
<br>
![run_cell.png](attachment:run_cell.png)

The cell below is an example of a code cell. You will be running numerous code cells like the one below throughout the case. Select the cell and select the run button above. 

In [1]:
# This is an example of a code cell
print('Congratulations!')
print('You\'ve run your first code cell.')

Congratulations!
You've run your first code cell.


# Bag of Words Representation

## Video

In [2]:
# Set Up
import warnings
warnings.filterwarnings('ignore')

from IPython.display import HTML

HTML('<iframe width="800" height="560" src="https://www.youtube.com/embed/rZjyDTH96hA?list=PL6IN6GlGifEytPcv5HR_iaNBekwXYZIpR" frameborder="0" allowfullscreen></iframe>')

## Code for Preparing Sample Dataset

First of all, let's create a demo dataset using the techniques learned above. The dataset we'll be using is the Brown corpus. The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, and so on. NLTK has samples of documents from the Brown corpus. For a complete list, see http://icame.uib.no/brown/bcm-los.html).

In [3]:
import pandas as pd
import nltk

#Downloading the Brown corpus from nltk.corpus
nltk.download('brown')
from nltk.corpus import brown

#show an example of categories in Brown
print('Categories in Brown Corpus are: ')
brown.categories()

Categories in Brown Corpus are: 


[nltk_data] Downloading package brown to /home/ashv/nltk_data...
[nltk_data]   Package brown is already up-to-date!


['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

Since each of the genre contains only one large text while we need multiple and small chunks (for saving processing time) of text to build the demo dataset, we heuristically pick several pieces of texts from each of the selected categories in Brown corpus. Here we set the number as 3, which indicates that we are picking three pieces of text from each selected category(genre).

In [4]:
#defining the function of picking (defaultly three) chunks of text. 
def pick_sents_brown(genre, number_of_sections=3):
    output_list = []
    sents = brown.sents(categories=genre)
    interval = len(sents)/(2*number_of_sections)
    while interval >= 50:
        interval = interval/2
    interval = int(interval)
    i=0
    j=0
    while j < number_of_sections:
        sents_picked = sents[i*interval:(i*interval+interval)]
        text = ''
        for sent in sents_picked:
            text += ' '.join(sent)
            text += ' '
        output_list.append(text)
        i=i+2
        j=j+1
    return output_list

#defining the function of turning the texts into our ideal dataset format.
#'news','romance','learned' are set to be default values. You can change them 
#to the ones you are interested in as long as they are available in 
#the sample Brown corpus in NLTK
def generate_dataset(categories=['news','romance','learned']):
    output_list = []
    for item in categories:
        sample_list = pick_sents_brown(item)
        for text in sample_list:
            dict_ = {'topic':item, 'abstract': text}
            output_list.append(dict_)
    return output_list

print('Functions have been successfully defined!')

Functions have been successfully defined!


We pick "news", "romance", and "learned" as the categories to study this time as they sound diverse. You can also try other genres if you are interested.

In [5]:
brown_samples = generate_dataset(['news','romance','learned'])

print("Dataset generated!")

Dataset generated!


As we picked three genres and and three chunks of texts from each of the genre, the sample list has 9 items in total. 

In [6]:
print('The total number of the texts in the sample list is ' + str(len(brown_samples)) + '. ')
print(' ')
print("Here is the sample data: ")
print(' ')
pd.DataFrame(brown_samples)

The total number of the texts in the sample list is 9. 
 
Here is the sample data: 
 


Unnamed: 0,topic,abstract
0,news,The Fulton County Grand Jury said Friday an in...
1,news,"`` Everything went real smooth '' , the sherif..."
2,news,Calls for extension Other recommendations made...
3,romance,They neither liked nor disliked the Old Man . ...
4,romance,"`` Laura , what would you say if I smoked a pi..."
5,romance,"She said , `` Oh Eugenia , I wish '' `` What '..."
6,learned,1 . Introduction It has recently become practi...
7,learned,The high heat fluxes existing at the electrode...
8,learned,The rest of the surface had a temperature whic...


## How to Clean the Dataset

Then we process the texts using the functions you have seen in Module 1. 

In [7]:
#We define the functions for text preprocessing
from contractions import CONTRACTION_MAP
import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

stop = stopwords.words('english')
lemmatizer = WordNetLemmatizer()

def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-Z0-9\s]' if not remove_digits else r'[^a-zA-Z\s]'
    text = re.sub(pattern, '', text)
    return text

print('Functions have been successfully defined!')

Functions have been successfully defined!


[nltk_data] Downloading package stopwords to /home/ashv/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/ashv/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


After defining the functions for expanding contractions, removing stopwords, and lemmatizing, we apply them to our texts and added the result list of words into our demo dataset.

In [8]:
#adding seperated words into the original sample list
df_samples_list = brown_samples
for item in df_samples_list:
    new_abs = item['abstract']
    new_abs = expand_contractions(new_abs, contraction_mapping=CONTRACTION_MAP)
    new_abs = remove_special_characters(new_abs, remove_digits=True)
    new_abs_words = new_abs.split()
    new_abs_words = [lemmatizer.lemmatize(w).lower() for w in new_abs_words if w.lower() not in stop]
    item['words'] = new_abs_words

pd.DataFrame(df_samples_list)

Unnamed: 0,topic,abstract,words
0,news,The Fulton County Grand Jury said Friday an in...,"[fulton, county, grand, jury, said, friday, in..."
1,news,"`` Everything went real smooth '' , the sherif...","[everything, went, real, smooth, sheriff, said..."
2,news,Calls for extension Other recommendations made...,"[calls, extension, recommendation, made, commi..."
3,romance,They neither liked nor disliked the Old Man . ...,"[neither, liked, disliked, old, man, could, br..."
4,romance,"`` Laura , what would you say if I smoked a pi...","[laura, would, say, smoked, pipe, laura, answe..."
5,romance,"She said , `` Oh Eugenia , I wish '' `` What '...","[said, oh, eugenia, wish, wish, three, wish, m..."
6,learned,1 . Introduction It has recently become practi...,"[introduction, recently, become, practical, us..."
7,learned,The high heat fluxes existing at the electrode...,"[high, heat, flux, existing, electrode, surfac..."
8,learned,The rest of the surface had a temperature whic...,"[rest, surface, temperature, decreased, toward..."


## How to Generate Bag of Words Representations

To make the term-document matrix, we need to find all unique words in our dataset. 

In [9]:
#finding all unique words
all_words = []
for item in df_samples_list:
    new_abs_words =item['words']
    all_words += new_abs_words
all_words_unique = list(set(all_words))

print('There are ' + str(len(all_words_unique)) + ' unique words in our dataset.')

There are 2103 unique words in our dataset.


For each chunk of text and each term, mark '1' if it contains the term and mark '0' if it doesn't. 

In [10]:
#making document term matrix
word_matrix = {}
for word in all_words_unique:
    word_vec = []
    for item in df_samples_list:
        if word in item['words']:
            word_vec += [1]
        else:
            word_vec += [0]
    word_matrix[word] = word_vec

dc_df = pd.DataFrame(word_matrix)
dc_df

Unnamed: 0,face,outlay,swollen,quality,receives,partially,thanks,know,granted,disaster,...,color,material,parsons,complexion,criminal,stocking,sponsor,return,price,age
0,0,0,0,0,1,0,1,0,1,0,...,0,0,0,0,0,0,0,0,1,1
1,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,1,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
4,1,0,1,1,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0
5,0,0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
8,1,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


# Vector Space

## Vector and Vector Space

### Introduction to Vectors

The word matrix we made above is a good example as the integration of a set of vectors. 

In [11]:
from IPython.display import HTML

HTML('<iframe width="800" height="560" src="https://www.youtube.com/embed/gvikUgKLb8Q?list=PL6IN6GlGifEytPcv5HR_iaNBekwXYZIpR" frameborder="0" allowfullscreen></iframe>')

In [12]:
dc_df

Unnamed: 0,face,outlay,swollen,quality,receives,partially,thanks,know,granted,disaster,...,color,material,parsons,complexion,criminal,stocking,sponsor,return,price,age
0,0,0,0,0,1,0,1,0,1,0,...,0,0,0,0,0,0,0,0,1,1
1,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,1,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
4,1,0,1,1,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0
5,0,0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
8,1,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


Extracting a vector from the whole matrix:

In [13]:
print("The vector for document 0 is: ")
pd.DataFrame(dc_df.loc[0,:]).transpose()

The vector for document 0 is: 


Unnamed: 0,face,outlay,swollen,quality,receives,partially,thanks,know,granted,disaster,...,color,material,parsons,complexion,criminal,stocking,sponsor,return,price,age
0,0,0,0,0,1,0,1,0,1,0,...,0,0,0,0,0,0,0,0,1,1


## How to Caculate Distance between Vectors

### Video of Vector Distance/Similarity Calculation using Euclidean Distance

In [14]:
from IPython.display import HTML

HTML('<iframe width="800" height="560" src="https://www.youtube.com/embed/a89nBIbygPo?list=PL6IN6GlGifEytPcv5HR_iaNBekwXYZIpR" frameborder="0" allowfullscreen></iframe>')

### Video of Vector Distance/Similarity Calculation using Cosine Similarity

In [15]:
from IPython.display import HTML

HTML('<iframe width="800" height="560" src="https://www.youtube.com/embed/J00IsthX38Y?list=PL6IN6GlGifEytPcv5HR_iaNBekwXYZIpR" frameborder="0" allowfullscreen></iframe>')

### How to Calculate Distance/Similarity between Vectors

Now let's try to calculate similarity between every pair of documents using the matrix we generated above. We'll be using cosine similarity(https://en.wikipedia.org/wiki/Cosine_similarity) in the following exmaples (which is more commonly used in real world). You can also try to use Euclidean distance as an alternative to see if there is any difference. The following code imports the packages for calculating cosine similarity as well as euclidean distance.

Note: Euclidean distance measures the distance, which means that the larger the number is, the far away the two vectors are (less simiar). On the contrary, cosine similarity measures the similarity, which means that the larger the number is (one is the maximum), the similar the two vectors are.

Code for Cosine Similarity

In [16]:
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

cos_matrix = cosine_similarity(dc_df)

pd.DataFrame(cos_matrix)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,1.0,0.180628,0.176559,0.050542,0.037758,0.098496,0.05906,0.060275,0.077337
1,0.180628,1.0,0.199412,0.062919,0.054236,0.10729,0.058045,0.059403,0.050231
2,0.176559,0.199412,1.0,0.077472,0.081958,0.111518,0.068222,0.059494,0.073797
3,0.050542,0.062919,0.077472,1.0,0.193949,0.13195,0.084565,0.04969,0.058212
4,0.037758,0.054236,0.081958,0.193949,1.0,0.15457,0.059641,0.042833,0.078852
5,0.098496,0.10729,0.111518,0.13195,0.15457,1.0,0.087394,0.053798,0.074279
6,0.05906,0.058045,0.068222,0.084565,0.059641,0.087394,1.0,0.164263,0.157272
7,0.060275,0.059403,0.059494,0.04969,0.042833,0.053798,0.164263,1.0,0.260333
8,0.077337,0.050231,0.073797,0.058212,0.078852,0.074279,0.157272,0.260333,1.0


Code for Euclidean Distance

In [17]:
#sample code for euclidean distance

euc_matrix = euclidean_distances(dc_df)

pd.DataFrame(euc_matrix)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,0.0,25.41653,24.062419,24.020824,25.768197,25.670995,24.718414,23.958297,23.706539
1,25.41653,0.0,24.63737,24.919872,26.532998,26.476405,25.748786,25.019992,25.099801
2,24.062419,24.63737,0.0,23.065125,24.596748,24.939928,24.0,23.345235,23.130067
3,24.020824,24.919872,23.065125,0.0,21.330729,22.93469,21.863211,21.424285,21.283797
4,25.768197,26.532998,24.596748,21.330729,0.0,24.103942,23.853721,23.280893,22.803509
5,25.670995,26.476405,24.939928,22.93469,24.103942,0.0,24.289916,23.979158,23.685439
6,24.718414,25.748786,24.0,21.863211,23.853721,24.289916,0.0,20.952327,21.0
7,23.958297,25.019992,23.345235,21.424285,23.280893,23.979158,20.952327,0.0,18.920888
8,23.706539,25.099801,23.130067,21.283797,22.803509,23.685439,21.0,18.920888,0.0


Finding the document that has the highest similarity to the selected document. Here we choose the eighth document as an example. 

In [18]:
#specify the index of your chosen document
chosen_doc = 8
scores = sorted(cos_matrix[chosen_doc],reverse=True) 
score = scores[1]                                     
result_doc = list(cos_matrix[chosen_doc]).index(score)
# note: you may want to change cos_matrix to euc_matrix and set reverse=False when you are using 
#       Euclidean distance since the smaller the Euclidean distance is, the similar the two documents are.
       
print('The document that is the most similar with document ' + str(chosen_doc) + ' is ' + 'document ' + str(result_doc) + '.')

The document that is the most similar with document 8 is document 7.


The results shows that document 7 was the one that is most similar to document 8, which makes sense as they are in the same category. If you investigate the matrix a little bit more, you will see that the documents in the same category have the highest similarity. 

# Term Weighting

## Video

In [19]:
from IPython.display import HTML

HTML('<iframe width="800" height="560" src="https://www.youtube.com/embed/-6oW_-QJ1Pc?list=PL6IN6GlGifEytPcv5HR_iaNBekwXYZIpR" frameborder="0" allowfullscreen></iframe>')

## Code for Different Term Weighting Strategies

Then we compute the term frequency(TF) matrix.

In [20]:
#tf
tf_matrix = {}
tf_ranking ={}
for word in all_words_unique:
    word_vec = []
    for item in df_samples_list:
        if word in item['words']:
            word_vec += [item['words'].count(word)/len(item['words'])]
        else:
            word_vec += [0]
    tf_matrix[word] = word_vec

pd.DataFrame(tf_matrix)

Unnamed: 0,face,outlay,swollen,quality,receives,partially,thanks,know,granted,disaster,...,color,material,parsons,complexion,criminal,stocking,sponsor,return,price,age
0,0.0,0.0,0.0,0.0,0.001792,0.0,0.001792,0.0,0.001792,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001792,0.003584
1,0.001603,0.001603,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.001603,0.001603,0.001603,0.003205
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.005515,0.0,0.007353,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006154,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.003077,0.0,0.0,0.0,0.0
4,0.007317,0.0,0.002439,0.002439,0.0,0.0,0.0,0.0,0.0,0.0,...,0.002439,0.0,0.0,0.002439,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001923,0.0,0.001923,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.00335,0.0,0.0,0.0,0.0,...,0.0,0.008375,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.005128,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.004425,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.006637,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Showing the term ranking according to the term frequency(TF) score.

In [21]:
def top_terms(matrix_df, n=10): #input should be a pandas dataframe
    output_dict = {}
    for index, series in matrix_df.iterrows():
        doc_num = 'doc' + str(index)
        scores = dict(series)
        scores_sorted = {k: v for k, v in sorted(scores.items(), key=lambda item: item[1], reverse=True)}
        terms = scores_sorted.keys()
        terms_topn = list(terms)[:n]
        output_dict[doc_num] = terms_topn
    output_df = pd.DataFrame(output_dict)
    return output_df.transpose()

matrix_df = pd.DataFrame(tf_matrix)
tf_ranking = top_terms(matrix_df, 10)
tf_ranking

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
doc0,jury,said,fulton,city,county,mayor,department,election,state,atlanta
doc1,would,bill,texas,school,said,committee,dallas,house,deaf,bank
doc2,case,court,judge,karns,th,precinct,ward,said,made,statement
doc3,old,man,street,never,one,would,woman,without,wife,thinner
doc4,knew,goat,would,could,pompeii,laura,boy,one,face,finger
doc5,said,cold,grandma,people,would,like,something,eugenia,day,room
doc6,radiation,radio,emission,planet,moon,thermal,observed,wave,intensity,temperature
doc7,anode,arc,heat,transfer,energy,burning,free,gas,condition,cathode
doc8,af,temperature,surface,pressure,anode,tape,block,figure,holder,fluid


Computing the inverse document frequency(IDF) matrix.

In [22]:
import math

#idf
idf_matrix = {}
for word in word_matrix:
    idf_matrix[word] = math.log(len(df_samples_list)/sum(word_matrix[word]))

print("Inverse Document Frequency Matrix successfully computed!")

Inverse Document Frequency Matrix successfully computed!


Showing the term ranking according to the inverse document frequency(IDF) score.

In [23]:
idf_ranking = {k: v for k, v in sorted(idf_matrix.items(), key=lambda item: item[1], reverse=True)}
idf_ranking

{'outlay': 2.1972245773362196,
 'swollen': 2.1972245773362196,
 'quality': 2.1972245773362196,
 'receives': 2.1972245773362196,
 'partially': 2.1972245773362196,
 'thanks': 2.1972245773362196,
 'granted': 2.1972245773362196,
 'disaster': 2.1972245773362196,
 'misuse': 2.1972245773362196,
 'tell': 2.1972245773362196,
 'scholastic': 2.1972245773362196,
 'title': 2.1972245773362196,
 'betting': 2.1972245773362196,
 'dry': 2.1972245773362196,
 'rosary': 2.1972245773362196,
 'calculated': 2.1972245773362196,
 'polar': 2.1972245773362196,
 'delighting': 2.1972245773362196,
 'thermocouple': 2.1972245773362196,
 'occurring': 2.1972245773362196,
 'direct': 2.1972245773362196,
 'dropping': 2.1972245773362196,
 'estimate': 2.1972245773362196,
 'appeased': 2.1972245773362196,
 'illegal': 2.1972245773362196,
 'karns': 2.1972245773362196,
 'reaction': 2.1972245773362196,
 'pearl': 2.1972245773362196,
 'arrival': 2.1972245773362196,
 'fill': 2.1972245773362196,
 'voting': 2.1972245773362196,
 'swayed

In [24]:
#tfidf
tfidf = {}
for word in idf_matrix:
    idf = idf_matrix[word]
    tfidf_vec = tf_matrix[word]
    tfidf[word] = [i * idf for i in tfidf_vec]

pd.DataFrame(tfidf)

Unnamed: 0,face,outlay,swollen,quality,receives,partially,thanks,know,granted,disaster,...,color,material,parsons,complexion,criminal,stocking,sponsor,return,price,age
0,0.0,0.0,0.0,0.0,0.003938,0.0,0.003938,0.0,0.003938,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002695,0.005391
1,0.001761,0.003521,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.003521,0.003521,0.00241,0.004821
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.012117,0.0,0.016156,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.009256,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.006761,0.0,0.0,0.0,0.0
4,0.008039,0.0,0.005359,0.005359,0.0,0.0,0.0,0.0,0.0,0.0,...,0.005359,0.0,0.0,0.005359,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002892,0.0,0.004225,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.007361,0.0,0.0,0.0,0.0,...,0.0,0.009201,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.005634,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.004861,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.007292,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Finding the terms that have the highest tfidf score and showing the ranking of the terms according to tfidf.

In [25]:
tfidf_df = pd.DataFrame(tfidf)
top_tfidf_terms = top_terms(tfidf_df)

top_tfidf_terms

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
doc0,fulton,jury,mayor,department,atlanta,hartsfield,fund,petition,said,county
doc1,texas,school,dallas,deaf,bank,bill,daniel,austin,district,senate
doc2,case,karns,th,judge,statement,wexler,president,precinct,trial,involved
doc3,old,street,woman,man,thinner,youth,never,wife,away,thought
doc4,goat,pompeii,laura,boy,knew,see,finger,white,hair,mouth
doc5,cold,grandma,eugenia,heater,bed,done,government,said,furnace,depression
doc6,emission,radio,planet,radiation,moon,observed,intensity,cm,length,wave
doc7,anode,arc,transfer,burning,free,cathode,cooling,heat,energy,generator
doc8,af,tape,block,pressure,anode,temperature,fluid,surface,normal,holder


These time we can find the most similar document of a chosen document using the tfidf matrix. We don't expect much improvement here since the results calculated from the simple word-document matrix were pretty good as the demo dataset is small and diverse.

In [26]:
dc_df = pd.DataFrame(tfidf)
cos_matrix = cosine_similarity(dc_df)

pd.DataFrame(cos_matrix)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,1.0,0.081699,0.113802,0.019556,0.005991,0.042135,0.004053,0.005268,0.007927
1,0.081699,1.0,0.065598,0.027825,0.01134,0.049342,0.006487,0.010587,0.005493
2,0.113802,0.065598,1.0,0.010334,0.024031,0.027402,0.019805,0.004503,0.012063
3,0.019556,0.027825,0.010334,1.0,0.112169,0.064476,0.017414,0.006817,0.008516
4,0.005991,0.01134,0.024031,0.112169,1.0,0.056239,0.005516,0.003505,0.018232
5,0.042135,0.049342,0.027402,0.064476,0.056239,1.0,0.012424,0.007385,0.011875
6,0.004053,0.006487,0.019805,0.017414,0.005516,0.012424,1.0,0.02935,0.074294
7,0.005268,0.010587,0.004503,0.006817,0.003505,0.007385,0.02935,1.0,0.234291
8,0.007927,0.005493,0.012063,0.008516,0.018232,0.011875,0.074294,0.234291,1.0


Again, we take document 8 as an example and see the document that is the most similar to document 8.

In [27]:
#specify the index of your chosen document
chosen_doc = 8
scores = sorted(cos_matrix[chosen_doc],reverse=True)
score = scores[1]
result_doc = list(cos_matrix[chosen_doc]).index(score)

print('The document that is the most similar with document ' + str(chosen_doc) + ' is ' + 'document ' + str(result_doc) + '.')

The document that is the most similar with document 8 is document 7.


# Exercise

**Warning**  
<font color = blue, size = 4> 
    Your work will not be saved in Jupyter Notebook. You are recommended to copy your work and paste it to a safe place to record your work.
<font>

Here we define a new dataset using the Medical dataset in module 1 for you to practice. 

Tips: When writting code, be careful about the variable names.

In [28]:
# loading dataset in Module 1.
import csv 

medical_data_list = []
with open('epc-ir_clean_10k.csv') as medical_data:
    medical_data_csv = csv.reader(medical_data)
    for row in medical_data_csv:
        if(row[1] != 'topic'):
            topic = row[1]
            pmid = row[2]
            abstract = row[4]
            words = row[5].split()
            data = {'topic':topic, 'pmid':pmid, 'abstract':abstract}
            medical_data_list.append(data)
            
            
import pandas as pd
df = pd.DataFrame(medical_data_list)
print('Examples in the medical dataset:')
df.head()        
    

Examples in the medical dataset:


Unnamed: 0,topic,pmid,abstract
0,ACEInhibitors,10024335,Hypercholesterolemia and hypertension are freq...
1,ACEInhibitors,10027665,To implement and measure the effects of automa...
2,ACEInhibitors,10027935,In patients with insulin-dependent diabetes me...
3,ACEInhibitors,10028936,Aortic root flow and pressure estimates were o...
4,ACEInhibitors,10029645,Population-based studies have found that black...


We are not going to use all of the data since it will take long to run for such a big amount of data. The new dataset will be composed of 10 articles from each of the topic in the medical dataset. The fnctions for data cleaning is re-defined here for your convinience. 

In [29]:
#We define the functions for text preprocessing
from contractions import CONTRACTION_MAP
import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

stop = stopwords.words('english')
lemmatizer = WordNetLemmatizer()

def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-Z0-9\s]' if not remove_digits else r'[^a-zA-Z\s]'
    text = re.sub(pattern, '', text)
    return text

print('Functions have been successfully defined!')

#getting 10 articles from each of the topic in the medical dataset 
current_topic = ''
counter_in_topic = 0
med_samples = []
for index, series in df.iterrows():
    if series['topic'] != current_topic:
        counter_in_topic = 0
        current_topic = series['topic']
    else:
        counter_in_topic += 1
    
    if counter_in_topic < 10:
        new_row = {'topic':current_topic, 'abstract':series['abstract']}
        med_samples.append(new_row)   

for item in med_samples:
    new_abs = item['abstract']
    new_abs = expand_contractions(new_abs, contraction_mapping=CONTRACTION_MAP)
    new_abs = remove_special_characters(new_abs, remove_digits=True)
    new_abs_words = new_abs.split()
    new_abs_words = [lemmatizer.lemmatize(w).lower() for w in new_abs_words if w.lower() not in stop]
    item['words'] = new_abs_words
        
df_samples_list = med_samples
pd.DataFrame(med_samples)

[nltk_data] Downloading package stopwords to /home/ashv/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/ashv/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Functions have been successfully defined!


Unnamed: 0,topic,abstract,words
0,ACEInhibitors,Hypercholesterolemia and hypertension are freq...,"[hypercholesterolemia, hypertension, frequentl..."
1,ACEInhibitors,To implement and measure the effects of automa...,"[implement, measure, effect, automatic, comput..."
2,ACEInhibitors,In patients with insulin-dependent diabetes me...,"[patient, insulindependent, diabetes, mellitus..."
3,ACEInhibitors,Aortic root flow and pressure estimates were o...,"[aortic, root, flow, pressure, estimate, obtai..."
4,ACEInhibitors,Population-based studies have found that black...,"[populationbased, study, found, black, patient..."
...,...,...,...
85,Opiods,A man with severe inflammatory bowel disease s...,"[man, severe, inflammatory, bowel, disease, su..."
86,Opiods,Several lines of evidence indicate that placeb...,"[several, line, evidence, indicate, placebo, p..."
87,Opiods,To compare the analgesic effects of preoperati...,"[compare, analgesic, effect, preoperative, ora..."
88,Opiods,To compare the efficacy of tramadol and morphi...,"[compare, efficacy, tramadol, morphine, intra,..."


Please following the steps specified below and using the functions we defined above to finish the task. The first step of data preprocessing has been given as an example.

In [30]:
#expanding contractions, removing stopwords, and lemmatizing
#then adding seperated words into the original sample list
#sample code
for item in med_samples:
    new_abs = item['abstract']
    new_abs = expand_contractions(new_abs, contraction_mapping=CONTRACTION_MAP)
    new_abs = remove_special_characters(new_abs, remove_digits=True)
    new_abs_words = new_abs.split()
    new_abs_words = [lemmatizer.lemmatize(w).lower() for w in new_abs_words if w.lower() not in stop]
    item['words'] = new_abs_words

pd.DataFrame(med_samples)

Unnamed: 0,topic,abstract,words
0,ACEInhibitors,Hypercholesterolemia and hypertension are freq...,"[hypercholesterolemia, hypertension, frequentl..."
1,ACEInhibitors,To implement and measure the effects of automa...,"[implement, measure, effect, automatic, comput..."
2,ACEInhibitors,In patients with insulin-dependent diabetes me...,"[patient, insulindependent, diabetes, mellitus..."
3,ACEInhibitors,Aortic root flow and pressure estimates were o...,"[aortic, root, flow, pressure, estimate, obtai..."
4,ACEInhibitors,Population-based studies have found that black...,"[populationbased, study, found, black, patient..."
...,...,...,...
85,Opiods,A man with severe inflammatory bowel disease s...,"[man, severe, inflammatory, bowel, disease, su..."
86,Opiods,Several lines of evidence indicate that placeb...,"[several, line, evidence, indicate, placebo, p..."
87,Opiods,To compare the analgesic effects of preoperati...,"[compare, analgesic, effect, preoperative, ora..."
88,Opiods,To compare the efficacy of tramadol and morphi...,"[compare, efficacy, tramadol, morphine, intra,..."


In [None]:
#finding all unique words

#your code here

In [31]:
#finding all unique words
#sample code
all_words = []
for item in med_samples:
    new_abs_words =item['words']
    all_words += new_abs_words
all_words_unique_med = list(set(all_words))

print('There are ' + str(len(all_words_unique_med)) + ' unique words in our dataset.')

There are 2193 unique words in our dataset.


In [None]:
#making document term matrix

#your code here

In [32]:
#making document term matrix
#sample code
word_matrix_med = {}
for word in all_words_unique_med:
    word_vec = []
    for item in med_samples:
        if word in item['words']:
            word_vec += [1]
        else:
            word_vec += [0]
    word_matrix_med[word] = word_vec

pd.DataFrame(word_matrix_med)

Unnamed: 0,uncertain,respect,wellbeing,quality,olanzapine,partially,extrapyramidal,demonstrated,correlation,omeprazole,...,ns,focused,muagonist,stsegment,lipid,ibuprofentreated,includes,channel,physicianscientists,age
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
86,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
87,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
88,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
#Computing the term frequency(TF) matrix (for calculating tfidf).
#your code here

In [33]:
#Computing the term frequency(TF) matrix (for calculating tfidf).
#sample code
tf_matrix_med = {}
for word in all_words_unique_med:
    word_vec = []
    for item in med_samples:
        if word in item['words']:
            word_vec += [item['words'].count(word)/len(item['words'])]
        else:
            word_vec += [0]
    tf_matrix_med[word] = word_vec
    
pd.DataFrame(tf_matrix_med)

Unnamed: 0,uncertain,respect,wellbeing,quality,olanzapine,partially,extrapyramidal,demonstrated,correlation,omeprazole,...,ns,focused,muagonist,stsegment,lipid,ibuprofentreated,includes,channel,physicianscientists,age
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.007407,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
86,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
87,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
88,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
#Computing the inverse document frequency(IDF) matrix.

#your code here

In [34]:
#Computing the inverse document frequency(IDF) matrix.
#sample code
import math

#idf
idf_matrix_med = {}
for word in word_matrix_med:
    idf_matrix_med[word] = math.log(len(med_samples)/sum(word_matrix_med[word]))

idf_matrix_med

{'uncertain': 3.8066624897703196,
 'respect': 3.8066624897703196,
 'wellbeing': 4.499809670330265,
 'quality': 4.499809670330265,
 'olanzapine': 4.499809670330265,
 'partially': 4.499809670330265,
 'extrapyramidal': 4.499809670330265,
 'demonstrated': 2.4203681286504293,
 'correlation': 3.1135153092103742,
 'omeprazole': 4.499809670330265,
 'nine': 3.8066624897703196,
 'hyperplasia': 4.499809670330265,
 'inhibitor': 2.1972245773362196,
 'without': 2.302585092994046,
 'three': 3.4011973816621555,
 'prospective': 2.70805020110221,
 'dispersion': 3.4011973816621555,
 'incriminated': 4.499809670330265,
 'endometrium': 4.499809670330265,
 'preparation': 4.499809670330265,
 'dry': 3.8066624897703196,
 'calculated': 4.499809670330265,
 'collection': 4.499809670330265,
 'alpha': 4.499809670330265,
 'converting': 3.4011973816621555,
 'oxygen': 4.499809670330265,
 'occurring': 4.499809670330265,
 'direct': 3.4011973816621555,
 'tcell': 4.499809670330265,
 'furosemide': 4.499809670330265,
 'wheth

In [None]:
#Computing the tf-idf matrix.

#your code here

In [35]:
#Computing the tf-idf matrix.
#sample
tfidf_med = {}
for word in idf_matrix_med:
    idf = idf_matrix_med[word]
    tfidf_vec = tf_matrix_med[word]
    tfidf_med[word] = [i * idf for i in tfidf_vec]

pd.DataFrame(tfidf_med)

Unnamed: 0,uncertain,respect,wellbeing,quality,olanzapine,partially,extrapyramidal,demonstrated,correlation,omeprazole,...,ns,focused,muagonist,stsegment,lipid,ibuprofentreated,includes,channel,physicianscientists,age
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.023063,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
86,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
87,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
88,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
#finding the terms that have the highest tf-idf score

#your code here

In [36]:
#finding the terms that have the highest tf-idf score
#sample code
def top_terms(matrix_df, n=10): #input should be a pandas dataframe
    output_dict = {}
    for index, series in matrix_df.iterrows():
        doc_num = 'doc' + str(index)
        scores = dict(series)
        scores_sorted = {k: v for k, v in sorted(scores.items(), key=lambda item: item[1], reverse=True)}
        terms = scores_sorted.keys()
        terms_topn = list(terms)[:n]
        output_dict[doc_num] = terms_topn
    output_df = pd.DataFrame(output_dict)
    return output_df.transpose()

tfidf_med_df=pd.DataFrame(tfidf_med)

top_tfidf_terms_med = top_terms(tfidf_med_df)
top_tfidf_terms_med

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
doc0,simvastatin,vascular,mmoll,p,enalapril,fvr,structural,reactivity,damage,mfbf
doc1,adrs,implement,als,signal,computerized,automatic,tool,detection,reaction,support
doc2,glomerular,mellitus,diabetes,sizeselectivity,iddm,deteriorates,niddm,develop,macromolecule,progressively
doc3,arterial,pressure,aortic,root,compliance,captopril,flow,total,mean,reduced
doc4,black,failure,factor,congestive,heart,mortality,higher,difference,populationbased,socioeconomic
...,...,...,...,...,...,...,...,...,...,...
doc85,amitriptyline,transdermal,gel,unable,monitored,metabolite,take,serum,abdominal,depression
doc86,respiratory,endogenous,depression,opioid,narcotic,produce,thus,placebo,sideeffect,depressant
doc87,tonsillectomy,preoperative,adenotonsillectomy,intraoperative,undergoing,clonidine,fentanyl,intravenous,child,analgesic
doc88,intra,tramadol,laparoscopic,cholecystectomy,morphine,undergoing,postoperative,analgesia,compare,efficacy


In [None]:
#calculating the scores of cosine similarity using tfidf

#your code here

In [37]:
#calculating the scores of cosine similarity using tfidf
#sample code
from sklearn.metrics.pairwise import cosine_similarity

dc_df_med = pd.DataFrame(tfidf_med)
cos_matrix_med = cosine_similarity(dc_df_med)

pd.DataFrame(cos_matrix_med)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,80,81,82,83,84,85,86,87,88,89
0,1.000000,0.001666,0.017338,0.088059,0.022137,0.006181,0.054515,0.011560,0.090027,0.045843,...,0.002031,0.000000,0.003927,0.005681,0.003008,0.002542,0.012005,0.002183,0.000000,0.020193
1,0.001666,1.000000,0.000000,0.000000,0.000000,0.034159,0.001935,0.002078,0.000000,0.000000,...,0.000000,0.000000,0.001903,0.002600,0.002960,0.016782,0.001263,0.003477,0.000000,0.018924
2,0.017338,0.000000,1.000000,0.025839,0.001492,0.000776,0.055817,0.001973,0.005850,0.076444,...,0.000000,0.002761,0.000000,0.000000,0.000000,0.007467,0.000000,0.000000,0.003416,0.002954
3,0.088059,0.000000,0.025839,1.000000,0.038918,0.009553,0.066709,0.029389,0.030950,0.041713,...,0.000000,0.000000,0.004029,0.000626,0.003810,0.005833,0.017120,0.000000,0.000000,0.017810
4,0.022137,0.000000,0.001492,0.038918,1.000000,0.018142,0.055490,0.116666,0.000000,0.062493,...,0.000000,0.031614,0.008725,0.017891,0.008338,0.000684,0.009749,0.000000,0.001260,0.011963
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85,0.002542,0.016782,0.007467,0.005833,0.000684,0.005664,0.005033,0.008335,0.003326,0.017346,...,0.000000,0.035545,0.000000,0.000000,0.045011,1.000000,0.042072,0.000000,0.001565,0.056870
86,0.012005,0.001263,0.000000,0.017120,0.009749,0.007382,0.013088,0.004128,0.000782,0.021690,...,0.007114,0.005149,0.012531,0.013577,0.031737,0.042072,1.000000,0.001655,0.010236,0.071048
87,0.002183,0.003477,0.000000,0.000000,0.000000,0.000000,0.021410,0.002723,0.000000,0.000000,...,0.102267,0.000000,0.036011,0.096322,0.000000,0.000000,0.001655,1.000000,0.118851,0.059746
88,0.000000,0.000000,0.003416,0.000000,0.001260,0.000655,0.021077,0.001666,0.000000,0.002341,...,0.040526,0.002331,0.195381,0.187257,0.000000,0.001565,0.010236,0.118851,1.000000,0.028868


In [None]:
#Find the document that is the most similar to document 1. Is it under the same topic of document 1 (with in the first 10 (0-9) documents)?

#your code here

In [38]:
#Find the document that is the most similar to document 1. Is it under the same topic of document 1 (with in the first 10 (0-9) documents)?
#sample code

chosen_doc = 1
scores = sorted(cos_matrix_med[chosen_doc],reverse=True)
score = scores[1]
result_doc = list(cos_matrix_med[chosen_doc]).index(score)

print('The document that is the most similar with document ' + str(chosen_doc) + ' is ' + 'document ' + str(result_doc) + '.')

The document that is the most similar with document 1 is document 5.
