Firstly, we call all the required libraries to assist our text processing process. 

a) <B>NLTK</B> is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning and wrappers for industrial-strength NLP libraries. We will use NLTK to: <br>
1. call method for word tokenization <br>
2. call method for stopword <br>
3. call method for  Porter stemmer.
(ref: https://www.nltk.org/) <br>
4. call method for Lemmatization

b) <B>Pandas</B> is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language. 
(ref: https://pandas.pydata.org/)


In [None]:
import pandas as pd

import nltk
nltk.download('punkt')
nltk.download('stopwords')

#module to tokenize word and sentence
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

#module for word freq
from nltk.probability import FreqDist

#module for stopword
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))

#module for stemming
from nltk.stem import PorterStemmer 
ps = PorterStemmer()

#module for lemmatization
from nltk.stem import WordNetLemmatizer 
nltk.download('wordnet')

#module for digits - we use this to remove digits
from string import digits

  



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Firstly, we need to upload dataset into our program. In this example, we will upload multiple files. You need to to download our dataset into your own computer before proceeding to this code. 

In [None]:
from google.colab import files
uploaded = files.upload()


Saving heart1.txt to heart1 (1).txt
Saving heart2.txt to heart2 (1).txt
Saving heart3.txt to heart3 (1).txt
Saving heart4.txt to heart4 (1).txt
Saving heart5.txt to heart5 (1).txt
Saving heart6.txt to heart6 (1).txt
Saving heart7.txt to heart7 (1).txt
Saving heart8.txt to heart8 (1).txt
Saving heart9.txt to heart9 (1).txt
Saving heart10.txt to heart10 (1).txt
Saving heart11.txt to heart11 (1).txt
Saving heart12.txt to heart12 (1).txt
Saving heart13.txt to heart13 (1).txt
Saving heart14.txt to heart14 (1).txt
Saving heart15.txt to heart15 (1).txt


We will examine the content of each file. We will be focusing only on text. We will remove digits and whitespace from our text file. 

---



The following code shows how to clean each text by using this procedure:
<br>
1) read file <br>
2) tokenize each word inside the file <br>
3) remove punctuation <br>
4) set each word to lower case <br>
5) remove stop word <br>
6) remove whitespace <br>
7) calculate word frequency <br>
8) show result <br>

In [None]:
lemmatizer = WordNetLemmatizer() 


for file in uploaded.keys():

  with open(file, "r") as file:
    FileContent = file.read()
    print("File Name: ",file.name)
    #file content
    #--print("--",FileContent)

    # tokenize before remove punctuation
    #2. split word
    tokenized_word=word_tokenize(FileContent)
    #tokenized_word is raw!
    print(tokenized_word)
    
    #calculate word frequency. We use raw data first!
    fdist = FreqDist(tokenized_word)
    print(fdist)
    print(fdist.most_common(10))
    

    #tokenize after remove punctuation
    tokenizer = nltk. RegexpTokenizer(r"\w+")
    new_words = tokenizer. tokenize(FileContent)
    print("**",new_words)
    #-fdist = FreqDist(new_words)
    #-print(fdist)
    #-print(fdist.most_common(10))

    #remove stopword
    filtered_word=[]
    for w in new_words:
      #set lowercase
      w = w.lower()

      #use lemmatization
     
      w= lemmatizer.lemmatize(w)
    
      if w not in stop_words:
        filtered_word.append(w)
    
    print("Filterd Sentence:",filtered_word) 
    #calculate word frequency. Use the filtered word - we have removed the word frequency
    fdist = FreqDist(filtered_word)
    print("*",fdist)
    print("-",fdist.most_common(10))
    #for word in fdist.items:
    #  print (word)

    df_fdist = pd.DataFrame.from_dict(fdist, orient='index')
    df_fdist.columns = ['Frequency']
    df_fdist.index.name = 'Term'

    total = df_fdist['Frequency'].sum()
    print ("------------------",total)

    df_fdist['Average'] = df_fdist['Frequency'] / total 

    
    print(df_fdist)

    #you can use these code to download the term freq value. You can remove the # to make it work
    #df_fdist.to_csv("word_freq_cleaned_lemma_"+file.name +"_.csv")
    #files.download("word_freq_cleaned_lemma_"+file.name +"_.csv")





File Name:  heart1.txt
['Heart', 'disease', 'is', 'a', 'term', 'covering', 'any', 'disorder', 'of', 'the', 'heart', '.', 'Unlike', 'cardiovascular', 'disease', ',', 'which', 'describes', 'problems', 'with', 'the', 'blood', 'vessels', 'and', 'circulatory', 'system', 'as', 'well', 'as', 'the', 'heart', ',', 'heart', 'disease', 'refers', 'to', 'issues', 'and', 'deformities', 'in', 'the', 'heart', 'itself', '.', 'According', 'to', 'the', 'Centers', 'for', 'Disease', 'Control', '(', 'CDC', ')', ',', 'heart', 'disease', 'is', 'the', 'leading', 'cause', 'of', 'death', 'in', 'the', 'United', 'Kingdom', ',', 'United', 'States', ',', 'Canada', ',', 'and', 'Australia', '.', 'One', 'in', 'every', 'four', 'deaths', 'in', 'the', 'U.S.', 'occurs', 'as', 'a', 'result', 'of', 'heart', 'disease', '.', 'Fast', 'facts', 'on', 'heart', 'disease', 'One', 'in', 'every', 'four', 'deaths', 'in', 'the', 'U.S.', 'is', 'related', 'to', 'heart', 'disease', '.', 'Coronary', 'heart', 'disease', ',', 'arrhythmia', ',

<B>NLTK</B> does not support <B>tf-idf</B>. So, we're going to use <B>scikit-learn</B>. The scikit-learn has a built in tf-Idf implementation while we still utilize NLTK's tokenizer and stemmer to preprocess the text.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
