In [36]:
import nltk
import os
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\Darshan
[nltk_data]     Mahajan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [37]:
script_directory = os.getcwd()
os.chdir(script_directory)

In [38]:
all_tokens = []

In [39]:
def read_text_file(file_path): 
    with open(file_path, 'r') as f: 
        word = f.read()
        temp = nltk.word_tokenize(word)
        all_tokens.extend(temp)

In [40]:
for file in os.listdir():
    if file.endswith(".txt"):
        file_path = os.path.join(script_directory, file)
        read_text_file(file_path)

In [41]:
import numpy as np

In [42]:
def clean_list(tokens):
    characters_to_remove = [',', '.', ' ']
    for token in tokens:
        if token in characters_to_remove:
            tokens.remove(token)    

In [43]:
clean_list(all_tokens)

In [44]:
nltk.download("stopwords")
from nltk.corpus import stopwords


[nltk_data] Downloading package stopwords to C:\Users\Darshan
[nltk_data]     Mahajan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Stop words are words that you want to ignore, so you filter them out of your text when you’re processing it. Very common words like 'in', 'is', and 'an' are often used as stop words since they don’t add a lot of meaning to a text in and of themselves.

In [45]:


stop_words = set(stopwords.words("english"))
filtered_list = []
for word in all_tokens:
    if word.casefold() not in stop_words:
         filtered_list.append(word)

all_tokens = filtered_list


In [46]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Darshan Mahajan\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

Part of speech is a grammatical term that deals with the roles words play when you use them together in sentences. 
Tagging parts of speech, or POS tagging, is the task of labeling the words in your text according to their part of speech.

In [75]:
pos = nltk.pos_tag(all_tokens)
# pos

Tags    Deal with
JJ      Adjectives
NN      Nouns
RB	    Adverbs
PRP	    Pronouns
VB	    Verbs

In [48]:
from nltk.stem import PorterStemmer

Stemming is a text processing task in which you reduce words to their root, which is the core part of a word. For example, the words “helping” and “helper” share the root “help.” Stemming allows you to zero in on the basic meaning of a word rather than all the details of how it’s being used. NLTK has more than one stemmer, but you’ll be using the Porter stemmer.

In [49]:
dict = {}

ps = PorterStemmer()
for w in all_tokens:
    dict[w] = ps.stem(w)

In [74]:
# print(dict)

Understemming and overstemming are two ways stemming can go wrong:

1. Understemming happens when two related words should be reduced to the same stem but aren’t. This is a false negative.

2. Overstemming happens when two unrelated words are reduced to the same stem even though they shouldn’t be. This is a false positive.

In [51]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
import nltk
nltk.download('omw-1.4')


[nltk_data] Downloading package wordnet to C:\Users\Darshan
[nltk_data]     Mahajan\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to C:\Users\Darshan
[nltk_data]     Mahajan\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

 Like stemming, lemmatizing reduces words to their core meaning, but it will give you a complete English word that makes sense on its own instead of just a fragment of a word like 'discoveri'.

In [52]:
lm = WordNetLemmatizer()

In [53]:
lem_dict = {}
for w in all_tokens:
    lem_dict[w] = lm.lemmatize(w)

In [73]:
# print(lem_dict)
# print(len(lem_dict))
# print(len(all_tokens))

In [None]:
def calculateTF(token):
    term_freq = {}
    for word in token:
        if word not in term_freq:
            term_freq[word] = token.count(word) / len(token)

    return term_freq

calculateTF(all_tokens)

TF = (Number of times the term appears in the document) / (Total number of terms in the document)

IDF = log((Total number of docs) / (Number of docs that contain the term))

In [70]:
inverse_doc_frequency = {}

for w in all_tokens:
    tot_docs_having_w = 0
    
    for file in os.listdir():
        if file.endswith(".txt"):
            file_path = os.path.join(script_directory, file)
            with open(file_path, 'r') as f: 
                word = f.read()
                if w in nltk.word_tokenize(word):
                    tot_docs_having_w += 1
    
    inverse_doc_frequency[w] =  np.log10(10 / tot_docs_having_w)

In [72]:
# print(inverse_doc_frequency)

1. IDF value of 0: This typically occurs when a term (word) is present in all documents within the corpus. In such cases, the IDF value is set to 0 to prevent division by zero when calculating the TF-IDF score. A term with an IDF value of 0 indicates that it is not discriminative or distinctive across documents and therefore may not contribute much to distinguishing documents.
2. IDF value of 1: This happens when the term appears in only one document in the corpus. In this case, the IDF value is often adjusted to 1 to indicate that the term is relatively rare but still present in the corpus. However, it doesn't provide significant discriminative power compared to terms with higher IDF values.
3. Other IDF values: Terms with IDF values other than 0 or 1 indicate their rarity or uniqueness within the corpus. Higher IDF values indicate that the term is rare across documents and therefore potentially more discriminative. Terms with higher IDF values contribute more to the TF-IDF score and are considered more important in distinguishing documents.


TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical measure that indicates the importance of a word in a document taking into account how frequent the word is in other documents in the same corpus. It consists of multiplying the term frequency (TF) by the inverse document frequency (IDF), which is the logarithm of the total number of documents divided by the number of documents containing the term. 