<a href="https://colab.research.google.com/github/yusnivtr/Natural-Language-Processing-HCMUS/blob/main/tf_idf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dataset

## Download dataset
Vietnamese Students' Feedback Corpus (UIT-VSFC) is the resource consists of over 16,000 sentences which are human-annotated with two different tasks: sentiment-based and topic-based classifications.

[1] Kiet Van Nguyen, Vu Duc Nguyen, Phu Xuan-Vinh Nguyen, Tham Thi-Hong Truong, Ngan Luu-Thuy Nguyen, UIT-VSFC: Vietnamese Students' Feedback Corpus for Sentiment Analysis,  2018 10th International Conference on Knowledge and Systems Engineering (KSE 2018), November 1-3, 2018, Ho Chi Minh City, Vietnam

In [None]:
!pip install datasets



In [None]:
from datasets import load_dataset

dataset = load_dataset("uitnlp/vietnamese_students_feedback")

## Interacting with the downloaded data

In [None]:
train_set = dataset['train']
train_set

Dataset({
    features: ['sentence', 'sentiment', 'topic'],
    num_rows: 11426
})

In [None]:
train_set[1]

{'sentence': 'nhiệt tình giảng dạy , gần gũi với sinh viên .',
 'sentiment': 2,
 'topic': 0}

In [None]:
for i in range(5):
  print(train_set[i])

{'sentence': 'slide giáo trình đầy đủ .', 'sentiment': 2, 'topic': 1}
{'sentence': 'nhiệt tình giảng dạy , gần gũi với sinh viên .', 'sentiment': 2, 'topic': 0}
{'sentence': 'đi học đầy đủ full điểm chuyên cần .', 'sentiment': 0, 'topic': 1}
{'sentence': 'chưa áp dụng công nghệ thông tin và các thiết bị hỗ trợ cho việc giảng dạy .', 'sentiment': 0, 'topic': 0}
{'sentence': 'thầy giảng bài hay , có nhiều bài tập ví dụ ngay trên lớp .', 'sentiment': 2, 'topic': 0}


In [None]:
len(train_set)

11426

## Split a sentence

In [None]:
# Read a sentence
example_word_list = train_set[0]['sentence']
example_word_list

'slide giáo trình đầy đủ .'

In [None]:
# Split sentence word-by-word
example_word_list.split()

['slide', 'giáo', 'trình', 'đầy', 'đủ', '.']

In [None]:
# Join words into 1 full sentence
sentence = ""
for word in example_word_list:
    sentence += word
sentence

'slide giáo trình đầy đủ .'

In [None]:
# Get 10 sentences to process
sentence_list = []
for idx in range(10):
    sentence = ""
    for word in train_set[idx]['sentence']:
        sentence += word
    sentence_list.append(sentence)
sentence_list

['slide giáo trình đầy đủ .',
 'nhiệt tình giảng dạy , gần gũi với sinh viên .',
 'đi học đầy đủ full điểm chuyên cần .',
 'chưa áp dụng công nghệ thông tin và các thiết bị hỗ trợ cho việc giảng dạy .',
 'thầy giảng bài hay , có nhiều bài tập ví dụ ngay trên lớp .',
 'giảng viên đảm bảo thời gian lên lớp , tích cực trả lời câu hỏi của sinh viên , thường xuyên đặt câu hỏi cho sinh viên .',
 'em sẽ nợ môn này , nhưng em sẽ học lại ở các học kỳ kế tiếp .',
 'thời lượng học quá dài , không đảm bảo tiếp thu hiệu quả .',
 'nội dung môn học có phần thiếu trọng tâm , hầu như là chung chung , khái quát khiến sinh viên rất khó nắm được nội dung môn học .',
 'cần nói rõ hơn bằng cách trình bày lên bảng thay vì nhìn vào slide .']

# Text processing

## N-grams
- N-grams are continuous sequences of words or symbols, or tokens in a document. In technical terms, they can be defined as the neighboring sequences of items in a document.
- We can use n-grams or multiple other text preprocessing algorithms by incorporating [`nltk`](https://www.nltk.org/) library.

In [None]:
example_sentence = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

In [None]:
from nltk import ngrams
import numpy as np

num_of_grams = np.arange(1, 4, 1) # Test 3 n-grams

print("Original sentence:", example_sentence[1])
print("==="*5)

for gram in num_of_grams:
    splitted_sentence = ngrams(example_sentence[1].split(), int(gram))
    print(f"{gram}-gram: ",end ='')
    print(splitted_sentence)
    n_grams_list = [ ' '.join(grams) for grams in splitted_sentence]
    print(n_grams_list)
    print()

Original sentence: This document is the second document.
1-gram: <generator object ngrams at 0x7941b0b1a570>
['This', 'document', 'is', 'the', 'second', 'document.']

2-gram: <generator object ngrams at 0x7941b0b1a790>
['This document', 'document is', 'is the', 'the second', 'second document.']

3-gram: <generator object ngrams at 0x7941b0b1a570>
['This document is', 'document is the', 'is the second', 'the second document.']



## Extract features with n-grams

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
count_vectorize_model = CountVectorizer(ngram_range = (1, 1))
n_grams_feature_vector = count_vectorize_model.fit_transform([example_sentence[1]]).toarray()
word_frequency = pd.DataFrame(data = n_grams_feature_vector, columns = count_vectorize_model.get_feature_names_out())
word_frequency.T

Unnamed: 0,0
document,2
is,1
second,1
the,1
this,1


In [None]:
count_vectorize_model = CountVectorizer(ngram_range = (1,1))

n_grams_feature_vector = count_vectorize_model.fit_transform([sentence_list[5]]).toarray()

word_frequency = pd.DataFrame(data = n_grams_feature_vector, columns = count_vectorize_model.get_feature_names_out())

print('Example sentence:', sentence_list[5])
word_frequency.T

Example sentence: giảng viên đảm bảo thời gian lên lớp , tích cực trả lời câu hỏi của sinh viên , thường xuyên đặt câu hỏi cho sinh viên .


Unnamed: 0,0
bảo,1
cho,1
câu,2
của,1
cực,1
gian,1
giảng,1
hỏi,2
lên,1
lớp,1


In [None]:
count_vectorize_model = CountVectorizer(ngram_range = (1, 1))

n_grams_feature_vector = count_vectorize_model.fit_transform(example_sentence).toarray()

word_frequency = pd.DataFrame(data = n_grams_feature_vector, columns = count_vectorize_model.get_feature_names_out())
print(example_sentence)
word_frequency.T

['This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?']


Unnamed: 0,0,1,2,3
and,0,0,1,0
document,1,2,0,1
first,1,0,0,1
is,1,1,1,1
one,0,0,1,0
second,0,1,0,0
the,1,1,1,1
third,0,0,1,0
this,1,1,1,1


## Problem set 1
Based on the UIT-VSFC dataset and the aforementioned information.
- Create an $n$-gram word frequency table, such that $n$ could be any number of your desire.
- With $n=1$ and $n=2$, what is the most popular word in the dataset ?
- With $n=1$ and $n=2$, what is the rarest word in the dataset ?
- What are the limitations of this data processing flow ? How can we overcome those ?


In [None]:
from nltk import ngrams
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer


def create_frequency(n_gram)->pd.DataFrame:
  count_vectorize_model = CountVectorizer(ngram_range=(n_gram,n_gram))

  n_grams_feature_vector = count_vectorize_model.fit_transform(sentence_list).toarray()

  word_frequency = pd.DataFrame(data=n_grams_feature_vector,columns=count_vectorize_model.get_feature_names_out())

  return word_frequency.T


In [None]:
sentence_list = [item['sentence'] for item in train_set]

uni_gram = create_frequency(n_gram=1)

bi_gram = create_frequency(n_gram=2)

In [None]:
uni_gram

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,11416,11417,11418,11419,11420,11421,11422,11423,11424,11425
10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
100,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10h,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10h30,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ổn,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ủa,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ủng,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ức,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
bi_gram

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,11416,11417,11418,11419,11420,11421,11422,11423,11424,11425
10 50,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10 bài,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10 fraction,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10 kiến,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10 luôn,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ứng đáp,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ứng đúng,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ứng được,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ứng đầy,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
print(f"With  n=1, the most popular word in the dataset is '{uni_gram.sum(axis=1).idxmax()}' ")
print(f"With  n=2, the most popular word in the dataset is '{bi_gram.sum(axis=1).idxmax()}' ")

With  n=1, the most popular word in the dataset is 'viên' 
With  n=2, the most popular word in the dataset is 'sinh viên' 


In [None]:
print(f"With  n=1, the most popular word in the dataset is '{uni_gram.sum(axis=1).idxmin()}' ")
print(f"With  n=2, the most popular word in the dataset is '{bi_gram.sum(axis=1).idxmin()}' ")

With  n=1, the most popular word in the dataset is '10h' 
With  n=2, the most popular word in the dataset is '10 50' 


### Retrieve all sentences within the dataset

In [None]:
from typing import List

def get_all_sentences(dataset) -> List[str]:
    """
    Function to get all sentences and store them into a list of strings

    Args:
    dataset -- The subset (i.e., train/valid/test) in UIT-VSFC dataset

    Returns:
    A list of all sentences in a subset data of the UIT-VSFC.
    """

    list_all_sentence: list = [item['sentence'] for item in dataset]

    ### YOUR CODE STARTS HERE

    # for idx in range(len(train_set)):
    #     sentence = ""
    #     for word in example_word_list:
    #         sentence += word
    #     list_all_sentence.append(sentence)

    ### YOUR CODE ENDS HERE

    return list_all_sentence


In [None]:
list_all_sentence: list = get_all_sentences(train_set)
print(f"#sentences within the dataset: {len(list_all_sentence)}")
print(f"Example sentence: {list_all_sentence[0]}")

#sentences within the dataset: 11426
Example sentence: slide giáo trình đầy đủ .


### Build the word frequency table

In [None]:
def n_gram_word_frequency(sentence_list: list,
                          n: int) -> pd.DataFrame:
    """
    Function to build a word frequency table based on n-grams

    Args:
    sentence_list (list) -- A list of all sentences needed for table constructing process
    n (int) -- Number of grams that we parse into this function

    Returns:
    A dataframe contains all words after conducting n-grams and their respective frequencies
    """
    ### YOUR CODE STARTS HERE

    count_vectorize_model = CountVectorizer(ngram_range = (n, n))
    n_grams_feature_vector = count_vectorize_model.fit_transform(sentence_list)
    word_frequency_table = pd.DataFrame(data = n_grams_feature_vector.toarray(), columns = count_vectorize_model.get_feature_names_out())

    ### YOUR CODE ENDS HERE

    return word_frequency_table

In [None]:
# Construct the table of word frequency
word_frequency_table = n_gram_word_frequency(sentence_list=list_all_sentence,
                                             n=1)
word_frequency_table.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,11416,11417,11418,11419,11420,11421,11422,11423,11424,11425
10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
100,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10h,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10h30,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ổn,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ủa,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ủng,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ức,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
def n_gram_word_frequency(sentence_list: list, n: int) -> pd.DataFrame:
    """
    Xây dựng bảng tần suất từ dựa trên n-grams

    Args:
        sentence_list (list): Danh sách các câu đầu vào
        n (int): Số lượng từ trong n-gram

    Returns:
        pd.DataFrame: DataFrame chứa 1 cột tần suất của tất cả từ trong toàn bộ các document
    """
    count_vectorizer = CountVectorizer(ngram_range=(n, n+1))

    sparse_matrix = count_vectorizer.fit_transform(sentence_list)

    word_frequencies = sparse_matrix.sum(axis=0)
    frequency_df = pd.DataFrame(
        data=word_frequencies.A1,
        columns=["Frequency"],
        index=count_vectorizer.get_feature_names_out()
    )

    return frequency_df.sort_values("Frequency", ascending=False)

# Construct the table of word frequency
word_frequency_table = n_gram_word_frequency(sentence_list=list_all_sentence,
                                             n=1)
word_frequency_table

Unnamed: 0,Frequency
viên,4803
giảng,3711
dạy,3156
thầy,3095
sinh,3082
...,...
nhiệm giảng,1
nhiệm hơn,1
nhiệm lý,1
nhiệm nhiệt,1


You should comment your answer to problem 1 here with sufficient explanations, including your implementation and reasoning.
- The code has been executed **above**.
- With  n=1, the most popular word in the dataset is 'viên', with  n=2, the most popular word in the dataset is 'sinh viên'.
- With  n=1, the most popular word in the dataset is '10h'
With  n=2, the most popular word in the dataset is '10 50'.
- What are the limitations of this data processing flow ? How can we overcome those ?
  Các hạn chế chính của luồng xử lý dữ liệu này bao gồm:

    - Không xử lý stop words (từ phổ biến vô nghĩa) → Khắc phục bằng tham số stop_words trong CountVectorizer

    - Phân biệt chữ hoa/thường → Thêm lowercase=True

    - Không xử lý dấu câu → Thêm tiền xử lý (preprocessor) loại bỏ dấu câu

    - Tokenization mặc định có thể không phù hợp → Tuỳ chỉnh token_pattern

    - Tạo ra ma trận thưa chiều cao, nên code chạy lâu → Kết hợp TF-IDF hoặc giảm chiều dữ liệu

## Stopwords

In [None]:
# Retrieve the stopword dictionary
!wget --no-check-certificate --content-disposition https://raw.githubusercontent.com/stopwords/vietnamese-stopwords/master/vietnamese-stopwords.txt

--2025-03-28 16:22:44--  https://raw.githubusercontent.com/stopwords/vietnamese-stopwords/master/vietnamese-stopwords.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20475 (20K) [text/plain]
Saving to: ‘vietnamese-stopwords.txt.1’


2025-03-28 16:22:44 (11.7 MB/s) - ‘vietnamese-stopwords.txt.1’ saved [20475/20475]



In [None]:
# Observe stopwords list
vietnamese_stopword = open('vietnamese-stopwords.txt', 'r').read()
vietnamese_stopword = vietnamese_stopword.split('\n') # Separate lines by lines
print(f"#Number of stop words: {len(vietnamese_stopword)}")

#Number of stop words: 1942


In [None]:
# Stop words example
for sentence in vietnamese_stopword[:10]:
    print(sentence)

a lô
a ha
ai
ai ai
ai nấy
ai đó
alô
amen
anh
anh ấy


## Term frequency - Invert document frequency (TF-IDF)


### TF
Term frequency (TF) is the number of times a given term appears in document

$$
tf(t) = f(t,d)\times\frac{1}{T}
$$
whereas, $f(t,d)$ is the frequency of the word $t$ in the document $d$, $T$ is the number of all words in that document.

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

# Declare TF vectorize
tf_vectorizer = TfidfVectorizer(ngram_range=(1, 1),
                                use_idf=False, # only using TF
                                norm='l1')

tf_vectorizer.fit_transform(corpus)

tf_vectorized = tf_vectorizer.transform(corpus)

tf_output = tf_vectorized[0]

# Build TF table
words_tf_idf = pd.DataFrame(tf_output.T.todense(), index=tf_vectorizer.get_feature_names_out(), columns=['tf'])
words_tf_idf

Unnamed: 0,tf
and,0.0
document,0.2
first,0.2
is,0.2
one,0.0
second,0.0
the,0.2
third,0.0
this,0.2


### IDF

Inverse Document Frequency, or abbreviated as IDF, measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones.

$$
idf(t) = \log\left(\frac{\text{#documents in the document set}}{\text{#documents with term}}\right) + 1
$$

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

# Configure settings for IDF vectorize
tf_idf_vectorizer = TfidfVectorizer(ngram_range=(1, 1),
                                    smooth_idf=False,
                                    use_idf=True,
                                    norm=None)

tf_idf_vectorizer.fit_transform(corpus)

# Retrieve only idf information
idf_vectorizer = tf_idf_vectorizer.idf_

# Join idf values into the previous dataframe
words_tf_idf['idf'] = idf_vectorizer

# Show dataframe with ascending values of idf
words_tf_idf.sort_values(by=['idf'])

Unnamed: 0,tf,idf
is,0.2,1.0
the,0.2,1.0
this,0.2,1.0
document,0.2,1.287682
first,0.2,1.693147
and,0.0,2.386294
second,0.0,2.386294
one,0.0,2.386294
third,0.0,2.386294


### TF-IDF

Technically saying, TF-IDF is a score which is applied to every word in every document in our dataset. And for every word, the TF-IDF value increases with every appearance of the word in a document, but is gradually decreased with every appearance in other documents

$$
\text{tf-idf}= tf(t, d) \times idf(t)
$$

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

tf_idf_vectorizer = TfidfVectorizer(ngram_range=(1, 1),
                                    smooth_idf=False,
                                    use_idf=True,
                                    norm='l1')

tf_idf_vectorizer.fit_transform(corpus)

tf_idf_vectorized = tf_idf_vectorizer.transform(corpus)


tf_idf_output = tf_idf_vectorized[0]

words_tf_idf['tf-idf'] = tf_idf_output.T.todense()

words_tf_idf.sort_values(by=['tf-idf'])

Unnamed: 0,tf,idf,tf-idf
and,0.0,2.386294,0.0
third,0.0,2.386294,0.0
second,0.0,2.386294,0.0
one,0.0,2.386294,0.0
is,0.2,1.0,0.167201
the,0.2,1.0,0.167201
this,0.2,1.0,0.167201
document,0.2,1.287682,0.215302
first,0.2,1.693147,0.283096


### Problem set 2
Based on the problem 1 and the instruction on TF, IDF, TF-IDF:
- (2a) Build the tf-idf table for the UIT-VSFC dataset with $n$-gram = 1 and $n$-gram = 2.
- (2b) Change a few hyperparameters in the `TfidfVectorizer` function (`smooth_idf`, `sublinear_tf` and `norm`) from problem 2a (*you could browse from this [link](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) to discover which are the correct paramters to parse*). Explain the results differences collected after modifying hyperparameters.
- (2c) Which words has the lowest and the highest tf-idf values ? Do they differ from $n$-grams results ?
- (2d) Which limitations from $n$-grams that TF-IDF overcame ?

In [None]:
# 2a
from urllib.parse import uses_fragment
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = train_set['sentence']

tf_idf_vectorizer = TfidfVectorizer(ngram_range=(1, 2),
                                    smooth_idf=False,
                                    use_idf=True,
                                    norm='l1')

tf_idf_vectorized = tf_idf_vectorizer.fit_transform(corpus)
tf_idf_output = tf_idf_vectorized[0]
tf_idf_table = pd.DataFrame(tf_idf_output.T.todense(),index=tf_idf_vectorizer.get_feature_names_out(),columns=['tf-idf'])
tf_idf_table.sort_values(by='tf-idf',ascending=False,inplace=True)
display(tf_idf_table['tf-idf'].value_counts())
display(tf_idf_table.head(20))

Unnamed: 0_level_0,count
tf-idf,Unnamed: 1_level_1
0.0,33834
0.186714,1
0.178869,1
0.11328,1
0.093067,1
0.093375,1
0.092839,1
0.086623,1
0.078906,1
0.076327,1


Unnamed: 0,tf-idf
slide giáo,0.186714
trình đầy,0.178869
giáo trình,0.11328
đầy đủ,0.093375
slide,0.093067
đầy,0.092839
đủ,0.086623
trình,0.078906
giáo,0.076327
thuyết dài,0.0


In [None]:
# 2b
corpus = train_set['sentence']

def test_tfidf(params, title):
    print(f"\n---- Thử nghiệm: {title} ----")
    tf_idf_vectorizer = TfidfVectorizer(**params)
    tf_idf_vectorized = tf_idf_vectorizer.fit_transform(corpus)

    tf_idf_output = tf_idf_vectorized[0]
    tf_idf_table = pd.DataFrame(
        tf_idf_output.T.todense(),
        index=tf_idf_vectorizer.get_feature_names_out(),
        columns=['tf-idf']
    )
    tf_idf_table.sort_values(by='tf-idf', ascending=False, inplace=True)

    print("\nPhân phối giá trị TF-IDF:")
    display(tf_idf_table['tf-idf'].value_counts().head())
    print("\nBảng TF-IDF sắp xếp:")
    display(tf_idf_table.head(5))

# Thử  1: Cài đặt gốc
test_tfidf(
    params={
        'ngram_range': (1, 2),
        'smooth_idf': False,
        'sublinear_tf': False,
        'norm': 'l1'
    },
    title="Cài đặt gốc (smooth_idf=False, sublinear_tf=False, norm=l1)"
)

# Thử nghiệm 2: Bật smooth_idf
test_tfidf(
    params={
        'ngram_range': (1, 2),
        'smooth_idf': True,
        'sublinear_tf': False,
        'norm': 'l1'
    },
    title="smooth_idf=True"
)

# Thử nghiệm 3: Bật sublinear_tf
test_tfidf(
    params={
        'ngram_range': (1, 2),
        'smooth_idf': False,
        'sublinear_tf': True,
        'norm': 'l1'
    },
    title="sublinear_tf=True"
)

# Thử nghiệm 4: Thay đổi norm sang l2
test_tfidf(
    params={
        'ngram_range': (1, 2),
        'smooth_idf': False,
        'sublinear_tf': False,
        'norm': 'l2'
    },
    title="norm=l2"
)

# Thử nghiệm 5: Kết hợp nhiều thay đổi
test_tfidf(
    params={
        'ngram_range': (1, 2),
        'smooth_idf': True,
        'sublinear_tf': True,
        'norm': 'l2'
    },
    title="Kết hợp smooth_idf=True + sublinear_tf=True + norm=l2"
)



---- Thử nghiệm: Cài đặt gốc (smooth_idf=False, sublinear_tf=False, norm=l1) ----

Phân phối giá trị TF-IDF:


Unnamed: 0_level_0,count
tf-idf,Unnamed: 1_level_1
0.0,33834
0.186714,1
0.178869,1
0.11328,1
0.093067,1



Bảng TF-IDF sắp xếp:


Unnamed: 0,tf-idf
slide giáo,0.186714
trình đầy,0.178869
giáo trình,0.11328
đầy đủ,0.093375
slide,0.093067



---- Thử nghiệm: smooth_idf=True ----

Phân phối giá trị TF-IDF:


Unnamed: 0_level_0,count
tf-idf,Unnamed: 1_level_1
0.0,33834
0.181404,1
0.175759,1
0.114667,1
0.094309,1



Bảng TF-IDF sắp xếp:


Unnamed: 0,tf-idf
slide giáo,0.181404
trình đầy,0.175759
giáo trình,0.114667
đầy đủ,0.094621
slide,0.094309



---- Thử nghiệm: sublinear_tf=True ----

Phân phối giá trị TF-IDF:


Unnamed: 0_level_0,count
tf-idf,Unnamed: 1_level_1
0.0,33834
0.186714,1
0.178869,1
0.11328,1
0.093067,1



Bảng TF-IDF sắp xếp:


Unnamed: 0,tf-idf
slide giáo,0.186714
trình đầy,0.178869
giáo trình,0.11328
đầy đủ,0.093375
slide,0.093067



---- Thử nghiệm: norm=l2 ----

Phân phối giá trị TF-IDF:


Unnamed: 0_level_0,count
tf-idf,Unnamed: 1_level_1
0.0,33834
0.527593,1
0.505426,1
0.320093,1
0.262977,1



Bảng TF-IDF sắp xếp:


Unnamed: 0,tf-idf
slide giáo,0.527593
trình đầy,0.505426
giáo trình,0.320093
đầy đủ,0.263848
slide,0.262977



---- Thử nghiệm: Kết hợp smooth_idf=True + sublinear_tf=True + norm=l2 ----

Phân phối giá trị TF-IDF:


Unnamed: 0_level_0,count
tf-idf,Unnamed: 1_level_1
0.0,33834
0.515696,1
0.499649,1
0.325976,1
0.268103,1



Bảng TF-IDF sắp xếp:


Unnamed: 0,tf-idf
slide giáo,0.515696
trình đầy,0.499649
giáo trình,0.325976
đầy đủ,0.268988
slide,0.268103


- `smooth_idf=True`: Làm mịn IDF, giảm giá trị từ hiếm, tăng nhẹ cho từ phổ biến.
- `sublinear_tf=True`: Chuẩn hóa log cho TF, giảm ảnh hưởng từ xuất hiện nhiều. Cân bằng TF-IDF giữa các từ.
- `norm='l2'`: Chuẩn hóa vector Euclidean, giá trị lớn hơn nhưng tỷ lệ tương đối giữ nguyên.
- `norm=None`: Không chuẩn hóa, TF-IDF thô lớn hơn, đặc biệt cho từ hiếm.

In [None]:
#2c
non_zero_tf_idf = tf_idf_table[tf_idf_table['tf-idf'] > 0]

min_tf_idf_word = non_zero_tf_idf['tf-idf'].idxmin()
min_tf_idf_value = non_zero_tf_idf['tf-idf'].min()

max_tf_idf_word = non_zero_tf_idf['tf-idf'].idxmax()
max_tf_idf_value = non_zero_tf_idf['tf-idf'].max()

print(f"Từ có TF-IDF thấp nhất: '{min_tf_idf_word}', với giá trị:  {min_tf_idf_value}")
print(f"Từ có TF-IDF cao nhất: '{max_tf_idf_word}', với giá trị:  {max_tf_idf_value}")


Từ có TF-IDF thấp nhất: 'giáo', với giá trị:  0.07632721998355652
Từ có TF-IDF cao nhất: 'slide giáo', với giá trị:  0.18671384793559254


**2d**

**Hạn chế của n-grams**

Khi sử dụng n-grams trực tiếp (ví dụ, trong mô hình bag-of-n-grams), ta chỉ đếm tần suất xuất hiện của chúng mà không có trọng số đặc biệt. Điều này dẫn đến các hạn chế sau:

- Kích thước lớn: Với n lớn (như bigrams, trigrams), số lượng n-grams tăng lên rất nhanh, dẫn đến không gian đặc trưng có kích thước khổng lồ.
- Độ thưa thớt: Hầu hết n-grams không xuất hiện trong một tài liệu cụ thể, khiến vector biểu diễn rất thưa.
- Thiếu trọng số quan trọng: Tất cả n-grams được đánh giá như nhau, bất kể chúng có mang thông tin quan trọng hay không. Ví dụ, "và tôi" và "học máy" có thể có cùng tần suất nhưng giá trị thông tin khác nhau.

**Cách TF-IDF khắc phục**

Khi áp dụng TF-IDF lên n-grams, các hạn chế trên được giải quyết như sau:

- Trọng số quan trọng:
  - TF-IDF gán trọng số cho n-grams dựa trên tần suất trong tài liệu (tf) và độ hiếm trong tập dữ liệu (idf). Các n-grams hiếm nhưng quan trọng (như "học máy") sẽ có trọng số cao hơn các n-grams phổ biến (như "của tôi").

- Giảm ảnh hưởng của kích thước:
  - Mặc dù TF-IDF không giảm số lượng đặc trưng, nó làm cho các đặc trưng ít thông tin (có TF-IDF thấp) ít ảnh hưởng hơn trong các tác vụ như phân loại hoặc gom cụm.
- Ý nghĩa của độ thưa thớt:
  - Vector vẫn thưa, nhưng các giá trị khác 0 được trọng số hóa, giúp biểu diễn tập trung vào các n-grams mang tính phân biệt.

Vậy, TF-IDF khắc phục hạn chế chính của n-grams là thiếu khả năng đánh giá mức độ quan trọng của các n-grams, giúp biểu diễn văn bản hiệu quả hơn.