# URL: http://tinyurl.com/b4ejxkca

# NLTK

## Introduction to Natural Language Toolkit (NLTK) ใช้สำหรับ Eng มากกว่า
- Brief overview of NLTK.

  NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.


## Setting Up NLTK
- Installing NLTK using pip.
- Downloading NLTK data.

In [1]:
!pip install nltk



In [2]:
import nltk
# nltk.download("all")
# print("NLTK is successfully installed and data is downloaded.")

## Basic NLTK Operations
- Tokenization: Breaking text into words or sentences.
- Stopwords: Identifying and removing common words.

In [4]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords

nltk.download("punkt")
nltk.download("stopwords")

# Example Text
text = "NLTK is a powerful library for natural language processing. It makes NLP tasks easy and efficient."

# Tokenization
words = word_tokenize(text)
sentences = sent_tokenize(text)

# Stopwords บอกขนาดเพื่อ content คำที่ต้องการ เป็นประเด็นหลัก
stop_words = set(stopwords.words("english"))
filtered_words = [word for word in words if word.lower() not in stop_words]

print("Tokenized Words:", words)
print("Tokenized Sentences:", sentences)
print("Filtered Words (excluding stopwords):", filtered_words)

Tokenized Words: ['NLTK', 'is', 'a', 'powerful', 'library', 'for', 'natural', 'language', 'processing', '.', 'It', 'makes', 'NLP', 'tasks', 'easy', 'and', 'efficient', '.']
Tokenized Sentences: ['NLTK is a powerful library for natural language processing.', 'It makes NLP tasks easy and efficient.']
Filtered Words (excluding stopwords): ['NLTK', 'powerful', 'library', 'natural', 'language', 'processing', '.', 'makes', 'NLP', 'tasks', 'easy', 'efficient', '.']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Part-of-Speech Tagging
- Understanding parts of speech.
- Using NLTK for part-of-speech tagging.

In [8]:
# Part-of-Speech Tagging Example
nltk.download('averaged_perceptron_tagger')
pos_tags = nltk.pos_tag(words)
print("Part-of-Speech Tags:", pos_tags)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Part-of-Speech Tags: [('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('powerful', 'JJ'), ('library', 'NN'), ('for', 'IN'), ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('.', '.'), ('It', 'PRP'), ('makes', 'VBZ'), ('NLP', 'NNP'), ('tasks', 'NNS'), ('easy', 'JJ'), ('and', 'CC'), ('efficient', 'JJ'), ('.', '.')]


## Text Analysis with NLTK
- Frequency distribution of words.
- Concordance and collocations.
- Lexical diversity.

In [9]:
import nltk
nltk.download('gutenberg')

from nltk import FreqDist
from nltk.text import Text

text = "NLTK is a powerful library for natural language processing. It makes NLP tasks easy and efficient. NLTK is a leading platform for building Python programs to work with human language data."
words = word_tokenize(text)

# Frequency Distribution
freq_dist = FreqDist(words)
print("Frequency Distribution:", freq_dist.most_common())

print("###########################")

# Concordance
text = Text(words)
concordance_results = text.concordance("language")
print(concordance_results)

print("###########################")

# concordance from the corpus
text = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
concordance_results = text.concordance("natural")
print(concordance_results)

print("###########################")

# Collocations
collocations = text.collocation_list()
print("Collocations:", collocations)

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


Frequency Distribution: [('.', 3), ('NLTK', 2), ('is', 2), ('a', 2), ('for', 2), ('language', 2), ('powerful', 1), ('library', 1), ('natural', 1), ('processing', 1), ('It', 1), ('makes', 1), ('NLP', 1), ('tasks', 1), ('easy', 1), ('and', 1), ('efficient', 1), ('leading', 1), ('platform', 1), ('building', 1), ('Python', 1), ('programs', 1), ('to', 1), ('work', 1), ('with', 1), ('human', 1), ('data', 1)]
###########################
Displaying 2 of 2 matches:
 is a powerful library for natural language processing . It makes NLP tasks ea
Python programs to work with human language data .
None
###########################
Displaying 25 of 36 matches:
 unite in a man of greatly superior natural force , with a globular brain and a
pit ! ye insult me , man ; past all natural bearing , ye insult me . It ' s an 
 stayed below . And all this seemed natural enough ; especially as in the merch
r a seaman , and endued with a deep natural reverence , the wild watery lonelin
ferences , that some depart

## Stemming and Lemmatization
- Introduction to stemming.
- Introduction to lemmatization.
- NLTK tools for stemming and lemmatization.

In [10]:
import nltk
nltk.download('wordnet')

from nltk.stem import PorterStemmer, WordNetLemmatizer

# Example Words
words_to_stem = ["running", "better", "cats", "gone"]

# Stemming
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in words_to_stem]
print("Stemmed Words:", stemmed_words)

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in words_to_stem]
print("Lemmatized Words:", lemmatized_words)

[nltk_data] Downloading package wordnet to /root/nltk_data...


Stemmed Words: ['run', 'better', 'cat', 'gone']
Lemmatized Words: ['running', 'better', 'cat', 'gone']


## Named Entity Recognition (NER)
- Identifying entities in text.
- NLTK's NER capabilities.

In [11]:
import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Named Entity Recognition Example
ner_result = nltk.ne_chunk(pos_tags)
print("Named Entities:", ner_result)

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.


Named Entities: (S
  (ORGANIZATION NLTK/NNP)
  is/VBZ
  a/DT
  powerful/JJ
  library/NN
  for/IN
  natural/JJ
  language/NN
  processing/NN
  ./.
  It/PRP
  makes/VBZ
  (ORGANIZATION NLP/NNP)
  tasks/NNS
  easy/JJ
  and/CC
  efficient/JJ
  ./.)


[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


## Sentiment Analysis with NLTK
- Introduction to sentiment analysis.
- Using NLTK for sentiment analysis.

In [12]:
import nltk
nltk.download('vader_lexicon')

from nltk.sentiment import SentimentIntensityAnalyzer

# Example Sentence
sentence = "NLTK is amazing! I love using it for NLP tasks."

# Sentiment Analysis
sia = SentimentIntensityAnalyzer()
sentiment_score = sia.polarity_scores(sentence)
print("Sentiment Score:", sentiment_score)

Sentiment Score: {'neg': 0.0, 'neu': 0.458, 'pos': 0.542, 'compound': 0.8516}


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


## Advanced in NLTK
- Advanced tokenization techniques.
- Chunking and parsing.
- Machine learning with NLTK.

In [13]:
from nltk.tokenize import MWETokenizer, TweetTokenizer

# Multi-Word Expression Tokenization
mwe_tokenizer = MWETokenizer([("natural", "language"), ("processing", "tasks")])
mwe_tokens = mwe_tokenizer.tokenize(words)
print("Multi-Word Expression Tokenization:", mwe_tokens)

# Tweet Tokenization
tweet_tokenizer = TweetTokenizer()
tweet_tokens = tweet_tokenizer.tokenize("NLTK is awesome! #NLP #Python")
print("Tweet Tokenization:", tweet_tokens)

Multi-Word Expression Tokenization: ['NLTK', 'is', 'a', 'powerful', 'library', 'for', 'natural_language', 'processing', '.', 'It', 'makes', 'NLP', 'tasks', 'easy', 'and', 'efficient', '.', 'NLTK', 'is', 'a', 'leading', 'platform', 'for', 'building', 'Python', 'programs', 'to', 'work', 'with', 'human', 'language', 'data', '.']
Tweet Tokenization: ['NLTK', 'is', 'awesome', '!', '#NLP', '#Python']


In [14]:
# Chunking and Parsing
import os
import nltk
from nltk import RegexpParser
from nltk.tokenize import word_tokenize
from IPython.display import Image

# Example Sentence
chunking_sentence = "The black cat chased the white mouse."

# Define a simple grammar for NP (Noun Phrase) chunking
grammar = r"NP: {<DT>?<JJ>*<NN>}"

# Create a chunk parser
chunk_parser = RegexpParser(grammar)

words = word_tokenize(chunking_sentence)
pos_tags = nltk.pos_tag(words)

# Apply chunking
tree = chunk_parser.parse(pos_tags)
print("Chunking Example:", tree)
# tree.draw()


Chunking Example: (S
  (NP The/DT black/JJ cat/NN)
  chased/VBD
  (NP the/DT white/JJ mouse/NN)
  ./.)


example of draw function in the computer.<br>
<img src="https://www.nltk.org/_images/tree.gif">

In [15]:
# Machine Learning with NLTK - Classification
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy

# Example Dataset
training_data = [
    ({"feature1": "value1", "feature2": "value2"}, "class1"),
    ({"feature1": "value3", "feature2": "value4"}, "class2"),
    # Add more examples...
]

# Train a Naive Bayes Classifier
classifier = NaiveBayesClassifier.train(training_data)

# Example Classification
test_instance = {"feature1": "value5", "feature2": "value6"}
classification_result = classifier.classify(test_instance)
print("Classification Result:", classification_result)

# Evaluate Classifier Accuracy
accuracy_score = accuracy(classifier, training_data)
print("Classifier Accuracy:", accuracy_score)

Classification Result: class2
Classifier Accuracy: 1.0


## Case Study
- Applying NLTK techniques to a real-world dataset.

In [24]:
import nltk
nltk.download('movie_reviews')

# Case Study: Analyzing Movie Reviews
from nltk.corpus import movie_reviews
from nltk import FreqDist, NaiveBayesClassifier
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Load movie reviews dataset from NLTK
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# Shuffle the documents to ensure randomness
import random
random.shuffle(documents)

# Tokenization and stopwords removal
all_words = [word.lower() for word in movie_reviews.words()]
filtered_words = [word.lower() for word in all_words if word.isalpha() and word.lower() not in stopwords.words("english")]

# Extract the 2000 most common words as features
word_features = FreqDist(filtered_words).most_common(2000)
word_features = [word for word, _ in word_features]

# Define a function to extract features from a document
def document_features(document):
    document_words = set(document)
    features = {word: (word in document_words) for word in word_features}
    return features

# Extract features for each document
featuresets = [(document_features(d), c) for (d, c) in documents]

# Split the dataset into a training set and a testing set
train_set, test_set = featuresets[:1600], featuresets[1600:]

# Train a Naive Bayes classifier
classifier = NaiveBayesClassifier.train(train_set)

# Evaluate the classifier on the testing set
accuracy = nltk.classify.accuracy(classifier, test_set)
print("Classifier Accuracy:", accuracy)

# Example of sentiment analysis
example_text = "This movie is amazing! I loved every moment of it."
tokens = word_tokenize(example_text.lower())
features = document_features(tokens)

sentiment = classifier.classify(features)
print("Sentiment:", sentiment)

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


Classifier Accuracy: 0.8175
Sentiment: neg


## Discussion on challenges and solutions.
    Tokenization Challenges:
        Challenge: Dealing with tokenization errors, especially in languages with complex grammatical structures.
        Solution: Use NLTK's more advanced tokenizers like TweetTokenizer or customize tokenization rules based on the specific language or domain.

    Stopwords Removal Challenges:
        Challenge: Deciding which words to include or exclude from stopwords.
        Solution: Customize the list of stopwords based on the specific requirements of the analysis. Consider domain-specific or project-specific stopwords.

    Part-of-Speech Tagging Challenges:
        Challenge: Ambiguity in part-of-speech tagging, especially in context-dependent cases.
        Solution: Experiment with different POS tagging models and fine-tune as needed. Evaluate the accuracy and performance of the chosen model on the specific type of text.

    Named Entity Recognition (NER) Challenges:
        Challenge: Handling entities with multiple words or complex structures.
        Solution: Utilize more advanced NER models, and consider post-processing steps to handle complex entities. NLTK's ne_chunk can be a starting point.

    Sentiment Analysis Challenges:
        Challenge: Addressing the subjectivity and context dependency of sentiment.
        Solution: Incorporate more sophisticated sentiment analysis models, such as machine learning classifiers, and consider using pre-trained models. Also, consider incorporating context information to enhance accuracy.

    Stemming and Lemmatization Challenges:
        Challenge: Overstemming or understemming issues.
        Solution: Choose an appropriate stemming or lemmatization algorithm based on the characteristics of the text. Evaluate the impact on downstream tasks and fine-tune as needed.

    Handling Large Datasets Challenges:
        Challenge: Memory and processing constraints when working with large text corpora.
        Solution: Implement efficient processing strategies, such as batch processing, and consider distributed computing frameworks if applicable. Optimize memory usage and load data incrementally if necessary.

    Machine Learning Model Challenges:
        Challenge: Finding an appropriate model and dealing with imbalanced datasets.
        Solution: Experiment with various machine learning models, including ensemble methods. Address imbalanced datasets through techniques such as oversampling, undersampling, or using evaluation metrics suitable for imbalanced classes.

    Generalization Challenges:
        Challenge: Ensuring that models generalize well to different domains.
        Solution: Train models on diverse datasets and test on a representative set of data. Use transfer learning or domain adaptation techniques if applicable.

    Interpreting Results Challenges:
        Challenge: Interpreting and explaining the results of NLP analyses.
        Solution: Utilize visualization techniques, conduct feature importance analysis, and document the preprocessing steps and model decisions. Provide clear explanations of the limitations and potential biases.

Resources and Further Learning
- Documentation: https://www.nltk.org/howto.html

# Pythainlp

## Introduction to Pythainlp
- Introduction to PyThaiNLP library for natural language processing in Thai.

## Setup

In [25]:
!pip install pythainlp
# for using TF-IDF
!pip install scikit-learn
# for using NER
!pip install python-crfsuite

Collecting pythainlp
  Downloading pythainlp-4.0.2-py3-none-any.whl (13.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.4/13.4 MB[0m [31m38.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pythainlp
Successfully installed pythainlp-4.0.2
Collecting python-crfsuite
  Downloading python_crfsuite-0.9.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-crfsuite
Successfully installed python-crfsuite-0.9.10


## Basic Text Processing

In [38]:
!pip install attacut

Collecting attacut
  Downloading attacut-1.0.6-py3-none-any.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting docopt>=0.6.2 (from attacut)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting fire>=0.1.3 (from attacut)
  Downloading fire-0.5.0.tar.gz (88 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.3/88.3 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting nptyping>=0.2.0 (from attacut)
  Downloading nptyping-2.5.0-py3-none-any.whl (37 kB)
Collecting ssg>=0.0.4 (from attacut)
  Downloading ssg-0.0.8-py3-none-any.whl (473 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m473.8/473.8 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: docopt, fire
  Building wheel for docopt (setup.py) ... [?25l[

In [44]:
import pythainlp

text = "วันนี้ตื่นสายจึงเข้าโรงเรียนช้าจากนั้นก็ไปหาขนมกินรูปร่างดอกไม้เพราะเมื่อเช้ายังไม่ได้กินอะไรมากเลย"
tokens = pythainlp.word_tokenize(text,engine='attacut')
print("Tokenized:", tokens)

Tokenized: ['วัน', 'นี้', 'ตื่น', 'สาย', 'จึง', 'เข้า', 'โรง', 'เรียน', 'ช้า', 'จาก', 'นั้น', 'ก็', 'ไป', 'หาขนม', 'กิน', 'รูปร่าง', 'ดอก', 'ไม้', 'เพราะ', 'เมื่อ', 'เช้า', 'ยัง', 'ไม่', 'ได้', 'กิน', 'อะไร', 'มาก', 'เลย']


## Tokenization and Part-of-Speech Tagging

In [45]:
import pythainlp

text = "ในยุคที่ขายของออนไลน์เต็มไปด้วยคู่แข่งมากมายที่ขายสินค้าชนิดเดียวกันหรือคล้ายกัน วิธีที่จะทำให้สินค้าของคุณโดดเด่นเตะตาลูกค้าแถมสร้างความเชื่อมั่นในสรรพคุณได้มากที่สุดนั่นก็คือการรีวิวสินค้า แต่วิธีการรีวิวสินค้าให้น่าสนใจนั้นต้องทำอย่างไรถึงจะสามารถสร้างความมั่นใจให้ลูกค้าจนกระทั่งพวกเขายอมสั่งซื้อสินค้าของคุณได้"
tokens = pythainlp.word_tokenize(text)
pos_tags = pythainlp.pos_tag(tokens)
print("Tokenized:", tokens)
print("Part-of-Speech Tags:", pos_tags)

Tokenized: ['ใน', 'ยุค', 'ที่', 'ขายของ', 'ออนไลน์', 'เต็มไปด้วย', 'คู่แข่ง', 'มากมาย', 'ที่', 'ขาย', 'สินค้า', 'ชนิด', 'เดียวกัน', 'หรือ', 'คล้าย', 'กัน', ' ', 'วิธี', 'ที่จะ', 'ทำให้', 'สินค้า', 'ของ', 'คุณ', 'โดดเด่น', 'เตะตา', 'ลูกค้า', 'แถม', 'สร้าง', 'ความเชื่อมั่น', 'ใน', 'สรรพคุณ', 'ได้', 'มาก', 'ที่สุด', 'นั่น', 'ก็', 'คือ', 'การ', 'รีวิว', 'สินค้า', ' ', 'แต่', 'วิธีการ', 'รีวิว', 'สินค้า', 'ให้', 'น่าสนใจ', 'นั้น', 'ต้อง', 'ทำ', 'อย่างไร', 'ถึง', 'จะ', 'สามารถ', 'สร้าง', 'ความมั่นใจ', 'ให้', 'ลูกค้า', 'จนกระทั่ง', 'พวกเขา', 'ยอม', 'สั่งซื้อ', 'สินค้า', 'ของ', 'คุณ', 'ได้']
Part-of-Speech Tags: [('ใน', 'RPRE'), ('ยุค', 'NCMN'), ('ที่', 'PREL'), ('ขายของ', 'NCMN'), ('ออนไลน์', 'NCMN'), ('เต็มไปด้วย', 'RPRE'), ('คู่แข่ง', 'NCMN'), ('มากมาย', 'ADVN'), ('ที่', 'PREL'), ('ขาย', 'VACT'), ('สินค้า', 'NCMN'), ('ชนิด', 'NCMN'), ('เดียวกัน', 'DDAC'), ('หรือ', 'JCRG'), ('คล้าย', 'VSTA'), ('กัน', 'ADVN'), (' ', 'PUNC'), ('วิธี', 'NCMN'), ('ที่จะ', 'JSBR'), ('ทำให้', 'VACT'), ('สินค้า', '

## Named Entity Recognition (NER)

In [28]:
from pythainlp.tag import NER

text = "ประเทศไทยเป็นประเทศที่มีประชากรมากมาย"
ner = NER("thainer")
entities = ner.tag(text)
print("Entities:", entities)

Corpus: thainer-1.4
- Downloading: thainer-1.4 1.4


  0%|          | 0/1872468 [00:00<?, ?it/s]

Entities: [('ประเทศ', 'B-LOCATION'), ('ไทย', 'I-LOCATION'), ('เป็น', 'O'), ('ประเทศ', 'O'), ('ที่', 'O'), ('มี', 'O'), ('ประชากร', 'O'), ('มากมาย', 'O')]


## Thai Word Segmentation

In [29]:
import pythainlp

text = "คนไทยมีความภูมิใจในภาษาไทยของตน"
segmented_text = pythainlp.word_tokenize(text, engine="newmm") # engine: attacut, deepcut, ..., etc.
print("Segmented Text:", segmented_text)

Segmented Text: ['คนไทย', 'มี', 'ความภูมิใจ', 'ใน', 'ภาษาไทย', 'ของ', 'ตน']


## Information Retrieval with TF-IDF


In [33]:
import pythainlp
from pythainlp.tokenize import word_tokenize
from pythainlp.corpus import thai_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

# Sample Thai movie dataset
movies = {
    "movie1": "กลุ่มเพื่อนผจญภัยที่น่าตื่นเต้นในป่าสมุทรอเมซอน",
    "movie2": "หุ่นยนต์ที่ถูกพลังงาน AI พัฒนาความรู้สึกคล้ายมนุษย์และสงสัยถึงการมีชีวิต",
    "movie3": "รอมคอมแห่งความฮาตั้งใจในใจกลางกรุงเทพฯ สำรวจความซับซ้อนของความสัมพันธ์ในยุคปัจจุบัน",
    "movie4": "รายลับที่น่าตื่นเต้นเปิดเผยเมื่อนักสืบค้นหาความจริงที่อยู่เบื้องหลังเหตุการณ์ที่ไม่คาดคิด",
}

# Preprocess and tokenize the movie dataset
processed_movies = [" ".join(word_tokenize(desc, engine="newmm")) for desc in movies.values()]

print(processed_movies)

# TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words=list(thai_stopwords()))
tfidf_matrix = vectorizer.fit_transform(processed_movies)

# Function to retrieve movies based on user query using TF-IDF
def retrieve_movies_tfidf(query):
    query_vec = vectorizer.transform([query])
    cosine_similarities = linear_kernel(query_vec, tfidf_matrix).flatten()
    movie_scores = list(zip(movies.keys(), cosine_similarities))
    movie_scores.sort(key=lambda x: x[1], reverse=True)

    return [movie_id for movie_id, score in movie_scores if score > 0]

# Example usage
user_query = input("กรุณากรอกคำค้น: ")

if user_query.lower() in ['exit', 'quit', 'ลาก่อน']:
    print("ลาก่อนครับ")
else:
    retrieved_movies = retrieve_movies_tfidf(user_query)

    if not retrieved_movies:
        print("ไม่พบหนังที่เกี่ยวข้อง")
    else:
        print("หนังที่เกี่ยวข้อง:")
        for movie_id in retrieved_movies:
            print(f"- {movie_id}: {movies[movie_id]}")

['กลุ่ม เพื่อน ผจญภัย ที่ น่าตื่นเต้น ใน ป่า สมุทร อ เม ซอน', 'หุ่นยนต์ ที่ ถูก พลังงาน   AI   พัฒนา ความรู้สึก คล้าย มนุษย์ และ สงสัย ถึง การ มีชีวิต', 'รอม คอม แห่ง ความ ฮา ตั้งใจ ใน ใจกลาง กรุงเทพฯ   สำรวจ ความ ซับซ้อน ของ ความสัมพันธ์ ใน ยุคปัจจุบัน', 'ราย ลับ ที่ น่าตื่นเต้น เปิดเผย เมื่อ นักสืบ ค้นหา ความจริง ที่อยู่ เบื้องหลัง เหตุการณ์ ที่ ไม่ คาดคิด']
กรุณากรอกคำค้น: นักส์บกรุงเทพ
หนังที่เกี่ยวข้อง:
- movie4: รายลับที่น่าตื่นเต้นเปิดเผยเมื่อนักสืบค้นหาความจริงที่อยู่เบื้องหลังเหตุการณ์ที่ไม่คาดคิด


## Word Vector

In [34]:
import pythainlp
from pythainlp.word_vector  import WordVector


# Train custom word embeddings
wv = WordVector()
words = ['ดีไซน์เนอร์', 'พนักงานเงินเดือน', 'หมอ', 'เรือ']
wv.doesnt_match(words)

Corpus: thai2fit_wv
- Downloading: thai2fit_wv 0.1


  0%|          | 0/62452646 [00:00<?, ?it/s]



'เรือ'

In [35]:
list_positive = ['ประเทศ', 'ไทย', 'จีน', 'ญี่ปุ่น']
list_negative = []
wv.most_similar_cosmul(list_positive, list_negative)


[('ประเทศจีน', 0.22022424638271332),
 ('เกาหลี', 0.219687357544899),
 ('สหรัฐอเมริกา', 0.21660110354423523),
 ('ประเทศญี่ปุ่น', 0.21205861866474152),
 ('ประเทศไทย', 0.2115921974182129),
 ('เกาหลีใต้', 0.20321202278137207),
 ('อังกฤษ', 0.19610872864723206),
 ('ฮ่องกง', 0.1928885132074356),
 ('ฝรั่งเศส', 0.18383873999118805),
 ('พม่า', 0.18369348347187042)]

In [36]:
list_positive = ['ประเทศ', 'ไทย', 'จีน', 'ญี่ปุ่น']
list_negative = ['อเมริกา']
wv.most_similar_cosmul(list_positive, list_negative)

[('ประเทศไทย', 0.3278158903121948),
 ('เกาหลี', 0.3201899230480194),
 ('ประเทศจีน', 0.31755179166793823),
 ('พม่า', 0.30845439434051514),
 ('ประเทศญี่ปุ่น', 0.306713730096817),
 ('เกาหลีใต้', 0.3003999888896942),
 ('ลาว', 0.2995176911354065),
 ('คนไทย', 0.288502037525177),
 ('เวียดนาม', 0.287837952375412),
 ('ชาวไทย', 0.2848070561885834)]