NLTK is a comprehensive toolkit for natural language processing tasks in multiple languages, and it includes tools and resources for many different tasks beyond just morphological analysis. While morphological analysis is an important part of NLP, it is just one of many subfields of NLP that NLTK covers.

In terms of morphological analysis specifically, NLTK does provide tools for this task, such as a morphological stemmer that can be used to reduce words to their base form. NLTK also includes a part-of-speech (POS) tagger that can be used to identify the grammatical category of a word in a sentence, which is related to morphological analysis. However, NLTK is not limited to just morphological analysis and provides a wide range of other NLP tools and resources as well.

1. Tokenization - Breaking text into words, phrases, or sentences.  

- 띄어쓰기 기준으로 parsing

In [1]:
import nltk
nltk.word_tokenize("This is a sentence. It has words.")

['This', 'is', 'a', 'sentence', '.', 'It', 'has', 'words', '.']

2. POS Tagging - Labeling the part of speech for each word in a sentence.  

- tokenization + 품사구분

In [2]:
import nltk
# nltk.download('averaged_perceptron_tagger')
nltk.pos_tag(nltk.word_tokenize("This was a sentence."))

[('This', 'DT'), ('was', 'VBD'), ('a', 'DT'), ('sentence', 'NN'), ('.', '.')]

3. Stemming - Reducing words to their root form.

- 문법 폼 떼고 어원만 추출

In [3]:
import nltk
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmer.stem("running")
stemmer.stem("quickly")
stemmer.stem("adjustable")
# Output: 'run','quickli','adjust'

'adjust'

4. Lemmatization - Similar to stemming but produces a more accurate base form of a word using linguistic rules.

- 원형화.input을 모두 noun으로 간주하기 때문에 명사가 아니라면 form을 알려주고 파라미터로 함께 넣어줘야 한다. ran -> run

In [4]:
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
word = "ran"
pos = 'v'
lemma = lemmatizer.lemmatize(word, pos)
print(lemma)
# Output: 'running','quickly'

run


5. Named Entity Recognition (NER) - Identifying and classifying named entities in text, such as people, organizations, and locations.

- 문장을 tokenizing >> token별로 품사identifyiing >> 품사별로 classifying

In [5]:
import nltk
# nltk.download('maxent_ne_chunker')
# nltk.download('words')
text = "Barack Obama was the 44th President of the United States."
tokens = nltk.word_tokenize(text)
tagged = nltk.pos_tag(tokens)
entities = nltk.chunk.ne_chunk(tagged)
print(entities)
# Output: (S (PERSON Barack/NNP Obama/NNP) was/VBD the/DT 44th/JJ President/NNP of/IN the/DT United/NNP States/NNPS ./.) 


(S
  (PERSON Barack/NNP)
  (PERSON Obama/NNP)
  was/VBD
  the/DT
  44th/JJ
  President/NNP
  of/IN
  the/DT
  (GPE United/NNP States/NNPS)
  ./.)


# CountVectorizer (scikit-learn)

- input : a collection of text documents (list of sentences, paragraphs, or even entire books)
- function : tokenize, n-grams or counts the frequency  
- output : row(documents) x columns(words,tokens) shaped sparse matrix

1. fit_transform(): a corpus of text documents and returns a document-term matrix, which represents the count of each word in each document. For 

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["This is the first document.", "This is the second document.", "And this is the third one."]

vectorizer = CountVectorizer()
doc_term_matrix = vectorizer.fit_transform(corpus)

print(doc_term_matrix.toarray())


[[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]]


2. get_feature_names(): This subfunction returns a list of all the unique words that were used to generate the document-term matrix

In [7]:
print(vectorizer.get_feature_names())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']




3. vocabulary_: This subfunction returns a dictionary that maps each word to its index in the document-term matrix.

In [8]:
print(vectorizer.vocabulary_)

{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}


# TruncatedSVD (scikit-learn)
is a dimensionality reduction technique that can be used for matrix factorization, topic modeling, and feature extraction. It works by projecting high-dimensional data into a lower-dimensional space while preserving as much of the original variance as possible.

1. fit: Compute the truncated singular value decomposition of a matrix and apply it to the data.

2. transform: Apply dimensionality reduction to the input data.

3. fit_transform: Fit the model with the input data and apply dimensionality reduction to it.

4. inverse_transform: Transform low-dimensional data back to its original high-dimensional space.

In this example, we first create a document-term matrix using CountVectorizer. We then apply TruncatedSVD to the document-term matrix to reduce its dimensionality to 5. We fit and transform the matrix using the fit_transform method of TruncatedSVD, which performs both steps in a single function call.

In [9]:
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer

# Fit and transform the document-term matrix with TruncatedSVD
svd = TruncatedSVD(n_components=5, random_state=42)
dtm_svd = svd.fit_transform(doc_term_matrix)
print(dtm_svd) # 3차원
print('='*20)
dtm_svd = svd.inverse_transform(dtm_svd)
print(dtm_svd) # 9차원

[[ 2.00000000e+00 -7.07106781e-01 -7.07106781e-01]
 [ 2.00000000e+00 -7.07106781e-01  7.07106781e-01]
 [ 2.00000000e+00  1.41421356e+00 -7.28840001e-16]]
[[ 3.16643451e-16  1.00000000e+00  1.00000000e+00  1.00000000e+00
   1.36788191e-16  3.56682708e-16  1.00000000e+00  1.36788191e-16
   1.00000000e+00]
 [ 1.56303455e-16  1.00000000e+00 -1.23869238e-16  1.00000000e+00
   2.21402234e-17  1.00000000e+00  1.00000000e+00  2.21402234e-17
   1.00000000e+00]
 [ 1.00000000e+00 -4.49289833e-16  6.43240206e-18  1.00000000e+00
   1.00000000e+00 -2.39468092e-17  1.00000000e+00  1.00000000e+00
   1.00000000e+00]]
