# 日本語テキストの特徴量抽出
日本語テキストに対して前処理（トークン化、ストップワード除去）を行い、Bag of WordsおよびTF-IDFによる特徴量抽出を行う。

In [1]:
!pip install janome

Collecting janome
  Downloading Janome-0.5.0-py2.py3-none-any.whl.metadata (2.6 kB)
Downloading Janome-0.5.0-py2.py3-none-any.whl (19.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.7/19.7 MB[0m [31m86.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: janome
Successfully installed janome-0.5.0


In [2]:
from janome.tokenizer import Tokenizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd

In [3]:
# サンプルの日本語テキストデータ
documents = [
    "私は昨日旅行の本を買いました。",
    "彼は毎日カフェオレを飲みます。",
    "ワシントンDCはアメリカの首都です。",
    "自然言語処理は面白い分野です。"
]

In [4]:
# Janomeを使った前処理関数（トークン化とストップワード除去）
tokenizer = Tokenizer()
stopwords = set(['は', 'を', 'です', 'ます', 'の', 'に', 'で', 'と', 'が', 'た', 'し', 'て', 'な', 'い'])

def tokenize(text):
    tokens = tokenizer.tokenize(text)
    words = [token.surface for token in tokens if token.surface not in stopwords]
    return words

# 前処理済みのテキストを作成（単語をスペースで結合）
preprocessed_docs = [' '.join(tokenize(doc)) for doc in documents]
print(preprocessed_docs)

['私 昨日 旅行 本 買い まし 。', '彼 毎日 カフェオレ 飲み 。', 'ワシントン DC アメリカ 首都 。', '自然 言語 処理 面白い 分野 。']


In [5]:
# Bag of Words 特徴量抽出
bow_vectorizer = CountVectorizer()
bow_features = bow_vectorizer.fit_transform(preprocessed_docs)
bow_df = pd.DataFrame(bow_features.toarray(), columns=bow_vectorizer.get_feature_names_out())
print("Bag of Words 特徴量:")
display(bow_df)

Bag of Words 特徴量:


Unnamed: 0,dc,まし,アメリカ,カフェオレ,ワシントン,処理,分野,旅行,昨日,毎日,自然,言語,買い,面白い,飲み,首都
0,0,1,0,0,0,0,0,1,1,0,0,0,1,0,0,0
1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0
2,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,1,1,0,0,0,1,1,0,1,0,0


In [6]:
# TF-IDF 特徴量抽出
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(preprocessed_docs)
tfidf_df = pd.DataFrame(tfidf_features.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
print("TF-IDF 特徴量:")
display(tfidf_df)

TF-IDF 特徴量:


Unnamed: 0,dc,まし,アメリカ,カフェオレ,ワシントン,処理,分野,旅行,昨日,毎日,自然,言語,買い,面白い,飲み,首都
0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.0,0.0,0.5,0.0,0.0,0.0
1,0.0,0.0,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.57735,0.0,0.0,0.0,0.0,0.57735,0.0
2,0.5,0.0,0.5,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5
3,0.0,0.0,0.0,0.0,0.0,0.447214,0.447214,0.0,0.0,0.0,0.447214,0.447214,0.0,0.447214,0.0,0.0
