feature_extraction 模块可以用于从文本和图像等格式的数据集中提取机器学习算法支持的格式的特征。

特征提取与 feature selection 有很大的不同:前者是将任意数据，如文本或图像，转换成可用来进行机器学习的数字特征。后者是一种应用于这些特征的机器学习技术。

In [5]:
# Loading features from dicts
from sklearn.feature_extraction import DictVectorizer
measurements = [
	{'city': 'Dubai', 'temperature': 33.},
	{'city': 'London', 'temperature': 12.},
	{'city': 'San Francisco', 'temperature': 18.},
]
vec = DictVectorizer()
print(repr(vec.fit_transform(measurements).toarray()))
vec.get_feature_names_out()

array([[ 1.,  0.,  0., 33.],
       [ 0.,  1.,  0., 12.],
       [ 0.,  0.,  1., 18.]])


array(['city=Dubai', 'city=London', 'city=San Francisco', 'temperature'],
      dtype=object)

In [7]:
movie_entry = [
	{'category': ['thriller', 'drama'], 'year': 2003},
	{'category': ['animation', 'family'], 'year': 2011},
	{'year': 1974}
]
vec.fit_transform(movie_entry).toarray()

array([[0.000e+00, 1.000e+00, 0.000e+00, 1.000e+00, 2.003e+03],
       [1.000e+00, 0.000e+00, 1.000e+00, 0.000e+00, 2.011e+03],
       [0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 1.974e+03]])

In [8]:
vec.get_feature_names_out()

array(['category=animation', 'category=drama', 'category=family',
       'category=thriller', 'year'], dtype=object)

In [9]:
vec.transform({'category': ['thriller'],
	'unseen_feature': '3'}).toarray()

array([[0., 0., 0., 1., 0.]])

## Feature hashing

The class FeatureHasher is a high-speed, low-memory vectorizer that uses a technique known as feature hashing, or the “hashing trick”.
好处：速度快
不好：转换不可逆

## Text feature extraction

In [10]:
# Common Vectorizer usage
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
	'This is the first document.',
	'This is the second second document.',
	'And the third one.',
	'Is this the first document?',
	]
X = vectorizer.fit_transform(corpus)
X

<4x9 sparse matrix of type '<class 'numpy.int64'>'
	with 19 stored elements in Compressed Sparse Row format>

In [15]:
analyze = vectorizer.build_analyzer()
analyze("This is a text document to analyze.") == (
	['this', 'is', 'text', 'document', 'to', 'analyze'])

True

In [12]:
vectorizer.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

In [13]:
X.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]])

In [16]:
vectorizer.vocabulary_.get('document')

1