## 使用分类数据

一般来说，取代一个具有N个可能等级的类属性特征，需要创建N个（1/0）数值的特征。这种操作称为虚拟编码（dummy coding）

### pandas实现较为轻松

In [1]:
import pandas as pd
categorical_feature = pd.Series(['sunny', 'cloudy', 'snowy', 'rainy', 'foggy'])
categorical_feature
mapping = pd.get_dummies(categorical_feature)
mapping

Unnamed: 0,cloudy,foggy,rainy,snowy,sunny
0,0,0,0,0,1
1,1,0,0,0,0
2,0,0,0,1,0
3,0,0,1,0,0
4,0,1,0,0,0


In [2]:
mapping['sunny']

0    1
1    0
2    0
3    0
4    0
Name: sunny, dtype: uint8

### sklearn 也可以处理，但较为复杂

In [3]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
ohe = OneHotEncoder()

levels = ['sunny', 'cloudy', 'snowy', 'rainy', 'foggy']
fit_levs = le.fit_transform(levels)
ohe.fit([[fit_levs[0]], [fit_levs[1]], [fit_levs[2]], [fit_levs[3]], [fit_levs[4]]])
print ohe.transform([le.transform(['sunny'])]).toarray()

[[ 0.  0.  0.  0.  1.]]


## 使用文本数据

处理文本最常用的方法是使用词袋（bag of words）

范例中处理的文本数据集为20newsgroup(http://qwone.com/~jason/20Newsgroups/)，范例中只使用子集，包括医学和空间方面的科学主题

下载的文件位于类似：C:\Users\01009558\scikit_learn_data的位置

In [4]:
from sklearn.datasets import fetch_20newsgroups
categories= ['sci.med', 'sci.space']
twenty_sci_news = fetch_20newsgroups(categories=categories)

In [5]:
print twenty_sci_news.data[0]

From: flb@flb.optiplan.fi ("F.Baube[tm]")
Subject: Vandalizing the sky
X-Added: Forwarded by Space Digest
Organization: [via International Space University]
Original-Sender: isu@VACATION.VENARI.CS.CMU.EDU
Distribution: sci
Lines: 12

From: "Phil G. Fraering" <pgf@srl03.cacs.usl.edu>
> 
> Finally: this isn't the Bronze Age, [..]
> please try to remember that there are more human activities than
> those practiced by the Warrior Caste, the Farming Caste, and the
> Priesthood.

Right, the Profiting Caste is blessed by God, and may 
 freely blare its presence in the evening twilight ..

-- 
* Fred Baube (tm)



In [6]:
twenty_sci_news.filenames

array([ 'C:\\Users\\01009558\\scikit_learn_data\\20news_home\\20news-bydate-train\\sci.space\\61116',
       'C:\\Users\\01009558\\scikit_learn_data\\20news_home\\20news-bydate-train\\sci.med\\58122',
       'C:\\Users\\01009558\\scikit_learn_data\\20news_home\\20news-bydate-train\\sci.med\\58903',
       ...,
       'C:\\Users\\01009558\\scikit_learn_data\\20news_home\\20news-bydate-train\\sci.space\\60774',
       'C:\\Users\\01009558\\scikit_learn_data\\20news_home\\20news-bydate-train\\sci.space\\60954',
       'C:\\Users\\01009558\\scikit_learn_data\\20news_home\\20news-bydate-train\\sci.med\\58911'],
      dtype='|S98')

In [7]:
print twenty_sci_news.target[0]

1


In [8]:
print twenty_sci_news.target_names[twenty_sci_news.target[0]]

sci.space


In [9]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
# 为每篇文档创建特征矢量(fit_transform)，输出矩阵为稀疏矩阵
word_count = count_vect.fit_transform(twenty_sci_news.data)
word_count.shape

(1187, 25638)

In [10]:
print word_count[0]

  (0, 10778)	1
  (0, 23849)	1
  (0, 9796)	1
  (0, 12716)	1
  (0, 18586)	1
  (0, 13384)	1
  (0, 5134)	1
  (0, 10785)	1
  (0, 15246)	1
  (0, 11330)	1
  (0, 5148)	1
  (0, 13318)	1
  (0, 18744)	1
  (0, 20110)	1
  (0, 18642)	1
  (0, 3808)	2
  (0, 10188)	1
  (0, 6017)	3
  (0, 24930)	1
  (0, 18474)	1
  (0, 23241)	1
  (0, 23129)	1
  (0, 3191)	1
  (0, 12362)	1
  (0, 15968)	1
  :	:
  (0, 7646)	1
  (0, 24547)	1
  (0, 24415)	1
  (0, 13359)	1
  (0, 20909)	1
  (0, 17235)	1
  (0, 24151)	1
  (0, 13158)	1
  (0, 24626)	1
  (0, 17217)	1
  (0, 8438)	1
  (0, 21686)	2
  (0, 5650)	3
  (0, 10713)	1
  (0, 3233)	1
  (0, 21382)	1
  (0, 23137)	7
  (0, 24461)	1
  (0, 22345)	1
  (0, 23381)	2
  (0, 4762)	2
  (0, 10341)	1
  (0, 17170)	1
  (0, 10501)	2
  (0, 10827)	2


In [11]:
word_list = count_vect.get_feature_names()
for n in word_count[0].indices:
    print "Word:", word_list[n], "appears", word_count[0, n], "times"


Word: fred appears 1 times
Word: twilight appears 1 times
Word: evening appears 1 times
Word: in appears 1 times
Word: presence appears 1 times
Word: its appears 1 times
Word: blare appears 1 times
Word: freely appears 1 times
Word: may appears 1 times
Word: god appears 1 times
Word: blessed appears 1 times
Word: is appears 1 times
Word: profiting appears 1 times
Word: right appears 1 times
Word: priesthood appears 1 times
Word: and appears 2 times
Word: farming appears 1 times
Word: caste appears 3 times
Word: warrior appears 1 times
Word: practiced appears 1 times
Word: those appears 1 times
Word: than appears 1 times
Word: activities appears 1 times
Word: human appears 1 times
Word: more appears 1 times
Word: are appears 1 times
Word: there appears 1 times
Word: that appears 1 times
Word: remember appears 1 times
Word: to appears 1 times
Word: try appears 1 times
Word: please appears 1 times
Word: age appears 1 times
Word: bronze appears 1 times
Word: isn appears 1 times
Word: this 

In [12]:
print twenty_sci_news.data[0]

From: flb@flb.optiplan.fi ("F.Baube[tm]")
Subject: Vandalizing the sky
X-Added: Forwarded by Space Digest
Organization: [via International Space University]
Original-Sender: isu@VACATION.VENARI.CS.CMU.EDU
Distribution: sci
Lines: 12

From: "Phil G. Fraering" <pgf@srl03.cacs.usl.edu>
> 
> Finally: this isn't the Bronze Age, [..]
> please try to remember that there are more human activities than
> those practiced by the Warrior Caste, the Farming Caste, and the
> Priesthood.

Right, the Profiting Caste is blessed by God, and may 
 freely blare its presence in the evening twilight ..

-- 
* Fred Baube (tm)



In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf_vect = TfidfVectorizer(use_idf=False, norm='l1')
word_freq = tf_vect.fit_transform(twenty_sci_news.data)
word_list = tf_vect.get_feature_names()

for n in word_freq[0].indices:
    print "Word:", word_list[n], "has frequency", word_freq[0, n]

Word: fred has frequency 0.010989010989
Word: twilight has frequency 0.010989010989
Word: evening has frequency 0.010989010989
Word: in has frequency 0.010989010989
Word: presence has frequency 0.010989010989
Word: its has frequency 0.010989010989
Word: blare has frequency 0.010989010989
Word: freely has frequency 0.010989010989
Word: may has frequency 0.010989010989
Word: god has frequency 0.010989010989
Word: blessed has frequency 0.010989010989
Word: is has frequency 0.010989010989
Word: profiting has frequency 0.010989010989
Word: right has frequency 0.010989010989
Word: priesthood has frequency 0.010989010989
Word: and has frequency 0.021978021978
Word: farming has frequency 0.010989010989
Word: caste has frequency 0.032967032967
Word: warrior has frequency 0.010989010989
Word: practiced has frequency 0.010989010989
Word: those has frequency 0.010989010989
Word: than has frequency 0.010989010989
Word: activities has frequency 0.010989010989
Word: human has frequency 0.010989010989

**TF-IDF**是一种统计方法，用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。**字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降**。

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf_vect = TfidfVectorizer() # Default: use_idf=True
word_tfidf = tf_vect.fit_transform(twenty_sci_news.data)
word_list = tf_vect.get_feature_names()

for n in word_freq[0].indices:
    print "Word:", word_list[n], "has tfidf", word_tfidf[0, n]

Word: fred has tfidf 0.0893604523484
Word: twilight has tfidf 0.139389277822
Word: evening has tfidf 0.113026734241
Word: in has tfidf 0.0239166759663
Word: presence has tfidf 0.118805671173
Word: its has tfidf 0.0614868335851
Word: blare has tfidf 0.150393472236
Word: freely has tfidf 0.118805671173
Word: may has tfidf 0.0543546855668
Word: god has tfidf 0.118805671173
Word: blessed has tfidf 0.150393472236
Word: is has tfidf 0.0255349229448
Word: profiting has tfidf 0.150393472236
Word: right has tfidf 0.0677614245918
Word: priesthood has tfidf 0.144196231323
Word: and has tfidf 0.0491489948076
Word: farming has tfidf 0.144196231323
Word: caste has tfidf 0.43258869397
Word: warrior has tfidf 0.144196231323
Word: practiced has tfidf 0.13214100026
Word: those has tfidf 0.0604689129732
Word: than has tfidf 0.0519193033019
Word: activities has tfidf 0.0906664266256
Word: human has tfidf 0.0844691857132
Word: more has tfidf 0.0464972413341
Word: are has tfidf 0.0346326118207
Word: there h

#### 多元语法组合示例

In [15]:
text_1 = "we love data science"
text_2 = "data science is hard"

documents = [text_1, text_2]
documents

['we love data science', 'data science is hard']

In [16]:
# 一元语法
count_vect_1_grams = CountVectorizer(ngram_range=(1,1), stop_words=[], min_df=1)
word_count = count_vect_1_grams.fit_transform(documents)
word_list = count_vect_1_grams.get_feature_names()
print "Word list = ", word_list
print "text_1 is described with", [word_list[n] + "(" + str(word_count[0, n]) + ")" for n in word_count[0].indices]

Word list =  [u'data', u'hard', u'is', u'love', u'science', u'we']
text_1 is described with [u'science(1)', u'data(1)', u'love(1)', u'we(1)']


In [17]:
# 二元语法
count_vect_1_grams = CountVectorizer(ngram_range=(2, 2))
word_count = count_vect_1_grams.fit_transform(documents)
word_list = count_vect_1_grams.get_feature_names()
print "Word list = ", word_list
print "text_1 is described with", [word_list[n] + "(" + str(word_count[0, n]) + ")" for n in word_count[0].indices]

Word list =  [u'data science', u'is hard', u'love data', u'science is', u'we love']
text_1 is described with [u'data science(1)', u'love data(1)', u'we love(1)']


In [18]:
# 混合使用一元语法与二元语法
count_vect_1_grams = CountVectorizer(ngram_range=(1, 2))
word_count = count_vect_1_grams.fit_transform(documents)
word_list = count_vect_1_grams.get_feature_names()
print "Word list = ", word_list
print "text_1 is described with", [word_list[n] + "(" + str(word_count[0, n]) + ")" for n in word_count[0].indices]

Word list =  [u'data', u'data science', u'hard', u'is', u'is hard', u'love', u'love data', u'science', u'science is', u'we', u'we love']
text_1 is described with [u'data science(1)', u'love data(1)', u'we love(1)', u'science(1)', u'data(1)', u'love(1)', u'we(1)']


#### 使用散列方法，解决性能和复杂性问题

In [19]:
from sklearn.feature_extraction.text import HashingVectorizer
hash_vect = HashingVectorizer(n_features=1000)
word_hashed = hash_vect.fit_transform(twenty_sci_news.data)
word_hashed.shape

(1187, 1000)