Working With Text Data
-----------------------

Ref: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

## 1. Load data

**Data Description**

* 20,000 newsgroup documents
* artitioned (nearly) evenly across 20 different newsgroup
* here we only work for 4 catetories

In [3]:
%%time

# load data
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)




Wall time: 12min 25s


## 1.1 Check data

In [6]:
twenty_train.target_names # data category

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

In [8]:
len(twenty_train.data) # training dataset number

2257

In [9]:
len(twenty_train.filenames)# 

2257

In [11]:
twenty_train.data[0]# data content

'From: sd345@city.ac.uk (Michael Collier)\nSubject: Converting images to HP LaserJet III?\nNntp-Posting-Host: hampton\nOrganization: The City University\nLines: 14\n\nDoes anyone know of a good way (standard PC application/PD utility) to\nconvert tif/img/tga files into LaserJet III format.  We would also like to\ndo the same, converting to HPGL (HP plotter) files.\n\nPlease email any response.\n\nIs this the correct group?\n\nThanks in advance.  Michael.\n-- \nMichael Collier (Programmer)                 The Computer Unit,\nEmail: M.P.Collier@uk.ac.city                The City University,\nTel: 071 477-8000 x3769                      London,\nFax: 071 477-8565                            EC1V 0HB.\n'

## 1.2 Check target

In [15]:
twenty_train.target# total 4 classification

array([1, 1, 3, ..., 2, 2, 2], dtype=int64)

In [14]:
for t in twenty_train.target[:10]:
    print(twenty_train.target_names[t])

comp.graphics
comp.graphics
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
sci.med
sci.med
sci.med


# 2. Extract feature -- Bags of words

**Bags of words**
* 每个词作为字典中的一个feature，统计单词count

In [20]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(2257, 35788)

In [23]:
type( X_train_counts[0] )

scipy.sparse.csr.csr_matrix

In [25]:
X_train_counts[0].todense() # check raw data

matrix([[0, 0, 0, ..., 0, 0, 0]], dtype=int64)

# 3. Extract feature -- TF

**TF**
* 统计count，会导致长文章的单词count大。为了避免这个问题，统计词频TF (term frequence)

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).


In [27]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape
# the shape is the same as bags of words

(2257, 35788)

# 4. Extract feature -- TFIDF

**IDF**


(term frequence)IDF(t) = log_e(Total number of documents / Number of documents with term t in it).



In [29]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(2257, 35788)

# 5. Training & Predict

In [30]:
%%time

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)


Wall time: 62.1 ms


In [36]:
docs_new = ['God is love', 'OpenGL on the GPU is fast', 'the graphic is beautiful', 'doctor Wang is PHD']# test sample
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))


'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics
'the graphic is beautiful' => comp.graphics
'doctor Wang is PHD' => sci.med


In [32]:
# check test sample data
X_new_tfidf.shape

(2, 35788)