# Project 1

Project 1 is **Classification Analysis on Textual Data**, where you extract features from raw texts and try different classification approaches to classify them into topics.

During this discussion, we introduce some key concepts and usages of relevant python packages to help you get started with your project.

This explanation are mainly from different sections of the scikit-learn tutorial on text classification available at http://scikit-learn.org.

### A short introduction to `NumPy`

> `NumPy` is the fundamental package for scientific computing with Python. It contains among other things:

>  - a powerful N-dimensional array object
>  - sophisticated (broadcasting) functions
>  - tools for integrating C/C++ and Fortran code
>  - useful linear algebra, Fourier transform, and random number capabilities
>

<span style="float: right;">--from <a href="http://www.numpy.org/">http://www.numpy.org/</a></span>

- Matlab-like syntax
- Useful and easy to use

In [1]:
import numpy as np

In [2]:
a = np.array(['hello', 'world'])
b = np.array([1, 2, 3])
c = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
print a
print '-' * 20
print b
print '-' * 20
print c
print '-' * 20
print a.dtype, b.dtype, c.dtype
print '-' * 20
print c.shape

['hello' 'world']
--------------------
[1 2 3]
--------------------
[[ 1.  2.  3.]
 [ 4.  5.  6.]]
--------------------
|S5 int64 float64
--------------------
(2, 3)


In [3]:
d = np.random.rand(3,4)
d

array([[ 0.32568816,  0.4214556 ,  0.07059667,  0.40779361],
       [ 0.32065374,  0.56665789,  0.41332777,  0.17829782],
       [ 0.41482226,  0.81258396,  0.49000415,  0.30554418]])

In [4]:
d[:,1:3]

array([[ 0.4214556 ,  0.07059667],
       [ 0.56665789,  0.41332777],
       [ 0.81258396,  0.49000415]])

In [5]:
d[0, :2]

array([ 0.32568816,  0.4214556 ])

In [6]:
d + 10

array([[ 10.32568816,  10.4214556 ,  10.07059667,  10.40779361],
       [ 10.32065374,  10.56665789,  10.41332777,  10.17829782],
       [ 10.41482226,  10.81258396,  10.49000415,  10.30554418]])

In [7]:
e = np.ones([3, 2])
f = np.zeros(e.shape)
e

array([[ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.]])

In [8]:
np.vstack([e, f])

array([[ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.],
       [ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.]])

In [9]:
np.hstack([e, f])

array([[ 1.,  1.,  0.,  0.],
       [ 1.,  1.,  0.,  0.],
       [ 1.,  1.,  0.,  0.]])

In [10]:
np.sum(e)

6.0

In [11]:
np.sum(e, axis=0)

array([ 3.,  3.])

In [12]:
np.sum(e, axis=1)

array([ 2.,  2.,  2.])

In [13]:
g = np.array([[1, 2], [3, 4], [5, 6]])
g

array([[1, 2],
       [3, 4],
       [5, 6]])

In [14]:
e + g

array([[ 2.,  3.],
       [ 4.,  5.],
       [ 6.,  7.]])

In [15]:
h = np.array([[1],[1]])
h

array([[1],
       [1]])

In [16]:
np.matmul(g, h)

array([[ 3],
       [ 7],
       [11]])

### Dataset
1. In **Project 1** we work with “20 Newsgroups” dataset. It is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups, each corresponding to a different topic.
2. To manually load the data, you need to run this python code. <a href="https://www.dropbox.com/s/5oek8qbsge1y64b/fetch_data.py?dl=0">link to fetch_data.py</a>
3. Easiest way to load the data is to use the built-in dataset loader for "20 newsgroups" from scikit-learn package.


In [1]:
from sklearn.datasets import fetch_20newsgroups
categories = ['comp.graphics', 'comp.sys.mac.hardware']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


FileNotFoundError: [Errno 2] No such file or directory: '/Users/shuangyu/scikit_learn_data/20news_home/20news-bydate.tar.gz'

In [None]:
print type(twenty_train)
print twenty_train.keys()

In [5]:
len(twenty_train.data)

1162

In [8]:
print twenty_train.data[0]

From: winstead@faraday.ece.cmu.edu (Charles Holden Winstead)
Subject: ftp site for Radius software???
Organization: Electrical and Computer Engineering, Carnegie Mellon

Hey All,

Does anyone know if I can ftp to get the newest version of Radiusware
and soft pivot from Radius?  I bought a pivot monitor, but it has an
old version of this software and won't work on my C650, and Radius said
it would be 4-5 weeks until delivery.

Thanks!

-Chuck





In [9]:
# list of category indices of the documents
twenty_train.target # twenty_train['target']

array([1, 0, 0, ..., 0, 1, 0])

In [22]:
# so index 0 corresponds to 'comp.graphics'; 1 to 'comp.sys.mac.hardware'
print twenty_train.target_names # twenty_train['target_names']

['comp.graphics', 'comp.sys.mac.hardware']


In [23]:
print twenty_train.target_names[twenty_train.target[0]]

comp.sys.mac.hardware


The files themselves are loaded in memory in the data attribute.

In [24]:
print len(twenty_train.data)
print len(twenty_train.filenames)
print len(twenty_train.target)
print len(twenty_train.target_names)

1162
1162
1162
2


## Extracting features from text files

### CountVectorizer
Convert a collection of text documents to a matrix of token counts

$\begin{pmatrix}tf(t_1, d_1) & \cdots & tf(t_m, d_1) \\ tf(t_1, d_2) & \cdots & tf(t_m, d_2) \\ \vdots & \vdots & \vdots \\ tf(t_1, d_n) & \cdots & tf(t_m, d_n) \end{pmatrix}$

$tf(t, d)$: term frequency of term $t$ in the document $d$, i.e. the number of occurrances of term $t$ in the document $d$.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df=1)
vectorizer

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

For the detailed documentation of `CountVectorizer`, see

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

Note: `fit`, `fit_transform`, `transform` are common methods of all kinds of data transformers in `scikit learn`

#### Stop words

Words that are too common to be useful in classification

In [26]:
from sklearn.feature_extraction import text
stop_words = text.ENGLISH_STOP_WORDS
print stop_words
print len(stop_words)

frozenset(['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'fifty', 'four', 'not', 'own', 'through', 'yourselves', 'go', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'whom', 'seeming', 'under', 'ours', 'has', 'might', 'thereafter', 'latterly', 'do', 'them', 'his', 'around', 'than', 'get', 'very', 'de', 'none', 'cannot', 'every', 'whether', 'they', 'front', 'during', 'thus', 'now', 'him', 'nor', 'name', 'several', 'hereafter', 'always', 'who', 'cry', 'whither', 'this', 'someone', 'either', 'each', 'become', 'thereupon', 'sometime', 'side', 'two', 'therein', 'twelve', 'because', 'often', 'ten', 'our', 'eg', 'some', 'back', 'up', 'namely', 'towards', 'are', 'further', 'beyond', 'ourselves', 'yet', 'out', 'even', 'will', 'what', 'still', 'for', 'bottom', 'mine', 'since', 'please', 'forty', 'per', 'its', 'everything', 'behind', 'un', 'above', 'between', 'it', 'nei

#### Demo

In [16]:
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]

# `fit_transform(corpus)` is equivalent to `fit(corpus)` then `transform(corpus)`
X = vectorizer.fit_transform(corpus)
X 

<4x9 sparse matrix of type '<type 'numpy.int64'>'
	with 19 stored elements in Compressed Sparse Row format>

In [28]:
# feature names are terms
vectorizer.get_feature_names()

[u'and',
 u'document',
 u'first',
 u'is',
 u'one',
 u'second',
 u'the',
 u'third',
 u'this']

In [29]:
# use `toarray()` to convert sparse matrices to ordinary matrices (multi-dim arrays)
X.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]])

In [32]:
test_corpus = [
    'Another random document.'
]

# Use `transform` instead of `fit_transform` here, to only count
# terms that are in the vocabulary of the training dataset
Y = vectorizer.transform(test_corpus)
# Here 'Another' and 'random' are just ignored
Y.toarray()

array([[0, 1, 0, 0, 0, 0, 0, 0, 0]])

In [21]:
count_vect = CountVectorizer(min_df=5)
# count_vect = CountVectorizer(stop_words='english')
# count_vect = CountVectorizer(stop_words='english', min_df=3, max_df=0.7)
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape # 1162 docs, 19610 terms in the training dataset

(1162, 3955)

In [18]:
X_test_counts = count_vect.transform(twenty_test.data)
X_test_counts.shape # 774 docs, 19610 terms in the training dataset

(774, 19610)

The feature names list returned by `get_feature_names()` can be used as the mapping from column index to feature name ;

The converse mapping from feature name to column index is stored in the `vocabulary_` attribute of the vectorizer:

In [28]:
print count_vect.get_feature_names()[500:520]
print '-' * 20
print count_vect.vocabulary_.get('circuitry')
print count_vect.get_feature_names()[820]

[u'assume', u'assumed', u'assuming', u'at', u'athena', u'athens', u'ati', u'att', u'attached', u'attempt', u'attempted', u'attendees', u'attention', u'attributes', u'au', u'audio', u'august', u'austin', u'australia', u'australian']
--------------------
820
circuitry


### TFIDF

TFxIDF score is used to describe "how important a word is to a document in a collection or corpus" (from [Wikipedia - tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf))

$$TF\times IDF(t,d)=tf(t,d)\times idf(t)$$
<hr>
$$idf(t)=\log(\frac{n}{df(t)})+1$$

- $tf(t, d)$: term frequency of term $t$ in the document $d$.


- $idf(t)$: inverse document frequency of term $t$ across the document dataset.
  - $df(t)$: # of documents that contain the term $t$.
  - Intuition: words that appear in all documents are useless in classificaiton.

In [34]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()

# recall that X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
print X_train_tfidf.shape
print '-' * 20
print X_train_counts.toarray()[:30,:5]
print '-' * 20
print X_train_tfidf.toarray()[:30,:5]

(1162, 19610)
--------------------
[[0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 1 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]]
--------------------
[[ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.        

## Training a classifier

Let's train a classifier to predict the category of a post.

In [35]:
# We use Naive Bayes classifier as an example
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

In [36]:
docs_new = ['He is an OS developer', 'OpenGL on the GPU is fast']

X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

'He is an OS developer' => comp.sys.mac.hardware
'OpenGL on the GPU is fast' => comp.graphics


## Dimensionality reduction

NMF: Non-negative matrix factorization
$$\min_{W, H}||X - WH||_F^2 \\ s.t. W \ge 0 \\ H \ge 0$$

In [37]:
from sklearn.decomposition import NMF

model = NMF(n_components=50, init='random', random_state=0)
W_train = model.fit_transform(X_train_tfidf)

In [38]:
print W_train.shape

(1162, 50)


In [39]:
H = model.components_
H.shape

(50, 19610)

In [40]:
H

array([[ 0.01737734,  0.        ,  0.        , ...,  0.        ,
         0.01977479,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.01326934,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       ..., 
       [ 0.01061516,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.02491158,  0.        ,  0.00064097, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.03186904,  0.04901559,  0.        , ...,  0.        ,
         0.        ,  0.        ]])

In [41]:
W_test = model.transform(X_new_tfidf)

clf = MultinomialNB().fit(W_train, twenty_train.target)

predicted = clf.predict(W_test)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

'He is an OS developer' => comp.sys.mac.hardware
'OpenGL on the GPU is fast' => comp.graphics
