# Project 1

Project 1 is about classification on text data, where you extract features from raw texts and try different classification approaches to classify them into topics.

During this discussion, we introduce some key concepts and usages of relevant python packages to help you get started with your project.

This explanation is mainly from different sections of the scikit-learn tutorial on text classification available at http://scikit-learn.org.

### A short introduction to `Numpy`

In [61]:
import numpy as np
a = [1,2,3]
print(a)

[1, 2, 3]


In [62]:
a = np.array(['hello', 'worlds']) # this a 1D array
b = np.array([1, 2, 3])
c = np.array([[1.0, 2.0, 3.0],[4.0, 5.0, 6.0]]) # this is a 2D array
print(a)
print('-' * 20)
print(b)
print('-' * 20)
print(c)
print('-' * 20)
print(a.dtype, b.dtype, c.dtype)
print('-' * 20)
print(c.shape) # same member variable exists for Pandas it is shape, not shape()

['hello' 'worlds']
--------------------
[1 2 3]
--------------------
[[1. 2. 3.]
 [4. 5. 6.]]
--------------------
<U6 int64 float64
--------------------
(2, 3)


In [4]:
d = np.random.rand(3,4) # Project 1 see GLoVe part
d
#type(d)

array([[0.17127942, 0.71862073, 0.65462346, 0.85386897],
       [0.74969474, 0.88460675, 0.63936699, 0.04346669],
       [0.39354624, 0.65591155, 0.09739239, 0.97920701]])

In [5]:
d[:,1:3]

array([[0.71862073, 0.65462346],
       [0.88460675, 0.63936699],
       [0.65591155, 0.09739239]])

In [6]:
d[0,:2]

array([0.17127942, 0.71862073])

In [66]:
d + 10 # braodcasting
a = [i + 10 for i in [1,2,4]] # don't recommend
a

[11, 12, 14]

In [68]:
e = np.ones([3, 2])
f = np.zeros(e.shape)
f

array([[0., 0.],
       [0., 0.],
       [0., 0.]])

In [11]:
np.vstack([e, f])

array([[1., 1.],
       [1., 1.],
       [1., 1.],
       [0., 0.],
       [0., 0.],
       [0., 0.]])

In [12]:
np.hstack([e, f])

array([[1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.]])

In [13]:
np.sum(e)

6.0

In [69]:
np.sum(e, axis=0) # summation across rows is axis = 0 and across col is axis = 1

array([3., 3.])

In [15]:
np.sum(e, axis=1)

array([2., 2., 2.])

In [70]:
g = np.array([[1, 2], [3, 4], [5, 6]])
g

array([[1, 2],
       [3, 4],
       [5, 6]])

In [71]:
e + g

array([[2., 3.],
       [4., 5.],
       [6., 7.]])

In [72]:
h = np.array([[1],[1]])
h

array([[1],
       [1]])

In [74]:
np.matmul(g, h)

array([[ 3],
       [ 7],
       [11]])

### Dataset
1. In the discussion, we work with “20 Newsgroups” dataset as an alternative to the custom dataset. Many of the functions will be the same. It is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups, each corresponding to a different topic.
2. To manually load the data, you need to run this python code. <a href="https://www.dropbox.com/s/5oek8qbsge1y64b/fetch_data.py?dl=0">link to fetch_data.py</a>
3. Easiest way to load the data is to use the built-in dataset loader for 20 newsgroups from scikit-learn package.


In [75]:
from sklearn.datasets import fetch_20newsgroups
categories = ['comp.graphics', 'comp.sys.mac.hardware']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

In [76]:
print(type(twenty_train))
print(twenty_train.keys())

<class 'sklearn.utils.Bunch'>
dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])


In [79]:
print(twenty_train.data[0])

From: winstead@faraday.ece.cmu.edu (Charles Holden Winstead)
Subject: ftp site for Radius software???
Organization: Electrical and Computer Engineering, Carnegie Mellon

Hey All,

Does anyone know if I can ftp to get the newest version of Radiusware
and soft pivot from Radius?  I bought a pivot monitor, but it has an
old version of this software and won't work on my C650, and Radius said
it would be 4-5 weeks until delivery.

Thanks!

-Chuck





In [80]:
# list of category indices of the documents
twenty_train.target # twenty_train['target']

array([1, 0, 0, ..., 0, 1, 0])

In [82]:
# so index 0 corresponds to 'comp.graphics'; 1 to 'comp.sys.mac.hardware'
print(twenty_train.target_names) # twenty_train['target_names'] # instead look at the column titled root_label

['comp.graphics', 'comp.sys.mac.hardware']


In [83]:
print(twenty_train.target_names[twenty_train.target[0]]) # unique() in pandas

comp.sys.mac.hardware


The files themselves are loaded in memory in the data attribute.

In [84]:
print(len(twenty_train.data)) # shape of a pandas dataframe is very useful especially in DL when you might lose control of your matrix dimensions
print(len(twenty_train.filenames))
print(len(twenty_train.target))
print(len(twenty_train.target_names))

1162
1162
1162
2


## Extracting features from text files

### CountVectorizer
Convert a collection of text documents to a matrix of token counts

In [85]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1) # np.array is a class
vectorizer

CountVectorizer()

For the detailed documentation of `CountVectorizer`, see

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

For regular expression:
https://docs.python.org/2/library/re.html

#### Stop words

Words that are too common to be useful in classification

In [29]:
from sklearn.feature_extraction import text
stop_words = text.ENGLISH_STOP_WORDS
print(stop_words)
print(len(stop_words))

frozenset({'mill', 'wherein', 'whose', 'until', 'thereupon', 'become', 'whole', 'still', 'of', 'however', 'sixty', 'never', 'whom', 'describe', 'but', 'she', 'am', 'thick', 'bottom', 'down', 'get', 'all', 'none', 'will', 'towards', 'full', 'therein', 'or', 'somehow', 'couldnt', 'everyone', 'cry', 'an', 'beforehand', 're', 'have', 'is', 'bill', 'my', 'rather', 'enough', 'amount', 'fire', 'once', 'so', 'how', 'were', 'your', 'herein', 'six', 'throughout', 'nowhere', 'who', 'very', 'elsewhere', 'themselves', 'mostly', 'onto', 'toward', 'next', 'which', 'around', 'must', 'whereas', 'beside', 'less', 'back', 'anyway', 'co', 'yourselves', 'well', 'under', 'us', 'yours', 'seem', 'part', 'herself', 'cannot', 'found', 'between', 'due', 'whereafter', 'top', 'both', 'more', 'many', 'fifteen', 'interest', 'together', 'whoever', 'fifty', 'hence', 'thence', 'whether', 'may', 'neither', 'you', 'yourself', 'again', 'as', 'below', 'them', 'would', 'could', 'some', 'latterly', 'alone', 'except', 'twenty

In [86]:
corpus = [
    'This is the first document. I was went to store.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
    'Arash Shadi and Pavan'
]
X = vectorizer.fit_transform(corpus) # you need a batch from the corpus to extract reasonable features especially in memory-less settings.
X 

<5x16 sparse matrix of type '<class 'numpy.int64'>'
	with 27 stored elements in Compressed Sparse Row format>

In [33]:
# feature names
vectorizer.get_feature_names()

['and',
 'arash',
 'document',
 'first',
 'is',
 'one',
 'pavan',
 'second',
 'shadi',
 'the',
 'third',
 'this']

In [35]:
X.toarray()

array([[0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1],
       [0, 0, 1, 0, 1, 0, 0, 2, 0, 1, 0, 1],
       [1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0],
       [0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1],
       [1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0]])

The converse mapping from feature name to column index is stored in the vocabulary_ attribute of the vectorizer:

In [90]:
# count_vect = CountVectorizer()
#count_vect = CountVectorizer(stop_words='english')
count_vect = CountVectorizer(stop_words='english', min_df=3, max_df=0.7)
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(1162, 6075)

In [88]:
X_test_counts = count_vect.transform(twenty_test.data)
X_test_counts.shape

(774, 6075)

In [41]:
print(count_vect.get_feature_names()[:20])
print('-' * 20)
print(count_vect.get_feature_names()[5161])
print(count_vect.vocabulary_.get('circuitry'))

['00', '000', '0010580b', '01', '02', '020', '0200', '03', '030', '04', '040', '05', '06', '060493101758', '0608', '07', '08', '09', '0953', '0x100']
--------------------
ssd
1313


### TFIDF

$$TF\times IDF(t,d)=tf(t,d)\times idf(t)$$
<hr>
$$idf(t)=\log(\frac{n}{df(t)})+1$$

- $tf(t, d)$: term frequency of term $t$ in the document $d$.


- $idf(t)$: inverse document frequency of term $t$ across the document dataset.
  - $df(t)$: # of documents that contain the term $t$.
  - Intuition: words that appear in all documents are useless in classificaiton.

In [91]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
print(X_train_tfidf.shape)
print('-' * 20)
print(X_train_counts.toarray()[:30,:5])
print('-' * 20)
print(X_train_tfidf.toarray()[:30,:5])

(1162, 6075)
--------------------
[[0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 2]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 1 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]]
--------------------
[[0.         0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.0823884 ]
 [0.         0.         0.      

### Training a classifier

Let's train a classifier to predict the category of a post.

In [92]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

In [93]:
docs_new = ['He is an OS developer', 'OpenGL on the GPU is fast', 'I like Apple Mojave OS', "This discussion was full of weird stuff."]

X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

'He is an OS developer' => comp.sys.mac.hardware
'OpenGL on the GPU is fast' => comp.graphics
'I like Apple Mojave OS' => comp.sys.mac.hardware
'This discussion was full of weird stuff.' => comp.graphics


### Dimensionality reduction

In [58]:
from sklearn.decomposition import NMF

model = NMF(n_components=50, init='random', random_state=0)
W_train = model.fit_transform(X_train_tfidf)



In [46]:
print(W_train.shape)

(1162, 50)


In [47]:
H = model.components_
H.shape

(50, 6075)

In [48]:
H

array([[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [5.60343694e-03, 0.00000000e+00, 2.51869020e-05, ...,
        0.00000000e+00, 0.00000000e+00, 2.86312774e-03],
       [1.67247737e-02, 2.23575367e-02, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       ...,
       [6.17835488e-03, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [2.27013999e-03, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])

In [59]:
W_test = model.transform(X_new_tfidf)

clf = MultinomialNB().fit(W_train, twenty_train.target)


predicted = clf.predict(W_test)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

'He is an OS developer' => comp.graphics
'OpenGL on the GPU is fast' => comp.graphics
'I like Apple Mojave OS' => comp.sys.mac.hardware


In [50]:
W_test

array([[0.00000000e+00, 6.01507454e-03, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 1.87760905e-03, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 1.47126358e-03, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 2.23838481e-03,
        0.00000000e+00, 2.40511541e-04, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 3.38612023e-03, 3.13733187e-03,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        9.69731116e-04, 0.00000000e+00, 2.20627340e-03, 0.00000000e+00,
        2.16066187e-03, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 3.74761017e-03, 0.00000000e+00, 6.02403561e-04,
        0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00,

In [51]:
W_test.shape

(2, 50)