# Documentation Classification Lab

* Author: ThuanLe
* Email: thuanle@hcmut.edu.vn

* Target: Given a bunch of documents, try to classify the topic of the document.

* Steps:
    * Step 1, crawl, parse the documents. Prepare data
    * Step 2, convert the documents into features vectors
    * Step 3, using a classification algorithm to classify the data.

## Prepare knowledge

### Text classification lib

* Install lib and dependencies

    * nltk
    
```bash
pip3 install nltk
```

Run within python
```python
import nltk
nltk.download('punkt')
```

In [5]:
import nltk
sentence = u"Chúng ta thường nói đến Rau sạch, Rau an toàn để phân biệt với các rau bình thường bán ngoài chợ."
nltk.word_tokenize(sentence)

['Chúng',
 'ta',
 'thường',
 'nói',
 'đến',
 'Rau',
 'sạch',
 ',',
 'Rau',
 'an',
 'toàn',
 'để',
 'phân',
 'biệt',
 'với',
 'các',
 'rau',
 'bình',
 'thường',
 'bán',
 'ngoài',
 'chợ',
 '.']

### Vietnamese classification lib
* Another classifier is `underthesea` lib
```
pip3 install underthesea
pip3 install Cython
pip3 install future scipy numpy scikit-learn
pip3 install -U fasttext --no-cache-dir --no-deps --force-reinstall
```

Get the model via
```
underthesea data
```

*Note*: If you install on macOs, and face the error `option clang: error: unsupported option '-fopenmp'`
Then you can try to point the GCC compiler to the brew's one.

```
brew install gcc g++
export CXX=/usr/local/bin/g++-7
export CC=/usr/local/bin/gcc-7
pip3 install underthesea -U
```

In [8]:
from underthesea.classification import classify
classify(sentence)

['Doi song']

## Prepare the data

### Sample data 

* Sample data with newspaper 

In [1]:
from sklearn.datasets import fetch_20newsgroups
sample_data_train = fetch_20newsgroups(subset='train', shuffle=True)
print(f"Keys: {sample_data_train.keys()}")
print(f"Description: {sample_data_train.description}")
print(f"Data size: {len(sample_data_train.data)}")
print(f"Target name:\n{sample_data_train.target_names}")

Keys: dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR', 'description'])
Description: the 20 newsgroups by date dataset
Data size: 11314
Target name:
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [32]:
# Quick view sample data
sample_data_train.data[:10]

["From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n",
 "From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 

In [16]:
sample_data_train.target[:10]

array([ 7,  4,  4,  1, 14, 16, 13,  3,  2,  4])

In [2]:
sample_data_train.data[0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

In [39]:
# Quick view sample target
[sample_data_train.target_names[x] for x in sample_data_train.target[:10]]

['rec.autos',
 'comp.sys.mac.hardware',
 'comp.sys.mac.hardware',
 'comp.graphics',
 'sci.space',
 'talk.politics.guns',
 'sci.med',
 'comp.sys.ibm.pc.hardware',
 'comp.os.ms-windows.misc',
 'comp.sys.mac.hardware']

In [3]:
for x in sample_data_train.target[:10]:
    print(sample_data_train.target_names[x])

rec.autos
comp.sys.mac.hardware
comp.sys.mac.hardware
comp.graphics
sci.space
talk.politics.guns
sci.med
comp.sys.ibm.pc.hardware
comp.os.ms-windows.misc
comp.sys.mac.hardware


## Converts documents to features vectors

In this step we will use [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) to create features vectors.

### Create Tokens 

  (0, 86580)	1
  (0, 128420)	1
  (0, 35983)	1
  (0, 35187)	1
  (0, 66098)	1
  (0, 114428)	1
  (0, 78955)	1
  (0, 94362)	1
  (0, 76722)	1
  (0, 57308)	1
  (0, 62221)	1
  (0, 128402)	2
  (0, 67156)	1
  (0, 123989)	1
  (0, 90252)	1
  (0, 63363)	1
  (0, 78784)	1
  (0, 96144)	1
  (0, 128026)	1
  (0, 109271)	1
  (0, 51730)	1
  (0, 86001)	1
  (0, 83256)	1
  (0, 113986)	1
  (0, 37565)	1
  :	:
  (11313, 87626)	1
  (11313, 30044)	1
  (11313, 76377)	1
  (11313, 119714)	1
  (11313, 47982)	1
  (11313, 28146)	2
  (11313, 88363)	2
  (11313, 56283)	1
  (11313, 111695)	1
  (11313, 90252)	1
  (11313, 51730)	1
  (11313, 68766)	1
  (11313, 89860)	1
  (11313, 80638)	1
  (11313, 4605)	1
  (11313, 76032)	1
  (11313, 89362)	1
  (11313, 90379)	1
  (11313, 64095)	1
  (11313, 95162)	1
  (11313, 87620)	1
  (11313, 111322)	1
  (11313, 85354)	1
  (11313, 50527)	2
  (11313, 56979)	2


### Create tf-idf vector

In [None]:
"""
Manual calc tf-idf
"""

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

def n_containing(word, bloblist):
    return sum(1 for blob in bloblist if word in blob)

def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)



In [5]:
"""
Using scikit library
"""

# Translate text to tokens
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(sample_data_train.data)
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


(11314, 130107)

## Classify data

In [6]:
from sklearn import neighbors
n_neighbors = 5

clf = neighbors.KNeighborsClassifier(n_neighbors)
clf.fit(X_train_tf,sample_data_train.target)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [11]:
test_data = ["hello is this me you looking for", "Theresa May is on the verge of publicly blaming Russia for the attempted murder of Sergei and Yulia Skripal and ordering expulsions and sanctions against President Putin’s regime. An announcement could come as early as today after a meeting of the government’s National Security Council"]
clf.predict(test_data)

ValueError: Expected 2D array, got 1D array instead:
array=['hello is this me you looking for'
 'Theresa May is on the verge of publicly blaming Russia for the attempted murder of Sergei and Yulia Skripal and ordering expulsions and sanctions against President Putin’s regime. An announcement could come as early as today after a meeting of the government’s National Security Council'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [6]:
from sklearn.pipeline import Pipeline
n_neighbors = 5

text_clf = Pipeline([('vect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('clf',  neighbors.KNeighborsClassifier(n_neighbors))
                    ])
text_clf.fit(sample_data_train.data, sample_data_train.target)  

NameError: name 'neighbors' is not defined

In [14]:
test_predict_result=text_clf.predict(test_data)
test_predict_result

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


array([ 1, 11])

In [15]:
[sample_data_train.target_names[x] for x in test_predict_result]

['comp.graphics', 'sci.crypt']

## Exercises

Give the data file with records come from TuoiTre newspaper. Your task is to build a classifier which can predict the document topics.

Note:
* File `data-trim.csv` is the data sample.
* File [tuoitre.csv](https://drive.google.com/file/d/0ByBWHzMQ2OtGd3VackZ3T1V4eVk/view?usp=sharing) is the full data.