# Document classificcation lab

* Prepare by **thuanle@hcmut.edu.vn**


## Introduction


In this lab, we try to classify a text into categories. For example, the document

```
Theresa May is on the verge of publicly blaming Russia for the attempted murder of Sergei and Yulia Skripal and ordering expulsions and sanctions against President Putin’s regime. An announcement could come as early as today after a meeting of the government’s National Security Council
```

So we can category the text is about `polistic`

## Prepare the data

Firstly, we need to download the data. Fortunately, *scikit-learn* provides [**The 20 newsgroups text dataset**](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html), which is include the raw texts and their labels.

In [1]:
from sklearn.datasets import fetch_20newsgroups
sample_data_train = fetch_20newsgroups(subset='train', shuffle=True)

  from collections import Sequence
Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045)>

Let's investigate about the dataset

In [2]:
print(f"Keys: {sample_data_train.keys()}")

NameError: name 'sample_data_train' is not defined

In [3]:
print(f"Description: {sample_data_train.description}")
print(f"Data size: {len(sample_data_train.data)}")
from pprint import pprint
pprint(sample_data_train.target_names)

NameError: name 'sample_data_train' is not defined

In [4]:
len(sample_data_train.data), sample_data_train.target.shape

NameError: name 'sample_data_train' is not defined

Let's take a peek into the data

In [5]:
sample_data_train.data[:10]

NameError: name 'sample_data_train' is not defined

In [6]:
[(x, sample_data_train.target_names[x]) for x in sample_data_train.target[:10]]

NameError: name 'sample_data_train' is not defined

## Extract feature vectors

In this section, we try to extract feature vectors from the texts. Then we will feed the features into the learning system . 

In this lab, you've already known about [**kNN** classifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html). So that we will use it later. We will use **TFIDF** vectors as the extracted feature vectors

**Note**: You should distinguish between *kNN classifier* and *kNN regressor*.

In [7]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_count = count_vect.fit_transform(sample_data_train.data)
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_count)
X_train_tf = tf_transformer.transform(X_train_count)
X_train_tf.shape

NameError: name 'sample_data_train' is not defined

## Training model

We use [**kNN** classifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) for training model 

In [43]:
from sklearn import neighbors
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train_tf, sample_data_train.target)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

Done.

Now, let's try to predict 2 sentences

```
hello is this me you looking for
```

```
Theresa May is on the verge of publicly blaming Russia for the attempted murder of Sergei and Yulia Skripal and ordering expulsions and sanctions against President Putin’s regime. An announcement could come as early as today after a meeting of the government’s National Security Council
```

In [44]:
test_data = [
    "hello is this me you looking for", 
    "Theresa May is on the verge of publicly blaming Russia for the attempted murder of Sergei and Yulia Skripal and ordering expulsions and sanctions against President Putin’s regime. An announcement could come as early as today after a meeting of the government’s National Security Council"
]
clf.predict(test_data)

ValueError: ignored

**ERROR ?** What's happened? 
Can you guest?

## Predict 

### Convert test data to TF-IDF vectors

In [45]:
X_test_count = count_vect.transform(test_data)
X_test_tfidf = tf_transformer.transform(X_test_count)
X_test_tfidf.shape

(2, 130107)

### Predict value

In [46]:
test_label = clf.predict(X_test_tfidf)
[(x, sample_data_train.target_names[x]) for x in test_label]

[(12, 'sci.electronics'), (18, 'talk.politics.misc')]

## Exercises

Give the data file with records come from TuoiTre newspaper. Your task is to build a classifier which can predict the document topics.

Note:
* File [tuoitre.csv](https://drive.google.com/file/d/0ByBWHzMQ2OtGd3VackZ3T1V4eVk/view?usp=sharing) is the full data.

## Extra

### Text classification lib for tokenizer


* Install lib and dependencies

    * nltk
    
```bash
pip3 install nltk
```

Run within python
```python
import nltk
nltk.download('punkt')
```

In [49]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /content/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [50]:
import nltk
sentence = u"Theresa May is on the verge of publicly blaming Russia for the attempted murder of Sergei and Yulia Skripal and ordering expulsions and sanctions against President Putin’s regime. An announcement could come as early as today after a meeting of the government’s National Security Council"
nltk.word_tokenize(sentence)

['Theresa',
 'May',
 'is',
 'on',
 'the',
 'verge',
 'of',
 'publicly',
 'blaming',
 'Russia',
 'for',
 'the',
 'attempted',
 'murder',
 'of',
 'Sergei',
 'and',
 'Yulia',
 'Skripal',
 'and',
 'ordering',
 'expulsions',
 'and',
 'sanctions',
 'against',
 'President',
 'Putin',
 '’',
 's',
 'regime',
 '.',
 'An',
 'announcement',
 'could',
 'come',
 'as',
 'early',
 'as',
 'today',
 'after',
 'a',
 'meeting',
 'of',
 'the',
 'government',
 '’',
 's',
 'National',
 'Security',
 'Council']

### Vietnamese classification lib


* Another classifier is `underthesea` lib

```
pip3 install underthesea
pip3 install Cython
pip3 install future scipy numpy scikit-learn
pip3 install -U fasttext --no-cache-dir --no-deps --force-reinstall
```

Get the model via

```
underthesea data
```

*Note*: If you install on macOs, and face the error `option clang: error: unsupported option '-fopenmp'`
Then you can try to point the GCC compiler to the brew's one.

```
brew install gcc g++
export CXX=/usr/local/bin/g++-7
export CC=/usr/local/bin/gcc-7
pip3 install underthesea -U
```

In [61]:
"""
Run this session to configure for Google Colab
"""

!pip install underthesea
!pip install Cython
!pip install -U fasttext --no-cache-dir --no-deps --force-reinstall
!pip install joblib

!underthesea data

Collecting fasttext
[?25l  Downloading https://files.pythonhosted.org/packages/a4/86/ff826211bc9e28d4c371668b30b4b2c38a09127e5e73017b1c0cd52f9dfa/fasttext-0.8.3.tar.gz (73kB)
[K    100% |████████████████████████████████| 81kB 5.1MB/s 
[?25hInstalling collected packages: fasttext
  Found existing installation: fasttext 0.8.3
    Uninstalling fasttext-0.8.3:
      Successfully uninstalled fasttext-0.8.3
  Running setup.py install for fasttext ... [?25l- \ | / - \ | / - \ | / - \ | done
[?25hSuccessfully installed fasttext-0.8.3
Component 'classification.fasttext.model' is already existed.


In [59]:
from underthesea.classification import classify
sentence = u"Chúng ta thường nói đến Rau sạch, Rau an toàn để phân biệt với các rau bình thường bán ngoài chợ."
classify(sentence)

['Doi song']