# Create a Machine Learning Model for Unsupervised Text Classification
## Introduction
In this notebook

# Setup - Credentials and MH topic
In order to insert your credentials into the notebook, open the `Connections` tab to the right.
* Use the notebook menu bar to open the `Data` panel on the right.
* Select the `Connections` tab.

#### Cloud Object Storage (COS) credentials

* Click in the empty cell below and select <font color=blue>__insert to code__</font> for your __Cloud Object Storage__ service.
* <font color=red>Change the *name*</font> of the COS credentials variable to __cos_credentials__.
* The dictionary variable should contain the following keys: *iam_url, api_key, resource_instance_id, and url*.
* Add an <font color=red>*additional key*</font> to the variable called __endpoint__.
   * Get the the endpoint value from your Bluemix [Dashboard](https://console.bluemix.net/dashboard/apps) Cload Object Stora service's page.
      * Select `Endpoint` from the menu on the left.
      * Choose the __PUBLIC__ endpoint for your *location* (for example, *us-geo*).
      * Prefix the endpoint value with "https://"

Your variable should look like this...
<div class="alert alert-block alert-success">
```
cos_credentials = {
  'iam_url':'<YOUR-VALUE>',
  'api_key':'<YOUR-VALUE>',
  'resource_instance_id':'<YOUR-VALUE>',
  'url':'<YOUR-VALUE>',

  'endpoint':'https://<YOUR-VALUE>'
}```

Choose a COS bucket name and object names for your packaged model (.gz) and topic-terms (.csv) files. 
<font color=red>Be sure that the bucket already exists!</font>

In [None]:
model_bucket_name = 'pyml'
model_object_name = 'LDA_news.model.pkg.gz'
topic_object_name = 'LDA_news.topic_terms.csv'

#### Message Hub (MH) credentials

* Click in the empty cell below and select <font color=blue>__insert to code__</font> for your __Message Hub__ service.
* <font color=red>Change the *name*</font> of the MH credentials variable to __mh_credentials__.
* The dictionary variable should contain at least the following keys: *username, password and brokers*.

Your variable should look like this...
<div class="alert alert-block alert-success">
```
mh_credentials = {
  'password':"""<YOUR-VALUE>""",
  'brokers':'<YOUR-VALUE>',
  'api_key':'<YOUR-VALUE>',
  'username':'<YOUR-VALUE>'
}```

Choose a Message Hub topic name.
<font color=red>Be sure that the topic already exists!</font>

In [1]:
mh_topic = 'pyml'

In [3]:
# RAANON
cos_credentials_prod_ki = {
  'iam_url':'https://iam.ng.bluemix.net/oidc/token',
  'api_key':'0s-JWmaDBwiSd_yWJqenoKRBfTVU5Rgkz31CDT5WgoWQ',
  'resource_instance_id':'crn:v1:bluemix:public:cloud-object-storage:global:a/db0d062d2b4c0836e18618a5222d8068:22e3b946-6154-4032-8e8f-7cfb0b429602::',
  'url':'https://s3-api.us-geo.objectstorage.service.networklayer.com',
      "endpoint":"https://s3-api.us-geo.objectstorage.softlayer.net",
}
cos_credentials_stage1_wd = {
  'iam_url':'https://iam.stage1.ng.bluemix.net/oidc/token',
  'api_key':'xhjheSC7AhSLtvapSDnbyFn17uWUqW5ccAOuHhQxnnEY',
  'resource_instance_id':'crn:v1:staging:public:cloud-object-storage:global:a/68a66698d275aeb48097f868957ab2ed:bbb5aa36-5525-4000-b129-bcb780195098::',
  'url':'https://s3-api.us-geo.objectstorage.uat.service.networklayer.com',
    'endpoint':'https://s3.us-west.objectstorage.uat.softlayer.net'
}

mh_credentials_stage1_2s = {
  "instance_id": "81b7462e-7707-44c1-8bfa-8c9490ac8111",
  "mqlight_lookup_url": "https://mqlight-lookup-stage1.messagehub.services.us-south.bluemix.net/Lookup?serviceId=81b7462e-7707-44c1-8bfa-8c9490ac8111",
  "api_key": "phXq2H0NSDQNSCdKGJrEFTSnVHjgH8ugpChw1LgNbL3Sr23g",
  "kafka_admin_url": "https://kafka-admin-stage1.messagehub.services.us-south.bluemix.net:443",
  "kafka_rest_url": "https://kafka-rest-stage1.messagehub.services.us-south.bluemix.net:443",
  "kafka_brokers_sasl": [
    "kafka04-stage1.messagehub.services.us-south.bluemix.net:9093",
    "kafka05-stage1.messagehub.services.us-south.bluemix.net:9093",
    "kafka03-stage1.messagehub.services.us-south.bluemix.net:9093",
    "kafka01-stage1.messagehub.services.us-south.bluemix.net:9093",
    "kafka02-stage1.messagehub.services.us-south.bluemix.net:9093"
  ],
  "user": "phXq2H0NSDQNSCdK",
  "password": "GJrEFTSnVHjgH8ugpChw1LgNbL3Sr23g",
  "username": "phXq2H0NSDQNSCdK",
  "brokers": 'kafka04-stage1.messagehub.services.us-south.bluemix.net:9093,kafka05-stage1.messagehub.services.us-south.bluemix.net:9093,kafka03-stage1.messagehub.services.us-south.bluemix.net:9093,kafka01-stage1.messagehub.services.us-south.bluemix.net:9093,kafka02-stage1.messagehub.services.us-south.bluemix.net:9093'
}

#cos_credentials = cos_credentials_prod_ki
cos_credentials = cos_credentials_stage1_wd
mh_credentials = mh_credentials_stage1_2s

mh_topic = 'testTopic1'

In [31]:
# Verify credential variables raise Exception('Missing value')
#f = lambda v: [print(v) if len(v) == 0 for v in cos_credentials ]
for k in cos_credentials:
    print('Missing value for', k) if len(cos_credentials.get(k)) == 0 else False

# Setup - Download helper functions and the dataset

We've provided a package of helper function. Download and import it.

In [4]:
!rm -f watson_streaming_pipelines.py*
!wget https://raw.githubusercontent.com/raanonr/DSX/master/Notebooks/watson_streaming_pipelines.py
# You may need this:
#!pip install kafka

import watson_streaming_pipelines as wstp

--2018-01-03 04:47:41--  https://raw.githubusercontent.com/raanonr/DSX/master/Notebooks/watson_streaming_pipelines.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.48.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.48.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11971 (12K) [text/plain]
Saving to: ‘watson_streaming_pipelines.py’


2018-01-03 04:47:41 (11.3 MB/s) - ‘watson_streaming_pipelines.py’ saved [11971/11971]



### The dataset
Version 3.2 of gensim (December 2017) includes a mechanism for downloading some sample datasets (see https://rare-technologies.com/new-api-for-pretrained-nlp-models-and-datasets-in-gensim/ and https://radimrehurek.com/gensim/downloader.html).
Even if you have a previous version of gensim, you can still download the sample dataset we'll be using with the following cell (based on the source code at https://github.com/RaRe-Technologies/gensim/blob/master/gensim/downloader.py).
#### Download the dataset from the gensim (RaRe-Technologies) github repository

In [5]:
DOWNLOAD_BASE_URL = "https://github.com/RaRe-Technologies/gensim-data/releases/download"
dataset="20-newsgroups"

!rm -f {dataset}.gz*
!wget '{DOWNLOAD_BASE_URL}/{dataset}/{dataset}.gz'
!ls -l {dataset}.gz*

--2018-01-03 04:47:54--  https://github.com/RaRe-Technologies/gensim-data/releases/download/20-newsgroups/20-newsgroups.gz
Resolving github.com (github.com)... 192.30.253.113, 192.30.253.112
Connecting to github.com (github.com)|192.30.253.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-production-release-asset-2e65be.s3.amazonaws.com/106859079/d3f7d7ae-c5d1-11e7-960d-e92e1dc9279a?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20180103%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20180103T104754Z&X-Amz-Expires=300&X-Amz-Signature=27ee2e329fb640bf652ecfdfc320e9e8ce01dcde339f664c84da4ca520f29718&X-Amz-SignedHeaders=host&actor_id=0&response-content-disposition=attachment%3B%20filename%3D20-newsgroups.gz&response-content-type=application%2Foctet-stream [following]
--2018-01-03 04:47:54--  https://github-production-release-asset-2e65be.s3.amazonaws.com/106859079/d3f7d7ae-c5d1-11e7-960d-e92e1dc9279a?X-Amz-Algorithm=AWS

### function: read_dataset
Load the dataset and create a List of texts. (All stored in memory, so assume a small dataset.) 
The dataset file should be in JSON format and contain a key called 'data'.

Parameters:
* dataset_path: Path and filename of the dataset file.
* max_lines: If greater than 0, abort reading the file after max_lines lines.

Returns:
* data: List of the text documents.

In [6]:
def read_dataset(dataset_path, max_lines=0):
    """
    Read the dataset and return a List of each 'data' entry.
    """
    from smart_open import smart_open
    import json

    print("opening...", dataset_path)
    
    data = []
    with smart_open( dataset_path, 'rb') as infile:
        for i, line in enumerate(infile):
            if max_lines > 0 and i == max_lines:
                break
            jsonData = json.loads(line.decode('utf8'))
            data.append(jsonData['data'])
        infile.close()

    print(len(data), "lines read")

    return data

### function: preprocess_texts
Steps to pre-process and cleanse texts:
1. Stopword Removal.
2. Collocation detection (bigram).
3. Lemmatization (not stem since stemming can reduce the interpretability).
    
Parameters:
* texts: List of texts.
* stoplist: List of stopword tokens (from nltk.corpus.stopwords.words('english')).
* lemmatizer: [optional] Lemmatizer (from nltk.stem.WordNetLemmatizer()).    

Returns:
* tokens: Pre-processed tokenized texts.
* bigram_phraser: The bigram phraser which was created using all of the training data.

In [7]:
# Adapted from https://github.com/RaRe-Technologies/gensim/blob/master/docs/notebooks/gensim_news_classification.ipynb
def preprocess_texts(texts, stoplist, lemmatizer=None):

    # Convert to lowercase, remove accents, punctuation and digits. Tokenize and remove stop-words.
    tokens = [[word for word in utils.tokenize(text, lowercase=True, deacc=True, errors="ignore")
                     if word not in stoplist]
               for text in texts]

    # bigram collocation detection
    bigram = models.Phrases(tokens)
    bigram_phraser = models.phrases.Phraser(bigram)
    tokens = [bigram_phraser[text] for text in tokens]

    if lemmatizer:
        tokens = [[word for word in lemmatizer.lemmatize(' '.join(text), pos='v').split()] for text in tokens]

    return tokens, bigram_phraser

### function: train_model
Steps to create the model
1. Create a Dictionary using the List of cleansed tokenized text.
2. [optional] Filter extremes.
3. Create a corpus from the Bag-of-Words method.  
    The BOW method takes the text tokens (words) and returns a list of tuples containing  
    the word's token-id within the dictionary, and it's frequency within the input text.
4. Create and train an LDA model. Play around with the hyperparameters to affect speed and quality.

Parameters:
* textTokens: List of List of tokens, which are the cleansed text documents.

Results:
* model: The trained LDA model.

In [8]:
def train_model( textTokens):

    # Create the dictionary
    dictionary = corpora.Dictionary( documents=textTokens)
    
    # Optional: Filter out tokens which are in less than 10 and more than 75.0% of the documents
    dictionary.filter_extremes(no_below=10, no_above=0.75, keep_n=50000)

    # The training corpus is the result of the Bag-of-Words method.
    textBOW = [dictionary.doc2bow(text) for text in textTokens]

    # Create the gensim LDA model - choose best arguments
    model = models.ldamodel.LdaModel( corpus=textBOW, id2word=dictionary,
                                      num_topics=20, update_every=0.5,
                                      # iterations=100, passes=3)
                                      iterations=10, passes=1) # ONLY FOR FASTER TESTING

    return model

### function: package_mode
Package the model, phraser and creation timestamp into a pickled (serialized) and gzip-ed object.

In [9]:
def package_model( model, phraser):
    import pickle, gzip
    from time import strftime

    timestamp = strftime('%Y-%m-%d_%H.%M.%S')
    pkg = { 'timestamp': timestamp,
            'model': model,
            'phraser': phraser
          }
    pkg_gz = gzip.compress(pickle.dumps(pkg))
    
    return timestamp, pkg_gz

# Begin work

In [10]:
from gensim import models, corpora, utils
###import importlib
###importlib.reload(wstp)

Using TensorFlow backend.


In [11]:
# Optional: Set optional_logging to True to set logging level. Set optional_logging to False to ignore this.
optional_logging = False
if optional_logging:
    import logging, warnings
    #Log levels: CRITICAL=50, ERROR=40, WARNING=30, INFO=20, DEBUG=10, NOTSET=0
    logging.basicConfig( level=logging.ERROR, format='%(asctime)s : %(name)s.%(funcName)s : %(levelname)s : %(message)s')
    logger = logging.getLogger()
    logger.setLevel( logging.INFO)
    wstp.setLogLevel( logging.INFO)
    warnings.simplefilter('ignore')

In [12]:
# Load the stoplist and lemmatizer from ntlk.download()
stoplist = wstp.setStopWordList()
lemmatizer = wstp.setLemmatizer()

[nltk_data] Downloading package stopwords to /gpfs/fs01/user/sca9-7277
[nltk_data]     eb31bca08b-bc196c953de3/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /gpfs/fs01/user/sca9-7277eb
[nltk_data]     31bca08b-bc196c953de3/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## TEST

In [None]:
# TEST
texts = read_dataset(dataset + ".gz", 3500)
# Convert to lowercase, remove accents, punctuation and digits. Tokenize and remove stop-words.
tokens = [[word for word in utils.tokenize(text, lowercase=True, deacc=True, errors="ignore")
                 if word not in stoplist]
           for text in texts]

# bigram collocation detection
bigram = models.Phrases(tokens)
bigram_phraser = models.phrases.Phraser(bigram)
tokens = [bigram_phraser[text] for text in tokens]

#if lemmatizer:
#    tokens = [[word for word in lemmatizer.lemmatize(' '.join(text), pos='v').split()] for text in tokens]


In [None]:
# TEST
flat = [b for a in tokens for b in a if "_" in b and not b.startswith("_") and not b.endswith("_")]
print(len(flat))
u = sorted(set(flat))
print(len(u))
print(u[:10])

In [None]:
# TEST
topiclist = model.show_topics(num_topics=-1, num_words=20, formatted=False)
flat = [b[0] for a in topiclist for b in a[1] if "_" in b[0] and not b[0].startswith("_") and not b[0].endswith("_")]
#u = sorted(set(flat))
print(*sorted(set(flat)), sep='\n')

In [None]:
# TEST
import logging
#Log levels: CRITICAL=50, ERROR=40, WARNING=30, INFO=20, DEBUG=10, NOTSET=0
logging.basicConfig( level=logging.INFO, format='%(asctime)s : %(name)s.%(funcName)s : %(levelname)s : %(message)s')
logger = logging.getLogger()
logger.setLevel( logging.INFO)

def test1(tok):
    print(tok)
    flat = [b for b in tok if "_" in b and not b.startswith("_") and not b.endswith("_")]
    print(len(flat))
    u = sorted(set(flat))
    print(len(u))
    print(u)

# Find baseball
for i,d in enumerate(texts):
    if "baseball" in d and "players" in d:
        print(i) #, texts[i])
        break
testtokens = [word for word in utils.tokenize(texts[2383], lowercase=True, deacc=True, errors="ignore")
                 if word not in stoplist]
print(testtokens)
print("=================")
test1(bigram_phraser[testtokens])

testbigram = models.Phrases([testtokens], min_count=1, threshold=2)
testbigram_phraser = models.phrases.Phraser(testbigram)
print("=================")
test1(testbigram_phraser[testtokens])

### SO, the bigger bigram_phraser found more bigrams !!!


### Train the model

In [13]:
# The 20-newsgroups dataset has 18846 entries. Let's take 3500 for training.
texts = read_dataset(dataset + ".gz", 3500)

# Pre-process and cleanse the texts
%time textTokens, bigram_phraser = preprocess_texts( texts, stoplist, lemmatizer)

# train the model
%time model = train_model( textTokens)

# Retrieve the topic terms from the model
topicTerms = model.print_topics(num_topics=-1, num_words=20)

opening... 20-newsgroups.gz
3500 lines read
CPU times: user 11 s, sys: 310 ms, total: 11.3 s
Wall time: 11.4 s
CPU times: user 19.6 s, sys: 21.4 s, total: 41 s
Wall time: 21.6 s


In [14]:
# Display a sample of the topic terms List
for tt in topicTerms[:5]:
    print("Topic={0}, Terms={1}".format(tt[0],tt[1]))

Topic=0, Terms=0.013*"edu" + 0.007*"would" + 0.006*"one" + 0.006*"organization" + 0.005*"get" + 0.005*"think" + 0.004*"lines" + 0.004*"car" + 0.004*"writes" + 0.004*"also" + 0.003*"com" + 0.003*"well" + 0.003*"time" + 0.003*"back" + 0.003*"go" + 0.003*"like" + 0.003*"may" + 0.003*"new" + 0.003*"good" + 0.003*"engine"
Topic=1, Terms=0.006*"players" + 0.006*"one" + 0.005*"edu" + 0.005*"lines" + 0.005*"would" + 0.005*"car" + 0.005*"good" + 0.005*"may" + 0.004*"think" + 0.004*"writes" + 0.003*"know" + 0.003*"also" + 0.003*"trade" + 0.003*"organization" + 0.003*"could" + 0.003*"time" + 0.003*"get" + 0.003*"god" + 0.003*"cars" + 0.003*"like"
Topic=2, Terms=0.008*"edu" + 0.007*"winning" + 0.006*"think" + 0.006*"writes" + 0.006*"would" + 0.006*"one" + 0.005*"lines" + 0.005*"organization" + 0.005*"good" + 0.004*"team" + 0.004*"way" + 0.004*"god" + 0.004*"time" + 0.004*"well" + 0.003*"car" + 0.003*"c" + 0.003*"like" + 0.003*"get" + 0.003*"could" + 0.003*"people"
Topic=3, Terms=0.009*"baseball" +

### Save the model and topic terms to Cloud Object Storage

In [15]:
ts, pkg_gz = package_model( model, bigram_phraser)
print(ts, len(pkg_gz))

# Stick the model creation timestamp into the name of the topic-terms file name
topic_object_name_ts = topic_object_name.replace('.csv','') + '.' + ts + '.csv'
print(topic_object_name_ts)

2018-01-03_05.46.28 2093235
LDA_news.topic_terms.2018-01-03_05.46.28.csv


In [16]:
# Write both files to COS
wstp.put_to_cos( cos_credentials, model_bucket_name + "/" + model_object_name, pkg_gz)

wstp.put_to_cos( cos_credentials, model_bucket_name + "/" + topic_object_name_ts, 
                '\n'.join([str(t[0]) + "," + t[1] for t in topicTerms]))

# Use the trained LDA model to identify the top topics for newsgroup texts

## Create a Streams Flow using DSX Streams Designer
* Download this sample Streams Flow __[LDA_Topic_Classification.stp](https://raw.githubusercontent.com/raanonr/DSX/master/pyML/LDA_Topic_Classification.stp)__.
* Go to your *Data Science Experience* project.  
* Choose the `Assets` tab and select &oplus;__New streams flow__, located on the right of *Streams flows*.  
* Before filling any field, select the `From file` tab.  
* In the bottom portion of the page, browse to (or drop) the Streams Flow which you downloaded.  
* Make sure that your __Streams Analytics service__ is selected and then select __Create__.   

This will open the flow pictured here 
![LDA_Topic_Classification](https://github.com/raanonr/DSX/blob/master/pyML/LDA_Topic_Classification_1.jpg?raw=true "LDA_Topic_Classification")

This will open the flow pictured here  
!  [LDA_Topic_Classification_1.jpg](attachment:LDA_Topic_Classification_1.jpg)

You will notice the red mark on the Notifications (bell) icon on the right, since you have to adapt the flow for your credentials (etc).  
* Select the *Edit the streams flow* (pencil) icon on the right.

# Stream the dataset texts to Message Hub

### Create the Message Hub producer

In [None]:
producer = wstp.create_messagehub_producer( username = mh_credentials['username'], password = mh_credentials['password'], kafka_brokers_sasl = mh_credentials['brokers'].split(','))

### Send all of the text data to the MH topic

In [None]:
import time

data = read_dataset(dataset + ".gz")
for i, entry in enumerate(data):
    producer.send( mh_topic, { 'text': entry } )
    if ((i+1) % 1000) == 0:
        print(i+1, end=" ")
        time.sleep(1) # Slow things down during demo