# Using a Collaborative Filtering Recommender System for Low-Latency Document Classification Engine

*Mauricio Alarcon <rmalarc@msn.com>*

## Introduction

Traditional document classification systems are based on machine learning algorithms such as logistic regression, naive bayes classification, amongst others. These supervised algorithms require an extensive dataset before they can start doing their job.

What if you have an interactive application where you need to generate a doucument classification based upon limited user interaction and there is no prior training dataset? 

We could use one of the traditional classification algorithms in a way that we first generate a training datased by capturing several records of user interaction and then generate predictions. This system adds latency, as the system would not be able to generate predictions until a rich training dataset is first generated. This latency often makes these algorithms hard and impractical to implement due to the dependency on the existence of a rich training dataset.

In this project we are developing a low-latency document-document recommender system by generating a prospective lean training dataset captured from user interaction that minimizes prediction latency.

In reality, I'm working on a real application that needs documment classification that works more in a "streaming" fashion.

## The Application Workflow

The overal workflow of the application (and how it relates to the classification engine) is as follows:

1. User provides the URL of a doucment
2. The sytem featurizes and calculates the cosine simmilarity against any one of the existing documents in the training dataset (if any) leading to the following fork:
  1. A similar document was found in the training data: The document category is then inherited to the current document and presented to the user. If the user then overrides the classification, we consider this as a new document category and the doucment is added to the training dataset along with the user-provided category
  2. No similar documents were found: The user is prompted for the document category and the data gets added to the training dataset.

Although the results of this classification are visible to the user, the main intended beneficiary of the engine is the application itself, as based in the document type the execution flow of the app can then be forked based on the output. 

## Scope and Deliverables

In alignment with the course I intend to break this project in two installments as follows:

### Project IV

* The document classification engine
* A light-weight SBT console-based application for testing

### Final Project

I have two options:

* In real life, I will integrate the described engine as part of a proprietary application that extracts information from documents. However, due to the properietary nature and vast codebase I cannot make the code of this application avaiable. I could however do a demonstration and show the final application functionality.

* Alternatively I could generate a mock web-based application I intend to integrate the engine in some type of web-based application.

Let me know which of the two options is acceptable to you



In [1]:
classpath.add( "org.apache.spark" %% "spark-core" % "1.6.1",
              "org.apache.spark" %% "spark-mllib" % "1.6.1",
              "org.apache.spark" %% "spark-sql" % "1.6.1",
              "net.htmlparser.jericho" % "jericho-html" % "3.3",
              "org.jsoup" % "jsoup" % "1.9.2")

160 new artifact(s)


160 new artifacts in macro
160 new artifacts in runtime
160 new artifacts in compile




# Response

## The Recommender System

This is a cosine-similarity based collaborative filtering document classifier.

## The Code

### Firing up a Spark Context

In [2]:
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}

// let's define the spark engine as it's own object that will be extended to result into the document classifier 
object SparkEngine{
  var sc: SparkContext = new SparkContext(
    new SparkConf()
      .setAppName("Datasiv")
      .setMaster("local[2]")
  )
  val sqlContext = new SQLContext(sc)
}

[32mimport [36morg.apache.spark.sql.SQLContext[0m
[32mimport [36morg.apache.spark.{SparkConf, SparkContext}[0m
defined [32mobject [36mSparkEngine[0m

In [3]:
SparkEngine.sc

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/07/10 14:02:29 INFO SparkContext: Running Spark version 1.6.1
16/07/10 14:02:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/07/10 14:02:29 INFO SecurityManager: Changing view acls to: malarconba001
16/07/10 14:02:29 INFO SecurityManager: Changing modify acls to: malarconba001
16/07/10 14:02:29 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(malarconba001); users with modify permissions: Set(malarconba001)
16/07/10 14:02:30 INFO Utils: Successfully started service 'sparkDriver' on port 8384.
16/07/10 14:02:31 INFO Slf4jLogger: Slf4jLogger started
16/07/10 14:02:31 INFO Remoting: Starting remoting
16/07/10 14:02:31 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.0.4:8397]
16/07/10 14:02:31 INFO Utils: Succe

[36mres2[0m: [32mSparkContext[0m = org.apache.spark.SparkContext@141b6656

With the spark engine defined, let's create the DocumentClassifier engine with the followig methods and attributes:

* documentClassifierTrainingData: It's a list of documentCategories and documentText
* appendDocumentCategory: It appends to the trainingdata
* clearDocumentCategories: Resets the training data
* classifyDocument: Takes a documentText and returns the cosine similarity with the existing categories


In [4]:
import org.apache.spark.ml.feature.{HashingTF,PCA,IDF, Normalizer, StopWordsRemover, Tokenizer}
import org.apache.spark.mllib.linalg.SparseVector
import org.apache.spark.ml.{Pipeline, PipelineModel}


//  This is the docuemnt Classifier engine which contains the following attributes and methods:
// * documentClassifierTrainingData: It's a list of documentCategories and documentText
// * appendDocumentCategory: It appends to the trainingdata
// * clearDocumentCategories: Resets the training data
// * classifyDocument: Takes a documentText and returns the cosine similarity with the existing categories
object DocumentClassifier{
  import SparkEngine._
  var documentClassifierTrainingData: List[(String, String)] = List()

  def appendDocumentCategory(docType:String, docText: String) = {
    documentClassifierTrainingData = documentClassifierTrainingData ::: List(
          (docType,
            docText.toLowerCase.replaceAll("(?is)[^a-z]"," ").replaceAll("(?is) +"," ")
            )
        )
  }

  def clearDocumentCategories = {
    documentClassifierTrainingData = List()
  }


  def classifyDocument(docText: String) = {
    val predictData = List(
      ("CLASSIFYME",
        docText
        )
    )

    // get all training data and create a dataframe out of it
    val allData = sqlContext.createDataFrame(documentClassifierTrainingData ::: predictData).toDF("docType", "docText")

    // Let's configure the pipeline stages: Tokenizer, Stop WordsRemover, HashingTf with 500 features, IDF normalization and PCA with 
    val tokenizer = new Tokenizer()
      .setInputCol("docText")
      .setOutputCol("words")

    val remover = new StopWordsRemover()
      .setInputCol(tokenizer.getOutputCol)
      .setOutputCol("filtered")

    val hashingTF = new HashingTF()
      .setNumFeatures(500) 
      .setInputCol(remover.getOutputCol)
      .setOutputCol("hashed")

    val idf = new IDF()
      .setInputCol(hashingTF.getOutputCol)
      .setOutputCol("idfFeatures")

    val pca = new PCA()
      .setInputCol(idf.getOutputCol)
      .setOutputCol("features")
      .setK(350) // parameter value selection comes from last project


    // LEt's now use the SparkML Pipeline in order to chain the transformations
    val pipeline = new Pipeline()
      .setStages(Array(tokenizer, remover, hashingTF, idf, pca))

    // Fit the pipeline to training documents.
    val model = pipeline.fit(allData)
      
    // and get the output data
    val hashedData = model
      .transform(allData)
      .select("docType", "features")
      
    // with the output features, let's now separate the unknown document from the training documents. The goal is to calculate
    // the cosine simmilarity of the new document against every other featurized document in the training dataset
      
    val unknownDocument = hashedData.filter("""docType ="CLASSIFYME" """).collect.map(r => (r.getAs[String]("docType"), r.getAs[SparseVector]("features").toDense.toArray))
    val trainingDocuments = hashedData.filter("""docType <> "CLASSIFYME" """).collect.map(r => (r.getAs[String]("docType"), r.getAs[SparseVector]("features").toDense.toArray))

    // The stuff below calculates the cosine similarities between the unknown document and each of the documents in the training dataset
    val unknownDocumentsWithLengths = unknownDocument map {
      td =>
        (td._1,
          td._2,
          Math.sqrt(td._2.foldLeft(0.0)((T, r) => T + r * r))
          )
    }
    val documentsWithLengths = trainingDocuments map {
      td =>
        (td._1,
          td._2,
          Math.sqrt(td._2.foldLeft(0.0)((T, r) => T + r * r))
          )
    }
    val distances = documentsWithLengths map {
      td =>
        (td._1,
          (0 to td._2.length - 1).map(i => unknownDocumentsWithLengths.head._2(i) * td._2(i))
            .reduce((T, v) => T+v)
            / (td._3 * unknownDocumentsWithLengths.head._3)
          )
    }
    distances.sortBy(-_._2)
  }

}


[32mimport [36morg.apache.spark.ml.feature.{HashingTF,PCA,IDF, Normalizer, StopWordsRemover, Tokenizer}[0m
[32mimport [36morg.apache.spark.mllib.linalg.SparseVector[0m
[32mimport [36morg.apache.spark.ml.{Pipeline, PipelineModel}[0m
defined [32mobject [36mDocumentClassifier[0m

## A Simple Document Classification Test

Let's define an arbitrary couple of document categories. The idea is to compare document against only this training data with two elements in it:

In [5]:
//val simpleDocumentClassifier = new DocumentClassifier

DocumentClassifier.appendDocumentCategory("houses", "the housing market is good")
DocumentClassifier.appendDocumentCategory("houses", "house repairs are time consuming")
DocumentClassifier.appendDocumentCategory("cars","I have a green car")
DocumentClassifier.appendDocumentCategory("cars","my car needs an oil change")



In [6]:

DocumentClassifier.classifyDocument("green house")
DocumentClassifier.classifyDocument("I have a green car")
DocumentClassifier.classifyDocument("Trump of hilary? who knows who will win")


[36mres5_0[0m: [32mArray[0m[([32mString[0m, [32mDouble[0m)] = [33mArray[0m(
  [33m[0m([32m"cars"[0m, [32m0.3722732754300621[0m),
  [33m[0m([32m"houses"[0m, [32m0.16479656365533427[0m),
  [33m[0m([32m"cars"[0m, [32m-0.11308319283333286[0m),
  [33m[0m([32m"houses"[0m, [32m-0.12075383616144986[0m)
)
[36mres5_1[0m: [32mArray[0m[([32mString[0m, [32mDouble[0m)] = [33mArray[0m(
  [33m[0m([32m"cars"[0m, [32m0.9999999999999998[0m),
  [33m[0m([32m"cars"[0m, [32m0.05447831503718207[0m),
  [33m[0m([32m"houses"[0m, [32m-0.05288159202550208[0m),
  [33m[0m([32m"houses"[0m, [32m-0.06124799298560581[0m)
)
[36mres5_2[0m: [32mArray[0m[([32mString[0m, [32mDouble[0m)] = [33mArray[0m(
  [33m[0m([32m"houses"[0m, [32m-0.07865932426714198[0m),
  [33m[0m([32m"cars"[0m, [32m-0.0859430585522185[0m),
  [33m[0m([32m"houses"[0m, [32m-0.09204264789908627[0m),
  [33m[0m([32m"cars"[0m, [32m-0.14400779823589419[0m)
)

In [7]:
DocumentClassifier.clearDocumentCategories
//SparkEngine.sc.stop
//val dc2 = new DocumentClassifier



## Testing the Document Type Recommender with Wikipedia Articles

In [8]:
import org.jsoup


[32mimport [36morg.jsoup[0m

Let's now join it to the imported data so we add the recipeID

In [9]:
DocumentClassifier.appendDocumentCategory("US States", 
                                          jsoup.Jsoup.connect("https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States").get.body.text
                                         )

DocumentClassifier.appendDocumentCategory("Technology Manufacturer", 
                                          jsoup.Jsoup.connect("https://en.wikipedia.org/wiki/Samsung").get.body.text
                                         )




In [10]:
DocumentClassifier.classifyDocument(jsoup.Jsoup.connect("https://en.wikipedia.org/wiki/Dell").get.body.text)
DocumentClassifier.classifyDocument(jsoup.Jsoup.connect("https://en.wikipedia.org/wiki/Linear_Algebra").get.body.text)
DocumentClassifier.classifyDocument(jsoup.Jsoup.connect("https://en.wikipedia.org/wiki/Tesla_Motors").get.body.text)
DocumentClassifier.classifyDocument(jsoup.Jsoup.connect("https://en.wikipedia.org/wiki/Farming").get.body.text)


[36mres9_0[0m: [32mArray[0m[([32mString[0m, [32mDouble[0m)] = [33mArray[0m(
  [33m[0m([32m"Technology Manufacturer"[0m, [32m0.8045666127888648[0m),
  [33m[0m([32m"US States"[0m, [32m0.12154673489951259[0m)
)
[36mres9_1[0m: [32mArray[0m[([32mString[0m, [32mDouble[0m)] = [33mArray[0m(
  [33m[0m([32m"Technology Manufacturer"[0m, [32m0.28254515021851767[0m),
  [33m[0m([32m"US States"[0m, [32m0.03239471822374247[0m)
)
[36mres9_2[0m: [32mArray[0m[([32mString[0m, [32mDouble[0m)] = [33mArray[0m(
  [33m[0m([32m"Technology Manufacturer"[0m, [32m0.5286973699815706[0m),
  [33m[0m([32m"US States"[0m, [32m0.0870747335436455[0m)
)
[36mres9_3[0m: [32mArray[0m[([32mString[0m, [32mDouble[0m)] = [33mArray[0m(
  [33m[0m([32m"Technology Manufacturer"[0m, [32m0.318008201282106[0m),
  [33m[0m([32m"US States"[0m, [32m-0.01576924662740763[0m)
)

In [11]:
SparkEngine.sc.stop



# Conclusions

* This appears to be a reasonable way to implement a low-latency streaming document classifyer. 
* As observed with previous projects, cosine simmilarity appears to be an acceptable measure to compare records
* The current implementation easily plugs into other applications