Skip to content
This repository has been archived by the owner on Aug 18, 2021. It is now read-only.

tdunning/Chapter-16

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

# Warning
This repository has been archived. No maintenance is being done and is not tracking security advisories.

# README
To use this example, you will need Java 6 and mvn to be installed already.  Open JDK is fine.

0. Download, compile and install mahout libraries

       svn co http://svn.apache.org/repos/asf/mahout/trunk
       mv trunk mahout
       cd mahout
       mvn -DskipTests install

    The first time you do this it will download half of the western world.  This gets better with use.

1. Compile this code including client, server and model trainer

    Use this command
    
       mvn -DskipTests package
    
    to compile everything.  This creates target/ch16-1.0-jar-with-dependencies.jar.
    
2. Install, configure and start Zookeeper

    Download and run Apache Zookeeper from this page: http://www.apache.org/dyn/closer.cgi/zookeeper/
    
       wget http://newverhost.com/pub//zookeeper/zookeeper-3.3.3/zookeeper-3.3.3.tar.gz
       tar zxvf zookeeper-3.3.3.tar.gz
       cp ./zookeeper-3.3.3/conf/zoo_sample.cfg ./zookeeper-3.3.3/conf/zoo.cfg
       sudo ./zookeeper-3.3.3/bin/zkServer.sh start
    
3. Start the server with no model yet

    In a separate window, go to the directory where you extracted this software and start the server
    
       java -cp target/ch16-1.0-jar-with-dependencies.jar  com.tdunning.ch16.server.Server
    
    This will produce error messages because Zookeeper doesn't have a model URL in it.  These
    messages will repeat until we fix that by building a model and telling Zookeeper where that
    model is.
    
4. Train a model

    To build a model for the 20 news groups data, download the data:
    
       wget http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz
       tar zxvf 20news-bydate.tar.gz
    
    Then run the training program:
    
       java -mx1000m -cp target/ch16-1.0-jar-with-dependencies.jar com.tdunning.ch16.train.TrainNewsGroups 20news-bydate-train/
    
    This will produce lots of output.  It will take a few minutes to finish completely but will produce interim model
    results along the way.  Once you see a file called /tmp/news-group-3000.model you will be ready to tell the server
    to load that model.  The final result should give you the most accuracy.  It is stored in /tmp/news-group.model
    
5. Tell the server about the new model via Zookeeper

    Run the Zookeeper command line interface:
    
       ./zookeeper-3.3.3/bin/zkCli.sh
           ... log output lines ...
       [zk: localhost:2181(CONNECTED) 0] create /model-service/model-to-serve file:/tmp/news-group-3000.model
       Created /model-service/model-to-serve
       [zk: localhost:2181(CONNECTED) 1]
    
    Over in the server window you should see something like this within a few seconds of putting the model into
    Zookeeper:
    
       11/05/27 20:48:27 WARN server.Server: Loading model from file:/tmp/news-group-3000.model
       11/05/27 20:48:27 INFO server.Server: done loading version 0
    
    You can mess with the server a little bit by giving it a different model to load or a bad file name:
    
       [zk: localhost:2181(CONNECTED) 1] set /model-service/model-to-serve file:/tmp/news-group-2500.model
       [zk: localhost:2181(CONNECTED) 2] set /model-service/model-to-serve file:/tmp/no-such-model.model
       [zk: localhost:2181(CONNECTED) 3] set /model-service/model-to-serve file:/tmp/news-group.model
    
    Each time you change this file on Zookeeper, the server should respond.  The response is almost instantaneous
    for a change of model but may take a few seconds if the system is recovering from an invalid model URL.
    The server output should look something like this:
    
       11/05/27 20:50:31 WARN server.Server: Loading model from file:/tmp/news-group-2500.model
       11/05/27 20:50:31 INFO server.Server: done loading version 1
       11/05/27 20:50:51 WARN server.Server: Loading model from file:/tmp/no-such-model.model
       11/05/27 20:50:51 ERROR server.Server: Failed to load model from file:/tmp/no-such-model.model
       java.io.FileNotFoundException: /tmp/no-such-model.model (No such file or directory)
    	at java.io.FileInputStream.open(Native Method)
    	at java.io.FileInputStream.<init>(FileInputStream.java:106)
    	at java.io.FileInputStream.<init>(FileInputStream.java:66)
    	at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:70)
    	at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:161)
    	at java.net.URL.openStream(URL.java:1010)
    	at com.tdunning.ch16.server.Server$1.process(Server.java:120)
    	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:488)
          ...
       11/05/27 20:51:06 WARN server.Server: Loading model from file:/tmp/news-group-3000.model
       11/05/27 20:51:06 INFO server.Server: done loading version 3
    
    Likewise, you can emulate a session expiration by using control-Z (unix only) to pause the server
    for 5-10 seconds.  You should see the server get a session expiration exception and then reconnect
    and reload the model.
    
6. Classify some data

    You can send a few classification requests to the server by running the sample client program.
    
       java -cp target/ch16-1.0-jar-with-dependencies.jar com.tdunning.ch16.client.Client
    
    This should produce something a lot like the following output:
    
       [0.05483312524126582, 0.053387027084160675, 0.05795630872834058, 0.04814920647604757, 0.04887495120450682, 0.05698449034115856, 0.052402388767486686, 0.03649784548083628, 0.04226219872179097, 0.049113831862600925, 0.038592728765213045, 0.06002963714513155, 0.049295643929958714, 0.06970468082142053, 0.04716914240563989, 0.07185404022050569, 0.0380260644141448, 0.046699086048427596, 0.035600312091787614, 0.04256729024957558]
       [0.06330439515922226, 0.056560054064692764, 0.02343542325381652, 0.05177427523603212, 0.036665787679528015, 0.05178567849176077, 0.02683583498898842, 0.08456194834549986, 0.026447663784874308, 0.04003566819867149, 0.032012526444505175, 0.16526835135003673, 0.0507057974805552, 0.04553203091882379, 0.0669502827598342, 0.041071210037759195, 0.04465434770121597, 0.026757171129609885, 0.03789223925903782, 0.027749313715535663]
       Highest score at index 11 which corresponds to sci.crypt
    
    The last line there is the important one.  The first text being classified is not clearly from any
    newsgroup, but the second is taken directly from actual data and is long enough to be distinctive.
    
    If the server is not running or doesn't have a live model to offer, you should see output like this
    
      Exception in thread "main" java.lang.IllegalStateException: No servers to query
    	at com.tdunning.ch16.client.Client.main(Client.java:81)
    
    Note that if the server ever loads a valid model then deleting or corrupting the model url
    in Zookeeper won't make the server stop serving requests.  Instead, the server will keep the
    last valid model it saw.

About

Example server for Chapter 16 of Mahout in Action

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages