Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
To use this example, you will need Java 6 and mvn to be installed already. Open JDK is fine. 0. Download, compile and install mahout libraries svn co http://svn.apache.org/repos/asf/mahout/trunk mv trunk mahout cd mahout mvn -DskipTests install The first time you do this it will download half of the western world. This gets better with use. 1. Compile this code including client, server and model trainer Use this command mvn -DskipTests package to compile everything. This creates target/ch16-1.0-jar-with-dependencies.jar. 2. Install, configure and start Zookeeper Download and run Apache Zookeeper from this page: http://www.apache.org/dyn/closer.cgi/zookeeper/ wget http://newverhost.com/pub//zookeeper/zookeeper-3.3.3/zookeeper-3.3.3.tar.gz tar zxvf zookeeper-3.3.3.tar.gz cp ./zookeeper-3.3.3/conf/zoo_sample.cfg ./zookeeper-3.3.3/conf/zoo.cfg sudo ./zookeeper-3.3.3/bin/zkServer.sh start 3. Start the server with no model yet In a separate window, go to the directory where you extracted this software and start the server java -cp target/ch16-1.0-jar-with-dependencies.jar com.tdunning.ch16.server.Server This will produce error messages because Zookeeper doesn't have a model URL in it. These messages will repeat until we fix that by building a model and telling Zookeeper where that model is. 4. Train a model To build a model for the 20 news groups data, download the data: wget http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz tar zxvf 20news-bydate.tar.gz Then run the training program: java -mx1000m -cp target/ch16-1.0-jar-with-dependencies.jar com.tdunning.ch16.train.TrainNewsGroups 20news-bydate-train/ This will produce lots of output. It will take a few minutes to finish completely but will produce interim model results along the way. Once you see a file called /tmp/news-group-3000.model you will be ready to tell the server to load that model. The final result should give you the most accuracy. It is stored in /tmp/news-group.model 5. Tell the server about the new model via Zookeeper Run the Zookeeper command line interface: ./zookeeper-3.3.3/bin/zkCli.sh ... log output lines ... [zk: localhost:2181(CONNECTED) 0] create /model-service/model-to-serve file:/tmp/news-group-3000.model Created /model-service/model-to-serve [zk: localhost:2181(CONNECTED) 1] Over in the server window you should see something like this within a few seconds of putting the model into Zookeeper: 11/05/27 20:48:27 WARN server.Server: Loading model from file:/tmp/news-group-3000.model 11/05/27 20:48:27 INFO server.Server: done loading version 0 You can mess with the server a little bit by giving it a different model to load or a bad file name: [zk: localhost:2181(CONNECTED) 1] set /model-service/model-to-serve file:/tmp/news-group-2500.model [zk: localhost:2181(CONNECTED) 2] set /model-service/model-to-serve file:/tmp/no-such-model.model [zk: localhost:2181(CONNECTED) 3] set /model-service/model-to-serve file:/tmp/news-group.model Each time you change this file on Zookeeper, the server should respond. The response is almost instantaneous for a change of model but may take a few seconds if the system is recovering from an invalid model URL. The server output should look something like this: 11/05/27 20:50:31 WARN server.Server: Loading model from file:/tmp/news-group-2500.model 11/05/27 20:50:31 INFO server.Server: done loading version 1 11/05/27 20:50:51 WARN server.Server: Loading model from file:/tmp/no-such-model.model 11/05/27 20:50:51 ERROR server.Server: Failed to load model from file:/tmp/no-such-model.model java.io.FileNotFoundException: /tmp/no-such-model.model (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.<init>(FileInputStream.java:106) at java.io.FileInputStream.<init>(FileInputStream.java:66) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:70) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:161) at java.net.URL.openStream(URL.java:1010) at com.tdunning.ch16.server.Server$1.process(Server.java:120) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:488) ... 11/05/27 20:51:06 WARN server.Server: Loading model from file:/tmp/news-group-3000.model 11/05/27 20:51:06 INFO server.Server: done loading version 3 Likewise, you can emulate a session expiration by using control-Z (unix only) to pause the server for 5-10 seconds. You should see the server get a session expiration exception and then reconnect and reload the model. 6. Classify some data You can send a few classification requests to the server by running the sample client program. java -cp target/ch16-1.0-jar-with-dependencies.jar com.tdunning.ch16.client.Client This should produce something a lot like the following output: [0.05483312524126582, 0.053387027084160675, 0.05795630872834058, 0.04814920647604757, 0.04887495120450682, 0.05698449034115856, 0.052402388767486686, 0.03649784548083628, 0.04226219872179097, 0.049113831862600925, 0.038592728765213045, 0.06002963714513155, 0.049295643929958714, 0.06970468082142053, 0.04716914240563989, 0.07185404022050569, 0.0380260644141448, 0.046699086048427596, 0.035600312091787614, 0.04256729024957558] [0.06330439515922226, 0.056560054064692764, 0.02343542325381652, 0.05177427523603212, 0.036665787679528015, 0.05178567849176077, 0.02683583498898842, 0.08456194834549986, 0.026447663784874308, 0.04003566819867149, 0.032012526444505175, 0.16526835135003673, 0.0507057974805552, 0.04553203091882379, 0.0669502827598342, 0.041071210037759195, 0.04465434770121597, 0.026757171129609885, 0.03789223925903782, 0.027749313715535663] Highest score at index 11 which corresponds to sci.crypt The last line there is the important one. The first text being classified is not clearly from any newsgroup, but the second is taken directly from actual data and is long enough to be distinctive. If the server is not running or doesn't have a live model to offer, you should see output like this Exception in thread "main" java.lang.IllegalStateException: No servers to query at com.tdunning.ch16.client.Client.main(Client.java:81) Note that if the server ever loads a valid model then deleting or corrupting the model url in Zookeeper won't make the server stop serving requests. Instead, the server will keep the last valid model it saw.