CPEN 221 / Fall 2020 / MP3
This mini-project involves interacting with Wikipedia and performing many operations.
Some of the learning goals for this mini-project are:
- Working with external libraries (such as
jwiki,gson, andantlr); - Implementing reusable datatypes such as a
FSFTBuffer; - Using multi-threading to handle certain aspects of the implementation;
- Managing shared memory when multiple threads are involved;
- Implementing parsers for a given grammar and executing queries on a database.
In this assignment, you will:
- Use external libraries and APIs for data processing;
- Implement concurrent processing of related operations;
- Implement core computing abstractions such as caching;
- Parse and execute structured queries.
You will continue to work with Java Generics to produce a reusable buffer abstraction.
Read this README in its entirety. There are five tasks involved. The assignment may appear more intimidating than it actually is. Completing one task at a time may be a good tactic.
The first task is to implement a parametrized datatype that stores a finite number of objects of type T for a finite amount of time: we will call such a buffer a finite-space finite-time buffer. Only a finite amount of objects can be added to such a buffer. Furthermore, an object added to the buffer are retained only for a finite amount of time unless the object is accessed, updated or touched.
When an object is to be added to an instance of FSFTBuffer and the buffer is at capacity, the least recently used object is removed to make space for the new object.
An FSFTBuffer supports the following operations:
-
FSFTBuffer(int capacity, int timeout): Create an instance ofFSFTBufferwith a givencapacityand withtimeoutrepresenting the duration of time (in seconds) for which an object should be retained in the buffer (unless it is removed because of capacity limitations). -
boolean put(T t): add a value to the buffer and return true if the value was successfully added and false otherwise. When a value is added to an instance ofFSFTBufferand the buffer is full then the new object should remove the least recently used object. (Note that objects that have timed out should be remove first.) -
T get(String id): Retrieve an object from the buffer. When an object is retrieved at time timeInSeconds from the buffer, it is "used" at that time and it will now timeout at the absolute time timeInSeconds + timeout. -
boolean touch(String id): This method, when called at time timeInSeconds, updates the absolute timeout time for the object withidto timeInSeconds + timeout. This method returns true if an object was touched and false if no object withidexists in the buffer.
An FSFTBuffer can be used to implement a data cache.
In this task, you should ensure that your implementation of FSFTBuffer can handle multiple threads writing to and reading from the same instance of FSFTBuffer. This means that many put and get operations should proceed in concurrently.
For this task, you should implement a mediator service for Wikipedia. This service will access Wikipedia (using the JWiki API) to obtain pages and other relevant information.
- The mediator service should cache Wikipedia pages to minimize network accesses.
- The mediator service should also collect statistical information about requests.
A WikiMediator instance should support the following basic operations:
List<String> search(String query, int limit): Given aquery, return up tolimitpage titles that match the query string (per Wikipedia's search service).String getPage(String pageTitle): Given apageTitle, return the text associated with the Wikipedia page that matchespageTitle.List<String> zeitgeist(int limit): Return the most commonStrings used insearchandgetPagerequests, with items being sorted in non-increasing count order. When many requests have been made, return onlylimititems.List<String> trending(int limit): Similar tozeitgeist(), but returns the most frequent requests made in the last 30 seconds.int peakLoad30s(): What is the maximum number of requests seen in any 30-second window? The request count is to include all requests made using the public API ofWikiMediator, and therefore counts all five methods listed as basic page requests.
Implement a server application that wraps a WikiMediator instance. The server should receive requests over a network Implement a server-based application that receives requests over a network socket and returns results appropriately. The server should be capable of handling more than one request simultaneously.
(To get started, you will find this example helpful: https://github.com/CPEN-221/FibonacciServer.)
The requests take the form of a JSON-formatted string with appropriate parameters. Each request has a type that indicates the operation that needs to be performed and other fields in the JSON-formatted string use the same name as the parameters for the operations.
As examples, here are strings for search and zeitgeist:
{
"id": "1",
"type": "search",
"query": "Barack Obama",
"limit": "12"
}
{
"id": "two",
"type": "zeitgeist",
"limit": "5"
}The id field is an identifier used by the client to disambiguate multiple responses and should be included as-is in the response.
The response should also be a JSON-formatted string with a status field that should have the value "success" if the operation was successfully executed, and a response field that contains the results. If the operation was not successful then the status field should have the value "failed" and the response field can include a suitable error message explaining the failure.
For example, the response to the simple search with "Barack Obama" should yield:
{
"id": "1",
"status": "success",
"response": ["Barack Obama", "Barack Obama in comics", "Barack Obama Sr.", "List of things named after Barack Obama", "Speeches of Barack Obama"]
}The JSON-formatted request may include an optional timeout field that indicates how long (in seconds) the service should wait for a response from Wikipedia before declaring the operation as having failed. For example, the following request
{
"id": "3",
"type": "search",
"pageTitle": "Philosophy",
"timeout": "1"
}may fail because no Wikipedia response was received in 1 second resulting in a WikiMediator response such as this:
{
"id": "3",
"status": "failed",
"response": "Operation timed out"
}You should implement a system where the statistical information associated with the WikiMediator instance can be stored in the local filesystem, and that such data can be reloaded each time your service is started. You should use the directory local for all the files that you create.
To shutdown a server, one would send a request like this:
{
"id": "ten",
"type": "stop"
}The server should respond with the message:
{
"id": "ten",
"response": "bye"
}And then the server should stop accepting requests over the network and terminate after writing state to disk. This state should be read when a new instance of WikiMediatorServer is created and the data is available in the directory named local.
Grammar for Queries
- QUERY := get ITEM where CONDITION SORTED?
- CONDITION := LPAREN CONDITION and CONDITION RPAREN | LPAREN CONDITION or CONDITION RPAREN | SIMPLE_CONDITION
- LPAREN := '('
- RPAREN := ')'
- SIMPLE_CONDITION := title is STRING | author is STRING | category is STRING
- ITEM := page | author | category
- SORTED := asc | desc
- STRING :=
'\\'' ( ~'\\'' | '\\'\\'' )* '\\''
Gson
You will be using JSON for exchanging information with a server. The Gson API will simplify the processing of JSON requests. You should add Gson as a dependency in build.gradle.
FSFTBuffer and Multithreading
You do not need to use multithreading to implement FSFTBuffer. On the other hand, a non-multithreaded implementation often involves combining multiple operations into one method, which is sometimes hard to maintain.
Staleness vs. Least Recently Used
The implementation of FSFTBuffer has two time-related aspects:
- Is an item stale? An item is stale if it has been in the cache for a long time and the original version of the item may have been mutated. Such items should be removed from the cache when they go stale. Or, in any case, should not be considered as occupying space in the cache.
- Evicting items from the cache: When a cache is full, you will have to remove items that have not been accessed recently. ("Recent access" has nothing to do with "staleness".) Why is least-recently-used replacement a good policy? It turns out the identifying the optimal policy for cache eviction is hard. LRU has some good properties and is known to be a competitive algorithm when compared with an oracle that knows the future and can decide what is the best item to evict. Note that stale items should not count towards cache occupancy and should be removed before evicting non-stale items.
Bufferable
The Bufferable interface has exactly one method. You should not add methods to this interface. Also, your implementation of FSFTBuffer should work with any datatype that implements the Bufferable interface so you cannot rely on additional information/methods being available from items you have to cache.
Note that FSFTBuffer need not implement Bufferable. (In principle, a buffer of buffers can be imagined but is not very useful.)
Buffer Accesses
Only getPage() requests affect the buffer/cache as far as the WikiMediator implementation is concerned.
touch and update
FSFTBuffer should support these methods but you are not expected to use these methods in WikiMediator. One can imagine background threads that keep items fresh in the cache using such methods but I am not expecting you to implement such features.
Do put and update affect access times?
The first put of an item into a cache does affect its access time. You can decide whether repeated puts of the same item and updates of an item are "accesses" of the item in the cache; I would not consider them to be so.
trending and zeitgeist
These methods only count the Strings passed as arguments and not the actual method calls.
zeitgeist was something Google used to support and has evolved into Google Trends.
peakLoad30s: This method tracks all requests to a WikiMediator instance including requests for peakLoad30s.
Storing statistics
- It is sufficient to log all requests in some form to disk so that when an instance of
WikiMediatoris started (after a shutdown) one can restore data needed to compute statistics fortrending,zeitgeistandpeakLoad30s. You may want to think carefully about when your write data to disk. - We will use the directory name
localwith no further path specifications when we test your implementation. A directory with this name exists in the skeleton code structure that was provided and that is where you should storing all the data. If this directory is empty (except for the.keepfile) then you should assume there is no history to load.
Multiple requests
You should be able to process multiple client requests simultaneously. The level of concurrency is dictated by the parameter n that is used to instantiate a WikiMediatorServer. A client can send more than one request before closing a session. Each client request will involve a single method invocation at the server. (You do not need to plan for clients bundling multiple requests into one message.)
Expectations around client requests
It is acceptable to assume that all requests from clients will be well-formed. It is not difficult to write handlers for ill-formed requests but you can assume, for now, that the requests will be correctly formatted.
Handling timeouts
You should not need to change WikiMediator to handle timeouts. You can use Java's ExecutorService and Futures if necessary. Note that you have to handle n concurrent clients but this does not mean that you are restricted to n threads. Using some additional threads may allow you to realize the abstraction required.