# Simple Weblog Analytics - The Streaming Way
In this notebook, we are going to explore the weblog use case using the stream 'as it happens'.

This notebook requires a local `TCP ` server that simulates the Web server sending data.  

Please start the [weblogs_TCP_Server](./weblogs_TCP_server.snb.ipynb) notebook before running this one.

## To connect to a TCP source, we need the host and the port of the TCP server.
Here we use the defaults used in the `weblogs_TCP_server` notebook. If you changed these parameters there, change them here accordingly

In [ ]:
val host = "localhost"
val port = 9999

## We use the `TextSocketSource` in Structured Streaming to connect to the TCP server and consume the text stream.
This `Source` is called `socket` as the short name we can use as `format` to instantiate it.

The options needed to configure the `socket` `Source` are `host` and `port` to provide the configuration of our TCP server.

In [ ]:
val stream = sparkSession.readStream
  .format("socket")
  .option("host", host)
  .option("port", port)
  .load()

## We define a schema for the data in the logs
Following the formal description of the dataset (at: [NASA-HTTP](http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html) ), the log is structured as follows:

>The logs are an ASCII file with one line per request, with the following columns:
- host making the request. A hostname when possible, otherwise the Internet address if the name could not be looked up.
- timestamp in the format "DAY MON DD HH:MM:SS YYYY", where DAY is the day of the week, MON is the name of the month, DD is the day of the month, HH:MM:SS is the time of day using a 24-hour clock, and YYYY is the year. The timezone is -0400.
- request given in quotes.
- HTTP reply code.
- bytes in the reply.

The dataset provided for this exercise offers this data in JSON format

In [ ]:
import java.sql.Timestamp
case class WebLog(host:String, 
                  timestamp: Timestamp, 
                  request: String, 
                  http_reply:Int, 
                  bytes: Long
                 )

##We convert the raw data to structured logs
In the batch analytics case we could load the data directly as JSON records. In the case of the `Socket` source, that data is received as plain text.
To transform our raw data to `WebLog` records, we first require a schema. The schema provides the necessary information to parse the text to a JSON object. It's the 'structure' when we talk about 'structured'  streaming.

After defining a schema for our data, we will:

- Transform the text `value` to JSON using the JSON support built in the structured API of Spark
- Use the `Dataset` API to transform the JSON records to `WebLog` objects

As result of this process, we will obtain a `Streaming Dataset` of `WebLog` records.

In [ ]:
val webLogSchema = Encoders.product[WebLog].schema 

In [ ]:
val jsonStream = stream.select(from_json($"value", webLogSchema) as "record")

In [ ]:
val webLogStream: Dataset[WebLog] = jsonStream.select("record.*").as[WebLog]

## We have a structured stream.
The `webLogStream` we just obtained is of type `Dataset[WebLog]` like we had in the batch analytics job.
The difference between this instance and the batch version is that `webLogStream` is a streaming `Dataset`.

We can observe this by querying the object.


In [ ]:
webLogStream.isStreaming

## Operations on Streaming Datasets
At this point in the batch job, we were creating the first query on our data: How many records are contained in our dataset?
This is a question that we can answer easily when we have access to all the data. But how to count records that are constantly arriving? 
The answer is that some operations we consider usual on a static `Dataset`, like counting all records, do not have a defined meaning on a streaming Dataset.

As we can observe, attempting to execute the `count` query below will result in an `AnalysisException`. Queries in Structured Streaming are a continuous operation that needs to be scheduled. To start scheduling queries on a stream, we use the `writeStream.start()` operation. 

In [ ]:
// expect this call to fail!
val count = webLogStream.count()

## What are popular URLs? In what timeframe?

Now that we have immediate analytic access to the stream of weblogs we don't need to wait for a day or a month to have a rank of the popular URLs. We can have that information as trends unfold on much shorter windows of time.

To define the period of our interest, we create a window over some timestamp. An interesting feature of Structured Streaming is that we can define that window on the timestamp when the data was produced, also known as 'event time' as opposed to the time when the data is processed.

Our window definition is of 5 minutes of event data. Given that the TCP Server is replaying the logs in a simulated timeline, the 5 minutes might happen much faster or slower than the clock time. In this way, we can appreciate how Structured Streaming uses the timestamp information in the events to keep track of the event timeline.

As we learned from the batch analytics, we should extract the URLs and only select content pages, like `html`, `htm`, or directories. Let's apply that acquired knowledge first before proceeding to define our `window` query.

In [ ]:
// A regex expression to extract the accessed URL from weblog.request 
val urlExtractor = """^GET (.+) HTTP/\d.\d""".r
val allowedExtensions = Set(".html",".htm", "")

val contentPageLogs: String => Boolean = url => {
  val ext = url.takeRight(5).dropWhile(c => c != '.')
  allowedExtensions.contains(ext)
}

val urlWebLogStream = webLogStream.flatMap{ weblog => 
  weblog.request match {                                        
    case urlExtractor(url) if (contentPageLogs(url)) => Some(weblog.copy(request = url))
    case _ => None
  }
}

## Top Content Pages Query
We have converted the request to only contain the visited URL and filtered out all non-content pages. 
We will now define the windowed query to compute the top trending URLs 

In [ ]:
val rankingURLStream = urlWebLogStream.groupBy($"request", window($"timestamp", "5 minutes", "1 minute")).count()

## Start the stream processing
All the steps we have followed so far have been to define the process that the stream will undergo but no data has been processed yet. 

To start a Structured Streaming job, we need to specify a `sink` and an `output mode`. 
These are two new concepts introduced by Structured Streaming.

A `sink` defines where we want to materialize the resulting data, like to a file in a file system, to an in-memory table or to another streaming system such as Kafka.
The `output mode` defines how we want the results to be delivered: Do we want to see all data every time, only updates or just the new records? 

These options are given to a `writeStream` operation that creates the streaming query that starts the stream consumption, materializes the computations 
declared on the query and produces the result to the output `sink`.

We will visit all these concepts in detail later on. For now, we will use them empirically and observe the results.

For our query, we will use the `memory` `sink` and output mode `complete` to have a fully updated table each time new records are added to the result of keeping track of the URL ranking.

In [ ]:
val query = rankingURLStream.writeStream
  .queryName("urlranks")
  .outputMode("complete")
  .format("memory")
  .start()

### The memory sink outputs the data to a temporary table of the same name given in the queryName option.

In [ ]:
sparkSession.sql("show tables").show()

## Exploring the Data
The `memory` `sink` outputs the data to a temporary table of the same name given in the `queryName` option. We can create a `DataFrame` from that table to explore the results of the stream process. 


In [ ]:
val urlRanks = sparkSession.sql("select * from urlranks")

### Before we can see any materialized results, we need to wait for the window to complete.
Given that we are accelerating the log timeline on the producer side, after few seconds, we can execute the next command to see the result of the first windows.

In [ ]:
urlRanks.select($"request", $"window", $"count").orderBy(desc("count"))