### Step 1 : Include Libraries and Initialize Spark Session

In [1]:
#Import necessary libraries and initialize Spark Session

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-10_2.12:3.0.0,org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 pyspark-shell'

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark.sql.types import *

spark = SparkSession \
    .builder \
    .appName("Lab-Task Log Analysis") \
    .getOrCreate()

## Use-Case : Tracking Server Access Log <a class="anchor" name="use-case"></a>
For this case, a server is going to continuously send a records of a host who is trying to access some endpoint (url) from the web server. This data will be send from a kafka producer (<code>2.Lab-Task-KafkaProducer.ipynb</code>) which is reading the data from a txt file in the dataset provided (<code>logs/access_log.txt</code>).

Each line contains some valuable information such as:

1. Host
2. Timestamp
3. HTTP method
4. URL endpoint
5. Status code
6. Protocol
7. Content Size

The goal here is to perform some real time queries from this stream of data and be able to output the results in multiple ways.

### Step 2 : Load Kafka Stream 
Use the <code>readStream</code> to load data from the Kafka Producer <strong>2.Lab-Task-KafkaProducer.ipynb</strong>

<a class="anchor" id="lab-task-1"></a>
<div style="background:rgba(0,109,174,0.2);padding:10px;border-radius:4px"><strong style="color:#FF5555">1. Lab Task: </strong> 
    Write the code below to readStream from the the producer into <code>df_urls</code> dataframe.
</div>

In [2]:
# Monitor the logs data stream for new log data
topic = "w10_access_log"
hostip = "10.156.3.124" 

df_urls = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", f'{hostip}:9092') \
  .option("subscribe", topic) \
  .load()#WRITE THE CODE HERE

## Data Preparation <a class="anchor" name="data-prep"></a>
We need to convert the data from the message in order to perform some queries. The steps to parse the data are:

1. Get message as a string from <code>value</code> which is binary.
2. Implement some regular expressions to capture specific fields in the message which is a line from the access log.
3. Extract the values using the regular expressions to create the dataframe.

In [3]:
# Get value of the kafka message
log_lines = df_urls.selectExpr("CAST(value AS STRING)")

# Parse out the common log format to a DataFrame
statusExp = r'\s(\d{3})\s'
generalExp = r'\"(\S+)\s(\S+)\s*(\S*)\"'
hostExp = r'(^\S+\.[\S+\.]+\S+)\s'

df_logs = log_lines.select(regexp_extract('value', hostExp, 1).alias('host'),
                         regexp_extract('value', generalExp, 1).alias('method'),
                         regexp_extract('value', generalExp, 2).alias('endpoint'),
                         regexp_extract('value', generalExp, 3).alias('protocol'),
                         regexp_extract('value', statusExp, 1).cast('integer').alias('status'))

df_logs.printSchema()

root
 |-- host: string (nullable = true)
 |-- method: string (nullable = true)
 |-- endpoint: string (nullable = true)
 |-- protocol: string (nullable = true)
 |-- status: integer (nullable = true)



## Data Streaming Processing 

<a class="anchor" id="lab-task-2"></a>
<div style="background:rgba(0,109,174,0.2);padding:10px;border-radius:4px"><strong style="color:#FF5555">2. Lab Task: </strong> 
    Write a DataFrame query to filter out those requests that were not successful using <code>status !=200</code> filter.
</div>

In [4]:
# 1. DF that filters those requests that were not successful (status != 200)
unsuccess_df = df_logs.filter(F.col('status')!=200)#WRITE THE CODE HERE

<a class="anchor" id="lab-task-3"></a>
<div style="background:rgba(0,109,174,0.2);padding:10px;border-radius:4px"><strong style="color:#FF5555">3. Lab Task: </strong> 
    Write a DataFrame query count the number of requests by access status code
</div>

In [5]:
# 2. DF that keeps a running count of every access by status code

status_count_df = df_logs.groupby('status').count() #WRITE THE CODE HERE


## Output sink <a class="anchor" name="output-sink"></a>
Before starting this section, run the kafka producer (<code>2.Lab-Task-KafkaProducer.ipynb</code>) that will send the data from the access log file.

In [6]:
# Create function to show values received from input dataframe
def foreach_batch_function(df, epoch_id):
    df.show(20,False)

#### Display stream output in notebook <a class="anchor" name="foreachBatch"></a>

In [7]:
# Write output of status_count_df in output cell using the foreach_batch_function
# Control the amount of times output is displayed with trigger function
query1 = status_count_df.writeStream.outputMode("complete")\
        .foreachBatch(foreach_batch_function)\
        .trigger(processingTime='5 seconds')\
        .start()

In [8]:
query1.stop()

<a class="anchor" id="lab-task-4"></a>
<div style="background:rgba(0,109,174,0.2);padding:10px;border-radius:4px"><strong style="color:#FF5555">4. Lab Task: </strong> 
    Write the stream output (<strong>for the unsucess_df</strong>) to the <strong>memory sink</strong> and display the result using <strong>spark SQL</strong>
</div>

In [9]:
#WRITE THE CODE HERE TO WRITE OUTPUT TO MEMORY SINK
query2 = unsuccess_df.writeStream\
            .outputMode("append")\
            .format('memory')\
            .queryName('unsuccess_query')\
            .start()

In [19]:
#WRITE THE CODE HERE TO QUERY THE TABLE FROM MEMORY SINK
spark.sql('select * from unsuccess_query').show()

+----------------+------+--------------------+---------+------+
|            host|method|            endpoint| protocol|status|
+----------------+------+--------------------+---------+------+
|    "180.76.15.7|   GET|                   /|HTTP/1.1\|   301|
|  "180.76.15.141|   GET|                   /|HTTP/1.1\|   301|
|   "180.76.15.30|   GET|                   /|HTTP/1.1\|   301|
|    "180.76.15.7|   GET|                   /|HTTP/1.1\|   301|
|  "62.210.88.201|   GET|http://51.254.206...|HTTP/1.1\|   301|
|  "97.100.169.53|   GET|     /?page_id=34248|HTTP/1.1\|   301|
| "46.229.170.197|   GET|     /?page_id=34420|HTTP/1.1\|   301|
|  "180.76.15.153|   GET|                   /|HTTP/1.1\|   301|
|  "180.76.15.152|   GET|                   /|HTTP/1.1\|   301|
|  "62.210.88.201|   GET|http://51.254.206...|HTTP/1.1\|   301|
|   "180.76.15.32|   GET|                   /|HTTP/1.1\|   301|
|    "180.76.15.7|   GET|                   /|HTTP/1.1\|   301|
| "113.204.53.134|  POST|          /jeec