Import SparkContext

In [0]:
from pyspark import SparkContext

Defining The solution in a routine

In [0]:
def PrintNoOfLogs(data):
    levels = data.filter(lambda row: len(row) > 0)\
                 .filter(lambda row: not row.startswith("##"))\
                 .filter(lambda row: not row.startswith(" "))\
                 .map(lambda row: (row.partition(' ')[0], 1))
    reducedLevels = levels.reduceByKey(lambda count1, count2: count1+count2)
    answer = reducedLevels.sortByKey().collect()
    print(answer)  

##Example 1
Print the input text file

In [0]:
dbutils.fs.cp("/FileStore/tables/input.txt", "file:///tmp/input.txt")
with open("/tmp/input.txt", "r") as file:
    print (file.read())

## input.txt ##

INFO This is a message with content

INFO This is some other content

## (empty line)

INFO Here are more messages


 

ERROR Something bad happened

WARN More details on the bad thing

INFO back to normal messages


Executing the routine on example 1

In [0]:
sc = SparkContext.getOrCreate()
data = sc.textFile("/FileStore/tables/input.txt")
PrintNoOfLogs(data)

[('ERROR', 1), ('INFO', 4), ('WARN', 2)]


##Example 2
Print the contents of the text file

In [0]:
dbutils.fs.cp("/FileStore/tables/input1.txt", "file:///tmp/input1.txt")
with open("/tmp/input1.txt", "r") as file:
    print (file.read())

## input1.txt ##

INFO SparkContext: Running Spark version 3.2.1
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
INFO ResourceUtils: No custom resources configured for spark.driver.
INFO SparkContext: Submitted application: solution.py
INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
INFO ResourceProfile: Limiting resource is cpu
INFO ResourceProfileManager: Added ResourceProfile id: 0
INFO SecurityManager: Changing view acls to: jananiravikumar
INFO SecurityManager: Changing modify acls to: jananiravikumar
INFO SecurityManager: Changing view acls groups to: 
INFO SecurityManager: Changing modify acls groups to: 
INFO SecurityManager: SecurityManager: authenticat

Executing the routine on example 2.

In [0]:
sc = SparkContext.getOrCreate()
data = sc.textFile("/FileStore/tables/input1.txt")
PrintNoOfLogs(data)

[('INFO', 247), ('WARN', 1)]


##How the routine works:
In the lines 2,3 and 4 of the routine, it parses each row of the text files and filters out all the empty lines whose length is 0 and also the line that start with '##' or a space. Thus all the lines that contains comments, blank line or lines with errors are ignored. Then in line 5, the `map` function maps the first word (delimited by a space ' ') to a count of 1. <br \>
In line 6, all the mapped entries are reduced, accumulating the count. In line 7, the reduced keys are sorted and the key, value pairs are dispalyed in line 8.

##Space and Time Complexity
When run in a cluser with a single node, the functions `filter`, `map` and `reduce` should all run in linear time since they process one line of the log at a time. Thus the lines 2-6 should run in `O(n)`, where `n` is the number of lines in the log file. In line 7, the sorting operation is performed on the reduced keys, assuming the complexity of the sorting operation `O(m log(m))`,  where m is the number of unique logging levels, this should be negligible in a large log file with a huge number of log lines in it i.e., where `n >>> m`. Therefore, when there is no parallelization, the time coplexity of the routine can be given as `O(n)`.<br/><br/> 
When there are more nodes available, there is a possibility to lower the time complexity of the map phase to `O(1)` since the operation on line of log is not dependant on any other line in this process. In this case, the time taken for the reduce phase can also be lowered to but still the time complexity should be linearly proportional to the number of log lines. Thus, the time complexity of the routine with maximum parallelization, can be described as `O(n)`