# FIT5202 - Data processing for Big Data

## Assignment 2: Detecting Linux system hacking activities Part B

Name: Roma Hambar\
Student ID: 31223958\
Python Version Used: Python 3

<h1>Memory Producer Modelling </h1>

Apache Kafka is an open-source event streaming data platform that is used to publish and subscribe to a stream of data. 
Apache Kafka producer is an application that is used to publish data into the Kafka cluster. 

Below is the code to build a Kafka producer to stream memory activity records to the Linux machine under attack Machine learning prediction model.

* [Initialize Apache Kafka producer](#Initialize)
* [Read process activity data from csv datafile](#Read)
* [Reformat data columns and timestamp](#Reformat)
* [Send data records to Kafka Topic](#Publish)
* [Setup Kafka Producer and read data](#Setup)
* [Initialize Streaming data to Kafka Topic](#streaming)

In [1]:
# import statements
from time import sleep
from json import dumps
from kafka import KafkaProducer
import random
from datetime import datetime as dt
import csv

In [2]:
# Kakfa topic name for memory data records
topic='memory'
# CSV file with Linux memory activity data
filename = 'Streaming_Linux_memory.csv'

## Initialize Apache Kakfa producer <a class="anchor" name="Initialize"></a>

In [3]:
# Using KafkaProducer(), initialise kakfa producer instance and set the kafka topic configuration in local mode
def connect_kafka_producer():
    _producer = None
    try:
        _producer = KafkaProducer(bootstrap_servers=['localhost:9092'],
                                  value_serializer=lambda x: dumps(x).encode('ascii'),
                                  api_version=(0, 10))
    except Exception as ex:
        print('Exception while connecting Kafka.')
        print(str(ex))
    finally:
        return _producer

## Read process activity data from csv datafile <a class="anchor" name="Read"></a>

In [4]:
def read_csv(fileName):
    data = []
    machine_4 = []
    machine_5 = []
    machine_6 = []
    machine_7 = []
    machine_8 = []

    # read csv file
    input_file = csv.DictReader(open(fileName))
    for row in input_file:
        # for each machine, store data records in separate lists
        if int(row['machine']) == 4:
            machine_4.append(row)
        elif int(row['machine']) == 5:
            machine_5.append(row)
        elif int(row['machine']) == 6:
            machine_6.append(row)
        elif int(row['machine']) == 7:
            machine_7.append(row)
        else:
            machine_8.append(row)
        
    data.append(machine_4)
    data.append(machine_5)
    data.append(machine_6)
    data.append(machine_7)
    data.append(machine_8)
    
    return data

## Reformat data columns and timestamp <a class="anchor" name="Reformat"></a>

In [5]:
# Change the datatype of each column and current timestamp in each data record
def add_timestamp(data_rows,ts):
    # for each machine, a new sublist will be used
    data = []

    for row in data_rows:
        row['ts'] = ts
        row['sequence'] = int(row['sequence'])
        row['machine'] = int(row['machine'])     
        row['PID'] = int(row['PID'])
        row['MINFLT'] = int(row['MINFLT'])
        row['MAJFLT'] = int(row['MAJFLT'])
        row['VSTEXT'] = int(row['VSTEXT'])
        row['VSIZE'] = float(row['VSIZE'])
        row['RSIZE'] = float(row['RSIZE'])
        row['VGROW'] = float(row['VGROW'])
        row['RGROW'] = float(row['RGROW'])
        row['MEM'] = float(row['MEM'])
        row['CMD'] = str(row['CMD'])
        # append dictionary records with each key in the data records to a list
        data.append({k: row[k] for k in row.keys()})

    return data

## Send data records to Kafka Topic <a class="anchor" name="Publish"></a>

In [6]:
# Publish data to kafka topic using producer instance 
def publish_message(producer_instance, topic_name, data):
    try:
        producer_instance.send(topic_name, data)
        # print('Message published successfully. Data: ' + str(data))
    except Exception as ex:
        print('Exception in publishing message.')
        print(str(ex))

## Setup Kafka Producer and read data  <a class="anchor" name="Start"></a>

In [7]:
# Initialize kafka producer
producer = connect_kafka_producer()
# Read memory activity data from csv file
csv_data = read_csv(filename)

## Initialize Streaming data to Kafka Topic   <a class="anchor" name="streaming"></a>

The topic data has been stored in the following format:\

    for first cycle :
    [
    [{sequence:1, machine:4,..},{sequence:2, machine:4,..},..],
    [{sequence:1, machine:5,..},{sequence:2, machine:5,..},..],
    [{sequence:1, machine:6,..},{sequence:2, machine:6,..},..],
    [{sequence:1, machine:7,..},{sequence:2, machine:7,..},..],
    [{sequence:1, machine:8,..},{sequence:2, machine:8,..},..]
    ]
    
    From second cycle onwards (if y-records are not 0, the list will not be generated for that machine): 
    [
    x_records[{sequence:1, machine:4,..},{sequence:2, machine:4,..},..],
    y_records[{sequence:1, machine:4,..},{sequence:2, machine:4,..},..],
    x_records[{sequence:1, machine:5,..},{sequence:2, machine:5,..},..],
    y_records[{sequence:1, machine:5,..},{sequence:2, machine:5,..},..],
    x_records[{sequence:1, machine:6,..},{sequence:2, machine:6,..},..],
    y_records[{sequence:1, machine:6,..},{sequence:2, machine:6,..},..],
    x_records[{sequence:1, machine:7,..},{sequence:2, machine:7,..},..],
    y_records[{sequence:1, machine:7,..},{sequence:2, machine:7,..},..],
    x_records[{sequence:1, machine:8,..},{sequence:2, machine:8,..},..],
    y_records[{sequence:1, machine:8,..},{sequence:2, machine:8,..},..]
    ]

In [8]:
# List of flags per machine to verify whether y records need to be send
y_cycle = [False]*5

# list of list of y records to be sent in next cycle
y_cycle_data = [0]*5

In [None]:
# generate stream of data for indefinite time
while True:
    publish_data = []
    # UTC timstamp in string format (13 digit unix epoch time in seconds)
    ts = dt.utcnow().strftime("%s")
    # print('Publishing records..')
    # for 5 machines
    for i in range(5):
        # generate random count for x and y records for each machine for second cycle onwards
        x = random.randint(20,80)
        y = random.randint(0,5)
            
        # generate x records and append the same selected records to the end of the original data list
        data_rows_x = csv_data[i][:x]
        del csv_data[i][:x]
        csv_data[i] += data_rows_x
        
        # Append the x-cycle records to be published to publish_data list
        publish_data.append(add_timestamp(data_rows_x,ts))
        
        # y_cycle  for each machine will be set to True after the first cycle, else it will be False by default
        if y_cycle[i] == True:
            # This sublist for each machine will be empty for first cycle and be used to store y records from 
            # the previous cycle to be published in the next cycle
            if y_cycle_data[i] != []:
                publish_data.append(add_timestamp(y_cycle_data[i],ts))
        
        # Generate y records from this cycle to publish in the next cycle and append the selected records at the
        # end of the machine data lsit for exhaustive dataset
        data_rows_y = csv_data[i][:y]
        y_cycle_data[i] = csv_data[i][:y]
        del csv_data[i][:y]
        csv_data[i] += data_rows_y
        
        y_cycle[i] = True
    # Publish records for all machines in two cycles x, y to kakfa producer
    publish_message(producer, topic, publish_data)
    # Generate a new cycle for 10 seconds
    sleep(10)