# FIT5202 - Data processing for Big Data

## Assignment 2: Detecting Linux system hacking activities Part B

Name: Roma Hambar\
Student ID: 31223958\
Python Version Used: Python 3

<h1>Process Producer Modelling </h1>

Apache Kafka is an open-source event streaming data platform that is used to publish and subscribe to a stream of data. 
Apache Kafka producer is an application that is used to publish data into the Kafka cluster. 

Below is the code to build a Kafka producer to stream process activity records to the Linux machine under attack Machine learning prediction model.

* [Initialize Apache Kafka producer](#Initialize)
* [Read process activity data from csv datafile](#Read)
* [Reformat data columns and timestamp](#Reformat)
* [Send data records to Kakfa Topic](#Publish)
* [Setup Kafka Producer and read data](#Setup)
* [Initialize Streaming data to Kafka Topic](#streaming)

In [1]:
# import statements
from time import sleep
from json import dumps
from kafka import KafkaProducer
import random
from datetime import datetime as dt
import csv

In [2]:
# Kakfa topic name for process data records
topic='process'
# CSV file with Linux process activity data
filename = 'Streaming_Linux_process.csv'

## Initialize Apache Kakfa producer <a class="anchor" name="Initialize"></a>

In [3]:
# Using KafkaProducer(), initialise kakfa producer instance and set the kafka topic configuration in local mode
def connect_kafka_producer():
    _producer = None
    try:
        _producer = KafkaProducer(bootstrap_servers=['localhost:9092'],
                                  value_serializer=lambda x: dumps(x).encode('ascii'),
                                  api_version=(0, 10))
    except Exception as ex:
        print('Exception while connecting Kafka.')
        print(str(ex))
    finally:
        return _producer

## Read process activity data from csv datafile <a class="anchor" name="Read"></a>

In [4]:
def read_csv(fileName):
    data = []
    machine_4 = []
    machine_5 = []
    machine_6 = []
    machine_7 = []
    machine_8 = []

    # read csv file
    input_file = csv.DictReader(open(fileName))
    for row in input_file:
        # for each machine, store data records in separate lists
        if int(row['machine']) == 4:
            machine_4.append(row)
        elif int(row['machine']) == 5:
            machine_5.append(row)
        elif int(row['machine']) == 6:
            machine_6.append(row)
        elif int(row['machine']) == 7:
            machine_7.append(row)
        else:
            machine_8.append(row)
        
    data.append(machine_4)
    data.append(machine_5)
    data.append(machine_6)
    data.append(machine_7)
    data.append(machine_8)
    
    return data

## Reformat data columns and timestamp <a class="anchor" name="Reformat"></a>

In [5]:
# Change the datatype of each column and current timestamp in each data record
def add_timestamp(data_rows,ts):
    # for each machine, a new sublist will be used
    data = []
    for row in data_rows:
        row['ts'] = ts
        row['sequence'] = int(row['sequence'])
        row['machine'] = int(row['machine'])       
        row['PID'] = int(row['PID'])
        row['TRUN'] = int(row['TRUN'])
        row['TSLPI'] = int(row['TSLPI'])
        row['TSLPU'] = int(row['TSLPU'])
        row['POLI'] = str(row['POLI'])
        row['NICE'] = int(row['NICE'])
        row['PRI'] = int(row['PRI'])
        row['RTPR'] = int(row['RTPR'])
        row['CPUNR'] = int(row['CPUNR'])
        row['Status'] = str(row['Status'])
        row['EXC'] = int(row['EXC'])
        row['State'] = str(row['State'])
        row['CPU'] = float(row['CPU'])
        row['CMD'] = str(row['CMD'])
        
        # append dictionary records with each key in the data records to a list
        data.append({k: row[k] for k in row.keys()})

    return data                                      

## Send data records to Kakfa Topic <a class="anchor" name="Publish"></a>

In [6]:
# Publish data to kafka topic using producer instance 
def publish_message(producer_instance, topic_name, data):
    try:
        producer_instance.send(topic_name, data)
        # print('Message published successfully. Data: ' + str(data))
    except Exception as ex:
        print('Exception in publishing message.')
        print(str(ex))

## Setup Kafka Producer and read data  <a class="anchor" name="Start"></a>

In [7]:
# Initialize kakfa producer
producer = connect_kafka_producer()
# Read data file to generate stream of records
csv_data = read_csv(filename)

## Initialize Streaming data to Kafka Topic   <a class="anchor" name="streaming"></a>

The topic data has been stored in the following format:

    [
    [{sequence:1, machine:4,..},{sequence:2, machine:4,..},..],
    [{sequence:1, machine:5,..},{sequence:2, machine:5,..},..],
    [{sequence:1, machine:6,..},{sequence:2, machine:6,..},..],
    [{sequence:1, machine:7,..},{sequence:2, machine:7,..},..],
    [{sequence:1, machine:8,..},{sequence:2, machine:8,..},..]
    ]

In [None]:
# generate stream of data for indefinite time
while True:
    # print('Publishing records..')
    publish_data = []
    # UTC timstamp in string format (13 digit unix epoch time in seconds)
    ts = dt.utcnow().strftime("%s")
    # In one cycle, publish an array of list of random count of machine records in sequence 
    for i in range(5):
        # random count of records for each machine
        x = random.randint(10,50)
        # select first x records from csv data segregated by machine    
        data_rows = csv_data[i][:x]
        # delete the selected records
        del csv_data[i][:x]
        # append the random selected records to the end of the list for never exhausting data
        csv_data[i] += data_rows
        
        # append the machine sublist machine records to the main array of records to be published
        publish_data.append(add_timestamp(data_rows,ts))
    # Publish records for all machines to kakfa producer
    publish_message(producer, topic, publish_data)
    # Generate a new cycle for 5 seconds
    sleep(5)