 # Finding Periodic Process Connectivity With Poisson Logic
 
 This notebook builds on logic published by Elastic to find periodic/beaconing
 processes.  TL;DR: independent events over time should follow a Poisson distribution.  Events that
 are more regular/metronomic are no longer independent, and are no longer Poisson-distributed. We can
 use this to identify processes that open network connectivity on a more metronomic basis, and therefore
 look more like beacons than benign connectivity.
 See https://www.elastic.co/blog/identifying-beaconing-malware-using-elastic for the full writeup.

 For this routine, needed to convert Elastic's logic to Python, and collect the right data. 
 In lieu of Netflow data, instead used Sysmon EID3: Network Connection.  This identifies the machine and
 process that opened the connection, but not bytes-in or bytes-out.  So, this logic will identify periodicity
 by only process and timestamp. 
 
Below are the Elastic-like searches to generate data for this notebook.  Search is derived from Elastic logic at https://github.com/elastic/detection-rules/releases/tag/ML-Beaconing-20211216-1.  It had to be split into 2 searches, since the regex was too long.  Concatenate the results into a single CSV.  You may also need to restrict the SourceIP, if size of output
is a concern.

event_id:3 (NOT (DestinationIp:10.0.0.0/8)) (NOT (Image:/.*(AddressBookSourceSync|Adobe_CCX|Adobe |AdobeCollab|accountsd|akd|apsd|atmgr|assistantd|backgroundTaskHost|BackgroundTransferHost|Brave Browser Helper|CalendarAgent|CCXProcess|chrome|Code Helper|CompatTelRunner|commerce|Core Sync|default-browser-agent|DeliveryService|DeviceCensus|Docker|Dropbox|Dsapi|elastic-|esensor|EXCEL|explorer|filebeat|FileCoAuth|firefox|GitHub Desktop Helper|Google Chrome Helper|google_guest_agent|GCEWindowsAgent|Google Drive|GoogleUpdate|IMRemoteURLConnectionAgent|keybase|ksfetch).*/)) TABLE @timestamp,SourceHostname,User,Image,SourceIp  

AND 

event_id:3 (NOT (DestinationIp:10.0.0.0/8)) (NOT (Image:/.*(mcautoreg|metricbeat|mdmclient|Mail|MMSSHOST|Microsoft Excel|Microsoft OneNote|Microsoft PowerPoint|Microsoft Teams|Microsoft Update|Microsoft Word|ModuleCoreService|msedge|node|nsurlsessiond|OfficeC2RClient|ONENOTE|officesvcmgr|OfficeClickToRun|OneDrive|parsec|pingsender|SDXHelper|SearchApp|ServiceLayer|Skype for Business|Slack|smartscreen|softwareupdated|Spotify|ssm-|syspolicyd|SystemIdleCheck|taskhostw|Teams|trustd|updater|WINWORD|WhatsApp Helper|xpcproxy|Zoom).*/)) TABLE @timestamp,SourceHostname,User,Image,SourceIp 

In [None]:
# Import things.

import plotly.express as px
import pandas as pd
import sys
import numpy as np
import os
from scipy.stats import poisson
pd.options.display.html.use_mathjax = False
from tqdm import tqdm
tqdm.pandas()

In [None]:
# Ingest the data you just generated. 
file = input("Enter the location of a CSV file:")
df = pd.read_csv(file, encoding="ANSI", header=0, parse_dates=["@timestamp"])
df = df.dropna(axis=0)
df.columns = ['timestamp', 'computername', 'user', 'process', 'sourceIP']
df.drop_duplicates(inplace=True)
df.index = df['timestamp']
print('Ingested ' + str(df.shape[0]) + ' lines of data')

In [None]:
# CONSTANTS

# Set some defaults.  Tweak em later if needed.

# Experimental data shows that for a non-Poisson process (in other words, something more "beacon-y"), when the sampling
# period equals the beacon interval, the RV approaches 0.08 and the CV approaches 0.2.  Search for this, over various 
# sampling periods.  Adjust targets as needed.
cvtarget = 0.2
cvrange = 0.65 * cvtarget
rvtarget = 0.08
rvrange = 0.5 * rvtarget
samplingperiods = ['30s', '60s', '120s', '300s', '600s', '1200s', '1800s', '3600s', '14400s'] 
minimum_executions = 20
outfile = 'c:\\hunting\\poisson\\identifiedbeacons.csv'

In [None]:
# Remove selected Images (executables), based on Elastic's config at https://github.com/elastic/detection-rules/releases/tag/ML-Beaconing-20211216-1
stufftodrop = "(AddressBookSourceSync|Adobe_CCX|Adobe |AdobeCollab|accountsd|akd|apsd|atmgr|assistantd|backgroundTaskHost|"
stufftodrop += "BackgroundTransferHost|Brave Browser Helper|CalendarAgent|CCXProcess|chrome|Code Helper|CompatTelRunner|"
stufftodrop += "commerce|Core Sync|default-browser-agent|DeliveryService|DeviceCensus|Docker|Dropbox|Dsapi|elastic-|esensor|"
stufftodrop += "EXCEL|explorer|filebeat|FileCoAuth|firefox|GitHub Desktop Helper|Google Chrome Helper|google_guest_agent|"
stufftodrop += "GCEWindowsAgent|Google Drive|GoogleUpdate|IMRemoteURLConnectionAgent|keybase|ksfetch|mcautoreg|metricbeat|"
stufftodrop += "mdmclient|Mail|MMSSHOST|Microsoft Excel|Microsoft OneNote|Microsoft PowerPoint|Microsoft Teams|"
stufftodrop += "Microsoft Update|Microsoft Word|ModuleCoreService|msedge|node|nsurlsessiond|OfficeC2RClient|ONENOTE|"
stufftodrop += "officesvcmgr|OfficeClickToRun|OneDrive|parsec|pingsender|SDXHelper|SearchApp|ServiceLayer|Skype for Business|"
stufftodrop += "Slack|smartscreen|softwareupdated|Spotify|ssm-|syspolicyd|SystemIdleCheck|taskhostw|Teams|trustd|"
stufftodrop += "updater|WebSense Endpoint|WINWORD|WhatsApp Helper|xpcproxy|Zoom)"
rowstoremove = df['process'].str.contains(stufftodrop)
df = df[~rowstoremove]
print('Reduced to ' + str(df.shape[0]) + ' lines of data')

In [None]:
# For each sampling period: resample the data, and find the count, stdev, mean, variance, coefficient of variance, and 
# relative variance of each process execution, per computer, per process.  This will create a Multi-index dataframe.

# This logic uses 2 consecutive groupby operations, to result in a multi-index dataframe by computername, and by process.
# It then performs aggregate statistics on the multiindexed data.

# Delete the output file, if it exists
if os.path.exists(outfile):
    os.remove(outfile)
    
# Start with an empty dataframe
found = pd.DataFrame()    
    
for period in tqdm(samplingperiods):
       
    # Create a multi-indexed table from our data, by computername, then by process, then by timestamp
    grouped = df.groupby(['computername', 'process']).resample(period).agg({'timestamp':'size', 'user':'first', 'sourceIP':'first'}).fillna(0)    
    
    # Per computer, per process, find the count (sum), standard deviation, mean, and 
    # variance of the count of each process launch
    subgroup = grouped.groupby(['computername', 'process']).agg({'timestamp':['sum','std','mean','var'], 'user':'first', 'sourceIP':'first'}).fillna(0)
    
    # Find the CV and RV of the counts of execution, per machine, per process
    # Coefficient of variance ("CV") = sigma/mu = stdev/mean
    # Relative variance ("RV") = sigma^2/mu = stdev^2/mean
    subgroup['cv'] = subgroup['timestamp']['std']/subgroup['timestamp']['mean']
    subgroup['rv'] = pow(subgroup['timestamp']['std'], 2)/subgroup['timestamp']['mean']
    
    # Include the beacon interval in the output
    subgroup['beaconinterval'] = period
    
    # Now create a dataframe of ONLY rows that are in our target ranges as defined above.
    # Find entries that fall in the indicated CV/RV ranges, which are most likely to be "beacon-like" processes.
    # (Only include them if they executed a minimum number of times.)  Append to the list of "found" rows.
    newlyfound = (subgroup[(subgroup['rv'] > (rvtarget - rvrange)) & (subgroup['rv'] < (rvtarget + rvrange)) & 
          (subgroup['cv'] > (cvtarget - cvrange)) & (subgroup['cv'] < (cvtarget + cvrange)) & 
          ((subgroup['timestamp']['sum']) > minimum_executions)]) 
    found = found.append(newlyfound)
        
    def get_first_timestamp(x):
    # For each line of the "Found" dataframe, grab the computer name and process name from the row's index.
    # Then perform a select operation on the original df, to find the first timestamp of execution.
        
        # These values come from the index (multiindex) of the passed-in row
        compname = x.name[0]
        procname = x.name[1]

        # Return to the original dataframe to look up the first timestamp of execution, per machine per process
        returnvalue = df[ (df['computername'] == compname) & (df['process'] == procname) ]['timestamp'][0]
        return returnvalue
    
    # If we found anything, grab the index of the first timestamp where it occurred
    if not found.empty:
        found['firsttimestamp'] = found.apply(get_first_timestamp, axis=1)
   
if not found.empty:
    found.reset_index(inplace=True) #flattens the dataframe 
    found.columns = ['computername', 'foundprocess', 'beaconinterval', 'coeffofvariance', 'firsttimestamp', 
                 'relativevariance', 'sourceIP', 'mean', 'std', 'executioncount', 'variance',
                 'user']
    found.to_csv(outfile, mode='a', index=False)
    
found