### Demo: Generate Non Graph Features
- This notebook generates the local historical and local popularity file that is updated on a daily bases.
- The notebook also shows how we generate features that do not rely on the graph database. For example, periodicity features, popularity features, FQDN semantic features etc.
- **Note** Run this notebook before run the `1_genfeats_graph.ipynb`, as the graph data depends on some files generated from this notebook.

In [1]:
import pandas as pd
import numpy as np
import datetime
import os

### Historical and Temporal features
This notebook shows how we generate non graph-based features, e.g. historical features and temporal features. 
- Due to restrictions, we can not share our original data. Hence, we provide a dummy dataset:
    - `dummy100.csv`: we sampled 100 rows of non-user-sensitive periodic FQDNs from 2021-03-30. We carefully examined every host name to ensure there's no privacy leakage. We restricted FQDNs to domains that have periodic activity because non-periodic domains are filtered out before feature generation in the daily pipeline.
    - The file contains the host name, server IP and port extracted from Zeek logs. Client IPs are faked to avoid privacy leakage.
    - `periodic100.parquet`: the corresponding periodicity detection results of the 100 servers.
    - `hist100_0329.csv`: the history file we got from our pipeline on 2021-03-29.
    - `cisco_top1m.csv`: the cisco top 1 million data we pulled on 2021-03-30.

In [2]:
# we read the above data
host100 = pd.read_csv("dummydata/dummy100.csv")
per100 = pd.read_parquet("dummydata/periodic100.parquet")
cisco_1m = pd.read_csv("dummydata/cisco_top1m.csv", names=["host", "rank"])
hist100_0329 = pd.read_csv("dummydata/hist100_0329.csv") 

# We set the logday: 2021-03-30. 
# This is important in our deployment, but not that important in our demo here.
# Because in our deployment, the pipeline is run on a daily bases. 
# Features on Day N, sometimes depends on the results of Day N-1. 
logday = "2021-03-30"

### Generate local historical and popularity file
- We first generate local historical and popularity files.

In [3]:
from src.dom_history import gen_domain_history
from src.dom_popularity import gen_popular_host

In [4]:
"""
generate local history file
hist_df: day n-1 historical file
zeek_df: daily logs
compute_info: logday
"""
hist100 = gen_domain_history(hist_df=hist100_0329, zeek_df=host100[["host"]], 
                             compute_info={"start_dt":datetime.datetime(2021,3,30)})
hist100.to_parquet("dummydata/hist100.parquet", index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  zeek_df["temp"] = 1


In [5]:
### history file shows the first seen and last seen date of the domain name,
### domains that hasn't been visited 30 days will be removed from the history file
hist100.head(2)

Unnamed: 0,firstseen_date,lastseen_date,firstseen_log_type,lastseen_log_type,days_since_lastseen,days_since_firstseen,count_since_firstseen,isIP,host
0,2020-05-02,2021-03-30,HTTPSSL,HTTPSSL,0,332,331.0,1.0,104.192.108.134
1,2020-05-02,2021-03-30,HTTPSSL,HTTPSSL,0,332,331.0,1.0,110.43.89.12


In [6]:
"""
local popularity data are also generated on a daily bases
local popularity is defined as:
the count of unique client IPs visiting the server/ total IPs observed in the campus on that day
zeek_df: daily logs
"""
poplocal = gen_popular_host(zeek_df=host100)
poplocal.to_parquet("dummydata/popularity100.parquet", index=False)

In [7]:
poplocal.head(2)

Unnamed: 0,host,fqdn_popularity
0,104.104.90.50,0.333333
1,104.192.108.134,0.333333


#### Historical features
- The below code generates historical features that do not rely on the graph database.
- The features describe the visiting frequency of the domain in campus networks.

In [8]:
from src.hist_feats import gen_history_score

In [9]:
"""
logday: date when the log is collected
datafpath: path to the log file
histfpath: path to the history file
savefpath: path to save the generated features
"""

logday = "2021-03-30"
featsdir = "dummydata/features/{}".format(logday)
histfeats = gen_history_score(logday="2021-03-30", datafpath="dummydata/periodic100.parquet",
                             histfpath="dummydata/hist100.parquet", 
                             savefpath=os.path.join(featsdir, "features_hist.parquet"))

[Info] Raw Data shape: (100, 1)
[Info] History Data shape: (100, 9)
[Info] Features Shape (100, 3)
[Info] History Features Saved to: dummydata/features/2021-03-30/features_hist.parquet


In [10]:
histfeats.sort_values(by=["occ"], ascending=False).head(5)

Unnamed: 0,host,freq,occ
27,st.dabaraw.com,0.233333,0.125
13,sgminorshort.wechat.com,0.566667,0.037037
43,wx.huion.cn,1.633333,0.017544
51,todolist.redirect.xzdesktop.cqttech.com,1.8,0.017544
93,uploads.engagephd.com,2.5,0.013333


#### Temporal features
- The below code blocks generate features based on periodicity
    - For each periodicity we measure its popularity based on cisco and local popularity.
    - For each periodicity we measure its maliciousness based on a historical file that logs the observed malicious domain. This file is updated daily based on feedback/queries from the active-learning pipeline.

In [11]:
from src.temporal_feats import gen_popularity_score, gen_hist_malscore

In [12]:
"""
The below function generate temporal related popularity features.
To do so, we need to read the periodicity detection data. 
Note that aperiodic servers are filtered out in the pipeline, 
because our target is to detect malicious beaconing (periodic) activity. 

logday: date when the log is collected
perfpath: path to the periodicity detection file
popularityfpath: path to the local popularity file
ciscofpath: path to the global popularity file
savefpath: path to save the generated feature
"""

tempfeats = gen_popularity_score(logday=logday, perfpath="dummydata/periodic100.parquet", 
                     popularityfpath="dummydata/popularity100.parquet",
                     ciscofpath="dummydata/cisco_top1m.csv",
                     savefpath=os.path.join(featsdir, "features_per.parquet"))

[Info] Raw Data shape: (100, 2)
[Info] Features Shape (100, 16)
[Info] Popularity Features Saved to: dummydata/features/2021-03-30/features_per.parquet


In [13]:
tempfeats.head(5)

Unnamed: 0,host,mean_fqdn_period,max_fqdn_period,min_fqdn_period,std_fqdn_period,min_per,max_per,std_per,mean_per,cisco_min_period,cisco_max_period,cisco_mean_period,cisco_median_period,cisco_ratio_period,cisco_score,fqdn_popularity
0,104.104.90.50,12.0,12,12,0.0,10.0,10.0,0.0,10.0,0.0,0.98873,0.638616,0.809256,0.75,0.0,0.333333
1,104.192.108.134,2.0,2,2,0.0,80.0,80.0,0.0,80.0,0.0,0.978719,0.48936,0.48936,0.5,0.0,0.333333
2,110.43.89.12,11.0,11,11,0.0,240.0,240.0,0.0,240.0,0.0,0.990054,0.669006,0.840967,0.727273,0.0,0.333333
3,15dfjkbvdf.club,2.0,2,2,0.0,53.0,53.0,0.0,53.0,0.690024,0.867957,0.77899,0.77899,1.0,0.690024,0.333333
4,203.205.219.244,3.5,6,1,2.081666,72.0,160.0,39.084311,103.25,0.0,0.99267,0.545816,0.876262,0.571429,0.0,0.333333


In [14]:
"""
The below function generate temporal related malicious score features.
The function reads 'dummydata/malicious_hist.csv' at backend.
The file that keep recording the observed suspicious domains.
To generate Day N's features, we use Day N-1 file.

logday: date when the log is collected
perfpath: path to the periodicity detection file
savefpath: path to save the generated feature
"""
hist_malfeats = gen_hist_malscore(logday=logday, perfpath="dummydata/periodic100.parquet",
                                 savefpath=os.path.join(featsdir, "features_histmal.parquet"))

[Info] Raw Data shape: (100, 4)
[Info] Features Shape (100, 6)
[Info] hist Mal Features Save Data to: dummydata/features/2021-03-30/features_histmal.parquet


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["periodicities"] = df["true_periods"]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["integer_pers"] = df["periodicities"].apply(lambda x: np.unique([np.rint(i) for i in x]))


#### FQDN features
The below code generates FQDN based semantic features, e.g. entropy, domain length.

In [15]:
from src.fqdn_feats import gen_fqdn_features

In [16]:
"""
The below function generate fqdn related semantic features, i.e. entropy, domain level etc.

logday: date when the log is collected
perfpath: path to the periodicity detection file
savefpath: path to save the generated feature
"""

fqdn_feats = gen_fqdn_features(logday, perfpath = "dummydata/periodic100.parquet",
                               savefpath=os.path.join(featsdir, "features_fqdn.parquet"))

[Info] Raw Data shape: (100, 2)
[Info] Features Shape (100, 12)
[Info] Popularity Features Saved to: dummydata/features/2021-03-30/features_fqdn.parquet


In [17]:
fqdn_feats

Unnamed: 0,host,psd_ratio,dom_illegal,dom_sld_entropy,subdom_entropy,dom_entropy,fqdn_entropy,dom_tldcnt,dom_sldcnt,dom_subcnt,dom_level,dom_length
16,www.horosproject.org,11.022869,0,3.022055,-0.000000,3.202820,3.346439,0,0,1,1,20
21,kindle-time.amazon.com,11.211672,0,2.251629,3.095795,2.721928,3.697846,0,0,1,1,22
36,weather.service.msn.com,1.610196,0,1.584963,3.240224,2.521641,3.621176,0,0,2,2,23
43,pico.eset.com,5.632438,0,1.500000,2.000000,2.750000,3.085055,0,0,1,1,13
55,qbwup.imtt.qq.com,6.281895,0,-0.000000,3.121928,2.251629,3.292770,0,0,2,2,17
...,...,...,...,...,...,...,...,...,...,...,...,...
777,ts.minipage.2345.cc,1.408176,0,2.000000,3.277613,2.521641,3.787144,0,0,2,2,19
778,uapi.mp.360.cn,2.267621,0,1.584963,2.521641,2.584963,3.324863,0,0,2,2,14
779,upgrade.actiontec.com,2.619776,0,2.725481,2.807355,3.026987,3.689704,0,0,1,1,21
781,tracker.lelux.fi,1.156560,0,1.921928,2.521641,2.750000,3.500000,0,0,1,1,16
