<a href="https://colab.research.google.com/github/sara-salamat/Anomaly-detection-from-http-logs/blob/main/weakly_supervised.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Requirements

to extract useful information that may help in labeling the data, I found a library on GitHub for parsing the user-agent string: 

ua-parser: https://github.com/ua-parser/uap-python 

The library is installed and imported


In [1]:
!pip install ua-parser

Collecting ua-parser
  Downloading ua_parser-0.10.0-py2.py3-none-any.whl (35 kB)
Installing collected packages: ua-parser
Successfully installed ua-parser-0.10.0


In [2]:
from ua_parser import user_agent_parser
from google.colab import drive
import pandas as pd

In [3]:
drive.mount('/content/drive')

Mounted at /content/drive


##Loading data

The data is loaded - it is not normalized and the percent features are not calculated because in this section we only need the user agents

In [4]:
data = pd.read_csv('/content/drive/MyDrive/sessions.csv')

In [5]:
data.head()

Unnamed: 0.1,Unnamed: 0,UserAgent,IP,FirstHitTime,TotalHits,Image,HTML,Api,ASCII,Other,Bandwith,SessionLength,AvgResponseTime,Errors,Get,Post,Head,Put,Options,IsRobotstxt
0,0,Googlebot-Image/1.0,207.213.193.143,2021-05-12 05:06:00+04:30,1172,1172,0,0,0,0,578464,1796,18.624573,0,1172,0,0,0,0,0
1,1,Mozilla/5.0 (Linux; Android 6.0.1; SAMSUNG SM-...,35.110.222.153,2021-05-12 05:06:00+04:30,15,6,6,0,3,0,780017,6,9.6,0,15,0,0,0,0,0
2,2,Mozilla/5.0 (Linux; Android 6.0; CAM-L21) Appl...,35.108.208.99,2021-05-12 05:06:00+04:30,45,25,11,3,6,0,1338640,27,88.266667,0,43,1,0,1,0,0
3,3,Go-http-client/2.0,36.67.23.210,2021-05-12 05:06:00+04:30,388,0,132,0,0,256,2503555,1796,15.969072,6,206,0,182,0,0,0
4,4,Go-http-client/2.0,76.212.164.3,2021-05-12 05:06:00+04:30,328,0,128,0,0,200,2118477,1796,15.121951,6,190,0,138,0,0,0


In [6]:
ua_list = data.UserAgent.tolist()

Checking some samples:

In [7]:
ua_list[:10]

['Googlebot-Image/1.0',
 'Mozilla/5.0 (Linux; Android 6.0.1; SAMSUNG SM-J710GN Build/MMB29K) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/4.0 Chrome/44.0.2403.133 Mobile Safari/537.36',
 'Mozilla/5.0 (Linux; Android 6.0; CAM-L21) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.88 Mobile Safari/537.36',
 'Go-http-client/2.0',
 'Go-http-client/2.0',
 'Go-http-client/2.0',
 'Mozilla/5.0 (Linux; Android 11; SAMSUNG SM-G975F) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/14.0 Chrome/87.0.4280.141 Mobile Safari/537.36',
 'Mozilla/5.0 (Linux; Android 10; SM-J600F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.181 Mobile Safari/537.36',
 'Mozilla/5.0 (Linux; Android 9; G9) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Mobile Safari/537.36',
 'Go-http-client/2.0']

Testing the parser library on data:

In [7]:
parsed_string = user_agent_parser.Parse(ua_list[0])
parsed_string

{'device': {'brand': 'Spider', 'family': 'Spider', 'model': 'Desktop'},
 'os': {'family': 'Other',
  'major': None,
  'minor': None,
  'patch': None,
  'patch_minor': None},
 'string': 'Googlebot-Image/1.0',
 'user_agent': {'family': 'Googlebot-Image',
  'major': '1',
  'minor': '0',
  'patch': None}}

the parser library gives some general information about the user

Defining a function for creating labels

In [8]:
def user_agent_label(user_agent: str, keyword_list: list):
  parsed_user_agent = user_agent_parser.Parse(user_agent)
  if parsed_user_agent['device']['brand'] == 'Spider':
    return 1
  elif any([True if word in user_agent.lower() else False for word in keyword_list]):
    return 1
  return 0

In [9]:
keyword_list = ['bot' , 'http' , 'crawl' , 'spider', 'python'] # list of keywords that define a crawler's user agent

Testing the function (0: normal session , 1: anomaly session)

In [10]:
user_agent_label('FreshpingBot/1.0 (+https://freshping.io/)' , keyword_list)

1

applying the label generator function to our data

In [11]:
data['weak_labels'] = data['UserAgent'].apply(lambda x:user_agent_label(x, keyword_list))

In [12]:
data.head()

Unnamed: 0.1,Unnamed: 0,UserAgent,IP,FirstHitTime,TotalHits,Image,HTML,Api,ASCII,Other,Bandwith,SessionLength,AvgResponseTime,Errors,Get,Post,Head,Put,Options,IsRobotstxt,weak_labels
0,0,Googlebot-Image/1.0,207.213.193.143,2021-05-12 05:06:00+04:30,1172,1172,0,0,0,0,578464,1796,18.624573,0,1172,0,0,0,0,0,1
1,1,Mozilla/5.0 (Linux; Android 6.0.1; SAMSUNG SM-...,35.110.222.153,2021-05-12 05:06:00+04:30,15,6,6,0,3,0,780017,6,9.6,0,15,0,0,0,0,0,0
2,2,Mozilla/5.0 (Linux; Android 6.0; CAM-L21) Appl...,35.108.208.99,2021-05-12 05:06:00+04:30,45,25,11,3,6,0,1338640,27,88.266667,0,43,1,0,1,0,0,0
3,3,Go-http-client/2.0,36.67.23.210,2021-05-12 05:06:00+04:30,388,0,132,0,0,256,2503555,1796,15.969072,6,206,0,182,0,0,0,1
4,4,Go-http-client/2.0,76.212.164.3,2021-05-12 05:06:00+04:30,328,0,128,0,0,200,2118477,1796,15.121951,6,190,0,138,0,0,0,1


**Conclusion**: the label generator can detect most of the suspicious activity of website users inclusing legitimate and unknown bots

In [13]:
data.sample(20)

Unnamed: 0.1,Unnamed: 0,UserAgent,IP,FirstHitTime,TotalHits,Image,HTML,Api,ASCII,Other,Bandwith,SessionLength,AvgResponseTime,Errors,Get,Post,Head,Put,Options,IsRobotstxt,weak_labels
23328,23328,Mozilla/5.0 (Linux; Android 10; Redmi Note 8 P...,35.117.29.243,2021-05-12 10:31:37+04:30,1,1,0,0,0,0,81143,0,40.0,0,1,0,0,0,0,0,0
40138,40138,Mozilla/5.0 (Linux; Android 10; SM-A600FN) App...,217.14.104.125,2021-05-12 12:34:50+04:30,89,44,30,0,15,0,60071,627,7.05618,0,89,0,0,0,0,0,0
26285,26285,Mozilla/5.0 (Linux; Android 8.1.0; SM-J701F Bu...,92.239.251.116,2021-05-12 10:56:12+04:30,2,0,0,2,0,0,17170,277,138535.5,0,1,0,0,1,0,0,0
51042,51042,MoziIIa/5.0 (X11; Linux x86_64) app_version: 581,85.20.179.213,2021-05-12 13:19:27+04:30,2,0,1,1,0,0,572,0,32.0,0,1,0,0,1,0,0,0
44285,44285,okhttp/3.12.1,35.54.26.244,2021-05-12 12:54:21+04:30,1,0,0,1,0,0,35519,0,88062.0,0,1,0,0,0,0,0,1
67426,67426,Mozilla/5.0 (compatible; SemrushBot/7~bl; +htt...,20.62.177.39,2021-05-12 15:06:26+04:30,1,0,1,0,0,0,26241,0,232.0,0,1,0,0,0,0,0,1
53848,53848,Mozilla/5.0 (Linux; Android 8.1.0; SM-J710F) A...,155.114.166.153,2021-05-12 13:32:15+04:30,52,35,9,4,4,0,1500578,274,1578.288462,0,47,4,0,1,0,0,0
67348,67348,Mozilla/5.0 (compatible; SemrushBot/7~bl; +htt...,20.62.177.84,2021-05-12 15:05:50+04:30,1,0,1,0,0,0,30395,0,164.0,0,1,0,0,0,0,0,1
7090,7090,Mozilla/5.0 (Linux; U; Android 4.3; fa-ir; C20...,35.195.46.239,2021-05-12 07:10:41+04:30,1,1,0,0,0,0,69107,0,40.0,0,1,0,0,0,0,0,0
40008,40008,MoziIIa/5.0 (X11; Linux x86_64),35.75.132.1,2021-05-12 12:34:15+04:30,3,0,1,2,0,0,861,3,20.0,0,1,0,0,2,0,0,0
