# Project 2: Web Traffic Analysis
**This is the second of three mandatory projects to be handed in as part of the assessment for the course 02807 Computational Tools for Data Science at Technical University of Denmark, autumn 2019.**

#### Practical info
- **The project is to be done in groups of at most 3 students**
- **Each group has to hand in _one_ Jupyter notebook (this notebook) with their solution**
- **The hand-in of the notebook is due 2019-11-10, 23:59 on DTU Inside**

#### Your solution
- **Your solution should be in Python**
- **For each question you may use as many cells for your solution as you like**
- **You should document your solution and explain the choices you've made (for example by using multiple cells and use Markdown to assist the reader of the notebook)**
- **You should not remove the problem statements**
- **Your notebook should be runnable, i.e., clicking [>>] in Jupyter should generate the result that you want to be assessed**
- **You are not expected to use machine learning to solve any of the exercises**
- **You will be assessed according to correctness and readability of your code, choice of solution, choice of tools and libraries, and documentation of your solution**

## Introduction
In this project your task is to analyze a stream of log entries. A log entry consists of an [IP address](https://en.wikipedia.org/wiki/IP_address) and a [domain name](https://en.wikipedia.org/wiki/Domain_name). For example, a log line may look as follows:

`192.168.0.1 somedomain.dk`

One log line is the result of the event that the domain name was visited by someone having the corresponding IP address. Your task is to analyze the traffic on a number of domains. Counting the number of unique IPs seen on a domain doesn't correspond to the exact number of unique visitors, but it is a good estimate.

Specifically, you should answer the following questions from the stream of log entries.

- How many unique IPs are there in the stream?
- How many unique IPs are there for each domain?
- How many times was IP X seen on domain Y? (for some X and Y provided at run time)

**The answers to these questions can be approximate!**

You should also try to answer one or more of the following, more advanced, questions. The answers to these should also be approximate.

- How many unique IPs are there for the domains $d_1, d_2, \ldots$?
- How many times was IP X seen on domains $d_1, d_2, \ldots$?
- What are the X most frequent IPs in the stream?

You should use algorithms and data structures that you've learned about in the lectures, and you should provide your own implementations of these.

Furthermore, you are expected to:

- Document the accuracy of your answers when using algorithms that give approximate answers
- Argue why you are using certain parameters for your data structures

This notebook is in three parts. In the first part you are given an example of how to read from the stream (which for the purpose of this project is a remote file). In the second part you should implement the algorithms and data structures that you intend to use, and in the last part you should use these for analyzing the stream.

## Reading the stream
The following code reads a remote file line by line. It is wrapped in a generator to make it easier to extend. You may modify this if you want to, but your solution should remain parametrized, so that your notebook can be run without having to consume the entire file.

In [1]:
import urllib
import numpy as np
import mmh3
import math
def stream(n):
    i = 0
    with urllib.request.urlopen('https://files.dtu.dk/fss/public/link/public/stream/read/traffic?linkToken=3pLwj8eS8I_MkvCK&itemName=traffic') as f:
        for line in f:
            element = line.rstrip().decode("utf-8")
            yield element
            i += 1
            if i == n:
                break

In [2]:
STREAM_SIZE = 1000
web_traffic_stream = stream(STREAM_SIZE)

## Data structures

### How many unique IPs are there in the stream?

For this question, we can use a hyperloglog structure. This enables us to estimate the number of unique ip's using very little memory. The hyper-parameter, b, is chosen to reflect a certain wanted error rate given by approximately $\frac{1.04}{\sqrt{2^b}}$ using b = 10, thus gives fairly good results as can be seen.

 Alpha is chosen by using the recommendations in the paper.

In [3]:
def get_cardinality(bucket,b,alpha):
    sum_ = 0
    j = 0 
    for elements in bucket:
        if(elements == 0):
            j+=1
        sum_ += 2**(-1*elements)
        #else:

    card = (sum_**(-1))*((2**b)**2)*alpha
    if card > 2**32/30:
        #print("large range correction")
        #print(-2**32*np.log(1-card/(2**32)))
        return -2**32 * (np.log(1-card/(2**32)))
    elif card < 5/2 * 2 ** 32:
        if j==0:
            return card
        else:
            #print("{} buckets are zero".format(j) )
            return (2**b) * np.log(2**b/j)
    else:
        return card

In [4]:
def FirstOnePos(n): 
    count = 0
    for i in range(len(n)):
        if n[i] == '1':
            break;
    return i+1

In [5]:
def get_hyperlog_bucket(n,b):
    web_traffic_stream = stream(n)
    bucket_count = np.zeros(2**b)
    bucket = np.zeros(2**b)
    for i,l in enumerate(web_traffic_stream):
        #if i%2 ==0 : print(i)
        q = format(mmh3.hash(l.split("\t",1)[0],signed=False),'032b')
        q_b = q[0:b]
        m = FirstOnePos(q[b:])
        bucket[int(q_b,2)] = max(bucket[int(q_b,2)],m)
    return bucket

### How many times was IP X seen on domain Y? (for some X and Y provided at run time)
In order to find how many times an IP address was seen on a domain we used the CountMin Sketch data-structure. We used murmur3 to hash the IP addredd and domain name combinations. We tried out different values for the number of hash algorithms used (height of the matrix)(d) and for the width of the matrix(w). With the values d = 10 and w = 8000 we achieved good approximation.

In the implementation first we set the values for d and w. After this we intialize the matrix(d x w) with zeros.

The function element_arrives(element) increases the counters based in the positions based on the hashed values. For hashing we used the mmh and in order to apply different hash algorithms from the same family we were changing the seed values.

The function get_element_count(element) finds the minimum corresponding counter.

By calling the countIpOnDomain(x, y), where x is the IP address and y is the domain name, we can get an estimation for the count of the IP address-domain name combination in the stream.

In order to test the algorithm we read in a stream from file where we added multiple rows with the same ip-domain combination (40156 rows). The algorithm overestimated whatever the checked ip/domain was. When we tested the algorithm for an IP accured 156 times in the stream we got an estimated accurance 167, and when we tested for elements which accured 1 times we got 3,4,5 as accurance.

Because of collisions the result depends heavily on the parameters d and w we select.

In [9]:
#count_min_sketch
import mmh3
d = 10
w = 8000

M = [[0 for x in range(w)] for y in range(d)]


def element_arrives(element):
    for i in range(0, d):
        M[i][mmh3.hash(element, seed=i) % w] = M[i][mmh3.hash(element, seed=i) % w] + 1

def get_element_count(element):
    min = M[0][mmh3.hash(element, seed=0) % w]
    for i in range(1, d):
        actual_element = M[i][mmh3.hash(element, seed=i) % w]
        if min > actual_element:
            min = actual_element
    return min

def countIpOnDomain(x, y):
    return get_element_count(x + "\t" + y)

In the function generateCountMinSketch(n) we fetch n element from the stream and for every element we call the element_arrives() function.

In [10]:
def generateCountMinSketch(n):
    d = 10
    w = 8000

    M = [[0 for x in range(w)] for y in range(d)]
    i = 0
    with urllib.request.urlopen('https://files.dtu.dk/fss/public/link/public/stream/read/traffic?linkToken=3pLwj8eS8I_MkvCK&itemName=traffic') as f:
        for line in f:
            element = line.rstrip().decode("utf-8")
            element_arrives(element)
            i += 1
            if i == n:
                break

In [11]:
generateCountMinSketch(40000)

In [12]:
countIpOnDomain("124.81.124.112","wikipedia.org")

2

## How many unique IPs are there for each domain?

In [1]:
def bucket_intializer(bucket_dictionary,domain,b):
    bucket_dictionary[domain] = np.zeros(2**b)
def domain_unique_ip(bucket_dictionary,web_traffic_stream,b):
    for i,l in enumerate(web_traffic_stream):
        if l.split()[1] in bucket_dictionary.keys():
            #print("c")
            a =0
        else:
            #print("e")
            bucket_intializer(bucket_dictionary,l.split()[1],b)
        #print("a")
        q = format(mmh3.hash(l.split("\t",1)[0],signed=False),'032b')
        #print(q)
        q_b = q[0:b]
        m = FirstOnePos(q[b:])
        bucket_dictionary[l.split()[1]][int(q_b,2)] = max(bucket_dictionary[l.split()[1]][int(q_b,2)],m)
    
    

## What are the most X frequent IP's in the stream?

Our idea was to use a count-min sketch together with a heap that tracks the heavy hitter IP's.

In [14]:
d = 10
w = 20000

M = [[0 for x in range(w)] for y in range(d)]

def element_arrives(element):
    for i in range(0, d):
        M[i][mmh3.hash(element, seed=i) % w] = M[i][mmh3.hash(element, seed=i) % w] + 1

def get_element_count(element):
    min = M[0][mmh3.hash(element, seed=0) % w]
    for i in range(1, d):
        actual_element = M[i][mmh3.hash(element, seed=i) % w]
        if min > actual_element:
            min = actual_element
    return min

def generateCountMinSketch_heap(n,heap_len):
    heap = {}
    i = 0
    with urllib.request.urlopen('https://files.dtu.dk/fss/public/link/public/stream/read/traffic?linkToken=3pLwj8eS8I_MkvCK&itemName=traffic') as f:
        for line in f:
            element = line.rstrip().decode("utf-8").split("\t",1)[0]
            element_arrives(element)
            i += 1
            element_count = get_element_count(element)
            if element not in heap:
                if len(heap) < heap_len:
                    heap[element] = element_count
            for j in heap:
                heap[j] = get_element_count(j)
            min_key = min(heap, key=heap.get)
            if heap[min_key] < element_count:
                heap[element] = heap.pop(min_key)
                heap[element] = element_count
                
            if i == n:
                break
        return heap

# Analysis

## Hyperloglog for unique ip's in stream
Let us compare the actual unique ip's to the estimate by HLL.

For this question, we can use a hyperloglog structure. This enables us to estimate the number of unique ip's using very little memory. The hyper-parameter, b, is chosen to reflect a certain wanted error rate given by approximately $\frac{1.04}{\sqrt{2^b}}$ using b = 10, thus gives fairly good results as can be seen.

 Alpha is chosen by using the recommendations in the paper.


In [15]:
# Reference
my_dict = {}
i = 0
n = 50000
with urllib.request.urlopen('https://files.dtu.dk/fss/public/link/public/stream/read/traffic?linkToken=3pLwj8eS8I_MkvCK&itemName=traffic') as f:
    for line in f:
        element = line.rstrip().decode("utf-8").split("\t",1)[0]
        i += 1
        if element not in my_dict:
                my_dict[element] = 1
        else:
                my_dict[element] += 1                
        if i == n:
             break
b = 10
bucket = get_hyperlog_bucket(5*10**4,b)
exact = sum(my_dict.values())
approx = get_cardinality(bucket,b,0.7213/(1+1.079/(2**b)))
print("Number of unique ip's(Reference): " + str(exact))
print("Number of estimated unique ip's(HLL)" + str(approx))
print("Total error: " + str((100*abs(approx-exact))/exact)+"%") 

Number of unique ip's(Reference): 50000
Number of estimated unique ip's(HLL)48603.14510159575
Total error: 2.7937097968085%


### How many times was IP X seen on domain Y? (for some X and Y provided at run time)
Below we can see some tests on the stream. The first one answers how many times ip 124.81.124.112 was seen on wikipedia.org, the second one answers how many times 192.118.123.80 was seen on 


In [16]:
STREAM_SIZE = 1000000
generateCountMinSketch(STREAM_SIZE)
ip = "192.118.123.80"
domain = "python.org"
estimated_number = countIpOnDomain(ip,domain)
print ("estimeted no of ip:{} in domain:{} is {}".format(ip,domain,estimated_number))



estimeted no of ip:192.118.123.80 in domain:python.org is 41


In [17]:
n = 500000
generateCountMinSketch(n)
countIpOnDomain("124.81.124.112","wikipedia.org")

64

### How many unique IPs are there for each domain?

Here we need to find the unique ips in each domain. Here the idea is to impliment multiple HLL for each domain.
Thi choice is made because memory used by each HLL is very less it is feasible to make a dictionary with keys to be domains and value as HLL buckets. Also the parameters for HLL is chosen by allowing to have an error of 4 to 5% which implies from the error rate equation $\frac{1.04}{\sqrt{2^b}}$ the b value to be around 10.



In [18]:

web_traffic_stream = stream(STREAM_SIZE)
b=10
domain_unique_ip(bucket_dictionary,web_traffic_stream,b)
STREAM_SIZE = 100000
web_traffic_stream = stream(STREAM_SIZE)
true_values = {}
for i,l in enumerate(web_traffic_stream):
    if l.split()[1] in true_values.keys():
        true_values[l.split()[1]].append(l.split()[0])
    else:
        true_values[l.split()[1]] = [l.split()[0]]
for l in bucket_dictionary.keys():
    estimated_ip = get_cardinality(bucket_dictionary[l],b,0.7213/(1+1.079/(2**b)))
    true_ip = len(list(set(true_values[l])))
    print("No of unique estimated ips in domain {} are {}".format(l,estimated_ip))
    print("No. of True ips in domain {} is {}".format(l,true_ip))
    print("error is {}".format(100*abs((true_ip-estimated_ip)/(true_ip))))

No of unique estimated ips in domain wikipedia.org are 50522.5393713267
No. of True ips in domain wikipedia.org is 52353
error is 3.4963815419809814
No of unique estimated ips in domain pandas.pydata.org are 12850.462560135693
No. of True ips in domain pandas.pydata.org is 12999
error is 1.142683590001593
No of unique estimated ips in domain python.org are 25610.173833989553
No. of True ips in domain python.org is 25927
error is 1.2219931577523322
No of unique estimated ips in domain spark.apache.org are 522.0859266999656
No. of True ips in domain spark.apache.org is 520
error is 0.4011397499933895
No of unique estimated ips in domain google.com are 2560.6707269182866
No. of True ips in domain google.com is 2626
error is 2.487786484452147
No of unique estimated ips in domain dtu.dk are 2888.292292020369
No. of True ips in domain dtu.dk is 2629
error is 9.862772613935684
No of unique estimated ips in domain github.com are 1346.253524923271
No. of True ips in domain github.com is 1341
er

Clearly we could see from above that  4 to 5% is also justified by comparing with true values


## Most X frequent IP's in stream

In [19]:
generateCountMinSketch_heap(10000,50)

{'128.76.240.132': 75,
 '151.112.165.145': 75,
 '158.159.192.137': 75,
 '44.98.181.83': 75,
 '177.158.218.97': 75,
 '154.113.149.55': 74,
 '183.105.132.76': 74,
 '121.81.207.182': 74,
 '164.65.148.120': 75,
 '75.220.155.79': 76,
 '101.136.92.86': 74,
 '87.169.85.108': 74,
 '174.117.171.123': 78,
 '161.166.109.171': 74,
 '153.147.118.113': 75,
 '164.138.98.124': 75,
 '186.197.114.139': 75,
 '113.58.134.64': 74,
 '136.108.134.135': 75,
 '49.50.50.159': 74,
 '137.170.154.83': 74,
 '163.70.109.155': 77,
 '204.28.120.155': 75,
 '168.120.130.122': 74,
 '119.108.102.158': 75,
 '0.138.125.209': 76,
 '82.69.113.168': 75,
 '70.168.130.156': 76,
 '28.156.118.159': 74,
 '269.64.127.77': 74,
 '92.175.129.212': 75,
 '158.110.147.43': 75,
 '227.112.50.155': 75,
 '98.180.195.200': 76,
 '113.170.90.76': 74,
 '113.85.116.139': 75,
 '52.82.137.64': 74,
 '148.159.194.191': 74,
 '108.52.137.92': 74,
 '117.114.112.48': 74,
 '103.201.106.76': 79,
 '131.139.255.102': 74,
 '140.52.237.136': 74,
 '165.224.177.9