# Project 2: Web Traffic Analysis
**This is the second of three mandatory projects to be handed in as part of the assessment for the course 02807 Computational Tools for Data Science at Technical University of Denmark, autumn 2019.**

#### Practical info
- **The project is to be done in groups of at most 3 students**
- **Each group has to hand in _one_ Jupyter notebook (this notebook) with their solution**
- **The hand-in of the notebook is due 2019-11-10, 23:59 on DTU Inside**

#### Your solution
- **Your solution should be in Python**
- **For each question you may use as many cells for your solution as you like**
- **You should document your solution and explain the choices you've made (for example by using multiple cells and use Markdown to assist the reader of the notebook)**
- **You should not remove the problem statements**
- **Your notebook should be runnable, i.e., clicking [>>] in Jupyter should generate the result that you want to be assessed**
- **You are not expected to use machine learning to solve any of the exercises**
- **You will be assessed according to correctness and readability of your code, choice of solution, choice of tools and libraries, and documentation of your solution**

## Introduction
In this project your task is to analyze a stream of log entries. A log entry consists of an [IP address](https://en.wikipedia.org/wiki/IP_address) and a [domain name](https://en.wikipedia.org/wiki/Domain_name). For example, a log line may look as follows:

`192.168.0.1 somedomain.dk`

One log line is the result of the event that the domain name was visited by someone having the corresponding IP address. Your task is to analyze the traffic on a number of domains. Counting the number of unique IPs seen on a domain doesn't correspond to the exact number of unique visitors, but it is a good estimate.

Specifically, you should answer the following questions from the stream of log entries.

- How many unique IPs are there in the stream?
- How many unique IPs are there for each domain?
- How many times was IP X seen on domain Y? (for some X and Y provided at run time)

**The answers to these questions can be approximate!**

You should also try to answer one or more of the following, more advanced, questions. The answers to these should also be approximate.

- How many unique IPs are there for the domains $d_1, d_2, \ldots$?
- How many times was IP X seen on domains $d_1, d_2, \ldots$?
- What are the X most frequent IPs in the stream?

You should use algorithms and data structures that you've learned about in the lectures, and you should provide your own implementations of these.

Furthermore, you are expected to:

- Document the accuracy of your answers when using algorithms that give approximate answers
- Argue why you are using certain parameters for your data structures

This notebook is in three parts. In the first part you are given an example of how to read from the stream (which for the purpose of this project is a remote file). In the second part you should implement the algorithms and data structures that you intend to use, and in the last part you should use these for analyzing the stream.

## Reading the stream
The following code reads a remote file line by line. It is wrapped in a generator to make it easier to extend. You may modify this if you want to, but your solution should remain parametrized, so that your notebook can be run without having to consume the entire file.

In [102]:
import mmh3
from tqdm import tqdm_notebook as tqdm
import numpy as np
from scipy.stats import hmean 
import random

In [103]:
url = 'https://files.dtu.dk/fss/public/link/public/stream/read/traffic?linkToken=3pLwj8eS8I_MkvCK&itemName=traffic'
url2 = 'https://files.dtu.dk/fss/public/link/public/stream/read/traffic_2?linkToken=_DcyO-U3MjjuNzI-&itemName=traffic_2'

In [104]:
import urllib

def stream(n):
    with urllib.request.urlopen(url2) as f:
        for i,line in enumerate(f):
            if i == n:
                break
                
            element = line.rstrip().decode("utf-8").split('\t')
            try:
                yield element
            except Exception as E:
                print(E)


In [105]:
STREAM_SIZE = 100
web_traffic_stream = stream(STREAM_SIZE)

# Data structures

## CountMin Sketch

In [106]:
class CountMin:
    def __init__(self, w, d):
        self.w = w
        self.d = d
        self.h = lambda x: [mmh3.hash(x, seed) % self.w for seed in range(self.d)]
        self.M = np.zeros((self.d,self.w))
    
    def __call__(self, e):
        h = self.h(e)
        for i in range(self.d):
            self.M[i,h[i]] += 1
        
    def query(self, e):
        h = self.h(e)
        return min([self.M[i,h[i]] for i in range(self.d)])
    
    def __repr__(self):
        return f'Counts: {list(self.M).__repr__()}'
    
    def getBins(self):
        return self.M

## HyperLogLog

In [107]:
class HLL:
    def __init__(self, m):
        assert (m & (m-1)) == 0, "m must be a power of 2"
        if m < 128:
            alphas = {16: 0.673, 32: 0.697, 64: 0.709}
            self.alpha = alphas[m]
        else:
            self.alpha = 0.7213 / (1+1.079/m)
        self.h = lambda x: mmh3.hash64(x)[0]
        self.M = np.zeros((m,1))
        self.split = int(np.log2(m))
        self.m = m
        
    def __call__(self, e):
        # Hash binary representation reversed because all ips were yielding
        binh = bin(self.h(e)).replace('b', '').replace('-','')[::-1]
        hp = binh[:self.split]
        lp = binh[self.split:]
        j = int(hp, base=2) # create int from binary (base 2)
        k = lp.index('1') + 1
        #if k > self.M[j]:
            #print(lp)
        self.M[j] = max(self.M[j],k)
        
    def query(self):
        # Try adding epsilon to avoid dividing by 0
        mean = hmean(pow(2,self.M))[0]
        return int(round(self.m * mean * self.alpha))
    
    def __repr__(self):
        return f'estimate: {self.query()} \n counters: {self.M}'

# Analysis

In [108]:
n_stream = int(1e6)

- How many unique IPs are there in the stream?

In [109]:
ip_counter = HLL(256)
for entry in tqdm(stream(n_stream), total=n_stream):
    ip_counter(entry[0])
f'{ip_counter.query()}/{n_stream} ({round(ip_counter.query()/n_stream*100, 2)}%) unique IP addresses'

HBox(children=(IntProgress(value=0, max=1000000), HTML(value='')))




'840745/1000000 (84.07%) unique IP addresses'

In [110]:
ips = [x[0] for x in tqdm(stream(n_stream), total=n_stream)]
gt_unique = len(set(ips))
f'ground truth: {round(gt_unique/n_stream*100, 2)}'

HBox(children=(IntProgress(value=0, max=1000000), HTML(value='')))




'ground truth: 97.17'

The ground truth and approximate values vary by about 13%. When the cardinality of the stream is so close to the total size of the stream, there is bound to be a fairly large margin of error like this.

- How many unique IPs are there for each domain?

### Solution
For this dataset, IP addresses are not frequently repeated. With this in mind, a simple first approach for this queston would be to count the unique website visits to each site using `CountMin` and use this as a proxy for the amount of unique visitors. This approach does not take into account however that some websites may have lots of repeated visitors while others may have no repeated visitors. A potentially more useful solution to this question is to keep a dictionary of unique websites and count unique visitors to each site. Theoretically, this could use an enormous amount of space if the number of unique websites was large. In this dataset, it is quite small at 9 websites, so this approach is quite possible. Keeping a dictionary of `HLL` counters, it is possible to quickly estimate the total number of unique IPs for each website. In theory, the number of unique IP addresses should be less than the number of website visits. Interestingly, the `CountMin` estimate of the number of visits to `Wikipedia.org` is actually lower than the `HyperLogLog` estimate of the number of unique visitors in this case.

Because the number of unique IPs is so high compared to the number of visits, CountMin cannot compress the data very well, to save at least a little space, values of `w=n_stream//1000` and `d=10` were chosen. For the HyperLogLog approach, a 32 bit hash was used, as this provided adequate results

In [120]:
website_visits = CountMin(n_stream//int(1e3), 10)
for entry in tqdm(stream(n_stream), total=n_stream):
    website_visits(entry[1])

HBox(children=(IntProgress(value=0, max=1000000), HTML(value='')))

In [157]:
website_visits.query('wikipedia.org')

522404.0

In [113]:
ip_counters = {}
for entry in tqdm(stream(n_stream), total=n_stream):
    if entry[1] not in ip_counters.keys():
        ip_counters[entry[1]] = HLL(32)
        
    ip_counters[entry[1]](entry[0])

HBox(children=(IntProgress(value=0, max=1000000), HTML(value='')))




In [114]:
ip_counters['wikipedia.org'].query()

544487

Ground Truth:

In [117]:
wiki_ips = {}
for entry in tqdm(stream(n_stream), total=n_stream):
    if entry[1] == 'wikipedia.org':
        try:
            wiki_ips[entry[0]] += 1
        except:
            wiki_ips[entry[0]] = 1
            
len(wiki_ips.keys())

HBox(children=(IntProgress(value=0, max=1000000), HTML(value='')))

510191

The HyperLogLog approach appears to be fairly accurate, differing from the ground truth by about 30,000/1,000,000, or ~3%.

- How many times was IP X seen on domain Y? (for some X and Y provided at run time)

### Answer
Use `CountMin` and hash both ip and domain values. This will estimate the number of occurrences of each domain/ip pair.

In [15]:
ipx_domainy = CountMin(n_stream//int(1e1), 100)
for entry in tqdm(stream(n_stream), total=n_stream):
    ipx_domainy(" ".join(entry))

HBox(children=(IntProgress(value=0, max=100000), HTML(value='')))




In [133]:
ex = random.choice(list(stream(100)))
ex_str = " ".join(ex)
count_ex = ipx_domainy.query(ex_str)
f'IP {ex[0]} appears {count_ex} times on {ex[1]}'

'IP 115.170.85.106 appears 5.0 times on pandas.pydata.org'

Ground Truth

In [134]:
count = sum([1 for i in tqdm(stream(n_stream), total=n_stream) if i[0]==ex[0] and i[1]==ex[1]])
count

HBox(children=(IntProgress(value=0, max=1000000), HTML(value='')))

1

It is quite difficult to achieve good accuracy on this problem because the number of unique domain/ip pairs is so close to the total number of visits. Using `w=n_stream/10` and `d=100` resulted in an estimate of between 3-5 times for a random IP address from the dataset, while the ground truth is almost always 1. If the number of people repeatedly visiting websites went up, our accuracy would go up as well.

- How many unique IPs are there for the domains $d_1, d_2, \ldots$?

- How many times was IP X seen on domains $d_1, d_2, \ldots$?

Solution:  
Use a dictionary with domains as keys and CountMin objects as values. Use each countmin object to estimate the number of visits for each IP address on the given domain.

In [18]:
ip_count_per_website = {}
for entry in tqdm(stream(n_stream), total=n_stream):
    if entry[1] not in ip_count_per_website.keys():
        ip_count_per_website[entry[1]] = CountMin(n_stream//100, 100)
        
    ip_count_per_website[entry[1]](entry[0])

HBox(children=(IntProgress(value=0, max=100000), HTML(value='')))




Check how many times the first IP address on the stream occurs across all websites except wikipedia

In [135]:
websites_tocheck = set(ip_count_per_website.keys()) - {'wikipedia.org'}
occurrences = sum([ip_count_per_website[website].query(ex[0]) for website in websites_tocheck])

print(f'IP {ex[0]} occurred {occurrences} times across websites {", ".join(websites_tocheck)}')

IP 115.170.85.106 occurred 19.0 times across websites databricks.com, spark.apache.org, dtu.dk, pandas.pydata.org, github.com, scala-lang.org, google.com, python.org, datarobot.com


Ground Truth

In [136]:
count = sum([1 for i in tqdm(stream(n_stream), total=n_stream) if i[0]==ex[0] and i[1] in websites_tocheck])
count

HBox(children=(IntProgress(value=0, max=1000000), HTML(value='')))

1

Once again, the accuracy is hindered by the low number of times a given IP appears in the stream. This is compounded by the fact that CountMin is a biased estimator, so summing across many CountMin objects the errors add together rather than cancel out. Estimates appear to be off by as much as 20 times for this question.

- What are the X most frequent IPs in the stream?

Solution: Use countmin, and when item is added, query it too and get approximate number of occurences. Keep a sorted heap of k IP addresses and their frequencies. If the approximate count for the new value is greater than the min value of the heap, pop one and insert the new item. If the new item already exists in the heap, increment it's value and continue.

In [137]:
from heapq import *

In [138]:
class TopK:
    def __init__(self, w, d, k):
        self.w = w
        self.d = d
        self.h = lambda x: [mmh3.hash(x, seed) % self.w for seed in range(self.d)]
        self.M = np.zeros((self.d,self.w))
        self.heap = [(0, '000.000.000.000')]
        self.k = k
    
    def __call__(self, e):
        h = self.h(e)
        for i in range(self.d):
            self.M[i,h[i]] += 1
            
        estimate = self.query(e)
        
        try:        
            # if element already in list, update existing entry
            if e in list(zip(*self.heap))[1]:
                idx = list(zip(*self.heap))[1].index(e)
                self.heap[idx] = (estimate, e)
            else:
                temp = heappop(self.heap)

                if estimate > temp[0]:
                    heappush(self.heap, (estimate, e))

                if len(self.heap) < self.k:
                    heappush(self.heap, temp)
        except:
            print(self.heap)
            raise()
        
    def query(self, e):
        h = self.h(e)
        return min([self.M[i,h[i]] for i in range(self.d)])
    
    def topk(self):
        return self.heap
    
    def __repr__(self):
        return f'Counts: {list(self.M).__repr__()}'
    
    def getBins(self):
        return self.M

Because the number of unique IPs is so close to the number of total visits, significant compression is not possible. For this problem `w=n_stream/1000` and `d=100` were chosen. `k=30` was chosen arbitrarily as a reasonably small number that would show some outlier IPs as well as some with more normal counts.

In [160]:
top_30_ip_counter = TopK(n_stream//1000, 100, 30)
for entry in tqdm(stream(n_stream), total=n_stream):
    top_30_ip_counter(entry[0])
    
top_30_ips = top_30_ip_counter.topk()
sorted(top_30_ips, key=lambda k: int(k[0]))

HBox(children=(IntProgress(value=0, max=1000000), HTML(value='')))

[(37.0, '132.60.146.100'),
 (39.0, '31.0.122.149'),
 (943.0, '164.194.109.120'),
 (943.0, '53.30.199.126'),
 (943.0, '167.157.107.246'),
 (943.0, '59.199.147.146'),
 (943.0, '88.210.234.107'),
 (944.0, '128.130.122.223'),
 (944.0, '185.126.73.174'),
 (944.0, '150.111.145.210'),
 (944.0, '56.29.200.127'),
 (944.0, '199.221.154.131'),
 (945.0, '1.186.108.218'),
 (946.0, '167.102.105.189'),
 (947.0, '57.27.198.129'),
 (947.0, '241.51.137.58'),
 (947.0, '148.229.103.163'),
 (948.0, '53.27.199.126'),
 (949.0, '55.25.199.130'),
 (950.0, '217.83.164.127'),
 (950.0, '54.31.200.128'),
 (950.0, '53.27.199.127'),
 (951.0, '54.27.199.128'),
 (951.0, '56.28.199.128'),
 (953.0, '58.31.200.128'),
 (954.0, '56.31.200.127'),
 (958.0, '53.29.200.127'),
 (1025.0, '204.141.72.187'),
 (1103.0, '108.41.112.108'),
 (1132.0, '72.187.84.158')]

In [148]:
counters = {}
for entry in tqdm(stream(n_stream), total=n_stream):
    try:
        counters[entry[0]] += 1
    except:
        counters[entry[0]] = 1

HBox(children=(IntProgress(value=0, max=1000000), HTML(value='')))

In [158]:
top_30_gt = zip(sorted(counters.keys(), key=lambda k: counters[k], reverse=True)[:30], sorted(counters.values(), reverse=True)[:30])

In [159]:
list(top_30_gt)

[('72.187.84.158', 222),
 ('108.41.112.108', 199),
 ('204.141.72.187', 139),
 ('53.30.199.128', 34),
 ('55.29.199.128', 34),
 ('56.29.200.127', 33),
 ('55.31.199.127', 32),
 ('54.30.199.128', 31),
 ('53.32.200.127', 30),
 ('53.29.198.127', 30),
 ('56.30.200.127', 30),
 ('55.30.198.128', 28),
 ('54.27.199.128', 28),
 ('54.31.200.128', 28),
 ('56.29.199.129', 27),
 ('56.30.200.128', 27),
 ('57.30.199.128', 27),
 ('54.32.199.127', 27),
 ('53.30.199.129', 27),
 ('55.29.200.128', 27),
 ('55.29.201.129', 27),
 ('55.28.198.128', 27),
 ('56.31.200.127', 27),
 ('54.30.200.128', 27),
 ('53.29.200.128', 26),
 ('53.28.199.127', 26),
 ('53.30.200.128', 26),
 ('56.28.199.128', 26),
 ('55.29.199.127', 26),
 ('53.29.200.127', 26)]

The results of this estimator are quite good, especially for the most commmon IP addresses. While the countmin sketch saturates and includes errors, these errors are approximately uniformly distributed. The top 3 most common IP addresses are outliers in terms of number of occurrences, and these all show up in the same order as the ground truth. Furthermore, while the exact count of other IP addresses in the top 30 isn't correct, or even close, many of the top 30 true IP addresses appear in the estimate. In this problem, as long as hash collisions are uniformly distributed, they do not effect the ability to detect outlier data such as spamming IP addresses.