# Project 2: Web Traffic Analysis
**This is the second of three mandatory projects to be handed in as part of the assessment for the course 02807 Computational Tools for Data Science at Technical University of Denmark, autumn 2019.**

#### Practical info
- **The project is to be done in groups of at most 3 students**
- **Each group has to hand in _one_ Jupyter notebook (this notebook) with their solution**
- **The hand-in of the notebook is due 2019-11-10, 23:59 on DTU Inside**

#### Your solution
- **Your solution should be in Python**
- **For each question you may use as many cells for your solution as you like**
- **You should document your solution and explain the choices you've made (for example by using multiple cells and use Markdown to assist the reader of the notebook)**
- **You should not remove the problem statements**
- **Your notebook should be runnable, i.e., clicking [>>] in Jupyter should generate the result that you want to be assessed**
- **You are not expected to use machine learning to solve any of the exercises**
- **You will be assessed according to correctness and readability of your code, choice of solution, choice of tools and libraries, and documentation of your solution**

## Introduction
In this project your task is to analyze a stream of log entries. A log entry consists of an [IP address](https://en.wikipedia.org/wiki/IP_address) and a [domain name](https://en.wikipedia.org/wiki/Domain_name). For example, a log line may look as follows:

`192.168.0.1 somedomain.dk`

One log line is the result of the event that the domain name was visited by someone having the corresponding IP address. Your task is to analyze the traffic on a number of domains. Counting the number of unique IPs seen on a domain doesn't correspond to the exact number of unique visitors, but it is a good estimate.

Specifically, you should answer the following questions from the stream of log entries.

- How many unique IPs are there in the stream?
- How many unique IPs are there for each domain?
- How many times was IP X seen on domain Y? (for some X and Y provided at run time)

**The answers to these questions can be approximate!**

You should also try to answer one or more of the following, more advanced, questions. The answers to these should also be approximate.

- How many unique IPs are there for the domains $d_1, d_2, \ldots$?
- How many times was IP X seen on domains $d_1, d_2, \ldots$?
- What are the X most frequent IPs in the stream?

You should use algorithms and data structures that you've learned about in the lectures, and you should provide your own implementations of these.

Furthermore, you are expected to:

- Document the accuracy of your answers when using algorithms that give approximate answers
- Argue why you are using certain parameters for your data structures

This notebook is in three parts. In the first part you are given an example of how to read from the stream (which for the purpose of this project is a remote file). In the second part you should implement the algorithms and data structures that you intend to use, and in the last part you should use these for analyzing the stream.

## Reading the stream
The following code reads a remote file line by line. It is wrapped in a generator to make it easier to extend. You may modify this if you want to, but your solution should remain parametrized, so that your notebook can be run without having to consume the entire file.

In [1]:
from urllib.request import urlopen

def stream(n):
    i = 0
    with urlopen('https://files.dtu.dk/fss/public/link/public/stream/read/traffic_2?linkToken=_DcyO-U3MjjuNzI-&itemName=traffic_2') as f:
        for line in f:
            element = line.rstrip().decode("utf-8")
            yield element
            i += 1
            if i == n:
                break

In [2]:
STREAM_SIZE = 10
web_traffic_stream = stream(STREAM_SIZE)
# list(web_traffic_stream)

## Data structures

The maximum length of an IP address is 15: ddd.ddd.ddd.ddd.
Therefore, according to the Flajolet-Martin Algorithm, a good estimate of required positions in a hash table is 2^15 = 32768.
We chose to simply use a 32 bit hash domain, which has 2^32 different values, giving more than enough unique positions, thus more or less eliminating the chance of overlap between input values.

In [3]:
# Example of a 32 bit hash function, the murmurhash3:
import mmh3
mmh3.hash('186.99.192.116', seed = 42)

-1302464439

In order to make sure that all data is processable, we chose to store the data in a database, since a very large data stream could exceed the memory space on our computer.
The chosen database structure was tables of each unique domain containing all IP addresses visiting that domain.
This structure allowed rapid searches within each domain and required less storage space than simply storing all data.

We chose not to store a count for each IP in the domain tables, as this would require a constant search through the database and make creating the database take too much time.

To further optimize storage usage, the domains and IPs were hashed in the database to a 32 bit domain.
This does introduce some kind of uncertainty, as two IPs could hash to the same value, but with the 2^32 possible values, the risk is minimal.
The domain names were hashed in positive values, as SQL does not allow punctuation as table names (except for underscores).

In [4]:
# Remove database in case of rerun
import os
try: os.remove("web.db")
except: pass

In [5]:
import sqlite3

conn = sqlite3.connect('web.db')
c = conn.cursor()

# Track unique domain names:
domain_names = set()

# Function for hashing in seed 42. Domain names are hashed in positive number space.
def hash(element, absolute = False):
    if absolute:
        return abs(mmh3.hash(element, seed = 42))
    else:
        return mmh3.hash(element, seed = 42)

STREAM_SIZE = 10000
for element in stream(STREAM_SIZE):
    element = element.split('\t')
    IP = hash(element[0])
    domain = hash(element[1], True)
    domain_names.add(element[1])

    add_table = """CREATE TABLE IF NOT EXISTS table_{}(id INTEGER PRIMARY KEY, ip REAL)""".format(domain)
    c.execute(add_table)
    c.execute('INSERT INTO table_{}(ip) VALUES ({})'.format(domain, IP))

conn.commit()

Now, a database has been created capable of answering most questions about the stream.

## Analysis


#### How many unique IPs are there in the stream?
To answer this question, we collected all distinct IPs from each domain and counted the number of total uniques:

In [6]:
# How many unique IPs are there in the stream?
uniques = set()
for domain in domain_names:
    domain_hash = hash(domain, True)
    get_IPs = '''SELECT DISTINCT ip FROM table_{}'''.format(domain_hash)
    c.execute(get_IPs)
    IPs = c.fetchall()
    [uniques.add(IP[0]) for IP in IPs]
print('Number of unique IPs:')
print(len(uniques))

Number of unique IPs:
9985


We note that the number of unique IPs are almost as large as the number of elements seen in the stream.

#### How many unique IPs are there for each domain?
Here, we looped through the tables and counted the number of distinct IPs for each.

In [7]:
# How many unique IPs are there for each domain?
for domain in domain_names:
    domain_hash = hash(domain, True)
    get_IPs = '''SELECT DISTINCT count(ip) FROM table_{}'''.format(domain_hash)
    c.execute(get_IPs)
    IP_count = c.fetchall()[0][0]
    print(domain, 'has', IP_count, 'unique IPs')

python.org has 2662 unique IPs
pandas.pydata.org has 1323 unique IPs
spark.apache.org has 52 unique IPs
databricks.com has 130 unique IPs
dtu.dk has 273 unique IPs
google.com has 256 unique IPs
datarobot.com has 14 unique IPs
scala-lang.org has 2 unique IPs
wikipedia.org has 5154 unique IPs
github.com has 134 unique IPs


#### How many times was IP X seen on domain Y? (for some X and Y provided at run time)
This is answered by looking in the domain table and outputting the count of IPs identical with the input IP_x.

In [8]:
# How many times was IP X seen on domain Y? (for some X and Y provided at run time)
# Define input variables:
IP_x = '186.99.192.116'
domain_y = 'python.org'

c.execute('SELECT count(ip) FROM table_{} WHERE ip = {}'.format(hash(domain_y, True), hash(IP_x)))
print('The number of times IP_x is seen on domain_y:')
print(c.fetchall()[0][0])

The number of times IP_x is seen on domain_y:
1


#### How many unique IPs are there for the domains d1,d2,…?
The answer to this is found in a much similar way as the similar question answered earlier. The domain names are defined and the corresponding tables are looped through.
As before, the count of distinct IPs are printed for each input domain.

In [9]:
# How many unique IPs are there for the domains d1,d2,…?
ds = ['python.org', 'wikipedia.org', 'pandas.pydata.org', 'github.com']

for d in ds:
    domain = hash(d, True)
    unique_IPs = '''SELECT DISTINCT count(ip) FROM table_{}'''.format(domain)
    c.execute(unique_IPs)
    print('The number of unique IPs visiting {} is:'.format(d))
    print(c.fetchall()[0][0])


The number of unique IPs visiting python.org is:
2662
The number of unique IPs visiting wikipedia.org is:
5154
The number of unique IPs visiting pandas.pydata.org is:
1323
The number of unique IPs visiting github.com is:
134


#### How many times was IP X seen on domains d1,d2,…?
Again a list of domains of interest are defined and the tables are looped through, counting the number of times a specific IP occurs.

In [10]:
# How many times was IP X seen on domains d1,d2,…?
ds = ['python.org', 'wikipedia.org', 'pandas.pydata.org', 'github.com']
IP_x = '186.99.192.116'

for d in ds:
    domain = hash(d, True)
    IP_x_seen = '''SELECT count(ip) FROM table_{}
                WHERE ip = {}'''.format(domain, hash(IP_x))
    c.execute(IP_x_seen)
    print('The number of times IP {} has visited {}:'.format(IP_x, d))
    print(c.fetchall()[0][0])

The number of times IP 186.99.192.116 has visited python.org:
1
The number of times IP 186.99.192.116 has visited wikipedia.org:
0
The number of times IP 186.99.192.116 has visited pandas.pydata.org:
0
The number of times IP 186.99.192.116 has visited github.com:
0


#### What are the X most frequent IPs in the stream?
Noting that the number of unique IPs in the stream is almost as large as the number of stream elements, it will not be possible to do this using a streaming algorithm.
Therefore, our best option to solve this is to make a dictionary of all IPs in our datastructure adding a counter to their number of occurances.
However, as the IPs were storing in hashes, it is not possible to retreive which hash is which IP, without running through the stream again. Another workaround would have been to store all unique IPs or not hash the IPs when storing them in the database in the first place.

In [11]:
# What are the X most frequent IPs in the stream?
X = 5
IP_counts = dict()

for domain in domain_names:
    domain_hash = hash(domain, True)
    get_IPs = '''SELECT ip FROM table_{}'''.format(domain_hash)
    c.execute(get_IPs)
    IPs = c.fetchall()
    IPs = [IP[0] for IP in IPs]
    for IP in IPs:
        if IP in IP_counts:
            IP_counts[IP] += 1
        else:
            IP_counts[IP] = 1

sorted_database = sorted(IP_counts.items(), key=lambda item: item[1], reverse = True)
print(sorted_database[:6])

[(1745939879.0, 3), (435711427.0, 3), (2113552735.0, 2), (-510818873.0, 2), (1372869047.0, 2), (1916741674.0, 2)]


In [12]:
conn.commit()
conn.close()