b) Write a program that downloads DNS measurements for Alexa Top 1M and Cisco Umbrella from OpenINTEL3
for a specified date. Extend your program so that it untars the files and opens the .avro files within, and inspect
the file schema and format. Make your script extract all responses that correspond to A (IPv4 addresses), AAAA
(IPv6 addresses), and CNAME resource records for each queried www. subdomain, and store the results (along
with the measurement date) in a SQL database; distinguish between the Alexa and Umbrella list accordingly by
creating separate tables. Finally, extend your script to repeat this process for measurements of a whole month
(specified by the user). (2 points)

solution b):

The program is "question_b.py". To start it you should provide a multiprocessing environment.
Note: it is recommended to start the program with max buffer size 10000

An example is: **"question_b.py alexa 2020 10 --buffer 10000"**
```
usage: question_b.py [-h] [--day [day]] [--db [db_name]] [--cache [cache_dir]]
                     [--buffer [buffer_max]]
                     source year month

question_b

positional arguments:
  source                the source of the data, can be Alexa or Umbrella
  year                  the year to query, example: 10
  month                 the month to query, example: 2

optional arguments:
  -h, --help            show this help message and exit
  --day [day]           the day to query, if given, only one day will be
                        handled
  --db [db_name]        the db name (name only), example: "example.db"
  --cache [cache_dir]   the cache folder, example: data
  --buffer [buffer_max]
                        the max buffer size: the size of avro entries that are
                        buffered in the memory

```

In order to lead to memory overflow, there is a "buffer_size" parameter that
can limit the number of the records buffered inside the "dao". If the "buffer_size" is set to 0,
which means the buffer size is not limited, then all of the records of "avro-file" will be loaded
at once and save to the sql, so the speed will be faster than using a buffer size limitation.

The extracting of data from "avro-file" is implemented with multiprocessing, each "avro-file" will be
processed by one thread (process), so the speed can be fast.

The downloading can similarly also be implemented with multiprocessing,
but downloading too many files and extracting all "avro-files" may cost too much storage. In this reason,
the software will clean all of the tar- and avro- files immediately after finished the extracting.

In [None]:
# the program can also be run as a function call
from question_b import handle_one_month
# note: slow! about 10 min!
handle_one_month(2020, 10, db_name="example.db", buffer_max=10000)
# will print: "used time: 878.292760, total: 3682335"


d) Write a program that adds columns for the AS numbers of the Autonomous Systems (ASes), which announce
the IPv4 and IPv6 addresses of the resolved domains, to each table of your database. To map IP addresses to
AS numbers (e.g., with pyasn
), use BGP data collected from the Amsterdam Internet Exchange (AMS-IX),
which you can download from the Route Views archive
. For simplicity, it is sufficient to take one RIB file (e.g.,
for the 15th of the month at 12:00 PM) for all IP-ASN-mappings over the whole month. Why do we choose data
from AMS-IX rather than from other collectors? (2 points)

solution d):

The program is "question_d.py". To start it you should provide an existed database.
Note: it is recommended to start the program with chunksize 10000

An example is: **"question_d.py alexa 2020 10 --chunksize 10000"**

```
usage: question_d.py [-h] [--db [db_name]] [--cache [cache_dir]]
                     [--chunksize [chunksize]]
                     source year month

question_d

positional arguments:
  source                the source of the data, can be Alexa or Umbrella
  year                  the year to query, example: 10
  month                 the month to query, example: 2

optional arguments:
  -h, --help            show this help message and exit
  --db [db_name]        the db name (name only), example: "example.db"
  --cache [cache_dir]   the cache folder, example: data
  --chunksize [chunksize]
                        the max chunksize to read and modify from sql table

```

The program will download the RIB archives with the giving year and month
and convert it to IPASN databases, use it for looking up the as.

In [None]:
# the program can also be run as a function call
from question_d import Asn
asn = Asn("example.db", 'alexa',year=2020, month=10, chunksize=10000)
# single process (a little slow)
asn.flush_ases()


