# Challenge

Write a Python program (OSX or Linux preferred) that queries to the top 500 Alexa domains/sites:
- http://www.alexa.com/topsites
- https://www.dropbox.com/s/pqsimknj77ywqbn/top-1m.csv.tar.gz?dl=0

### CDN analysis

- Filter the domains served by a CDN (e.g., Akamai)
- Rank CDN providers by number of sites
- Calculate the average response time for the index page of each site (aka Time To First Byte) per CDN provider, and rank them by speed (separate DNS resolution, TCP connection, SSL negotiation and receive time ideally);

### BGP analysis

- Determine the ASN that each of the 500 sites maps to (based on hosting IP address)
- Rank ASNs by number of sites

In [5]:
from __future__ import division
import numpy as np
import os, sys
import matplotlib
#matplotlib.use('Agg')
%matplotlib nbagg
import matplotlib.pyplot as plt
import pandas as pd
from collections import defaultdict

In [6]:
def getCDF(data):
    xdata = np.sort(data)
    ydata = [i/len(xdata) for i in range(len(xdata))]
    return xdata, ydata

# CDN Analysis

## Load Data
load top 500 alexa websites from csv

In [9]:
sites = pd.read_csv('top-1m.csv', nrows=500, header = None, names = ['rank', 'url'])

## Get CDN for website
- using whois on IP address for domain
- extracting field from DNS headers
- checking IP from multiple server locations
- downloading all objects on the website homepage

## Rank number of websites per CDN
- sort dataframe
- bar chart

## Request Index page
- time to first byte
- time to DNS resolution
- time to TCP connection
- time to SSL negotiation
- receive time