### Test using Google BigQuery to access M-Lab data and get ISP name using cmyruwhois

Created by John Burt, for allTBD group.

This notebook is a demo showing how to read a query file generated by Kinga's R based query script to access M-Lab data. Then the notebook uses cmyruwhois to get ISP names from IP addresses. 

#### Process followed to get this working:
- Have Anaconda installed
- Install pandas-gbq package (in command line: 'conda install pandas-gbq --channel conda-forge')
- Followed steps in the M-lab [BigQuery quickstart](https://www.measurementlab.net/data/docs/bq/quickstart/) to create a Google Cloud account and project, enable the BQ API, etc.

#### getting kinga's script into python:
- I tried to make the python version as comparable as possible, for ease of maintenance

#### getting cmyru whois to work:
- does NOT work: conda/pip install cyrmuwhois 
- works: manual install from source 
  - download cymruwhois.py source tar from https://pypi.python.org/pypi/cymruwhois
  - extract
  - open cmd window, go to folder containing extracted install files
  - enter "python setup.py install"

#### example of query/read/write the data multiple times.
- The last cell shows how to run multiple queries and save them to a single file for retrieval.


In [102]:
import pandas as pd
import datetime
import time

# This function takes as input the metric, mlab_location, AS number, 
# start_time, end_time and the optional country (the default 
# country is set to US)
# Check out the MLabServers.csv file to look up possible values for the
# mlab_location and AS variables.  The mlab_location should be entered using 
# quotation marks, the AS should be entered as an integer.
# The choices for the metric are: "dtp", "rtt", and "prt" for download 
# throughput, round trip time and packet retransmission respectively
# The start_time, end_time info should be entered in the 'mm/dd/yy' format
# The output of the function, when successful, is a text file, called
# query.txt

def query_writer(metric, mlab_location, AS, start_time, end_time, country = 'US' ): 
      
    #DEFINING THE BASIC QUERIES FOR EACH METRIC

    #The basic query for download throughput
    dtp_basic_query = ("SELECT "
        "\nweb100_log_entry.log_time AS log_time, "
        "\nconnection_spec.client_geolocation.city  AS client_city, "
        "\nconnection_spec.client_geolocation.area_code As client_area_code, "
        "\nweb100_log_entry.connection_spec.remote_ip AS client_ip, "
        "\nweb100_log_entry.connection_spec.local_ip AS MLab_ip, "
        "\n8 * (web100_log_entry.snap.HCThruOctetsAcked / "
        "\n(web100_log_entry.snap.SndLimTimeRwin + "
        "\nweb100_log_entry.snap.SndLimTimeCwnd + "
        "\nweb100_log_entry.snap.SndLimTimeSnd)) AS download_Mbps "
        "\nFROM "
        "\n[plx.google:m_lab.ndt.all] "
        "\nWHERE "
        "\nIS_EXPLICITLY_DEFINED(web100_log_entry.connection_spec.remote_ip) "
        "\nAND IS_EXPLICITLY_DEFINED(web100_log_entry.connection_spec.local_ip) "
        "\nAND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.HCThruOctetsAcked) "
        "\nAND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeRwin) "
        "\nAND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeCwnd) "
        "\nAND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeSnd) "
        "\nAND project = 0 "
        "\nAND IS_EXPLICITLY_DEFINED(connection_spec.data_direction) "
        "\nAND connection_spec.data_direction = 1 "
        "\nAND IS_EXPLICITLY_DEFINED(web100_log_entry.is_last_entry) "
        "\nAND web100_log_entry.is_last_entry = True "
        "\nAND web100_log_entry.snap.HCThruOctetsAcked >= 8192 "
        "\nAND (web100_log_entry.snap.SndLimTimeRwin + "
        "\nweb100_log_entry.snap.SndLimTimeCwnd + "
        "\nweb100_log_entry.snap.SndLimTimeSnd) >= 9000000 "
        "\nAND (web100_log_entry.snap.SndLimTimeRwin + "
        "\nweb100_log_entry.snap.SndLimTimeCwnd +   "
        "\nweb100_log_entry.snap.SndLimTimeSnd) < 3600000000 "
        "\nAND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.CongSignals) "
        "\nAND web100_log_entry.snap.CongSignals > 0 "
        "\nAND (web100_log_entry.snap.State == 1 "
        "\nOR (web100_log_entry.snap.State >= 5 "
        "\nAND web100_log_entry.snap.State <= 11))")


    #The basic query for finding round trip time 
    rtt_basic_query = ("SELECT "
        "\nweb100_log_entry.log_time AS log_time, "
        "\nconnection_spec.client_geolocation.city  AS client_city, "
        "\nconnection_spec.client_geolocation.area_code As client_area_code, "
        "\nweb100_log_entry.connection_spec.remote_ip AS client_ip, "
        "\nweb100_log_entry.connection_spec.local_ip AS MLab_ip, "
        "\nweb100_log_entry.snap.MinRTT AS min_rtt "
        "\nFROM "
        "\n[plx.google:m_lab.ndt.all] "
        "\nWHERE "
        "\nIS_EXPLICITLY_DEFINED(web100_log_entry.connection_spec.remote_ip) "
        "\nAND IS_EXPLICITLY_DEFINED(web100_log_entry.connection_spec.local_ip) "
        "\nAND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.HCThruOctetsAcked) "
        "\nAND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeRwin) "
        "\nAND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeCwnd) "
        "\nAND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeSnd) "
        "\nAND project = 0 "
        "\nAND IS_EXPLICITLY_DEFINED(connection_spec.data_direction) "
        "\nAND connection_spec.data_direction = 1 "
        "\nAND IS_EXPLICITLY_DEFINED(web100_log_entry.is_last_entry) "
        "\nAND web100_log_entry.is_last_entry = True "
        "\nAND web100_log_entry.snap.HCThruOctetsAcked >= 8192 "
        "\nAND (web100_log_entry.snap.SndLimTimeRwin + "
        "\nweb100_log_entry.snap.SndLimTimeCwnd + "
        "\nweb100_log_entry.snap.SndLimTimeSnd) >= 9000000 "
        "\nAND (web100_log_entry.snap.SndLimTimeRwin + "
        "\nweb100_log_entry.snap.SndLimTimeCwnd +   "
        "\nweb100_log_entry.snap.SndLimTimeSnd) < 3600000000 "
        "\nAND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.MinRTT) "
        "\nAND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.CountRTT) "
        "\nAND web100_log_entry.snap.CountRTT > 10 "
        "\nAND (web100_log_entry.snap.State == 1 "
        "\nOR (web100_log_entry.snap.State >= 5 "
        "\nAND web100_log_entry.snap.State <= 11))")


    #The basic query for packet retransmission 
    prt_basic_query = ("SELECT "
        "\nweb100_log_entry.log_time AS log_time, "
        "\nconnection_spec.client_geolocation.city  AS client_city,  "
        "\nconnection_spec.client_geolocation.area_code As client_area_code,  "
        "\nweb100_log_entry.connection_spec.remote_ip AS client_ip, "
        "\nweb100_log_entry.connection_spec.local_ip AS MLab_ip, "
        "\n(web100_log_entry.snap.SegsRetrans / web100_log_entry.snap.DataSegsOut) AS packet_retransmission_rate "
        "\nFROM "
        "\n[plx.google:m_lab.ndt.all] "
        "\nWHERE "
        "\nIS_EXPLICITLY_DEFINED(web100_log_entry.connection_spec.remote_ip) "
        "\nAND IS_EXPLICITLY_DEFINED(web100_log_entry.connection_spec.local_ip) "
        "\nAND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.HCThruOctetsAcked) "
        "\nAND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeRwin) "
        "\nAND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeCwnd) "
        "\nAND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeSnd) "
        "\nAND project = 0 "
        "\nAND IS_EXPLICITLY_DEFINED(connection_spec.data_direction) "
        "\nAND connection_spec.data_direction = 1 "
        "\nAND IS_EXPLICITLY_DEFINED(web100_log_entry.is_last_entry) "
        "\nAND web100_log_entry.is_last_entry = True "
        "\nAND web100_log_entry.snap.HCThruOctetsAcked >= 8192 "
        "\nAND (web100_log_entry.snap.SndLimTimeRwin + "
        "\nweb100_log_entry.snap.SndLimTimeCwnd + "
        "\nweb100_log_entry.snap.SndLimTimeSnd) >= 9000000 "
        "\nAND (web100_log_entry.snap.SndLimTimeRwin + "
        "\nweb100_log_entry.snap.SndLimTimeCwnd +   "
        "\nweb100_log_entry.snap.SndLimTimeSnd) < 3600000000 "
        "\nAND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SegsRetrans) "
        "\nAND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.DataSegsOut) "
        "\nAND web100_log_entry.snap.DataSegsOut > 0 "
        "\nAND (web100_log_entry.snap.State == 1 "
        "\nOR (web100_log_entry.snap.State >= 5 "
        "\nAND web100_log_entry.snap.State <= 11))")
    
    #SELECTING THE RIGHT BASIC QUERY
    if metric == "dtp":
        basic = dtp_basic_query
    elif metric == "rtt" :
        basic = rtt_basic_query
    elif metric == "prt":
        basic = prt_basic_query
    else:
        print("The metric entered is invalid!")
        return

    #FINDING MLAB SERVER IPS
    servers = pd.read_csv('MLabServers.csv')
    #print(servers)
    cond = servers[(servers.City==mlab_location) & (servers.AS==AS) ]
    #print(cond)
    if cond.empty:
        print("There are no MLab servers satisfying the conditions entered.")
        return
    else:
       ips = cond.IP
    #print("\n",ips)

    #WRITING THE MLAB SERVER CONDITION
    mlab_serv_var = "web100_log_entry.connection_spec.local_ip"
    mlab_ips_cond = "\nAND ("
    for ip in ips[:-1]:
        mlab_ips_cond += mlab_serv_var + "=='" + ip + "' OR "
    mlab_ips_cond += mlab_serv_var + "=='" + str(ips[-1:].values[0]) + "')"
    #print(mlab_ips_cond)
    
    #CONVERTING DATE TO UNIX TIMESTAMP
    try:
        dt = datetime.datetime.strptime(start_time, "%m/%d/%y")
        start_time_unix = time.mktime(dt.timetuple())
    except:
        print("The start_time entered is invalid!")
        return
        
    try:
        dt = datetime.datetime.strptime(end_time, "%m/%d/%y")
        end_time_unix = time.mktime(dt.timetuple())
    except:
        print("The end_time entered is invalid!")
        return
    
    #print(datetime.datetime.fromtimestamp(start_time_unix).strftime('%Y-%m-%d %H:%M:%S'))    
    #print(datetime.datetime.fromtimestamp(end_time_unix).strftime('%Y-%m-%d %H:%M:%S'))    

    #WRITING THE TIME CONDITION
    tstamp_var = "web100_log_entry.log_time"
    tframe_cond = ("\nAND " + tstamp_var + "<=" + "%d"%(end_time_unix) +
        "\nAND " + tstamp_var + ">=" + "%d"%(start_time_unix))
    #print(tframe_cond)

    #WRITING THE COUNTRY CONDITION
    country_string = "'" + country + "'" 
    country_var = "connection_spec.client_geolocation.country_code"
    country_cond = "\nAND " + country_var + "==" + country_string
    #print(country_cond)

    #WRITING THE QUERY
    the_query = basic + country_cond + mlab_ips_cond + tframe_cond
    #with open("querypy.txt", "w") as text_file:
    #    text_file.write(the_query)
    return the_query

In [105]:
# test the query_writer output:
#the_query = query_writer("rtt", "New York", 174, "06/15/14", "05/13/15")
#print(the_query)

In [113]:
from pandas.io import gbq

# this is my project ID, you will probably use a different one
project_id = 'mlab-194421'

# generate the query
querystring = query_writer("rtt", "New York", 174, "06/15/14", "05/13/15")

# read the query output into a pandas dataframe
#   NOTE: the first time this runs, you will be prompted for an authorization key. 
#    Click on the link provided, get the key string, paste it in, and go.
test_df = gbq.read_gbq(querystring, project_id=project_id)
    
# show contents    
test_df.head()

Requesting query... ok.
Job ID: 045db3de-271f-4ea9-a98c-9e0c07ee6d1e
Query running...
Query done.
Processed: 0.0 B Billed: 0.0 B
Standard price: $0.00 USD

Retrieving results...
Got 195912 rows.

Total time taken 28.82 s.
Finished at 2018-02-21 20:31:20.


Unnamed: 0,log_time,client_city,client_area_code,client_ip,MLab_ip,min_rtt
0,1407508160,Bronx,718,173.68.141.95,38.106.70.147,12
1,1407977963,,0,107.19.172.10,38.106.70.173,13
2,1408004954,Streator,815,173.25.10.122,38.106.70.160,39
3,1408058463,Flushing,718,98.14.201.111,38.106.70.147,10
4,1408025714,Houston,713,74.124.33.150,38.106.70.147,39


### Print out some names from the IP columns

In [114]:
from cymruwhois import Client

c = Client()

print('Some names from the client_ip column')
for r in c.lookupmany(list(test_df.client_ip[0:20])):
    print('\t',r.owner)

print('\n\nSome names from the MLab_ip column')
for r in c.lookupmany(list(test_df.MLab_ip[0:20])):
    print('\t',r.owner)


Some names from the client_ip column
	 UUNET - MCI Communications Services, Inc. d/b/a Verizon Business, US
	 WAYPORT - Wayport, Inc., US
	 MEDIACOM-ENTERPRISE-BUSINESS - Mediacom Communications Corp, US
	 SCRR-12271 - Time Warner Cable Internet LLC, US
	 PSLIGHTWAVE - PS Lightwave, US
	 COMCAST-7922 - Comcast Cable Communications, LLC, US
	 UUNET - MCI Communications Services, Inc. d/b/a Verizon Business, US
	 CABLE-NET-1 - Cablevision Systems Corp., US
	 UUNET - MCI Communications Services, Inc. d/b/a Verizon Business, US
	 SCRR-12271 - Time Warner Cable Internet LLC, US
	 FRII - Front Range Internet Inc., US
	 COMCAST-7922 - Comcast Cable Communications, LLC, US
	 SCRR-11427 - Time Warner Cable Internet LLC, US
	 NITEL - NETWORK INNOVATIONS, INC., US
	 ASN-CXA-ALL-CCI-22773-RDC - Cox Communications Inc., US
	 COMCAST-7922 - Comcast Cable Communications, LLC, US
	 CHARTER-NET-HKY-NC - Charter Communications, US
	 UUNET - MCI Communications Services, Inc. d/b/a Verizon Business, US
	 

### Example of query/write cycle for large data reads

Note: Pandas supports a number of file output formats, all using the same methods:

CSV, JSON, HTML, Local clipboard, MS Excel, HDF5 Format, Feather Format, Parquet Format, Msgpack, Stata, SAS, Python Pickle Format, SQL, Google Big Query

For storage of large datasets locally, I'd suggest Feather, which is designed for efficiency and for cross compatibility with both Python and R.


In [112]:
from pandas.io import gbq
from cymruwhois import Client

# read and store several months worth of data to a csv file
startdates = ["06/15/14","07/15/14","08/15/14","09/15/14"]
enddates = ["07/14/14","08/14/14","09/14/14","10/14/14"]

# this is my project ID, you will probably use a different one
project_id = 'mlab-194421'

# output csv file
outputfile = "mlab query.csv"

# cymruwhois client for ISP names
c = Client()

# flag to indicate the first write
first_write = True

# iterate through each date set
for startdate, enddate in zip(startdates, enddates):
    print("\nreading data from %s - %s"%(startdate, enddate))

    # generate the query
    querystring = query_writer("rtt", "New York", 174, startdate, enddate)

    # read the query output into a pandas dataframe
    df = gbq.read_gbq(querystring, project_id=project_id, verbose=False)
    
# # NOTE: this takes a REALLY long time to complete!
#     print("getting ISP names...")
#     ISPname = []
#     for r in c.lookupmany(list(df.client_ip)):
#         ISPname.append(r.owner)
#     df["ISPname"] = ISPname
#     print("     DONE getting ISP names")
    
    print("writing dataframe to file...")
    # if this is the first write, create and specify the header be included
    if first_write:
        # append dataframe to csv file
        with open(outputfile, 'w') as f:
            df.to_csv(f, header=True)
        first_write = False

    # subsequent writes append and don't include the header
    else:
        # append dataframe to csv file
        with open(outputfile, 'a') as f:
            df.to_csv(f, header=False)
    print("     DONE writing dataframe to file.")

# now, read the entire data file that was created
df_in = pd.read_csv(outputfile)

print("\ncombined dataframe size: ",df_in.shape)
df_in.head()


reading data from 06/15/14 - 07/14/14
writing dataframe to file...
     DONE writing dataframe to file.

reading data from 07/15/14 - 08/14/14
writing dataframe to file...
     DONE writing dataframe to file.

reading data from 08/15/14 - 09/14/14
writing dataframe to file...
     DONE writing dataframe to file.

reading data from 09/15/14 - 10/14/14
writing dataframe to file...
     DONE writing dataframe to file.
dataframe size:  (81273, 7)


Unnamed: 0.1,Unnamed: 0,log_time,client_city,client_area_code,client_ip,MLab_ip,min_rtt
0,0,1404482078,,0,68.207.150.89,38.106.70.147,73
1,1,1404432017,Owego,607,71.188.233.57,38.106.70.160,62
2,2,1404496904,Lansdale,215,76.98.0.131,38.106.70.160,16
3,3,1404436662,Lindenhurst,631,174.44.220.233,38.106.70.160,13
4,4,1404446445,Hillsborough,908,98.221.88.221,38.106.70.147,221
