### Test using Google BigQuery to access M-Lab data and get ISP name using cmyruwhois

Created by John Burt, for allTBD group.

This notebook is a demo showing how to read a query file generated by Kinga's R based query script to access M-Lab data. Then the notebook uses cmyruwhois to get ISP names from IP addresses. 

#### Process followed to get this working:
- Have Anaconda installed
- Install pandas-gbq package (in command line: 'conda install pandas-gbq --channel conda-forge')
- Followed steps in the M-lab [BigQuery quickstart](https://www.measurementlab.net/data/docs/bq/quickstart/) to create a Google Cloud account and project, enable the BQ API, etc.

#### getting kinga's script to work:
- Install R
- Install R kernel for Jupyter notebook: https://irkernel.github.io/installation/
- Run Kinga's query generator script, with example query included, generates "query.txt"
- Run this script

#### getting cmyru whois to work:
- does NOT work: conda/pip install cyrmuwhois 
- works: manual install from source 
  - download cymruwhois.py source tar from https://pypi.python.org/pypi/cymruwhois
  - extract
  - open cmd window, go to folder containing extracted install files
  - enter "python setup.py install"



In [3]:
import pandas as pd
from pandas.io import gbq

# this is my project ID, you will probably use a different one
project_id = 'mlab-194421'

# query text file generated by Kinga's query writer script
queryfile = 'query.txt'

# read the query text file
with open(queryfile, 'r') as myfile:
    querystring = myfile.read()

# read the query output into a pandas dataframe
#   NOTE: the first time this runs, you will be prompted for an authorization key. 
#    Click on the link provided, get the key string, paste it in, and go.
test_df = gbq.read_gbq(querystring, project_id=project_id)
    
# show contents    
test_df.head()

Requesting query... ok.
Job ID: 2ee8b6fd-ec40-4c5f-bc05-4196e593039b
Query running...
Query done.
Processed: 0.0 B Billed: 0.0 B
Standard price: $0.00 USD

Retrieving results...
Got 195892 rows.

Total time taken 25.66 s.
Finished at 2018-02-15 17:41:48.


Unnamed: 0,log_time,remote_ip,local_ip,min_rtt
0,1407854173,67.231.254.19,38.106.70.173,27
1,1407508160,173.68.141.95,38.106.70.147,12
2,1403889089,207.200.215.121,38.106.70.147,213
3,1403842123,68.227.188.74,38.106.70.160,12
4,1403904453,67.182.181.146,38.106.70.147,84


### Print out some names from the IP columns

In [4]:
from cymruwhois import Client

c = Client()

print('Some names from the remote_ip column')
for r in c.lookupmany(list(test_df.remote_ip[0:20])):
    print('\t',r.owner)

print('\n\nSome names from the local_ip column')
for r in c.lookupmany(list(test_df.local_ip[0:20])):
    print('\t',r.owner)


Some names from the remote_ip column
	 TURNKEY-INTERNET - Turnkey Internet Inc., US
	 UUNET - MCI Communications Services, Inc. d/b/a Verizon Business, US
	 NITEL - NETWORK INNOVATIONS, INC., US
	 ASN-CXA-ALL-CCI-22773-RDC - Cox Communications Inc., US
	 COMCAST-7922 - Comcast Cable Communications, LLC, US
	 CHARTER-NET-HKY-NC - Charter Communications, US
	 UUNET - MCI Communications Services, Inc. d/b/a Verizon Business, US
	 UUNET - MCI Communications Services, Inc. d/b/a Verizon Business, US
	 ALLO-COMM - Allo Communications LLC, US
	 COMCAST-7922 - Comcast Cable Communications, LLC, US
	 GOEASTON - Easton Utilities Commission, US
	 ASN-CXA-ALL-CCI-22773-RDC - Cox Communications Inc., US
	 CABLE-NET-1 - Cablevision Systems Corp., US
	 WAYPORT - Wayport, Inc., US
	 MEDIACOM-ENTERPRISE-BUSINESS - Mediacom Communications Corp, US
	 SCRR-12271 - Time Warner Cable Internet LLC, US
	 PSLIGHTWAVE - PS Lightwave, US
	 COMCAST-7922 - Comcast Cable Communications, LLC, US
	 UUNET - MCI Commun