<img src="http://dkrib.web.elte.hu/compare/COMPARE_LOGO.jpg" width=400 img>

<div class="jumbotron">

<center> <h2> Demo notebook  </h2> </center>

<center> <h1> Query all the geographic location from EBI Influenza A samples and draw them on a map  </h1> </center>

</div>

---
#About the Jupyter-notebook


####The programming languages
- This notebook is written in python, but you can use the exact same jupyter framework in many different languages (R,Ruby,Julia,Haskell etc..). Please explore the jupyter project webpage for more information about support for programming languages

<center>
<a type="button" class="btn btn-lg btn-warning " href="https://jupyter.org/" >Link to Jupyter project</a>
</center>



####This is a markdown cell
- You can write easy markdown headers and notes like this
- Or you can write html ike the jumbotron above. In the notebook, you can use the Bootstrap framework to have nice buttons etc. like the one below.

<center>
<a type="button" class="btn btn-lg btn-success " href="http://getbootstrap.com/" >Link to Bootstrap</a>
</center>

- You can also write equations, which will be rendered by [MathJax](http://www.mathjax.org/)

$$E = mc^2$$

---
#The purpose of this small demo is to explore Influenza A geolocations in the [ENA](http://www.ebi.ac.uk/ena)

###First we should figure, how many flu samples are there?

- ENA has an advanced search option where we can discover data with some filtering. It has a graphical interface but, it also support programatic acces through url based queries.
- [Advanced search graphical interface](http://www.ebi.ac.uk/ena/data/warehouse/search)
- [Advanced search tutorial ](http://www.ebi.ac.uk/ena/support/advanced-search-tutorial)

I will build the url from the logical blocks:
- the thing below is called a code cell, If you push Shift+Enter, or click the triangle at the menubar, the code will be executed

In [1]:
url_base='http://www.ebi.ac.uk/ena/data/warehouse/search?' #base for advanced search
url_query='\"tax_tree(11320)\"' #influenza A taxon and all subordinates (tree)
url_result='&result=sample' # looking for samples, they have location
url_count='&resultcount' # count the results

url=url_base+url_query+url_result+url_count #concatenate

print 'The url is:',url #print

The url is: http://www.ebi.ac.uk/ena/data/warehouse/search?"tax_tree(11320)"&result=sample&resultcount


Query the url, read the result back as a string
- Actually you can also click on it, and you will be presented with the results int the browser

In [2]:
import urllib #python modules for url-s
print urllib.urlopen(url).read()

Number of results: 1,108,550
Time taken: 0 seconds


###Now i will download all the geolocation information associated with the samples

Build url again

In [4]:
url_base='http://www.ebi.ac.uk/ena/data/warehouse/search?'
url_query='\"tax_tree(11320)\"'
url_result='&result=sample'
url_display='&display=report' #report is the tab separated output
url_fields='&fields=accession,location' #get accesion and location
url_limits='&offset=1&length=1095067' #get all the results

url=url_base+url_query+url_result+url_display+url_fields+url_limits
print url

http://www.ebi.ac.uk/ena/data/warehouse/search?"tax_tree(11320)"&result=sample&display=report&fields=accession,location&offset=1&length=1095067


The result is a tab separated table
- I will Download the table to a string

In [4]:
ena_flu_loco_page = urllib.urlopen(url).read()

Load the table into a pandas DataFrame
- [Pandas](http://pandas.pydata.org/) is a very useful library for data analysis in python
- The DataFrame object is similar to R dataframes

In [5]:
import pandas as pd #pandas
from StringIO import StringIO #for reading string into pandas
ena_flu_loco_table = pd.read_csv(StringIO(ena_flu_loco_page),sep='\t')

Peek into the table
- Unfortunately most of the values are missing (NaNs)

In [6]:
ena_flu_loco_table.head()

Unnamed: 0,accession,location
0,33124,
1,9544,
2,9545,
3,SAMD00000344,
4,SAMD00000345,


###See how many  geolocation data is there?

In [7]:
print "The number of sample with geolocations is: ",
print len(ena_flu_loco_table[ pd.isnull(ena_flu_loco_table['location']) == False ])

The number of sample with geolocations is:  125477


Get rid of samples with no geolocation

In [8]:
ena_flu_loco_table=ena_flu_loco_table[ pd.isnull(ena_flu_loco_table['location']) == False ]

###Some location are malformed

In [9]:
err= ena_flu_loco_table[ [ len(x.split(' '))!=4 for x in ena_flu_loco_table['location'] ]]
err.head()

Unnamed: 0,accession,location
196956,SAMEA2384944,18.49041 E
196980,SAMEA2384968,18.49041 E
202130,SAMEA2392363,23.1 E
234588,SAMEA2547925,28.23 E
234589,SAMEA2547926,28.23 E


Delete these

In [10]:
ena_flu_loco_table = ena_flu_loco_table[ [ 
        len(x.split(' '))==4 for x in ena_flu_loco_table['location'] ]]

###Parse the longitudes, longitudes
- The data is in a different format than the map will need read, so I need to convert is. (N,E,S,W) instead of negative values

In [11]:
def parse_lat(string_loc):
    loc_list=string_loc.split(' ')
    if (loc_list[1] =='N'):
        return float(loc_list[0])
    elif (loc_list[1] =='S'):
        return -float(loc_list[0])
    
def parse_lon(string_loc):
    loc_list=string_loc.split(' ')
    if (loc_list[3] =='E'):
        return float(loc_list[2])
    elif (loc_list[3] =='W'):
        return -float(loc_list[2])
    
ena_flu_loco_table['lat']=map(parse_lat,ena_flu_loco_table['location'])
ena_flu_loco_table['lon']=map(parse_lon,ena_flu_loco_table['location'])

ena_flu_loco_table=ena_flu_loco_table[['lat','lon','accession']]

In [12]:
ena_flu_loco_table.head()

Unnamed: 0,lat,lon,accession
2856,44.337,143.38083,SAMD00003553
2857,44.337,143.38083,SAMD00003554
2858,44.337,143.38083,SAMD00003555
2859,44.337,143.38083,SAMD00003556
2860,44.337,143.38083,SAMD00003557


###See how many unique locations we have

In [13]:
print 'Number of unique locations:',
print len(ena_flu_loco_table.groupby(['lat','lon']).size().reset_index())

Number of unique locations: 16142


###Generate a popup string for each unique location
- This will be shown on the map, when you click on the point with the mouse

Contents:
- Number of cases
- list of accession numbers, truncated if too long

I am using the sql-like groupby statement for group the samples

In [14]:
#the function used for grouping
def form_acc(x):
    if (x['accession'].size < 5):
        return pd.Series(
            dict({'count' : x['accession'].size, 'acc_list' : ' '.join(x['accession']),
                }))
    else:
        return pd.Series(
            dict({'count' : x['accession'].size, 'acc_list' : ' '.join(list(
                        x['accession'])[:2]) + ' ... ' + ' '.join(list(
                        x['accession'])[-2:])}))

#group-by
uniq_locs_w_acc=ena_flu_loco_table.groupby(['lat','lon']).apply(form_acc).reset_index()

#Plot the points on map

I will use the [Folium](http://folium.readthedocs.org/en/latest/) library which is python wrapper for the [Leaflet](http://leafletjs.com/) javasript library for map based visualizations
- (The magic will be in the html output in the cell, in you are interested you can read the html source code of the notebook output cell)



First define the map drawing function

In [15]:
from IPython.core.display import HTML
import folium

def inline_map(m, width=650, height=500):
    """Takes a folium instance and embed HTML."""
    m._build_map()
    srcdoc = m.HTML.replace('"', '&quot;')
    embed = HTML('<iframe srcdoc="{}" '
                 'style="width: {}px; height: {}px; '
                 'border: none"></iframe>'.format(srcdoc, width, height))
    return embed

Initialize the map object

In [16]:
width, height = 650, 500
flu_map = folium.Map(location=[47, -17], zoom_start=3,
                    tiles='OpenStreetMap', width=width, height=height)

Add point to the map object

- Let's make point area proportional to number of cases
    - This is miseleading, beacuse somewhere all the cases around have the sample position (Europe), and somewhere the positions are more scattered (Shanghai)

In [17]:
for i in xrange(len(uniq_locs_w_acc)):
    loc=(uniq_locs_w_acc.iloc[i]['lat'],uniq_locs_w_acc.iloc[i]['lon'] )
    name='Number of cases: '+str(uniq_locs_w_acc.iloc[i]['count'])
    name+='   Accesions: '+uniq_locs_w_acc.iloc[i]['acc_list']
    size=uniq_locs_w_acc.iloc[i]['count'] ** 0.5 
    
    flu_map.circle_marker(location=loc, radius=1e3*size,
                          line_color='none',fill_color='#3186cc',
                          fill_opacity=0.7, popup=name)

And finally draw the map

In [18]:
inline_map(flu_map)

---
#Some notes about this notebook:

Memory footprint: 
- python: 270 MiB
- chrome: 400 MiB


Map:
- Rendering slow on a Bay-Trail intel proc
- Folium has limited customization


Some small details:
- Bad geolocations can be seen at 0 longitude, but it is not surprising
- Grid in france (due to truncation of decimal values?)
- Line of the Danube can clearly be seen 