### Visualisation with Bokeh
This is just a simple piece of visualisation code to display the text data which has been processed in other notebooks.  

In [1]:
print('------------------------------------------------------')
print('Step 5:  Creating visualisation')
from datetime import datetime as dt
print(dt.now())
print('------------------------------------------------------')

------------------------------------------------------
Step 5:  Creating visualisation
2018-02-16 18:00:25.609348
------------------------------------------------------


In [2]:
import pandas as pd
import numpy as np
import pickle

# %matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns


from bokeh.plotting import *
# from bokeh.models import ColumnDataSource, OpenURL, TapTool
from bokeh.models import *
from bokeh.io import output_notebook

In [3]:
from config import Config as c
tfidf_filename = c.tfidf_filename
bar_filename = c.bar_filename
plot_title = c.title
doi_datapath = c.dois_pkl
working_data = c.working_data
set_name = c.set_name
# test_set = c.test_set
n_clusters = c.n_clusters
journal_of_interest = c.joi

In [4]:
# global variable
dois = pickle.load(open(doi_datapath,'rb'))

In [5]:
# read data
data = pd.read_csv(c.working_data,index_col = 0)
# data["Cluster_no"] = data["Cluster_no"].astype('category')
data.sample(5)

Unnamed: 0,DI,PY,WD,AU,AF,SO,SC,SN,EI,TC,...,highly_cited_1,highly_cited_10,highly_cited_5,recent_citations,relative_citation_ratio,times_cited,Citations,Cluster,Cluster_no,Article_kws
4564,10.1016/j.neuroimage.2012.06.024,2012,Activation and connectivity patterns of the pr...,"de Manzano, O; Ullen, F","de Manzano, Orjan; Ullen, Fredrik",NEUROIMAGE,"Neurosciences & Neurology; Radiology, Nuclear ...",1053-8119,1095-9572,13,...,False,False,False,12.0,0.9,20.0,20.0,"['connectivity', 'network', 'effects']",1,"['rhythms', 'free', 'premotor']"
2331,10.1523/JNEUROSCI.3942-12.2013,2013,Runx1 Controls Terminal Morphology and Mechano...,"Lou, S; Duan, B; Vong, L; Lowell, BB; Ma, QF","Lou, Shan; Duan, Bo; Linh Vong; Lowell, Bradfo...",JOURNAL OF NEUROSCIENCE,Neurosciences & Neurology,0270-6474,,52,...,False,True,True,23.0,3.89,67.0,67.0,"['dynamics', 'retinal', 'sleep']",4,"['runx1', 'vglut3', 'morphology']"
5847,10.1016/j.neuron.2012.04.021,2012,Odor Representations in Olfactory Cortex: Dist...,"Miura, K; Mainen, ZF; Uchida, N","Miura, Keiji; Mainen, Zachary F.; Uchida, Naos...",NEURON,Neurosciences & Neurology,0896-6273,,58,...,False,True,True,27.0,2.6,71.0,71.0,"['mice', 'induced', 'olfactory']",9,"['odor', 'distributed', 'rate']"
5427,10.1212/WNL.0b013e318248e4ff,2012,American Academy of Neurology policy on pharma...,"Hutchins, JC; Rydell, CM; Griggs, RC; Sagsveen...","Hutchins, J. C.; Rydell, C. M.; Griggs, R. C.;...",NEUROLOGY,Neurosciences & Neurology,0028-3878,,4,...,False,False,False,2.0,0.15,6.0,6.0,"['mutation', 'models', 'new']",10,"['policy', 'academy', 'american']"
3796,10.1016/j.neuroimage.2014.03.042,2014,Characterizing individual differences in funct...,"Smith, DV; Utevsky, AV; Bland, AR; Clement, N;...","Smith, David V.; Utevsky, Amanda V.; Bland, Am...",NEUROIMAGE,"Neurosciences & Neurology; Radiology, Nuclear ...",1053-8119,1095-9572,27,...,False,True,False,18.0,2.84,32.0,32.0,"['based', 'receptors', 'distinct']",7,"['characterizing', 'approaches', 'regression']"


## Set NaNs in citations col to zero.

In [6]:
data['Citations'] = data['Citations'].fillna(0)

In [7]:
print('Showing data for the following journals:')
print(data['SO'].value_counts())

Showing data for the following journals:
JOURNAL OF NEUROSCIENCE       2450
NEUROIMAGE                    1449
NEURON                         462
NEUROLOGY                      430
BRAIN                          378
MULTIPLE SCLEROSIS JOURNAL     296
ANNALS OF NEUROLOGY            247
ACTA NEUROPATHOLOGICA          142
LANCET NEUROLOGY                66
Name: SO, dtype: int64


In [8]:
# for col in data.columns:
#     print(col,': ',sum(data[col].isnull()))

## Bokeh code
The first step is to define a few things that will go into the plot.

#### Hover tool
The hover tool defines what happens when you hover your mouse over the plot. 

In [10]:
hover = HoverTool(
        tooltips=[
#             ("index", "$index"),
#             ("(x,y)", "($x, $y)"),
            ("Journal", "@SO"),
            ("DOI", "@DI"),
            ("Article Keywords","@Article_kws"),
            ("Citations", "@Citations"),
            ("Cluster_no", "@Cluster_no"),
            ("Cluster Keywords","@Cluster")
                ])

In [11]:
output_notebook()

#### Other tools
Other tools to add to the right hand side of the plot can be selected from a list.

In [12]:
TOOLS = [BoxSelectTool(), hover, 'tap','box_zoom','reset', 'crosshair'] #,HoverTool()] # just say 'HoverTool()' for the default

## Add Alpha and size data

'Alpha' is the transparency or 'brightness' of the dots.  The formula below ensures that low-cited articles are dimmer than the bright ones.  This has quite a subtle effect on the final plot and can be removed, but it does help to make individual articles stand out , even if their coloring puts then in a group that has low citations.

'Sizes' is our setting for the size of the dots.  I used to use this to help accentuate highly cited papers, but I decided that it made the plot look cluttered.  Perhaps it's worth uncommenting this line if you are using a small dataset.

In [13]:
# data['Sizes'] = 3+(1.5*(np.log(1+data.Citations)))
data['Alpha'] = 0.1+(0.3*np.log2(1+data.Citations))

## Colour clusters by average citation rates
This is where we define our colour scheme. 

In [14]:

sns.palplot(sns.color_palette("YlOrRd", 50)[::-1])

In [15]:
cit_col_ls = []
avcits = data.groupby('Cluster_no')['Citations'].mean()
maxav = avcits.max()
minav = avcits.min()
avrange = maxav-minav
palette = sns.color_palette("YlOrRd", n_clusters+1).as_hex()[::-1] # note +1 to fix rounding errors
# palette2 = ['#%02x%02x%02x'%(int(y) for y in x) for x in palette]
# palette2
i=0
so = data['SO'].tolist()
color_numerics=[]
for cl_no in data['Cluster_no'].tolist():
    
    if so[i] == journal_of_interest:
        cit_col_ls.append('#7ec0ee') # this colour is a bright sky-blue.  'green' or maybe purple would show up well, too
    else:
        cl_avcit = n_clusters*(avcits[cl_no]-minav)/avrange # gets the relative position of the cluster's average cites in the distribution
        color_numerics.append(cl_avcit)
        colr = palette[int(cl_avcit)]
        cit_col_ls.append(colr)
    i+=1
data['Cit_col'] = cit_col_ls   
# cit_col_ls

Check the distribution of the colours.  In some datasets, you'll find a poor distribution of red/orange/yellow simply due to the distribution of citations in the dataset.  Worth fiddling with the code to make the diferences stand out.  

In [16]:
pd.Series(color_numerics).hist()

<matplotlib.axes._subplots.AxesSubplot at 0x21db41980f0>

## Build plot
This is where we define the figure itself using the Bokeh package.

In [17]:
from bokeh.io import show
from bokeh.models import ColumnDataSource
from bokeh.palettes import RdBu3
from bokeh.plotting import figure

# https://stackoverflow.com/questions/41856999/bokeh-plots-just-bring-up-a-blank-window
# BOKEH_RESOURCES=inline


# plotting
p = figure(plot_width=950, plot_height=600,
           title=plot_title,  # specified in the config file!
           tools=TOOLS,
          x_axis_label = "Textual similarity axis_1 (arbitrary units)",
          y_axis_label = 'Textual similarity axis_2 (arbitrary units)') # , active_inspect=None)

# p.toolbar.active_inspect = ['crosshair', hover]

p.background_fill_color = "black"

p.circle(x = 'TSNE1', 
         y = 'TSNE2', 
#           legend = 'Division',
#          size = 'Sizes',
         color = 'Cit_col', # Cit_col',#'j_col', #'cit_colr', # #841F27', #'Color',
         alpha = 0.2, #'Alpha', 
         line_alpha = 0,
         source = ColumnDataSource(data))  # This conversion to ColumnDataSource is crucial.

p.legend.location = "bottom_right"
# p.legend.text = div_colors

output_file(tfidf_filename,
           mode = 'inline')  # toggle for write-to-file


# add links
url = "@Link"
taptool = p.select(type=TapTool)
taptool.callback = OpenURL(url=url)

In [18]:
show(p)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.


### Add bar plots
Now that the plot is completed, there are a few simple images that might help to describe the data.

In [None]:
# % matplotlib inline
df = pd.read_csv('data/cluster_data.csv', index_col=0)
df.sample(1)

In [None]:
df.dtypes

In [None]:
# Convert Cluster to category data type
df['Cluster'] = df['Cluster'].astype(str).astype('category', ordered = True)
#check
df.dtypes

### With Seaborn
Means first

In [None]:
x = 'Cluster'
y = 'mean_cites'

df.Cluster = df.sort_values(y).Cluster
df = df.sort_values(y, ascending = False)
df.Cluster.cat.ordered
# df['Cluster'] = df['Cluster'].cat.reorder_categories(list(df['Cluster']), ordered=True)

In [None]:
import matplotlib.pyplot as plt
f, ax = plt.subplots(figsize=(20, 12))
p = sns.barplot(data = df, 
            x=df[y],
            y=df[x],
            color = 'b',
            orient='h',
            order = df[x]).set_title('Citation rates for {} split into 50 K-Means clusters'.format(c.set_name))
# p.set_xticklabels(labels = df[x],rotation=90)
ax.set(xlabel='Mean citations', ylabel='Cluster')
plt.savefig('outputs/K_Means_Barplot_means.png')
p

Now plot the medians

In [None]:
import seaborn as sns

df = pd.DataFrame(data.groupby('Cluster')['Citations'].median())
df.reset_index(level=0, inplace=True)

In [None]:
df.columns = ['Cluster', 'Median citations']
df.Cluster = df.Cluster.astype('category')
df.head()

In [None]:
x = 'Cluster'
y = 'Median citations'

In [None]:
df[x] = df.sort_values(y).Cluster

In [None]:
df = df.sort_values(y, ascending = False)
df.Cluster.cat.ordered

In [None]:
import matplotlib.pyplot as plt
f, ax = plt.subplots(figsize=(20, 12))
p = sns.barplot(data = df, 
            x=df[y],
            y=df[x],
            color = 'b',
            orient='h',
            order = df[x]).set_title('Citation rates for {} into 50 K-Means clusters'.format(c.set_name))
# p.set_xticklabels(labels = df[x],rotation=90)
ax.set(xlabel='Median citations', ylabel='Cluster')
plt.savefig('outputs/K_Means_Barplot_medians.png')
p

## Show relative sizes of journals over the years

In [None]:
# see clusters 15, 47, 10

bar_df = pd.DataFrame(data.groupby(['SO',
#                                     'Cluster',
                                    'PY']).size().reset_index(name="Count"))

# bar_df.columns
bar_df.sample()

In [None]:
f, ax = plt.subplots(figsize=(20, 12))
p = sns.factorplot(data = bar_df, 
                   ax =ax,
            x='PY',
            y='Count',
#                     fit_reg=False,
                   kind='bar',
            hue = 'SO')
# plt.title('Citations per year for cluster 15')
# p.set_xticklabels(labels = df[x],rotation=90)
# plt.savefig('outputs/Cluster_15_Barplot.png') #image not saving correctly
plt