### Visualisation with Bokeh
This is just a simple piece of visualisation code to display the text data which has been processed in other notebooks.  

In [55]:
print('------------------------------------------------------')
print('Step 5:  Creating visualisation')
from datetime import datetime as dt
print(dt.now())
print('------------------------------------------------------')

------------------------------------------------------
Step 5:  Creating visualisation
2018-02-20 16:38:52.259237
------------------------------------------------------


In [56]:
import pandas as pd
import numpy as np
import pickle

# %matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns


from bokeh.plotting import *
# from bokeh.models import ColumnDataSource, OpenURL, TapTool
from bokeh.models import *
from bokeh.io import output_notebook

In [57]:
from config import Config as c
tfidf_filename = c.tfidf_filename
bar_filename = c.bar_filename
plot_title = c.title
doi_datapath = c.dois_pkl
working_data = c.working_data
set_name = c.set_name
# test_set = c.test_set
n_clusters = c.n_clusters
journal_of_interest = c.joi

In [58]:
# global variable
dois = pickle.load(open(doi_datapath,'rb'))

In [59]:
# read data
data = pd.read_csv(c.working_data,index_col = 0)
# data["Cluster_no"] = data["Cluster_no"].astype('category')
data.sample(5)

Unnamed: 0,DI,PY,WD,AU,AF,SO,SC,SN,EI,TC,...,TSNE2,DOI_y,Link_y,field_citation_ratio,highly_cited_1,highly_cited_10,highly_cited_5,recent_citations,relative_citation_ratio,times_cited
18137,10.1177/1708538114531259,2015,Through-and-through wire technique for endovas...,"Rohlffs, F; Larena-Avellaneda, AA; Petersen, J...","Rohlffs, Fiona; Larena-Avellaneda, Axel Antoni...",VASCULAR,Cardiovascular System & Cardiology,1708-5381,1708-539X,1,...,-19.47009,10.1177/1708538114531259,http://dx.doi.org/10.1177/1708538114531259,0.86,False,False,False,2.0,0.97,2.0
1318,10.1016/j.ahj.2012.01.029,2012,The impact of high-density lipoprotein cholest...,"Duffy, D; Holmes, DN; Roe, MT; Peterson, ED","Duffy, Danielle; Holmes, DaJuanicia N.; Roe, M...",AMERICAN HEART JOURNAL,Cardiovascular System & Cardiology,0002-8703,,11,...,-33.253826,10.1016/j.ahj.2012.01.029,http://dx.doi.org/10.1016/j.ahj.2012.01.029,2.41,False,False,False,3.0,0.64,13.0
3249,10.1161/CIR.0000000000000444,2016,Sleep Duration and Quality: Impact on Lifestyl...,"St-Onge, MP; Grandner, MA; Brown, D; Conroy, M...","St-Onge, Marie-Pierre; Grandner, Michael A.; B...",CIRCULATION,Cardiovascular System & Cardiology,0009-7322,1524-4539,25,...,17.714027,10.1161/CIR.0000000000000444,http://dx.doi.org/10.1161/CIR.0000000000000444,28.56,True,True,True,40.0,8.98,40.0
13915,10.1177/1526602817711424,2017,Outcomes After Endovascular Revascularization ...,"Uhl, C; Steinbauer, M; Torsello, G; Bisdas, T","Uhl, Christian; Steinbauer, Markus; Torsello, ...",JOURNAL OF ENDOVASCULAR THERAPY,Surgery; Cardiovascular System & Cardiology,1526-6028,1545-1550,0,...,-6.968868,10.1177/1526602817711424,http://dx.doi.org/10.1177/1526602817711424,,False,False,False,0.0,,0.0
10294,10.1177/2047487314553736,2014,Electrocardiographic monitoring during maratho...,"Spethmann, S; Prescher, S; Dreger, H; Nettlau,...","Spethmann, Sebastian; Prescher, Sandra; Dreger...",EUROPEAN JOURNAL OF PREVENTIVE CARDIOLOGY,Cardiovascular System & Cardiology,2047-4873,2047-4881,2,...,-1.732146,10.1177/2047487314553736,http://dx.doi.org/10.1177/2047487314553736,0.89,False,False,False,3.0,0.25,3.0


## Set NaNs in citations col to zero.

In [60]:
data['Citations'] = data['Citations'].fillna(0)

In [61]:
print('Showing data for the following journals:')
print(data['SO'].value_counts())

Showing data for the following journals:
CIRCULATION                                                     1583
AMERICAN HEART JOURNAL                                          1391
EUROPACE                                                        1345
JOURNAL OF CEREBRAL BLOOD FLOW AND METABOLISM                   1230
EUROPEAN HEART JOURNAL                                          1210
JOURNAL OF INTERNATIONAL MEDICAL RESEARCH                       1023
CARDIOVASCULAR RESEARCH                                          927
EUROPEAN JOURNAL OF PREVENTIVE CARDIOLOGY                        915
EUROPEAN HEART JOURNAL-CARDIOVASCULAR IMAGING                    776
INTERNATIONAL JOURNAL OF STROKE                                  741
CLINICAL AND APPLIED THROMBOSIS-HEMOSTASIS                       597
ANGIOLOGY                                                        565
JOURNAL OF ENDOVASCULAR THERAPY                                  526
VASCULAR AND ENDOVASCULAR SURGERY                             

In [62]:
# for col in data.columns:
#     print(col,': ',sum(data[col].isnull()))

## Bokeh code
The first step is to define a few things that will go into the plot.

#### Hover tool
The hover tool defines what happens when you hover your mouse over the plot. 

In [63]:
hover = HoverTool(
        tooltips=[
#             ("index", "$index"),
#             ("(x,y)", "($x, $y)"),
            ("Journal", "@SO"),
            ("DOI", "@DI"),
            ("Article Keywords","@Article_kws"),
            ("Citations", "@Citations"),
            ("Cluster_no", "@Cluster_no"),
            ("Cluster Keywords","@Cluster")
                ])

In [64]:
output_notebook()

#### Other tools
Other tools to add to the right hand side of the plot can be selected from a list.

In [65]:
TOOLS = [BoxSelectTool(), hover, 'tap','box_zoom','reset', 'crosshair'] #,HoverTool()] # just say 'HoverTool()' for the default

## Add Alpha and size data

'Alpha' is the transparency or 'brightness' of the dots.  The formula below ensures that low-cited articles are dimmer than the bright ones.  This has quite a subtle effect on the final plot and can be removed, but it does help to make individual articles stand out , even if their coloring puts then in a group that has low citations.

'Sizes' is our setting for the size of the dots.  I used to use this to help accentuate highly cited papers, but I decided that it made the plot look cluttered.  Perhaps it's worth uncommenting this line if you are using a small dataset.

In [66]:
# data['Sizes'] = 3+(1.5*(np.log(1+data.Citations)))
data['Alpha'] = 0.1+(0.3*np.log2(1+data.Citations))

## Colour clusters by average citation rates
This is where we define our colour scheme. 

In [67]:

sns.palplot(sns.color_palette("YlOrRd", 50)[::-1])

In [68]:
cit_col_ls = []
avcits = data.groupby('Cluster_no')['Citations'].mean()
maxav = avcits.max()
minav = avcits.min()
avrange = maxav-minav
palette = sns.color_palette("YlOrRd", n_clusters+1).as_hex()[::-1] # note +1 to fix rounding errors
# palette2 = ['#%02x%02x%02x'%(int(y) for y in x) for x in palette]
# palette2
i=0
so = data['SO'].tolist()
color_numerics=[]
for cl_no in data['Cluster_no'].tolist():
    
    if so[i] == journal_of_interest:
        cit_col_ls.append('#7ec0ee') # this colour is a bright sky-blue.  'green' or maybe purple would show up well, too
    else:
        cl_avcit = n_clusters*(avcits[cl_no]-minav)/avrange # gets the relative position of the cluster's average cites in the distribution
        color_numerics.append(cl_avcit)
        colr = palette[int(cl_avcit)]
        cit_col_ls.append(colr)
    i+=1
data['Cit_col'] = cit_col_ls   
# cit_col_ls

Check the distribution of the colours.  In some datasets, you'll find a poor distribution of red/orange/yellow simply due to the distribution of citations in the dataset.  Worth fiddling with the code to make the diferences stand out.  

In [69]:
pd.Series(color_numerics).hist()

<matplotlib.axes._subplots.AxesSubplot at 0x117526f72e8>

## Build plot
This is where we define the figure itself using the Bokeh package.

In [70]:
from bokeh.io import show
from bokeh.models import ColumnDataSource
from bokeh.palettes import RdBu3
from bokeh.plotting import figure

# https://stackoverflow.com/questions/41856999/bokeh-plots-just-bring-up-a-blank-window
# BOKEH_RESOURCES=inline


# plotting
p = figure(plot_width=950, plot_height=600,
           title=plot_title,  # specified in the config file!
           tools=TOOLS,
          x_axis_label = "Textual similarity axis_1 (arbitrary units)",
          y_axis_label = 'Textual similarity axis_2 (arbitrary units)') # , active_inspect=None)

# p.toolbar.active_inspect = ['crosshair', hover]

p.background_fill_color = "black"

p.circle(x = 'TSNE1', 
         y = 'TSNE2', 
#           legend = 'Division',
#          size = 'Sizes',
         color = 'Cit_col', # Cit_col',#'j_col', #'cit_colr', # #841F27', #'Color',
         alpha = 0.2, #'Alpha', 
         line_alpha = 0,
         source = ColumnDataSource(data))  # This conversion to ColumnDataSource is crucial.

p.legend.location = "bottom_right"
# p.legend.text = div_colors

output_file(tfidf_filename,
           mode = 'inline')  # toggle for write-to-file


# add links
url = "@Link"
taptool = p.select(type=TapTool)
taptool.callback = OpenURL(url=url)

In [71]:
show(p)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.


### Add bar plots
Now that the plot is completed, there are a few simple images that might help to describe the data.

In [72]:
# % matplotlib inline
df = pd.read_csv('data/cluster_data.csv', index_col=0)
df.sample(1)

Unnamed: 0,Cluster,cites,dois_ls,mean_cites,nz_cites,len_cites,Cluster_no,nz_pc
3,"['mortality', 'hospital', 'european']","[0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0, 1....","['10.1016/j.ahj.2017.08.011', '10.1016/j.ahj.2...",21.998745,739.0,797,3,0.927227


In [73]:
df.dtypes

Cluster        object
cites          object
dois_ls        object
mean_cites    float64
nz_cites      float64
len_cites       int64
Cluster_no      int64
nz_pc         float64
dtype: object

In [74]:
# Convert Cluster to category data type
df['Cluster'] = df['Cluster'].astype(str).astype('category', ordered = True)
#check
df.dtypes

  


Cluster       category
cites           object
dois_ls         object
mean_cites     float64
nz_cites       float64
len_cites        int64
Cluster_no       int64
nz_pc          float64
dtype: object

### With Seaborn
Means first

In [75]:
x = 'Cluster'
y = 'mean_cites'

df.Cluster = df.sort_values(y).Cluster
df = df.sort_values(y, ascending = False)
df.Cluster.cat.ordered
# df['Cluster'] = df['Cluster'].cat.reorder_categories(list(df['Cluster']), ordered=True)

True

In [76]:
import matplotlib.pyplot as plt
f, ax = plt.subplots(figsize=(20, 12))
p = sns.barplot(data = df, 
            x=df[y],
            y=df[x],
            color = 'b',
            orient='h',
            order = df[x]).set_title('Citation rates for {} split into 50 K-Means clusters'.format(c.set_name))
# p.set_xticklabels(labels = df[x],rotation=90)
ax.set(xlabel='Mean citations', ylabel='Cluster')
plt.savefig('outputs/K_Means_Barplot_means.png')
p

  stat_data = remove_na(group_data)


Text(0.5,1,'Citation rates for Cardiology split into 50 K-Means clusters')

Now plot the medians

In [77]:
import seaborn as sns

df = pd.DataFrame(data.groupby('Cluster')['Citations'].median())
df.reset_index(level=0, inplace=True)

In [78]:
df.columns = ['Cluster', 'Median citations']
df.Cluster = df.Cluster.astype('category')
df.head()

Unnamed: 0,Cluster,Median citations
0,"['ablation', 'catheter', 'fibrillation backgro...",7.0
1,"['angiotensin', 'mice', 'mice background']",7.0
2,"['associated', 'function', 'effects']",6.0
3,"['blood', 'pressure', 'blood pressure']",6.0
4,"['burden stroke', 'burden', 'response']",4.0


In [79]:
x = 'Cluster'
y = 'Median citations'

In [80]:
df[x] = df.sort_values(y).Cluster

In [81]:
df = df.sort_values(y, ascending = False)
df.Cluster.cat.ordered

False

In [82]:
import matplotlib.pyplot as plt
f, ax = plt.subplots(figsize=(20, 12))
p = sns.barplot(data = df, 
            x=df[y],
            y=df[x],
            color = 'b',
            orient='h',
            order = df[x]).set_title('Citation rates for {} into 50 K-Means clusters'.format(c.set_name))
# p.set_xticklabels(labels = df[x],rotation=90)
ax.set(xlabel='Median citations', ylabel='Cluster')
plt.savefig('outputs/K_Means_Barplot_medians.png')
p

  stat_data = remove_na(group_data)


Text(0.5,1,'Citation rates for Cardiology into 50 K-Means clusters')

## Show relative sizes of journals over the years

In [83]:
# see clusters 15, 47, 10

bar_df = pd.DataFrame(data.groupby(['SO',
#                                     'Cluster',
                                    'PY']).size().reset_index(name="Count"))

# bar_df.columns
bar_df.sample()

Unnamed: 0,SO,PY,Count
72,EUROPEAN JOURNAL OF CARDIOVASCULAR NURSING,2014,41


In [84]:
f, ax = plt.subplots(figsize=(20, 12))
p = sns.factorplot(data = bar_df, 
                   ax =ax,
            x='PY',
            y='Count',
#                     fit_reg=False,
                   kind='bar',
            hue = 'SO')
# plt.title('Citations per year for cluster 15')
# p.set_xticklabels(labels = df[x],rotation=90)
# plt.savefig('outputs/Cluster_15_Barplot.png') #image not saving correctly
plt

  stat_data = remove_na(group_data[hue_mask])


<module 'matplotlib.pyplot' from 'C:\\Users\\aday\\AppData\\Roaming\\Python\\Python36\\site-packages\\matplotlib\\pyplot.py'>