## About
In this notebook, we read in some tf-idf data which is essentially a giant matrix and then we reduce its dimensionality so that we can visualise its features.  Note that there's actually a lot more that can be done with the tf-idf data - this is all just for the purposes of visualisation.

In [1]:
print('------------------------------------------------------')
print('Step 2:  t-SNE dimensional reduction for visualisation')
from datetime import datetime as dt
print(dt.now())
print('------------------------------------------------------')

------------------------------------------------------
Step 2:  t-SNE dimensional reduction for visualisation
2018-02-16 12:47:03.997663
------------------------------------------------------


In [2]:
# import dependencies
import pickle
import pandas as pd
import numpy as np
from sklearn.manifold import TSNE
from sklearn.decomposition import TruncatedSVD, PCA
from datetime import datetime as dt

In [3]:
# configure
from config import Config as c
working_data = c.working_data
doi_datapath = c.dois_pkl
# word_datapath = c.word_datapath
tfidf_datapath = c.tfidf_datapath
# cosine_sims_datapath = c.cosine_sims_datapath
vectorizer_datapath = c.vectorizer_datapath
tsne_data = c.tsne_data

In [4]:
# load data
t_start=dt.now()
tfidf = pickle.load(open(tfidf_datapath,'rb')).todense() # should this be todense?
# cosine_distances = pickle.load(open(cosine_sims_datapath,'rb'))
dois = pickle.load(open(doi_datapath,'rb'))
data = pd.read_csv(working_data,index_col = 0)

## Heuristics
This section is far from ideal.  We are using t-SNE to reduce a high dimensional dataset down to 2 dimensions.  The correct settings are dependent on the size of our dataset, its structure, AND on each other - so changing one setting requires changing others as well to get the desired effect.  Furthermore, the process is not completely deterministic and won't always give the same result with the same inputs.  Tuning t-SNE is not something that can be done programmatically as far as I am aware.  

Below, I have simply guessed at some approximate starting figures which I think will give OK results.  It may be necessary to modify them in config.py for each specific visualisation in order to get better results.

In [5]:
# configure t-sne parameters
n_rows = data.shape[0]

if c.perplexity == None:
    perplexity = int(0.07*n_rows) # choosing 7 percent of dataset size for perplexity
    print('Warning: perplexity not set in config.py choosing default of: ',perplexity)
else:
    perplexity = c.perplexity
    print('Perplexity set to:',perplexity)

if c.n_iter == None:
    n_iter = int(2*n_rows)
    print('Warning: n_iter not set in config.py choosing default of: ',n_iter)
else:
    n_iter = c.n_iter
    print('Number of iterations set to: ',n_iter)
    
if c.n_iter_without_progress == None:
    n_iter_without_progress = max(n_iter//100 , 25)
    print('Warning: n_iter_without_progress not set in config.py choosing default of: ',n_iter_without_progress)
else:
    n_iter_without_progress = c.n_iter_without_progress
    print('Number of iterations without progress set to: ',n_iter_without_progress)

if c.learning_rate == None:
    learning_rate = int(7*n_rows)
    print('Warning: learning_rate not set in config.py choosing default of: ',learning_rate)
else:
    learning_rate = c.learning_rate
    print('Learning rate set to: ',learning_rate)


Perplexity set to: 80
Number of iterations set to:  1000
Number of iterations without progress set to:  50
Learning rate set to:  100


### Input data

In [6]:
np.shape(data)

(7713, 23)

In [7]:
data.sample(4)

Unnamed: 0,DI,PY,WD,AU,AF,SO,SC,SN,EI,TC,...,DOI,Link,field_citation_ratio,highly_cited_1,highly_cited_10,highly_cited_5,recent_citations,relative_citation_ratio,times_cited,Citations
2237,10.1523/JNEUROSCI.4627-13.2014,2014,Cortical and Thalamic Excitation Mediate the M...,"Doig, NM; Magill, PJ; Apicella, P; Bolam, JP; ...","Doig, Natalie M.; Magill, Peter J.; Apicella, ...",JOURNAL OF NEUROSCIENCE,Neurosciences & Neurology,0270-6474,,31,...,10.1523/JNEUROSCI.4627-13.2014,http://dx.doi.org10.1523/JNEUROSCI.4627-13.2014,8.32,False,True,False,27.0,3.64,41.0,41.0
4328,10.1016/S1474-4422(12)70311-3,2013,A multilevel intervention to increase communit...,"Scott, PA; Meurer, WJ; Frederiksen, SM; Kalbfl...","Scott, Phillip A.; Meurer, William J.; Frederi...",LANCET NEUROLOGY,Neurosciences & Neurology,1474-4422,,26,...,10.1016/S1474-4422(12)70311-3,http://dx.doi.org10.1016/S1474-4422(12)70311-3,6.86,False,True,False,12.0,2.18,32.0,32.0
1401,10.1523/JNEUROSCI.1034-14.2014,2014,Motion Direction Biases and Decoding in Human ...,"Wang, HX; Merriam, EP; Freeman, J; Heeger, DJ","Wang, Helena X.; Merriam, Elisha P.; Freeman, ...",JOURNAL OF NEUROSCIENCE,Neurosciences & Neurology,0270-6474,,8,...,10.1523/JNEUROSCI.1034-14.2014,http://dx.doi.org10.1523/JNEUROSCI.1034-14.2014,3.0,False,False,False,8.0,1.02,12.0,12.0
4670,10.1177/1352458512462920,2013,Characterization of anti-natalizumab antibodie...,"Lundkvist, M; Engdahl, E; Holmen, C; Moverare,...","Lundkvist, M.; Engdahl, E.; Holmen, C.; Movera...",MULTIPLE SCLEROSIS JOURNAL,Neurosciences & Neurology,1352-4585,,17,...,10.1177/1352458512462920,http://dx.doi.org10.1177/1352458512462920,2.32,False,False,False,6.0,1.33,17.0,17.0


How many articles do we have data for in our set?

In [8]:
data['SO'].value_counts()

JOURNAL OF NEUROSCIENCE       3243
NEUROIMAGE                    1822
NEURON                         644
NEUROLOGY                      499
BRAIN                          489
MULTIPLE SCLEROSIS JOURNAL     412
ANNALS OF NEUROLOGY            329
ACTA NEUROPATHOLOGICA          184
LANCET NEUROLOGY                91
Name: SO, dtype: int64

__Check__ How many rows of tf-idf data did we create in the last step?

In [9]:
len(tfidf)

7711

### Dimensional reduction

In [10]:
method = 'TSVD'

print('Starting dimensional reduction with',method)
# Peform dimensional reduction
vectors = tfidf
# truncated svd
t_start = dt.now()
# n of dimensions for initial step
n_components = 50

print('Choosing ',n_components,' dimensions for initial step.  t-SNE will do the rest.')

if method =='TSVD':
    X_reduced = TruncatedSVD(n_components = n_components,
                            random_state = 0).fit_transform(vectors)
else:
    # PCA Option... seems to be slower than TSVD
    X_reduced = PCA(n_components = n_components).fit_transform(vectors)

dr_t = dt.now() - t_start
print(method,'took', dr_t)

Starting dimensional reduction with TSVD
Choosing  50  dimensions for initial step.  t-SNE will do the rest.
TSVD took 0:00:02.322022


In [11]:
print('Continuing dimensional reduction with t-SNE')
# tsne
X_embedded = TSNE(n_components = 2,  
                  perplexity = perplexity, 
                  verbose = 3,
                 n_iter = n_iter,
                 n_iter_without_progress = n_iter_without_progress,
                 learning_rate = learning_rate,
                 ).fit_transform(X_reduced)
print('Dimensional reduction complete!')

Continuing dimensional reduction with t-SNE
[t-SNE] Computing 241 nearest neighbors...
[t-SNE] Indexed 7711 samples in 0.084s...
[t-SNE] Computed neighbors for 7711 samples in 9.873s...
[t-SNE] Computed conditional probabilities for sample 1000 / 7711
[t-SNE] Computed conditional probabilities for sample 2000 / 7711
[t-SNE] Computed conditional probabilities for sample 3000 / 7711
[t-SNE] Computed conditional probabilities for sample 4000 / 7711
[t-SNE] Computed conditional probabilities for sample 5000 / 7711
[t-SNE] Computed conditional probabilities for sample 6000 / 7711
[t-SNE] Computed conditional probabilities for sample 7000 / 7711
[t-SNE] Computed conditional probabilities for sample 7711 / 7711
[t-SNE] Mean sigma: 0.019389
[t-SNE] Computed conditional probabilities in 1.435s
[t-SNE] Iteration 50: error = 79.7311020, gradient norm = 0.0000001 (50 iterations in 26.263s)
[t-SNE] Iteration 50: gradient norm 0.000000. Finished.
[t-SNE] KL divergence after 50 iterations with early 

In [12]:
np.shape(data), np.shape(X_embedded)

((7713, 23), (7711, 2))

In [13]:
# data.head()

In [14]:
try:
    data = data.drop(['TSNE1','TSNE2'], axis=1)#.head()
except:
    pass
data = pd.concat([data,pd.DataFrame(X_embedded)], axis =1)
# test_df.head()
data.rename(columns={0:'TSNE1',1:'TSNE2'}, inplace=True)
data.columns

Index(['DI', 'PY', 'WD', 'AU', 'AF', 'SO', 'SC', 'SN', 'EI', 'TC', 'Z9', 'DOI',
       'Link', 'field_citation_ratio', 'highly_cited_1', 'highly_cited_10',
       'highly_cited_5', 'recent_citations', 'relative_citation_ratio',
       'times_cited', 'Citations', 'TSNE1', 'TSNE2'],
      dtype='object')

In [15]:
# add PCA data as a check to see how closely TSNE matches it

# X_embedded2 = PCA(n_components = 2).fit_transform(X_reduced)
# X_embedded2
# try:
#     data = data.drop(['PCA1','PCA2'], axis=1).head()
# except:
#     pass
# data = pd.concat([data,pd.DataFrame(X_embedded2)], axis =1)
# # test_df.head()
# data.rename(columns={0:'PCA1',1:'PCA2'}, inplace=True)
# data.columns

In [16]:
data.sample(4)

Unnamed: 0,DI,PY,WD,AU,AF,SO,SC,SN,EI,TC,...,field_citation_ratio,highly_cited_1,highly_cited_10,highly_cited_5,recent_citations,relative_citation_ratio,times_cited,Citations,TSNE1,TSNE2
6107,10.1016/j.neuroimage.2013.02.070,2013,Spontaneous EEG alpha oscillation interacts wi...,"Mayhew, SD; Ostwald, D; Porcaro, C; Bagshaw, AP","Mayhew, Stephen D.; Ostwald, Dirk; Porcaro, Ca...",NEUROIMAGE,"Neurosciences & Neurology; Radiology, Nuclear ...",1053-8119,,27,...,6.12,False,False,False,18.0,2.15,32.0,32.0,31.124981,-27.585726
7513,10.1016/j.neuron.2013.07.002,2013,Distinct Representations of Cognitive and Moti...,"Matsumoto, M; Takada, M","Matsumoto, Masayuki; Takada, Masahiko",NEURON,Neurosciences & Neurology,0896-6273,1097-4199,47,...,9.75,False,True,True,28.0,2.86,51.0,51.0,-29.711529,-12.323181
5677,10.1016/j.neuroimage.2013.09.057,2014,Dynamic and static contributions of the cerebr...,"Tak, S; Wang, DJJ; Polimeni, JR; Yan, LR; Chen...","Tak, Sungho; Wang, Danny J. J.; Polimeni, Jona...",NEUROIMAGE,"Neurosciences & Neurology; Radiology, Nuclear ...",1053-8119,1095-9572,11,...,3.25,False,False,False,7.0,1.76,16.0,16.0,0.684785,-32.77692
7094,10.1016/j.neuron.2014.10.014,2014,Cortical fosGFP Expression Reveals Broad Recep...,"Jouhanneau, JS; Ferrarese, L; Estebanez, L; Au...","Jouhanneau, Jean-Sebastien; Ferrarese, Leiron;...",NEURON,Neurosciences & Neurology,0896-6273,1097-4199,13,...,3.25,False,False,False,12.0,1.0,16.0,16.0,19.252451,-5.27196


In [17]:
np.shape(data)

(7713, 23)

### Output

In [18]:
data.sample(4)

Unnamed: 0,DI,PY,WD,AU,AF,SO,SC,SN,EI,TC,...,field_citation_ratio,highly_cited_1,highly_cited_10,highly_cited_5,recent_citations,relative_citation_ratio,times_cited,Citations,TSNE1,TSNE2
7408,10.1016/j.neuron.2013.09.028,2013,Causal Evidence of Performance Monitoring by N...,"Heilbronner, SR; Platt, ML","Heilbronner, Sarah R.; Platt, Michael L.",NEURON,Neurosciences & Neurology,0896-6273,1097-4199,16,...,4.78,False,False,False,13.0,1.16,20.0,20.0,41.914577,27.885458
4463,10.1177/1352458513509507,2014,Hypovitaminosis-D and EBV: no interdependence ...,"Ramien, C; Pachnio, A; Sisay, S; Begum, J; Lee...","Ramien, Caren; Pachnio, Annette; Sisay, Sofia;...",MULTIPLE SCLEROSIS JOURNAL,Neurosciences & Neurology,1352-4585,1477-0970,7,...,2.0,False,False,False,9.0,0.9,11.0,11.0,-24.931896,-47.509399
7605,10.1016/j.neuron.2013.03.004,2013,"Plum, an Immunoglobulin Superfamily Protein, R...","Yu, XMM; Gutman, I; Mosca, TJ; Iram, T; Ozkan,...","Yu, Xiaomeng M.; Gutman, Itai; Mosca, Timothy ...",NEURON,Neurosciences & Neurology,0896-6273,,18,...,3.03,False,False,False,6.0,1.06,22.0,22.0,16.197462,33.14278
2561,10.1523/JNEUROSCI.3029-13.2013,2013,The Transcription Factor Serum Response Factor...,"Stern, S; Haverkamp, S; Sinske, D; Tedeschi, A...","Stern, Sina; Haverkamp, Stephanie; Sinske, Dan...",JOURNAL OF NEUROSCIENCE,Neurosciences & Neurology,0270-6474,,20,...,3.08,False,False,False,11.0,1.07,20.0,20.0,1.311239,1.133781


In [19]:
# write to file
data.to_csv(working_data)

In [20]:
print('Done in '+str(dt.now()-t_start))

Done in 0:10:37.121142
