# First Test of The Model With Scoring Method

Testing the model after training it on one of the forty files in the dataset. Unfortunately the trained model is takes about 1GB so  cannot upload it to github. 

In [1]:
import gensim
import pandas as pd
import numpy as np

Loading the filtered dataset

In [2]:
data_dir = './data/csvTrain/'
filename = 's2-corpus-002_lang_cites.csv'

In [3]:
df = pd.read_csv(data_dir+filename)

Finding a paper from PRL (I have some domain knowledge in the field):

In [4]:
df[df.journalName.str.contains('review letters')].head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,id,paperAbstract,title,year,journalName,numCites,lang
1885,1917,1917,c367c5744775fa3d79404e1a5ad8cb0f9107655f,The branching fraction ratio R(D^{*})≡B(B[over...,Measurement of the ratio of branching fraction...,2015.0,Physical review letters,1,English
2906,2952,2952,3758ce746008b7d6ccc06d49e57a235a4c21c6f0,We grew tetragonally distorted FexCo1-x alloy ...,Perpendicular magnetic anisotropy induced by t...,2006.0,Physical review letters,3,English
6320,6422,6422,202dd914cbaadbd5ee76c1ddc919750f949c669b,"We introduce state-independent, nonperturbativ...",Quantum Speed Limits for Leakage and Decoherence.,2015.0,Physical review letters,5,English
8949,9082,9082,e93d5a8632797df8201733ff347f81e5cc3ef5a8,In this study we report on jumps in the magnet...,Interaction-induced partitioning and magnetiza...,2011.0,Physical review letters,0,English
9290,9427,9427,1384974a2b790d01ffa237bd084eec0af62688f7,Quantum teleportation faces increasingly deman...,"Remote preparation of single-photon ""hybrid"" e...",2010.0,Physical review letters,24,English


4195 seems promising. It's quantum mechanics and according to the abstract I would put it in either a condensed matter journal or physical chemistry journal (as this kind of quantum calculation are heavily used in computational chemistry) (ie. PRL, PRE, journal of chemical physics PCCP or somthing like that)<br>
Loading the model and finding the top 500 similar abstracts:

In [5]:
from gensim.models.word2vec import Word2Vec
model= Word2Vec.load('./models/doc2vec/doc2vec_one_file_filter.model')

In [6]:
sims = model.docvecs.most_similar(df.iloc[6320].id, topn=500)
sims_id = [n for n,v in sims]
sims_score = np.exp(np.array([v for n,v in sims])*30)

In [7]:
df_sims = df[df.id.isin(sims_id)]
df_sims['score'] = sims_score

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [8]:
df.iloc[6320].paperAbstract

'We introduce state-independent, nonperturbative Hamiltonian quantum speed limits for population leakage and fidelity loss, for a gapped open system interacting with a reservoir. These results hold in the presence of initial correlations between the system and the reservoir, under the sole assumption that their interaction and its commutator with the reservoir Hamiltonian are norm bounded. The reservoir need not be thermal and can be time dependent. We study the significance of energy mismatch between the system and the local degrees of freedom of the reservoir that directly interact with the system. We demonstrate that, in general, by increasing the system gap we may reduce this energy mismatch, and, consequently, drive the system and the reservoir into resonance; this can accelerate fidelity loss, irrespective of the thermal properties or state of the reservoir. This implies that quantum error suppression strategies based on increasing the gap are not uniformly beneficial. Our speed 

In [9]:
df_sims.iloc[1].paperAbstract

'Despite the considerable evidence showing that dispersal between habitat patches is often asymmetric, most of the metapopulation models assume symmetric dispersal. In this paper, we develop a Monte Carlo simulation model to quantify the effect of asymmetric dispersal on metapopulation persistence. Our results suggest that metapopulation extinctions are more likely when dispersal is asymmetric. Metapopulation viability in systems with symmetric dispersal mirrors results from a mean field approximation, where the system persists if the expected per patch colonization probability exceeds the expected per patch local extinction rate. For asymmetric cases, the mean field approximation underestimates the number of patches necessary for maintaining population persistence. If we use a model assuming symmetric dispersal when dispersal is actually asymmetric, the estimation of metapopulation persistence is wrong in more than 50% of the cases. Metapopulation viability depends on patch connectivi

In [10]:
from sklearn.preprocessing import MinMaxScaler
def score_sites(df_sims, citesWeight=.1, yearWeight = 1, scoreWeight = 0.1):
    """
    The scoring function. 
    For rach abstract the number of citations, published year and model score are scaled
    and a weighted sum defines the final score. The abstracts are then aggragates by jurnal 
    and the sum of papers from each journal is multiplied by the mean score for the abstracts from 
    this journal to get the final ranking
    
    @citesWeight : weight for the number of citations
    @yearWeight : weight for the publication year
    @scoreWeight : model score 
    
    @returns: a pandas DataFrame with the journal ranking.
    """
    scaler = MinMaxScaler()

    scaledCites = scaler.fit_transform(np.array([df_sims.numCites]).T)
    scaledYear = scaler.fit_transform(np.array([df_sims.year]).T)
    scaledScore = scaler.fit_transform(np.array([df_sims.score]).T)
    df_sims['combinedScore'] = (citesWeight * scaledCites + yearWeight * scaledYear + scoreWeight * scaledScore).flatten()

    groups =df_sims.groupby('journalName')
    df_sims2 = pd.DataFrame(groups.id.count())
    df_sims2['meanCites'] = groups.numCites.mean()
    df_sims2['meanScore'] = groups.score.mean()
    df_sims2['combinedScoreMean'] = groups.combinedScore.mean()
    df_sims2 = df_sims2[df_sims2.id > 3]
    df_sims2['finalScore'] =  df_sims2.id *df_sims2.combinedScoreMean
    return df_sims2

In [13]:
df_sims2 = score_sites(df_sims)
df_sims2.sort_values(by='finalScore', ascending=False).head(20)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0_level_0,id,meanCites,meanScore,combinedScoreMean,finalScore
journalName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
The Journal of chemical physics,26,3.769231,2671.006723,0.869952,22.61874
"Physical review. E, Statistical, nonlinear, and soft matter physics",19,4.631579,2676.943138,0.804445,15.284459
Physical review letters,17,3.705882,3031.382269,0.81217,13.806895
CoRR,13,1.846154,4231.727015,0.91425,11.885255
Proceedings of the National Academy of Sciences of the United States of America,14,13.928571,2306.292477,0.79927,11.189775
Physical review. E,7,0.285714,2140.031539,0.965802,6.760612
Journal of the American Chemical Society,8,3.5,2119.064206,0.796398,6.371184
Journal of physics. Condensed matter : an Institute of Physics journal,7,1.142857,2120.546383,0.852909,5.970365
The journal of physical chemistry. B,7,4.714286,2223.644915,0.843937,5.907562
Journal of chemical theory and computation,6,0.833333,2646.387752,0.95017,5.701019


PRL is there, and most of the journals are in either condensed matter or physical chemistry. There are obviosly not enough samples and using the full dataset should fix that. I also need of think of a way to normalize things by the total number of publication in a journal (i.e if there is a publication with tons of papers that are somwhat similar to our abstract it will be ranked high)

Next, lets try a random abstract from 'Human Genetics' I pulled from google:

In [14]:
abst = 'Single nucleotide polymorphisms (SNPs) constitute the bulk of human genetic variation, occurring with an average density of ∼1/1000 nucleotides of a genotype. SNPs are either neutral allelic variants or are under selection of various strengths, and the impact of SNPs on fitness remains unknown. Identification of SNPs affecting human phenotype, especially leading to risks of complex disorders, is one of the key problems of medical genetics. SNPs in protein-coding regions that cause amino acid variants (non-synonymous cSNPs) are most likely to affect phenotypes. We have developed a straightforward and reliable method based on physical and comparative considerations that estimates the impact of an amino acid replacement on the three-dimensional structure and function of the protein. We estimate that ∼20% of common human non-synonymous SNPs damage the protein. The average minor allele frequency of such SNPs in our data set was two times lower than that of benign non-synonymous SNPs. The average human genotype carries approximately 103 damaging non-synonymous SNPs that together cause a substantial reduction in fitness.'

In [15]:
abst2 = abst.lower().split(' ')

In [16]:
bb = model.infer_vector(abst2)

In [17]:
sims = model.docvecs.most_similar(positive=[model.infer_vector(abst2)], topn=5000)
sims_id = [n for n,v in sims]
sims_score = np.exp(np.array([v for n,v in sims])*30)

In [18]:
df_sims = df[df.id.isin(sims_id)]
df_sims['score'] = sims_score

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [19]:
df_sims2 = score_sites(df_sims)
df_sims2.sort_values(by='finalScore', ascending=False).head(20)
df_sims2.sort_values(by='finalScore', ascending=False).head(20)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0_level_0,id,meanCites,meanScore,combinedScoreMean,finalScore
journalName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Proceedings of the National Academy of Sciences of the United States of America,67,77.313433,585.738023,0.74811,50.123371
The Journal of biological chemistry,53,23.698113,674.159148,0.71951,38.134025
Genetics,29,19.931034,623.716554,0.763917,22.153597
Gene,27,9.37037,659.067914,0.805058,21.736559
Methods in molecular biology,21,3.0,631.041874,0.91257,19.163968
Genome research,23,55.434783,534.222207,0.82704,19.021928
"Journal of chromatography. B, Analytical technologies in the biomedical and life sciences",21,5.666667,574.547938,0.860456,18.069574
American journal of human genetics,23,37.347826,509.190381,0.782092,17.98812
Human molecular genetics,22,55.409091,508.513719,0.816138,17.955032
CoRR,19,6.157895,674.727007,0.941688,17.892069


I have absolutly no knoladge in the field but it seems reasonable.

Lastly, I'm gonna try one of my papers:

In [20]:
abst= 'The main challenge in predicting sliding friction is related to the complexity of highly \
nonequilibrium processes, the kinetics of which are controlled by the interface temperature. \
Our experiments reveal a nonmonotonic enhancement of dry nanoscale friction at cryogenic \
temperatures for different material classes. Concerted simulations show that it emerges from \
two competing processes acting at the interface: the thermally activated formation as well as \
rupturing of an ensemble of atomic contacts. These results provide a new conceptual \
framework to describe the dynamics of dry friction'

In [21]:
abst2 = abst.lower().split(' ')
bb = model.infer_vector(abst2)
sims = model.docvecs.most_similar(positive=[model.infer_vector(abst2)], topn=5000)
sims_id = [n for n,v in sims]
sims_score = np.exp(np.array([v for n,v in sims])*30)
df_sims = df[df.id.isin(sims_id)]
df_sims['score'] = sims_score

df_sims2 = score_sites(df_sims)
df_sims2.sort_values(by='finalScore', ascending=False).head(20)
df_sims2.sort_values(by='finalScore', ascending=False).head(20)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0_level_0,id,meanCites,meanScore,combinedScoreMean,finalScore
journalName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
The Journal of biological chemistry,115,14.721739,825.957468,0.774179,89.030574
Langmuir : the ACS journal of surfaces and colloids,97,2.453608,710.381089,0.888147,86.150222
The Journal of chemical physics,92,1.347826,567.726906,0.883199,81.254343
Proceedings of the National Academy of Sciences of the United States of America,94,25.691489,599.895766,0.814914,76.601876
Physical chemistry chemical physics : PCCP,79,1.113924,670.41954,0.927414,73.265734
Journal of the American Chemical Society,83,5.831325,682.779292,0.880732,73.10078
"Physical review. E, Statistical, nonlinear, and soft matter physics",76,2.578947,783.665615,0.856364,65.083641
The journal of physical chemistry. B,73,2.534247,694.160339,0.87788,64.085263
ACS applied materials & interfaces,60,1.75,615.797727,0.957149,57.428913
Biochemistry,81,10.209877,717.293664,0.693466,56.17075


Now I get the journal of biological chemistry as the first result which is an error, but the rest of them are more accurate. It is also noticable that the score of this journal is much lower then the average its just the shear amount of papers there that tips the scale. I still need to compensate for journals with lots of papers in them...

So... This needs more work, but it seems like it's going in the right direction. Next I'll train a Glove model and test the mean vectors distance for similarity, there is some reaserch which suggests this can work better for short documents)