<a href="https://colab.research.google.com/github/restrepo/cms_pub/blob/main/classify.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine learning classifier for CMS publications
We use the CMS publications with the manually defined categories at:

https://cms-results.web.cern.ch/cms-results/public-results/publications/

to predict the category of a new CMS publication.

We will follow the `NaiveBayesClassifier` tutorial from `textblob` at:

https://textblob.readthedocs.io/en/dev/classifiers.html#classifiers

In [1]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from textblob.classifiers import NaiveBayesClassifier
nltk.download('punkt')  

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Load the data and prepare the training data set based in the title+abstract of each publication and the assigned category: `'CTG'`

In [2]:
df=pd.read_json('https://raw.githubusercontent.com/restrepo/cms_pub/main/cms.json')

Sample of the data

In [3]:
df[:2]

Unnamed: 0,CMS_id,report,title,journal,date,CTG,category,inspire_id,abstract
0,58,FSQ-16-006,Study of central exclusive $\pi^{+}\pi^{-}$ pr...,EPJC 80 (2020) 718,2020-03-05,FSQ,Forward and Small-x QCD Physics,1784063,Central exclusive and semiexclusive production...
1,57,FSQ-12-033,Measurement of single-diffractive dijet produc...,"EPJC 80, 1164 (2020)",2020-02-27,FSQ,Forward and Small-x QCD Physics,1782637,Measurements are presented of the single-diffr...


Prepare the data

In [5]:
df=df[df['abstract'].fillna('')!=''].reset_index(drop=True)
df['text']=df['title']+' '+df['abstract']
df=df.rename({'CTG':'label'},axis='columns')

stop_words=stopwords.words()
def simplify(text,stop_words=stop_words):
    text_tokens = text.replace("  "," ").split()
    filtered_words = [w for w in text_tokens if not w in stop_words]    
    return " ".join(filtered_words).lower()

df['text']=df['text'].apply(simplify)

print('total publication count:',df.shape[0])

total publication count: 912


shuffle the DataFrame

In [6]:
dfr=df.sample(df.shape[0])

Calculate the accuracy of the trained dataset, `train`, with the `test` one

In [7]:
n_train=800
train=dfr[['text','label']][:n_train].reset_index(drop=True)
test =dfr[['text','label']][n_train:].reset_index(drop=True)

#Obtain the classifier with the `train` dataset
cl = NaiveBayesClassifier(  
    [ (d.get('text'),d.get('label')) for d in train.to_dict(orient='records')]  )

# Check the accuracy of the classifier with the `test` dataset
print('accuracy → ',
      round( cl.accuracy( [ (d.get('text'),d.get('label')) for d in test.to_dict(orient='records')] ),2)
     )

accuracy →  0.81


Partial accuracies: there are several at the 90% level. Those would correspond to well defined categories

In [8]:
for c in test['label'].unique():
    print(c,'→',
          round(cl.accuracy( [ (d.get('text'),d.get('label')) for d in test[test['label']==c].to_dict(orient='records')]),2),
          f" samples → {test[test['label']==c].shape[0]}")

FSQ → 0.5  samples → 6
SMP → 0.93  samples → 15
SUS → 0.75  samples → 12
EXO → 0.93  samples → 27
TOP → 0.75  samples → 12
HIN → 1.0  samples → 9
HIG → 0.88  samples → 17
BPH → 0.62  samples → 8
B2G → 0.33  samples → 6


Check explicit results including the probability

In [9]:
fulltest=test.copy()
fulltest['test']=fulltest['text'].apply(cl.classify)

fulltest['prob']=fulltest['text'].apply(lambda t:   
                                cl.prob_classify( t ).prob(   cl.classify( t  )  ) ).round(2)

Show the ones that failed.
The following categories are expected to be mixed between them

* `[B2G,EXO,SUS]`
* `[HIG,TOP,SMP,FSQ]`

In this way, the errors are easy to understand. 

__Hypothesis__: The network analysis would have this failed publications in the frontiers of the clusters.

In [10]:
fulltest[fulltest['label']!=fulltest['test']]

Unnamed: 0,text,label,test,prob
0,exclusive semi-exclusive $\pi^{+}\pi^{-}$ prod...,FSQ,SMP,1.0
2,search dark matter supersymmetry compressed ma...,SUS,EXO,1.0
8,search anomalous single top quark production a...,TOP,EXO,1.0
9,observation diffractive contribution dijet pro...,FSQ,SMP,0.98
11,search massive resonance decaying higgs boson ...,EXO,B2G,1.0
12,"study bose-einstein correlations pp, ppb, pbpb...",FSQ,HIN,1.0
18,search $s$ channel single top quark production...,TOP,HIG,0.98
19,measurement $ \lambda_{\mathrm{b}} $ polarizat...,BPH,SMP,0.57
21,combination searches heavy resonances decaying...,B2G,EXO,1.0
22,search $r$-parity violating supersymmetry disp...,SUS,EXO,1.0
