<a href="https://colab.research.google.com/github/thomasavare/DNLP-project/blob/master/SPECTER_mesh_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook aims at incorporating the MAG classification presented in the [Scidocs](https://github.com/allenai/scidocs) github. This classification consists of 19 different classes.

| int | Class label |
| --- | ----------- |
| 0   | Cardiovascular diseases |
| 1	  | Chronic kidney disease |
| 2	  | Chronic respiratory diseases |
| 3	  | Diabetes mellitus |
| 4   |	Digestive disease |
| 5	  | HIV/AIDS |
| 6	  | Hepatitis A/B/C/E |
| 7	  | Mental disorders | 
| 8	  | Musculoskeletal disorders | 
| 9	  | Neoplasms (cancer) |
| 10  | Neurological disorders |

Everything said, let's begin.

First we import the [SPECTER](https://github.com/allenai/specter)embedding model. We are going to use the model accessible through the HuggingFace's transformer librairy.

In [1]:
pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
from sentence_transformers import SentenceTransformer, util

# load model and tokenizer
model = SentenceTransformer('allenai-specter')

We are now going to import the data used to train the models in the scidocs benchmarking. This include the meta-data (paper id, title, abstract and citation graph) and also the class associated to these paper.

In [3]:
!pip install awscli
!aws s3 sync --no-sign-request s3://ai2-s2-research-public/specter/scidocs/ data/

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [4]:
import pandas as pd

train_path, test_path = '/content/data/mesh/train.csv', '/content/data/mesh/test.csv'
train_df, test_df = pd.read_csv(train_path), pd.read_csv(test_path)
train_df

Unnamed: 0,pid,class_label
0,dda7899c46b7764ed16ab3092f4f9476a9cecedf,9
1,16d289c10d9834d6fcb2870ba0d6cb5fd839e9a3,7
2,a1337bbf7cff59dcd9d63600ac80eb90f191a26e,8
3,3b24a936f546fd65a9bfbeb79496a5edc6d0d419,9
4,be0a5a24498086d9544d0c9dd328a9a76a8aadb2,8
...,...,...
16473,f85e9cbd0ebbdc89873b945eb98d22f7d9211957,9
16474,c4165de55761f1bb97c8e7c5224c7617a061d1bc,8
16475,993ec23465dce0d7067d674b5b21426d716f372b,9
16476,1437f6adbedcfc60432df230128383d98997f93e,2


Unfortunatly, the HuggingFace SPECTER model isn't coinciding with the embeddings provided, so we have to recompute the embeddings. I did it once and dowloaded the data such that I will never have to do it again because it took 3 hours. Here's the code that I used.

----
```{python}
from tqdm import tqdm

for index, row in tqdm(data_df.iterrows(), total=len(data_df.index)):
  data_df['embedding'][row['paper_id']] = model.encode(row['title'] + '[SEP]' + row['title'])
```

100%|██████████| 48473/48473 [3:05:07<00:00, 4.36it/s]

----

The data is currently on my personal drive but I will upload it on my github so it can be easily imported and save 3 hours if someone is using that notebook one day.

In [5]:
embeddings_df = pd.read_json('/content/drive/MyDrive/DNLP-project/embeddings_metadata_mag_mesh.json', orient='index')

In [6]:
embeddings_df

Unnamed: 0,paper_id,title,abstract,embedding
00014a8515491f0b3fe2a1ff6e0f5305e584dcd9,00014a8515491f0b3fe2a1ff6e0f5305e584dcd9,ON THE CLASSIFICATION OF THE SCIENCES,,"[-0.7367060184, 0.8094774485, -0.0390381552, -..."
00021eeee2bf4e06fec98941206f97083c38b54d,00021eeee2bf4e06fec98941206f97083c38b54d,Opportunities and challenges for E-Commerce in...,Numerous studies of E-Commerce have emphasized...,"[-1.1841081381, 1.1308963299, 0.6682422161, 0...."
00027baa2a90e1a3d50c1da14882d518de6847f5,00027baa2a90e1a3d50c1da14882d518de6847f5,Interactions between model membranes and ligni...,In order to elucidate the modes of interaction...,"[0.2389703691, 0.7384790778, 1.0119105577, -0...."
00034a5a5bd11b51ec046d31de273946d91fb766,00034a5a5bd11b51ec046d31de273946d91fb766,Perpetual Peace: What Kant Should Have Said,,"[-0.7743704319, 0.5894224048, -0.3162544966, 0..."
000c8d85037886c86de15290e5a8e9bae7b66103,000c8d85037886c86de15290e5a8e9bae7b66103,Reimagining Greek Tragedy on the American Stage,List of Illustrations Preface Introduction CHA...,"[0.0163563173, 0.50049299, -0.0730320513, 0.64..."
...,...,...,...,...
ffe534bb74efd04b0038510be2a4ed5072430c8b,ffe534bb74efd04b0038510be2a4ed5072430c8b,Comparison of olmesartan combined with a calci...,The cardiovascular effects of combined therapy...,"[-0.5972291231, 0.17011463640000002, 0.6428002..."
ffeaac2b94fc298676e9784ed3bd7a6a7c23b9d1,ffeaac2b94fc298676e9784ed3bd7a6a7c23b9d1,Role of angiotensin II in plasma PAI-1 changes...,To evaluate the relationship between plasma pl...,"[-0.3963649869, 0.5153160095, 0.4946651161, -0..."
ffecb03cbf4dccf9fcdc8b26f40f53d5bc44be66,ffecb03cbf4dccf9fcdc8b26f40f53d5bc44be66,Factors that influence cancer patients' anxiet...,BACKGROUND\nNo study has yet assessed the impa...,"[-0.3526456356, 0.3627114594, -0.8083767891, 0..."
fff238844076ad5643dc2ff53153581bd89441ea,fff238844076ad5643dc2ff53153581bd89441ea,Is adjuvant chemotherapy indicated in stage I ...,OBJECTIVE\nConservative surgery followed by pl...,"[0.5204859972, -0.07988817990000001, -0.430343..."


Now we import the associated MAG classification and merge it with the embedding dataframe.

In [7]:
train_path, test_path = '/content/data/mesh/train.csv', '/content/data/mesh/test.csv'
train_df, test_df = pd.read_csv(train_path), pd.read_csv(test_path)
train_df.columns

Index(['pid', 'class_label'], dtype='object')

In [8]:
train_merged_df, test_merged_df = pd.merge(embeddings_df, train_df, how='inner', left_on='paper_id', right_on='pid'), pd.merge(embeddings_df, test_df, how='inner', left_on='paper_id', right_on='pid')
train_merged_df, test_merged_df = train_merged_df[['paper_id', 'embedding', 'class_label']], test_merged_df[['paper_id', 'embedding', 'class_label']]

In [9]:
train_merged_df

Unnamed: 0,paper_id,embedding,class_label
0,fedb8360a09a326f403dcca14494e1da8a5f3adc,"[-1.0045875311, -0.1595053226, 0.2786692381, 0...",0
1,0000f668a76cbc2bffa4c5ec476c37724b585e41,"[-1.0261574984, -0.34777581690000003, -0.33172...",2
2,00020be3126ad6ea52b111a83cb3c49bd9a53677,"[-1.2241401672, 0.9126404524, 0.1266495585, 0....",3
3,00043777ed6209e46430fc343dabcf028ab5b03e,"[0.554428637, 0.6132897735, -0.0784010291, 0.5...",9
4,0004a7236789b7683a123eabeaa9355ca537aa69,"[-0.2132486999, 0.2546937168, 0.3828205764, -0...",0
...,...,...,...
16428,ffd92664ddfa10882b89a3eaf4aed9b319542c5e,"[-0.2729554772, 0.6822428107, -0.2828726470000...",7
16429,ffdfe54fa3957cc1f4a08b0d100afc5e448ff9e7,"[0.2677490711, -0.24520285430000002, -0.392297...",9
16430,ffe534bb74efd04b0038510be2a4ed5072430c8b,"[-0.5972291231, 0.17011463640000002, 0.6428002...",0
16431,ffeaac2b94fc298676e9784ed3bd7a6a7c23b9d1,"[-0.3963649869, 0.5153160095, 0.4946651161, -0...",0


Perfect. Now that we have our train and test data, we can focus on training different models.

Let's put the data in an adequat format.

In [10]:
import numpy as np

x_train, y_train = [], []
x_test, y_test = [], []

i = 0
for _, embedding, class_label in train_merged_df.values:
  x_train.append(embedding)
  y_train.append(class_label)

x_train = np.array(x_train)
y_train = np.array(y_train)

for _, embedding, class_label in test_merged_df.values:
  x_test.append(embedding)
  y_test.append(class_label)

x_test = np.array(x_test)
y_test = np.array(y_test)

First let's try a linear SVC.

In [11]:
from sklearn import svm

classifier1 = svm.LinearSVC(penalty='l2', loss='squared_hinge', max_iter=50000, random_state=42)

In [12]:
classifier1.fit(x_train, y_train)

LinearSVC(max_iter=50000, random_state=42)

In [13]:
from sklearn.metrics import f1_score

y_pred = classifier1.predict(x_test)
classifier1.score(x_test, y_test), f1_score(y_test, y_pred, average='macro')

(0.8217934165720772, 0.7857127929592772)

Let's put it to the test with different articles.

In [14]:
def classify_article(classifier, title, abstract, cls=False, probas=False):
  classes = {0: "Cardiovascular diseases", 1: "Chronic kidney disease", 2: "Chronic respiratory diseases", 3: "Diabetes mellitus", \
             4: "Digestive disease", 5: "HIV/AIDS", 6: "Hepatitis A/B/C/E", 7: "Mental disorders",8: "Musculoskeletal disorders",\
             9: "Neoplasms (cancer)", 10: "Neurological disorders"}

  embed = model.encode([title + '[SEP]' + abstract])

  if cls:
    return np.array(list(zip(classifier.decision_function(embed)[0], classes.values())), dtype=object)
  return classes[classifier.predict(embed)[0]]

Title: SPECTER: Document-level Representation Learning using Citation-informed Transformers

Abstract: Representation learning is a critical ingredient for natural language processing systems. Recent Transformer language models like BERT learn powerful textual representations, but these models are targeted towards token- and sentence-level training objectives and do not leverage information on inter-document relatedness, which limits their document-level representation power. For applications on scientific documents, such as classification and recommendation, the embeddings power strong performance on end tasks. We propose SPECTER, a new method to generate document-level embedding of scientific documents based on pretraining a Transformer language model on a powerful signal of document-level relatedness: the citation graph. Unlike existing pretrained language models, SPECTER can be easily applied to downstream applications without task-specific fine-tuning. Additionally, to encourage further research on document-level models, we introduce SCIDOCS, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction, to document classification and recommendation. We show that SPECTER outperforms a variety of competitive baselines on the benchmark.

Class: Computer science (obviously)


In [15]:
res = classify_article(classifier1, "SPECTER: Document-level Representation Learning using Citation-informed Transformers", "Representation learning is a critical ingredient for natural language processing systems. Recent Transformer language models like BERT learn powerful textual representations, but these models are targeted towards token- and sentence-level training objectives and do not leverage information on inter-document relatedness, which limits their document-level representation power. For applications on scientific documents, such as classification and recommendation, the embeddings power strong performance on end tasks. We propose SPECTER, a new method to generate document-level embedding of scientific documents based on pretraining a Transformer language model on a powerful signal of document-level relatedness: the citation graph. Unlike existing pretrained language models, SPECTER can be easily applied to downstream applications without task-specific fine-tuning. Additionally, to encourage further research on document-level models, we introduce SCIDOCS, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction, to document classification and recommendation. We show that SPECTER outperforms a variety of competitive baselines on the benchmark.", cls=True)
res[np.argsort(-res[:, 0])]

array([[0.18188403460280544, 'Musculoskeletal disorders'],
       [0.0014438201551862928, 'Cardiovascular diseases'],
       [-0.6703056599495868, 'Neoplasms (cancer)'],
       [-1.188735031315706, 'Chronic respiratory diseases'],
       [-1.570818688226489, 'Hepatitis A/B/C/E'],
       [-1.7029991593739067, 'Neurological disorders'],
       [-1.9067350494895103, 'Mental disorders'],
       [-4.894820763527614, 'Diabetes mellitus'],
       [-5.172434206536423, 'Chronic kidney disease'],
       [-5.56194483012853, 'Digestive disease'],
       [-5.635450894921967, 'HIV/AIDS']], dtype=object)

Successfuly classifies as computer science.

title: Exploring The Concept Of Cubism Art Essay

abstract: Cubism was one of the most influential art movements of the 20th century. It took place between 1907 and about 1914. The innovators of the Cubist movement were Pablo Picasso (Spanish, 1881 1973) and Georges Braque (French, 1882 1963). Cubism was one of the most significant changes in ideas in the history of art. It allowed for the development of many of the abstract modern art movements in areas such as Futurism and Constructivism.

Class: Art

In [16]:
res = classify_article(classifier1, "Exploring The Concept Of Cubism Art Essay", "Cubism was one of the most influential art movements of the 20th century. It took place between 1907 and about 1914. The innovators of the Cubist movement were Pablo Picasso (Spanish, 1881 1973) and Georges Braque (French, 1882 1963). Cubism was one of the most significant changes in ideas in the history of art. It allowed for the development of many of the abstract modern art movements in areas such as Futurism and Constructivism.", cls=True)
res[np.argsort(-res[:, 0])]

array([[-0.7300249163725618, 'Cardiovascular diseases'],
       [-0.7333489748110631, 'Neoplasms (cancer)'],
       [-0.7539828753068685, 'Musculoskeletal disorders'],
       [-0.9048945074057232, 'Chronic respiratory diseases'],
       [-1.0358776720878116, 'Neurological disorders'],
       [-1.970403438661797, 'Chronic kidney disease'],
       [-2.3168198993510534, 'Mental disorders'],
       [-3.777500497878009, 'Hepatitis A/B/C/E'],
       [-3.9904632917053258, 'Digestive disease'],
       [-4.080385763275245, 'HIV/AIDS'],
       [-5.649749347456169, 'Diabetes mellitus']], dtype=object)

Great success.

Let's use gridsearch to find the best set of parameters. The training might take a lot of time because it trains. We're also using f1 as a scoring function since it's the scoring function used in the scidoc benchmark.

The gridsearch is a quite long process when it has a lot of parameters. In this case, we test two different loss and 7 different C, so our gridsearch consist of training and comparing 14 different models to extract the best one.

In [25]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

parameters = {'C': np.logspace(-4, 2, 7)}
svc = svm.LinearSVC(loss='squared_hinge')
clf1 = GridSearchCV(svc, parameters, cv=3, verbose=1)

In [26]:
clf1.fit(x_train, y_train)

Fitting 3 folds for each of 7 candidates, totalling 21 fits




GridSearchCV(cv=3, estimator=LinearSVC(),
             param_grid={'C': array([1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
             verbose=1)

In [27]:
params = clf1.get_params()
params

{'cv': 3,
 'error_score': nan,
 'estimator__C': 1.0,
 'estimator__class_weight': None,
 'estimator__dual': True,
 'estimator__fit_intercept': True,
 'estimator__intercept_scaling': 1,
 'estimator__loss': 'squared_hinge',
 'estimator__max_iter': 1000,
 'estimator__multi_class': 'ovr',
 'estimator__penalty': 'l2',
 'estimator__random_state': None,
 'estimator__tol': 0.0001,
 'estimator__verbose': 0,
 'estimator': LinearSVC(),
 'n_jobs': None,
 'param_grid': {'C': array([1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
 'pre_dispatch': '2*n_jobs',
 'refit': True,
 'return_train_score': False,
 'scoring': None,
 'verbose': 1}

```
{'cv': None,
 'error_score': nan,
 'estimator': LinearSVC(),
 'n_jobs': None,
 'param_grid': {'loss': ('hinge', 'squared_hinge'),
  'C': array([1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
 'pre_dispatch': '2*n_jobs',
 'refit': True,
 'return_train_score': False,
 'scoring': make_scorer(f1_score, average=macro),
 'verbose': 0}
 ```

In [28]:
y_pred = clf1.predict(x_test)
clf1.score(x_test, y_test), np.mean(y_test == y_pred)

(0.8444948921679909, 0.8444948921679909)

f1 = 0.8286

accuracy = 0.8445

Not as good as in the scidocs benchmarking (f1 = 0.8195). But as we can see there's an error in the convergence and it advises to increase the number of iteration. So let's reuse the parameters of the best current model and increase the number of iterations. Right now, the maximum number of iterations is 1000, so let's set it to 50 000. All the other parameters are actually the default parameters.

In [29]:
clf1_2 = svm.LinearSVC(max_iter=50000)

In [30]:
clf1_2.fit(x_train, y_train)

LinearSVC(max_iter=50000)

In [31]:
y_pred = clf1_2.predict(x_test)
clf1_2.score(x_test, y_test), f1_score(y_test, y_pred, average='macro')

(0.8217934165720772, 0.7857127929592772)

Doesn't improve, tried to do a grid search on the C parameter with max_iter=50 000 but failed everytime and took about 7 hours so I'm not doing it.

Let's save the trained model into the Drive, maybe use it later for a more friendly user interface.

In [32]:
import pickle

filename = '/content/drive/MyDrive/DNLP-project/linearSVC_model_mesh.sav'
pickle.dump(clf1, open(filename, 'wb'))

----

Let's try a non linear svm now.

In [33]:
from sklearn.model_selection import GridSearchCV

parameters = {'C': np.logspace(-4, 2, 7)}
svc = svm.SVC()
clf2 = GridSearchCV(svc, parameters, scoring=make_scorer(f1_score, average="macro"))

In [34]:
clf2.fit(x_train, y_train)

GridSearchCV(estimator=SVC(),
             param_grid={'C': array([1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
             scoring=make_scorer(f1_score, average=macro))

In [35]:
clf2.get_params()

{'cv': None,
 'error_score': nan,
 'estimator__C': 1.0,
 'estimator__break_ties': False,
 'estimator__cache_size': 200,
 'estimator__class_weight': None,
 'estimator__coef0': 0.0,
 'estimator__decision_function_shape': 'ovr',
 'estimator__degree': 3,
 'estimator__gamma': 'scale',
 'estimator__kernel': 'rbf',
 'estimator__max_iter': -1,
 'estimator__probability': False,
 'estimator__random_state': None,
 'estimator__shrinking': True,
 'estimator__tol': 0.001,
 'estimator__verbose': False,
 'estimator': SVC(),
 'n_jobs': None,
 'param_grid': {'C': array([1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
 'pre_dispatch': '2*n_jobs',
 'refit': True,
 'return_train_score': False,
 'scoring': make_scorer(f1_score, average=macro),
 'verbose': 0}

```
{'cv': None,
 'error_score': nan,
 'estimator__C': 1.0,
 'estimator__break_ties': False,
 'estimator__cache_size': 200,
 'estimator__class_weight': None,
 'estimator__coef0': 0.0,
 'estimator__decision_function_shape': 'ovr',
 'estimator__degree': 3,
 'estimator__gamma': 'scale',
 'estimator__kernel': 'rbf',
 'estimator__max_iter': -1,
 'estimator__probability': False,
 'estimator__random_state': None,
 'estimator__shrinking': True,
 'estimator__tol': 0.001,
 'estimator__verbose': False,
 'estimator': SVC(),
 'n_jobs': None,
 'param_grid': {'C': array([1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
 'pre_dispatch': '2*n_jobs',
 'refit': True,
 'return_train_score': False,
 'scoring': make_scorer(f1_score, average=macro),
 'verbose': 0}
```


In [36]:
y_pred = clf2.predict(x_test)
clf2.score(x_test, y_test), np.mean(y_test == y_pred)

(0.8289055337231488, 0.8476163450624291)

f1 = 0.8289

accuracy = 0.8476

no improvment, equivalent results.

In [37]:
filename = '/content/drive/MyDrive/DNLP-project/SVC_model_mesh.sav'
pickle.dump(clf1, open(filename, 'wb'))

----

Let's try naive bayes

In [38]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()

In [39]:
gnb.fit(x_train, y_train)

GaussianNB()

In [40]:
y_pred = gnb.predict(x_test)
gnb.score(x_test, y_test), f1_score(y_test, y_pred, average="macro")

(0.7122587968217934, 0.681885251722862)

f1 = 0.6818

accuracy = 0.7123

Doesn't work well

In [41]:
from sklearn.neighbors import KNeighborsClassifier

knb = KNeighborsClassifier()

In [42]:
knb.fit(x_train, y_train)

KNeighborsClassifier()

In [43]:
y_pred = knb.predict(x_test)
knb.score(x_test, y_test), f1_score(y_test, y_pred, average="macro")

(0.8073212258796821, 0.7777932401154167)

f1 = 0.7778

accuracy = 0.8073

Works poorly

In [44]:
filename = '/content/drive/MyDrive/DNLP-project/Knn_model_mesh.sav'
pickle.dump(clf1, open(filename, 'wb'))

----

Let's try SGD classifier with gridsearch for the parameters.

In [45]:
from sklearn.linear_model import SGDClassifier
sgd = SGDClassifier('squared_error')

In [46]:
sgd.fit(x_train, y_train)



SGDClassifier(loss='squared_error')

In [47]:
y_pred = sgd.predict(x_test)
sgd.score(x_test, y_test), f1_score(y_test, y_pred, average="macro")

(0.08740068104426787, 0.04208318135834049)

f1 = 0.0223

accuracy = 0.0451

That's hilarious, it's the most inaccurate classification model I've ever seen. The accuracy is even lower than 1/19 = 0.05263 which is the theoritical expected value if we have a random uniform classifier.

In [48]:
sgd = SGDClassifier('squared_epsilon_insensitive')

In [49]:
sgd.fit(x_train, y_train)



SGDClassifier(loss='squared_epsilon_insensitive')

In [50]:
y_pred = sgd.predict(x_test)
sgd.score(x_test, y_test), f1_score(y_test, y_pred, average="macro")

(0.18217934165720773, 0.060065520419049355)

f1 = 0.0194

accuracy = 0.0712 

That's even worse, the thing is that it took quite some time eventhough it's kind of funny

In [51]:
parameters = {'loss': ('hinge', 'modified_huber', 'squared_hinge')}
sgd = SGDClassifier()
clf3 = GridSearchCV(sgd, parameters, scoring=make_scorer(f1_score, average="macro"))

In [52]:
clf3.fit(x_train, y_train)

GridSearchCV(estimator=SGDClassifier(),
             param_grid={'loss': ('hinge', 'modified_huber', 'squared_hinge')},
             scoring=make_scorer(f1_score, average=macro))

In [53]:
clf3.get_params()

{'cv': None,
 'error_score': nan,
 'estimator__alpha': 0.0001,
 'estimator__average': False,
 'estimator__class_weight': None,
 'estimator__early_stopping': False,
 'estimator__epsilon': 0.1,
 'estimator__eta0': 0.0,
 'estimator__fit_intercept': True,
 'estimator__l1_ratio': 0.15,
 'estimator__learning_rate': 'optimal',
 'estimator__loss': 'hinge',
 'estimator__max_iter': 1000,
 'estimator__n_iter_no_change': 5,
 'estimator__n_jobs': None,
 'estimator__penalty': 'l2',
 'estimator__power_t': 0.5,
 'estimator__random_state': None,
 'estimator__shuffle': True,
 'estimator__tol': 0.001,
 'estimator__validation_fraction': 0.1,
 'estimator__verbose': 0,
 'estimator__warm_start': False,
 'estimator': SGDClassifier(),
 'n_jobs': None,
 'param_grid': {'loss': ('hinge', 'modified_huber', 'squared_hinge')},
 'pre_dispatch': '2*n_jobs',
 'refit': True,
 'return_train_score': False,
 'scoring': make_scorer(f1_score, average=macro),
 'verbose': 0}

In [54]:
y_pred = clf3.predict(x_test)
clf3.score(x_test, y_test), np.mean(y_test == y_pred)

(0.8063962048280385, 0.8280363223609535)

f1 = 0.7797

accuracy = 0.8079

In [55]:
filename = '/content/drive/MyDrive/DNLP-project/SGD_model_mag.sav'
pickle.dump(clf1, open(filename, 'wb'))

----

Let's try a CNN now. first we'll try a very simple CNN with only 2 dense layers.

In [56]:
import tensorflow as tf
y_train_cnn = tf.keras.utils.to_categorical(y_train)
x_train_cnn = np.expand_dims(x_train, -1)

In [57]:
from tensorflow.keras import layers

inputs = tf.keras.Input(shape=(x_train.shape[1], 1,), dtype="float")


# We add a vanilla hidden layer:
x = layers.Flatten()(inputs)
x = layers.Dense(x_train.shape[0], activation="relu")(x)

# We project onto a single unit output layer, and squash it with a sigmoid:
predictions = layers.Dense(11, activation="sigmoid", name="predictions")(x)

Cnn1 = tf.keras.Model(inputs, predictions)

In [58]:
Cnn1.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 768, 1)]          0         
                                                                 
 flatten (Flatten)           (None, 768)               0         
                                                                 
 dense (Dense)               (None, 16433)             12636977  
                                                                 
 predictions (Dense)         (None, 11)                180774    
                                                                 
Total params: 12,817,751
Trainable params: 12,817,751
Non-trainable params: 0
_________________________________________________________________


In [59]:
Cnn1.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

In [60]:
epochs = 10

# Fit the model using the train and test datasets.
Cnn1.fit(x_train_cnn, y_train_cnn, epochs=epochs)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f21028f2e50>

In [61]:
x_test_cnn = np.expand_dims(x_test, -1)
y_pred = np.argmax(Cnn1.predict(x_test_cnn), axis=1)
np.mean(y_test == y_pred), f1_score(y_test, y_pred, average="macro")



(0.8280363223609535, 0.8100371088014756)

f1 = 0.8105

accuracy = 0.8368


No big improvment compared to the SVC classification. Let's expand our network. This model is inspired by the "Alexnet" network and was initially designed for 1000 class classification and won the ImageNet Large Scale Visual Recognition Challenge in 2012.

In [62]:
from tensorflow.keras import layers

inputs = tf.keras.Input(shape=(x_train.shape[1], 1,), dtype="float")

x = layers.Conv1D(96, 11, 4)(inputs)
x = layers.MaxPool1D(3, 2)(x)
x = layers.Conv1D(256, 5)(x)
x = layers.MaxPool1D(3, 2)(x)
x = layers.Conv1D(384, 3)(x)
x = layers.Conv1D(384, 3)(x)
x = layers.MaxPool1D(3, 2)(x)

x = layers.Flatten()(x)
x = layers.Dense(2048, activation="relu")(x)
x = layers.Dense(2048, activation="relu")(x)


# We project onto a single unit output layer, and squash it with a sigmoid:
predictions = layers.Dense(11, activation="sigmoid", name="predictions")(x)

Cnn2 = tf.keras.Model(inputs, predictions)

In [63]:
Cnn2.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 768, 1)]          0         
                                                                 
 conv1d (Conv1D)             (None, 190, 96)           1152      
                                                                 
 max_pooling1d (MaxPooling1D  (None, 94, 96)           0         
 )                                                               
                                                                 
 conv1d_1 (Conv1D)           (None, 90, 256)           123136    
                                                                 
 max_pooling1d_1 (MaxPooling  (None, 44, 256)          0         
 1D)                                                             
                                                                 
 conv1d_2 (Conv1D)           (None, 42, 384)           2952

In [64]:
Cnn2.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

In [65]:
epochs = 10

# Fit the model using the train and test datasets.
Cnn2.fit(x_train_cnn, y_train_cnn, epochs=epochs)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f21028583a0>

In [66]:
x_test_cnn = np.expand_dims(x_test, -1)
y_pred = np.argmax(Cnn2.predict(x_test_cnn), axis=1)
np.mean(y_test == y_pred), f1_score(y_test, y_pred, average="macro")



(0.8056186152099887, 0.7824980486995603)

what a disappointment.