# Context

In this notebook, I used extracted embeddings (CLS token) as features and LogisticRegression classifier.

I'm trying to understand how informative the extracted features are and what a classifier (simple or neural network) can do with them.

I load all the data (features + labels) from the dataset: [Custom Dataset for Evaluating Student Writing Competition](https://www.kaggle.com/renokan/dataset-student-writing)

## Versions

* V2 - Used base_bert_cls_tokens
* V1 - Used distil_bert_cls_tokens

```
cls_tokens_dict = {
    "base_bert": "../input/dataset-student-writing/base_bert_cls_tokens.csv",
    "distil_bert": "../input/dataset-student-writing/distil_bert_cls_tokens.csv"
}

use_tokens = "base_bert"
```

## Description

How and from what embeddings were extracted.

```
from transformers import DistilBertTokenizer, DistilBertModel
import torch

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = DistilBertModel.from_pretrained("distilbert-base-uncased")

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)
```

![last_hidden_states](https://camo.githubusercontent.com/6c2185c7620a3fe52f1968752febb6467723f4485c257442d3b0ed03bb0da197/68747470733a2f2f6a616c616d6d61722e6769746875622e696f2f696d616765732f64697374696c424552542f626572742d6f75747075742d74656e736f722d73656c656374696f6e2e706e67)



## Sources

* [A Visual Notebook to Using BERT for the First TIme.ipynb](https://github.com/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb)
* [A Visual Guide to Using BERT for the First Time <<< github.io](https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/)
* [Ваш первый BERT: иллюстрированное руководство <<< habr.com](https://habr.com/ru/post/498144/)

In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings('ignore')

rs_param = 42

# Load features

In [None]:
cls_tokens_dict = {
    "base_bert": "../input/dataset-student-writing/base_bert_cls_tokens.csv",
    "distil_bert": "../input/dataset-student-writing/distil_bert_cls_tokens.csv"
}

use_tokens = "base_bert"

features = pd.read_csv(cls_tokens_dict.get(use_tokens))
features.head()

In [None]:
features.info(memory_usage='deep')

# Load labels

In [None]:
text_and_labels = pd.read_csv("../input/dataset-student-writing/text_and_labels.csv")
text_and_labels.head()

In [None]:
labels = text_and_labels.select_dtypes(exclude='object')
labels.head()

In [None]:
labels.mean().mul(100).round(2).map("{} %".format)

# Create train / test data

In [None]:
print(features.shape)
print(labels.shape)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(features, labels,
                                                    test_size=0.2, random_state=42)

# Using LogisticRegression

In [None]:
%%time
for target in labels.columns:    
    logreg = LogisticRegression(
        random_state=rs_param,
        solver='liblinear'
    )
    # There is a bug with solver='lbfgs'
    # AttributeError: 'str' object has no attribute 'decode'
    # in fitting Logistic Regression Model

    logreg.fit(X_train, y_train[target])
    
    score = logreg.score(X_test, y_test[target])
    score = round(score, 5)
    print(f"{score}\t{target}\n")

# Using GridSearchCV + LogisticRegression

In [None]:
use_target = "tp_Claim"
use_model = LogisticRegression(solver='liblinear',
                               random_state=rs_param)
use_params = {
    'C': np.linspace(0.01, 10, 5),
    'class_weight': [None, 'balanced']
}

In [None]:
%%time
search = GridSearchCV(estimator=use_model,
                      param_grid=use_params,
                      cv=4, verbose=3)

search.fit(X_train, y_train[use_target])

print('best parameters: ', search.best_params_)
print('best scrores: ', search.best_score_)

In [None]:
%%time
search.score(X_test, y_test[use_target])