# Supershort NLP classification notebook for prize competition "[Real or Not? NLP with Disaster Tweets](https://www.kaggle.com/c/nlp-getting-started)"

## This notebook is based on:
* notebook [NLP with DT cleaning: Simple Transformers predict](https://www.kaggle.com/vbmokin/nlp-with-dt-cleaning-simple-transformers-predict)
* notebook [NLP with DT: Simple Transformers Research](https://www.kaggle.com/vbmokin/nlp-with-dt-simple-transformers-research)
* notebook [SimpleTransformers + Hyperparam Tuning + k-fold CV](https://www.kaggle.com/szelee/simpletransformers-hyperparam-tuning-k-fold-cv)
* libraries [simpletransformers](https://github.com/ThilinaRajapakse/simpletransformers), [transformers](https://huggingface.co/transformers)
* dataset [NLP with Disaster Tweets - cleaning data](https://www.kaggle.com/vbmokin/nlp-with-disaster-tweets-cleaning-data)

More detailed solution see: [NLP with DT: Simple Transformers Research](https://www.kaggle.com/vbmokin/nlp-with-dt-simple-transformers-research)

List of models see: https://huggingface.co/transformers/pretrained_models.html

### The notebook is universal and can be used in other Kaggle competitions or for forecasting data from one dataset after a little adaptation to another data structure.

In [None]:
!pip install simpletransformers 
!pip uninstall fsspec -y
!pip install fsspec==2021.5.0
!pip install seqeval==0.0.12 simpletransformers==0.45.3 tokenizers==0.8.1rc1 transformers==3.0.2

In [None]:
import pandas as pd, torch, warnings; warnings.simplefilter('ignore'); from simpletransformers.classification.classification_model import ClassificationModel
train_data = pd.read_csv('../input/nlp-with-disaster-tweets-cleaning-data/train_data_cleaning2.csv')[['text', 'target']]
test_data = pd.read_csv('../input/nlp-with-disaster-tweets-cleaning-data/test_data_cleaning2.csv')[['id', 'text']]
model = ClassificationModel('distilbert', 'distilbert-base-uncased', args={'fp16': False,'train_batch_size': 4, 'gradient_accumulation_steps': 2, 'model_args.silent' : True,
        'learning_rate': 2e-05, 'do_lower_case': True, 'overwrite_output_dir': True, 'manual_seed': 42, 'num_train_epochs': 1}, weight = [0.44, 0.56])
model.train_model(train_data)
test_data["target"], _ = model.predict(test_data['text'])
test_data.drop(columns=['text']).to_csv("submission.csv", index=False, header=True)

In [None]:
# In addition - accuracy evaluation
# Accuracy
import sklearn
result, model_outputs, wrong_predictions = model.eval_model(train_data, acc=sklearn.metrics.accuracy_score)
print('Accuracy = ',round(result['acc'],2),'%', sep = "")

# Confusion_matrix, Accuracy_score, Classification_report
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report
predictions, _ = model.predict(train_data['text'])
matrix = confusion_matrix(train_data["target"],predictions)
print(matrix)

score = accuracy_score(train_data["target"],predictions)
print(score)

report = classification_report(train_data['target'],predictions)
print(report)