<a class="anchor" id="0"></a>
# [NLP : Reports & News Classification](https://www.kaggle.com/vbmokin/nlp-reports-news-classification)
## Automatic Environmental Reports & News Classification (English)

# Acknowledgements

This notebook uses such good notebooks: 
* BERT, GPT2, XLNET summarizing from the notebook [Text Summarization using BERT, GPT2,XLNET](https://www.kaggle.com/pemagrg/text-summarization-using-bert-gpt2-xlnet)
* data download from my notebook [NLP for EN : BERT Classification for Water Report](https://www.kaggle.com/vbmokin/nlp-for-en-bert-classification-for-water-report)

My dataset [NLP : Reports & News Classification](https://www.kaggle.com/vbmokin/nlp-reports-news-classification)

<a class="anchor" id="0.1"></a>
## Table of Contents

1. [Install and import libraries](#1)
1. [Download data](#2)
1. [Text Summarizing](#3)
    -  [BERT Summarizing](#3.1)
    -  [GPT-2 Summarizing](#3.2)
    -  [XLNet Summarizing](#3.3)
1. [Results](#4)

## 1. Install and import libraries <a class="anchor" id="1"></a>

[Back to Table of Contents](#0.1)

In [None]:
%%time
!pip install bert-extractive-summarizer

In [None]:
%%time
!pip install transformers==2.2.0

In [None]:
%%time
!pip install spacy==2.0.12

In [None]:
import numpy as np
import pandas as pd

import matplotlib
import matplotlib.pyplot as plt

from summarizer import Summarizer,TransformerSummarizer
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings('ignore')

## 2. Download data <a class="anchor" id="2"></a>

[Back to Table of Contents](#0.1)

In [None]:
# Thanks to https://www.kaggle.com/vbmokin/nlp-for-en-bert-classification-for-water-report
df = pd.read_csv('../input/nlp-reports-news-classification/water_problem_nlp_en_for_Kaggle_100.csv', delimiter=';', header=0)
df = df.fillna(0)

convert_dict = {'text': str, 
                'env_problems': int,
                'pollution': int, 
                'treatment': int,
                'climate': int,
                'biomonitoring': int} 
  
df = df.astype(convert_dict)
df = df[:5]
df

In [None]:
df.info()

In [None]:
df['text'].head(10)

In [None]:
df['text'].str.len().max()

In [None]:
# Creation the list with new long block
max_length = 400  # minimum characters in each block
i = 0
bodies = []
while i < len(df):
    body = ""
    body_empty = True
    while (len(body) < max_length) and (i < len(df)):
        if body_empty:
            body = df.loc[i,'text']
            body_empty = False
        else: body += " " + df.loc[i,'text']
        i += 1
    bodies.append(body)
    print("Length of blocks =", len(body))
print(f"\nNumber of text blocks = {len(bodies)}\n")
print("Text blocks:\n", bodies)

## 3. Text Summarizing <a class="anchor" id="3"></a>

[Back to Table of Contents](#0.1)

In [None]:
min_length_text = 40

### 3.1. BERT Summarizing <a class="anchor" id="3.1"></a>

[Back to Table of Contents](#0.1)

In [None]:
%%time
bert_summary = []
for i in range(len(bodies)):
    bert_model = Summarizer()
    bert_summary.append(''.join(bert_model(bodies[i], min_length=min_length_text)))

### 3.2. GPT-2 Summarizing <a class="anchor" id="3.2"></a>

[Back to Table of Contents](#0.1)

In [None]:
%%time
gpt_summary = []
for i in range(len(bodies)):
    GPT2_model = TransformerSummarizer(transformer_type="GPT2",transformer_model_key="gpt2-medium")
    gpt_summary.append(''.join(GPT2_model(bodies[i], min_length=min_length_text)))

### 3.3. XLNet Summarizing <a class="anchor" id="3.3"></a>

[Back to Table of Contents](#0.1)

In [None]:
%%time
xlnet_summary = []
for i in range(len(bodies)):
    model = TransformerSummarizer(transformer_type="XLNet",transformer_model_key="xlnet-base-cased")
    xlnet_summary.append(''.join(model(bodies[i], min_length=min_length_text)))

## 4. Results <a class="anchor" id="4"></a>

[Back to Table of Contents](#0.1)

In [None]:
%%time
print("All Summarizing Results:\n")
for i in range(len(bodies)):
    print("ORIGINAL TEXT:")
    print(bodies[i])
    print("\nBERT Summarizing Result:")
    print(bert_summary[i])
    print("\nGPT-2 Summarizing Result:")
    print(gpt_summary[i])
    print("\nXLNet Summarizing Result:")
    print(xlnet_summary[i])
    print("\n\n")

I hope you find this notebook useful and enjoyable.

Your comments and feedback are most welcome.

[Go to Top](#0)