## **Building the Summarization Pipeline**

Recall the `xsum` dataset from the **Summarization** section before.

**Steps**
1. Load the data
```python
from datasets import load_dataset

dataset = load_dataset('xsum')
```
2. Define the pipeline by specifying the task and model
```python
from transformers import pipeline

summarizer = pipeline(
                task="summarization",
                model="t5-small"
)
```
3. Use `summarizer` to summarize the articles
```python
summarizer(article)
```

In [1]:
# Step 1 - Load the dataset

from datasets import load_dataset

xsum_dataset = load_dataset('xsum')



Downloading builder script:   0%|          | 0.00/5.76k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.24k [00:00<?, ?B/s]

In [2]:
xsum_sample = xsum_dataset['train'].select(range(10))

xsum_sample

Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 10
})

In [3]:
# Step 2 - Define the pipeline by specifying the task and model

from transformers import pipeline

summarizer = pipeline(
                task="summarization",
                model="t5-small",
                truncation=True
)

# If we donot set truncation=True, we get the following warning during inference:
# Token indices sequence length is longer than the specified maximum 
# sequence length for this model (541 > 512). Running this sequence 
# through the model will result in indexing errors

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [4]:
# Step 3 - Use summarizer to summarize the articles

summarizer(xsum_sample["document"][0], do_sample=True, top_k=10, top_p=0.8)

[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in peeblesshire remain badly affected by standing water . the water breached a retaining wall, flooding many commercial properties .'}]