<a href="https://colab.research.google.com/github/xutian1113/pytorch_practice/blob/main/WikiText_practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install datasets
!pip install tqdm



## WikiText Description
The WikiText dataset is a collection of high-quality Wikipedia articles designed for language modeling tasks. It is commonly used to train and evaluate text generation models such as GPT, LLaMA, and Transformer-based models.

- Developed by: Salesforce Research
- Purpose: Train and benchmark language models (next-word prediction, text generation, etc.).
- Dataset Variants:
  - WikiText-2 (smaller, ideal for quick training)
  - WikiText-103 (larger, used for pretraining large models)
  - WikiText-2-raw and WikiText-103-raw (uncleaned versions)

### Key Features of WikiText
| **Feature**      | **Description** |
|---------------|----------------------|
|Task Type|	Language Modeling (Next-Word Prediction)|
|Input Type|	Continuous Wikipedia Text|
|Vocabulary|	~267K unique words (WikiText-103)|
|Dataset Size|	WikiText-2: 2M tokens, WikiText-103: 103M tokens|
|Text Type|	Cleaned Wikipedia articles (excludes lists, tables, headers)|
|Evaluation Metrics|	Perplexity (PPL), BPC (bits-per-character)|
|Commonly Used With|	GPT, Transformer models, LSTMs|


### WikiText Variants and Size

| Dataset |Tokens|Train Samples|Validation Samples|Test Samples|
|---------|------|-------------|------------------|------------|
|WikiText-2	|2M	|36,718	|3,760	|4,358|
|WikiText-103|	103M|	28,475|	60|	60|
|WikiText-2-raw|	2M|	36,718|	3,760	|4,358|
|WikiText-103-raw	|103M|	28,475|	60|	60|

- WikiText-2: Smaller version for experiments and fine-tuning.
- WikiText-103: Large dataset for training large-scale transformers.
- Raw versions: Include additional formatting (unprocessed text)


### Example Sentences from WikiText

| **Article**      | **Excerpt** |
|---------------|----------------------|
|	Artificial Intelligence |	"Artificial intelligence (AI) is intelligence demonstrated by machines, as opposed to the natural intelligence displayed by humans and animals."|
|Physics	|"Physics is the natural science that studies matter, its fundamental constituents, its motion and behavior through space and time, and the related entities of energy and force."|
|History of Computing|	"The history of computing is longer than the history of computing hardware and modern computing technology and includes the history of methods intended for pen and paper or for chalk and slate."|




In [5]:
import torch
from torch.utils.data import DataLoader
from transformers import GPT2Tokenizer, BertForSequenceClassification, AdamW, get_linear_schedule_with_warmup
from datasets import load_dataset
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import matthews_corrcoef
from tqdm import tqdm
from transformers import GPT2LMHeadModel

In [7]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Testing
input_text = "Once upon a time"
# Assign `eos_token` as the `pad_token`
tokenizer.pad_token = tokenizer.eos_token

input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)
# Explicitly pass `attention_mask`
attention_mask = torch.ones_like(input_ids)





In [None]:
model2 = GPT2LMHeadModel.from_pretrained("gpt2").to(device)
model2.config.pad_token_id = tokenizer.pad_token_id

model2.eval()


# Generate 50 tokens
output = model2.generate(input_ids, attention_mask=attention_mask,max_length=50, num_return_sequences=1)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

print(generated_text)