# 大規模言語モデル（LLM）入門

このノートブックでは、大規模言語モデル（Large Language Models, LLMs）の基本概念について学習します。

## 目次
1. LLMとは何か
2. LLMの主要な特徴
3. 代表的なLLMモデル
4. LLMの応用分野

## 1. LLMとは何か

大規模言語モデル（LLM）は、大量のテキストデータを学習し、人間のような自然言語処理能力を持つAIモデルです。
これらのモデルは以下のような特徴を持っています：

- 大規模なデータセットでの事前学習
- Transformerアーキテクチャの活用
- 文脈理解と生成能力
- マルチタスク処理能力

## 2. LLMの主要な特徴

### スケール
- パラメータ数：数十億から数兆
- 学習データ：数百GB～数TB
- 計算リソース：大規模なGPUクラスター

### アーキテクチャ
- Transformerベース
- 自己注意機構
- 深層学習

## 3. 代表的なLLMモデル

現在、多くのLLMが公開されています：

- GPT（OpenAI）
- Claude（Anthropic）
- LLaMA（Meta）
- PaLM（Google）
- BERT（Google）

## 4. LLMの応用分野

LLMは様々な分野で活用されています：

- テキスト生成・要約
- 質問応答
- コード生成
- 言語翻訳
- 感情分析

In [None]:
# 必要なライブラリのインポート
import transformers
import torch

print(f"Transformersバージョン: {transformers.__version__}")
print(f"PyTorchバージョン: {torch.__version__}")

## 5. 実践：Transformerモデルの基本操作

ここからは、実際にTransformerモデルを使って実験を行います。

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# 小規模なGPT-2モデルを使用します
model_name = "distilgpt2"

# トークナイザーとモデルの読み込み
print("トークナイザーを読み込んでいます...")
tokenizer = AutoTokenizer.from_pretrained(model_name)

print("\nモデルを読み込んでいます...")
model = AutoModelForCausalLM.from_pretrained(model_name)

print("\n準備完了！")

In [None]:
# テキスト生成の例
input_text = "Artificial Intelligence is"
input_ids = tokenizer.encode(input_text, return_tensors="pt")

# テキスト生成
print("テキスト生成中...")
output = model.generate(
    input_ids,
    max_length=50,
    num_return_sequences=1,
    no_repeat_ngram_size=2,
    pad_token_id=tokenizer.eos_token_id
)

# 生成されたテキストのデコード
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(f"\n入力テキスト: {input_text}")
print(f"生成されたテキスト: {generated_text}")

In [None]:
# トークン化の実験
example_text = "これは自然言語処理の実験です。LLMについて学んでいきましょう！"

# テキストのトークン化
tokens = tokenizer.tokenize(example_text)
token_ids = tokenizer.encode(example_text)

print("テキスト:", example_text)
print("\nトークン:", tokens)
print("\nトークンID:", token_ids)

# トークンの数を表示
print(f"\nトークンの数: {len(tokens)}")

# トークンIDからテキストに戻す
decoded_text = tokenizer.decode(token_ids)
print(f"\n復元されたテキスト: {decoded_text}")

In [None]:
# より詳細なテキスト生成の実験
def generate_text(prompt, max_length=50, num_sequences=3, temperature=0.7):
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    
    # テキスト生成
    outputs = model.generate(
        input_ids,
        max_length=max_length,
        num_return_sequences=num_sequences,
        temperature=temperature,
        no_repeat_ngram_size=2,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
    
    # 生成されたテキストの表示
    print(f"入力プロンプト: {prompt}\n")
    print("生成されたテキスト:")
    for i, output in enumerate(outputs):
        generated_text = tokenizer.decode(output, skip_special_tokens=True)
        print(f"\n{i+1}番目の生成結果:")
        print(generated_text)
        print("-" * 50)

# 異なるプロンプトとパラメータで試してみる
prompts = [
    "The future of AI is",
    "In the next decade, technology will",
    "Artificial Intelligence can help humans by"
]

for prompt in prompts:
    print("=" * 80)
    generate_text(prompt, max_length=70, num_sequences=3, temperature=0.8)
    print("=" * 80)
    print("\n")

In [19]:
from transformers import pipeline

# 感情分析のパイプラインを準備
sentiment_analyzer = pipeline("sentiment-analysis")

# 分析するテキストの例
texts = [
    "I love working with artificial intelligence!",
    "This technology is very complicated and frustrating.",
    "AI has both benefits and challenges to consider.",
    "The future of AI looks promising and exciting!"
]

print("テキストの感情分析：")
print("-" * 50)

# 各テキストの感情を分析
for text in texts:
    result = sentiment_analyzer(text)
    sentiment = result[0]
    
    print(f"\nテキスト: {text}")
    print(f"感情: {sentiment['label']}")
    print(f"確信度: {sentiment['score']:.3f}")
    print("-" * 50)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


テキストの感情分析：
--------------------------------------------------

テキスト: I love working with artificial intelligence!
感情: POSITIVE
確信度: 0.999
--------------------------------------------------

テキスト: This technology is very complicated and frustrating.
感情: NEGATIVE
確信度: 0.998
--------------------------------------------------

テキスト: AI has both benefits and challenges to consider.
感情: POSITIVE
確信度: 0.999
--------------------------------------------------

テキスト: The future of AI looks promising and exciting!
感情: POSITIVE
確信度: 1.000
--------------------------------------------------


In [20]:
from transformers import pipeline

# 要約用のパイプラインを準備
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# 要約するテキスト（AIに関する長めの文章）
long_text = """
Artificial Intelligence has transformed the way we live and work in the 21st century. 
Machine learning algorithms are now used in healthcare to diagnose diseases, in finance 
to detect fraudulent transactions, and in transportation to develop self-driving cars. 
These AI systems process vast amounts of data to identify patterns and make predictions 
with increasing accuracy. However, the rapid advancement of AI technology also raises 
important ethical questions about privacy, bias, and the future of human employment. 
Researchers and developers are working to address these challenges while continuing to 
push the boundaries of what AI can achieve.
"""

# テキストを要約
print("元のテキスト:")
print("-" * 50)
print(long_text)
print("-" * 50)

# 要約を生成（異なる長さで試してみる）
print("\n短い要約:")
short_summary = summarizer(long_text, max_length=50, min_length=10, do_sample=False)
print(short_summary[0]['summary_text'])

print("\nやや長めの要約:")
long_summary = summarizer(long_text, max_length=100, min_length=30, do_sample=False)
print(long_summary[0]['summary_text'])

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use mps:0


元のテキスト:
--------------------------------------------------

Artificial Intelligence has transformed the way we live and work in the 21st century. 
Machine learning algorithms are now used in healthcare to diagnose diseases, in finance 
to detect fraudulent transactions, and in transportation to develop self-driving cars. 
These AI systems process vast amounts of data to identify patterns and make predictions 
with increasing accuracy. However, the rapid advancement of AI technology also raises 
important ethical questions about privacy, bias, and the future of human employment. 
Researchers and developers are working to address these challenges while continuing to 
push the boundaries of what AI can achieve.

--------------------------------------------------

短い要約:
Artificial Intelligence has transformed the way we live and work in the 21st century. Machine learning algorithms are now used in healthcare to diagnose diseases, in finance to detect fraudulent transactions, and in transpo

In [21]:
from transformers import pipeline

# 質問応答用のパイプラインを準備
qa_pipeline = pipeline("question-answering")

# コンテキスト（文脈となるテキスト）
context = """
OpenAI's GPT (Generative Pre-trained Transformer) models have revolutionized natural language processing. 
The first GPT model was released in 2018, followed by GPT-2 in 2019 and GPT-3 in 2020. 
GPT-3 contains 175 billion parameters and can perform various tasks like translation, 
question-answering, and code generation. In 2022, GPT-4 was announced, showing significant 
improvements in reasoning and creativity. These models are trained using transformer 
architecture and learn from vast amounts of internet text data.
"""

# テストする質問のリスト
questions = [
    "When was the first GPT model released?",
    "How many parameters does GPT-3 have?",
    "What tasks can GPT-3 perform?",
    "What architecture do these models use?"
]

# 各質問に対して回答を生成
print("質問応答テスト：")
print("-" * 50)

for question in questions:
    result = qa_pipeline(question=question, context=context)
    
    print(f"\n質問: {question}")
    print(f"回答: {result['answer']}")
    print(f"確信度: {result['score']:.3f}")
    print("-" * 50)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use mps:0


質問応答テスト：
--------------------------------------------------

質問: When was the first GPT model released?
回答: 2018
確信度: 0.978
--------------------------------------------------

質問: How many parameters does GPT-3 have?
回答: 175 billion
確信度: 0.799
--------------------------------------------------

質問: What tasks can GPT-3 perform?
回答: translation, 
question-answering, and code generation
確信度: 0.942
--------------------------------------------------

質問: What architecture do these models use?
回答: transformer
確信度: 0.678
--------------------------------------------------
