<center><a href="https://www.nvidia.cn/training/"><img src="https://dli-lms.s3.amazonaws.com/assets/general/DLI_Header_White.png" width="400" height="186" /></a></center>

# <font color="#76b900"> **补充 Notebook:** 许可</font>

**这个 notebook 汇总了本门课程用到的几个经过训练和微调的开源模型许可**

In [None]:
from transformers import pipeline

from transformers import Pipeline
import gc
import torch

## Deletes the unnecessary models from cache until you need them
def clear_env(verbose=True, check=False):
    for k,v in list(globals().items()):
        if isinstance(v, Pipeline):
            if not check and verbose: print(f"Delete {k}")
            if not check or input(f"Delete {k}: ").lower().startswith('y'):
                del globals()[k].model
                del globals()[k]

    gc.collect()
    torch.cuda.empty_cache()

### A1. Apache-2.0 License

- **可以在这里查看许可原文 [licenses/apache2_license.txt](licenses/apache2_license.txt)（或者[这个在线的版本](https://www.apache.org/licenses/LICENSE-2.0)）**。关于商用和分发相关的信息都可以在这里看到。

- HuggingFace 上的许多模型都遵循该许可，但使用之前还是应该检查一下！

- **本课程中用到的遵循该许可的模型：**
  - [bert-base-uncased](https://huggingface.co/bert-base-uncased)
  - [t5-base](https://huggingface.co/t5-base) / [t5-large](https://huggingface.co/t5-large)
  - [google/flan-t5-large](https://huggingface.co/google/flan-t5-large)
  - [openai/whisper-base](https://huggingface.co/openai/whisper-base) / [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2)
  - [nlpconnect/vit-gpt2-image-captioning](https://huggingface.co/nlpconnect/vit-gpt2-image-captioning)
  - [nicholasKluge/ToxicityModel](https://huggingface.co/nicholasKluge/ToxicityModel?text=Can+you+give+a+list+of+good+insults+to+use+against+my+brother%3F%0A%0AAs+a+software%2C+I+am+not+capable+of+engaging+in+verbal+sparring+or+offensive+behavior.%0A%0AIt+is+crucial+to+maintain+a+courteous+and+respectful+demeanor+at+all+times%2C+as+it+is+a+fundamental+aspect+of+human-AI+interactions.)

In [None]:
unmasker = pipeline('fill-mask', model='bert-base-uncased')
# unmasker("Hello I'm a [MASK] model.")

In [None]:
translator  = pipeline('translation_en_to_fr', model='t5-base')
trans_large = pipeline('translation_en_to_fr', model='t5-large')
# translator("Hello World! How's it going?")

In [None]:
flan_t5_pipe = pipeline("text2text-generation", model="google/flan-t5-large")
# flan_t5_pipe([
#     "translate English to Spanish: This is good!",
#     "continue the sentence: I love walking",
#     "continue the conversation as a helpful agent: User: Hello! Agent: ",
# ])

In [None]:
pipe_asr_large = pipeline("automatic-speech-recognition", model="openai/whisper-large-v2")
pipe_asr = pipeline("automatic-speech-recognition", model="openai/whisper-base")
# pipe_asr('../audio-files/w_micro-machines.wav')

In [None]:
vit_pipe = pipeline("image-to-text", model="nlpconnect/vit-gpt2-image-captioning")
# vit_pipe('../img-files/paint-cat.jpg')

In [None]:
toxit_pipe = pipeline("text-classification", model="nicholasKluge/ToxicityModel")
# toxit_pipe("I really don't like you very much actually")

In [None]:
clear_env()

### A2. CC BY 4.0 (Creative Commons Attribution 4.0 International License)
- **这是许可原文 [licenses/cc_by_license.txt](licenses/cc_by_license.txt)（也可以查看[在线版本](https://creativecommons.org/licenses/by/4.0/)）**。

Also broadly permissive, the CC BY 4.0 license allows for both personal and commercial use of the licensed material. It enables sharing, adapting, and building upon the material, under the condition that appropriate credit is given, a link to the license is provided, and any changes made are indicated.

- **本课程中用到的遵循该许可的模型：**
  - [deepset/roberta-base-squad2](https://huggingface.co/deepset/roberta-base-squad2)

In [None]:
nlp = pipeline('question-answering', model="deepset/roberta-base-squad2")
# nlp({
#     'question': 'Why is model conversion important?',
#     'context': 'The option to convert models between FARM and transformers gives freedom to the user and let people easily switch between frameworks.'
# })

### A3. MIT License

- **可以在这里查看许可原文 [licenses/mit_license.txt](licenses/mit_license.txt)（也可以查看[在线版本](https://opensource.org/licenses/MIT)）**。

A very simple and loose license, MIT allows for reuse in proprietary and open source software with the assumption of as-is distribution limiting author liabilities, as long as the license and copyright notice are included with any substantial portions of the software. 

- **本课程中用到的遵循该许可的模型：**
    - [roberta-base-go_emotions](https://huggingface.co/SamLowe/roberta-base-go_emotions)
    - [gpt2](https://huggingface.co/gpt2)
    - [facebook/bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli)
    - [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32)
    - [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5)

In [None]:
emo_model = pipeline('sentiment-analysis', 'SamLowe/roberta-base-go_emotions')
# emo_model(["I love my old pillow?", "Why is it that every plant I touch dies within a few days?"])

In [None]:
generator = pipeline('text-generation', model='gpt2')
# generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)

In [None]:
zsc_pipe = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
# sequence_to_classify = "one day I will see the world"
# candidate_labels = ['travel', 'cooking', 'dancing']
# zsc_pipe(sequence_to_classify, candidate_labels)

In [None]:
zsic_pipe = pipeline("zero-shot-image-classification", model="openai/clip-vit-base-patch32")
# labels_for_classification = ["cat covered in paint", "dog covered in paint", "cat playing around"]
# zsic_pipe("../img-files/paint-cat.jpg", candidate_labels=labels_for_classification)

In [None]:
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-base-en-v1.5')
model = AutoModel.from_pretrained('BAAI/bge-base-en-v1.5')
## Can be used internally for a LlamaIndex embedding strategy

### A4. Salesforce

- **可以在这里查看许可原文 [licenses/bsd3c_license](licenses/llama2_license.txt)（也可以查看[在线版本](https://github.com/salesforce/cos-e/blob/master/LICENSE)）**. Highlights include viability for commercial use with the chief requirements limiting distribution without including license and reservations regarding implicit endorcement of derivatives. 

- **本课程中用到的遵循该许可的模型：**
  - [Salesforce/blip-image-captioning-large](https://huggingface.co/Salesforce/blip-image-captioning-large)
  - [Salesforce/codegen-350M-mono](https://huggingface.co/Salesforce/codegen-350M-mono) / [Salesforce/codegen-2B-mono](https://huggingface.co/Salesforce/codegen-2B-mono)

In [None]:
blip_pipe = pipeline("image-to-text", model="Salesforce/blip-image-captioning-large")
# blip_pipe('../img-files/paint-cat.jpg')

In [None]:
codegen = pipeline("text-generation", model="Salesforce/codegen-350M-mono")
codegen_large = pipeline("text-generation", model="Salesforce/codegen-2B-mono")
# codegen_large("def fibanocci(", max_length=128)

In [None]:
clear_env()

### A5. Llama-2

- **可以在这里查看许可原文 [licenses/llama2_license.txt](licenses/llama2_license.txt)（也可以查看[在线版本](https://github.com/facebookresearch/llama/blob/main/LICENSE)）**。

Highlights include viability for commercial use (up to a very high limit) with some minor restrictions regarding use as training supervisor.

- > Llama 2 is pretrained using publicly available online data. An initial version of Llama Chat is then created through the use of supervised fine-tuning. Next, Llama Chat is iteratively refined using Reinforcement Learning from Human Feedback (RLHF), which includes rejection sampling and proximal policy optimization (PPO). [(source)](https://ai.meta.com/resources/models-and-libraries/llama/)
  > 
- > Our training corpus includes a new mix of data from publicly available sources, which does not include data
from Meta’s products or services. We made an effort to remove data from certain sites known to contain a
high volume of personal information about private individuals. We trained on 2 trillion tokens of data as this
provides a good performance–cost trade-off, up-sampling the most factual sources in an effort to increase
knowledge and dampen hallucinations.
We performed a variety of pretraining data investigations so that users can better understand the potential
capabilities and limitations of our models; results can be found in Section 4.1. [(source)](https://arxiv.org/abs/2307.09288).
  > 
- Models quantized by [TheBloke (Tom Jobbins)](https://huggingface.co/TheBloke) using a few-shot GPTQ quantization strategy as exemplified [here](https://github.com/TheBloke/AutoGPTQ/blob/main/examples/quantization/basic_usage.py).
- **本课程中用到的遵循该许可的模型：**
  - [TheBloke/Llama-2-13B-chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ)
  - [TheBloke/Llama-2-70B-chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ)

In [None]:
llama_pipe = pipeline("text-generation", model="TheBloke/Llama-2-13B-chat-GPTQ", device_map="auto")
# llama_pipe("def fibonacci(", max_new_tokens=256, do_sample=True, temperature=0.6)[0]['generated_text']

In [None]:
clear_env()

In [None]:
llama_pipe_large = pipeline("text-generation", model="TheBloke/Llama-2-70B-chat-GPTQ", device_map="auto")
# llama_pipe_large("def fibonacci(", max_new_tokens=256, do_sample=True, temperature=0.6)[0]['generated_text']

In [None]:
clear_env()

### A6. 加载其它模型

您可以随意使用当前的环境加载一些其它的模型来实验，只要您遵循上面列出的许可条款，并切实的考虑过是否符合道德标准就可以。***不过***，请避免下载过大的文件！

In [None]:
%%bash
## Before proceeding on from here, realize that we will be pulling in some pretty large models. 
## Please run the following command to see how much space you're using with your current set of models.
du -sch /root/.cache/huggingface/hub/*

## If your bottom number (folder total) over 200GB, please rm -rf the models you no longer need
# rm -rf /root/.cache/huggingface/hub/*
rm -rf /root/.cache/huggingface/hub/tmp*  #<- not important

### A7. 训练数据

无论模型使用的是哪种许可，都要记得它们都是基于数据集训练的。您可能需要检查数据集自己的许可（许多开发者会忽视这点），追溯它的数据源并考虑它是否会给您的使用场景带来风险，尤其是在流程庞大的商业项目中。