# CIS6930 Week 9 Assignment: Getting Familiar with Hugging Face's Transformers Library and Model/Dataset Hub

The primary goal of this assignment is to provide an opportunity to learn about Hugging Face's Transformers library, Model Hub, and Dataset Hub. The problems should be pretty straightforward. Problem 4 should be helpful especially for those who are interestd in text data for the term project.

I hope you will learn the basics of the `transformers` library and get familar with Hugging Face Hub. Enjoy! :)

---
Preparation: Go to `Runtime > Change runtime type` and choose `GPU` for the hardware accelerator.

In [None]:
gpu_info = !nvidia-smi -L
gpu_info = "\n".join(gpu_info)
if gpu_info.find("failed") >= 0:
    print("Not connected to a GPU")
else:
    print(gpu_info)

## Preparation

For this notebookt, we use Hugging Face's `transformers` library.

In [None]:
!pip install transformers

## Preliminary: Quick Tour (20 min read + 10 min watch)


- Go through [the Quick Tour](https://huggingface.co/transformers/quicktour.html#quick-tour) and watch the two videos on the page. 
- Make sure that you are now familiar with the following two patterns.


#### A) The "`tokenizer` + `model`" pattern 

```
>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification
>>> model_name = "distilbert-base-uncased-finetuned-sst-2-english"
>>> pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
```

#### B) The "`pipeline`" pattern

```
>>> from transformers import pipeline
>>> classifier = pipeline('sentiment-analysis')
>>> classifier('We are very happy to show you the 🤗 Transformers library.')
```

For `pipeline`, see also [Pipelines documentation](https://huggingface.co/transformers/main_classes/pipelines.html) for more details.

## Problem 1: Bug hunting!

The code below runs successfully but has an issue. Detect the issue. Technically, it is not a software bug as it does not raise any errors. 

You're allowed to **change 1 line** in the code. (Hint: Carefully look at the code line by line. You have two options. Either is fine.)


```
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

inputs = tokenizer("Hello, world!", return_tensors="pt")
model(**inputs)
```

### Problem 1-a: Fix the code

In [None]:
# Directly edit this code block. Do not change more than 1 line!
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

inputs = tokenizer("Hello, world!", return_tensors="pt")
model(**inputs)

### Problem 1-b: Explain the issue

What is the issue with the orignal code? Explain why the but has to be fixed. 
(Hint: Compare encoded token IDs)

### Your Answer

`TYPE YOUR ANSWER HERE`

## Problem 2: Pre-trained model hunting at Hugging Face Hub

1. Go to Hugging Face Model Hub (https://huggingface.co/models) 
2. Search for pre-trained models that can be used for **summarizing Japanese text**.
3. Pick one model and answer the name (either `model name` or `URL` is fine.)

#### Your Answer

`TYPE YOUR ANSWER HERE`

## Problem 3: One thing needs to be done before use

Suppose you are trying out two pre-trained language models for text classification using the `pipeline` interface. 

- [Model 1](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
- [Model 2](https://huggingface.co/distilbert-base-uncased)

```
>>> from transformers import pipeline
>>> pipe1 = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english", device=0) # cuda:0
>>> pipe1("This restaurant is awesome")
[{'label': 'POSITIVE', 'score': 0.9998743534088135}]

>>> pipe2 = pipeline("text-classification", model="distilbert-base-uncased", device=0) # cuda:0
>>> pipe2("This restaurant is awesome")
[{'label': 'LABEL_0', 'score': 0.5170442461967468}]
```

Although both of the models seem to be the same type of pre-trained language model, the results look significantly different. While `distilbert-base-uncased-finetuned-sst-2-english` confidently classifies the sentence has `positive` sentiment, `distilbert-base-uncased`'s result does not look good.

**Question:** What's the issue with the second model (`distilbert-base-uncased`)? Why does the model not return an expected result?

Hint: Look at the warning message. :)



In [None]:
from transformers import pipeline
pipe1 = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english", device=0) # cuda:0
pipe1("This restaurant is awesome")

In [None]:
pipe2 = pipeline("text-classification", model="distilbert-base-uncased", device=0) # cuda:0
pipe2("This restaurant is awesome")

### Your Answer

`TYPE YOUR ANSWER HERE`

## Problem 4: Finding a dataset & a model (this can be part of your term project!)

- Go to Hugging Face Dataset Hub (https://huggingface.co/datasets)


### Problem 4-1: Pick a dataset 

- Take a look at the list of available datasets and **pick one dataset you like**. (Hint: Tasks tags might be helpful.)
- Read the description of the dataset and understand **the task category**. Note that some datasets consist of more than one problem. Avoid that type of datasets for this assignment. (It's fine to consider such a dataset for your term project.)
- Name the dataset and write the task category.

#### Your Answer

- Dataset name (URL): `TYPE URL HERE`
- Taek category: `TYPE YOUR ANSWER HERE`

### Problem 4-2: Pick a model

Now, you have a problem to solve! It's time to find models that can be used for the problem! :)

- Go to Hugging Face Model Hub (https://huggingface.co/models)
- Search for models that can be used for the dataset you chose in the previous step (Hint: Tasks tags should be helpful)
- **Pick one model you like** and name the model. Also, explain why you think the model can be used for the dataset. 
- Note that the model does not have to perform well, but it must be **compatible** with the dataset. (Hint: Click the "Use in Transformers" tab to see what class is used to load the model.)



#### Your Answer

- Model name (URL): `TYPE URL HERE`
- Explain why you think the model can be used for the dataset: `TYPE YOUR ANSWER HERE`

## Problem 5: Your thoughts on the term project?

Describe what type of problem you'd like to tackle for the term project. Do you want to use image data, textual data, or both? What kinds of techniques do you want to use?

If you're interested in using textual data and pre-trained Transformer models, do you think the dataset and the model you choose for Problems 4-1 and 4-2 can be part of your term project?

Are you interested in other types of data (and/or techniques) such as images (non-Transformer models etc.)?

Please share more information and your thoughts on the term project. 

### Your Answer

`TYPE YOUR ANSWER HERE`

## [Optional] Survey: Any ideas for additional content?

The detailed content plan for Weeks 12 and 13 are in progress, and there may be a chance to cover some content that you're interested but not covered by the original plan.

Please feel free to share any topics/keywords/techniques that you're interested to learn. Note that due to the scope of the course (and the areas of my expertise), I may not be able to accommodate all requests but will do my best. 

Thanks!

### Your Answer

`TYPE YOUR ANSWER HERE`