# Getting started with Hugging Face

This notebook walks you through querying a Hugging Face LLM.

This tutorial is only a starting point and does not eliminate the need to dig deeper into those documentation materials linked here.

In this tutorial, we will **investigate racial bias** in GPT models.

## Hugging Face models

### Install packages

Colab doesn't have all the packages you need installed, so you have to install them.

There are ways to install packages permanently such that you don't have to reinstall every time. If you are interested, check out how to do so [here](https://stackoverflow.com/questions/55253498/how-do-i-install-a-library-permanently-in-colab) and [here](https://netraneupane.medium.com/how-to-install-libraries-permanently-in-google-colab-fb15a585d8a5).

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m59.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m79.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m

### Import packages

In [None]:
import transformers # for working with Hugging Face models
from transformers import pipeline, set_seed
from transformers import DataCollatorForTokenClassification
from transformers import AutoTokenizer, AutoModelForTokenClassification

In [None]:
import pandas as pd # for working with dataframes

[Hugging Face](https://huggingface.co/) is a platform that hosts many machine learning models, including LLMs [GPT-2](https://huggingface.co/gpt2).

Make sure you use the right model and version before you work with it. For example, we are using a relatively small version of GPT-2 here for testing purposes, but you would probably want to use a larger and more recent version.

Let's look at an example of loading the GPT-2 model from Huggine Face and using it to complete a sentence.

For models that are well-documented, check out the model information on Hugging Face and the underlying paper or repository to gain more insights about the LLM and issues such as known biases. For example, check out the [Hugging Face documentation for GPT-2](https://huggingface.co/gpt2).

In [None]:
generator = pipeline('text-generation', model='gpt2') # dowload a small version

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json: 0.00B [00:00, ?B/s]

Downloading (…)olve/main/merges.txt: 0.00B [00:00, ?B/s]

Downloading (…)/main/tokenizer.json: 0.00B [00:00, ?B/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [None]:
set_seed(42) # seet seed for reproducibilty in the same chunk (really important!)
generator("An American woman tends to be", max_length=50, num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'An American woman tends to be more generous toward her husband when she is angry—and in fact, if you look at her online, the "likes for him are so high that your only hope of finding out about this isn\'t your husband"'},
 {'generated_text': 'An American woman tends to be among the most polarizing and sometimes even the most unappealing candidates for U.S. leadership.\n\nHer own campaign has been labeled anti-woman, hostile to women, racist, racist, xenophobic,'},
 {'generated_text': 'An American woman tends to be a woman who wants to do things that make people happy and make people feel better about themselves. She has to be able to do both at once in order to make people happy, and to maintain a sense of purpose."'},
 {'generated_text': "An American woman tends to be extremely close to a woman's partner due to her social/emotional makeup. A woman who is closer to a partner than she normally would rather be, but may also be concerned about her partner's looks. As 

In [None]:
set_seed(42) # seet seed for reproducibilty (really important!)
generator("A Chinese woman tends to be", max_length=50, num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "A Chinese woman tends to be more generous with her donations, but if you're not a typical mom in this country -- or you're the kind of mom to try to be a role model for your kids when they grow up -- then your choices may"},
 {'generated_text': "A Chinese woman tends to be seen as more open-minded and more optimistic in her outlook on life; she is also less likely to be afraid of losing weight over time.\n\nAnd that's not to say that when things come to their close"},
 {'generated_text': 'A Chinese woman tends to be younger than her American counterpart in terms of age, and Chinese men tend to be more masculine. Asian men tend to be more sexually mature, while white men tend to be more young and mature. Asian women tend to date'},
 {'generated_text': "A Chinese woman tends to be extremely close to a woman's partner due to her proximity to her husband. She has always worked with a Chinese woman and still works with Chinese women. However, she might be asked to stop 

Wow, how exciting an insightful (you'll see how much GPT models have improved over time) 😉 We definitely need to save this output.

Your models will produce different outputs. In the case above, it is a list, so we can use `Dataframe.from_records` to convert it into a pandas dataframe. Check out the [pandas documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide) to find out how create files from the output you

In [None]:
# assign model output to a variable
set_seed(42)
output_chinese_woman_prompt = generator("A Chinese woman tends to be", max_length=50, num_return_sequences=5)

# assign model output to a dataframe
df_chinese_woman_prompt = pd.DataFrame.from_records(output_chinese_woman_prompt)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [None]:
# print output preview
df_chinese_woman_prompt

Unnamed: 0,generated_text
0,A Chinese woman tends to be more generous with...
1,A Chinese woman tends to be seen as more open-...
2,A Chinese woman tends to be younger than her A...
3,A Chinese woman tends to be extremely close to...
4,A Chinese woman tends to be her pet food for m...


## Save output

Everything output we've generated so far will be lost once we close this notebook, or once it crashes. Anything we want to save permanently needs to be written to a file on Drive.

Let's first mount the Drive so we can easily find the paths where we want to store the files. You have to allow access to your Drive in a pop-up window to proceed.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


If you want to change your root directory from `/content/drive` to `/content/drive/MyDrive/LLM Project/tutorial`, you can run the following code.
Changing your working directory means you can write shorter file paths in the subsequent code when you write your output etc.

You can find more details on working with files in Colab, including how to set file paths depending on whether you work on Colab or not, which is especially helpful when you develop your script for a GCP VM on Colab first, and how to clone a GitHub repository in Drive, in this [external notebook](https://colab.research.google.com/github/kenperry-public/ML_Fall_2019/blob/master/Colab_practical.ipynb#scrollTo=Ma_WihlT23zc).

In [None]:
# Switch to the directory on the Google Drive that you want to use
import os
drive_root = "/content/drive/MyDrive/LLM Project/tutorial"
%cd $drive_root

For the purposes of this notebook, we will use absolute paths.

In [None]:
drive_root = "/content/drive"
%cd $drive_root

To easily find your LLM project, add a shortcut to it to MyDrive.

Click the folder icon on the left to see your Drive, locate the folder in which you want to save your file, and click on the three dots on the right of the desired folder and select `Copy path` to find the path for your file. Append `/` and the desired filename including the extension (e.g., `df_chinese_woman_prompt_gpt2.csv`) to the path.

In [None]:
# save dataframe as CSV
df_chinese_woman_prompt.to_csv("/content/drive/MyDrive/LLM Project/output/tutorial_output/df_chinese_woman_prompt_gpt2.csv")

Now, you should be able to see the output CSV file in Drive (it makes sense to double check). It might take a few seconds to appear.

Check out the [PyDrive package](https://pypi.org/project/PyDrive/) for upload and download of Drive files not covered here.

At the very end of your session when you don't need to access Drive anymore, you can flush the content you saved to Drive and unmount Drive.

While the contents should be created automatically and this isn't necessary, it might help to see the content appear more quickly in Drive.

In [None]:
drive.flush_and_unmount()

## Further resources

* [Hugging Face documentation](https://huggingface.co/docs)
* The [Hugging Face Inference API](https://huggingface.co/docs/api-inference/index) allows you to run requests on Hugging Face's infrastructure so you don't have to download a model. You can use this API and don't need to download and store a model, but ensure that you document which exact version of the model was used since the Inference API upgrades models from time to time