# Huggingface Dataset Loader
- Huggingface Hub에 있는 다양한 데이터셋들을 Langchain에 로드하는 방법
- `HuggingfaceDatasetLoader`를 이용하면 Hugginface Dataset을 Langchain에서 사용할 수 있는 `Document` 형식으로 변환된다.

In [10]:
# API 키를 환경변수로 관리하기 위한 설정 파일
from dotenv import load_dotenv

# API 키 정보 로드
load_dotenv()

True

In [3]:
from langchain_community.document_loaders import HuggingFaceDatasetLoader
dataset_name = "imdb"  # 데이터셋 이름을 "imdb"로 설정합니다.
page_content_column = "text"  # 페이지 내용이 포함된 열의 이름을 "text"로 설정합니다.

# HuggingFaceDatasetLoader를 사용하여 데이터셋을 로드합니다.
# 데이터셋 이름과 페이지 내용 열 이름을 전달합니다.
loader = HuggingFaceDatasetLoader(dataset_name, page_content_column)
data = loader.load() # 로더를 사용하여 데이터를 불러옵니다.

  from .autonotebook import tqdm as notebook_tqdm
Downloading readme: 100%|██████████| 7.81k/7.81k [00:00<00:00, 12.5MB/s]
Downloading data: 100%|██████████| 21.0M/21.0M [00:02<00:00, 8.86MB/s]
Downloading data: 100%|██████████| 20.5M/20.5M [00:02<00:00, 9.26MB/s]
Downloading data: 100%|██████████| 42.0M/42.0M [00:04<00:00, 10.3MB/s]
Generating train split: 100%|██████████| 25000/25000 [00:00<00:00, 90726.73 examples/s] 
Generating test split: 100%|██████████| 25000/25000 [00:00<00:00, 85104.85 examples/s]
Generating unsupervised split: 100%|██████████| 50000/50000 [00:00<00:00, 84544.68 examples/s]


In [6]:
data[0].page_content

'"I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered \\"controversial\\" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far betwee

In [7]:
from langchain.indexes import VectorstoreIndexCreator
from langchain_community.document_loaders.hugging_face_dataset import (
    HuggingFaceDatasetLoader,
)

In [8]:
dataset_name = "tweet_eval"  # 데이터셋 이름을 "tweet_eval"로 설정합니다.
page_content_column = "text"  # 페이지 내용이 포함된 열의 이름을 "text"로 설정합니다.
name = "stance_climate"  # 데이터셋의 특정 부분을 식별하는 이름을 "stance_climate"로 설정합니다.

# HuggingFaceDatasetLoader를 사용하여 데이터셋을 로드합니다.
loader = HuggingFaceDatasetLoader(dataset_name, page_content_column, name)

In [13]:
import os
from langchain_community.embeddings import OllamaEmbeddings

embedding = OllamaEmbeddings(model=os.environ["OLLAMA_MODEL_NAME"])

In [16]:
from langchain_community.chat_models import ChatOllama
from langchain_core.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

callbacks = [StreamingStdOutCallbackHandler()]
llm = ChatOllama(model="gemma:7b", temperature=0, streaming=True, callbacks=callbacks)
# 로더에서 벡터 저장소 인덱스를 생성합니다.
index = VectorstoreIndexCreator(embedding=embedding).from_loaders([loader])



In [20]:
query = "What are the most used hashtag?"  # 가장 많이 사용되는 해시태그는 무엇인가요?
result = index.query(llm=llm, question=query)  # 질의를 수행하여 결과를 얻습니다.

The most used hashtags are:

- #SemST
- #ClimateSummitoftheAmericas
- #CSOTA
- #HumanRights
- #SOSEurope