### create virtual environment and install libraries
```bash
python -m venv .venv
source .venv/bin/activate
pip install transformers tf-keras torch datasets
```

### text generation

In [9]:
from transformers import pipeline

generator = pipeline('text-generation', model='openai-community/gpt2')

generator("Whales are blue and giraffes are yellow", truncation = True, num_return_sequences = 3)

Device set to use mps:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Whales are blue and giraffes are yellow and the other colours are all white. We don\'t know if the colours in this picture are identical to those in the picture above. The colour pattern was taken from the right of the picture.\n\nWe have to be careful with the colours in this picture. To clarify a bit, the colours, which are very different from the ones in the picture above, are very different. However, they are not the same colour.\n\nSo, what does this mean? It means that there are more colours, but that is where the differences end. It means that there is more diversity between various colours. So we get the same colours, but the difference is not noticeable, so it is not going to affect our interpretation of the picture.\n\nIt means that the colours are not the same. So, it is not the same colours.\n\nThere is a very real possibility of the colour difference. So, it could be a pattern that is different from the one we saw before. The pattern could be called a 

### sentiment analysis

In [10]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="SamLowe/roberta-base-go_emotions")

classifier("Dinosaurs are interestingly green!")

Device set to use mps:0


[{'label': 'neutral', 'score': 0.5663705468177795}]

### question answering

In [11]:
from transformers import pipeline

qa_model = pipeline("question-answering")

qa_model(question = "What day would it be today, if tomorrow would be Saturday?", context = "If tomorrow would be Saturday, that means today is Friday.")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


{'score': 0.9281117916107178, 'start': 51, 'end': 57, 'answer': 'Friday'}

### import dataset

In [12]:
from datasets import load_dataset

dataset = load_dataset('CShorten/ML-ArXiv-Papers')

dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0.1', 'Unnamed: 0', 'title', 'abstract'],
        num_rows: 117592
    })
})

In [13]:
dataset['train'][0]

{'Unnamed: 0.1': 0,
 'Unnamed: 0': 0.0,
 'title': 'Learning from compressed observations',
 'abstract': '  The problem of statistical learning is to construct a predictor of a random\nvariable $Y$ as a function of a related random variable $X$ on the basis of an\ni.i.d. training sample from the joint distribution of $(X,Y)$. Allowable\npredictors are drawn from some specified class, and the goal is to approach\nasymptotically the performance (expected loss) of the best predictor in the\nclass. We consider the setting in which one has perfect observation of the\n$X$-part of the sample, while the $Y$-part has to be communicated at some\nfinite bit rate. The encoding of the $Y$-values is allowed to depend on the\n$X$-values. Under suitable regularity conditions on the admissible predictors,\nthe underlying family of probability distributions and the loss function, we\ngive an information-theoretic characterization of achievable predictor\nperformance in terms of conditional distortion-rat

### summarization

In [14]:
from transformers import pipeline

summerizer = pipeline("summarization", model="facebook/bart-large-cnn")

summerizer(dataset['train'][0]['abstract'])

Device set to use mps:0


[{'summary_text': 'Predictors are drawn from some specified class, and the goal is to approach the performance (expected loss) of the best predictor in the class. The ideas areillustrated on the example of nonparametric regression in Gaussian noise. Under suitable regularity conditions on the admissible predictors, we give an information-theoretic characterization of achievable predictorperformance in terms of conditional distortion-rate functions.'}]

### tokenization

In [15]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
sentence = "I can approve this by using the statistical methods."
tokens = tokenizer.tokenize(sentence)
tokens

['i',
 'can',
 'approve',
 'this',
 'by',
 'using',
 'the',
 'statistical',
 'methods',
 '.']