# A Review in Knowledge Extraction from Knowledge Bases

[Paper source](https://acl-bg.org/proceedings/2023/RANLP%202023/pdf/2023.ranlp-1.12.pdf)

Generative language models achieve the state of the art in many tasks within natural language processing (NLP), but fail to interpret knowledge (semantics). The lack of interpretability of these models promotes the use of other technologies as a replacement or complement to generative language models for cases like research focused on incorporating knowledge by resorting to knowledge bases mainly in the form of graphs. The generation of large knowledge graphs is carried out with unsupervised or semi-supervised techniques, which promotes the validation of this knowledge with the same type of techniques due to the size of the generated databases.

## Representing Knowledge - bases and graphs

Knowledge bases (KB), generally represented by knowledge graphs (KG), storing sets of nodes and edges (relations between the nodes), are widely used for storage information used in different machine learning tasks. Traditional machine learning models, including deep neural networks use vectors as input, while the structure of KGs is more complex and can’t be simplified in a vector, due to the need for representing nodes, edges, connectivity, global relations inside the graph and features of every element. Methodologies used for extracting knowledge from KGs focus on creating latent vectors with the graph information (embeddings) or using neural networks specially designed for dealing with the graph structure.

Many KBs are developed using non supervised machine learning techniques, generating massive data in the process. Those methods may cause errors when completing the KB due to false relations between nodes. Large KBs also have problems with not useful information introduced for a specific task which can be considered as noise.

## Modeling Knowledge

For Entity Linking three different families of models are considered devided by to the techniques used to perform the task.

### Translational Models

These models, which consider that the different relationships between elements can be represented as displacements in space

#### Euclidean Space Models

Translational models express the existing relation between two entities as a translation in a vector space. Head entity h and tail entity t have a relation r which can translate the first entity to the second, this is the case for the first translational model - TransE. This model does not deal well with complex relations, i.e relations one-to-many (1-N), many-to-one (N-1) or many-to-many (N-N). TransH improves the representation of complex relations creating a unique hyperplane for each relation between two entities. The TransR considers both entities and relations should be in different spaces. This allow different entity representations according to the relation between them.

The TransD model uses less parameters than its predecessor, this can be done using vector multiplications instead of matrices. It assumes two vectors for each entity and relation: the first vector (h, r, t) represents the meaning of the entity or relation and the second (hp, rp, tp) indicates how the entity must be proyected in the relation space, is utilized to map entities in the relation space. TransD uses the same number of parameters for each specific relation wich can lead to overfitting when using more parameters than necessary (simple relations) or underfitting when there are less parameters (complex relations). In TranSparse each relation uses a sparse matrix for each entity, with different sparse degrees. This enable the use of more or less parameters depending of the complexity of the relation.

TransE regularization forces entity embeddings to stay inside a spherical vector space out of the range of the correct triple. The regularization used in TransE is normalization, making the magnitude of each embedding become 1 during each step of earning. This provoke a violation of equation making the sum of head entity and relation not equal to tail entity. This causes major problems, warping the embeddings obtained. To solve this TorusE creates entity and relation embeddings using the same principles as TransE but in a torus space.

PairRE employs paired vectors for representing complex relations. These vectors proyect entities in the euclidean space where distance is minimized if the relation is right. The main advantage of PairRE is that both paired vectors allow more versatility in the loss function, achieving a better representation of complex relations.

#### Complex Space Models

Even if Euclidean space models progressively improve state of the art, they still have difficulties dealing with relations of symmetry, anti-symmetry, inversion and composition. 

RotatE tries to solve this problem with a complex space in order to represent embeddings using Euler’s identity. This way the translation from the head entity to the tail entity is a rotation. The model also changes the loss function introducing self adversarial samples, which improves the training process. The score function employed in RotatE is the the same as equation ($|h + r − t| ≈ 0 $) of TransE, but using Hadamard product instead of vector sum between head entity and relation. RotatE is improved with more dimension spaces through relation modeling with orthogonal transformations embeddings OTE. 

OTE makes orthogonal transformations with the head and relation vectors to the tail vector, and then from the tail and relation vectors to the head vector. Extending the idea of complex spaces, QuatE uses an hypercomplex space with 3 imaginary components i, j, k with the objective of having more degrees of freedom to the obtained embeddings. In this case, the scoring function utilized rotates head entity using the Hamilton product.

#### Other Non-Euclidean Space Models

Other models explore the posibility of using mathematical expresions out of the euclidean space.

ManifoldE is a model that uses non-euclidean space. It considers that translational models are algebraically ill-conceived because they generate more equations than variables to solve, leading to approximate calculations for tasks like entity linking, where there are many entity candidates for one relation. In the case of ManifoldE, it uses a
principle based on a ”manifold” function for expressing the relation between two entities. With this approach calculation should be exact, retrieving true candidates for each relation. ManifoldE expands the position of golden triples from one point (compared to TransE) to a manifold using a larger dimension sphere, diminishing noise when detecting true relations between all candidates and improving embedding vectors precision. Considering a head entity and a relation, all possible tail entities are inside a manifold of greater dimension (sphere). Scoring function is obtained as the difference in distance between radius of the sphere and the equation $|h + r − t| ≈ 0$. ManifoldE improves their results using a hyperplane as a manifold instead of a sphere.

Hyperbolic space is ideal for modeling entities with hierarchical information due to its curvature. The problem with hyperbolic space is representing entities with different hierarchies under different relations. MuRP utilizes a Poincare Ball as a hyperbolic space, creating multi-relational embeddings for each entity and relation. The key of MuRP is using a hypersphere in hyperbolic space because it grows exponentially compare to euclidean space, having more space to separate each node. MuRP trains relation-specific parameters used for transforming entity embeddings through Mobius matrix-vector multiplication (in order to obtain the hyperbolic entity embeddings) and Mobius addition. The hyperbolic entity  embeddings are obtained by Mobius matrix-vector multiplication projecting the original embeddings to the tangent space of the Poincare ball transformed by the diagonal relation matrix and then projected back to Poincare ball.

MuRP cannot encode some logical properties of relationships. It uses a fixed curvature for each relation. Although specific curvature for each relation would represent better hierarchies based on the context, it also uses only translations in the hyperbolic space. By contrast, ATTH creates embeddings in hyperbolic space using reflexions and
rotations, enabling RotatE patterns to be captured, as well as considering a relation-specific curvature cr that allows a variety of hierarchies. Rotations are created with Givens transformations matrices due to this model does not employ complex numbers. ATTH use entity biases in the scoring function which act as margins for triples.

Previous methods are designed for creating entity and relation representations in Euclidean, Hyperbolic or Hyperspherical space, but no one of them compare results in different spaces. The Geometry Interaction Knowledge Graph Embeddings (GIE) considers vectors in Euclidean (E), Hyperbolic (H) and Hyperspherical (S) spaces for head and tail entities and uses an attention mechanism over each vector in order to prioritize the space which represents better knowledge from the entity. Vectors in Hyperbolic and Hyperspherical space are logarithmically mapped to tangent space before applying attention and then features are extracted. GIE has an attention vector with a specific component for each different space both for head and tail entities inside a triple.

### Tensor Factorization Models

These matrix factorization models represent the relationships between entities as tensors and perform decomposition operations on the tensors to represent each entity and relationship. This approach has some advantages over translational models: 1) tensors can represent multiple relations of any order, you just need to increase tensor dimensionality, 2) previous knowledge from the problem structure is not necessary in order to infer knowledge from data.

#### Euclidean Space Models

RESCAL is the first tensor factorization model created to represent relations between entities. In this model, each matrix is constructed representing the relation between two entities, like a confusion matrix, and each matrix indicates a specific relation. The data is given as a $(n · n · m)$ tensor where $n$ is the number of entities and $m$ is the number of relations. 

RESCAL employs the following factorization over each slice of tensor Xk: $Xk ≈ ARkAT , for k = 1, ..., m$ 
Where $A$ is a $n x r$ matrix containing latentcomponent representation of entities an $Rk$ is an $r x r$ matrix that models the interactions between latent components for relation $k$. Matrix $Rk$ is asymmetric, which is useful for considering whether a latent component acts as a subject or object, given that each entity has a unique latent-component representation even if it is a subject or object in a relation. Matrices $A$ and $Rk$ are computed solving the following minimization problem. In order to reduce training parameters in RESCAL, DistMult uses a diagonal matrix Wr instead of an asymmetric relation matrix. This leads to a more expressive model than transE with the same number of parameters, being
as scalable as previously mentioned models but less expressive than RESCAL.

Holographic embeddings use vector circular correlation to represent entity embeddings. HolE creates holographic embeddings for represent pairs of entities. Correlation makes HolE efficient to compute and scalable to large datasets. This operation can be considered as a compression of the tensor product, in circular correlation each component is a sum of a fixed partition of pairwise interactions. HolE can store and retrieve information via circular convolution and circular correlation, respectively and it also learns the embeddings of the data.

SimplE is a tensor factorization method based con Canonical Polyadic(CP) decomposition. It uses two vectors for each entity (he, te) and relation (vr, vr−1 ). SimplE uses a similarity function for each triple which is the average of the CP scores for the current triple and its inverse relation triple.

TuckER is a lineal model for tensor factorization which generalizes previous tensor factorization models like RESCAL, DistMult, ComplEx and SimplE based on Tucker decomposition. It makes a decomposition from the binary tensor of triplets. It factorizes a tensor into a core smaller tensor multiplying one matrix for each dimension in the original tensor. In the case of TuckER, the decomposition creates a smaller tensor $W$, and matrices $eh$, $wr$ and $et$ for head entity, relation and tail entity, respectively.

#### Other Non-Euclidean Space Models

As RotatE, ComplEx uses imaginary numbers in the complex space, in this case it performs tensor factorization using Hermitian dot product, which involves the conjugate-transform on one of the two vectors multiplied. With this type of dot product, we obtain a non symmetric matrix being able to represent antisymmetric relations while maintaining linearity and low time complexity.

### Deep Neural Models

The deep neural models are used to obtain the main characteristics of each possible relationship and determine whether they are truthful or encode information from nearby entities. Graph neural networks can encode information about neighbours from each specific node, introducing context during processing in the neural network.

#### Graph Convolutional Networks (GCNs)

The first GCN introduced generates hidden states for each node processed taking into consideration each neighbour and relation. For each GCN layer, the processed node adds information from each neighbour equally. The context given by graphs improves many tasks when dealing with relational data, this is the case for R-GCN, an encoder that produces a hidden state for each node considering neighbours but also specific relations, in contrast with original GCN, being suitable for processing heterogeneous graphs.

#### Graph Attention Networks (GATs)

GCNs make convolutions considering equal importance among all edges in the processed graph, which may be a shallow approach for tasks where specific nodes and edges have more important information than others. In order to solve this issue, Graph Attention Networks are introduced. GATs make a convolution considering different weights for each edge connected to a specific node and can have multiple weights associated for each edge equal to the number of
attention heads.

A2N uses attention mechanism with specific queries in order to generate conditioned embeddings taking into account each query with the neighborhood of a source entity. A scalar attention score is generated for each neighbour and then their embeddings are aggregated generating a new source embeddings. Lastly, concatenate the new source embedding with the initial and projecting it to obtain the final source embeddings. In the original paper, DistMult is utilized as an attention scoring function as it allows the projection of neighbors in the same space as target entities.

The use of non-Euclidean spaces has been extended to graph neural networks as in the case of M2GNN. Previous models using non-Euclidean spaces only considered homogeneous relations, so they lack expressiveness in this respect. M2GNN creates a non-constant heterogeneous curvature space using new parameters in the network called curvature coefficients. The proposed architecture also makes use of attention heads to improve the accuracy obtained.

#### Convolutional Neural Networks (CNNs)

CNNs utilized broadly in computer vision have recently been used for entity linking. The main reason is that CNNs can solve entity linking tasks with far less parameters than previous mentioned models like DistMult. CNNs are also considered
a very expressive way of representing entities and relations comparing to translational models, due to the number of features extracted with the CNN filters.

ConvE is the first convolutional model achieving good results with entity linking tasks. It is simple, as it uses only one convolutional layer with 2D convolutions, a proyection layer to the embedding dimension and an inner product to make the entity linking prediction. The convolution is made by first concatenating the 2D vectors from the head entity and relation embeddings.

ConvKB uses a convolutional layer with 3-column matrices, where each matrix is made of the concatenation of the triple vectors (eh, r, et). The features obtained after convolution are concatenated and score is obtained performing a multiplication with a weight vector $w$. Filters used for convolution in previous models are designed arbitrarily, which can lead to a poor performance.

In order to solve this problem, HypER uses a hypernetwork for determining the right filter for each relation. A fully
connected layer is used for obtaining embeddings representing head entity and relation, then the hypernetwork creates the filters of each relation embedding which will be utilized during convolution of entity embeddings. The hypernetwork proposed is a single fully connected layer. HypER uses a weight matrix that projects the results to another dimensional space in order to make the dot product between head entity and tail entity.

## Conclusions

Both in the case of translational models and in tensor factorization, there is a tendency to represent increasingly complex spaces, to the point of combining different types of spaces into one (euclidean, hyperspherical and hyperbolic) or to represent increasingly complex vector spaces (complex space, quaternions, etc.). However, in some cases it is
observed that the state of the art is surpassed without necessarily increasing the complexity of the space represented; this is the case of SimplE (which achieves results similar to ComplEx) or Tucker.

Alternative spaces to the euclidean with positive or negative curvature tend to better represent some properties of entities with a smaller number of features, such as circular relations in hyperspherical spaces and hierarchies in hyperbolic spaces, allowing the creation of embeddings at a lower computational cost.

In the case of deep neural models, tests have also been carried out with positive and negative curvature spaces. In these cases, curvature is a parameter to be trained within the network. The current state of the art is led by models that
combine different vector spaces (GIE, M2GNN).

# ChatGPT-guided Semantics for Zero-shot Learning

[Paper PDF source](https://arxiv.org/pdf/2310.11657v1.pdf)

[Paper code repository](https://github.com/FHShubho/CGS-ZSL/tree/main)

### Zero-shot learning

The Zero-shot Learning (ZSL) approach aims to classify unseen objects not observed in training. A more generic version
named Generalized ZSL (GZSL) attempts to predict a class from seen and unseen classes together. Researchers have
started exploring ZSL and GZSL with 2D image datasets. Later, considering the availability of depth-sensing cameras,
exploring ZSL on 3D point cloud data got considerable attention. For both 2D and 3D cases, semantic descriptions of classes play a pivotal role in transferring knowledge from seen to unseen classes. Class semantics are designed to describe all objects with a common set of features or components, working as a bridge between seen and unseen worlds.
Prior works show that addressing ZSL tasks in 3D has more challenges than its 2D counterpart. Therefore, improving class semantics may help to address some challenges.

Class semantics can be obtained manually (attribute vectors) or automatically (word vectors). Attributes are identifiable features to describe a class that require laborious human annotation to obtain and are not readily available for
many large-scale or 3D point cloud datasets. In contrast, automatic word vectors are the output (as vectors) of language models, given class names as input. These models are usually trained using billions of text corpus from Wikipedia, news articles, etc. Compared to attributes, the automatic extraction of word vectors makes them more realistic for real-world applications. However, embeddings from word vectors are noisier than manual attribute vectors resulting in poorer ZSL performance than attributes. This issue becomes more challenging, especially for ZSL on 3D point cloud objects, because of pre-trained models, poor quality features, dataset size, etc.

Usually, word vectors are calculated using a single class name as input. However, many related words and definitions associated with that single class name are also necessary to improve the representativeness of the word vector. Considering related semantics can improve the semantic description of the given class. Again, such related semantics can be obtained by hard manual annotations, a costly process, or noisy web crawling annotation. 

To address this problem, recent ChatGPT can be a valuable source for describing a class with related semantics and attributes. It is both automatic and less noisy. herefore, given a class name, we ask ChatGPT to describe that class with a paragraph of text containing related semantics and attributes. Then, a word embedding of that paragraph can be extracted using a language model(word2vec). As last step, the word embeddings from class names and ChatGPT can be linearly combined to calculate an improved word vector. This approach does not need any prompt engineering, so it can be used in any existing ZSL models to increase accuracy. The new approach is tested on seven methods DEM, LATEM , SYNC, GDAN, f-CLSWGAN, TF-VAEGAN and CADA-VAE covering both 2D image and 3D point cloud datasets (ModelNet10, ModelNet40 and
ScanObjectNN). This method is also applicable for both synthetic ModelNet40 and real-world scanned ScanObjectNN cases.

### Methodology

Large language models (LLMs) have gained significant attention due to numerous advantages across various applications. To harness this power, the ChatGPT model is utilized to generate supplementary descriptions for the class names during the training phase of ZSL. This description includes attributes and semantics related to a class that can enhance its discriminative ability from other classes. Encoding this description with a word vector by forwarding it through a language model (word2vec) can augment additional supervision to ZSL models.

#### Problem Formulation

Let $x ∈ R^K$ represent the input data, corresponding to either an image or point cloud data. Two sets of class labels,
$Y^s = {y^s_1, ..., y^s_s}$ for seen classes and $Y^u = {y^u_1, ..., y^u_u }$ for unseen classes, with the seen and unseen labels are disjoint. Additionally, $E^s = {ϕ(y^s_1), ..., ϕ(y^s_s )}$ and $E^u = {ϕ(y^u_1 ), ..., ϕ(y^u_u)}$ represent the sets of semantic feature embeddings obtained using the embedding function $ϕ$, where $ϕ(y) ∈ R^d$. To proceed, the set of $n^s$ seen samples is defined as $D^s = {(x^s_i , y^s_i , e^s_i )}^{n_s}_{i=1}$, where $x^s_i$ represents the $i^th$ instance from the seen set, with ground truth label $y^s_i ∈ Y^s$ and corresponding semantic vector $e^s_i = ϕ(y^s_i ) ∈ E^s$. Similarly, the set of $n_u$ unseen samples is defined as $D^u = {(x^u_i , y^u_i , e^u_i )}^{n_u}_{i=1}$, where $x^u_i$ represents the $i^th$ sample from the unseen set, with ground truth label $y^u_i ∈ Y^u$ and corresponding semantic vector $e^u_i = ϕ(y^u_i ) ∈ E^u$. The two main tasks addressed are: Zero-Shot Learning (ZSL) and Generalized Zero-Shot Learning (GZSL).

#### Preliminaries

Two main types ZSL methods are embedding and generative. Embedding methods learn to map both the visual features of the input data and the semantic attributes to a common space. They can recognize unseen classes in this space by comparing their attribute representations with the input’s features. In contrast, generative methods synthesize samples that look like unseen classes by using a mix of labeled data from seen classes and auxiliary information, such as class attributes or textual descriptions. Overall, in any ZSL methodology, an input backbone transfers the input data into a meaningful feature embedding.

Embedding-based ZSL aims to learn functions that map input data (images or point clouds) and semantic information (attributes or class labels) to a shared space. In this space, the model can link input features with semantic
information, allowing it to identify and categorize classes it has not seen before. In general, there are two branches in
the embedding approach. In the first, the input feature embedding is forwarded into a fully connected layer. Simultaneously, the corresponding semantic representation is fed into a fully connected layer, which maps the semantic embedding to a common space, where the Euclidean distance between the semantic and feature embeddings is minimized. This is achieved by optimizing the following objective function:

$$L_E = \frac1{n_s} \sum^{n_s}_{i=1} ∥z^′_i − e^′_i∥^2_2 + λR(θ)$$

where the $θ$ is the trainable parameters, and $R$ refers to the regularization loss function. The hyperparameter $λ$ is crucial in controlling the trade-off between the regularization and embedding losses.

In generative Zero-Shot Learning (ZSL) methods, the objective is to generate synthetic samples for the unseen classes based on their semantic attributes. These methods enable the model to generate realistic and representative samples of unseen classes by leveraging the semantic information associated with those classes. The goal is to learn a conditional
Generative Adversarial Network (GAN) model, denoted as $G$, which takes random Gaussian noise $h ∼ N (0, 1)$ and the
semantic class embedding $e_i$ as inputs to generate the feature representation of class $i$, denoted as $\hat{z}_i ∈ R^m$. Concurrently, a discriminator model is trained to classify real features against synthetic features. The objective function for generating synthetic feature samples is defined as follows:

$$L_G =E_{z,e}[D(z, e)] − E_{z,e}[D(\hat{z}, e)]− ηE_{\tilde{z},eh}[(∥∇_{\tilde{z}}D(\tilde{z}, e)∥_2 − 1)^2]$$

where $\tilde{z} = βz + (1 − β)\tilde{z}$, with $β ∼ U(0, 1)$, and $η$ is the penalty coefficient. This objective function guides the training of our conditional GAN model, enabling it to generate realistic and diverse feature representations based on the input class embeddings and Gaussian noise. Also, a classification loss is required to ensure that the synthetic samples are suitable for the classifier. This loss is defined as follows:

$$L_C = −E_{\hat{z}∼p_{\hat{z}}} [log P (y | \hat{z}; Θ)]$$

The provided loss function is computed using a linear softmax classifier parameterized by $Θ$, which undergoes pretraining on the real features $z ∈ D^s$ from seen classes. To elaborate, this loss function serves as a regularizer, motivating the generator to construct discerning features in its generated samples.

#### Improved class semantics

The ChatGPT model is utilized by providing it with the class name of a seen class to generate descriptive sentences. Specifically the class names are given to the GPT-3.5, using prompts such as “Describe the [CLASS] in at most ten sentences, focusing on specific physical features and excluding any unavailable features.” The generated sentences
offer detailed descriptions of the class names, aiding in improving class semantics for our ZSL tasks.

A multi-step process is devised to enrich the understanding of class characteristics. The class name is initially processed by ChatGPT, generating a comprehensive description. This description includes multiple sentences closely related
to the class name. Subsequently, both the class name’s semantic representation and the ChatGPT output’s semantic representation are extracted using a Word2vec model. These semantic representations offer valuable insights into the meaning and context of the class name and its associated description. Afterwards, these semantic descriptions are passed through separate fully connected layers before being merged. The merging process through addition allows the combination of the class-specific information with the contextual details from ChatGPT, resulting in a more comprehensive
and enriched representation.

### Conclusion

ChatGPT-based word vectors fused with traditional class name-based vectors can achieve better ZSL and GZSL performance on 2D image and 3D point cloud recognition tasks. Experiments on multiple embedding-based and gener-ative ZSL methods show that this technique consistently improves those methods’ existing performance. ChatGPT could be a suitable annotation
tool that can provide automatic and less noisy information without manual labor.

### Code Showcase


Handle dependencies 

In [1]:
import os
import openai
import pandas as pd
import time
from tqdm.notebook import tqdm
import random

Set up ChatGPT API and environment

In [2]:
GPT_KEY = 'inssert API key here'

In [3]:
openai.api_key = GPT_KEY
openai.Model.list()

<OpenAIObject list at 0x203eb9dfab0> JSON: {
  "object": "list",
  "data": [
    {
      "id": "curie-search-query",
      "object": "model",
      "created": 1651172509,
      "owned_by": "openai-dev"
    },
    {
      "id": "babbage-search-query",
      "object": "model",
      "created": 1651172509,
      "owned_by": "openai-dev"
    },
    {
      "id": "dall-e-3",
      "object": "model",
      "created": 1698785189,
      "owned_by": "system"
    },
    {
      "id": "babbage-search-document",
      "object": "model",
      "created": 1651172510,
      "owned_by": "openai-dev"
    },
    {
      "id": "dall-e-2",
      "object": "model",
      "created": 1698798177,
      "owned_by": "system"
    },
    {
      "id": "gpt-3.5-turbo-0301",
      "object": "model",
      "created": 1677649963,
      "owned_by": "openai"
    },
    {
      "id": "text-embedding-ada-002",
      "object": "model",
      "created": 1671217299,
      "owned_by": "openai-internal"
    },
    {
      "id

In [4]:
model = 'gpt-3.5-turbo'

In [5]:
response = openai.ChatCompletion.create(
    model=model,
    messages=[{"role": "user", "content": "tell me a joke"}]
)

print(response)

RateLimitError: You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.

Download dataset from [here](https://data.caltech.edu/records/65de6-vp158).

Read the class names and generate description.

Load file names from the dataset.

In [9]:
birds = os.listdir('CUB_200_2011\images')

bird_names = {}

for i in birds:
    index, name = i.split('.')[0], i.split('.')[1]
    index = int(index) - 1
    bird_names[index] = name

bird_names = dict(sorted(bird_names.items(), key=lambda x:x[0]))
bird_names

{0: 'Black_footed_Albatross',
 1: 'Laysan_Albatross',
 2: 'Sooty_Albatross',
 3: 'Groove_billed_Ani',
 4: 'Crested_Auklet',
 5: 'Least_Auklet',
 6: 'Parakeet_Auklet',
 7: 'Rhinoceros_Auklet',
 8: 'Brewer_Blackbird',
 9: 'Red_winged_Blackbird',
 10: 'Rusty_Blackbird',
 11: 'Yellow_headed_Blackbird',
 12: 'Bobolink',
 13: 'Indigo_Bunting',
 14: 'Lazuli_Bunting',
 15: 'Painted_Bunting',
 16: 'Cardinal',
 17: 'Spotted_Catbird',
 18: 'Gray_Catbird',
 19: 'Yellow_breasted_Chat',
 20: 'Eastern_Towhee',
 21: 'Chuck_will_Widow',
 22: 'Brandt_Cormorant',
 23: 'Red_faced_Cormorant',
 24: 'Pelagic_Cormorant',
 25: 'Bronzed_Cowbird',
 26: 'Shiny_Cowbird',
 27: 'Brown_Creeper',
 28: 'American_Crow',
 29: 'Fish_Crow',
 30: 'Black_billed_Cuckoo',
 31: 'Mangrove_Cuckoo',
 32: 'Yellow_billed_Cuckoo',
 33: 'Gray_crowned_Rosy_Finch',
 34: 'Purple_Finch',
 35: 'Northern_Flicker',
 36: 'Acadian_Flycatcher',
 37: 'Great_Crested_Flycatcher',
 38: 'Least_Flycatcher',
 39: 'Olive_sided_Flycatcher',
 40: 'Scisso

Clear underscores from file names.

In [11]:
bird_names = {k: v.replace('_', ' ') for k, v in bird_names.items()}
bird_names

{0: 'Black footed Albatross',
 1: 'Laysan Albatross',
 2: 'Sooty Albatross',
 3: 'Groove billed Ani',
 4: 'Crested Auklet',
 5: 'Least Auklet',
 6: 'Parakeet Auklet',
 7: 'Rhinoceros Auklet',
 8: 'Brewer Blackbird',
 9: 'Red winged Blackbird',
 10: 'Rusty Blackbird',
 11: 'Yellow headed Blackbird',
 12: 'Bobolink',
 13: 'Indigo Bunting',
 14: 'Lazuli Bunting',
 15: 'Painted Bunting',
 16: 'Cardinal',
 17: 'Spotted Catbird',
 18: 'Gray Catbird',
 19: 'Yellow breasted Chat',
 20: 'Eastern Towhee',
 21: 'Chuck will Widow',
 22: 'Brandt Cormorant',
 23: 'Red faced Cormorant',
 24: 'Pelagic Cormorant',
 25: 'Bronzed Cowbird',
 26: 'Shiny Cowbird',
 27: 'Brown Creeper',
 28: 'American Crow',
 29: 'Fish Crow',
 30: 'Black billed Cuckoo',
 31: 'Mangrove Cuckoo',
 32: 'Yellow billed Cuckoo',
 33: 'Gray crowned Rosy Finch',
 34: 'Purple Finch',
 35: 'Northern Flicker',
 36: 'Acadian Flycatcher',
 37: 'Great Crested Flycatcher',
 38: 'Least Flycatcher',
 39: 'Olive sided Flycatcher',
 40: 'Scisso

Convert filenames into classes in datsframe.

In [12]:
bird_names_df = pd.DataFrame(bird_names.items(), columns=['index', 'Bird_name'])
bird_names_df.drop(['index'], axis=1, inplace=True)
bird_names_df.head()

Unnamed: 0,Bird_name
0,Black footed Albatross
1,Laysan Albatross
2,Sooty Albatross
3,Groove billed Ani
4,Crested Auklet


Add decription column

In [14]:
bird_names_df['Short_Description'] = ''
bird_names_df.head()

Unnamed: 0,Bird_name,Short_Description
0,Black footed Albatross,
1,Laysan Albatross,
2,Sooty Albatross,
3,Groove billed Ani,
4,Crested Auklet,


Function to generate descriptions.

In [16]:
def generate_short_description(bird_name):
    time.sleep(30) 
    for delay_secs in (2**x for x in range(0, 6)):
        try:
            response = openai.ChatCompletion.create(
                model=model,
                messages=[{"role": "user", "content": f"Describe the {bird_name} bird in at most 10 sentences using the specific physical features and do not need to mention the features that are not available in the bird. Also, do not use any numeric in descriptions, instead, use words."}]
            )
            print(response.choices[0].message['content'].partition('.')[0])
            return response.choices[0].message['content']
            break
        
        except openai.OpenAIError as e:
            randomness_collision_avoidance = random.randint(0, 1000) / 1000.0
            sleep_dur = delay_secs + randomness_collision_avoidance
            print(f"Error: {e}. Retrying in {round(sleep_dur, 2)} seconds.")
            time.sleep(sleep_dur)
            continue

Test it for 2 classes.

In [18]:
for i in tqdm(range(2)):
    bird_names_df['Short_Description'][i] = generate_short_description(bird_names_df['Bird_name'][i])

bird_names_df.head()

  0%|          | 0/2 [00:00<?, ?it/s]

Error: You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.. Retrying in 1.8 seconds.
Error: You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.. Retrying in 2.84 seconds.
Error: You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.. Retrying in 4.48 seconds.
Error: You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.. Retrying in 8.08 seconds.
Error: You exceeded your current quota, please check your plan and billing details. For more info

Generate short description for all classes

In [None]:
for i in tqdm(range(len(bird_names_df['Bird_name']))):
    if bird_names_df['Short_Description'][i] == '':
        bird_names_df['Short_Description'][i] = generate_short_description(bird_names_df['Bird_name'][i])

bird_names_df.head(10)

Save descriptions

In [None]:
bird_names_df.to_excel('CUB_description.xlsx', index=False)