# Comparação de predição de links com embeddings de nodos baseados em caminhadas aleatórias

<table><tr><td>Run the latest release of this notebook:</td><td><a href="https://mybinder.org/v2/gh/stellargraph/stellargraph/master?urlpath=lab/tree/demos/link-prediction/homogeneous-comparison-link-prediction.ipynb" alt="Open In Binder" target="_parent"><img src="https://mybinder.org/badge_logo.svg"/></a></td><td><a href="https://colab.research.google.com/github/stellargraph/stellargraph/blob/master/demos/link-prediction/homogeneous-comparison-link-prediction.ipynb" alt="Open In Colab" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg"/></a></td></tr></table>

Este notebook foi desenvolvido como parte da disciplina de "Análise de Redes Sociais" na faculdade. Ele demonstra a comparação do desempenho de predição de links utilizando embeddings aprendidos por Node2Vec [1], Attri2Vec [2], GraphSAGE [3] e GCN [4] no dataset Cora, sob a mesma configuração de divisão de arestas em treino e teste. Node2Vec e Attri2Vec aprendem ao capturar a similaridade de nós no contexto de caminhadas aleatórias. GraphSAGE e GCN aprendem de forma não supervisionada, representando nós que coocorrem em caminhadas aleatórias curtas de forma próxima no espaço de embedding.

O objetivo é abordar a predição de links como um problema de aprendizado supervisionado em cima das representações/embeddings de nós. Após obter os embeddings, um classificador binário pode ser usado para prever a existência, ou não, de um link entre dois nós do grafo. Diversos hiperparâmetros podem influenciar na obtenção do melhor classificador de links - esta demonstração incorpora a seleção de modelos no pipeline para escolher o melhor operador binário a ser aplicado em um par de embeddings de nós.

O processo está dividido em quatro etapas principais:

1. Obter os embeddings para cada nó
2. Para cada conjunto de hiperparâmetros, treinar um classificador
3. Selecionar o classificador com melhor desempenho
4. Avaliar o classificador selecionado em dados não vistos para validar sua capacidade de generalização

**Referências:**

[1] Node2Vec: Scalable Feature Learning for Networks. A. Grover, J. Leskovec. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2016.

[2] Attributed Network Embedding via Subspace Discovery. D. Zhang, Y. Jie, X. Zhu e C. Zhang. Data Mining and Knowledge Discovery, 2019.

[3] Inductive Representation Learning on Large Graphs. W.L. Hamilton, R. Ying, e J. Leskovec. Neural Information Processing Systems (NIPS), 2017.

[4] Graph Convolutional Networks (GCN): Semi-Supervised Classification with Graph Convolutional Networks. Thomas N. Kipf, Max Welling. International Conference on Learning Representations (ICLR), 2017.


In [1]:
# install StellarGraph if running on Google Colab
#import sys
#if 'google.colab' in sys.modules:
#  %pip install -q stellargraph[demos]==1.3.0b

In [2]:
# verify that we're using the correct version of StellarGraph for this notebook
#import stellargraph as sg

#try:
#    sg.utils.validate_notebook_version("1.3.0b")
#except AttributeError:
#    raise ValueError(
#        f"This notebook requires StellarGraph version 1.3.0b, but a different version {sg.__version__} is installed.  Please see <https://github.com/stellargraph/stellargraph/issues/1172>."
#    ) from None

In [55]:
import matplotlib.pyplot as plt
from math import isclose
from sklearn.decomposition import PCA
import os
import networkx as nx
import numpy as np
import pandas as pd
from stellargraph import StellarGraph, datasets
from stellargraph.data import EdgeSplitter
from collections import Counter
import multiprocessing
from IPython.display import display, HTML
from sklearn.model_selection import train_test_split
import stellargraph as sg

%matplotlib inline

## Carregamento do conjunto de dados

O dataset Cora é uma rede homogênea onde todos os nós representam artigos, e as arestas entre os nós são links de citação, por exemplo, o artigo A cita o artigo B.

(See [the "Loading from Pandas" demo](../basics/loading-pandas.ipynb) for details on how data can be loaded.)

In [56]:
#dataset = datasets.Cora()
#display(HTML(dataset.description))
#graph, _ = dataset.load(largest_connected_component_only=True, str_node_ids=True)

In [68]:
#import pickle
#import stellargraph as sg

## Carregue o grafo do arquivo
#with open("meu_grafo.gpickle", "rb") as f:
#    G = pickle.load(f)
    
#graph = sg.StellarGraph.from_networkx(G)
#graph.info()

In [69]:
features_nodes = pd.read_csv('features.csv', index_col='g_lattes_id')
features_nodes.head()

Unnamed: 0_level_0,3d,abastecimento,aberto,abertura,absorção,accounting,aceleradores,acervos,acessibilidade,acondicionamento,...,ópticos,órbita,órbitas,óssea,ósseo,ósteo,ótica,óticas,óticos,úteis
g_lattes_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
LattesID_1003657277565622,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
LattesID_1161800349977394,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
LattesID_1004206862799097,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
LattesID_1122362218673413,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
LattesID_1010721095594895,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [70]:
G = nx.read_graphml('../graphs/coauthorship_graph.xml', node_type=str)
graph = sg.StellarGraph.from_networkx(G, node_features=features_nodes)

In [71]:
len(G.nodes)

3203

In [72]:
len(G.edges)

7570

In [73]:
print(graph.info())

StellarGraph: Undirected multigraph
 Nodes: 3203, Edges: 7570

 Node types:
  default: [3203]
    Features: float32 vector, length 2218
    Edge types: default-default->default

 Edge types:
    default-default->default: [7570]
        Weights: range=[1, 631], mean=4.58481, std=13.9803
        Features: none


## Construir divisões dos dados de entrada

Precisamos dividir os dados cuidadosamente para evitar vazamento de informações e avaliar os algoritmos corretamente:

* Para calcular os embeddings dos nós, um **Grafo de Treinamento** (`graph_train`)
* Para treinar os classificadores, um **Conjunto de Treinamento do Classificador** (`examples_train`) com arestas positivas e negativas que não foram usadas para calcular os embeddings dos nós
* Para escolher o melhor classificador, um **Conjunto de Teste para Seleção de Modelo** (`examples_model_selection`) com arestas positivas e negativas que não foram usadas para calcular os embeddings dos nós ou para treinar o classificador
* Para a avaliação final, com os embeddings dos nós aprendidos a partir do **Grafo de Treinamento** (`graph_train`), o melhor classificador escolhido é aplicado a um **Conjunto de Teste** (`examples_test`) com arestas positivas e negativas que não foram usadas para calcular os embeddings dos nós, treinar o classificador ou selecionar o modelo

### Grafo de Teste

Começamos com o grafo completo e utilizamos a classe `EdgeSplitter` para produzir:

* Grafo de Teste
* Conjunto de teste com exemplos de links positivos/negativos

O Grafo de Teste é o grafo reduzido que obtemos ao remover o conjunto de teste de links do grafo completo.

In [74]:
# Definir um divisor de arestas no grafo original:
edge_splitter_test = EdgeSplitter(graph)

# Amostra aleatoriamente uma fração p=0.1 de todos os links positivos, e o mesmo número de links negativos, do grafo,
# obtendo o grafo reduzido graph_test com os links amostrados removidos:
graph_test, examples_test, labels_test = edge_splitter_test.train_test_split(
    p=0.1, method="global"
)

print(graph_test.info())

** Sampled 757 positive and 757 negative edges. **
StellarGraph: Undirected multigraph
 Nodes: 3203, Edges: 6813

 Node types:
  default: [3203]
    Features: float32 vector, length 2218
    Edge types: default-default->default

 Edge types:
    default-default->default: [6813]
        Weights: range=[1, 631], mean=4.52987, std=13.8952
        Features: none


### Grafo de Treinamento

Desta vez, utilizamos a classe `EdgeSplitter` no Grafo de Teste e realizamos uma divisão de treino/teste nos exemplos para produzir:

* Grafo de Treinamento
* Conjunto de treinamento com exemplos de links
* Conjunto de exemplos de links para seleção de modelo

In [75]:
# Realizar o mesmo processo para calcular um subconjunto de treinamento a partir do grafo de teste
edge_splitter_train = EdgeSplitter(graph_test)
graph_train, examples, labels = edge_splitter_train.train_test_split(
    p=0.1, method="global"
)
(
    examples_train,
    examples_model_selection,
    labels_train,
    labels_model_selection,
) = train_test_split(examples, labels, train_size=0.75, test_size=0.25)

print(graph_train.info())

** Sampled 681 positive and 681 negative edges. **
StellarGraph: Undirected multigraph
 Nodes: 3203, Edges: 6132

 Node types:
  default: [3203]
    Features: float32 vector, length 2218
    Edge types: default-default->default

 Edge types:
    default-default->default: [6132]
        Weights: range=[1, 391], mean=4.48451, std=12.0058
        Features: none


Below is a summary of the different splits that have been created in this section.

In [76]:
pd.DataFrame(
    [
        (
            "Training Set",
            len(examples_train),
            "Train Graph",
            "Test Graph",
            "Train the Link Classifier",
        ),
        (
            "Model Selection",
            len(examples_model_selection),
            "Train Graph",
            "Test Graph",
            "Select the best Link Classifier model",
        ),
        (
            "Test set",
            len(examples_test),
            "Test Graph",
            "Full Graph",
            "Evaluate the best Link Classifier",
        ),
    ],
    columns=("Split", "Number of Examples", "Hidden from", "Picked from", "Use"),
).set_index("Split")

Unnamed: 0_level_0,Number of Examples,Hidden from,Picked from,Use
Split,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Training Set,1021,Train Graph,Test Graph,Train the Link Classifier
Model Selection,341,Train Graph,Test Graph,Select the best Link Classifier model
Test set,1514,Test Graph,Full Graph,Evaluate the best Link Classifier


## Criar o gerador de caminhadas aleatórias

Definimos uma função auxiliar para gerar caminhadas aleatórias enviesadas a partir do grafo fornecido com os parâmetros fixos de caminhada aleatória:

* `p` - Parâmetro da caminhada aleatória "p" que define a probabilidade, "1/p", de retornar ao nó de origem
* `q` - Parâmetro da caminhada aleatória "q" que define a probabilidade, "1/q", de se mover para um nó afastado do nó de origem

In [77]:
from stellargraph.data import BiasedRandomWalk


def create_biased_random_walker(graph, walk_num, walk_length):
    # Configurações de parâmetros para "p" e "q":
    p = 1.0
    q = 1.0
    return BiasedRandomWalk(graph, n=walk_num, length=walk_length, p=p, q=q)

## Configurações de Parâmetros

Treinamos Node2Vec, Attri2Vec, GraphSAGE e GCN seguindo o mesmo procedimento de aprendizado não supervisionado: primeiro geramos um conjunto de caminhadas aleatórias curtas a partir do grafo fornecido e, em seguida, aprendemos os embeddings dos nós a partir de lotes de pares `target, context` coletados das caminhadas aleatórias. Para aprender os embeddings dos nós, é necessário especificar os seguintes parâmetros:

* `dimension` - Dimensionalidade dos embeddings dos nós
* `walk_number` - Número de caminhadas partindo de cada nó
* `walk_length` - Comprimento de cada caminhada aleatória
* `epochs` - Número de épocas para treinar o modelo de aprendizado de embeddings
* `batch_size` - Tamanho do lote para treinar o modelo de aprendizado de embeddings

Definimos consistentemente a dimensão dos embeddings dos nós como 128 para todos os algoritmos. No entanto, usamos diferentes camadas ocultas para aprender os embeddings dos nós de acordo com os algoritmos, a fim de explorar seus respectivos potenciais. Para os demais parâmetros, configuramos os seguintes valores:

|               | Node2Vec | Attri2Vec | GraphSAGE | GCN |
|---------------|----------|-----------|-----------|-----|
| `walk_number` |    20    |     4     |     1     |  1  |
| `walk_length` |     5    |     5     |     5     |  5  |
| `epochs`      |    6     |     6     |     6     |  6  |
| `batch_size`  |    50    |     50    |    50     |  50 |

Como todos os algoritmos utilizam os mesmos valores para `walk_length`, `batch_size` e `epochs`, definimos esses parâmetros de forma uniforme aqui:

In [78]:
walk_length = 5

In [79]:
epochs = 2 # original = 6

In [80]:
batch_size = 50

Para diferentes algoritmos, os usuários podem encontrar a melhor configuração de parâmetros utilizando o conjunto de arestas de ` Model Selection `.

## Node2Vec

Utilizamos o Node2Vec [1] para calcular os embeddings dos nós. Esses embeddings são aprendidos de forma a garantir que nós que estão próximos no grafo permaneçam próximos no espaço de embedding. Treinamos o Node2Vec com os componentes do Node2Vec fornecidos pelo Stellargraph.

In [81]:
from stellargraph.data import UnsupervisedSampler
from stellargraph.mapper import Node2VecLinkGenerator, Node2VecNodeGenerator
from stellargraph.layer import Node2Vec, link_classification
from tensorflow import keras


def node2vec_embedding(graph, name):

    # Set the embedding dimension and walk number:
    dimension = 128
    walk_number = 20

    print(f"Training Node2Vec for '{name}':")

    graph_node_list = list(graph.nodes())

    # Create the biased random walker to generate random walks
    walker = create_biased_random_walker(graph, walk_number, walk_length)

    # Create the unsupervised sampler to sample (target, context) pairs from random walks
    unsupervised_samples = UnsupervisedSampler(
        graph, nodes=graph_node_list, walker=walker
    )

    # Define a Node2Vec training generator, which generates batches of training pairs
    generator = Node2VecLinkGenerator(graph, batch_size)

    # Create the Node2Vec model
    node2vec = Node2Vec(dimension, generator=generator)

    # Build the model and expose input and output sockets of Node2Vec, for node pair inputs
    x_inp, x_out = node2vec.in_out_tensors()

    # Use the link_classification function to generate the output of the Node2Vec model
    prediction = link_classification(
        output_dim=1, output_act="sigmoid", edge_embedding_method="dot"
    )(x_out)

    # Stack the Node2Vec encoder and prediction layer into a Keras model, and specify the loss
    model = keras.Model(inputs=x_inp, outputs=prediction)
    model.compile(
        optimizer=keras.optimizers.Adam(lr=1e-3),
        loss=keras.losses.binary_crossentropy,
        metrics=[keras.metrics.binary_accuracy],
    )

    # Train the model
    model.fit(
        generator.flow(unsupervised_samples),
        epochs=epochs,
        verbose=2,
        use_multiprocessing=True,
        workers=-1,
        shuffle=True,
    )

    # Build the model to predict node representations from node ids with the learned Node2Vec model parameters
    x_inp_src = x_inp[0]
    x_out_src = x_out[0]
    embedding_model = keras.Model(inputs=x_inp_src, outputs=x_out_src)

    # Get representations for all nodes in ``graph``
    node_gen = Node2VecNodeGenerator(graph, batch_size).flow(graph_node_list)
    node_embeddings = embedding_model.predict(node_gen, workers=1, verbose=0)

    def get_embedding(u):
        u_index = graph_node_list.index(u)
        return node_embeddings[u_index]

    return get_embedding

## Attri2Vec

Utilizamos o Attri2Vec [2] para calcular os embeddings dos nós. O Attri2Vec aprende as representações dos nós realizando um mapeamento linear/não-linear nas características de conteúdo dos nós e, simultaneamente, garantindo que nós que compartilham contextos similares em caminhadas aleatórias tenham representações similares. Como as características de conteúdo dos nós são utilizadas para aprender os embeddings, esperamos que o Attri2Vec alcance um desempenho superior na predição de links em comparação com o Node2Vec, que preserva apenas a estrutura da rede.

In [82]:
from stellargraph.mapper import Attri2VecLinkGenerator, Attri2VecNodeGenerator
from stellargraph.layer import Attri2Vec


def attri2vec_embedding(graph, name):

    # Set the embedding dimension and walk number:
    dimension = [128]
    walk_number = 4

    print(f"Training Attri2Vec for '{name}':")

    graph_node_list = list(graph.nodes())

    # Create the biased random walker to generate random walks
    walker = create_biased_random_walker(graph, walk_number, walk_length)

    # Create the unsupervised sampler to sample (target, context) pairs from random walks
    unsupervised_samples = UnsupervisedSampler(
        graph, nodes=graph_node_list, walker=walker
    )

    # Define an Attri2Vec training generator, which generates batches of training pairs
    generator = Attri2VecLinkGenerator(graph, batch_size)

    # Create the Attri2Vec model
    attri2vec = Attri2Vec(
        layer_sizes=dimension, generator=generator, bias=False, normalize=None
    )

    # Build the model and expose input and output sockets of Attri2Vec, for node pair inputs
    x_inp, x_out = attri2vec.in_out_tensors()

    # Use the link_classification function to generate the output of the Attri2Vec model
    prediction = link_classification(
        output_dim=1, output_act="sigmoid", edge_embedding_method="ip"
    )(x_out)

    # Stack the Attri2Vec encoder and prediction layer into a Keras model, and specify the loss
    model = keras.Model(inputs=x_inp, outputs=prediction)
    model.compile(
        optimizer=keras.optimizers.Adam(lr=1e-3),
        loss=keras.losses.binary_crossentropy,
        metrics=[keras.metrics.binary_accuracy],
    )

    # Train the model
    model.fit(
        generator.flow(unsupervised_samples),
        epochs=epochs,
        verbose=2,
        use_multiprocessing=False,
        workers=1,
        shuffle=True,
    )

    # Build the model to predict node representations from node features with the learned Attri2Vec model parameters
    x_inp_src = x_inp[0]
    x_out_src = x_out[0]
    embedding_model = keras.Model(inputs=x_inp_src, outputs=x_out_src)

    # Get representations for all nodes in ``graph``
    node_gen = Attri2VecNodeGenerator(graph, batch_size).flow(graph_node_list)
    node_embeddings = embedding_model.predict(node_gen, workers=1, verbose=0)

    def get_embedding(u):
        u_index = graph_node_list.index(u)
        return node_embeddings[u_index]

    return get_embedding

## GraphSAGE

O GraphSAGE [3] aprende os embeddings dos nós em grafos atribuídos através da agregação das características dos nós vizinhos. Os parâmetros de agregação são aprendidos incentivando pares de nós que coocorrem em caminhadas aleatórias curtas a terem representações similares. Como as características dos nós também são aproveitadas, espera-se que o GraphSAGE apresente um desempenho superior ao Node2Vec na predição de links.

In [83]:
from stellargraph.mapper import GraphSAGELinkGenerator, GraphSAGENodeGenerator
from stellargraph.layer import GraphSAGE


def graphsage_embedding(graph, name):

    # Set the embedding dimensions, the numbers of sampled neighboring nodes and walk number:
    dimensions = [128, 128]
    num_samples = [10, 5]
    walk_number = 1

    print(f"Training GraphSAGE for '{name}':")

    graph_node_list = list(graph.nodes())

    # Create the biased random walker to generate random walks
    walker = create_biased_random_walker(graph, walk_number, walk_length)

    # Create the unsupervised sampler to sample (target, context) pairs from random walks
    unsupervised_samples = UnsupervisedSampler(
        graph, nodes=graph_node_list, walker=walker
    )

    # Define a GraphSAGE training generator, which generates batches of training pairs
    generator = GraphSAGELinkGenerator(graph, batch_size, num_samples)

    # Create the GraphSAGE model
    graphsage = GraphSAGE(
        layer_sizes=dimensions,
        generator=generator,
        bias=True,
        dropout=0.0,
        normalize="l2",
    )

    # Build the model and expose input and output sockets of GraphSAGE, for node pair inputs
    x_inp, x_out = graphsage.in_out_tensors()

    # Use the link_classification function to generate the output of the GraphSAGE model
    prediction = link_classification(
        output_dim=1, output_act="sigmoid", edge_embedding_method="ip"
    )(x_out)

    # Stack the GraphSAGE encoder and prediction layer into a Keras model, and specify the loss
    model = keras.Model(inputs=x_inp, outputs=prediction)
    model.compile(
        optimizer=keras.optimizers.Adam(lr=1e-3),
        loss=keras.losses.binary_crossentropy,
        metrics=[keras.metrics.binary_accuracy],
    )

    # Train the model
    model.fit(
        generator.flow(unsupervised_samples),
        epochs=epochs,
        verbose=2,
        use_multiprocessing=False,
        workers=4,
        shuffle=True,
    )

    # Build the model to predict node representations from node features with the learned GraphSAGE model parameters
    x_inp_src = x_inp[0::2]
    x_out_src = x_out[0]
    embedding_model = keras.Model(inputs=x_inp_src, outputs=x_out_src)

    # Get representations for all nodes in ``graph``
    node_gen = GraphSAGENodeGenerator(graph, batch_size, num_samples).flow(
        graph_node_list
    )
    node_embeddings = embedding_model.predict(node_gen, workers=1, verbose=0)

    def get_embedding(u):
        u_index = graph_node_list.index(u)
        return node_embeddings[u_index]

    return get_embedding

## GCN

O GCN [4] aprende os embeddings dos nós por meio de convolução em grafos. Tradicionalmente, o GCN depende de rótulos dos nós como supervisão para realizar o treinamento. Aqui, consideramos o cenário de predição de links não supervisionada e tentamos aprender embeddings de nós informativos no GCN, garantindo que nós que coocorrem em caminhadas aleatórias curtas sejam representados de forma próxima, como é feito no treinamento do GraphSAGE.

In [84]:
from stellargraph.mapper import FullBatchLinkGenerator, FullBatchNodeGenerator
from stellargraph.layer import GCN, LinkEmbedding


def gcn_embedding(graph, name):

    # Set the embedding dimensions and walk number:
    dimensions = [128, 128]
    walk_number = 1

    print(f"Training GCN for '{name}':")

    graph_node_list = list(graph.nodes())

    # Create the biased random walker to generate random walks
    walker = create_biased_random_walker(graph, walk_number, walk_length)

    # Create the unsupervised sampler to sample (target, context) pairs from random walks
    unsupervised_samples = UnsupervisedSampler(
        graph, nodes=graph_node_list, walker=walker
    )

    # Define a GCN training generator, which generates the full batch of training pairs
    generator = FullBatchLinkGenerator(graph, method="gcn")

    # Create the GCN model
    gcn = GCN(
        layer_sizes=dimensions,
        activations=["relu", "relu"],
        generator=generator,
        dropout=0.3,
    )

    # Build the model and expose input and output sockets of GCN, for node pair inputs
    x_inp, x_out = gcn.in_out_tensors()

    # Use the dot product of node embeddings to make node pairs co-occurring in short random walks represented closely
    prediction = LinkEmbedding(activation="sigmoid", method="ip")(x_out)
    prediction = keras.layers.Reshape((-1,))(prediction)

    # Stack the GCN encoder and prediction layer into a Keras model, and specify the loss
    model = keras.Model(inputs=x_inp, outputs=prediction)
    model.compile(
        optimizer=keras.optimizers.Adam(lr=1e-3),
        loss=keras.losses.binary_crossentropy,
        metrics=[keras.metrics.binary_accuracy],
    )

    # Train the model
    batches = unsupervised_samples.run(batch_size)
    for epoch in range(epochs):
        print(f"Epoch: {epoch+1}/{epochs}")
        batch_iter = 1
        for batch in batches:
            samples = generator.flow(batch[0], targets=batch[1], use_ilocs=True)[0]
            [loss, accuracy] = model.train_on_batch(x=samples[0], y=samples[1])
            output = (
                f"{batch_iter}/{len(batches)} - loss:"
                + " {:6.4f}".format(loss)
                + " - binary_accuracy:"
                + " {:6.4f}".format(accuracy)
            )
            if batch_iter == len(batches):
                print(output)
            else:
                print(output, end="\r")
            batch_iter = batch_iter + 1

    # Get representations for all nodes in ``graph``
    embedding_model = keras.Model(inputs=x_inp, outputs=x_out)
    node_embeddings = embedding_model.predict(
        generator.flow(list(zip(graph_node_list, graph_node_list)))
    )
    node_embeddings = node_embeddings[0][:, 0, :]

    def get_embedding(u):
        u_index = graph_node_list.index(u)
        return node_embeddings[u_index]

    return get_embedding

## Treinar e avaliar o modelo de predição de links

Existem algumas etapas envolvidas no uso dos embeddings aprendidos para realizar a predição de links:
1. Calculamos os embeddings de links/arestas para as amostras de arestas positivas e negativas aplicando um operador binário nos embeddings dos nós de origem e destino de cada aresta amostrada.
2. Dado os embeddings dos exemplos positivos e negativos, treinamos um classificador de regressão logística para prever um valor binário indicando se uma aresta entre dois nós deve ou não existir.
3. Avaliamos o desempenho do classificador de links para cada um dos 4 operadores nos dados de treinamento com os embeddings dos nós calculados no **Grafo de Treinamento** (`graph_train`) e selecionamos o melhor classificador.
4. O melhor classificador é então usado para calcular os escores nos dados de teste com os embeddings dos nós treinados no **Grafo de Treinamento** (`graph_train`).

Abaixo estão um conjunto de funções auxiliares que nos permitem repetir essas etapas para cada um dos operadores binários.

In [85]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler


# 1. link embeddings
def link_examples_to_features(link_examples, transform_node, binary_operator):
    return [
        binary_operator(transform_node(src), transform_node(dst))
        for src, dst in link_examples
    ]


# 2. training classifier
def train_link_prediction_model(
    link_examples, link_labels, get_embedding, binary_operator
):
    clf = link_prediction_classifier()
    link_features = link_examples_to_features(
        link_examples, get_embedding, binary_operator
    )
    clf.fit(link_features, link_labels)
    return clf


def link_prediction_classifier(max_iter=5000):
    lr_clf = LogisticRegressionCV(Cs=10, cv=10, scoring="roc_auc", max_iter=max_iter)
    return Pipeline(steps=[("sc", StandardScaler()), ("clf", lr_clf)])


# 3. and 4. evaluate classifier
def evaluate_link_prediction_model(
    clf, link_examples_test, link_labels_test, get_embedding, binary_operator
):
    link_features_test = link_examples_to_features(
        link_examples_test, get_embedding, binary_operator
    )
    score = evaluate_roc_auc(clf, link_features_test, link_labels_test)
    return score


def evaluate_roc_auc(clf, link_features, link_labels):
    predicted = clf.predict_proba(link_features)

    # check which class corresponds to positive links
    positive_column = list(clf.classes_).index(1)
    return roc_auc_score(link_labels, predicted[:, positive_column])

Consideramos 4 operadores diferentes:

* *Hadamard*
* $L_1$
* $L_2$
* *average*

O artigo [1] fornece uma descrição detalhada desses operadores. Todos os operadores produzem embeddings de links com a mesma dimensionalidade dos embeddings de entrada dos nós (128 dimensões no nosso exemplo).


In [86]:
def operator_hadamard(u, v):
    return u * v


def operator_l1(u, v):
    return np.abs(u - v)


def operator_l2(u, v):
    return (u - v) ** 2


def operator_avg(u, v):
    return (u + v) / 2.0


def run_link_prediction(binary_operator, embedding_train):
    clf = train_link_prediction_model(
        examples_train, labels_train, embedding_train, binary_operator
    )
    score = evaluate_link_prediction_model(
        clf,
        examples_model_selection,
        labels_model_selection,
        embedding_train,
        binary_operator,
    )

    return {
        "classifier": clf,
        "binary_operator": binary_operator,
        "score": score,
    }


binary_operators = [operator_hadamard, operator_l1, operator_l2, operator_avg]

### Treinar e avaliar o modelo de links com o embedding especificado

In [87]:
def train_and_evaluate(embedding, name):

    embedding_train = embedding(graph_train, "Train Graph")

    # Train the link classification model with the learned embedding
    results = [run_link_prediction(op, embedding_train) for op in binary_operators]
    best_result = max(results, key=lambda result: result["score"])
    print(
        f"\nBest result with '{name}' embeddings from '{best_result['binary_operator'].__name__}'"
    )
    display(
        pd.DataFrame(
            [(result["binary_operator"].__name__, result["score"]) for result in results],
            columns=("name", "ROC AUC"),
        ).set_index("name")
    )

    # Evaluate the best model using the test set
    test_score = evaluate_link_prediction_model(
        best_result["classifier"],
        examples_test,
        labels_test,
        embedding_train,
        best_result["binary_operator"],
    )

    return test_score

### Coletar os resultados da predição de links para Node2Vec, Attri2Vec, GraphSAGE e GCN

#### Obter o resultado da predição de links com Node2Vec

In [89]:
node2vec_result = train_and_evaluate(node2vec_embedding, "Node2Vec")

Training Node2Vec for 'Train Graph':
link_classification: using 'dot' method to combine node embeddings into edge embeddings




Epoch 1/2
9764/9764 - 59s - loss: 0.5519 - binary_accuracy: 0.6704 - 59s/epoch - 6ms/step
Epoch 2/2
9764/9764 - 59s - loss: 0.4444 - binary_accuracy: 0.7562 - 59s/epoch - 6ms/step

Best result with 'Node2Vec' embeddings from 'operator_hadamard'


Unnamed: 0_level_0,ROC AUC
name,Unnamed: 1_level_1
operator_hadamard,0.676633
operator_l1,0.623284
operator_l2,0.647665
operator_avg,0.578868


#### Obter o resultado da predição de links com Attri2Vec

In [91]:
attri2vec_result = train_and_evaluate(attri2vec_embedding, "Attri2Vec")



Training Attri2Vec for 'Train Graph':
link_classification: using 'ip' method to combine node embeddings into edge embeddings
Epoch 1/2
1953/1953 - 10s - loss: 0.7234 - binary_accuracy: 0.5377 - 10s/epoch - 5ms/step
Epoch 2/2
1953/1953 - 10s - loss: 0.7058 - binary_accuracy: 0.5597 - 10s/epoch - 5ms/step

Best result with 'Attri2Vec' embeddings from 'operator_hadamard'


Unnamed: 0_level_0,ROC AUC
name,Unnamed: 1_level_1
operator_hadamard,0.662649
operator_l1,0.633733
operator_l2,0.646045
operator_avg,0.638061


#### Obter o resultado da predição de links com GraphSAGE

In [92]:
graphsage_result = train_and_evaluate(graphsage_embedding, "GraphSAGE")

Training GraphSAGE for 'Train Graph':


  f"The initializer {self.__class__.__name__} is unseeded "


link_classification: using 'ip' method to combine node embeddings into edge embeddings
Epoch 1/2
489/489 - 24s - loss: 0.6746 - binary_accuracy: 0.5969 - 24s/epoch - 49ms/step
Epoch 2/2
489/489 - 22s - loss: 0.5786 - binary_accuracy: 0.7240 - 22s/epoch - 45ms/step

Best result with 'GraphSAGE' embeddings from 'operator_l2'


Unnamed: 0_level_0,ROC AUC
name,Unnamed: 1_level_1
operator_hadamard,0.804228
operator_l1,0.817987
operator_l2,0.818643
operator_avg,0.617905


#### Obter o resultado da predição de links com GCN

In [93]:
gcn_result = train_and_evaluate(gcn_embedding, "GCN")

Training GCN for 'Train Graph':
Using GCN (local pooling) filters...




Epoch: 1/2
489/489 - loss: 0.6815 - binary_accuracy: 0.5000
Epoch: 2/2
489/489 - loss: 0.6718 - binary_accuracy: 0.5000

Best result with 'GCN' embeddings from 'operator_l1'


Unnamed: 0_level_0,ROC AUC
name,Unnamed: 1_level_1
operator_hadamard,0.648027
operator_l1,0.651786
operator_l2,0.636596
operator_avg,0.643286


#### Comparação entre Node2Vec, Attri2Vec, GraphSAGE e GCN no conjunto de teste

As pontuações de ROC AUC no conjunto de teste de links para diferentes embeddings com seus respectivos melhores operadores:

In [94]:
pd.DataFrame(
    [
        ("Node2Vec", node2vec_result),
        ("Attri2Vec", attri2vec_result),
        ("GraphSAGE", graphsage_result),
        ("GCN", gcn_result),
    ],
    columns=("name", "ROC AUC"),
).set_index("name")

Unnamed: 0_level_0,ROC AUC
name,Unnamed: 1_level_1
Node2Vec,0.681033
Attri2Vec,0.589139
GraphSAGE,0.82534
GCN,0.660423


## Conclusão

Este exemplo demonstrou como usar a biblioteca `stellargraph` para construir um algoritmo de predição de links para grafos homogêneos utilizando embeddings não supervisionados aprendidos por Node2Vec [1], Attri2Vec [2], GraphSAGE [3] e GCN [4].

Para mais informações sobre o processo de predição de links, todos esses algoritmos possuem demonstrações específicas com mais detalhes:

- [Node2Vec](node2vec-link-prediction.ipynb)
- [Attri2Vec](attri2vec-link-prediction.ipynb)
- [GraphSAGE](graphsage-link-prediction.ipynb)
- [GCN](gcn-link-prediction.ipynb)

<table><tr><td>Run the latest release of this notebook:</td><td><a href="https://mybinder.org/v2/gh/stellargraph/stellargraph/master?urlpath=lab/tree/demos/link-prediction/homogeneous-comparison-link-prediction.ipynb" alt="Open In Binder" target="_parent"><img src="https://mybinder.org/badge_logo.svg"/></a></td><td><a href="https://colab.research.google.com/github/stellargraph/stellargraph/blob/master/demos/link-prediction/homogeneous-comparison-link-prediction.ipynb" alt="Open In Colab" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg"/></a></td></tr></table>

### Criação de features para os nós

In [11]:
import pandas as pd
areas = pd.read_csv('../data/processed/aplicacoes/areas.csv')
areas.head()

Unnamed: 0,LattesID,SEQUENCIA-AREA-DE-ATUACAO,NOME-GRANDE-AREA-DO-CONHECIMENTO,NOME-DA-AREA-DO-CONHECIMENTO,NOME-DA-SUB-AREA-DO-CONHECIMENTO,NOME-DA-ESPECIALIDADE
0,565598534943,5.0,CIENCIAS_EXATAS_E_DA_TERRA,Ciência da Computação,Metodologia e Técnicas da Computação,Linguagens de Programação
1,565598534943,4.0,CIENCIAS_EXATAS_E_DA_TERRA,Ciência da Computação,Processamento de Imagens,Visão Robótica
2,565598534943,3.0,CIENCIAS_EXATAS_E_DA_TERRA,Ciência da Computação,Processamento de Imagens,Visão Computacional Aplicada
3,565598534943,2.0,CIENCIAS_EXATAS_E_DA_TERRA,Ciência da Computação,Metodologia e Técnicas da Computação,Sistemas de Informação
4,565598534943,1.0,CIENCIAS_EXATAS_E_DA_TERRA,Ciência da Computação,Metodologia e Técnicas da Computação,Engenharia de Software


In [12]:
gerais = pd.read_csv('../data/processed/aplicacoes/gerais.csv')

gerais['g_lattes_id'] = gerais.apply(lambda row: 'LattesID_' + str(row['LattesID']), axis=1)

gerais.head()

Unnamed: 0,LattesID,NOME-COMPLETO,DATA-ATUALIZACAO,HORA-ATUALIZACAO,CIDADE-NASCIMENTO,UF-NASCIMENTO,PAIS-DE-NASCIMENTO,NACIONALIDADE,DATA-DE-FALECIMENTO,g_lattes_id
0,565598534943,Sdnei de Brito Alves,25012007,120204,Itajubá,BA,Brasil,B,,LattesID_565598534943
1,601083852823,Alexandre Loureiros Rodrigues,12072021,204404,Vitória,ES,Brasil,B,,LattesID_601083852823
2,5349558315095,Juliano Manabu Iyoda,24092021,105006,Recife,PE,Brasil,B,,LattesID_5349558315095
3,10858860721392,Hugo Bastos de Paula,2032021,83521,Belo Horizonte,MG,Brasil,B,,LattesID_10858860721392
4,11303079806761,Gerald Jean Francis Banon,3062014,113408,Paris,,França,B,,LattesID_11303079806761


In [13]:
gerais_set = set(gerais['g_lattes_id'])
areas_set = set(areas['LattesID'])

In [14]:
len(gerais_set - areas_set)

3992

In [15]:
len(areas_set - gerais_set)

3916

In [16]:
import networkx as nx
G = nx.read_graphml('../graphs/coauthorship_graph.xml', node_type=str)

In [17]:
areas['g_lattes_id'] = areas.apply(lambda row: 'LattesID_' + str(row['LattesID']), axis=1)

In [18]:
areas_completo = areas

In [19]:
areas.shape

(16286, 7)

In [20]:
areas = areas[['NOME-DA-ESPECIALIDADE', 'g_lattes_id']]

In [21]:
areas

Unnamed: 0,NOME-DA-ESPECIALIDADE,g_lattes_id
0,Linguagens de Programação,LattesID_565598534943
1,Visão Robótica,LattesID_565598534943
2,Visão Computacional Aplicada,LattesID_565598534943
3,Sistemas de Informação,LattesID_565598534943
4,Engenharia de Software,LattesID_565598534943
...,...,...
16281,Mídia Esportiva,LattesID_9998824647536109
16282,Estados Emocionais,LattesID_9998824647536109
16283,,LattesID_9998824647536109
16284,,LattesID_9999217523842385


In [22]:
areas = areas.dropna()

In [23]:
areas.shape

(8236, 2)

In [24]:
areas.head()

Unnamed: 0,NOME-DA-ESPECIALIDADE,g_lattes_id
0,Linguagens de Programação,LattesID_565598534943
1,Visão Robótica,LattesID_565598534943
2,Visão Computacional Aplicada,LattesID_565598534943
3,Sistemas de Informação,LattesID_565598534943
4,Engenharia de Software,LattesID_565598534943


#### Verificação de se realmente os nós da tabela estão no grafo

In [301]:
a = set(areas['g_lattes_id'])

In [262]:
b = set(G.nodes())

In [263]:
len(a - b) # nos que estao na tabela mas nao estao no grafo

539

In [264]:
len(b - a) # nos que estao no grafo mas nao estao na tabela

785

In [223]:
df = pd.read_csv('../data/processed/aplicacoes/coauthorship_weighted.csv')

In [226]:
d = set(df['author1'].tolist() + df['author2'].tolist())

In [227]:
len(b - d)

0

In [228]:
len(d - b)

0

In [229]:
len(a - d)

539

In [230]:
len(d - a)

785

In [243]:
len(gerais_set - d)

789

In [244]:
len(d - gerais_set)

0

In [245]:
len(b - gerais_set)

0

In [246]:
len(gerais_set - b)

789

In [239]:
len(d)

3203

#### Aplicação do tf-idf

In [25]:
import pandas as pd

# Supondo que o DataFrame se chama 'areas'
# Agrupando por 'g_lattes_id' e unindo as especialidades
areas_tdidf = areas.groupby('g_lattes_id')['NOME-DA-ESPECIALIDADE'].apply(lambda x: ' '.join(x.dropna())).reset_index()

# Renomeando as colunas para algo mais claro
areas_tdidf.columns = ['g_lattes_id', 'especialidades']

# Mostrando o novo DataFrame
areas_tdidf.head()

Unnamed: 0,g_lattes_id,especialidades
0,LattesID_1003657277565622,Relação Imagem e Som Trilha Sonora Sonoplastia...
1,LattesID_1004206862799097,Petrologia Geotectônica Geologia Estrutural
2,LattesID_1010721095594895,Materiais e Componentes de Construção
3,LattesID_1013393322957956,Representação da Informação Técnicas de Recupe...
4,LattesID_1013624109317787,Fisiologia de Plantas Cultivadas Grandes Cultu...


In [26]:
areas_tdidf.shape

(2957, 2)

In [27]:
areas_tdidf['g_lattes_id'].unique().shape

(2957,)

In [28]:
areas_tdidf.isnull().sum()

g_lattes_id       0
especialidades    0
dtype: int64

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [30]:
tfidf = TfidfVectorizer()

In [31]:
tfidf_matriz = tfidf.fit_transform(areas_tdidf['especialidades'])

In [32]:
palavras = tfidf.get_feature_names_out()

In [33]:
tfidf_dataframe = pd.DataFrame(tfidf_matriz.toarray(), columns=palavras)
tfidf_dataframe.head()

Unnamed: 0,3d,abastecimento,aberto,abertura,absorção,accounting,aceleradores,acervos,acessibilidade,acondicionamento,...,ópticos,órbita,órbitas,óssea,ósseo,ósteo,ótica,óticas,óticos,úteis
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [34]:
tfidf_dataframe.shape

(2957, 2218)

In [35]:
areas_tdidf = pd.concat([areas_tdidf, tfidf_dataframe], axis=1)

In [36]:
areas_tdidf.head()

Unnamed: 0,g_lattes_id,especialidades,3d,abastecimento,aberto,abertura,absorção,accounting,aceleradores,acervos,...,ópticos,órbita,órbitas,óssea,ósseo,ósteo,ótica,óticas,óticos,úteis
0,LattesID_1003657277565622,Relação Imagem e Som Trilha Sonora Sonoplastia...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,LattesID_1004206862799097,Petrologia Geotectônica Geologia Estrutural,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,LattesID_1010721095594895,Materiais e Componentes de Construção,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,LattesID_1013393322957956,Representação da Informação Técnicas de Recupe...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,LattesID_1013624109317787,Fisiologia de Plantas Cultivadas Grandes Cultu...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [37]:
areas_tdidf.shape

(2957, 2220)

In [38]:
#areas_tdidf = pd.merge(areas_completo, areas_tdidf, on='g_lattes_id', how='left')

In [39]:
#areas_tdidf = areas_tdidf.drop(['LattesID', 'SEQUENCIA-AREA-DE-ATUACAO', 'NOME-GRANDE-AREA-DO-CONHECIMENTO',
 #                  'NOME-DA-AREA-DO-CONHECIMENTO', 'NOME-DA-SUB-AREA-DO-CONHECIMENTO',
  #                'NOME-DA-ESPECIALIDADE'], axis=1)

In [40]:
#areas_tdidf.isnull().sum()

In [41]:
#areas_tdidf = areas_tdidf.fillna(-1.0)

In [42]:
#areas_tdidf.isnull().sum()

In [43]:
#areas_tdidf.shape

In [44]:
#areas_tdidf.describe()

In [45]:
#areas_tdidf.head()

In [46]:
pesq_grafo = pd.DataFrame(list(G.nodes), columns=['g_lattes_id'])

In [47]:
pesq_tdidf = pd.merge(pesq_grafo, areas_tdidf, on='g_lattes_id', how='left')

In [48]:
pesq_tdidf

Unnamed: 0,g_lattes_id,especialidades,3d,abastecimento,aberto,abertura,absorção,accounting,aceleradores,acervos,...,ópticos,órbita,órbitas,óssea,ósseo,ósteo,ótica,óticas,óticos,úteis
0,LattesID_1003657277565622,Relação Imagem e Som Trilha Sonora Sonoplastia...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,LattesID_1161800349977394,Análise de Tensões Mecânica dos Corpos Sólidos...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,LattesID_1004206862799097,Petrologia Geotectônica Geologia Estrutural,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,LattesID_1122362218673413,Proveniência Sedimentar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,LattesID_1010721095594895,Materiais e Componentes de Construção,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3198,LattesID_965773984673376,Eletrônica Industrial Controle de Processos El...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3199,LattesID_3227995737489876,,,,,,,,,,...,,,,,,,,,,
3200,LattesID_98763826166873,,,,,,,,,,...,,,,,,,,,,
3201,LattesID_988994019537246,Análise de Algoritmos e Complexidade de Comput...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [49]:
pesq_tdidf.isnull().sum()

g_lattes_id         0
especialidades    785
3d                785
abastecimento     785
aberto            785
                 ... 
ósteo             785
ótica             785
óticas            785
óticos            785
úteis             785
Length: 2220, dtype: int64

In [50]:
pesq_tdidf = pesq_tdidf.fillna(-1.0)

In [51]:
pesq_tdidf = pesq_tdidf.drop(['especialidades'], axis=1)
pesq_tdidf.set_index('g_lattes_id', inplace=True)

In [52]:
pesq_tdidf.shape

(3203, 2218)

In [53]:
pesq_tdidf

Unnamed: 0_level_0,3d,abastecimento,aberto,abertura,absorção,accounting,aceleradores,acervos,acessibilidade,acondicionamento,...,ópticos,órbita,órbitas,óssea,ósseo,ósteo,ótica,óticas,óticos,úteis
g_lattes_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
LattesID_1003657277565622,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
LattesID_1161800349977394,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
LattesID_1004206862799097,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
LattesID_1122362218673413,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
LattesID_1010721095594895,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
LattesID_965773984673376,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
LattesID_3227995737489876,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
LattesID_98763826166873,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
LattesID_988994019537246,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [54]:
pesq_tdidf.to_csv('features.csv')