# Overview


For this analysis, we focus on the "aplicacoes" scope and use supervision data (orientações), where each record represents a supervisee. The two main CSV files used are:

- gerais.csv: Contains general data for all researchers.

- orientacoes.csv: Contains supervision data with information about both supervisees and supervisors.

In [81]:
import pandas as pd
import unidecode
from typing import Optional
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
from tqdm import tqdm
import pandas as pd
import networkx as nx
from itertools import combinations

In [82]:
DATA_DIR = '../data/'

In [83]:
# function to read the Lattes dataset
def read_dataset(
    data_dir,
    scope,
    dataset
):
    
    data_path = f'{data_dir}processed/{scope}/{dataset}.csv'
    df = pd.read_csv(data_path)
    return df

In [84]:
# function to standardize any string
standardize_string = lambda x: unidecode.unidecode(x.lower())

# Data Sources and Structure

### gerais.csv

Content: General data for all researchers (both supervisees and supervisors).

Key Columns:

- LattesID: Unique identifier for each researcher.

- NOME-COMPLETO: Full name of the researcher.

### orientacoes.csv

Content: Supervision records where each row corresponds to a specific supervisee.

Key Columns:

- LattesID: Identifier for the supervisee.

- NomeDoOrientador: Name of the supervisor.

- NumeroIdOrientado: Identifier for the supervisor.

Important: Not all supervisors may be present in the gerais.csv dataset because the filtering was applied only to supervisees. This means some supervisors might be missing from the overall researcher dataset.

# Data Preprocessing and Methodology

## Initial Considerations
- Filtering by Supervision Status:

Only supervision records with a status of CONCLUIDA (completed) are used. This choice excludes ongoing supervision relationships, which might affect the network analysis if such relationships are relevant.

- Unique Identifier Selection:

The Lattes ID is used as the unique identifier for both supervisees and supervisors. This is crucial for creating a reliable source-target relationship table for network analysis.

- Data Representation:

The supervision dataset (orientacoes.csv) is structured so that each row represents a supervisee, not a supervisor. Therefore, caution is needed when linking supervisors, as some might not be included in the gerais.csv dataset.

## Data Cleaning and Transformation

- Standardizing Strings:
All names are converted to lowercase and stripped of diacritics to ensure consistency during merging and matching.

- Handling Missing Values:

For supervisors missing their Lattes ID, a placeholder value (-100000) is used.

Missing supervisor names are filled with a default text (orientador com nome nao preenchido).

- Column Renaming:
Columns are renamed to clearly differentiate between supervisee and supervisor IDs (e.g., renaming LattesID to LattesID_Orientando for supervisees).

## Load and processing "orientacoes" dataframe

In [85]:
scope = 'aplicacoes' #current scope

# dataframe of supervision
df_orien = read_dataset(data_dir=DATA_DIR, scope=scope, dataset='orientacoes')

In [86]:
df_orien.head(2)

Unnamed: 0,LattesID,NATUREZA,STATUS,ANO,NomeDoOrientador,CODIGO-INSTITUICAO,NOME-INSTITUICAO,CODIGO-CURSO,FLAG-BOLSA,CODIGO-AGENCIA-FINANCIADORA,NOME-DA-AGENCIA,TITULO,NumeroIdOrientado,NOME-CURSO,NomeGrandeAreaDoConhecimento,NomeDaAreaDoConhecimento,NomeDaSubAreaDoConhecimento,TIPO-DE-ORIENTACAO-CONCLUIDA,TIPO-DE-ORIENTACAO
0,11303079806761,Dissertação de mestrado,CONCLUIDA,1995.0,JOAO CAMARGO NETO,8700000009,Instituto Nacional de Pesquisas Espaciais,,NAO,,,Utilizacao da Morfologia Matematica Na Analise...,,,CIENCIAS_EXATAS_E_DA_TERRA,Ciência da Computação,Metodologia e Técnicas da Computação,,ORIENTADOR_PRINCIPAL
1,11303079806761,Dissertação de mestrado,CONCLUIDA,1992.0,AILTON CRUZ DOS SANTOS,8700000009,Instituto Nacional de Pesquisas Espaciais,,NAO,,,Simulacao de Imagens de Sensores Com Largo Cam...,8226326000000000.0,,,,,,ORIENTADOR_PRINCIPAL


In [87]:
# shape
df_orien.shape

(239652, 19)

In [88]:
# select only completed supervisions
df_orien = df_orien[df_orien['STATUS'] == 'CONCLUIDA']

In [89]:
df_orien.shape

(222315, 19)

In [90]:
df_orien.duplicated(subset=['LattesID', 'NATUREZA', 'NomeDoOrientador']).sum()

13547

In [91]:
# columns = 'LattesID', 'NomeDoOrientador', 'NumeroIdOrientado'
df_orien = df_orien[['LattesID', 'NomeDoOrientador', 'NumeroIdOrientado']]

In [92]:
# adjusting column names
df_orien = df_orien.rename({'LattesID': 'LattesID_Orientando',
                            'NumeroIdOrientado': 'LattesID_Orientador'}, axis=1)

In [93]:
# null data
df_orien.isnull().sum()

LattesID_Orientando         0
NomeDoOrientador           11
LattesID_Orientador    183973
dtype: int64

From the top row, we can see that there are many supervisors who do not have their Lattes ID registered, only their names. Later, we will use an approach to try to retrieve the Lattes ID of these supervisors

In [94]:
# fillna of NomeDoOrientador
df_orien['NomeDoOrientador'] = df_orien['NomeDoOrientador'].fillna('orientador com nome nao preenchido')

In [95]:
# standardize_string on 'NomeDoOrientador'
df_orien['NomeDoOrientador'] = df_orien['NomeDoOrientador'].apply(standardize_string)

In [96]:
# null variable
null_number = -100000

In [97]:
# Convert Lattes ID to string
df_orien['LattesID_Orientador'] = df_orien['LattesID_Orientador'].fillna(null_number)
df_orien['LattesID_Orientador'] = df_orien['LattesID_Orientador'].astype('int')

In [98]:
# null data
df_orien.isnull().sum()

LattesID_Orientando    0
NomeDoOrientador       0
LattesID_Orientador    0
dtype: int64

In [99]:
df_orien.dtypes

LattesID_Orientando     int64
NomeDoOrientador       object
LattesID_Orientador     int64
dtype: object

### Load and processing "geral" dataframe

In [100]:
# dataframe of general data
df_geral = read_dataset(data_dir=DATA_DIR, scope=scope, dataset='gerais')[['LattesID', 'NOME-COMPLETO']]

In [101]:
df_geral.shape

(3992, 2)

In [102]:
df_geral.head()

Unnamed: 0,LattesID,NOME-COMPLETO
0,565598534943,Sdnei de Brito Alves
1,601083852823,Alexandre Loureiros Rodrigues
2,5349558315095,Juliano Manabu Iyoda
3,10858860721392,Hugo Bastos de Paula
4,11303079806761,Gerald Jean Francis Banon


In [103]:
df_geral = df_geral.rename({'NOME-COMPLETO': 'NomeDoOrientando',
                           'LattesID': 'LattesID_Orientando'}, axis=1)

In [104]:
# This dataframe will be used to get the name of the students
df_geral.head()

Unnamed: 0,LattesID_Orientando,NomeDoOrientando
0,565598534943,Sdnei de Brito Alves
1,601083852823,Alexandre Loureiros Rodrigues
2,5349558315095,Juliano Manabu Iyoda
3,10858860721392,Hugo Bastos de Paula
4,11303079806761,Gerald Jean Francis Banon


In [105]:
# standardize_string 'NomeDoOrientando'
df_geral['NomeDoOrientando'] = df_geral['NomeDoOrientando'].apply(standardize_string)

In [106]:
df_geral.head()

Unnamed: 0,LattesID_Orientando,NomeDoOrientando
0,565598534943,sdnei de brito alves
1,601083852823,alexandre loureiros rodrigues
2,5349558315095,juliano manabu iyoda
3,10858860721392,hugo bastos de paula
4,11303079806761,gerald jean francis banon


In [107]:
df_geral.dtypes

LattesID_Orientando     int64
NomeDoOrientando       object
dtype: object

# Data Merging and Approaches

## Merge Process

1. Merge with gerais.csv (Supervisee Data):

The orientacoes dataset is merged with gerais.csv on the supervisee's Lattes ID to obtain standardized supervisee names.

2. First Approach – Direct ID Matching:

- Objective: Use available Lattes IDs for both supervisors and supervisees.

- Method: Remove rows with missing supervisor IDs and drop unnecessary name columns.

- Outcome: A straightforward merging of data where both IDs are present.

3. Second Approach – Recovering Missing Supervisor IDs:

- Objective: Retrieve missing supervisor Lattes IDs by matching on supervisor names.

- Method:

- - A concatenated geral dataset is created from the three scopes (aplicacoes, restritivo, abrangente).

- - The supervisor’s name from the orientacoes dataset is matched with NOME-COMPLETO in the geral dataset.

- Challenges:

- - Homonym Issue: There is a risk of misidentification due to supervisors sharing the same name. In the current dataset, only two cases of homonyms were detected, but this might increase with larger or more varied datasets.

- - Data Consistency: Using additional attributes (e.g., university affiliation) could improve disambiguation; however, such information may be outdated or inconsistent.

4. Final Dataset Preparation:

- The two approaches are concatenated to form a comprehensive dataset.

- Duplicates are removed, and self-referential links (where the supervisee and supervisor are the same) are filtered out.

## Merge and processing

In [108]:
# merge both dataframes on 'LattesID_Orientando'
df_merged = df_orien.merge(df_geral, on='LattesID_Orientando')

In [109]:
# ordering columns for better interpretation
df_merged = df_merged[['LattesID_Orientando', 'NomeDoOrientando', 'LattesID_Orientador', 'NomeDoOrientador']]

In [110]:
df_merged.head(3)

Unnamed: 0,LattesID_Orientando,NomeDoOrientando,LattesID_Orientador,NomeDoOrientador
0,11303079806761,gerald jean francis banon,-100000,joao camargo neto
1,11303079806761,gerald jean francis banon,8226325993016336,ailton cruz dos santos
2,11303079806761,gerald jean francis banon,-100000,ana lucia bezerra candeias


In [111]:
# this is going to be our final dataframe
df_final = pd.DataFrame({}, columns=['LattesID_Orientando', 'LattesID_Orientador'])
df_final

Unnamed: 0,LattesID_Orientando,LattesID_Orientador


#### First Approach – Direct ID Matching

In [112]:
first_mask = df_merged.copy()

# remove names
first_mask.drop(['NomeDoOrientando','NomeDoOrientador'], axis=1, inplace=True)

In [113]:
# return the NaN values
first_mask['LattesID_Orientador'] = first_mask['LattesID_Orientador'].replace(null_number, np.nan)

In [114]:
# drop NaN vallues
first_mask.dropna(inplace=True)

In [115]:
first_mask.dtypes

LattesID_Orientando      int64
LattesID_Orientador    float64
dtype: object

In [116]:
# convert to int
first_mask['LattesID_Orientador'] = first_mask['LattesID_Orientador'].astype(int)

In [117]:
first_mask.dtypes

LattesID_Orientando    int64
LattesID_Orientador    int64
dtype: object

In [118]:
# concat on final dataframe
df_final = pd.concat([df_final, first_mask], axis=0, ignore_index=True)

In [119]:
len(df_final)

38342

### Second Approach – Recovering Missing Supervisor IDs

In [120]:
# get all general data

# aplicacoes
df_geral_aplic = read_dataset(data_dir=DATA_DIR,
                              scope='aplicacoes',
                                dataset='gerais')[['LattesID', 'NOME-COMPLETO']]

# restritivo
df_geral_rest = read_dataset(data_dir=DATA_DIR,
                              scope='restritivo',
                                dataset='gerais')[['LattesID', 'NOME-COMPLETO']]

# abrangente
df_geral_abg = read_dataset(data_dir=DATA_DIR,
                              scope='abrangente',
                                dataset='gerais')[['LattesID', 'NOME-COMPLETO']]

# concat
df_geral = pd.concat([df_geral_aplic, df_geral_rest, df_geral_abg], axis=0,
                    ignore_index=True)

# standardize_string on 'NOME-COMPLETO'
df_geral['NOME-COMPLETO'] = df_geral['NOME-COMPLETO'].apply(standardize_string)

In [121]:
df_geral.shape

(24289, 2)

In [122]:
# It is possible for a supervisor being on more than one scope
df_geral = df_geral.drop_duplicates(subset=['LattesID', 'NOME-COMPLETO'])

In [123]:
df_geral.shape

(13317, 2)

In [124]:
df_geral.head(3)

Unnamed: 0,LattesID,NOME-COMPLETO
0,565598534943,sdnei de brito alves
1,601083852823,alexandre loureiros rodrigues
2,5349558315095,juliano manabu iyoda


In [125]:
# showing the homonyms
df_geral[df_geral.duplicated(subset=['NOME-COMPLETO'], keep=False)]

Unnamed: 0,LattesID,NOME-COMPLETO
3843,9578832375902806,luciano silva
4297,315210533272570,carlos eduardo de souza
8515,4835041248145487,carlos eduardo de souza
11063,7514305376858192,luciano silva


In [126]:
df_geral[df_geral.duplicated(subset=['NOME-COMPLETO'], keep=False)].shape

(4, 2)

According to the result above, we notice that, at least in the current dataset, there are only two homonyms. However, if a different dataset is used, this issue could become more significant.

In [127]:
# adjusting column names
df_geral = df_geral.rename({'LattesID': 'LattesID_Orientador',
                          'NOME-COMPLETO': 'NomeDoOrientador'}, axis=1)

In [128]:
df_geral.head(3)

Unnamed: 0,LattesID_Orientador,NomeDoOrientador
0,565598534943,sdnei de brito alves
1,601083852823,alexandre loureiros rodrigues
2,5349558315095,juliano manabu iyoda


In [129]:
df_merged.head(2)

Unnamed: 0,LattesID_Orientando,NomeDoOrientando,LattesID_Orientador,NomeDoOrientador
0,11303079806761,gerald jean francis banon,-100000,joao camargo neto
1,11303079806761,gerald jean francis banon,8226325993016336,ailton cruz dos santos


In [130]:
sec_mask = df_merged.copy().drop(['LattesID_Orientador'], axis=1)

In [131]:
sec_mask.head(2)

Unnamed: 0,LattesID_Orientando,NomeDoOrientando,NomeDoOrientador
0,11303079806761,gerald jean francis banon,joao camargo neto
1,11303079806761,gerald jean francis banon,ailton cruz dos santos


In [132]:
sec_mask.shape

(222315, 3)

In [133]:
df_geral.head(2)

Unnamed: 0,LattesID_Orientador,NomeDoOrientador
0,565598534943,sdnei de brito alves
1,601083852823,alexandre loureiros rodrigues


In [134]:
# merge on name
sec_mask = sec_mask.merge(df_geral, on='NomeDoOrientador', how='inner')

In [135]:
sec_mask.shape

(8077, 4)

In [136]:
sec_mask.head(2)

Unnamed: 0,LattesID_Orientando,NomeDoOrientando,NomeDoOrientador,LattesID_Orientador
0,11303079806761,gerald jean francis banon,ana lucia bezerra candeias,4950530398212920
1,11303079806761,gerald jean francis banon,marcos cordeiro d'ornellas,1765721612533942


In [137]:
# selecting relevant columns for IDs
sec_mask = sec_mask[['LattesID_Orientando', 'LattesID_Orientador']]

In [138]:
sec_mask.head()

Unnamed: 0,LattesID_Orientando,LattesID_Orientador
0,11303079806761,4950530398212920
1,11303079806761,1765721612533942
2,11303079806761,5123287769635741
3,11303079806761,362417828475021
4,11303079806761,362417828475021


In [139]:
len(df_final)

38342

In [140]:
df_final = pd.concat([df_final, sec_mask], axis=0, ignore_index=True)

In [141]:
len(df_final)

46419

### Final Dataset Preparation

In [142]:
df_final.drop_duplicates(inplace=True)

In [143]:
df_final.head()

Unnamed: 0,LattesID_Orientando,LattesID_Orientador
0,11303079806761,8226325993016336
1,11303079806761,9973453169372876
2,11303079806761,362417828475021
3,11303079806761,4950530398212920
4,11303079806761,5067006506847859


In [144]:
len(df_final)

36747

In [145]:
df_final.dtypes

LattesID_Orientando    object
LattesID_Orientador    object
dtype: object

In [146]:
# Removing lines where the two columns have the same value

# (this is caused both by first and second mask
df_final = df_final[df_final['LattesID_Orientando'] != df_final['LattesID_Orientador']]

In [147]:
len(df_final)

36703

In [148]:
df_final.columns

Index(['LattesID_Orientando', 'LattesID_Orientador'], dtype='object')

In [149]:
len(df_final.LattesID_Orientando.unique()), len(df_final.LattesID_Orientador.unique())

(2537, 34490)

In [150]:
df_final.LattesID_Orientando = df_final.LattesID_Orientando.apply(lambda x: f'LattesID_{x}')
df_final.LattesID_Orientador = df_final.LattesID_Orientador.apply(lambda x: f'LattesID_{x}')

In [151]:
len(df_final.LattesID_Orientando.unique()), len(df_final.LattesID_Orientador.unique())

(2537, 34490)

In [152]:
df_final

Unnamed: 0,LattesID_Orientando,LattesID_Orientador
0,LattesID_11303079806761,LattesID_8226325993016336
1,LattesID_11303079806761,LattesID_9973453169372876
2,LattesID_11303079806761,LattesID_362417828475021
3,LattesID_11303079806761,LattesID_4950530398212920
4,LattesID_11303079806761,LattesID_5067006506847859
...,...,...
46406,LattesID_9900612496740076,LattesID_6150479997891841
46408,LattesID_9900612496740076,LattesID_6025151534761132
46411,LattesID_9962204158580009,LattesID_3264146604600929
46415,LattesID_9962204158580009,LattesID_9771138597443055


In [153]:
df_final.rename(columns={'LattesID_Orientando': 'Target', 'LattesID_Orientador': 'Source'}, inplace=True)

In [154]:
df_final = df_final[['Source', 'Target']]

In [155]:
#df_final.to_csv('grafo-orientacoes.csv', index=False)

In [156]:
df_final.columns

Index(['Source', 'Target'], dtype='object')

In [157]:
# Criar grafo direcionado
G = nx.DiGraph()

# Adicionar arestas: orientador -> orientando
for _, row in df_final.iterrows():
    orientador = row['Source']
    orientando = row['Target']
    G.add_edge(orientador, orientando)


In [158]:
nx.write_gexf(G, "../graphs/aplicacoes/orientacoes.gexf")