# Extraindo Dados

Desenvolvendo uma tabela que contenha os seguintes dados: o identificador do estudante (ID discente), o ano em que o estudante ingressou na instituição, o ano e período da última matrícula realizada pelo estudante, o status atual do estudante e a quantidade de vezes que o estudante cursou a disciplina.

Importando o Pandas e o Csv.

In [485]:
import pandas as pd
import csv

Leitura do arquivo em csv e carregamento desses dados em um DataFrame usando o ponto e vírgula como separador.

In [486]:
df_dados = pd.read_csv('dataframe-bsi-2009-2022.csv', sep=';')

Listando as colunas do dataframe.

In [487]:
df_dados.columns

Index(['discente', 'unidade', 'media_final', 'descricao', 'ano',
       'id_componente', 'nome', 'ch_total', 'sexo', 'ano_nascimento',
       'ano_ingresso', 'status'],
      dtype='object')

# Filtros

Fazendo um recorte da nossa análise, vamos começar por disciplinas obrigatórias do Bacharelado em Sistemas da Informação (BSI):

In [488]:
lista_obrigatórias = [
                'ALGORITMOS E LÓGICA DE PROGRAMAÇÃO',
                'INTRODUÇÃO À INFORMÁTICA',
                'FUNDAMENTOS DE MATEMÁTICA',
                'LÓGICA',
                'TEORIA GERAL DA ADMINISTRAÇÃO',
                'PROGRAMAÇÃO',
                'CÁLCULO DIFERENCIAL E INTEGRAL',
                'TEORIA GERAL DOS SISTEMAS',
                'PROGRAMAÇÃO ORIENTADA A OBJETOS I',
                'ESTRUTURA DE DADOS',
                'ÁLGEBRA LINEAR',
                'ORGANIZAÇÃO, SISTEMAS E MÉTODOS',
                'FUNDAMENTOS DE SISTEMAS DE INFORMAÇÃO',
                'PROGRAMAÇÃO WEB',
                'ARQUITETURA DE COMPUTADORES',
                'PROBABILIDADE E ESTATÍSTICA',
                'BANCO DE DADOS',
                'ENGENHARIA DE SOFTWARE I',
                'PROGRAMAÇÃO ORIENTADA A OBJETOS II',
                'SISTEMAS OPERACIONAIS',
                'PROJETO E ADMINISTRAÇÃO DE BANCO DE DADOS',
                'ENGENHARIA DE SOFTWARE II',
                'REDES DE COMPUTADORES',
                'EMPREENDEDORISMO EM INFORMÁTICA',
                'GESTÃO DE PROJETO DE SOFTWARE',
                'PROGRAMAÇÃO VISUAL',
                'MATEMÁTICA FINANCEIRA',
                'SISTEMAS DE APOIO À DECISÃO',
                'ÉTICA',
                ]
condição_nome = f"nome in {lista_obrigatórias}"
df_dados_filtrado = df_dados.query(condição_nome)

Listando os dados NaN.

In [489]:
df_dados_filtrado.isnull().sum()

discente             0
unidade           2034
media_final       4449
descricao            0
ano                  0
id_componente        0
nome                 0
ch_total             0
sexo                 0
ano_nascimento       0
ano_ingresso         0
status               0
dtype: int64

Preenchendo os dados NaN da coluna **unidade** com 1.

In [490]:
df_dados_filtrado.loc[:, 'unidade'] = df_dados_filtrado['unidade'].fillna(1)
df_dados_filtrado 

Unnamed: 0,discente,unidade,media_final,descricao,ano,id_componente,nome,ch_total,sexo,ano_nascimento,ano_ingresso,status
0,afba64c0118bfcc8d5b3987e725ed545,1.0,15,REPROVADO,20091,2037000,ALGORITMOS E LÓGICA DE PROGRAMAÇÃO,90,M,1987,2009,CANCELADO
1,afba64c0118bfcc8d5b3987e725ed545,2.0,15,REPROVADO,20091,2037000,ALGORITMOS E LÓGICA DE PROGRAMAÇÃO,90,M,1987,2009,CANCELADO
2,afba64c0118bfcc8d5b3987e725ed545,3.0,15,REPROVADO,20091,2037000,ALGORITMOS E LÓGICA DE PROGRAMAÇÃO,90,M,1987,2009,CANCELADO
3,9526e01da587b20211a39b4e66673aea,1.0,92,APROVADO,20091,2037000,ALGORITMOS E LÓGICA DE PROGRAMAÇÃO,90,M,1990,2009,CONCLUÍDO
4,9526e01da587b20211a39b4e66673aea,2.0,92,APROVADO,20091,2037000,ALGORITMOS E LÓGICA DE PROGRAMAÇÃO,90,M,1990,2009,CONCLUÍDO
...,...,...,...,...,...,...,...,...,...,...,...,...
53194,22f4aed4a073c5e9515a8669e9c102f3,2.0,52,APROVADO POR NOTA,20222,62764,PROGRAMAÇÃO VISUAL,60,M,1984,2019,ATIVO - FORMANDO
53195,22f4aed4a073c5e9515a8669e9c102f3,3.0,52,APROVADO POR NOTA,20222,62764,PROGRAMAÇÃO VISUAL,60,M,1984,2019,ATIVO - FORMANDO
53199,943463d2de8c6a60ec5f5d959ba7f1ac,1.0,100,APROVADO,20222,2054400,MATEMÁTICA FINANCEIRA,60,M,1999,2018,CONCLUÍDO
53200,943463d2de8c6a60ec5f5d959ba7f1ac,2.0,100,APROVADO,20222,2054400,MATEMÁTICA FINANCEIRA,60,M,1999,2018,CONCLUÍDO


Filtrar os dados onde a coluna **unidade** foi preenchida com 1.

In [491]:
df_dados_filtrado = df_dados_filtrado[df_dados_filtrado['unidade'] == 1]
df_dados_filtrado

Unnamed: 0,discente,unidade,media_final,descricao,ano,id_componente,nome,ch_total,sexo,ano_nascimento,ano_ingresso,status
0,afba64c0118bfcc8d5b3987e725ed545,1.0,15,REPROVADO,20091,2037000,ALGORITMOS E LÓGICA DE PROGRAMAÇÃO,90,M,1987,2009,CANCELADO
3,9526e01da587b20211a39b4e66673aea,1.0,92,APROVADO,20091,2037000,ALGORITMOS E LÓGICA DE PROGRAMAÇÃO,90,M,1990,2009,CONCLUÍDO
6,1ed6777bd6ff4fd393e0b334d519c642,1.0,80,APROVADO,20091,2037000,ALGORITMOS E LÓGICA DE PROGRAMAÇÃO,90,M,1991,2009,CONCLUÍDO
9,cd66757ed4a317a3537ae3e246648778,1.0,73,APROVADO,20091,2037000,ALGORITMOS E LÓGICA DE PROGRAMAÇÃO,90,M,1975,2009,CANCELADO
12,fa7b20f8ac2312976cd7338487ad527d,1.0,98,APROVADO,20091,2037000,ALGORITMOS E LÓGICA DE PROGRAMAÇÃO,90,M,1978,2009,CONCLUÍDO
...,...,...,...,...,...,...,...,...,...,...,...,...
52942,b3c338925bb16486f1bf2a2f1771e464,1.0,70,EXCLUIDA,20222,70879,EMPREENDEDORISMO EM INFORMÁTICA,60,M,2001,2020,ATIVO
53189,7d2dd0d35ebb8319b0c0e612660d2c3a,1.0,93,APROVADO,20222,62766,SISTEMAS DE APOIO À DECISÃO,60,M,2000,2019,CONCLUÍDO
53190,7d2dd0d35ebb8319b0c0e612660d2c3a,1.0,98,APROVADO,20222,62764,PROGRAMAÇÃO VISUAL,60,M,2000,2019,CONCLUÍDO
53193,22f4aed4a073c5e9515a8669e9c102f3,1.0,52,APROVADO POR NOTA,20222,62764,PROGRAMAÇÃO VISUAL,60,M,1984,2019,ATIVO - FORMANDO


Calculando a quantidade de vezes que cada discente cursou cada disciplina.

In [492]:
quantidade_disciplinas = df_dados_filtrado.groupby(['discente', 'nome']).size().reset_index(name='quantidade')

Pivotando as disciplinas.

In [493]:
tabela_final = quantidade_disciplinas.pivot(index='discente', columns='nome', values='quantidade').reset_index()

Substituindo NaN por 0 nas colunas de nome.

In [494]:
tabela_final = tabela_final.fillna(0)

Adicionando o *ano_ingresso* para cada discente.

In [495]:
ano_ingresso_discente = df_dados_filtrado.drop_duplicates(subset=['discente'])[['discente', 'ano_ingresso']]
tabela_final = tabela_final.merge(ano_ingresso_discente, on='discente', how='left')
tabela_final

Unnamed: 0,discente,ALGORITMOS E LÓGICA DE PROGRAMAÇÃO,ARQUITETURA DE COMPUTADORES,BANCO DE DADOS,CÁLCULO DIFERENCIAL E INTEGRAL,EMPREENDEDORISMO EM INFORMÁTICA,ENGENHARIA DE SOFTWARE I,ENGENHARIA DE SOFTWARE II,ESTRUTURA DE DADOS,FUNDAMENTOS DE MATEMÁTICA,...,PROGRAMAÇÃO WEB,PROJETO E ADMINISTRAÇÃO DE BANCO DE DADOS,REDES DE COMPUTADORES,SISTEMAS DE APOIO À DECISÃO,SISTEMAS OPERACIONAIS,TEORIA GERAL DA ADMINISTRAÇÃO,TEORIA GERAL DOS SISTEMAS,ÁLGEBRA LINEAR,ÉTICA,ano_ingresso
0,001cea3c82e2010681f2cdeab21e5ecf,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2018
1,005c14d7c07bf7980b60c703f99c5ee7,1.0,2.0,1.0,1.0,0.0,1.0,0.0,3.0,3.0,...,1.0,2.0,1.0,0.0,1.0,1.0,1.0,2.0,1.0,2018
2,0107fd69d8cd7e3d30dede96fb68bfe5,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,2011
3,014789363f7940922e71e710ee9d22bc,2.0,3.0,1.0,0.0,1.0,1.0,1.0,2.0,3.0,...,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2016
4,014f0dec46fe7a9c5836527662e1df10,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2020
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
677,fe802d8d85de6f842749468401d1146c,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,2022
678,fe87dfa176a74fc10a5cb701b9fb5dd4,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,2016
679,fec9ed6026d55ecdf514c640312c3d08,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,...,2.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,2020
680,ff56f2c5048dae0797fd3e851572b80c,4.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,6.0,...,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,0.0,2014


Adicionando o *status* para cada discente.

In [496]:
status_discente = df_dados_filtrado.drop_duplicates(subset=['discente'])[['discente', 'status']]
tabela_final = tabela_final.merge(status_discente, on='discente', how='left')
tabela_final

Unnamed: 0,discente,ALGORITMOS E LÓGICA DE PROGRAMAÇÃO,ARQUITETURA DE COMPUTADORES,BANCO DE DADOS,CÁLCULO DIFERENCIAL E INTEGRAL,EMPREENDEDORISMO EM INFORMÁTICA,ENGENHARIA DE SOFTWARE I,ENGENHARIA DE SOFTWARE II,ESTRUTURA DE DADOS,FUNDAMENTOS DE MATEMÁTICA,...,PROJETO E ADMINISTRAÇÃO DE BANCO DE DADOS,REDES DE COMPUTADORES,SISTEMAS DE APOIO À DECISÃO,SISTEMAS OPERACIONAIS,TEORIA GERAL DA ADMINISTRAÇÃO,TEORIA GERAL DOS SISTEMAS,ÁLGEBRA LINEAR,ÉTICA,ano_ingresso,status
0,001cea3c82e2010681f2cdeab21e5ecf,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2018,CANCELADO
1,005c14d7c07bf7980b60c703f99c5ee7,1.0,2.0,1.0,1.0,0.0,1.0,0.0,3.0,3.0,...,2.0,1.0,0.0,1.0,1.0,1.0,2.0,1.0,2018,CANCELADO
2,0107fd69d8cd7e3d30dede96fb68bfe5,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,2011,CANCELADO
3,014789363f7940922e71e710ee9d22bc,2.0,3.0,1.0,0.0,1.0,1.0,1.0,2.0,3.0,...,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2016,CONCLUÍDO
4,014f0dec46fe7a9c5836527662e1df10,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2020,CANCELADO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
677,fe802d8d85de6f842749468401d1146c,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,2022,ATIVO
678,fe87dfa176a74fc10a5cb701b9fb5dd4,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,2016,CONCLUÍDO
679,fec9ed6026d55ecdf514c640312c3d08,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,...,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,2020,ATIVO
680,ff56f2c5048dae0797fd3e851572b80c,4.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,6.0,...,1.0,1.0,1.0,1.0,1.0,2.0,1.0,0.0,2014,CONCLUÍDO


Agrupando por discente e somando a carga horária.

In [497]:
df_carga_horaria_cumprida = df_dados_filtrado.groupby('discente')['ch_total'].sum().reset_index()
df_carga_horaria_cumprida.rename(columns={'ch_total': 'ch_cumprida'}, inplace=True)

Adicionando a carga horária cumprida de cada discente.

In [498]:
tabela_final = tabela_final.merge(df_carga_horaria_cumprida, on='discente', how='left')

Dividindo a carga horária cumprida do discente pela carga horária das disciplinas obrigatórias.

In [499]:
tabela_final['ch_cumprida_dividida'] = tabela_final['ch_cumprida'] / 1830

Identificar semestres únicos cursados por cada discente.

In [500]:
semestres_unicos_por_discente = df_dados_filtrado.groupby('discente')['ano'].nunique().reset_index()
semestres_unicos_por_discente.rename(columns={'ano' : 'semestre'}, inplace=True)

Adicionando a quantidade de semestres cursados por cada discente.

In [501]:
tabela_final = tabela_final.merge(semestres_unicos_por_discente, on='discente', how='left')

In [502]:
tabela_final['semestre_dividido'] = tabela_final['semestre'] / 8

Encontrar o *último período* que cada aluno estudou no curso.

In [503]:
ultimo_periodo = df_dados_filtrado.groupby('discente')['ano'].max().reset_index()
ultimo_periodo.rename(columns={'ano': 'ultimo_periodo'}, inplace=True)
ultimo_periodo

Unnamed: 0,discente,ultimo_periodo
0,001cea3c82e2010681f2cdeab21e5ecf,20181
1,005c14d7c07bf7980b60c703f99c5ee7,20221
2,0107fd69d8cd7e3d30dede96fb68bfe5,20121
3,014789363f7940922e71e710ee9d22bc,20206
4,014f0dec46fe7a9c5836527662e1df10,20206
...,...,...
677,fe802d8d85de6f842749468401d1146c,20222
678,fe87dfa176a74fc10a5cb701b9fb5dd4,20206
679,fec9ed6026d55ecdf514c640312c3d08,20222
680,ff56f2c5048dae0797fd3e851572b80c,20192


Juntar as informações do último período ao DataFrame original.

In [504]:
tabela_final = tabela_final.merge(ultimo_periodo, on='discente', how='left')

Mudando a ordem das colunas.

In [505]:
colunas_ordenadas = ['discente', 'ano_ingresso', 'ultimo_periodo', 'semestre',
                    'semestre_dividido', 'status','ch_cumprida', 'ch_cumprida_dividida',
                    'ALGORITMOS E LÓGICA DE PROGRAMAÇÃO',
                    'INTRODUÇÃO À INFORMÁTICA',
                    'FUNDAMENTOS DE MATEMÁTICA',
                    'LÓGICA',
                    'TEORIA GERAL DA ADMINISTRAÇÃO',
                    'PROGRAMAÇÃO',
                    'CÁLCULO DIFERENCIAL E INTEGRAL',
                    'TEORIA GERAL DOS SISTEMAS',
                    'PROGRAMAÇÃO ORIENTADA A OBJETOS I',
                    'ESTRUTURA DE DADOS',
                    'ÁLGEBRA LINEAR',
                    'ORGANIZAÇÃO, SISTEMAS E MÉTODOS',
                    'FUNDAMENTOS DE SISTEMAS DE INFORMAÇÃO',
                    'PROGRAMAÇÃO WEB',
                    'ARQUITETURA DE COMPUTADORES',
                    'PROBABILIDADE E ESTATÍSTICA',
                    'BANCO DE DADOS',
                    'ENGENHARIA DE SOFTWARE I',
                    'PROGRAMAÇÃO ORIENTADA A OBJETOS II',
                    'SISTEMAS OPERACIONAIS',
                    'PROJETO E ADMINISTRAÇÃO DE BANCO DE DADOS',
                    'ENGENHARIA DE SOFTWARE II',
                    'REDES DE COMPUTADORES',
                    'EMPREENDEDORISMO EM INFORMÁTICA',
                    'GESTÃO DE PROJETO DE SOFTWARE',
                    'PROGRAMAÇÃO VISUAL',
                    'MATEMÁTICA FINANCEIRA',
                    'SISTEMAS DE APOIO À DECISÃO',
                    'ÉTICA']


Reordenando as colunas.

In [506]:
tabela_final = tabela_final[colunas_ordenadas]
tabela_final

Unnamed: 0,discente,ano_ingresso,ultimo_periodo,semestre,semestre_dividido,status,ch_cumprida,ch_cumprida_dividida,ALGORITMOS E LÓGICA DE PROGRAMAÇÃO,INTRODUÇÃO À INFORMÁTICA,...,SISTEMAS OPERACIONAIS,PROJETO E ADMINISTRAÇÃO DE BANCO DE DADOS,ENGENHARIA DE SOFTWARE II,REDES DE COMPUTADORES,EMPREENDEDORISMO EM INFORMÁTICA,GESTÃO DE PROJETO DE SOFTWARE,PROGRAMAÇÃO VISUAL,MATEMÁTICA FINANCEIRA,SISTEMAS DE APOIO À DECISÃO,ÉTICA
0,001cea3c82e2010681f2cdeab21e5ecf,2018,20181,1,0.125,CANCELADO,330,0.180328,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,005c14d7c07bf7980b60c703f99c5ee7,2018,20221,10,1.250,CANCELADO,2340,1.278689,1.0,2.0,...,1.0,2.0,0.0,1.0,0.0,0.0,0.0,2.0,0.0,1.0
2,0107fd69d8cd7e3d30dede96fb68bfe5,2011,20121,3,0.375,CANCELADO,870,0.475410,2.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,014789363f7940922e71e710ee9d22bc,2016,20206,11,1.375,CONCLUÍDO,2430,1.327869,2.0,2.0,...,1.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0
4,014f0dec46fe7a9c5836527662e1df10,2020,20206,3,0.375,CANCELADO,630,0.344262,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
677,fe802d8d85de6f842749468401d1146c,2022,20222,2,0.250,ATIVO,540,0.295082,2.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
678,fe87dfa176a74fc10a5cb701b9fb5dd4,2016,20206,3,0.375,CONCLUÍDO,420,0.229508,0.0,0.0,...,1.0,0.0,2.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
679,fec9ed6026d55ecdf514c640312c3d08,2020,20222,7,0.875,ATIVO,1470,0.803279,1.0,2.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
680,ff56f2c5048dae0797fd3e851572b80c,2014,20192,12,1.500,CONCLUÍDO,3390,1.852459,4.0,7.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0


Para salvar o DataFrame no formato CSV com a separação por ';' e garantir que os dados numéricos estejam no formato correto.

In [507]:
tabela_final.to_csv('tabela_final.csv', index=False, sep=';', quoting=csv.QUOTE_NONNUMERIC)