# Extraindo Dados

Desenvolvendo uma tabela que contenha os seguintes dados: o identificador do estudante (ID discente), o ano em que o estudante ingressou na instituição, o ano e período da última matrícula realizada pelo estudante, o status atual do estudante e a quantidade de vezes que o estudante cursou a disciplina.

Importando o Pandas e o Csv.

In [248]:
import pandas as pd
import csv

Leitura do arquivo em csv e carregamento desses dados em um DataFrame usando o ponto e vírgula como separador.

In [249]:
df_dados = pd.read_csv('dataframe-bsi-2009-2022.csv', sep=';')

# Filtros

Filtrando os dados da unidade 1.

In [250]:
filtro  = df_dados['unidade'] == 1
df_dados_filtrado = df_dados[filtro]
df_dados_filtrado

Unnamed: 0,discente,unidade,media_final,descricao,ano,id_componente,nome,sexo,ano_nascimento,ano_ingresso,status
0,afba64c0118bfcc8d5b3987e725ed545,1.0,15,REPROVADO,20091,2037000,ALGORITMOS E LÓGICA DE PROGRAMAÇÃO,M,1987,2009,CANCELADO
3,9526e01da587b20211a39b4e66673aea,1.0,92,APROVADO,20091,2037000,ALGORITMOS E LÓGICA DE PROGRAMAÇÃO,M,1990,2009,CONCLUÍDO
6,1ed6777bd6ff4fd393e0b334d519c642,1.0,80,APROVADO,20091,2037000,ALGORITMOS E LÓGICA DE PROGRAMAÇÃO,M,1991,2009,CONCLUÍDO
9,cd66757ed4a317a3537ae3e246648778,1.0,73,APROVADO,20091,2037000,ALGORITMOS E LÓGICA DE PROGRAMAÇÃO,M,1975,2009,CANCELADO
12,fa7b20f8ac2312976cd7338487ad527d,1.0,98,APROVADO,20091,2037000,ALGORITMOS E LÓGICA DE PROGRAMAÇÃO,M,1978,2009,CONCLUÍDO
...,...,...,...,...,...,...,...,...,...,...,...
53189,7d2dd0d35ebb8319b0c0e612660d2c3a,1.0,93,APROVADO,20222,62766,SISTEMAS DE APOIO À DECISÃO,M,2000,2019,CONCLUÍDO
53190,7d2dd0d35ebb8319b0c0e612660d2c3a,1.0,98,APROVADO,20222,62764,PROGRAMAÇÃO VISUAL,M,2000,2019,CONCLUÍDO
53193,22f4aed4a073c5e9515a8669e9c102f3,1.0,52,APROVADO POR NOTA,20222,62764,PROGRAMAÇÃO VISUAL,M,1984,2019,ATIVO - FORMANDO
53196,e10089f6080d3afa8904437086ea2752,1.0,93,APROVADO,20222,2050107,DIREITO E LEGISLAÇÃO SOCIAL,M,2002,2019,ATIVO


Fazendo um recorte da nossa análise, vamos começar por disciplinas obrigatórias do Bacharelado em Sistemas da Informação (BSI):

In [251]:
lista_obrigatórias = [
                'ALGORITMOS E LÓGICA DE PROGRAMAÇÃO',
                'INTRODUÇÃO À INFORMÁTICA',
                'FUNDAMENTOS DE MATEMÁTICA',
                'LÓGICA',
                'TEORIA GERAL DA ADMINISTRAÇÃO',
                'PROGRAMAÇÃO',
                'CÁLCULO DIFERENCIAL E INTEGRAL',
                'TEORIA GERAL DOS SISTEMAS',
                'PROGRAMAÇÃO ORIENTADA A OBJETOS I',
                'ESTRUTURA DE DADOS',
                'ÁLGEBRA LINEAR',
                'ORGANIZAÇÃO, SISTEMAS E MÉTODOS',
                'FUNDAMENTOS DE SISTEMAS DE INFORMAÇÃO',
                'PROGRAMAÇÃO WEB',
                'ARQUITETURA DE COMPUTADORES',
                'PROBABILIDADE E ESTATÍSTICA',
                'BANCO DE DADOS',
                'ENGENHARIA DE SOFTWARE I',
                'PROGRAMAÇÃO ORIENTADA A OBJETOS II',
                'SISTEMAS OPERACIONAIS',
                'PROJETO E ADMINISTRAÇÃO DE BANCO DE DADOS',
                'ENGENHARIA DE SOFTWARE II',
                'REDES DE COMPUTADORES',
                'CONTABILIDADE E CUSTOS',
                'EMPREENDEDORISMO EM INFORMÁTICA',
                'GESTÃO DE PROJETO DE SOFTWARE',
                'PROGRAMAÇÃO VISUAL',
                'MATEMÁTICA FINANCEIRA',
                'SISTEMAS DE APOIO À DECISÃO',
                'ÉTICA',
                ]
condição_nome = f"nome in {lista_obrigatórias}"
df_dados_filtrado = df_dados_filtrado.query(condição_nome)

Calculando a quantidade de vezes que cada discente cursou cada disciplina.

In [252]:
quantidade_disciplinas = df_dados_filtrado.groupby(['discente', 'nome']).size().reset_index(name='quantidade')

Pivotando as disciplinas.

In [253]:
tabela_final = quantidade_disciplinas.pivot(index='discente', columns='nome', values='quantidade').reset_index()

Substituindo NaN por 0 nas colunas de nome.

In [254]:
tabela_final = tabela_final.fillna(0)

Adicionando o status para cada discente.

In [255]:
ano_ingresso_discente = df_dados_filtrado.drop_duplicates(subset=['discente'])[['discente', 'ano_ingresso']]
tabela_final = tabela_final.merge(ano_ingresso_discente, on='discente', how='left')

Adicionando o status para cada discente.

In [256]:
status_discente = df_dados_filtrado.drop_duplicates(subset=['discente'])[['discente', 'status']]
tabela_final = tabela_final.merge(status_discente, on='discente', how='left')

Encontrar o último período que cada aluno estudou no curso.

In [257]:
ultimo_periodo = df_dados_filtrado.groupby('discente')['ano'].max().reset_index()
ultimo_periodo.rename(columns={'ano': 'ultimo_periodo'}, inplace=True)

Juntar as informações do último período ao DataFrame original.

In [258]:
tabela_final = tabela_final.merge(ultimo_periodo, on='discente', how='left')
tabela_final

Unnamed: 0,discente,ALGORITMOS E LÓGICA DE PROGRAMAÇÃO,ARQUITETURA DE COMPUTADORES,BANCO DE DADOS,CONTABILIDADE E CUSTOS,CÁLCULO DIFERENCIAL E INTEGRAL,EMPREENDEDORISMO EM INFORMÁTICA,ENGENHARIA DE SOFTWARE I,ENGENHARIA DE SOFTWARE II,ESTRUTURA DE DADOS,...,REDES DE COMPUTADORES,SISTEMAS DE APOIO À DECISÃO,SISTEMAS OPERACIONAIS,TEORIA GERAL DA ADMINISTRAÇÃO,TEORIA GERAL DOS SISTEMAS,ÁLGEBRA LINEAR,ÉTICA,ano_ingresso,status,ultimo_periodo
0,005c14d7c07bf7980b60c703f99c5ee7,1.0,2.0,1.0,1.0,1.0,0.0,1.0,0.0,3.0,...,1.0,0.0,1.0,1.0,1.0,2.0,1.0,2018,CANCELADO,20221
1,0107fd69d8cd7e3d30dede96fb68bfe5,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,2011,CANCELADO,20121
2,014789363f7940922e71e710ee9d22bc,2.0,3.0,1.0,1.0,0.0,1.0,1.0,1.0,2.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2016,CONCLUÍDO,20206
3,014f0dec46fe7a9c5836527662e1df10,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2020,CANCELADO,20206
4,0168075add041f9eb4bba46d6fdb6387,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2018,CANCELADO,20181
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,fe802d8d85de6f842749468401d1146c,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,2022,ATIVO,20222
668,fe87dfa176a74fc10a5cb701b9fb5dd4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,2016,CONCLUÍDO,20206
669,fec9ed6026d55ecdf514c640312c3d08,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,1.0,1.0,1.0,0.0,2020,ATIVO,20222
670,ff56f2c5048dae0797fd3e851572b80c,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,...,1.0,1.0,1.0,1.0,1.0,1.0,0.0,2014,CONCLUÍDO,20192


Mudando a ordem das colunas.

In [259]:
colunas_ordenadas = ['discente', 'ano_ingresso', 'ultimo_periodo', 'status', 
                    'ALGORITMOS E LÓGICA DE PROGRAMAÇÃO',
                    'INTRODUÇÃO À INFORMÁTICA',
                    'FUNDAMENTOS DE MATEMÁTICA',
                    'LÓGICA',
                    'TEORIA GERAL DA ADMINISTRAÇÃO',
                    'PROGRAMAÇÃO',
                    'CÁLCULO DIFERENCIAL E INTEGRAL',
                    'TEORIA GERAL DOS SISTEMAS',
                    'PROGRAMAÇÃO ORIENTADA A OBJETOS I',
                    'ESTRUTURA DE DADOS',
                    'ÁLGEBRA LINEAR',
                    'ORGANIZAÇÃO, SISTEMAS E MÉTODOS',
                    'FUNDAMENTOS DE SISTEMAS DE INFORMAÇÃO',
                    'PROGRAMAÇÃO WEB',
                    'ARQUITETURA DE COMPUTADORES',
                    'PROBABILIDADE E ESTATÍSTICA',
                    'BANCO DE DADOS',
                    'ENGENHARIA DE SOFTWARE I',
                    'PROGRAMAÇÃO ORIENTADA A OBJETOS II',
                    'SISTEMAS OPERACIONAIS',
                    'PROJETO E ADMINISTRAÇÃO DE BANCO DE DADOS',
                    'ENGENHARIA DE SOFTWARE II',
                    'REDES DE COMPUTADORES',
                    'CONTABILIDADE E CUSTOS',
                    'EMPREENDEDORISMO EM INFORMÁTICA',
                    'GESTÃO DE PROJETO DE SOFTWARE',
                    'PROGRAMAÇÃO VISUAL',
                    'MATEMÁTICA FINANCEIRA',
                    'SISTEMAS DE APOIO À DECISÃO',
                    'ÉTICA']


Reordenando as colunas.

In [260]:
tabela_final = tabela_final[colunas_ordenadas]
tabela_final

Unnamed: 0,discente,ano_ingresso,ultimo_periodo,status,ALGORITMOS E LÓGICA DE PROGRAMAÇÃO,INTRODUÇÃO À INFORMÁTICA,FUNDAMENTOS DE MATEMÁTICA,LÓGICA,TEORIA GERAL DA ADMINISTRAÇÃO,PROGRAMAÇÃO,...,PROJETO E ADMINISTRAÇÃO DE BANCO DE DADOS,ENGENHARIA DE SOFTWARE II,REDES DE COMPUTADORES,CONTABILIDADE E CUSTOS,EMPREENDEDORISMO EM INFORMÁTICA,GESTÃO DE PROJETO DE SOFTWARE,PROGRAMAÇÃO VISUAL,MATEMÁTICA FINANCEIRA,SISTEMAS DE APOIO À DECISÃO,ÉTICA
0,005c14d7c07bf7980b60c703f99c5ee7,2018,20221,CANCELADO,1.0,2.0,2.0,2.0,1.0,2.0,...,2.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
1,0107fd69d8cd7e3d30dede96fb68bfe5,2011,20121,CANCELADO,2.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,014789363f7940922e71e710ee9d22bc,2016,20206,CONCLUÍDO,2.0,2.0,3.0,2.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0
3,014f0dec46fe7a9c5836527662e1df10,2020,20206,CANCELADO,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,0168075add041f9eb4bba46d6fdb6387,2018,20181,CANCELADO,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,fe802d8d85de6f842749468401d1146c,2022,20222,ATIVO,2.0,1.0,2.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
668,fe87dfa176a74fc10a5cb701b9fb5dd4,2016,20206,CONCLUÍDO,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
669,fec9ed6026d55ecdf514c640312c3d08,2020,20222,ATIVO,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
670,ff56f2c5048dae0797fd3e851572b80c,2014,20192,CONCLUÍDO,2.0,4.0,4.0,4.0,1.0,3.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0


Para salvar o DataFrame no formato CSV com a separação por ';' e garantir que os dados numéricos estejam no formato correto.

In [261]:
tabela_final.to_csv('tabela_final.csv', index=False, sep=';', quoting=csv.QUOTE_NONNUMERIC)