# Configuração de ambiente

Importação das bibliotecas:
- Numpy: para manipulação de arrays
- Pandas: para manipulação de dataframes
- Matplotlib: para visualização de dados
- Seaborn: para visualização de dados

E dentre outras para o aprendizado não supervisionado.

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering

# Leitura e visualização do dataset

Leitura do arquivo "dataset.csv" através da função "read_csv" do Pandas e armazenamento do resultado na variável "df". Uso do parâmetro "sep" para informar qual o separador de colunas do arquivo e do parâmetro "encoding" para informar qual o tipo de codificação do arquivo.

In [3]:
csv_url = "https://github.com/viniciusgugelmin/data-science-2/blob/master/projects/cursos-prouni/data/dataset_clean.csv?raw=true"

df = pd.read_csv(csv_url, sep=';', encoding='utf-8', low_memory=False)

Exibição das 5 primeiras linhas do DataFrame através da função "head" do Pandas para verificar se o arquivo foi carregado corretamente e ter uma ideia do que ele contém.

In [4]:
df.head()

Unnamed: 0,grau,turno,mensalidade,bolsa_integral_cotas,bolsa_integral_ampla,bolsa_parcial_cotas,bolsa_parcial_ampla,curso_id,curso_busca,cidade_busca,uf_busca,cidade_filtro,universidade_nome,campus_nome,campus_id,nome,nota_integral_ampla,nota_integral_cotas,nota_parcial_ampla,nota_parcial_cotas
0,Bacharelado,Integral,9999.99,15.0,14.0,0.0,0.0,706710394154,Medicina,Campo Grande,MS,NTAwMjAwNDAyNzA0,Universidade Anhanguera - UNIDERP,CAMPO GRANDE - SEDE - Miguel Couto,706710,Medicina,740.22,726.46,0.0,0.0
1,Bacharelado,Noturno,9836.4,1.0,0.0,0.0,0.0,104191210567043,Enfermagem,Crateus,CE,MjMwNDAxODA0MTAz,Faculdade Princesa do Oeste - FPO,UNIDADE SEDE - São Vicente,1041912,Enfermagem,663.36,0.0,0.0,0.0
2,Bacharelado,Integral,9715.61,2.0,5.0,6.0,10.0,1002328574024,Medicina,Sao Paulo,SP,MzUxNTA2MTUwMzA4,Universidade Cidade de São Paulo - UNICID,UNIVERSIDADE CIDADE DE SÃO PAULO - UNICID - SE...,1002328,Medicina,739.62,738.08,738.96,718.64
3,Bacharelado,Noturno,9689.34,3.0,2.0,0.0,0.0,104191212798093,Psicologia,Crateus,CE,MjMwNDAxODA0MTAz,Faculdade Princesa do Oeste - FPO,UNIDADE SEDE - São Vicente,1041912,Psicologia,651.0,652.22,0.0,0.0
4,Bacharelado,Integral,9674.34,4.0,1.0,5.0,2.0,65899611932754,Medicina,Rio Branco,AC,MTIwMjAwNDAwNDAx,Faculdade Barão do Rio Branco - FAB,CAMPUS - RIO BRANCO - JARDIM EUROPA II - Jard...,658996,Medicina,758.32,723.94,734.92,711.26


# Limpeza e troca de valores

Iniciamos a limpeza de colunas desnecessárias para o aprendizado e trocamos valores de texto usando o LabelEncoder para opções numéricas.

In [13]:
le = LabelEncoder()
df_new = df.copy()

df_new["grau"] = le.fit_transform(df_new.grau.values)
df_new["turno"] = le.fit_transform(df_new.turno.values)
df_new["curso_busca"] = le.fit_transform(df_new.curso_busca.values)
df_new["cidade_busca"] = le.fit_transform(df_new.cidade_busca.values)
df_new = df_new.drop(['cidade_filtro'], axis=1)
df_new["uf_busca"] = le.fit_transform(df_new.uf_busca.values)
df_new["universidade_nome"] = le.fit_transform(df_new.universidade_nome.values)
df_new["campus_nome"] = le.fit_transform(df_new.campus_nome.values)
df_new["nome"] = le.fit_transform(df_new.nome.values)

df_new

Unnamed: 0,grau,turno,mensalidade,bolsa_integral_cotas,bolsa_integral_ampla,bolsa_parcial_cotas,bolsa_parcial_ampla,curso_id,curso_busca,cidade_busca,uf_busca,universidade_nome,campus_nome,campus_id,nome,nota_integral_ampla,nota_integral_cotas,nota_parcial_ampla,nota_parcial_cotas
0,0,1,9999.99,15.0,14.0,0.0,0.0,706710394154,210,207,11,1230,371,706710,210,740.22,726.46,0.00,0.00
1,0,3,9836.40,1.0,0.0,0.0,0.0,104191210567043,79,314,5,588,4574,1041912,79,663.36,0.00,0.00,0.00
2,0,1,9715.61,2.0,5.0,6.0,10.0,1002328574024,210,1006,25,1242,4678,1002328,210,739.62,738.08,738.96,718.64
3,0,3,9689.34,3.0,2.0,0.0,0.0,104191212798093,248,314,5,588,4574,1041912,248,651.00,652.22,0.00,0.00
4,0,1,9674.34,4.0,1.0,5.0,2.0,65899611932754,210,881,0,254,568,658996,210,758.32,723.94,734.92,711.26
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41442,1,0,149.00,1.0,0.0,0.0,0.0,994312865605,281,751,26,1251,2697,9943,281,502.36,0.00,0.00,0.00
41443,2,0,144.00,1.0,2.0,2.0,5.0,65868712869275,205,1020,25,988,4408,658687,205,533.34,450.00,450.00,450.00
41444,2,4,139.00,1.0,0.0,0.0,0.0,1056445674232,169,154,6,73,1236,1056445,169,580.76,0.00,0.00,0.00
41445,0,0,139.00,1.0,0.0,0.0,0.0,96781210935,270,1147,7,1251,4753,9678,270,548.26,0.00,0.00,0.00
