# Analysis of Brazilian Higher Education Data
### (Part 1 - Data Extraction)

<b>Date:</b> 01/09/2020

<b>Author:</b> Cardoso, Thiago (thiago.guimaraesdf10@gmail.com)

<b>Project Description:</b> This study seeks to organize and develop a basic analysis of higher education data available for Brazilians public and private institutions. This is my first data science/exploration project using Python, therefore the analysis and codes showcased are far from being a repository of best practice. Any suggestions to improve this project are appreciated.

For better organization and comprehension, this study is divided into 3 parts:

- <b>Part 1 - Data Extraction</b>: Description of datasets used in the project, its sources and the path to download it. This notebook will be updated if future analysis demands new datasets.
- <b>Part 2 - Data Cleaning</b>: Code used to clean data and standardize column names. The main goal is to have comparable panel data, with annual information for courses, students, and institutions in the last decade.
- <b>Part 3 - Data Analysis</b>: The analysis is subdivided into 6 sections. The first section analysis the historical panorama of higher education growth in Brazil, especially in the last 10 years.

<b>This notebook is related to Part 1</b>

<b>Important issues:</b>

- English is not my mother language. Sorry for the mistakes;
- Many (maybe most) code lines lack consistency, performance and/or efficiency. Did my best to conciliate productivity and code quality. Any suggestions to improve code lines are welcome;
- The Analysis and Data Cleaning only scratch the surface of the extremely rich data used in this study.  Any author seeking to further this study fell free to contact me. I can also help with any translation issue and provide information for additional sources of data in Brazil.

## Library Import

In [4]:

import os
import shutil
import time
from os import path

import webbrowser
import xlsxwriter
import zipfile

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import folium

%matplotlib inline

## Brazilian Higher Education Census - 1995 to 2018

Data from Brazilian public and private higher education institutions (HEI) are collected and organized by the <a href="http://inep.gov.br/web/guest/about-inep">"National Institute for Educational Studies and Research Anísio Teixeira (INEP)"</a>. Since 2009, INEP publishes the Higher Education Census microdata at student, teacher, course and institution level. Here is a brief description of the Higher Education Census provided by INEP:

    Annually, this initiative collects data on higher education in the country, including graduate and sequential courses – both attendance and distance learning courses, providing a "radiography" of this educational level.

    Higher education institutions fill out the census online forms and, based on this data, provide policy makers with an overview of educational policy trends. By collecting data regarding the number of enrollment and graduates, candidates to university entrance; information on faculty – by qualifications and contract nature – as well as on administrative and support staff; financial data and infrastructure, this initiative provides valuable information about an educational level that is perceived to be in a process of expanding and diversifying."

From 1995 to 2008, INEP published data only at course and institution level. In this study, we focus on extracting, cleaning, and analyzing data from 2009–2018. Further work needs to be made in order to organize 1995 to 2008 data.

The code bellow downloads all Higher Education Census files, from 1995 to 2018, from the INEP website, one at a time.

In [None]:
# Data extration (Higher Education Census)

# Files names are in different formats:
#
# - 1995 to 2008 and 2017 to 2018 files are named microdados_educacao_superior_YEAR.zip
# - 2009 to 2016 files are named microdados_censo_superior_YEAR.zip
#
# At this date (16/08/2020) all files can be found in http://download.inep.gov.br/microdados

years = list(range(1995,2008))
years.extend([2017,2018])

for y in years:
    
    webbrowser.open("http://download.inep.gov.br/microdados/microdados_educacao_superior_" + str(y) + ".zip")
    
    while not os.path.exists("C:/Users/Thiago/Downloads/microdados_educacao_superior_" + str(y) + ".zip"):
         time.sleep(1)
            
    if os.path.isfile("C:/Users/Thiago/Downloads/microdados_educacao_superior_" + str(y) + ".zip"):
        continue

for y in range(2009,2016):
  
    webbrowser.open("http://download.inep.gov.br/microdados/microdados_censo_superior_" + str(y) + ".zip")
    
    while not os.path.exists("C:/Users/Thiago/Downloads/microdados_censo_superior_" + str(y) + ".zip"):
         time.sleep(1)
            
    if os.path.isfile("C:/Users/Thiago/Downloads/microdados_censo_superior_" + str(y) + ".zip"):
        continue  

## National Assessment of Student Achievement (ENADE) - 2004 to 2018

ENADE is an external evaluation applied annually to assess undergraduate students learning in their final year. Each course is assessed every three years since programs are grouped in three representative areas and each year one group is assessed.

ENADE microdata provides information at the student level (unidentified), regarding each student's answers and results in the content assessment and also its answers to a socioeconomic and course quality perception questionary.

I still did not work on ENADE data. However, I already extracted it for future analysis. The code bellow download all ENADE files, from 2004 to 2018, from INEP website, one at a time

In [None]:
# Data extration (ENADE)

years = list(range(2004,2016))
years.extend([2018])

for y in years:  
    webbrowser.open("http://download.inep.gov.br/microdados/Enade_Microdados/microdados_enade_" + str(y) + ".zip")    
    while not os.path.exists("C:/Users/Thiago/Downloads/microdados_enade_" + str(y) + ".zip"):
         time.sleep(1)           
    if os.path.isfile("C:/Users/Thiago/Downloads/microdados_enade_" + str(y) + ".zip"):
        continue

webbrowser.open("http://download.inep.gov.br/microdados/Enade_Microdados/microdados_enade_2016_versao_28052018.zip")       
webbrowser.open("http://download.inep.gov.br/microdados/Enade_Microdados/microdados_Enade_2017_portal_2018.10.09.zip")       

## Preliminar Cource concept (CPC)

CPC is an index, ranging from 1 to 5, calculated by INEP to assess undergraduate course quality. In this work, only 2009 and 2018 CPC data is used. 

The 2018 CPC is calculated according to the following aspects and weights:

- 20% - Students performance in ENADE, 
- 35% - Dif between expected and observed performance in ENADE (expected performance is calculated considing perfomance in courses with similiar students background), 
- 7,5% - Proportion of teachers with MA degree, 
- 15% - Proportion of teachers with Ph.D degree, 
- 7,5% - Teachers work regime, 
- 15% - Teachers perception regarding infraestructure, pedagogical organizatioin and opportunies to improvel academic and professional formation.

2009 CPC has slight variations in some aspects weight

CPC data from 2010 to 2017 can be found <a href='http://inep.gov.br/web/guest/educacao-superior/indicadores-de-qualidade/resultados'>here</a>

Further information for the CPC can be found <a href='http://inep.gov.br/web/guest/educacao-superior/indicadores-de-qualidade/indice-geral-de-cursos-igc-'>here</a>

In [None]:
# CPC 2009
webbrowser.open("http://download.inep.gov.br/download/enade/2009/cpc_decomposto_2009.xls")    
while not os.path.exists("C:/Users/Thiago/Downloads/cpc_decomposto_2009.xls"):
    time.sleep(1)            
if os.path.isfile("C:/Users/Thiago/Downloads/cpc_decomposto_2009.xls"):
    continue
    
# CPC 2019
webbrowser.open("http://download.inep.gov.br/educacao_superior/igc_cpc/2018/portal_CPC_edicao2018.xlsx")    
while not os.path.exists("C:/Users/Thiago/Downloads/portal_CPC_edicao2018.xlsx"):
    time.sleep(1)            
if os.path.isfile("C:/Users/Thiago/Downloads/portal_CPC_edicao2018.xlsx"):
    continue
    

## General Course Index (IGC)

The IGC provides a quality index for higher education institutions based on the average CPC of the last three years, the distribution of students between undergrad and graduate levels, and the average evaluation of graduate courses.

Further information for the IGC can be found <a href="http://inep.gov.br/web/guest/educacao-superior/indicadores-de-qualidade/indice-geral-de-cursos-igc-">here</a>  (in Portuguese)

Because IGC file names differ a lot, I wrote the laziest code possible to download all files.

In [None]:
webbrowser.open("http://download.inep.gov.br/download/areaigc/Downloads/igc_2009.xls")
webbrowser.open("http://download.inep.gov.br/educacao_superior/enade/igc/tabela_igc_2010_16_10_2012.xls" )
webbrowser.open("http://download.inep.gov.br/educacao_superior/enade/igc/tabela_igc_2011_15_01_2013.xls")
webbrowser.open("http://download.inep.gov.br/educacao_superior/enade/igc/tabela_igc_2012_30012014.xls" )
webbrowser.open("http://download.inep.gov.br/educacao_superior/enade/igc/2013/igc_2013_09022015.xlsx")
webbrowser.open("http://download.inep.gov.br/educacao_superior/enade/igc/2014/igc_2014.xlsx")
webbrowser.open("http://download.inep.gov.br/educacao_superior/indicadores/legislacao/2017/igc_2015_portal_04_12_2017.xlsx")
webbrowser.open("http://download.inep.gov.br/educacao_superior/igc_cpc/2016/resultado_igc_2016_11042018.xlsx")
webbrowser.open("http://download.inep.gov.br/educacao_superior/igc_cpc/2018/resultado_igc_2017.xlsx")
webbrowser.open("http://download.inep.gov.br/educacao_superior/igc_cpc/2018/portal_IGC_edicao2018.xlsx")

## Data Organization

A manual work had to be employed in order to organize and padronize files names. There are an huge variety in zip files folders paths, file names and formats across years.

In this manual process of data organization, the following file names were choosen:

* For HEC data:
student level files: alunos_'year';
course level files: cursos_'year';
university level files: ies_'year'

* For IGC data: igc_'year'

* For CPC data cpc_'year'

All files are placed in 'data/csv_bases'

The code bellow automatize the first-mile of this process for the HEC files. Namely: i) create a data/csv_bases folder in the project folder; ii) standardize HEC file names; iii) transfer HEC files to data/csv_bases in the project folder; iv) extract files from .zip

In [None]:
# Create new directory to keep the Project files

# Important: Run only once

path = 'C:/Users/Thiago/Documents/DataScience'

os.chdir(path)

os.mkdir("Projeto_A")
os.mkdir("Projeto_A/data")
os.mkdir("Projeto_A/data/csv_bases")

In [None]:
# Rename files to keep same name pattern

downloads_path = "C:/Users/Thiago/Downloads"

os.chdir(downloads_path)

for y in range(2009,2016):
    os.rename("microdados_censo_superior_" + str(y) + ".zip","microdados_educacao_superior_" + str(y) + ".zip")

# Copy files to the project directory

files = []

for y in range(1995,2019):    
    files.append("microdados_educacao_superior_" + str(y) + ".zip")

for f in files:
    shutil.copy(f, path + 'Projeto_A/data/csv_bases')
    
# Delete files from Downloads folder

for y in range(1995,2019):
    os.remove("microdados_educacao_superior_" + str(y) + ".zip")


# Extract zip files

for y in range(1995,2019):
    with zipfile.ZipFile(path + '/Projeto_A/data/csv_bases/microdados_educacao_superior_' + str(y) +'.zip', 'r') as zip_ref:
        zip_ref.extractall(path + '/Projeto_A/data/csv_bases')
