### LINKS FOR REFERENCE

https://biblioteca.ibge.gov.br/visualizacao/periodicos/308/cd_2000_v7.pdf

Site usado pelo boletim do SVS:
“As taxas de mortalidade foram calculadas com base nas projeções populacionais do Instituto Brasileiro de Geografia e Estatísticas (IBGE), tendo como população padrão estrutura etária da projeção populacional para o ano de 2010.”
https://www.ibge.gov.br/estatisticas/sociais/populacao/9662-censo-demografico-2010.html?edicao=9673

Population Census

https://www.ibge.gov.br/estatisticas/sociais/populacao/9103-estimativas-de-populacao.html

https://sidra.ibge.gov.br/pesquisa/censo-demografico/demografico-2022/inicial


Population estimates:

https://sidra.ibge.gov.br/tabela/6579


#### Types of population count:

Census/Count: 1996, 2000, 2010, 2022

Estimates: 1997 to 1999, 2001 to 2009, 2011 to 2021

In [9]:
# Importações
from IPython.display import display

import pandas as pd
import numpy as np

# Prevent infinite warnings
import warnings
warnings.filterwarnings('ignore')

# OS and File imports
import os
from zipfile import ZipFile

In [10]:
dir_path = './'

In [11]:
# List all files in the csvs directory
all_files = os.listdir(dir_path)

# Filter for files that end with .zip extension
zip_files = [file for file in all_files if file.endswith('.zip')]
print('Zips: ', '\n', zip_files)

# Filter for files that end with .zip extension
zip_file_name = zip_files[0]
print('Zip file name: ', '\n', zip_file_name)

if not zip_file_name:
    raise FileNotFoundError("No .zip file found in the specified directory.")

# Dictionary to store data from each .xls file
xls_dict = {}

with ZipFile(zip_file_name, 'r') as z:
    # Find all .xls files in the zip
    xls_files = sorted([f for f in z.namelist() if f.endswith('.xls')])
    print('XLS files:', xls_files)
    
    # Iterate through each .xls file in the zip
    for xlsf in xls_files:
        # Open the .xls file within the zip
        with z.open(xlsf) as file:
            # Load the spreadsheet
            xls = pd.ExcelFile(file)

            # Load data from the second sheet
            data = pd.read_excel(xls, sheet_name=xls.sheet_names[1])
            data = data.iloc[1:-2]
            # Drop rows that contain NaN values (completely empty rows)
            data = data.dropna(how='all').reset_index(drop=True)
            data.iloc[:, 0] = data.iloc[:, 0].str.lower()
            data.iloc[:, 0] = data.iloc[:, 0].str.strip()

            xls_dict[xlsf] = data

            index_column = 'Localização'

            year = xlsf.split('-')[1].split('.')[0]
            data.columns = [index_column, year]
            data.set_index(index_column, inplace=True)

            # Optional: Print or check the first few rows of each file's data
            print(f"\nData from {xlsf}:")
            display(data.head(16))

Zips:  
 ['IBGE_97_98_99_data.zip']
Zip file name:  
 IBGE_97_98_99_data.zip
XLS files: ['estimativa_populacao-1997.xls', 'estimativa_populacao-1998.xls', 'estimativa_populacao-1999.xls']

Data from estimativa_populacao-1997.xls:


Unnamed: 0_level_0,1997
Localização,Unnamed: 1_level_1
brasil,159636413.0
região norte,11604158.0
rondônia,1255522.0
acre,500185.0
amazonas,2460602.0
roraima,254499.0
pará,5650681.0
amapá,401916.0
tocantins,1080753.0
região nordeste,45334385.0



Data from estimativa_populacao-1998.xls:


Unnamed: 0_level_0,1998
Localização,Unnamed: 1_level_1
brasil,161790311.0
região norte,11868725.0
rondônia,1276173.0
acre,514050.0
amazonas,2520684.0
roraima,260705.0
pará,5768476.0
amapá,420834.0
tocantins,1107803.0
região nordeste,45811342.0



Data from estimativa_populacao-1999.xls:


Unnamed: 0_level_0,1999
Localização,Unnamed: 1_level_1
brasil,163947554.0
região norte,12133705.0
rondônia,1296856.0
acre,527937.0
amazonas,2580860.0
roraima,266922.0
pará,5886454.0
amapá,439781.0
tocantins,1134895.0
região nordeste,46289042.0


In [12]:
for name, xls in xls_dict.items():
    # 🤡🤡🤡🤡🤡🤡🤡🤡🤡🤡🤡🤡🤡🤡🤡🤡🤡🤡🤡🤡🤡🤡🤡🤡🤡🤡🤡🤡
    # WHY IS PERNAMBUCO SPELLED WRONG IN THESE THREE OFFICIAL IBGE FILES
    print('"Pernambuco" not fixed:')
    display(xls[xls.index == 'Pernanbuco'.lower()])
    xls.index = xls.index.str.replace("Pernanbuco".lower(), "Pernambuco".lower())
    print('"Pernambuco" fixed:')
    display(xls[xls.index == 'Pernambuco'.lower()])
    print('------------------------------------------------------------------')

"Pernambuco" not fixed:


Unnamed: 0_level_0,1997
Localização,Unnamed: 1_level_1
pernanbuco,7466773.0


"Pernambuco" fixed:


Unnamed: 0_level_0,1997
Localização,Unnamed: 1_level_1
pernambuco,7466773.0


------------------------------------------------------------------
"Pernambuco" not fixed:


Unnamed: 0_level_0,1998
Localização,Unnamed: 1_level_1
pernanbuco,7523755.0


"Pernambuco" fixed:


Unnamed: 0_level_0,1998
Localização,Unnamed: 1_level_1
pernambuco,7523755.0


------------------------------------------------------------------
"Pernambuco" not fixed:


Unnamed: 0_level_0,1999
Localização,Unnamed: 1_level_1
pernanbuco,7580826.0


"Pernambuco" fixed:


Unnamed: 0_level_0,1999
Localização,Unnamed: 1_level_1
pernambuco,7580826.0


------------------------------------------------------------------


In [13]:
def capitalize_words(text):
    return ' '.join('-'.join(part.capitalize() for part in word.split('-')) for word in text.split())

concat_years_df = pd.concat([df for df in xls_dict.values()], axis=1)

# Removes the word "Região" for future comparison with other dataframes
concat_years_df.index = concat_years_df.index.str.replace("Região ".lower(), "", regex=False).map(capitalize_words)

concat_years_df.to_csv('estimated_pop-1997_1998_1999.csv', index=True)

# Display the transposed dataframe for easier checkup
display(concat_years_df.T.head())


Localização,Brasil,Norte,Rondônia,Acre,Amazonas,Roraima,Pará,Amapá,Tocantins,Nordeste,...,São Paulo,Sul,Paraná,Santa Catarina,Rio Grande Do Sul,Centro-Oeste,Mato Grosso Do Sul,Mato Grosso,Goiás,Distrito Federal
1997,159636413.0,11604158.0,1255522.0,500185.0,2460602.0,254499.0,5650681.0,401916.0,1080753.0,45334385.0,...,34752225.0,23862664.0,9142215.0,4958339.0,9762110.0,10769249.0,1964603.0,2287846.0,4639785.0,1877015.0
1998,161790311.0,11868725.0,1276173.0,514050.0,2520684.0,260705.0,5768476.0,420834.0,1107803.0,45811342.0,...,35284072.0,24154080.0,9258813.0,5028339.0,9866928.0,10994821.0,1995578.0,2331663.0,4744174.0,1923406.0
1999,163947554.0,12133705.0,1296856.0,527937.0,2580860.0,266922.0,5886454.0,439781.0,1134895.0,46289042.0,...,35816740.0,24445950.0,9375592.0,5098448.0,9971910.0,11220742.0,2026600.0,2375549.0,4848725.0,1969868.0


In [14]:
from unidecode import unidecode

csv_dataframes = {}

# Iterate over each .csv file in the directory
for file_name in os.listdir(dir_path):
    if file_name.endswith('.csv'):
        # Read the CSV file
        df = pd.read_csv(os.path.join(dir_path, file_name), sep=None)

        # Get the name of the index column (assumes it's the first column)
        index_column = df.columns[0]

        # Set the index column, remove accents, and capitalize
        df.set_index(index_column, inplace=True)
        df.index = df.index.map(unidecode).str.lower().str.capitalize()
        
        # Sort the index alphabetically
        df.sort_index(inplace=True)
        
        # Store the sorted dataframe in the dictionary
        csv_dataframes[file_name] = df

# Concatenate all dataframes along columns
final_df = pd.concat(csv_dataframes.values(), axis=1)
final_df = final_df[sorted(final_df.columns)]

# Display the concatenated dataframe for verification
display(final_df.tail(15))

Unnamed: 0,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
Paraiba,3305616,3331673.0,3353624.0,3375609.0,3444794,3468594,3494893,3518595,3568350,3595886,...,3914421,3943885,3972202,3999415,4025558,3996496,4018127,4039277,4059905,3974687
Parana,9003804,9142215.0,9258813.0,9375592.0,9564643,9694709,9798006,9906866,10135388,10261856,...,10997465,11081692,11163018,11242720,11320892,11348937,11433957,11516840,11597484,11444380
Pernambuco,7399071,7466773.0,7523755.0,7580826.0,7929154,8008207,8084667,8161862,8323911,8413593,...,9208550,9277727,9345173,9410336,9473266,9496294,9557071,9616621,9674793,9058931
Piaui,2673085,2695876.0,2714999.0,2734152.0,2843428,2873010,2898223,2923725,2974698,3006885,...,3184166,3194718,3204028,3212180,3219257,3264531,3273227,3281480,3289290,3271199
Rio de janeiro,13406308,13555657.0,13681410.0,13807358.0,14392106,14558545,14724475,14879118,15203750,15383407,...,16369179,16461173,16550024,16635996,16718956,17159960,17264943,17366189,17463349,16055174
Rio grande do norte,2558660,2594340.0,2624397.0,2654501.0,2777509,2815244,2852784,2888058,2962107,3003087,...,3373959,3408510,3442175,3474998,3507003,3479010,3506853,3534165,3560903,3302729
Rio grande do sul,9634688,9762110.0,9866928.0,9971910.0,10187842,10309819,10408540,10510992,10726063,10845087,...,11164043,11207274,11247972,11286500,11322895,11329605,11377239,11422973,11466630,10882965
Rondonia,1229306,1255522.0,1276173.0,1296856.0,1380952,1407886,1431777,1455907,1562085,1534594,...,1728214,1748531,1768204,1787279,1805788,1757589,1777225,1796460,1815278,1581196
Roraima,247131,254499.0,260705.0,266922.0,324397,337237,346871,357302,381896,391317,...,488072,496936,505665,514229,522636,576568,605761,631181,652713,636707
Santa catarina,4875244,4958339.0,5028339.0,5098448.0,5357864,5448736,5527707,5607233,5774178,5866568,...,6634254,6727148,6819190,6910553,7001161,7075494,7164788,7252502,7338473,7610361


In [15]:
years_interval = f'{final_df.columns[0]}-{final_df.columns[-1]}'

dataframe_to_save_name = f'ibge_data-{years_interval}.csv'

# Optional: Save the final concatenated dataframe to a new CSV file
final_df.to_csv(f'final_all_years_result_csv/{dataframe_to_save_name}', index=True)