## Preprocesamiento de datos

Este notebook muestra como limpiar y normalizar los valores de cuota de las AFP en Chile. Los datos de los fondos de pensiones fueron descargados desde el sitio web de la [Superintendencia de Pensiones](https://www.spensiones.cl/apps/valoresCuotaFondo/vcfAFP.php)

In [33]:
import pandas as pd
import numpy as np
import os
import re
import requests

from datetime import datetime

## Descargar los datos actualizados desde el sitio web

In [34]:
dataDir = '../data'

dataFileDF = pd.DataFrame([{'fondo':'A', 'URL':'https://www.spensiones.cl/apps/valoresCuotaFondo/vcfAFPxls.php?aaaaini=0000&aaaafin=2020&tf=A&fecconf=20200331'},
                          {'fondo':'B', 'URL':'https://www.spensiones.cl/apps/valoresCuotaFondo/vcfAFPxls.php?aaaaini=0000&aaaafin=2020&tf=B&fecconf=20200331'},
                          {'fondo':'C', 'URL':'https://www.spensiones.cl/apps/valoresCuotaFondo/vcfAFPxls.php?aaaaini=0000&aaaafin=2020&tf=C&fecconf=20200331'},
                          {'fondo':'D', 'URL':'https://www.spensiones.cl/apps/valoresCuotaFondo/vcfAFPxls.php?aaaaini=0000&aaaafin=2020&tf=D&fecconf=20200331'},
                          {'fondo':'E', 'URL':'https://www.spensiones.cl/apps/valoresCuotaFondo/vcfAFPxls.php?aaaaini=0000&aaaafin=2020&tf=E&fecconf=20200331'}])

dataFileDF.head()

Unnamed: 0,fondo,URL
0,A,https://www.spensiones.cl/apps/valoresCuotaFon...
1,B,https://www.spensiones.cl/apps/valoresCuotaFon...
2,C,https://www.spensiones.cl/apps/valoresCuotaFon...
3,D,https://www.spensiones.cl/apps/valoresCuotaFon...
4,E,https://www.spensiones.cl/apps/valoresCuotaFon...


In [35]:
for index, rowFile in dataFileDF.iterrows():
    
    url = rowFile['URL']
    print("Downloading URL: ", url)
    r = requests.get(url)
    if r.status_code == 200:
        content = r.headers['content-disposition']
        
        fileName = re.findall("filename=(.+\.\w{3})", content)[0]
        fileName = os.path.join(dataDir, fileName)
        
        with open(fileName, 'wb') as f:
            f.write(r.content)
            dataFileDF.loc[index, 'fileName'] = fileName

Downloading URL:  https://www.spensiones.cl/apps/valoresCuotaFondo/vcfAFPxls.php?aaaaini=0000&aaaafin=2020&tf=A&fecconf=20200331
Downloading URL:  https://www.spensiones.cl/apps/valoresCuotaFondo/vcfAFPxls.php?aaaaini=0000&aaaafin=2020&tf=B&fecconf=20200331
Downloading URL:  https://www.spensiones.cl/apps/valoresCuotaFondo/vcfAFPxls.php?aaaaini=0000&aaaafin=2020&tf=C&fecconf=20200331
Downloading URL:  https://www.spensiones.cl/apps/valoresCuotaFondo/vcfAFPxls.php?aaaaini=0000&aaaafin=2020&tf=D&fecconf=20200331
Downloading URL:  https://www.spensiones.cl/apps/valoresCuotaFondo/vcfAFPxls.php?aaaaini=0000&aaaafin=2020&tf=E&fecconf=20200331


## Creamos la carpeta results para guardar el output

In [36]:
if not os.path.exists('../results'):
    os.mkdir('../results')

In [37]:
reHeader = re.compile(r'^Fecha.*')
reData = re.compile(r'^\d{4}-\d{2}-\d{2}.*')
       
dataList=[]
for index, rowFile in dataFileDF.iterrows():
    
    print("Processing file: ", rowFile['fileName'])
    
    dataRaw = []
    with open(rowFile['fileName'], 'r') as f:
        for line in f:
            #print(line)
            matchHeader = reHeader.match(line)
            matchData = reData.match(line)
            if matchHeader:
                headerRaw = matchHeader.group(0)
            if matchData:
                dataRaw.append(matchData.group(0))
                
    header = headerRaw.split(';')
    
    for rowData in dataRaw:
        rowValues = rowData.split(';')
                
        for i in range(1, len(header), 2):
            dataDict={'fecha': datetime.strptime(rowValues[0], '%Y-%m-%d'),
                      'fondo':rowFile['fondo'] }

            dataDict['administradora'] = header[i]
            if re.match("[0-9]+\..*", rowValues[i]):
                dataDict['valor cuota'] = float(rowValues[i].replace('.','').replace(',','.'))
            else:
                dataDict['valor cuota'] = np.nan
            if re.match("[0-9]+.*", rowValues[i]):
                dataDict['valor patrimonio'] = float(rowValues[i+1])
            else:
                dataDict['valor cuota'] = np.nan
            
            dataList.append(dataDict)
           
dataDF = pd.DataFrame(dataList)

Processing file:  ../data/vcfA2020-2020.csv
Processing file:  ../data/vcfB2020-2020.csv
Processing file:  ../data/vcfC2020-2020.csv
Processing file:  ../data/vcfD2020-2020.csv
Processing file:  ../data/vcfE2020-2020.csv


In [41]:
dataDF.tail(7)

Unnamed: 0,fecha,fondo,administradora,valor cuota,valor patrimonio
4193,2020-04-29,E,CAPITAL,,
4194,2020-04-29,E,CUPRUM,,
4195,2020-04-29,E,HABITAT,,
4196,2020-04-29,E,MODELO,,
4197,2020-04-29,E,PLANVITAL,72899.76,901194200000.0
4198,2020-04-29,E,PROVIDA,,
4199,2020-04-29,E,UNO,,


In [42]:
outputFileName = '../results/afp_chile_2020.csv'

dataDF.to_csv(outputFileName, index=False)