<a href="https://colab.research.google.com/github/rodrigobernall/ds4a_group_30_FINAL_PROJECT/blob/master/scripts/data_wrangling/SECOP_I_2019_Data_Wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing the libraries

More SECOP data can be found [here](https://www.datos.gov.co/Ciencia-Tecnolog-a-e-Innovaci-n/Inventario-de-Datasets/2irh-ijg2).

In [0]:
import pandas as pd
import numpy as np
from zipfile import ZipFile
import urllib.request
import json
from pandas.io.json import json_normalize

# Importing the data

We download the 2019 data from a Dropbox URL (zip file.)

In [0]:
url = 'https://www.dropbox.com/s/r56zkj70r5eldmn/SI2019.zip?dl=1'
print('Beginning file download with urllib2...')
urllib.request.urlretrieve(url, 'SI2019.zip')

Extracting the data

In [0]:
!ls

In [0]:
# Create a ZipFile Object and load sample.zip in it
with ZipFile('SI2019.zip', 'r') as zipObj:
   # Extract all the contents of zip file in current directory
   zipObj.extractall()

Let's read the data into Pandas using `pd.read_json()`

In [0]:
si2019 = pd.read_json('SI2019.zip', encoding='latin-1', lines=True)
si2019.head()

Now let's extract just one record, the first one:

In [0]:
df = si2019.head(1).copy()
df

Let's explore its structure:

In [21]:
df['Release'].to_dict()

{0: {'buyer': {'id': '891502397',
   'name': 'META - ALCALDÍA MUNICIPIO DE MESETAS'},
  'contracts': [{'awardID': 8118331,
    'dateSigned': '2019-01-22T00:00:00.000Z',
    'description': 'PRESTACION DE SERVICIOS DE APOYO A LA GESTION PARA EL EMBELLECIMIENTO DEL CEMENTERIO MUNICIPAL DE MESETAS',
    'id': 8118331,
    'items': [{'additionalClassifications': [{'description': 'Servicios de recursos humanos',
        'id': '8011',
        'scheme': 'UNSPSC',
        'uri': 'http://www.colombiacompra.gov.co/clasificador-de-bienes-y-servicios'}],
      'classification': {'description': 'Reclutamiento de personal',
       'id': '801117',
       'scheme': 'UNSPSC',
       'uri': 'http://www.colombiacompra.gov.co/clasificador-de-bienes-y-servicios'},
      'description': 'Reclutamiento de personal',
      'id': '801117'}],
    'period': {'durationInDays': 330, 'startDate': '2019-01-22T00:00:00.000Z'},
    'title': '043 DE 2019',
    'value': {'amount': 16800000.0, 'currency': 'COP'}}],
  'date

# Parsing the JSON records to obtain usable dataframes

We shall create a function that does the following *for each record*:

1. Normalises the JSON file.
1. Checks which columns are JSON arrays (Python lists).
1. For each column that is an array, it normalises it.
1. Repeat the previous steps until there are no arrays left.
1. Keeps track of the identifiers for each column.
1. Creates a normalised DataFrame with all the data.


This function is called json_row_to_df().

Some good resources are [this one](https://mindtrove.info/flatten-nested-json-with-pandas/), [this one](https://stackoverflow.com/questions/45418334/using-pandas-json-normalize-on-nested-json-with-arrays), [this one](https://stackoverflow.com/questions/45672130/how-to-identify-a-pandas-column-is-a-list) and [this one](https://stackoverflow.com/questions/20638006/convert-list-of-dictionaries-to-a-pandas-dataframe/53831756#53831756).

In [20]:
RELEASE = df['Release']
MAIN = json_normalize(RELEASE).set_index('ocid')
ocid = MAIN.index[0]

## Las ramas primarias
PARTIES = json_normalize(RELEASE, record_path='parties')
MAIN = MAIN.drop(columns=['parties'])
PARTIES['ocid'] = ocid
PARTIES = PARTIES.set_index('ocid')
PARTIES = PARTIES.add_prefix('PARTIES.')

CONTRACTS = json_normalize(RELEASE, record_path='contracts')
MAIN = MAIN.drop(columns=['contracts'])
CONTRACTS['ocid'] = ocid
CONTRACTS = CONTRACTS.set_index('ocid')
CONTRACTS = CONTRACTS.add_prefix('CONTRACTS.')

## Ramas secundarias

PLANNING_MILESTONES = json_normalize(MAIN['planning.milestones'][0])
MAIN = MAIN.drop(columns=['planning.milestones'])
PLANNING_MILESTONES['ocid'] = ocid
PLANNING_MILESTONES = PLANNING_MILESTONES.set_index('ocid')
PLANNING_MILESTONES = PLANNING_MILESTONES.add_prefix('PLANNING.MILESTONES.')

ITEMS = json_normalize(CONTRACTS['CONTRACTS.items'][0])
CONTRACTS = CONTRACTS.drop(columns=['CONTRACTS.items'])
ITEMS['ocid'] = ocid
ITEMS = ITEMS.set_index('ocid')
ITEMS = ITEMS.add_prefix('CONTRACTS.ITEMS.')

ADDITIONALCLASSIFICATIONS = json_normalize(ITEMS['CONTRACTS.ITEMS.additionalClassifications'][0])
ITEMS = ITEMS.drop(columns=['CONTRACTS.ITEMS.additionalClassifications'])
ADDITIONALCLASSIFICATIONS['ocid'] = ocid
ADDITIONALCLASSIFICATIONS = ADDITIONALCLASSIFICATIONS.set_index('ocid')
ADDITIONALCLASSIFICATIONS = ADDITIONALCLASSIFICATIONS.add_prefix('CONTRACTS.ITEMS.ADDITIONALCLASSIFICATIONS.')

CONTRACTS_PERIOD = json_normalize(CONTRACTS['CONTRACTS.period'][0])
CONTRACTS = CONTRACTS.drop(columns=['CONTRACTS.period'])
CONTRACTS_PERIOD['ocid'] = ocid
CONTRACTS_PERIOD = CONTRACTS_PERIOD.set_index('ocid')
CONTRACTS_PERIOD = CONTRACTS_PERIOD.add_prefix('CONTRACTS.PERIOD.')


# ## JOIN

# RESULT = MAIN.join(CONTRACTS).join(ITEMS).join(ADDITIONALCLASSIFICATIONS).join(PARTIES).join(PLANNING_MILESTONES).join(CONTRACTS_VALUE).join(CONTRACTS_PERIOD)
# RESULT = pd.DataFrame(RESULT.stack()).reset_index()
# #RESULT.index = 
# i = RESULT['level_1'].str.split('.',expand=True)
# v = RESULT.values
# #pd.DataFrame(v, index=[i])
# #CONTRACTS
# i.set_index([0,1,2,3])
# #MAIN

KeyError: ignored