In this project, I'll be looking at tracking corporate bankruptcies in Poland. To do that, I'll need to get data that's been stored in a JSON file, explore it, and turn it into a DataFrame that I'll use to train my model.

In [1]:
import gzip
import json

import pandas as pd

Using a context manager, I'll open the file and load it as a dictionary.

In [5]:
# Open compressed file and load contents
with gzip.open("poland-bankruptcy-data-2009.json.gz", "r") as read_file:
    poland_data_gz = json.load(read_file)

print(type(poland_data_gz))

<class 'dict'>


# Explore

In [8]:
print(poland_data_gz.keys())

dict_keys(['schema', 'data', 'metadata'])


schema tells us how the data is structured, metadata tells us where the data comes from, and data is the data itself

Now let's take a look at the values. Remember, the values in a dictionary are ways to describe the variable that belongs to a key.

In [9]:
poland_data_gz["metadata"]

{'title': 'Ensemble Boosted Trees with Synthetic Features Generation in Application to Bankruptcy Prediction',
 'authors': 'Zieba, M., Tomczak, S. K., & Tomczak, J. M.',
 'journal': 'Expert Systems with Applications',
 'publicationYear': 2016,
 'dataYear': 2009,
 'articleLink': 'doi:10.1016/j.eswa.2016.04.001',
 'datasetLink': 'https://archive.ics.uci.edu/ml/datasets/Polish+companies+bankruptcy+data'}

In [10]:
poland_data_gz["schema"].keys()

dict_keys(['fields', 'primaryKey', 'pandas_version'])

This dataset includes all the information needed to figure whether or not a Polish company went bankrupt in 2009. There's a bunch of features included in the dataset, each of which corresponds to some element of a company's balance sheet. You can explore the features by looking at the data dictionary. Most importantly, you also know whether or not the company went bankrupt. That's the last key-value pair.

Now that there's data for each company, let's take a look at how many companies there are.


In [12]:
# Calculate number of companies
len(poland_data_gz["data"])

9977

And then let's see how many features were included for one of the companies.

In [13]:
# Calculate number of features for "company 1"
len(poland_data_gz["data"][0])

66

Here I'm dealing with data stored in a JSON file, which is common for semi-structured data, I can't assume that all companies have the same features. So I'll check.

Iterate through the companies in poland_data_gz["data"] and check that they all have the same number of features.

In [14]:
# Iterate through companies
count = 0
for item in poland_data_gz["data"]:
    count += 1
    if count == 66:
        print("There are exactly 66 companies in the list.")
        break
else:
    print("There are not exactly 66 companies in the list.")

There are exactly 66 companies in the list.


Now i can turn the dictionary into a dataframe using pandas.

In [15]:
#  Create a DataFrame df that contains the all companies in the dataset, indexed by "company_id"
df = pd.DataFrame().from_dict(poland_data_gz["data"]).set_index("company_id")
print(df.shape)
df.head()

(9977, 65)


Unnamed: 0_level_0,feat_1,feat_2,feat_3,feat_4,feat_5,feat_6,feat_7,feat_8,feat_9,feat_10,...,feat_56,feat_57,feat_58,feat_59,feat_60,feat_61,feat_62,feat_63,feat_64,bankrupt
company_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.17419,0.41299,0.14371,1.348,-28.982,0.60383,0.21946,1.1225,1.1961,0.46359,...,0.16396,0.37574,0.83604,7e-06,9.7145,6.2813,84.291,4.3303,4.0341,False
2,0.14624,0.46038,0.2823,1.6294,2.5952,0.0,0.17185,1.1721,1.6018,0.53962,...,0.027516,0.271,0.90108,0.0,5.9882,4.1103,102.19,3.5716,5.95,False
3,0.000595,0.22612,0.48839,3.1599,84.874,0.19114,0.004572,2.9881,1.0077,0.67566,...,0.007639,0.000881,0.99236,0.0,6.7742,3.7922,64.846,5.6287,4.4581,False
5,0.18829,0.41504,0.34231,1.9279,-58.274,0.0,0.23358,1.4094,1.3393,0.58496,...,0.17648,0.32188,0.82635,0.073039,2.5912,7.0756,100.54,3.6303,4.6375,False
6,0.18206,0.55615,0.32191,1.6045,16.314,0.0,0.18206,0.79808,1.8126,0.44385,...,0.55577,0.41019,0.46957,0.029421,8.4553,3.3488,107.24,3.4036,12.454,False
