# Beginner Friendly Guide: Flatten JSON Fields
Thanks to this [wonderful kernel](https://www.kaggle.com/julian3833/1-quick-start-read-csv-and-flatten-json-fields/notebook) by [Julian](https://www.kaggle.com/julian3833), we can flatten the json fields in the input file. I created this kernel based on his in order to add some basic explanations that weren't so clear to me when I first read his kernel (this is my very first kaggle competition without any tutorials :D)

# Necessary Imports:

In [None]:
import numpy as np
import pandas as pd
import os
import json
from pandas.io.json import json_normalize

# Finding the csv Path in Kaggle:

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Loading the Data:

   * ``pd.read_csv`` takes has an optional parameter called `converters` which is a dict of functions for converting values in certain columns. Keys can either be integers or column labels.
   * ``json.loads(s)``: is a function that deserializes ``s`` (a ``str``, ``bytes`` or ``bytearray`` instance containing a JSON document) to a Python object.
   * The dataset contains many json objects in which every field must be converted to a seperate column, for instance the `totals` column looks like this:
   ```
   {"visits": "1", "hits": "1", "pageviews": "1", "bounces": "1", "newVisits": "1"}
   ```
What we need is to have each field as a seperate column: totals_visits,totals_hits...etc.
   * We need to deserialize each one of these json objects to a Python object. This can be done by assigning a dict of json_column names each one of them mapped to ``json.loads``. Then, each one of these python object, that resulted from the deserialization of a json object, needs to be normalized to a flat table, i.e. every field must be converted to a seperate column. Thsi is achieved using pandas utility ``json_normalize``  

In [None]:
CSV_PATH='/kaggle/input/ga-customer-revenue-prediction/train.csv'
JSON_COLUMNS = ['device','geoNetwork', 'totals', 'trafficSource']
def load_data(csv_path=CSV_PATH, nrows=None, json_cols=JSON_COLUMNS):
    df = pd.read_csv(csv_path, # engine='python', 
                     converters={col: json.loads for col in json_cols}, 
                     dtype={'fullVisitorId': 'str'},
                     nrows=nrows)
    for col in json_cols:
        # normalizing (flattening) each json column
        col_as_df = json_normalize(df[col])
        # renaming each column of the new dataframe that resulted form
        # normalization by concatenating its name to the name of the json column 
        # from which it was extracted so that we can keep track of the 
        # significance of the columns
        col_as_df.columns = [f"{col}.{subcol}" for subcol in col_as_df.columns]
        # replacing the original json column by the new dataframe we obtained above
        df = df.drop(col, axis=1).merge(col_as_df, right_index=True, left_index=True)
    print(f"Loaded {os.path.basename(csv_path)}. Shape: {df.shape}")
    return df
        

In [None]:
%%time
customer_data_flattened = load_data()

# Saving to Feather Format:
## What is Feather?

Feather is a fast, lightweight, and easy-to-use binary file format for storing data frames. It has a few specific design goals:

   * **Lightweight, minimal API**: make pushing data frames in and out of memory as simple as possible

   * **Language agnostic**: Feather files are the same whether written by Python or R code. Other languages can read and write Feather files, too.

   * **High read and write performance**: When possible, Feather operations should be bound by local disk performance.

## Limitations:
Some features of pandas are not supported in Feather:

   * Non-string column names
   * Row indexes
   * Object-type columns with non-homogeneous data


In [None]:
os.makedirs('tmp', exist_ok=True)
customer_data_flattened.to_feather('tmp/ga-customer-data-flattened.feather')

In [None]:
# to read your data in the futur uncomment the following line of code:
customer_data_flattened = pd.read_feather('tmp/ga-customer-data-flattened.feather')

***I hope this kernel helped at least one beginner like myself. If you have any questions or feedback I'd love to hear them. Enjoy :D***