# Working with JSON files in Python
Working with JSON files isn't the most fun.  While pandas has the read_json method that is useful for reading the .json file into a dataframe, we are often left with lists or dictionaries inside of columns.  Since nested column values aren't really helpful for analzying data, we'll explore some methods for unpacking the json and creating clean and orderly dataframes.

In [1]:
import numpy as np
import pandas as pd
import ijson
from pandas.io.json import json_normalize

In [2]:
%%bash
# we can use %%bash magic to print a preview of our file

head ../input/roam_prescription_based_prediction.jsonl

{"cms_prescription_counts": {"DOXAZOSIN MESYLATE": 26, "MIDODRINE HCL": 12, "MEGESTROL ACETATE": 11, "BENAZEPRIL HCL": 11, "METOLAZONE": 73, "NOVOLOG": 12, "DIAZEPAM": 24, "HYDRALAZINE HCL": 50, "SENSIPAR": 94, "LABETALOL HCL": 28, "PREDNISONE": 40, "CALCITRIOL": 79, "HYDROCODONE-ACETAMINOPHEN": 64, "HYDROCHLOROTHIAZIDE": 59, "LOSARTAN-HYDROCHLOROTHIAZIDE": 14, "FENOFIBRATE": 14, "MINOXIDIL": 14, "MELOXICAM": 29, "ATENOLOL": 62, "CARISOPRODOL": 40, "GABAPENTIN": 35, "OMEPRAZOLE": 35, "KLOR-CON M10": 20, "LANTUS": 20, "AMLODIPINE BESYLATE": 175, "CARVEDILOL": 36, "LOSARTAN POTASSIUM": 41, "IRBESARTAN": 11, "NIFEDICAL XL": 32, "NIFEDIPINE ER": 51, "LEVOTHYROXINE SODIUM": 12, "POTASSIUM CHLORIDE": 30, "FUROSEMIDE": 162, "GLYBURIDE": 16, "CLONIDINE HCL": 43, "TEMAZEPAM": 41, "SPIRONOLACTONE": 50, "LOVASTATIN": 11, "LISINOPRIL": 44, "PANTOPRAZOLE SODIUM": 13, "CALCIUM ACETATE": 85, "NEXIUM": 44, "ZOLPIDEM TARTRATE": 41, "DIOVAN": 20, "OXYCODONE HCL": 51, "METOPROLOL SUCCINATE": 104, "RANITI

In [3]:
# read in data
raw_data = pd.read_json("../input/roam_prescription_based_prediction.jsonl",
                        lines=True,
                        orient='columns')
print(raw_data.shape)
raw_data.head()

(239930, 3)


Unnamed: 0,cms_prescription_counts,npi,provider_variables
0,"{'DOXAZOSIN MESYLATE': 26, 'MIDODRINE HCL': 12...",1295763035,"{'settlement_type': 'non-urban', 'generic_rx_c..."
1,"{'CEPHALEXIN': 23, 'AMOXICILLIN': 52, 'HYDROCO...",1992715205,"{'settlement_type': 'non-urban', 'generic_rx_c..."
2,"{'CEPHALEXIN': 28, 'AMOXICILLIN': 73, 'CLINDAM...",1578587630,"{'settlement_type': 'non-urban', 'generic_rx_c..."
3,{'AMOXICILLIN': 63},1932278405,"{'settlement_type': 'non-urban', 'generic_rx_c..."
4,"{'PIOGLITAZONE HCL': 24, 'BENAZEPRIL HCL': 29,...",1437366804,"{'settlement_type': 'non-urban', 'generic_rx_c..."


We can see from above that we have nested values inside our cells.  There are several options for extracting these values.  In this kernel we will explore using list comprehensions and json_normalize.

## Extract Prescriber Data
### List Comprehension

In [4]:
%time provider = pd.DataFrame([md for md in raw_data.provider_variables])
provider.head()

CPU times: user 828 ms, sys: 16 ms, total: 844 ms
Wall time: 842 ms


Unnamed: 0,brand_name_rx_count,gender,generic_rx_count,region,settlement_type,specialty,years_practicing
0,384,M,2287,South,non-urban,Nephrology,7
1,0,M,103,South,non-urban,General Practice,7
2,0,M,112,Midwest,non-urban,General Practice,7
3,0,M,63,South,non-urban,General Practice,7
4,316,M,1035,West,non-urban,Nephrology,6


In [5]:
# add npi as index
provider['npi'] = raw_data.npi
provider.set_index('npi', inplace=True)
provider.head()

Unnamed: 0_level_0,brand_name_rx_count,gender,generic_rx_count,region,settlement_type,specialty,years_practicing
npi,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1295763035,384,M,2287,South,non-urban,Nephrology,7
1992715205,0,M,103,South,non-urban,General Practice,7
1578587630,0,M,112,Midwest,non-urban,General Practice,7
1932278405,0,M,63,South,non-urban,General Practice,7
1437366804,316,M,1035,West,non-urban,Nephrology,6


### JSON Normalize

In [6]:
%time provider = json_normalize(data=raw_data.provider_variables)
provider.head()

CPU times: user 5.47 s, sys: 64 ms, total: 5.53 s
Wall time: 5.53 s


Unnamed: 0,brand_name_rx_count,gender,generic_rx_count,region,settlement_type,specialty,years_practicing
0,384,M,2287,South,non-urban,Nephrology,7
1,0,M,103,South,non-urban,General Practice,7
2,0,M,112,Midwest,non-urban,General Practice,7
3,0,M,63,South,non-urban,General Practice,7
4,316,M,1035,West,non-urban,Nephrology,6


## Extract Rx Data
### List Comprehension

In [7]:
%time rx_counts = pd.DataFrame([rx for rx in raw_data.cms_prescription_counts])

CPU times: user 3min 26s, sys: 12.8 s, total: 3min 39s
Wall time: 3min 38s


In [8]:
print(rx_counts.shape)
rx_counts.head()

(239930, 2397)


Unnamed: 0,1ST TIER UNIFINE PENTIPS,ABACAVIR,ABELCET,ABILIFY,ABILIFY DISCMELT,ABILIFY MAINTENA,ABRAXANE,ABSTRAL,ACAMPROSATE CALCIUM,ACANYA,ACARBOSE,ACCOLATE,ACCUNEB,ACCUPRIL,ACEBUTOLOL HCL,ACETAMINOPH-CAFF-DIHYDROCODEIN,ACETAMINOPHEN-BUTALBITAL,ACETAMINOPHEN-CODEINE,ACETAZOLAMIDE,ACETIC ACID,ACETIC ACID-ALUMINUM,ACETYLCYSTEINE,ACIPHEX,ACITRETIN,ACTEMRA,ACTIGALL,ACTIMMUNE,ACTIQ,ACTIVELLA,ACTONEL,ACTOPLUS MET,ACTOPLUS MET XR,ACTOS,ACYCLOVIR,ACZONE,ADACEL TDAP,ADALAT CC,ADAPALENE,ADCIRCA,ADDERALL,...,ZIPRASIDONE HCL,ZIPSOR,ZIRGAN,ZITHROMAX,ZOCOR,ZOFRAN,ZOFRAN ODT,ZOLADEX,ZOLEDRONIC ACID,ZOLINZA,ZOLMITRIPTAN,ZOLMITRIPTAN ODT,ZOLOFT,ZOLPIDEM TARTRATE,ZOLPIDEM TARTRATE ER,ZOLPIMIST,ZOMETA,ZOMIG,ZOMIG ZMT,ZONALON,ZONEGRAN,ZONISAMIDE,ZORTRESS,ZOSTAVAX,ZOSYN,ZOVIA 1-35E,ZOVIA 1-50E,ZOVIRAX,ZUBSOLV,ZYCLARA,ZYFLO,ZYFLO CR,ZYLET,ZYLOPRIM,ZYMAXID,ZYPREXA,ZYPREXA RELPREVV,ZYPREXA ZYDIS,ZYTIGA,ZYVOX
0,,,,11.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,41.0,,,,,,,,,,,,,,,,,,,,,,,,,,
1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,16.0,,,,,,,,,,,...,,,,,,,,,,,,,,35.0,,,,,,,,,,,,,,,,,,,,,,,,,,
