<h1 align='center'>Sanity Checks on API</h1>

<h4 align='center'>iReceptor $\mid$ Laura Gutierrez Funderburk $\mid$ October 23</h4>

<h4 align='center'>Supervised by Dr. Felix Breden, Dr. Jamie Scott, Dr. Brian Corrie</h4>

<h2 align='center'>Abstract</h2>

In this notebook I will parse and study JSON content found https://ipa.ireceptor.org/v2/sequences_summary with the purpose of doing sanity checks on data.

In [15]:
import requests
import json
import csv
import pandas as pd

In [2]:
DATA = json.loads(requests.get("https://ipa.ireceptor.org/v2/sequences_summary").text)
DATA_items = DATA['items']
DATA_summary = DATA['summary']

In [3]:
df_items = pd.DataFrame.from_dict(DATA_items)
# print("Column Names")
cols =[item for item in df_items.columns] 
# print(cols)
# print("There are a total of " + str(len(cols)) + " columns")

In [4]:
# What is this?
df_items["_id"][0]

{'$oid': '5a704fb7b4737f76afc4f4eb'}

In [5]:
#df_summary["disease_state_sample"]

In [6]:
# for item in df_items.columns:
#     print(df_items[item])

In [7]:
# for item in df_items.columns:
#     print(item)

In [8]:
df_summary = pd.DataFrame.from_dict(DATA_summary)

cols =[item for item in df_summary.columns] 

In [9]:
print("Are the entries in the JSON file uniform?")
lengths = set([len(DATA_items[i].keys()) for i in range(len(DATA_items))])
print(lengths)  

Are the entries in the JSON file uniform?
{128}


In [10]:
print("Are the entries in the JSON file uniform?")
lengths = set([len(DATA_summary[i].keys()) for i in range(len(DATA_summary))])
print(lengths)  

Are the entries in the JSON file uniform?
{109, 81, 82, 84, 85}


From this the first thing we observe is that the content on each row is non-uniform. Indeed, the columns' in samples can be categorized into different groups:

Groups that have defined information under 4, 109, 81, 82, 84 or 85 columns. Notice that defined information ranges from integers, floating values, strings and the word "Null" which is transformed into None in the Dataframe. 

Below we observe that the groups are not categorized in an ordered fashion, indeed, we find that the entry with only 4 well-defined columns is the ninth entry (entries run from 0 to 549) on the JSON array DATA, and correspondingly on the night row on the Dataframe df. 

In [11]:
# Study different cases and obtain row number/JSON array number
cases = []
for item in lengths:
    cases.append([item,[i for i in range(len(DATA_items)) if len(DATA_items[i].keys())==item]])
for i in range(len(cases)):
    print("Case: " + str(cases[i][0]) + ". Well-defined columns are on JSON entry(ies)/row number(s):\n" + str(cases[i][1]) + "\n") 


Case: 109. Well-defined columns are on JSON entry(ies)/row number(s):
[]

Case: 81. Well-defined columns are on JSON entry(ies)/row number(s):
[]

Case: 82. Well-defined columns are on JSON entry(ies)/row number(s):
[]

Case: 84. Well-defined columns are on JSON entry(ies)/row number(s):
[]

Case: 85. Well-defined columns are on JSON entry(ies)/row number(s):
[]



### Remark

From the above we observe that there are a total of 123 columns, however, the cluster with the largest number of well defined columns is 109....this means that some entries have columns that others dont. 

Let us study with more detail the clusters.

We define a test that studies whether all column names are equal within the same cluster. 

We find they are not. 

Let us take the cluster with 84 well-defined entries.

In [12]:
print(cases[4][0],cases[4][1])

85 []


We define a test that compares all entries against the first and last entries on the cluster, and checks whether column names match. If they match the first entry, they append True along with the first entry value (0), if it matches the column names on the last, we append True followed by -1. In all other cases we append False. This worked (luckily) because in all clusters either all column names matched the first or last set of values....we might not always get this lucky. 

To see an example, let us consider the cluster that has 84 well defined entries. From the test, we see that all colum names match either the first entry on the cases array, or the last one as follows. 

No other cases were found. 

In [13]:
def test_df_colNames(case_arr):

    case_arr_test = []
    iterate_over = len(case_arr)
    for i in range(iterate_over):
        if DATA[case_arr[i]].keys() ==DATA[case_arr[0]].keys():
            case_arr_test.append([True,0])
        elif DATA[case_arr[i]].keys() ==DATA[case_arr[-1]].keys():
            case_arr_test.append([True,-1])
        else:
            case_arr_test.append([False])
    return case_arr_test

In [14]:
# # # Running the test on all clusters
# for i in range(6):
#     print(cases[i][0])
#     print(test_df_colNames(cases[i][1]))


We summarize results as follows:

|Case | Result| Meaning         |
|-----|-------|-----------------|
|4 well defined columns|  0|All entries equal to one another|
|81 well defined columns| 0|All entries equal to one another|
|82 well defined columns| 0|All entries equal to one another|
|84 well defined columns| 0,-1|Two subcases: see first and last entries in corresponding JSON file|
|85 well defined columns| 0,-1|Two subcases: see first and last entries in corresponding JSON file|
|109 well defined columns | 0 |All entries equal to one another|
