<h1 align='center'>Sanity Checks on API</h1>

<h4 align='center'>iReceptor $\mid$ Laura Gutierrez Funderburk $\mid$ October 21</h4>

<h4 align='center'>Supervised by Dr. Felix Breden, Dr. Jamie Scott, Dr. Brian Corrie</h4>

<h2 align='center'>Abstract</h2>

In this notebook I will parse and study JSON content found https://ipa.ireceptor.org/v2/samples with the purpose of doing sanity checks on data.

In [1]:
import requests
import json
import csv
import pandas as pd

  return f(*args, **kwds)


In [2]:
DATA = json.loads(requests.get("https://ipa.ireceptor.org/v2/samples").text)
df = pd.DataFrame.from_dict(DATA)
print("Column Names")
cols =[item for item in df.columns] 
print(cols)
print("There are a total of " + str(len(cols)) + " columns")

Column Names
['Paired_read_assembly', 'Sequencing_method', 'Software_versions', '_id', 'accession', 'adapter_sequence_forward ', 'adapter_sequence_reverse', 'age', 'age_event', 'anatomic_site', 'ancestry_population', 'antigen ', 'barcode_1', 'barcode_2', 'biomaterial_provider', 'bioproject_biosample_id', 'cell_isolation', 'cell_number', 'cell_phenotype', 'cell_processing_protocol', 'cell_quality', 'cell_storage', 'cell_subset', 'cell_tissue', 'cell_type', 'cells_per_reaction', 'collapsing_method', 'collected_by', 'collection_time_event', 'collection_time_point_reference', 'collection_time_point_relative', 'complete_sequences', 'data_processing_protocols', 'disease_diagnosis', 'disease_length', 'disease_stage', 'disease_state', 'disease_state_sample', 'ethnicity', 'experiment_id', 'experiment_name', 'fasta_file_name', 'forward_PCR_primer_target_location', 'forward_pcr_primer_target_location', 'forward_primers', 'gene_derivation_heavy', 'gene_derivation_light', 'germline_database', 'gran

In [3]:
print("Are the entries in the JSON file uniform?")
lengths = set([len(DATA[i].keys()) for i in range(len(DATA))])
print(lengths)  

Are the entries in the JSON file uniform?
{4, 109, 81, 82, 84, 85}


From this the first thing we observe is that the content on each row is non-uniform. Indeed, the columns' in samples can be categorized into different groups:

Groups that have defined information under 4, 109, 81, 82, 84 or 85 columns. Notice that defined information ranges from integers, floating values, strings and the word "Null" which is transformed into None in the Dataframe. 

Below we observe that the groups are not categorized in an ordered fashion, indeed, we find that the entry with only 4 well-defined columns is the ninth entry (entries run from 0 to 549) on the JSON array DATA, and correspondingly on the night row on the Dataframe df. 

In [4]:
print("Cluster with 4 defined entries JSON data")
case_4 = [i for i in range(len(DATA)) if len(DATA[i].keys())==4]
print(case_4)
print("We find only one entry with 4 defined column names\n")
print("Next we explore what those columns are")
print(DATA[9])
print("\n")
print("What does this look like on the dataframe?")
print(df.loc[9])

Cluster with 4 defined entries JSON data
[9]
We find only one entry with 4 defined column names

Next we explore what those columns are
{'_id': 21, 'study_title': 'Measurement and Clinical Monitoring of Human Lymphocyte Clonality by Massively Parallel V-D-J Pyrosequencing', 'lab_name': 'Department of Pathology, Stanford University', 'ir_project_sample_id': 21}


What does this look like on the dataframe?
Paired_read_assembly                                                             NaN
Sequencing_method                                                                NaN
Software_versions                                                                NaN
_id                                                                               21
accession                                                                        NaN
adapter_sequence_forward                                                         NaN
adapter_sequence_reverse                                                         N

In [5]:
# Study different cases and obtain row number/JSON array number
cases = []
for item in lengths:
    cases.append([item,[i for i in range(len(DATA)) if len(DATA[i].keys())==item]])
for i in range(len(cases)):
    print("Case: " + str(cases[i][0]) + ". Well-defined columns are on JSON entry(ies)/row number(s):\n" + str(cases[i][1]) + "\n") 


Case: 4. Well-defined columns are on JSON entry(ies)/row number(s):
[9]

Case: 109. Well-defined columns are on JSON entry(ies)/row number(s):
[389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419]

Case: 81. Well-defined columns are on JSON entry(ies)/row number(s):
[305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 457, 458, 459, 460, 464, 465, 466, 468, 

In [6]:
def test_df_colNames(case_arr):

    case_arr_test = []
    iterate_over = len(case_arr)
    for i in range(iterate_over):
        if DATA[case_arr[i]].keys() ==DATA[case_arr[0]].keys():
            case_arr_test.append([True,0])
        elif DATA[case_arr[i]].keys() ==DATA[case_arr[-1]].keys():
            case_arr_test.append([True,-1])
        else:
            case_arr_test.append([False])
    return case_arr_test

In [7]:
# Case                     Result
#test_df_colNames(case_4)  0
#test_df_colNames(case_81) 0
#test_df_colNames(case_82) 0
#test_df_colNames(case_84) 0,-1
#test_df_colNames(case_85) 0,-1