
# JSON

**Dataset:** `titanic-parquet.json` (JSON array of passenger records)

### Learning goals
- Understand JSON structure: objects `{}` vs arrays `[]`, keys & values, nesting
- Load JSON into Python and navigate it (lists & dicts)
- Convert JSON to a DataFrame for analysis
- Perform data manipulation tasks (filtering, grouping, aggregations)
- (Preview) See how this maps 1:1 to JSON returned by APIs



## Part A — Guided Practice

### Step 1 — Load the JSON
Choose **one** of the two options below:
1) **Upload** `titanic-parquet.json` from your computer  
2) If the file is already in your Colab runtime, just use the filename

> Tip: After loading, we'll quickly check types to understand structure.


In [None]:
import json

data = []
with open('titanic-parquet.json', 'r') as f:
    for line in f:
        data.append(json.loads(line))

type(data), type(data[0])

(list, dict)


**Checkpoint:**  
- `type(data)` should be `list` (array of passenger objects)  
- `type(data[0])` should be `dict` (one passenger record)



### Step 2 — Inspect the structure
We'll peek at the first record and list its keys.


In [None]:

first = data[0]
print("First passenger record:")
print({k: first[k] for k in list(first.keys())[:8]})  # preview some fields

print("\nAll keys in one passenger:")
print(list(first.keys()))


First passenger record:
{'PassengerId': '1', 'Survived': '0', 'Pclass': '3', 'Name': 'Braund, Mr. Owen Harris', 'Sex': 'male', 'Age': 22, 'SibSp': '1', 'Parch': '0'}

All keys in one passenger:
['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']



### Step 3 — Accessing values
We'll access specific fields via Python's list/dict notation.


In [None]:

# Example: name, ticket, and survived status for the 1st passenger
p0 = data[0]
(p0['Name'], p0['Ticket'], p0['Survived'])


('Braund, Mr. Owen Harris', 'A/5 21171', '0')

In [None]:

# Guided exercise: Access fields for passenger #10 (index 9)
# TODO: Replace the ... with the correct keys to access these values.
p9 = data[9]
name_10 = p9['...']          # TODO
age_10 = p9['...']          #TODO
pclass_10 = p9['...']      # TODO
surivived_10 = p9['...']  # TODO

name_10, age_10, pclass_10, surivived_10


KeyError: '...'


### Step 4 — Convert to a DataFrame
Because each passenger is a flat dictionary (no nested objects), conversion is straightforward.


In [None]:
import pandas as pd

df = pd.DataFrame(data)
df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Understanding Key Fields in the Titanic Dataset

| **Field** | **Meaning** | **Explanation** |
|-----------|------------|----------------|
| **SibSp** | Number of siblings or spouses aboard | How many **brothers, sisters, or husband/wife** this passenger was traveling with. |
| **Parch** | Number of parents or children aboard | Stands for **Par**ent + **Ch**ild → indicates family members like mother, father, son, or daughter traveling together. |
| **Ticket** | Passenger’s ticket identifier | A booking reference — multiple passengers may share the same ticket if they booked together. |
| **Fare** | Amount paid for the ticket | The **price of the ticket**, often related to class and economic status. |
| **Cabin** | Cabin number assigned | The **room location** on the ship — many lower-class passengers don't have this recorded (`null`). |
| **Embarked** | Port of boarding | Shows where the passenger got on the ship:<br>**C = Cherbourg (France)**<br>**Q = Queenstown (Ireland)**<br>**S = Southampton (UK)** |




### Step 5 — Filtering & selecting (guided)
Let's warm up with basic selectors.


In [None]:
df.dtypes


Unnamed: 0,0
PassengerId,object
Survived,object
Pclass,object
Name,object
Sex,object
Age,float64
SibSp,object
Parch,object
Ticket,object
Fare,float64


In [None]:

# Example: Female passengers traveling alone (no spouse/siblings and no parents/children)

female_solo_travelers = df[
    (df['Sex'] == 'female') &      # female only
    (df['SibSp'] == '0') &           # no siblings/spouse
    (df['Parch'] == '0')             # no parents/children
]

female_solo_travelers[['Name', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Survived']]




Unnamed: 0,Name,Pclass,Sex,SibSp,Parch,Survived
2,"Heikkinen, Miss. Laina",3,female,0,0,1
11,"Bonnell, Miss. Elizabeth",1,female,0,0,1
14,"Vestrom, Miss. Hulda Amanda Adolfina",3,female,0,0,0
15,"Hewlett, Mrs. (Mary D Kingcome)",2,female,0,0,1
19,"Masselmani, Mrs. Fatima",3,female,0,0,1
...,...,...,...,...,...,...
862,"Swift, Mrs. Frederick Joel (Margaret Welles Ba...",1,female,0,0,1
865,"Bystrom, Mrs. (Karolina)",2,female,0,0,1
875,"Najib, Miss. Adele Kiamie ""Jane""",3,female,0,0,1
882,"Dahlberg, Miss. Gerda Ulrika",3,female,0,0,0


In [None]:

# Example: Passengers with Age under 25
young_passengers = df[df['Age'] < 25]

young_passengers[['Name', 'Age', 'Survived']]


Unnamed: 0,Name,Age,Survived
0,"Braund, Mr. Owen Harris",22.0,0
7,"Palsson, Master. Gosta Leonard",2.0,0
9,"Nasser, Mrs. Nicholas (Adele Achem)",14.0,1
10,"Sandstrom, Miss. Marguerite Rut",4.0,1
12,"Saundercock, Mr. William Henry",20.0,0
...,...,...,...
875,"Najib, Miss. Adele Kiamie ""Jane""",15.0,1
876,"Gustafsson, Mr. Alfred Ossian",20.0,0
877,"Petroff, Mr. Nedelio",19.0,0
882,"Dahlberg, Miss. Gerda Ulrika",22.0,0



---

## Part B — Individual Practice

Work through the tasks below on your own. Each task has a **hint**.  
Feel free to add cells as needed. After this section, there's an optional **Solutions** section.



### Q1 — List of names who **survived**
Create a Python list of names for all passengers who **survived**.

**Hint:** Filter `df['Survived'] == 1`, then select the `Name` column and convert to a list.


In [None]:

# Q1: Your work here
# survivors = ...
# survivors[:10]



### Q2 — Count of passengers by embarkation port
Compute how many passengers embarked from each port (`Embarked`).

**Hint:** `value_counts()` is your friend. Consider missing values.


In [None]:

# Q2: Your work here
# df['Embarked'].value_counts(dropna=False)



### Q3 — Average fare by passenger class
Compute the **mean fare** for each `Pclass`.

**Hint:** Use `groupby('Pclass')['Fare'].mean()`.


In [None]:

# Q3: Your work here
# avg_fare_by_class = ...
# avg_fare_by_class



### Q4 — New column `IsChild`
Create a boolean column `IsChild` that is `True` if `Age < 12`, else `False` (or NaN if `Age` missing).

**Hint:** `df['Age'] < 12`


In [None]:

# Q4: Your work here
# df['IsChild'] = ...
# df[['Age','IsChild']].head(10)



### Q5 — Filter: female, Second Class, did **not** survive
Return a DataFrame subset for **female** passengers in **Second Class** who **did not survive**.

**Hint:** Combine three boolean conditions with `&`.


In [None]:

# Q5: Your work here
# subset = df[(...) & (...) & (...)]
# subset.head()




Unnamed: 0,Name,Pclass,Sex,SibSp,Parch,Survived



### Q6 – Let’s Find People Like **Jack** and **Rose**

Adjusting for historical accuracy:

- **Rose-type** → Female, First Class, **no children**, but *may be traveling with a parent (like Rose with her mother)*  Age under 25, embarded at Southampton

  → `Sex == 'female'`, `Pclass == '1'`, `SibSp == '0'`,`Parch == '1'`, `Age < 25`, `Embarked == 'S'`

- **Jack-type** → Male, Third Class, **traveling alone** (no spouse/siblings, no parents/children), Age under 25,embarded at Southampton

  → `Sex == 'male'`, `Pclass == '3'`, `SibSp == '0'`, `Parch == '0'`, `Age < 25`,`Embarked == 'S'`

#### Task:
1. Filter **Rose-type passengers**
2. Filter **Jack-type passengers**
3. Compare:
   -  How many "Roses" were on board?
   -  What percentage of them survived?
   -  Compare with "Jack-type" — does the data match the movie narrative?



In [None]:
df.dtypes

Unnamed: 0,0
PassengerId,object
Survived,object
Pclass,object
Name,object
Sex,object
Age,float64
SibSp,object
Parch,object
Ticket,object
Fare,float64


In [None]:

# Q6: Your work here






---

##API Preview
## Step 1 — Start with the Base API URL

We always begin by defining the **base endpoint**.  
This is like setting up the connection before we add filters or queries.


###  Why It's Important to Read API Documentation First

Before coding with an API, always look at its official documentation page:

🔗 **Example:** https://openlibrary.org/dev/docs/api/search

**Why this matters:**


- Shows which **parameters** we can use (like `q`, `limit`, `fields`)
-  Tells us which **fields** exist — so we don’t guess
-  Explains how the **response is structured**
-  Saves time and prevents errors
-  Without it, we might get empty or messy results

>  Think of API docs like an *instruction manual* — it tells you how to ask for data the right way.



In [None]:
#Code Cell — Store Base URL
import requests

# Define the base API endpoint (no parameters yet)
BASE_URL = "https://openlibrary.org/search.json"

# Make a basic call WITHOUT parameters
response = requests.get(BASE_URL)
data = response.json()

data  # Notice: no results, because we didn't tell it what to search for yet



{'numFound': 0,
 'start': 0,
 'numFoundExact': True,
 'num_found': 0,
 'q': '',
 'documentation_url': 'https://openlibrary.org/dev/docs/api/search',
 'docs': []}

> The result above comes back **empty** (`"docs": []`) because we didn't include a **query parameter**.

APIs work like **input → output** systems.
Just like a function needs arguments, **APIs need parameters** to know what to return.

Now, we'll add parameters **separately** using a dictionary instead of hardcoding them into the URL.


In [None]:
# Step 2 — Add parameters using a dictionary instead of writing directly in the URL

params = {
    "q": "harry potter",  # our search query
    "limit": 1             # limit to a few results for testing
}

response = requests.get(BASE_URL, params=params)  # cleaner way
data = response.json()

data


{'numFound': 3754,
 'start': 0,
 'numFoundExact': True,
 'num_found': 3754,
 'documentation_url': 'https://openlibrary.org/dev/docs/api/search',
 'q': 'harry potter',
 'offset': None,
 'docs': [{'author_key': ['OL23919A'],
   'author_name': ['J. K. Rowling'],
   'cover_edition_key': 'OL25662116M',
   'cover_i': 10523466,
   'ebook_access': 'borrowable',
   'edition_count': 226,
   'first_publish_year': 2003,
   'has_fulltext': True,
   'ia': ['harypotterizakon0000rowl',
    'harrypotter5undd0000joan',
    'harrypotter05har0000joan',
    'harrypotterorder0000jkro',
    'harrypotterogfni0000unse',
    'harrypotteresfon0000rowl',
    'harispoterisirfe0000rowl',
    'harrypotterorder0000rowl_j7r2',
    'harrynbsppottery0005jkro',
    'harrypotterorder0000rowl_h2t4',
    'harrypotterorder0000rowl_h7u3',
    'harrypotterorder2004rowl',
    'harrypotterendeo0000rowl',
    'harrypotterorder0000rowl_g1n2',
    'harrypotterorder0000rowl_o5a3',
    'harripotteriorde0000rowl',
    'harrypotterorde

### Reading an API Response

The Open Library Search API returns a top-level dictionary:

- **`numFound` / `num_found`**: total number of matches (the whole catalog search)
- **`start`**: where this page of results begins (0-based)
- **`q`**: the query we asked for
- **`docs`**: **list** of result items (each item is a **work** record)

Each item in `docs` is a **dictionary** with fields such as:

- `title`: work title  
- `author_name`: list of author names  
- `first_publish_year`: year the work first appeared  
- `edition_count`: number of known editions  
- `ebook_access`: ebook status (`public`, `borrowable`, `printdisabled`, `no_ebook`)  
- `has_fulltext`: whether any text is digitized  
- `language`: list of language codes for editions (e.g., `eng`, `spa`)  
- `key`: unique Work key (open with `https://openlibrary.org` + key)  
- `cover_i`: numeric cover id (for cover image URLs)  
- `ia`: Internet Archive identifiers for available copies


## From Search Results to a Specific Book

The initial API search returns **many works** (books, editions, translations, box sets).
To get **one specific book**, we can follow this workflow:

> **Search → Inspect Result → Identify Key → Drill Down**

###  Three Ways to Be More Specific

| Method | Description | Example |
|--------|-----------|--------|
| 1. Use structured parameters (`title=`, `author=`)** | Server filters by **exact field**, more accurate than `q=` | `?title=Harry Potter and the Goblet of Fire&author=Rowling` |
| 2. Filter results locally with pandas** | First get results, then narrow using string filters | `df[df["title"].str.contains("Goblet", case=False)]` |
| 3. Jump to full details using a **work key** | After search, grab `/works/OLXXXXXW` and fetch full JSON | `https://openlibrary.org/works/OL82563W.json` |


In [None]:
# Make an API request
params = {"q": "harry potter", "limit": 10}  # small limit
response = requests.get(BASE_URL, params=params).json()

# Collect unique keys across all docs (not just one)
df_api_results  = set()
for doc in response["docs"]:
    df_api_results .update(doc.keys())

sorted(df_api_results )


['author_key',
 'author_name',
 'cover_edition_key',
 'cover_i',
 'ebook_access',
 'edition_count',
 'first_publish_year',
 'has_fulltext',
 'ia',
 'ia_collection_s',
 'key',
 'language',
 'lending_edition_s',
 'lending_identifier_s',
 'public_scan_b',
 'title']

In [None]:
# More accurate server-side filtering
params = {
    "q": "harry potter",
    "limit": 200,
    "fields": "title,author_name,first_publish_year,edition_count"
}

response = requests.get(BASE_URL, params=params)
data_filtered = response.json()

data_filtered["docs"]

[{'author_name': ['J. K. Rowling'],
  'edition_count': 226,
  'first_publish_year': 2003,
  'title': 'Harry Potter and the Order of the Phoenix'},
 {'author_name': ['J. K. Rowling'],
  'edition_count': 357,
  'first_publish_year': 1997,
  'title': "Harry Potter and the Philosopher's Stone"},
 {'author_name': ['J. K. Rowling'],
  'edition_count': 126,
  'first_publish_year': 2007,
  'title': 'Harry Potter and the Deathly Hallows'},
 {'author_name': ['J. K. Rowling'],
  'edition_count': 263,
  'first_publish_year': 1999,
  'title': 'Harry Potter and the Prisoner of Azkaban'},
 {'author_name': ['J. K. Rowling'],
  'edition_count': 276,
  'first_publish_year': 1998,
  'title': 'Harry Potter and the Chamber of Secrets'},
 {'author_name': ['J. K. Rowling', 'Mary GrandPré'],
  'edition_count': 151,
  'first_publish_year': 2005,
  'title': 'Harry Potter and the Half-Blood Prince'},
 {'author_name': ['J. K. Rowling', 'Jim Kay'],
  'edition_count': 231,
  'first_publish_year': 1993,
  'title': '

###  From Broad Search to a Specific Book — Three Precision Methods

We already used:
 **Method 1 — `fields=` + `q=` to reduce response size**

Now let’s continue with:

---

### Method 2 — Filter Locally with pandas (Client-Side)

Once we get the `docs` into a DataFrame, we can search **inside** the results.


In [None]:

# Convert to DataFrame (if not already)
df_api = pd.DataFrame(data_filtered["docs"])

# Filter for a specific title keyword (case-insensitive)
df_api[df_api["title"].str.contains("harry", case=False, na=False)]

Unnamed: 0,author_name,edition_count,first_publish_year,title
0,[J. K. Rowling],226,2003.0,Harry Potter and the Order of the Phoenix
1,[J. K. Rowling],357,1997.0,Harry Potter and the Philosopher's Stone
2,[J. K. Rowling],126,2007.0,Harry Potter and the Deathly Hallows
3,[J. K. Rowling],263,1999.0,Harry Potter and the Prisoner of Azkaban
4,[J. K. Rowling],276,1998.0,Harry Potter and the Chamber of Secrets
...,...,...,...,...
195,[Book Magical],1,2020.0,Harry Potter Livre de Coloriage
196,"[Dorian LOUIS, Joanna LOUIS]",1,2020.0,Harry Potter Livre de Coloriage
197,,1,,HARRY POTTER
198,[Anelia James],1,2019.0,Harry Potter Coloring Book


In [None]:
# Step 1: Make a request (limit results to keep it clean)
params = {"q": "harry potter", "limit": 10}
response = requests.get(BASE_URL, params=params).json()

# Step 2: Convert to DataFrame for better inspection
df_api_results = pd.DataFrame(response["docs"])

# Display only title + key so students can choose
df_api_results[["title", "key"]]



Unnamed: 0,title,key
0,Harry Potter and the Order of the Phoenix,/works/OL82548W
1,Harry Potter and the Philosopher's Stone,/works/OL82563W
2,Harry Potter and the Deathly Hallows,/works/OL82586W
3,Harry Potter and the Prisoner of Azkaban,/works/OL82536W
4,Harry Potter and the Chamber of Secrets,/works/OL82537W
5,Harry Potter and the Half-Blood Prince,/works/OL82565W
6,Harry Potter and the Goblet of Fire,/works/OL82560W
7,Harry Potter,/works/OL20874116W
8,Harry Potter (series) 1-7,/works/OL14981609W
9,Harry Potter,/works/OL21385222W


In [None]:
# Paste ONE work key from df_results["key"] output (example shown below)
selected_key = "/works/OL82548W"  # <-- e.g. "/works/OL82548W"

# Only run this AFTER entering a key above!
if selected_key:
    detail_url = f"https://openlibrary.org{selected_key}.json"
    full_details = requests.get(detail_url).json()

    print(" Fetching details from:", detail_url)
    print(list(full_details.keys())[:12], "...")  # Preview main fields
else:
    print("Please set selected_key first by copying one from df_results['key']")


 Fetching details from: https://openlibrary.org/works/OL82548W.json
['description', 'links', 'title', 'covers', 'subject_places', 'first_publish_date', 'subject_people', 'key', 'authors', 'excerpts', 'type', 'subjects'] ...


In [None]:
# Print important fields from full_details

print("Title:", full_details.get("title"))
print("First Publish Date:", full_details.get("first_publish_date"))
print("Description:",
      full_details.get("description").get("value")
      if isinstance(full_details.get("description"), dict)
      else full_details.get("description"))

print("Subjects:", full_details.get("subject_places"))
print("Cover IDs:", full_details.get("covers"))


Title: Harry Potter and the Order of the Phoenix
First Publish Date: 2003
Description: After the Dementors’ attack on his cousin Dudley, Harry knows he is about to become Voldemort’s next target.

Although many are denying the Dark Lord’s return, Harry is not alone, and a secret order is gathering at Grimmauld Place to fight against the Dark forces.

Meanwhile, Voldemort’s savage assaults on Harry’s mind are growing stronger every day.

He must allow Professor Snape to teach him to protect himself before he runs out of time.
([source][1])


----------
This work has also been published in multiple volumes. See:

 - [Harry Potter and the Order of the Phoenix: III](https://openlibrary.org/works/OL17937113W/Harry_Potter_and_the_Order_of_the_Phoenix_Chapters_17-23)
 - [Harry Potter and the Order of the Phoenix: IV](https://openlibrary.org/works/OL17915213W/Harry_Potter_and_the_Order_of_the_Phoenix_Chapters_24-30)

  [1]: https://www.jkrowling.com/book/harry-potter-order-phoe