# 📦 Restructuring JSON Data

## 🧪 Objective

This notebook is created to **explore** and **analyze** the structure of the `books++` JSON file. We'll be diving deep into how the data is organized, identifying any **inconsistencies 🐛 or bugs**, and finally **restructuring 🛠️** the data into a cleaner, more usable format.

---

## 🔍 What we'll do

- Use **Python libraries** like `json`, `pandas`, and `matplotlib` to **read**, **visualize**, and **manipulate** the data.
- Generate **graphs 📊 and visual outputs** to get a better understanding of how the data is structured.
- Spot any unusual patterns, missing values, or incorrect formats.
- Propose and apply a **clean and logical structure** for future use.

### Reading JSON File and checking for structure

In [26]:
import json  # 📚 To handle JSON data
from pprint import pprint  # 👓 Pretty printing nested dictionaries
import os  # 📁 To check file existence

# 🔹 Step 1: Load the JSON file
filename = "books++.json"  # Replace with your actual filename

# ✅ Check if file exists
if not os.path.exists(filename):
    print("❌ File not found:", filename)
else:
    with open(filename, 'r', encoding='utf-8') as file:
        try:
            data = json.load(file)  # 📥 Load JSON content into Python dict
            print("✅ JSON file loaded successfully!\n")
        except json.JSONDecodeError as e:
            print("❌ Error decoding JSON:", e)
            data = None

    # 🔹 Step 2: Print the data structure
    if data:
        print("📜 Here's a preview of the JSON content:\n")
        pprint(data, width=120)

        # 🔹 Step 3: Check for inconsistencies 🔍
        print("\n🔍 Checking for structural inconsistencies...")

        # Let's assume the JSON should be a list of book entries 📚
        if not isinstance(data, list):
            print("⚠️ Expected a list of books, but found:", type(data).__name__)
        else:
            required_keys = {"title", "author", "year", "genre"}  # 🧩 Expected fields
            for i, book in enumerate(data):
                if not isinstance(book, dict):
                    print(f"❌ Entry {i} is not a dictionary.")
                    continue

                keys = set(book.keys())
                missing = required_keys - keys
                extra = keys - required_keys

                if missing:
                    print(f"⚠️ Book {i} is missing fields: {missing}")
                if extra:
                    print(f"ℹ️ Book {i} has extra unexpected fields: {extra}")
                # Optional: Check data types
                if "year" in book and not isinstance(book["year"], int):
                    print(f"🔁 Book {i} has non-integer 'year': {book['year']}")

❌ Error decoding JSON: Extra data: line 15 column 3 (char 2518)


### Error Identified: Books JSON has extra data, this means either two json objects are copy pasted without , or some extra symbol has been added.

Now Let's see what is the exact problem

In [28]:
# 🔹 Read the raw JSON text as a string
with open("books++.json", "r", encoding="utf-8") as f:
    raw = f.read()

print("📄 Raw content length:", len(raw), "characters")

# 🔍 Try parsing it from the start — we'll stop at first error
decoder = json.JSONDecoder()

try:
    # This returns a tuple: (parsed_object, index_where_it_stopped)
    parsed_data, idx = decoder.raw_decode(raw)
    print("✅ Successfully decoded JSON up to character index:", idx)
    
    # Check if there’s extra junk after the JSON
    leftover = raw[idx:].strip()
    
    if leftover:
        print("\n⚠️ Found extra characters after valid JSON:")
        print("─────────────────────────────────────")
        print(leftover[:200] + ("..." if len(leftover) > 200 else ""))
        print("─────────────────────────────────────")

        print("\n🧹 Removing extra part and using only the valid JSON...\n")
    else:
        print("✅ No extra data found. JSON is clean!")

    # Print the cleaned JSON object
    print("📚 Cleaned JSON content:")
    from pprint import pprint
    pprint(parsed_data, width=120)

except json.JSONDecodeError as e:
    print("❌ Still couldn't decode JSON properly:", e)


📄 Raw content length: 552350 characters
✅ Successfully decoded JSON up to character index: 2517

⚠️ Found extra characters after valid JSON:
─────────────────────────────────────
{
    "_id": 2,
    "title": "Android in Action, Second Edition",
    "isbn": "1935182722",
    "pageCount": 592,
    "publishedDate": {
        "$date": "2011-01-14T00:00:00.000-0800"
    },
    "thu...
─────────────────────────────────────

🧹 Removing extra part and using only the valid JSON...

📚 Cleaned JSON content:
{'_id': 1,
 'authors': ['W. Frank Ableson', 'Charlie Collins', 'Robi Sen'],
 'categories': ['Open Source', 'Mobile'],
 'isbn': '1933988673',
 'longDescription': 'Android is an open source mobile phone platform based on the Linux operating system and developed '
                    'by the Open Handset Alliance, a consortium of over 30 hardware, software and telecom companies '
                    'that focus on open standards for mobile devices. Led by search giant, Google, Android is designed

### ✅ **Problem Explanation**

We're encountering the error:

```
❌ Error decoding JSON: Extra data: line 15 column 3 (char 2518)
```

This means **multiple JSON objects** are present **one after another** without being enclosed in a single array (`[...]`). JSON parsers expect one top-level structure — either a single object `{}` or a single array `[]`. So this input is **not valid JSON**:

```json
{ "_id": 1, ... }
{ "_id": 2, ... }
{ "_id": 3, ... }
```

Instead, it **must be wrapped in an array** like this:

```json
[
  { "_id": 1, ... },
  { "_id": 2, ... },
  { "_id": 3, ... }
]
```

### 📘 JSON Cleaner Script

This script fixes JSON files that contain multiple JSON objects separated without being in an array.

### ❓ Problem
Some JSON data exports (especially from MongoDB or NoSQL databases) look like this:
```json
{ "_id": 1, "title": "Book A" }
{ "_id": 2, "title": "Book B" }
```
Which is **not valid JSON** as per standard parsers.

### 🛠️ Solution
The script:
- Reads the raw file.
- Uses regex to extract all top-level JSON objects.
- Wraps them in a proper array (`[ ... ]`).
- Saves the cleaned JSON in a new file `cleaned_books.json`.

In [31]:
# Read raw JSON content from file
with open("books++.json", "r", encoding="utf-8") as file:
    raw_content = file.read()

# Step 1: Find all individual JSON objects using regex
json_objects = re.findall(r'{\s*"_id"\s*:\s*\d+.*?}\s*(?={|$)', raw_content, re.DOTALL)

# Step 2: Wrap them in a list format
wrapped_json = "[" + ",".join(json_objects) + "]"

# Step 3: Decode cleaned JSON
try:
    data = json.loads(wrapped_json)
    print("✅ Successfully parsed JSON with", len(data), "items.")
except json.JSONDecodeError as e:
    print("❌ Still error:", e)

# Step 4: (Optional) Save to fixed file
with open("cleaned_books++.json", "w", encoding="utf-8") as fixed_file:
    json.dump(data, fixed_file, indent=4)
    print("\n====== Save Cleaned JSON File as cleaned_books++.json =======")

✅ Successfully parsed JSON with 399 items.

