<font style='font-size:1.5em'>**ðŸ”– Week 07 Lecture (Appendix)**</font><br>
<font style='font-size:1.3em;color:#888888'>Life is hard without `pd.json_normalize()`</font>

<font style='font-size:1.2em'>LSE [DS105A](https://lse-dsi.github.io/DS105/autumn-term/index.html){font-weight:bold"} â€“ Data for Data Science (2024/25) </font>

**AUTHORS:**  Dr. [Jon Cardoso-Silva](https://jonjoncardoso.github.io)

**DEPARTMENT:** [LSE Data Science Institute](https://lse.ac.uk/dsi)

**OBJECTIVE:** <font style="color:#e26a4f;font-weight:bold;">THIS IS A NOTEBOOK ON WHAT NOT TO DO  when working with JSON data in the context of this course.</font> On the [main notebook](./LSE_DS105A_2024_W07_lecture.ipynb), you saw how to 'unnest' a JSON file using `pd.json_normalize()`. Here, I will show you how much more complicated it would be to do the same without it. 

**Why?**  Thinking more deeply about the conversion from a complex JSON to tabular data might help you better appreciate how `pd.json_normalize()` works and why it is so useful. You will also discover a few new Pandas tricks along the way.

**REFERENCES:**

- The [`DataFrame.transpose()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transpose.html) to convert rows to columns and vice-versa.

- The [`pd.concat()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) to concatenate a list of DataFrames and produce a single DataFrame.

---

In [1]:
import pandas as pd

**SAMPLE DATA**

I will use the same JSON data as before:

In [2]:
json_one_location = {
  "latitude": 51.50853,
  "longitude": -0.12574,
  "generationtime_ms": 0.154972076416016,
  "utc_offset_seconds": 0,
  "timezone": "GMT",
  "timezone_abbreviation": "GMT",
  "elevation": 23,
  "daily_units": {
    "time": "iso8601",
    "weather_code": "wmo code",
    "daylight_duration": "s",
    "sunshine_duration": "s"
  },
  "daily": {
    "time": [
      "2024-10-26",
      "2024-10-27",
      "2024-10-28",
      "2024-10-29",
      "2024-10-30",
      "2024-10-31",
      "2024-11-01",
      "2024-11-02",
      "2024-11-03",
      "2024-11-04",
      "2024-11-05",
      "2024-11-06",
      "2024-11-07",
      "2024-11-08",
      "2024-11-09"
    ],
    "weather_code": [3, 51, 51, 51, 3, 3, 3, 51, 3, 3, 3, 3, 3, 51, 3],
    "daylight_duration": [35989.18, 35766.05, 35543.86, 35322.81, 35103.12, 34885.01, 34668.7, 34454.44, 34242.45, 34032.99, 33826.3, 33622.64, 33422.29, 33225.15, 33029.27],
    "sunshine_duration": [14002.7, 31127.43, 5811.8, 10787.3, 19920.14, 14421.7, 14437.86, 0, 1098.29, 8162.58, 17941.69, 536.17, 4472.84, 0, 7173.36]
  }
}

In [3]:
json_two_locations = [
  {
    "latitude": 51.50853,
    "longitude": -0.12574,
    "generationtime_ms": 0.200033187866211,
    "utc_offset_seconds": 0,
    "timezone": "GMT",
    "timezone_abbreviation": "GMT",
    "elevation": 23,
    "daily_units": {
      "time": "iso8601",
      "weather_code": "wmo code",
      "daylight_duration": "s",
      "sunshine_duration": "s"
    },
    "daily": {
      "time": [
        "2024-10-26",
        "2024-10-27",
        "2024-10-28",
        "2024-10-29",
        "2024-10-30",
        "2024-10-31",
        "2024-11-01",
        "2024-11-02",
        "2024-11-03",
        "2024-11-04",
        "2024-11-05",
        "2024-11-06",
        "2024-11-07",
        "2024-11-08",
        "2024-11-09"
      ],
      "weather_code": [3, 51, 51, 51, 3, 3, 3, 51, 3, 3, 3, 3, 3, 51, 3],
      "daylight_duration": [35989.18, 35766.05, 35543.86, 35322.81, 35103.12, 34885.01, 34668.7, 34454.44, 34242.45, 34032.99, 33826.3, 33622.64, 33422.29, 33225.15, 33029.27],
      "sunshine_duration": [14002.7, 31127.43, 5811.8, 10787.3, 19920.14, 14421.7, 14437.86, 0, 1098.29, 8162.58, 17941.69, 536.17, 4472.84, 0, 7173.36]
    }
  },
  {
    "latitude": 48.85341,
    "longitude": 	2.3488,
    "generationtime_ms": 0.160098075866699,
    "utc_offset_seconds": 0,
    "timezone": "GMT",
    "timezone_abbreviation": "GMT",
    "elevation": 43,
    "location_id": 1,
    "daily_units": {
      "time": "iso8601",
      "weather_code": "wmo code",
      "daylight_duration": "s",
      "sunshine_duration": "s"
    },
    "daily": {
      "time": [
        "2024-10-26",
        "2024-10-27",
        "2024-10-28",
        "2024-10-29",
        "2024-10-30",
        "2024-10-31",
        "2024-11-01",
        "2024-11-02",
        "2024-11-03",
        "2024-11-04",
        "2024-11-05",
        "2024-11-06",
        "2024-11-07",
        "2024-11-08",
        "2024-11-09"
      ],
      "weather_code": [55, 53, 3, 3, 3, 3, 3, 51, 3, 3, 3, 3, 3, 3, 3],
      "daylight_duration": [36679.21, 36477.63, 36276.98, 36077.46, 35879.25, 35682.56, 35487.58, 35294.54, 35103.64, 34915.11, 34729.17, 34546.06, 34366.02, 34188.94, 34013.09],
      "sunshine_duration": [17721.39, 1329.73, 14961.37, 3869.25, 13851.97, 17390.9, 130.73, 0, 17079.12, 17671.44, 0, 9888.03, 0, 0, 10666.35]
    }
  }
]

# 1. Unnesting a nested JSON: an exercise in data manipulation

[ðŸ¤” **Think about it:** What would it take to convert the original JSON data to a nicely shaped data frame if I didn't know about `pd.json_normalize()`?]{style="display:block;background-color:rgba(93, 158, 188, 0.1);padding:0.5em;font-size:1.05em;margin-left:0em;margin-bottom:1em;border-radius:0.5em;"}

Let me do this step by step:

In [4]:
# Instead of json_one_location['daily'], let me work with the original dictionary
df = pd.DataFrame(json_one_location)

When I look at all of the columns I realise that I just care about the `daily` key.

In [5]:
df.columns

Index(['latitude', 'longitude', 'generationtime_ms', 'utc_offset_seconds',
       'timezone', 'timezone_abbreviation', 'elevation', 'daily_units',
       'daily'],
      dtype='object')

 I can drop the rest of the columns:

## 1.1 Subsetting the DataFrame

**Alternative 01:** Filter the DataFrame so it only contains the columns you want.

In [6]:
selected_columns = ['daily']

df[selected_columns]

Unnamed: 0,daily
time,"[2024-10-26, 2024-10-27, 2024-10-28, 2024-10-2..."
weather_code,"[3, 51, 51, 51, 3, 3, 3, 51, 3, 3, 3, 3, 3, 51..."
daylight_duration,"[35989.18, 35766.05, 35543.86, 35322.81, 35103..."
sunshine_duration,"[14002.7, 31127.43, 5811.8, 10787.3, 19920.14,..."


In [7]:
# This does the same thing as the previous cell, but in a single line.
df[['daily']]

Unnamed: 0,daily
time,"[2024-10-26, 2024-10-27, 2024-10-28, 2024-10-2..."
weather_code,"[3, 51, 51, 51, 3, 3, 3, 51, 3, 3, 3, 3, 3, 51..."
daylight_duration,"[35989.18, 35766.05, 35543.86, 35322.81, 35103..."
sunshine_duration,"[14002.7, 31127.43, 5811.8, 10787.3, 19920.14,..."


**Alternative 02:** Drop the columns you do not want with the `drop()` method.

In [8]:
invalid_columns = [column for column in df.columns if column != 'daily']

df.drop(columns=invalid_columns)

Unnamed: 0,daily
time,"[2024-10-26, 2024-10-27, 2024-10-28, 2024-10-2..."
weather_code,"[3, 51, 51, 51, 3, 3, 3, 51, 3, 3, 3, 3, 3, 51..."
daylight_duration,"[35989.18, 35766.05, 35543.86, 35322.81, 35103..."
sunshine_duration,"[14002.7, 31127.43, 5811.8, 10787.3, 19920.14,..."


Of course, **Alternative 01** makes a lot more sense in this case because we can manually specify the columns we want to keep.

ðŸ’¡ In the future you might find yourself in a situation where you have a DataFrame with hundreds of columns and you want to drop all of them except for a few. In that case, you can use the `drop()` method above.

## 1.2 Transposing the DataFrame

Essentially, we want to transform the **rows into columns**. Helpfully, each row already has a good name suitable for a column name.

Whenever you want to do this, you can use the `transpose()` method:

In [9]:
# You can use the transpose method to swap the rows and columns:
df[['daily']].transpose()

Unnamed: 0,time,weather_code,daylight_duration,sunshine_duration
daily,"[2024-10-26, 2024-10-27, 2024-10-28, 2024-10-2...","[3, 51, 51, 51, 3, 3, 3, 51, 3, 3, 3, 3, 3, 51...","[35989.18, 35766.05, 35543.86, 35322.81, 35103...","[14002.7, 31127.43, 5811.8, 10787.3, 19920.14,..."


In [10]:
# Or just use T if you prefer:
df[['daily']].T

Unnamed: 0,time,weather_code,daylight_duration,sunshine_duration
daily,"[2024-10-26, 2024-10-27, 2024-10-28, 2024-10-2...","[3, 51, 51, 51, 3, 3, 3, 51, 3, 3, 3, 3, 3, 51...","[35989.18, 35766.05, 35543.86, 35322.81, 35103...","[14002.7, 31127.43, 5811.8, 10787.3, 19920.14,..."


## 1.3 Exploding lists into separate rows

Now, notice that we ended up with a DataFrame where the columns are made out of lists. You will inevitably find yourself in this situation when working with JSON data every now and then.

The `DataFrame.explode()` method is a great way to handle this situation. It expects to receive the column name (as a string) or names (as a list of strings) that you want to explode out into separate rows.

In [11]:
selected_columns = ['daily']
columns_to_explode = ['time', 'weather_code', 'daylight_duration', 'sunshine_duration']

# Explode the columns that contain lists:
df[selected_columns].T.explode(columns_to_explode)

Unnamed: 0,time,weather_code,daylight_duration,sunshine_duration
daily,2024-10-26,3,35989.18,14002.7
daily,2024-10-27,51,35766.05,31127.43
daily,2024-10-28,51,35543.86,5811.8
daily,2024-10-29,51,35322.81,10787.3
daily,2024-10-30,3,35103.12,19920.14
daily,2024-10-31,3,34885.01,14421.7
daily,2024-11-01,3,34668.7,14437.86
daily,2024-11-02,51,34454.44,0.0
daily,2024-11-03,3,34242.45,1098.29
daily,2024-11-04,3,34032.99,8162.58


And that's it! We achieved the same DataFrame as the one we got from `pd.DataFrame(json_output['daily'])`.

Take-home message:

- **Always be mindful of the structure of your data**

- If you have a flat JSON, you can easily convert it into a DataFrame, no need to worry about anything else.

- If you have a slightly nested JSON, you can still convert it into a DataFrame, but you might need to do some extra work to make it more efficient.

# 2. Normalising more complex JSON data

Now, let me complicate things a bit further. Sometimes, you will find yourself with JSON data that is more complex and has more than one level of nesting.

To demonstrate this, I will use the `json_two_locations` object, which contains the weather data for two locations at once.

[ðŸ¤” **Think about it:** What would you expect to happen when you convert the `json_two_locations` object into a DataFrame?]{style="display:block;background-color:rgba(93, 158, 188, 0.1);padding:0.5em;font-size:1.05em;margin-left:0em;margin-bottom:1em;border-radius:0.5em;"} 

In [12]:
# Your recurrent reminder that JSON objects can be either dictionaries or lists.
type(json_two_locations)

list

What happens when I try to create a DataFrame for the entire JSON response?

In [13]:
pd.DataFrame(json_two_locations)

Unnamed: 0,latitude,longitude,generationtime_ms,utc_offset_seconds,timezone,timezone_abbreviation,elevation,daily_units,daily,location_id
0,51.50853,-0.12574,0.200033,0,GMT,GMT,23,"{'time': 'iso8601', 'weather_code': 'wmo code'...","{'time': ['2024-10-26', '2024-10-27', '2024-10...",
1,48.85341,2.3488,0.160098,0,GMT,GMT,43,"{'time': 'iso8601', 'weather_code': 'wmo code'...","{'time': ['2024-10-26', '2024-10-27', '2024-10...",1.0


âš¡ **OH NO!** This time around the code we built above will not work. When I select just the `daily` column, I no longer have those neat index names that I can use as column names:

In [14]:
# If you try to transpose after this line, things will get messy
pd.DataFrame(json_two_locations)[selected_columns]

Unnamed: 0,daily
0,"{'time': ['2024-10-26', '2024-10-27', '2024-10..."
1,"{'time': ['2024-10-26', '2024-10-27', '2024-10..."


## 2.1 The long and winding road (NOT A GOOD PRACTICE -- BUT IT WORKS)

If your Internet is down and you can't look up any documentation, Google or ChatGPT, then you can always take the long way around and use `for` loops to rebuild the DataFrame.

After all, in case of panic, **DataFrames can always be converted to a mix of list and dictionaries**:

### Converting a DataFrame back to 'pure Python'

Using the method below, each row becomes a list of dictionaries.

In [15]:
# I will save that to a variable so I can use it later.
output = pd.DataFrame(json_two_locations)[selected_columns].values.tolist()

len(output)

2

In [16]:
print(f"The first element is a {type(output[0])} and the second element is also a {type(output[1])}.")
print(f"The first element has {len(output[0])} element and the second element has {len(output[1])} element.")

The first element is a <class 'list'> and the second element is also a <class 'list'>.
The first element has 1 element and the second element has 1 element.


### Rebuilding by concatenating manually

Therefore, I could concatenate all of the lists into a single list and then convert it into a DataFrame:

In [17]:
pd.concat([pd.DataFrame(output[0]), pd.DataFrame(output[1])])

Unnamed: 0,time,weather_code,daylight_duration,sunshine_duration
0,"[2024-10-26, 2024-10-27, 2024-10-28, 2024-10-2...","[3, 51, 51, 51, 3, 3, 3, 51, 3, 3, 3, 3, 3, 51...","[35989.18, 35766.05, 35543.86, 35322.81, 35103...","[14002.7, 31127.43, 5811.8, 10787.3, 19920.14,..."
0,"[2024-10-26, 2024-10-27, 2024-10-28, 2024-10-2...","[55, 53, 3, 3, 3, 3, 3, 51, 3, 3, 3, 3, 3, 3, 3]","[36679.21, 36477.63, 36276.98, 36077.46, 35879...","[17721.39, 1329.73, 14961.37, 3869.25, 13851.9..."


### Exploding so we have a conventional, unnested DataFrame

After which I could try to explode this DataFrame as we did before.

In [18]:
pd.concat([pd.DataFrame(output[0]), pd.DataFrame(output[1])]).explode(columns_to_explode)

Unnamed: 0,time,weather_code,daylight_duration,sunshine_duration
0,2024-10-26,3,35989.18,14002.7
0,2024-10-27,51,35766.05,31127.43
0,2024-10-28,51,35543.86,5811.8
0,2024-10-29,51,35322.81,10787.3
0,2024-10-30,3,35103.12,19920.14
0,2024-10-31,3,34885.01,14421.7
0,2024-11-01,3,34668.7,14437.86
0,2024-11-02,51,34454.44,0.0
0,2024-11-03,3,34242.45,1098.29
0,2024-11-04,3,34032.99,8162.58


**Success?! Not quite.** As the 'daily' key didn't have any information about the location, we lost that information.

Unlike the previous example, where we had a single location, we now have two locations and we need to keep track of which row belongs to which location.

### Manually adding the location information (DEFINITELY NOT A GOOD PRACTICE!!!)

While the code below works, this is very error prone. It requires a lot of faith on the part of the programmer that the data is correct and that the code is doing what it is supposed to do.

You've been coding for a few weeks now, you are now probably familiar with the concept of 'oops, I forgot to add a comma here' or 'I forgot to add a plus sign there'. Imagine how much more complex it is to keep track of all of the different variables and the different data types.

<font style="color:#e26a4f;font-weight:bold;">**AVOID MANUALLY ADDING DATA TO DATAFRAMES AT ALL COSTS!**</font>

In [19]:
df = pd.concat([pd.DataFrame(output[0]), pd.DataFrame(output[1])]).explode(columns_to_explode)

print(f"The DataFrame has {df.shape[0]} rows and {df.shape[1]} columns.")

The DataFrame has 30 rows and 4 columns.


I know that the first half of the DataFrame belongs to the first location and the second half belongs to the second location.

I also know (I think?!) that the first Location is London and the second Location is Paris.

I can manually add this information to the DataFrame:

In [20]:
# This hurts my eyes. Never use hard-coded values like this.
df['location_id'] = ['London'] * 15 + ['Paris'] * 15

In [5]:
["London"] * 15 + ["Paris"] * 15

['London',
 'London',
 'London',
 'London',
 'London',
 'London',
 'London',
 'London',
 'London',
 'London',
 'London',
 'London',
 'London',
 'London',
 'London',
 'Paris',
 'Paris',
 'Paris',
 'Paris',
 'Paris',
 'Paris',
 'Paris',
 'Paris',
 'Paris',
 'Paris',
 'Paris',
 'Paris',
 'Paris',
 'Paris',
 'Paris']

In [21]:
# It's a bittersweet victory:
df

Unnamed: 0,time,weather_code,daylight_duration,sunshine_duration,location_id
0,2024-10-26,3,35989.18,14002.7,London
0,2024-10-27,51,35766.05,31127.43,London
0,2024-10-28,51,35543.86,5811.8,London
0,2024-10-29,51,35322.81,10787.3,London
0,2024-10-30,3,35103.12,19920.14,London
0,2024-10-31,3,34885.01,14421.7,London
0,2024-11-01,3,34668.7,14437.86,London
0,2024-11-02,51,34454.44,0.0,London
0,2024-11-03,3,34242.45,1098.29,London
0,2024-11-04,3,34032.99,8162.58,London


---

**TAKE-HOME MESSAGE:**

- In particular, the last part of this notebook where we manually included the location information is a good example of what you should avoid doing at all costs. It is very easy to introduce errors into your data and it is very difficult to debug them later on.

- Always try to find a way to automate the process as much as possible. If you find yourself doing something manually, there is probably a pandas function that can help you do it more efficiently.