<font style='font-size:1.5em'>**🧑‍🏫 Week 07 Lecture**</font><br>
<font style='font-size:1.3em;color:#888888'>Normalising JSON + the Groupby -> Apply -> Combine Strategy </font>

<font style='font-size:1.2em;color:#e26a4f;font-weight:bold'>LSE DS105A – Data for Data Science (2024/25) </font>



<div style="color: #333333; background-color:rgba(226, 106, 79, 0.075); border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); padding: 20px 0 20px 10px; margin: 10px 0 10px 0; flex: 1 1 calc(45% - 20px);min-width: 250px;max-width: 350px;align-items:top;min-height: calc(45% - 20px); box-sizing: border-box;font-size:0.9em;">

🗓️ **DATE:** 14 November 2024 

⌚ **TIME:** 16.00-18.00

📍 **LOCATION:** CLM.5.02
</div>


**AUTHORS:**  Dr. [Jon Cardoso-Silva](https://jonjoncardoso.github.io)

**DEPARTMENT:** [LSE Data Science Institute](https://lse.ac.uk/dsi)

**OBJECTIVE**: Demonstrate how to 'disentangle' complex JSON data structures using the `json_normalize` function from the `pandas` library and introduce the `groupby -> apply -> combine` strategy to process data in a more efficient way than using loops. We will also discuss the `explode` function to handle cases when we find ourselves with columns made out of lists.

**REFERENCES:**

- The [`pd.json_normalize()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html) to convert JSON data more easily into tabular format

- The [DataFrame.explode()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html) function to handle cases when columns are made out of lists

In the labs later (second notebook), we will also cover:

- The [DataFrame.groupby()](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html) function, combined with apply() and agg() to aggregate data 

---

In [1]:
import pandas as pd

# 1. Flat vs Nested JSON

You spent the past few weeks playing with data collected from OpenMeteo, a (mostly) free API that provides weather data for any location in the world in the format of JSON. JSON is indeed the preferred format for APIs, as it is easy to read and write by both humans and machines and easy to be parsed by any programming language. 

👉 However, most data analysis libraries, such as `pandas` in Python or `dplyr` in R or SQL, as well as most visualisation libraries, are designed to work with tabular data. This means that we need to convert JSON data into a tabular format to be able to analyse it.

OpenMeteo's JSON is overall fairly straightforward. If you just used a single location and a single temporal resolution (either `daily` or `hourly`), the JSON output was mostly "flat" and could be easily converted into a DataFrame. 

<div style="display: flex; flex-wrap: wrap; flex-direction:row;justify-content: left; margin: 0.5em;font-size:0.9em">

<div style="color: #333333; background-color:#ffffff; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); padding: 20px 0 20px 10px; margin: 10px 0 10px 0; flex: 1 1 calc(45% - 20px);min-width: 250px;max-width: 350px;align-items:top;min-height: calc(45% - 20px); box-sizing: border-box;font-size:0.9em;">

**What is considered a "flat" JSON?**

A "flat" JSON is one where the keys are all at the same level, and the values are either atomic (strings, numbers, booleans) or lists of atomic values.

For example:

```json
{
    "key1": "value1",
    "key2": "value2",
    "key3": [1, 2, 3]
}
```

</div>

<div style="color: #333333; background-color:#ffffff; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); padding: 20px 0 20px 10px; margin: 10px 0 10px 0; flex: 1 1 calc(45% - 20px);min-width: 250px;max-width: 350px;align-items:top;min-height: calc(45% - 20px); box-sizing: border-box;font-size:0.9em;margin-left:2em">

**What is considered a "nested" JSON?**

A "nested" JSON is one where the keys are at different levels, you have dictionaries within dictionaries, or lists of dictionaries.

For example:

```json
{
    "key1": "value1",
    "key2": {
        "key3": "value3",
        "key4": "value4"
    },
    "key5": [
        {"key6": "value6"},
        {"key7": "value7"}
    ]
}
```

</div>

</div>

<span style="display:block;background-color:rgba(93, 158, 188, 0.1);padding:0.5em;font-size:1.05em;margin-left:0em;margin-bottom:1em;border-radius:0.5em;">🤔 **Think about it:** What is the easiest way to convert JSON data into a tabular format using Python?</span>

To answer that question, let's first understand the structure of the JSON data we are working with.

Take the first, flat JSON example above. We can convert it into a DataFrame straight away:

In [2]:
flat_json_dict = {
    'key1': 'value1',
    'key2': 'value2',
    'key3': [1, 2, 3]
}

pd.DataFrame(flat_json_dict)

Unnamed: 0,key1,key2,key3
0,value1,value2,1
1,value1,value2,2
2,value1,value2,3


We can't say the same about the second, nested JSON example:

In [3]:
nested_json_dict = {
    "key1": "value1",
    "key2": {
        "key3": "value3",
        "key4": "value4"
    },
    "key5": [
        {"key6": "value6"},
        {"key6": "another value 6", "key7": "value7"}
    ]
}

If you try to create a DataFrame out of `key2`, like this:

```python
pd.DataFrame(nested_json_dict['key2'])
```

You will get this error:

```python
ValueError: If using all scalar values, you must pass an index
```

In [4]:
# This works as a pandas Series
pd.Series(nested_json_dict['key2'])

key3    value3
key4    value4
dtype: object

In [5]:
# But if it is a pandas DataFrame
# We need to come up with an Index for that row
pd.DataFrame(nested_json_dict['key2'], index=[0])

Unnamed: 0,key3,key4
0,value3,value4


I could create separate DataFrames out of specific keys (not the best approach):

In [6]:
pd.DataFrame(nested_json_dict['key5'])

Unnamed: 0,key6,key7
0,value6,
1,another value 6,value7


And then combine them:

In [7]:
pd.concat([pd.DataFrame(nested_json_dict['key2'], index=[0]), 
           pd.DataFrame(nested_json_dict['key5'])])

Unnamed: 0,key3,key4,key6,key7
0,value3,value4,,
0,,,value6,
1,,,another value 6,value7


## 1.1 Let's look at a familiar example

What follows below is the response I get after requesting historical weather data from OpenMeteo using the following parameters:

| Variable             | Value                                                                 |
|----------------------|-----------------------------------------------------------------------|
| **Latitude**             | 51.50853                                                             |
| **Longitude**            | -0.12574                                                            |
| **Period**               | from 26/Oct/2024 until 09/Nov/2024 <br> (when it was super grey in London!)      |
| **Frequency**            | daily                                                                 |
| **Weather variables**    | weather code, daylight duration, sunshine duration                    |

I copied the output and stored it as a dictionary to save us the hassle of sending a request:

In [8]:
json_one_location = {
  "latitude": 51.50853,
  "longitude": -0.12574,
  "generationtime_ms": 0.154972076416016,
  "utc_offset_seconds": 0,
  "timezone": "GMT",
  "timezone_abbreviation": "GMT",
  "elevation": 23,
  "daily_units": {
    "time": "iso8601",
    "weather_code": "wmo code",
    "daylight_duration": "s",
    "sunshine_duration": "s"
  },
  "daily": {
    "time": [
      "2024-10-26",
      "2024-10-27",
      "2024-10-28",
      "2024-10-29",
      "2024-10-30",
      "2024-10-31",
      "2024-11-01",
      "2024-11-02",
      "2024-11-03",
      "2024-11-04",
      "2024-11-05",
      "2024-11-06",
      "2024-11-07",
      "2024-11-08",
      "2024-11-09"
    ],
    "weather_code": [3, 51, 51, 51, 3, 3, 3, 51, 3, 3, 3, 3, 3, 51, 3],
    "daylight_duration": [35989.18, 35766.05, 35543.86, 35322.81, 35103.12, 34885.01, 34668.7, 34454.44, 34242.45, 34032.99, 33826.3, 33622.64, 33422.29, 33225.15, 33029.27],
    "sunshine_duration": [14002.7, 31127.43, 5811.8, 10787.3, 19920.14, 14421.7, 14437.86, 0, 1098.29, 8162.58, 17941.69, 536.17, 4472.84, 0, 7173.36]
  }
}

[🤔 **Think about it:** Is this a flat or nested JSON object?]{style="display:block;background-color:rgba(93, 158, 188, 0.1);padding:0.5em;font-size:1.05em;margin-left:0em;margin-bottom:1em;border-radius:0.5em;"}

What about the output of OpenMeteo when I request data for two locations at once?

👇

In [9]:
json_two_locations = [
  {
    "latitude": 51.50853,
    "longitude": -0.12574,
    "generationtime_ms": 0.200033187866211,
    "utc_offset_seconds": 0,
    "timezone": "GMT",
    "timezone_abbreviation": "GMT",
    "elevation": 23,
    "daily_units": {
      "time": "iso8601",
      "weather_code": "wmo code",
      "daylight_duration": "s",
      "sunshine_duration": "s"
    },
    "daily": {
      "time": [
        "2024-10-26",
        "2024-10-27",
        "2024-10-28",
        "2024-10-29",
        "2024-10-30",
        "2024-10-31",
        "2024-11-01",
        "2024-11-02",
        "2024-11-03",
        "2024-11-04",
        "2024-11-05",
        "2024-11-06",
        "2024-11-07",
        "2024-11-08",
        "2024-11-09"
      ],
      "weather_code": [3, 51, 51, 51, 3, 3, 3, 51, 3, 3, 3, 3, 3, 51, 3],
      "daylight_duration": [35989.18, 35766.05, 35543.86, 35322.81, 35103.12, 34885.01, 34668.7, 34454.44, 34242.45, 34032.99, 33826.3, 33622.64, 33422.29, 33225.15, 33029.27],
      "sunshine_duration": [14002.7, 31127.43, 5811.8, 10787.3, 19920.14, 14421.7, 14437.86, 0, 1098.29, 8162.58, 17941.69, 536.17, 4472.84, 0, 7173.36]
    }
  },
  {
    "latitude": 48.85341,
    "longitude": 	2.3488,
    "generationtime_ms": 0.160098075866699,
    "utc_offset_seconds": 0,
    "timezone": "GMT",
    "timezone_abbreviation": "GMT",
    "elevation": 43,
    "location_id": 1,
    "daily_units": {
      "time": "iso8601",
      "weather_code": "wmo code",
      "daylight_duration": "s",
      "sunshine_duration": "s"
    },
    "daily": {
      "time": [
        "2024-10-26",
        "2024-10-27",
        "2024-10-28",
        "2024-10-29",
        "2024-10-30",
        "2024-10-31",
        "2024-11-01",
        "2024-11-02",
        "2024-11-03",
        "2024-11-04",
        "2024-11-05",
        "2024-11-06",
        "2024-11-07",
        "2024-11-08",
        "2024-11-09"
      ],
      "weather_code": [55, 53, 3, 3, 3, 3, 3, 51, 3, 3, 3, 3, 3, 3, 3],
      "daylight_duration": [36679.21, 36477.63, 36276.98, 36077.46, 35879.25, 35682.56, 35487.58, 35294.54, 35103.64, 34915.11, 34729.17, 34546.06, 34366.02, 34188.94, 34013.09],
      "sunshine_duration": [17721.39, 1329.73, 14961.37, 3869.25, 13851.97, 17390.9, 130.73, 0, 17079.12, 17671.44, 0, 9888.03, 0, 0, 10666.35]
    }
  }
]

👨🏻‍🏫 **TEACHING MOMENT:** Watch me as I demonstrate how to browse through the JSON data on VSCode.

## 1.2 Flat JSON makes for easy conversion to tabular format

Going back to the OpenMeteo data...

I can easily convert a flat dictionary (similarly, a flat JSON object) into a DataFrame using the `pd.DataFrame()` function:

In [10]:
(
    pd.json_normalize(json_one_location)
    # I don't care about the Daily Units (for now at least)
    .drop(columns=['daily_units.time', 'daily_units.weather_code', 'daily_units.daylight_duration', 'daily_units.sunshine_duration'])

    # I would rather have the lists expanded into rows
    .explode(['daily.time', 'daily.weather_code', 'daily.daylight_duration', 'daily.sunshine_duration'])

    # I don't like the 'daily.' prefix
    .rename(columns={
        'daily.time': 'time',
        'daily.weather_code': 'weather_code',
        'daily.daylight_duration': 'daylight_duration',
        'daily.sunshine_duration': 'sunshine_duration'
    })

    # There are these other useless columns that I don't care about
    .drop(columns=['generationtime_ms', 'utc_offset_seconds', 'timezone', 'timezone_abbreviation', 'elevation'])
)

Unnamed: 0,latitude,longitude,time,weather_code,daylight_duration,sunshine_duration
0,51.50853,-0.12574,2024-10-26,3,35989.18,14002.7
0,51.50853,-0.12574,2024-10-27,51,35766.05,31127.43
0,51.50853,-0.12574,2024-10-28,51,35543.86,5811.8
0,51.50853,-0.12574,2024-10-29,51,35322.81,10787.3
0,51.50853,-0.12574,2024-10-30,3,35103.12,19920.14
0,51.50853,-0.12574,2024-10-31,3,34885.01,14421.7
0,51.50853,-0.12574,2024-11-01,3,34668.7,14437.86
0,51.50853,-0.12574,2024-11-02,51,34454.44,0.0
0,51.50853,-0.12574,2024-11-03,3,34242.45,1098.29
0,51.50853,-0.12574,2024-11-04,3,34032.99,8162.58


<details><summary>Click here to see a different way to do the same thing without method chaining</summary>

If you are not a fan of method chaining, you can also do the following:

```python
df = pd.json_normalize(json_one_location)

# I don't care about the Daily Units (for now at least)
df = df.drop(columns=['daily_units.time', 'daily_units.weather_code', 'daily_units.daylight_duration', 'daily_units.sunshine_duration'])

# I would rather have the lists expanded into rows
df = df.explode(['daily.time', 'daily.weather_code', 'daily.daylight_duration', 'daily.sunshine_duration'])

# I don't like the 'daily.' prefix
df = df.rename(columns={
        'daily.time': 'time',
        'daily.weather_code': 'weather_code',
        'daily.daylight_duration': 'daylight_duration',
        'daily.sunshine_duration': 'sunshine_duration'
    })

    # There are these other useless columns that I don't care about
df = df.drop(columns=['generationtime_ms', 'utc_offset_seconds', 'timezone', 'timezone_abbreviation', 'elevation'])

```

The advantage of method chaining is that it is more concise, you don't need to keep track of intermediate DataFrames, and, I'd argue, it is also easier to read if well formatted. I expand on this at the end of the notebook.

</details>

## 2.2 Normalising the JSON data (two locations)

The same code above will work for the `json_two_locations` object. 

The only difference is that OpenMeteo returns an additional key called `location_id` where it specifies if that data sample is for the first or the second location. Weirdly, it only starts counting from the second location, rendering the first location with a missing `location_id` (NaN).

In [11]:
daily_cols = ['daily.time', 'daily.weather_code', 'daily.daylight_duration', 'daily.sunshine_duration']

cols_to_keep = daily_cols + ['latitude', 'longitude']

df_sunshine = (
    pd.json_normalize(json_two_locations)[cols_to_keep]

    # I don't like the 'daily.' prefix
    .rename(columns={
        'daily.time': 'time',
        'daily.weather_code': 'weather_code',
        'daily.daylight_duration': 'daylight_duration',
        'daily.sunshine_duration': 'sunshine_duration'
    })

    # I would rather have the lists expanded into rows
    .explode(['time', 'weather_code', 'daylight_duration', 'sunshine_duration'])

)

# I could also drop the 'location_id' column, but I will keep it for now.
df_sunshine

Unnamed: 0,time,weather_code,daylight_duration,sunshine_duration,latitude,longitude
0,2024-10-26,3,35989.18,14002.7,51.50853,-0.12574
0,2024-10-27,51,35766.05,31127.43,51.50853,-0.12574
0,2024-10-28,51,35543.86,5811.8,51.50853,-0.12574
0,2024-10-29,51,35322.81,10787.3,51.50853,-0.12574
0,2024-10-30,3,35103.12,19920.14,51.50853,-0.12574
0,2024-10-31,3,34885.01,14421.7,51.50853,-0.12574
0,2024-11-01,3,34668.7,14437.86,51.50853,-0.12574
0,2024-11-02,51,34454.44,0.0,51.50853,-0.12574
0,2024-11-03,3,34242.45,1098.29,51.50853,-0.12574
0,2024-11-04,3,34032.99,8162.58,51.50853,-0.12574


# 3. Merging data

We have explored merge briefly in the lecture before, but let's revisit it here.

<span style="display:block;background-color:rgba(93, 158, 188, 0.1);padding:0.5em;font-size:1.05em;margin-left:0em;margin-bottom:1em;border-radius:0.5em;"> 🤔 **Think about it:** Wouldn't it be great if instead of `latitude` and `longitude`, we just had the city name?</span>

We DO have data on the city names! Let's read that old CSV file we used in the past:

In [12]:
df_world_cities = pd.read_csv('../data/world_cities.csv')

# Show a sample of the data just to get an idea of what it looks like
df_world_cities.head()

Unnamed: 0,country,name,lat,lng
0,AD,El Tarter,42.57952,1.65362
1,AD,Sant Julià de Lòria,42.46372,1.49129
2,AD,Pas de la Casa,42.54277,1.73361
3,AD,Ordino,42.55623,1.53319
4,AD,les Escaldes,42.50729,1.53414


**We want to merge the `df_sunshine` DataFrame with the `df_world_cities` DataFrame.**

What does that mean? It means that we want to add the `name` and `country` columns from the `df_world_cities` DataFrame to the `df_sunshine` DataFrame.

Then, I'd be able to drop the `latitude` and `longitude` columns that are not very human-readable.

### How to perform a merge

The `pd.merge()` function is the way to go. It expects the following arguments:

- `left`: the left DataFrame

- `right`: the right DataFrame

- `how`: the type of merge you want to perform (inner, outer, left, right)

    - `inner`: keeps only the rows that have a match in both DataFrames

    - `outer`: keeps all rows from both DataFrames, even if they don't have a match

    - `left`: keeps all rows from the left DataFrame, even if they don't have a match with the right DataFrame

    - `right`: keeps all rows from the right DataFrame, even if they don't have a match with the left DataFrame

- `left_on`: the column(s) on the left DataFrame that you want to use to merge

- `right_on`: the column(s) on the right DataFrame that you want to use to merge

It is important to note that the columns used to merge the DataFrames have to represent the same information. In this case, the `latitude` and `longitude` columns from the `df_sunshine` DataFrame represent the same information as the `lat` and `lng` columns from the `df_world_cities` DataFrame.

In [13]:
pd.merge(left=df_sunshine, right=df_world_cities, how='left', 
         left_on=['latitude', 'longitude'], 
         right_on=['lat', 'lng'])

Unnamed: 0,time,weather_code,daylight_duration,sunshine_duration,latitude,longitude,country,name,lat,lng
0,2024-10-26,3,35989.18,14002.7,51.50853,-0.12574,GB,London,51.50853,-0.12574
1,2024-10-27,51,35766.05,31127.43,51.50853,-0.12574,GB,London,51.50853,-0.12574
2,2024-10-28,51,35543.86,5811.8,51.50853,-0.12574,GB,London,51.50853,-0.12574
3,2024-10-29,51,35322.81,10787.3,51.50853,-0.12574,GB,London,51.50853,-0.12574
4,2024-10-30,3,35103.12,19920.14,51.50853,-0.12574,GB,London,51.50853,-0.12574
5,2024-10-31,3,34885.01,14421.7,51.50853,-0.12574,GB,London,51.50853,-0.12574
6,2024-11-01,3,34668.7,14437.86,51.50853,-0.12574,GB,London,51.50853,-0.12574
7,2024-11-02,51,34454.44,0.0,51.50853,-0.12574,GB,London,51.50853,-0.12574
8,2024-11-03,3,34242.45,1098.29,51.50853,-0.12574,GB,London,51.50853,-0.12574
9,2024-11-04,3,34032.99,8162.58,51.50853,-0.12574,GB,London,51.50853,-0.12574


**WHAT HAPPENED THERE?**

We merged left on the `latitude` and `longitude` columns of the `df_sunshine` DataFrame and right on the `lat` and `lng` columns of the `df_world_cities` DataFrame.

This means that:

1. All the columns of the `df_sunshine` DataFrame were kept intact 

2. Pandas identified, for each row, the corresponding row in the `df_world_cities` DataFrame that had the same `lat` and `lng` values

3. Pandas added all the columns from the `df_world_cities` DataFrame to the `df_sunshine` DataFrame, filling in with the values of the corresponding row in the `df_world_cities` DataFrame

## How is that useful?

Well, we can now drop all the columns related to the location and keep only the `name` and `country` columns:

In [14]:
(
    pd.merge(left=df_sunshine, right=df_world_cities, how='left', left_on=['latitude', 'longitude'], right_on=['lat', 'lng'])
    # While I'm at it, I might as well drop the location_id column as well
    .drop(columns=['lat', 'lng', 'latitude', 'longitude'])
)

Unnamed: 0,time,weather_code,daylight_duration,sunshine_duration,country,name
0,2024-10-26,3,35989.18,14002.7,GB,London
1,2024-10-27,51,35766.05,31127.43,GB,London
2,2024-10-28,51,35543.86,5811.8,GB,London
3,2024-10-29,51,35322.81,10787.3,GB,London
4,2024-10-30,3,35103.12,19920.14,GB,London
5,2024-10-31,3,34885.01,14421.7,GB,London
6,2024-11-01,3,34668.7,14437.86,GB,London
7,2024-11-02,51,34454.44,0.0,GB,London
8,2024-11-03,3,34242.45,1098.29,GB,London
9,2024-11-04,3,34032.99,8162.58,GB,London


You might also want to use the `rename()` function to rename the columns to something more meaningful:

In [15]:
df = (
    pd.merge(left=df_sunshine, right=df_world_cities, how='left', left_on=['latitude', 'longitude'], right_on=['lat', 'lng'])
    # While I'm at it, I might as well drop the location_id column as well
    .drop(columns=['lat', 'lng', 'latitude', 'longitude'])
    .rename(columns={'name': 'city', 'time': 'date'})
    # Reorder the columns
    [['city', 'country', 'date', 'weather_code', 'daylight_duration', 'sunshine_duration']]
)
df

Unnamed: 0,city,country,date,weather_code,daylight_duration,sunshine_duration
0,London,GB,2024-10-26,3,35989.18,14002.7
1,London,GB,2024-10-27,51,35766.05,31127.43
2,London,GB,2024-10-28,51,35543.86,5811.8
3,London,GB,2024-10-29,51,35322.81,10787.3
4,London,GB,2024-10-30,3,35103.12,19920.14
5,London,GB,2024-10-31,3,34885.01,14421.7
6,London,GB,2024-11-01,3,34668.7,14437.86
7,London,GB,2024-11-02,51,34454.44,0.0
8,London,GB,2024-11-03,3,34242.45,1098.29
9,London,GB,2024-11-04,3,34032.99,8162.58


# 4. A huge method chain to process the data from OpenMeteo

I've done a lot of repetition above with the intention to demonstrate the different ways you can manipulate JSON data.

Ultimately, though, it is in the philosophy of this course to be as efficient as possible and to not add as many intermediate DataFrames as I did above.

Let me show you how a huge method chain can be used to process the data from OpenMeteo in a single (huge) command:

In [16]:
### NOTE: THIS IS THE FINAL VERSION OF THE CODE ###
# You wouldn't reach this point in your first attempt
# Rewatch the lecture where I demonstrate how you would arrive at this point

daily_cols = ['daily.time', 'daily.weather_code', 'daily.daylight_duration', 'daily.sunshine_duration']

cols_to_keep = daily_cols + ['latitude', 'longitude']

df_sunshine = (
    pd.json_normalize(json_two_locations)[cols_to_keep]

    # I don't like the 'daily.' prefix
    .rename(columns={
        'daily.time': 'time',
        'daily.weather_code': 'weather_code',
        'daily.daylight_duration': 'daylight_duration',
        'daily.sunshine_duration': 'sunshine_duration'
    })

    # I would rather have the lists expanded into rows
    .explode(['time', 'weather_code', 'daylight_duration', 'sunshine_duration'])

    .merge(right=df_world_cities, how='left', left_on=['latitude', 'longitude'], right_on=['lat', 'lng'])
    .drop(columns=['lat', 'lng', 'latitude', 'longitude'])
    .rename(columns={'name': 'city', 'time': 'date', 'country_code': 'country'})
    # Reorder the columns
    [['city', 'country', 'date', 'weather_code', 'daylight_duration', 'sunshine_duration']]
)

df_sunshine

Unnamed: 0,city,country,date,weather_code,daylight_duration,sunshine_duration
0,London,GB,2024-10-26,3,35989.18,14002.7
1,London,GB,2024-10-27,51,35766.05,31127.43
2,London,GB,2024-10-28,51,35543.86,5811.8
3,London,GB,2024-10-29,51,35322.81,10787.3
4,London,GB,2024-10-30,3,35103.12,19920.14
5,London,GB,2024-10-31,3,34885.01,14421.7
6,London,GB,2024-11-01,3,34668.7,14437.86
7,London,GB,2024-11-02,51,34454.44,0.0
8,London,GB,2024-11-03,3,34242.45,1098.29
9,London,GB,2024-11-04,3,34032.99,8162.58


**Why is this preferred?** 

A little bit is down to taste. You might just think that this is too complex and that you'd rather have the intermediate DataFrames to check if everything is going as expected. I get that. I also use intermediate DataFrames when I'm building the solution, but at the end once I'm happy with the steps I've taken, I like to chain everything together.

 [💡 **TIP:** There is also a very practical reason why I'd like you to practice this approach to writing code. Chaining operations is the default way of working with data in other data analysis libraries (dplyr in R, SQL, etc.). If you master this skill in Python, you will be able to apply it to any other data manipulation tool you come across in the future. ]{style="display:block;color: #333333; background-color:rgba(93, 158, 188, 0.075);border-radius: 10px; box-shadow: 0 4px 12px rgba(0, 0, 0, 0.1);padding:1em;font-size:0.9em;margin-left:0em;margin-bottom:1em;border-radius:0.5em;max-width:500px"}

# 5. Groupby -> Apply -> Combine

We often want to calculate summary statistics for groups of data. In the case above, we might want to calculate the average daily sunshine duration for each city.

## 5.1 First, let me create a new column

What if I wanted to combine the `name` and `country` columns into a single column? It seems wasteful to have two columns for that information.

I'd use the `apply()` function to concatenate the two columns:

In [17]:
# Axis 1 means apply the function to each row
df_sunshine['city_country'] = df.apply(lambda row: f"{row['city']} ({row['country']})", axis=1)

Then I can drop the `city` and `country` columns:

In [18]:
df_sunshine = df_sunshine.drop(columns=['city', 'country'])
df_sunshine

Unnamed: 0,date,weather_code,daylight_duration,sunshine_duration,city_country
0,2024-10-26,3,35989.18,14002.7,London (GB)
1,2024-10-27,51,35766.05,31127.43,London (GB)
2,2024-10-28,51,35543.86,5811.8,London (GB)
3,2024-10-29,51,35322.81,10787.3,London (GB)
4,2024-10-30,3,35103.12,19920.14,London (GB)
5,2024-10-31,3,34885.01,14421.7,London (GB)
6,2024-11-01,3,34668.7,14437.86,London (GB)
7,2024-11-02,51,34454.44,0.0,London (GB)
8,2024-11-03,3,34242.45,1098.29,London (GB)
9,2024-11-04,3,34032.99,8162.58,London (GB)


## 5.2 Now, let's calculate the average daily sunshine duration for each city

When I `groupby()` something, I'm telling pandas that I want to split the data into groups based on the values of a specific column.

Pandas will then apply an aggregation function to each group and combine the results into a new DataFrame.

The [`agg()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html) expects a dictionary in which the keys are names of existing columns and the values are the aggregation functions you want to apply to those columns:

In [19]:
df_sunshine.groupby(['city_country']).agg({'sunshine_duration': 'mean', 'daylight_duration': 'mean'})

Unnamed: 0_level_0,sunshine_duration,daylight_duration
city_country,Unnamed: 1_level_1,Unnamed: 2_level_1
London (GB),9992.924,34475.617333
Paris (FR),8304.018667,35314.482667


You can pass some commonly used function names as a string:

- "mean"
- "median"
- "prod"
- "sum"
- "std"
- "var"

But you can also pass your own lambda (or regular) functions.

The units of these two variables are seconds, say we want the aggregator to return the average daily sunshine duration in hours:

In [20]:
# You can store a lambda function in a variable to reuse it
daily_avg_in_minutes = lambda x: (sum(x)/len(x)) / 60 / 60

df_sunshine.groupby(['city_country']).agg({'sunshine_duration': daily_avg_in_minutes,
                                           'daylight_duration': daily_avg_in_minutes})

Unnamed: 0_level_0,sunshine_duration,daylight_duration
city_country,Unnamed: 1_level_1,Unnamed: 2_level_1
London (GB),2.775812,9.57656
Paris (FR),2.306672,9.809579


## 5.3 Use `.apply()` instead of `.agg()` for ultimate flexibility

Maybe you want to apply several different opperations at once. You can specify a custom function that expects a DataFrame (with the group of rows) and returns a Series (with the aggregated values).

In [21]:
# My function receives a DataFrame (for an individual city)
# I calculate some statistics, **CUSTOMISE THE NAME OF THE NEW COLUMNS**
# And return a Series
def summarise_weather_data(group_of_data):
    output_dict = {
        'Avg. Daily Sunshine (s)': group_of_data['sunshine_duration'].mean(),
        'Avg. Daily Sunshine (h)': group_of_data['sunshine_duration'].mean()/(60*60)
    }
    output = pd.Series(output_dict)
    return output


df_sunshine.groupby(['city_country']).apply(summarise_weather_data, include_groups=False)

Unnamed: 0_level_0,Avg. Daily Sunshine (s),Avg. Daily Sunshine (h)
city_country,Unnamed: 1_level_1,Unnamed: 2_level_1
London (GB),9992.924,2.775812
Paris (FR),8304.018667,2.306672


--- 

**NOW WHAT?**

1. Read the [appendix notebook](./notebooks/W07-Lecture-Notebook-Appendix.ipynb) to gain a deeper insight into the process of normalising JSON data. This notebook should help you appreciate the power of the `json_normalize()` function.

2. Tomorrow in the labs (15 November 2024), you will be given a different dataset to work with, and you will have to apply the concepts you learned today to convert the JSON data into a tabular format. **Keep your notes about json_normalize() handy!**

3. There will also be a moment where your class teachers will tell you about the `groupby -> apply -> combine` strategy to summarise and prepare data for plots than using loops.