<font style='font-size:1.5em'>**✅ W07 Lab Solutions**</font><br>
<font style='font-size:1.3em;color:#888888'>Normalising JSON + the Groupby -> Apply -> Combine Strategy </font>

<font style='font-size:1.2em'>LSE [DS105A](https://lse-dsi.github.io/DS105/autumn-term/index.html){style="color:#e26a4f;font-weight:bold"} – Data for Data Science (2024/25) </font>



<div style="color: #333333; background-color:rgba(226, 106, 79, 0.075); border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); padding: 20px 0 20px 10px; margin: 10px 0 10px 0; flex: 1 1 calc(45% - 20px);min-width: 250px;max-width: 350px;align-items:top;min-height: calc(45% - 20px); box-sizing: border-box;font-size:0.9em;">

🗓️ **DATE:** 15 November 2024 
</div>


**CREATORS:**  

- [Alex Soldatkin](https://github.com/alex-soldatkin) provided the dataset, the use case and a starting code
- Dr. [Jon Cardoso-Silva](https://jonjoncardoso.github.io) adjusted the content to meet the lecture more closely

**DEPARTMENT:** [LSE Data Science Institute](https://lse.ac.uk/dsi)

**OBJECTIVE**: Practice normalising JSON data and using the groupby -> apply -> combine strategy to aggregate data.

**REFERENCES:**

- The [`pd.json_normalize()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html) to convert JSON data more easily into tabular format

- The [DataFrame.explode()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html) function to handle cases when columns are made out of lists

In the labs later (second notebook), we will also cover:

- The [DataFrame.groupby()](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html) function, combined with apply() and agg() to aggregate data 

---

<div style="background-color:white;padding:0.5em;margin-left:2em;margin-bottom:1em;border-radius:0.5em;font-family: monospace;border: 1px solid #eda291;font-size:1.05em;width:500px">

💽 **DATA SPECIFICATION CARD:**

<font style="font-size:0.8em">We're going to use data from the [OpenSanctions](https://www.opensanctions.org/) project. This dataset includes information about individuals and entities that governments and international organizations have sanctioned worldwide. OpenSanctions is operated by a German company, [OpenSanctions Datenbanken GmbH](https://www.opensanctions.org/docs/about/), and has received funding from the German Federal Ministry for Education and Research. They offer a paid API for accessing the data, but you can also download the [data in bulk](https://www.opensanctions.org/datasets/sanctions/) for free, for academic and research purposes.</font>

A few things to know about the dataset:

- **We are focusing on Targets.** These are the individuals and entities that have been sanctioned. This dataset includes information about the name, country, and other 'properties' of the targets.

- **We have filtered for Russian Targets.** This in part because Alex, who provided us with the data sample for this lab, is doing a PhD where he focuses on studying Russia, and also because the dataset is large and we want to make it more manageable for this lab.

- **We are using a small random sample.** Again, this is to make the dataset more manageable for this lab. The full dataset is much larger. 

</div>

**WARNING:** You will need to install a new package before running this notebook.

Either open a terminal and run:

```bash
pip install pycountry
```

or add a new Python cell and run:

```python
!pip install pycountry
```

(delete the cell after running it)

In [141]:
# To convert files to a suitable Python format (list or dictionary)
import json
import pycountry

import numpy as np
import pandas as pd

from IPython.display import Image

from lets_plot import *
LetsPlot.setup_html()

# 1. Let's normalise the JSON data

- You can work alone or in small groups for this. 

- If you want, feel free to play a game of <span style="display: inline-block; padding: 0 7.5px; font-size: 12px; font-weight: bold; line-height: 18px; white-space: nowrap; border: 1px solid rgba(20, 18, 11, 0.75); border-radius: 0.5em; color: rgb(20, 18, 11); background-color: rgba(255, 255, 255, 0.75); vertical-align: top; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1)"> 🧑‍✈️ Pilot</span> and <span style="display: inline-block; padding: 0 7.5px; font-size: 12px; font-weight: bold; line-height: 18px; white-space: nowrap; border: 1px solid rgba(20, 18, 11, 0.75); border-radius: 0.5em; color: #ac831d; background-color: rgba(255, 255, 255, 0.75); vertical-align: top; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1)">🙋 Copilot (s)</span> like we've done in the past.

Treat everything that comes below as 🎯 **ACTION POINTS:**


## 1.1 Read JSON data into a Python object

We have a JSON file called `data/sample_single_target.json` that contains information about a single target of sanctions.

Run the code below that reads the data from the file into a suitable Python object (either list or dictionary):

In [41]:
with open('../data/opensanctions/sample_single_target.json', mode='r') as file:
    sample_target = json.load(file)

## 1.2 Explore the JSON data:

Either browse the file or print the object you read the JSON data into to understand its structure.

**Questions:**

1. What is the type of the object that you read the JSON data into?

    My answer: 

    > **The `sample_single_target` object is a dictionary.**

2. Is this a flat or nested JSON object?

    My answer: 

    > **It's a very deeply nested JSON object.**

3. Can this object be converted into a DataFrame directly (with `pd.DataFrame()`), or do we need to do some pre-processing first?

    My answer: 

    > **No. Pandas cannot make sense of this object directly. It's very deeply nested and each of the sub-sub-keys have different lenghts. If you try to create a DataFrame from this object, you will get a `ValueError: All arrays must be of the same length`.**

## 1.3 Normalise the JSON data

- Convert the JSON data into a DataFrame using the `pd.json_normalize()` function.

- Store the resulting DataFrame in a variable called `df_sample`.

You should see something like this:

In [42]:
# Uncomment the cell below to see what your DataFrame should look like
# Image("../figures/opensanctions/df_sample_v1.png")

In [43]:
# The json_normalize function is a good way to flatten nested dictionaries
df_sample = pd.json_normalize(sample_target)
df_sample

Unnamed: 0,id,caption,schema,referents,datasets,first_seen,last_seen,last_change,target,properties.alias,...,properties.address,properties.position,properties.nationality,properties.sourceUrl,properties.fatherName,properties.birthCountry,properties.birthPlace,properties.createdAt,properties.country,properties.sanctions
0,Q61116762,Aleksey Mikhailovich SALYAEV,Person,"[gb-fcdo-rus0208, au-dfat-3611-oleksii-mykhail...","[eu_fsf, au_dfat_sanctions, ua_nsdc_sanctions,...",2022-04-27T18:12:14,2024-09-30T06:58:02,2024-08-23T00:00:00,True,"[САЛЯЕВ Алексей Михайлович, Alexei Mikhailovic...",...,"[АР Крим, м. Сімферополь, вул. Федотова, 27, У...",[командир прикордонного сторожового корабля «Д...,[ru],[https://gels-avoirs.dgtresor.gouv.fr/Gels/Reg...,"[Mikhailovich, Михайлович]",[ua],[The Autonomous Republic of Crimea and the cit...,[2019-03-16],[ru],[{'id': 'au-dfat-b9ca38783b02c98ff4860e31b0f9c...


## 1.4 Subset for the most interesting `properties` columns

We want to focus on the following properties:

| Property                  | Description                                                                                     | Type   |
|---------------------------|-------------------------------------------------------------------------------------------------|--------|
| `properties.alias`        | The different names that the target is known by.                                                | List   |
| `properties.nationality`  | The nationality(ies) of the target.                                                             | List   |
| `properties.birthCountry` | The country where the target was born. This is stored as a list but should have only one element.| List   |
| `properties.sourceUrl`    | The URL where the information about the target was sourced from. This is stored as a list but should have only one element. | List   |
| `properties.sanctions`    | The sanctions that the target is subject to.                                                    | List   |


- Save the names of the columns above to a list called `interesting_columns`.

- Subset the DataFrame to keep only the columns listed above.

- Replace the `df_sample` variable with the new DataFrame that contains only the interesting columns.

💡 **TIP:** If you have GitHub Copilot installed on your machine, try adding the instructions above to the AI and see if it produces the output you want.

In [44]:
# Uncomment the cell below to see what your DataFrame should look like
# Image("../figures/opensanctions/df_sample_v2.png")

In [45]:
# Specify the columns in a list
interesting_properties = ['properties.alias', 'properties.nationality', 
                       'properties.birthCountry',
                       'properties.sourceUrl', 'properties.sanctions']

# Filter the DataFrame to only include the columns in the list
# Because we're creating a shorter DataFrame, 
# it is best practice to create a copy of the DataFrame
df_sample = df_sample[interesting_properties].copy()
df_sample

Unnamed: 0,properties.alias,properties.nationality,properties.birthCountry,properties.sourceUrl,properties.sanctions
0,"[САЛЯЕВ Алексей Михайлович, Alexei Mikhailovic...",[ru],[ua],[https://gels-avoirs.dgtresor.gouv.fr/Gels/Reg...,[{'id': 'au-dfat-b9ca38783b02c98ff4860e31b0f9c...


## 1.5 Rename the columns

Let's get rid of the `properties.` prefix in the column names.

If you created the `interesting_columns` list and the `df_sample` correctly, you can run the code below to rename the columns. 

Cut this piece of code and paste it in the cell below:

```python
new_column_names = [col.split('.')[1] for col in interesting_properties]

# Here's a new way to rename columns
df_sample.columns = new_column_names
```

In [46]:
new_column_names = [col.split('.')[1] for col in interesting_properties]

# Here's a new way to rename columns
df_sample.columns = new_column_names

# Yeah, that worked!
df_sample

Unnamed: 0,alias,nationality,birthCountry,sourceUrl,sanctions
0,"[САЛЯЕВ Алексей Михайлович, Alexei Mikhailovic...",[ru],[ua],[https://gels-avoirs.dgtresor.gouv.fr/Gels/Reg...,[{'id': 'au-dfat-b9ca38783b02c98ff4860e31b0f9c...


## 🏆 1.6 String Manipulation (Don't explode anything just yet)

We are interested in understanding which countries/entities have imposed sanctions on the target. This means the only column we want to explode is the `properties.sanctions` column. 

All the other columns, despite being lists, should not be exploded. It makes a lot more sense to just convert them to meaningful strings.

We can use the `apply()` function on each of these columns to convert the lists into strings.

Here's, for example, how I would convert the `birth_country` column from a list to a string:

```python
# Because I know the `birthCountry` column is a list that has just a single element, 
# I can extract it directly like this.
# Run it and check the result before assigning it back to the column
df_sample['birthCountry'].apply(lambda x: x[0])

# To make this change permanent, assign it back to the column
df_sample['birthCountry'] = df_sample['birthCountry'].apply(lambda x: x[0])

```


In [47]:
# Birth country is easy
df_sample['birthCountry'] = df_sample['birthCountry'].apply(lambda x: x[0])

In [48]:
# Alias is also easy to handle
df_sample['alias'] = df_sample['alias'].apply(lambda x: x[0])

In [49]:
# It's the same code as above, but with a different column name
# I wonder if should have used a function instead of repeating the code
df_sample['sourceUrl'] = df_sample['sourceUrl'].apply(lambda x: x[0])

In [50]:
#We use the ", ".join() function to convert the list of nationalities to a single string that separates all the nationalities with a comma and a space
df_sample['nationality'] = df_sample['nationality'].apply(lambda x: ", ".join(x))

In [51]:
df_sample

Unnamed: 0,alias,nationality,birthCountry,sourceUrl,sanctions
0,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,[{'id': 'au-dfat-b9ca38783b02c98ff4860e31b0f9c...


In [52]:
# Uncomment the cell below to see what your DataFrame should look like
# Image("../figures/opensanctions/df_sample_v3.png")

## 1.7 Explode the columns

- Use the [DataFrame.explode()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html) function to explode the `sanctions` column.

In [53]:
# Uncomment the cell below to see what your DataFrame should look like
# Image("../figures/opensanctions/df_sample_v4.png")

In [54]:
# That was easy
df_sample = df_sample.explode('sanctions')

Just notice that because all the new rows came from the same original row, all the scalar values in the other columns will be repeated. **This also applies to the Index.**

In [55]:
df_sample

Unnamed: 0,alias,nationality,birthCountry,sourceUrl,sanctions
0,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,{'id': 'au-dfat-b9ca38783b02c98ff4860e31b0f9c8...
0,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,{'id': 'be-fod-b614c4124050ffc8026f33e047687fb...
0,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,{'id': 'ca-sema-8159ef9b659e50844d6dfc202ddb41...
0,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,{'id': 'ch-seco-ed9c37948250b396934fcad975f889...
0,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,{'id': 'eu-fsf-dcda1f3baa54269ca6de446ddde0b5b...
0,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,{'id': 'eu-tb-af54460497b44ac7ffd29c9cdfc9e805...
0,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,{'id': 'eu-tb-d9fc7a31d1f34e72e5effefbcefa6b78...
0,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,{'id': 'fr-ga-69da9f08fe7cbbb366f62d65ed8d5c09...
0,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,{'id': 'gb-fcdo-e05a6ff863d7ab10f453ad909dddf4...
0,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,{'id': 'gb-hmt-10625340ba89ed2059109526fad8f0e...


**Not part of the original instructions, but I recommend you to reset the index after exploding the DataFrame.**

In [56]:
# What do you think drop=True does?
# You can try to use your favourite GenAI tool to find out
# or ask us, humans, in the Slack channel
df_sample = df_sample.reset_index(drop=True)

# 1.8. See the solution

Putting it all together, here's what the solution would look like if you were to use method-chaining:

<details><summary>Click HERE for the solution</summary>

```python
interesting_columns = ['properties.alias', 
                       'properties.nationality', 
                       'properties.birthCountry', 
                       'properties.sourceUrl', 
                       'properties.sanctions']

with open('../data/opensanctions/sample_single_target.json', mode='r') as file:
    sample_target = json.load(file)

df_sample = (
    pd.json_normalize(sample_target)
    [interesting_columns]
    .rename(columns={
        'properties.alias': 'alias',
        'properties.nationality': 'nationality',
        'properties.birthCountry': 'birthCountry',
        'properties.sourceUrl': 'sourceUrl',
        'properties.sanctions': 'sanctions'
    })
    .assign(
        alias=lambda x: x['alias'].apply(lambda x: x[0]),
        sourceUrl=lambda x: x['sourceUrl'].apply(lambda x: x[0]),
        nationality=lambda x: x['nationality'].apply(lambda x: ", ".join(x)),
        birthCountry=lambda x: x['birthCountry'].apply(lambda x: x[0])
    )
    .explode('sanctions')
    .reset_index(drop=True)
)

df_sample
```

# 2. Unnest the `sanctions` column

We have a lot of stuff but our `sanctions` column is still nested.

We can't explode it, as the data inside is not a list, but a dictionary.

Give it a go:

```python
# This won't work. Explode only works with lists
df_sample.explode('sanctions')
```

We need to:

- Keep the other columns as they are
- Work separately on the `sanctions` column, using the `pd.json_normalize()` function to normalise the data inside it.
- Concatenate the resulting DataFrame with the original one, keeping the index aligned.

## 2.1 Convert the entire `sanctions` column into a DataFrame of its own

For this to work, we first need to convert the 'sanctions' column back to a 'pure Python' list of dictionaries

```python
# You can normalise a pandas column when they are lists of dictionaries
# json_normalise is not just for 'pure Python' lists of dictionaries
pd.json_normalize(df_sample['sanctions'])
```

Copy the code above and paste it in the cell below to observe the output.

In [34]:
# Your code here
pd.json_normalize(df_sample['sanctions'])

Unnamed: 0,id,caption,schema,referents,datasets,first_seen,last_seen,last_change,target,properties.entity,...,properties.summary,properties.authority,properties.reason,properties.authorityId,properties.listingDate,properties.provisions,properties.modifiedAt,properties.status,properties.endDate,properties.duration
0,au-dfat-b9ca38783b02c98ff4860e31b0f9c80056c953c0,Autonomous (Ukraine),Sanction,[],[au_dfat_sanctions],2023-04-20T12:10:15,2024-09-30T06:02:03,2024-06-23T20:02:02,False,[Q61116762],...,[Listed on: 16 Mar. 2019],[Department of Foreign Affairs and Trade],,,,,,,,
1,be-fod-b614c4124050ffc8026f33e047687fb32fbc4981,UKR,Sanction,[],[be_fod_sanctions],2023-04-20T12:13:17,2024-09-30T06:17:02,2023-04-20T12:13:17,False,[Q61116762],...,,[Federal Public Service Finance],[2020/1267 (OJ L298)],[EU.5064.30],[2020-09-11],,,,,
2,ca-sema-8159ef9b659e50844d6dfc202ddb41c8bbe63d42,Russia / Russie,Sanction,[],[ca_dfatd_sema_sanctions],2024-08-06T09:43:02,2024-09-30T06:43:02,2024-08-06T09:43:02,False,[Q61116762],...,,[Global Affairs Canada],"[1, Part 1]",[112],,,,,,
3,ch-seco-ed9c37948250b396934fcad975f8891ceae4bfbd,Ordinance of 4 March 2022 on measures related ...,Sanction,[],[ch_seco_sanctions],2023-04-20T12:16:14,2024-09-24T00:00:00,2024-08-16T00:00:00,False,[Q61116762],...,,[State Secretariat for Economic Affairs],,[40229],"[2020-04-02, 2019-04-02, 2020-09-29]",,,,,
4,eu-fsf-dcda1f3baa54269ca6de446ddde0b5b798fdadb6,UKR,Sanction,[],[eu_fsf],2023-04-20T18:00:25,2024-09-30T06:57:03,2024-08-08T16:57:03,False,[Q61116762],...,,"[Directorate‑General for Financial Stability, ...",[2020/1267 (OJ L298)],[EU.5064.30],[2020-09-11],,,,,
5,eu-tb-af54460497b44ac7ffd29c9cdfc9e805aa3a53fd,Sanction,Sanction,[],[eu_travel_bans],2024-02-23T15:37:01,2024-09-30T06:37:02,2024-06-23T21:37:02,False,[Q61116762],...,,[Council of the European Union],[Council Decision concerning restrictive measu...,,,,,,,
6,eu-tb-d9fc7a31d1f34e72e5effefbcefa6b78b8481d1b,Sanction,Sanction,[],[eu_travel_bans],2024-02-26T18:37:01,2024-09-30T06:37:02,2024-06-23T21:37:02,False,[Q61116762],...,,[Council of the European Union],[Council Decision concerning restrictive measu...,,,,,,,
7,fr-ga-69da9f08fe7cbbb366f62d65ed8d5c092ee52123,(UE) 2019/409 du 14/03/2019 (UE Ukraine intégr...,Sanction,[],[fr_tresor_gels_avoir],2023-04-20T10:12:18,2024-09-30T06:58:02,2024-06-23T21:58:02,False,[Q61116762],...,,[Direction Générale du Trésor],[Commandant du navire de patrouille frontalièr...,,,,,,,
8,gb-fcdo-e05a6ff863d7ab10f453ad909dddf44f6d126485,The Russia (Sanctions) (EU Exit) Regulations 2019,Sanction,[],[gb_fcdo_sanctions],2024-05-09T21:39:15,2024-09-30T06:40:02,2024-06-23T20:40:03,False,[Q61116762],...,,"[Foreign, Commonwealth & Development Office, UK]",[Commanding officer of the border patrol boat ...,[RUS0208],,"[Asset freeze, Travel Ban, Trust Services Sanc...",[2023-03-20],,,
9,gb-hmt-10625340ba89ed2059109526fad8f0e032a2764a,Russia,Sanction,[],[gb_hmt_sanctions],2023-04-20T10:42:49,2024-09-30T06:14:02,2023-10-17T12:01:46,False,[Q61116762],...,[Trust services],"[Office of Financial Sanctions Implementation,...",[Commanding officer of the border patrol boat ...,[RUS0208],[2019],[Asset freeze],[2023-03-21],[Asset Freeze Targets],,


From the resulting DataFrame above, we definitely want the `properties.country` (renamed to just `sanction_country`), but we want it as a string, not as a list.

Figure out how to create this column and add it to the DataFrame.

**Let me do this step by step:**

In [57]:
# Let me look at just the 'properties.country' column
pd.json_normalize(df_sample['sanctions'])['properties.country']

0     [au]
1     [be]
2     [ca]
3     [ch]
4     [eu]
5     [eu]
6     [eu]
7     [fr]
8     [gb]
9     [gb]
10    [ua]
11    [ua]
Name: properties.country, dtype: object

**After this, I would be able to do the same thing I've done with the 'birthCountry' column.**

That is:

In [58]:
(
    pd.json_normalize(df_sample['sanctions'])['properties.country']
    .apply(lambda x: x[0])
)

0     au
1     be
2     ca
3     ch
4     eu
5     eu
6     eu
7     fr
8     gb
9     gb
10    ua
11    ua
Name: properties.country, dtype: object

In [None]:
# Then I can just assign the result to a new column
df_sample['sanctions_country'] = (
    pd.json_normalize(df_sample['sanctions'])['properties.country']
    .apply(lambda x: x[0])
)

In [60]:
df_sample

Unnamed: 0,alias,nationality,birthCountry,sourceUrl,sanctions,sanctions_country
0,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,{'id': 'au-dfat-b9ca38783b02c98ff4860e31b0f9c8...,au
1,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,{'id': 'be-fod-b614c4124050ffc8026f33e047687fb...,be
2,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,{'id': 'ca-sema-8159ef9b659e50844d6dfc202ddb41...,ca
3,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,{'id': 'ch-seco-ed9c37948250b396934fcad975f889...,ch
4,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,{'id': 'eu-fsf-dcda1f3baa54269ca6de446ddde0b5b...,eu
5,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,{'id': 'eu-tb-af54460497b44ac7ffd29c9cdfc9e805...,eu
6,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,{'id': 'eu-tb-d9fc7a31d1f34e72e5effefbcefa6b78...,eu
7,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,{'id': 'fr-ga-69da9f08fe7cbbb366f62d65ed8d5c09...,fr
8,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,{'id': 'gb-fcdo-e05a6ff863d7ab10f453ad909dddf4...,gb
9,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,{'id': 'gb-hmt-10625340ba89ed2059109526fad8f0e...,gb


I don't need the `sanctions` column anymore, so I can drop it.


In [61]:
df_sample = df_sample.drop(columns='sanctions').copy()

In [62]:
df_sample

Unnamed: 0,alias,nationality,birthCountry,sourceUrl,sanctions_country
0,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,au
1,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,be
2,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,ca
3,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,ch
4,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,eu
5,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,eu
6,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,eu
7,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,fr
8,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,gb
9,САЛЯЕВ Алексей Михайлович,ru,ua,https://gels-avoirs.dgtresor.gouv.fr/Gels/Regi...,gb


In [17]:
## Uncomment the cell below to see what your DataFrame should look like
# Image("../figures/opensanctions/df_sample_v5.png")

# 3. A larger dataset

Let's use a much larger dataset now. Small tweaks to the code are necessary, as some of the data is missing plus we have more than one target.

In [63]:
# This is a list of JSON objects (each element is a JSON object like the sample we've used above)
df_targets = pd.read_json('../data/opensanctions/targets_sample_4000.jsonl', lines=True)
df_targets.head()

Unnamed: 0,id,caption,schema,properties,referents,datasets,first_seen,last_seen,last_change,target
0,NK-YPMWhnEtqnRViPgZx2ET7N,"Tovarystvo z obmezhenoiu vidpovidalnistiu ""Tek...",Organization,"{'innCode': ['5904094273'], 'jurisdiction': ['...",[ua-nsdc-24776-tovaristvo-z-obmezenou-vidpovid...,[ua_nsdc_sanctions],2023-05-25T20:13:16,2024-09-30T06:22:04,2024-03-06T18:31:06,True
1,ua-nsdc-15471-dovtaev-alihan-isajovic,Dovtaiev Alikhan Isaiovych,Person,"{'alias': ['Довтаєв Аліхан Ісайович', 'DOVTAIE...",[],[ua_nsdc_sanctions],2023-04-20T10:50:14,2024-09-30T06:22:04,2024-03-06T18:31:06,True
2,il-nbctf-9d2131feeb8917c9f53877a5c7cf1b86f0fd8cdc,468002109,CryptoWallet,"{'publicKey': ['468002109'], 'topics': ['crime...",[],[il_mod_crypto],2024-04-05T13:51:48,2024-09-30T06:02:03,2024-04-05T13:51:48,True
3,NK-djckrSBnoZYnm2PngKZRaJ,Samarchenko Svitlana Vitaliyivna,Person,"{'name': ['Samarchenko Svitlana Vitaliyivna', ...","[ua-nsdc-13365-samarcenko-svitlana-vitaliivna,...",[ua_nsdc_sanctions],2023-04-20T10:50:14,2024-09-30T06:22:04,2024-03-06T18:31:06,True
4,Q61116762,Aleksey Mikhailovich SALYAEV,Person,"{'alias': ['САЛЯЕВ Алексей Михайлович', 'Alexe...","[gb-fcdo-rus0208, au-dfat-3611-oleksii-mykhail...","[eu_fsf, au_dfat_sanctions, ua_nsdc_sanctions,...",2022-04-27T18:12:14,2024-09-30T06:58:02,2024-08-23T00:00:00,True


Just like before, we just care about the 'properties' columns, but this time we have a lot more data:

In [19]:
df_targets['properties'].head()

0    {'innCode': ['5904094273'], 'jurisdiction': ['...
1    {'alias': ['Довтаєв Аліхан Ісайович', 'DOVTAIE...
2    {'publicKey': ['468002109'], 'topics': ['crime...
3    {'name': ['Samarchenko Svitlana Vitaliyivna', ...
4    {'alias': ['САЛЯЕВ Алексей Михайлович', 'Alexe...
Name: properties, dtype: object

We can normalise the 'properties' column and work with the resulting DataFrame.


In [20]:
interesting_columns = ['alias', 'nationality', 'birthCountry', 'sourceUrl', 'sanctions']
pd.json_normalize(df_targets['properties'])[interesting_columns]

Unnamed: 0,alias,nationality,birthCountry,sourceUrl,sanctions
0,"[Limited Liability Company ""Composite Tech№log...",,,,[{'id': 'ua-nsdc-1a55567bf27b82e85aaa21c74ed43...
1,"[Довтаєв Аліхан Ісайович, DOVTAIEV ALIKHAN]",[ru],,,[{'id': 'ua-nsdc-6d9d313cafb495bc2adff58623044...
2,,,,,
3,"[Самарченко Светлана Витальевна, Samarchenko S...",[ua],,,[{'id': 'ua-nsdc-254540b0426981a248dd23ad55938...
4,"[САЛЯЕВ Алексей Михайлович, Alexei Mikhailovic...",[ru],[ua],[https://gels-avoirs.dgtresor.gouv.fr/Gels/Reg...,[{'id': 'au-dfat-b9ca38783b02c98ff4860e31b0f9c...
...,...,...,...,...,...
3995,"[Агафонова Наталя Миколаївна, Агафонова Наталь...",[ru],,,[{'id': 'ua-nsdc-640bb3ba773bf28d09c6cfa6ff3b0...
3996,,[ir],,[https://sanctionssearch.ofac.treas.gov/Detail...,[{'id': 'ofac-c206e06e1f947b5884af787983b6aa90...
3997,"[Ginsburg Vladimir, Гинсбург Владимир Срульеви...",[ru],,,[{'id': 'ua-nsdc-1298251b55b833bfbbe09baef1b3b...
3998,,,,[https://sanctionssearch.ofac.treas.gov/Detail...,[{'id': 'ofac-16649d709282f4b69ea21d00ea9be712...


☝️ Notice how this time around there are some NaN values in the DataFrame. This is because some of the 'properties' columns are missing in some of the rows.

We need to consider this when we normalise the data!

Your task now is to **understand everything the code below does** and then **run it**.


In [None]:
# I will read this first, outside the method chain, so it's easier to see what's happening
df_targets = pd.read_json('../data/opensanctions/targets_sample_4000.jsonl', lines=True)

# I will also leave this outside the method chain, so it's easier to see what's happening
interesting_columns = ['alias', 'nationality', 'birthCountry', 'sourceUrl', 'sanctions']
df_targets = pd.json_normalize(df_targets['properties'])[interesting_columns]

# Have you seen the assign() method before? It's a very useful method to add new columns to a DataFrame
# Do you see what we're doing differently here?
df_targets = (
    df_targets
    .assign(
        alias=lambda x: x['alias'].apply(lambda x: x[0] if isinstance(x, list) else None),
        sourceUrl=lambda x: x['sourceUrl'].apply(lambda x: x[0] if isinstance(x, list) else None),
        nationality=lambda x: x['nationality'].apply(lambda x: ", ".join(x) if isinstance(x, list) else None),
        birthCountry=lambda x: x['birthCountry'].apply(lambda x: x[0] if isinstance(x, list) else None)
    )
    .explode('sanctions')
)

# Here's another way to add the 'sanction_country' column
sanction_country = pd.json_normalize(df_targets['sanctions'])['properties.country']
sanction_country = sanction_country.apply(lambda x: x[0] if isinstance(x, list) else None).tolist()
df_targets['sanction_country'] = sanction_country

df_targets = df_targets.drop(columns='sanctions')

df_targets

Unnamed: 0,alias,nationality,birthCountry,sourceUrl,sanction_country
0,"Limited Liability Company ""Composite Tech№logy""",,,,ua
1,Довтаєв Аліхан Ісайович,ru,,,ua
1,Довтаєв Аліхан Ісайович,ru,,,ua
2,,,,,
3,Самарченко Светлана Витальевна,ua,,,ua
...,...,...,...,...,...
3997,Ginsburg Vladimir,ru,,,ua
3998,,,,https://sanctionssearch.ofac.treas.gov/Details...,us
3999,BABII ANNA,ua,,,ua
3999,BABII ANNA,ua,,,ua


# 4. Groupby -> Apply -> Combine

Write code to group the data by the `sanction_country` column and count the number of sanctions imposed by each country.

**This could be seen as a task to the `value_counts()`:**

In [68]:
# How many sanctions did each country impose on targets?
df_targets['sanction_country'].value_counts()

sanction_country
ua    2420
us    1886
eu    1109
gb     724
ch     470
be     441
mc     420
fr     414
ca     369
jp     254
au     242
kz     170
tr     144
nz     141
kg      82
il      79
za      66
md      59
id      54
pl      51
ar      50
ir      34
ee      33
it      26
lt      25
ae      21
in      18
nl      14
at      11
hr      11
my      11
ro       8
gr       8
lv       7
sg       6
fi       5
es       5
np       4
az       3
cz       2
ng       2
sk       1
ie       1
si       1
Name: count, dtype: int64

<span style="display:block;background-color:rgba(93, 158, 188, 0.1);padding:0.5em;font-size:1.05em;margin-left:0em;margin-bottom:1em;border-radius:0.5em;">🤔 **BUT think about it:** Each target might have been sanctioned by many organisations within the same country. We don't know if Ukraine (`ua`) sanctioned 2420 targets or if all of those sanctions were placed by many Ukranian organisations on a single target. </span>

## 4.1 Asking better questions

**You should always critically evaluate the results you get from your code and reformulate your questions if necessary.**

Let's do that.

What if our question now was:

> **How many targets were sanctioned by each country?**

A good way to answer this question is to group the data by the `sanction_country` column and count the number of unique targets in each group.

In [70]:
# Group by 'sanction_country' 
# and count the number of unique/distinct 'alias', 
# it doesn't matter how many sanctions each country imposed on each target
df_targets.groupby(['sanction_country']).agg({'alias': 'nunique'})

Unnamed: 0_level_0,alias
sanction_country,Unnamed: 1_level_1
ae,4
ar,45
at,2
au,229
az,0
be,382
ca,297
ch,408
cz,1
ee,29


Or with `groupby().apply()`:

In [None]:
get_num_unique_targets = lambda x: pd.Series({'num_targets': x['alias'].nunique()})

(
    df_targets.groupby(['sanction_country'])
              .apply(get_num_unique_targets, include_groups=False)
)

Unnamed: 0_level_0,num_targets
sanction_country,Unnamed: 1_level_1
ae,4
ar,45
at,2
au,229
az,0
be,382
ca,297
ch,408
cz,1
ee,29


In [75]:
# Reorder the DataFrame by the number of unique targets
(
    df_targets.groupby(['sanction_country'])
              .apply(get_num_unique_targets, include_groups=False)
              .sort_values(by='num_targets', ascending=False)
)

Unnamed: 0_level_0,num_targets
sanction_country,Unnamed: 1_level_1
ua,1444
us,860
eu,421
ch,408
fr,384
be,382
mc,354
gb,335
ca,297
au,229


The grouped DataFrame will have the `sanction_country` as the index and the number of unique targets as the values. If we want it as a regular DataFrame, we can use the `reset_index()` function.

In [93]:
df_country_sanctions = (
    df_targets.groupby(['sanction_country'])
              .apply(get_num_unique_targets, include_groups=False)
              .sort_values(by='num_targets', ascending=False)
              .reset_index()
)

# Look at the top 10
df_country_sanctions.head(10)

Unnamed: 0,sanction_country,num_targets
0,ua,1444
1,us,860
2,eu,421
3,ch,408
4,fr,384
5,be,382
6,mc,354
7,gb,335
8,ca,297
9,au,229


## 4.2 Get nicer country names

There is this nice Python package called `pycountry` that can help us get the full country names from these two-digit country codes.

In [95]:
def get_country_name(code):
    selected_country = pycountry.countries.get(alpha_2=code)

    if selected_country is None:
        if code == 'eu':
            return 'European Union 🇪🇺'
        else:
            return code
    else:
        return f"{selected_country.name} {selected_country.flag}"
     
# Uncomment the line below to test that the function works
# df_country_sanctions['sanction_country'].apply(get_country_name)

In [96]:
df_country_sanctions['sanction_country'] = df_country_sanctions['sanction_country'].apply(get_country_name)

In [97]:
df_country_sanctions.head(10)

Unnamed: 0,sanction_country,num_targets
0,Ukraine 🇺🇦,1444
1,United States 🇺🇸,860
2,European Union 🇪🇺,421
3,Switzerland 🇨🇭,408
4,France 🇫🇷,384
5,Belgium 🇧🇪,382
6,Monaco 🇲🇨,354
7,United Kingdom 🇬🇧,335
8,Canada 🇨🇦,297
9,Australia 🇦🇺,229


## 4.3 Create a bar plot

In [127]:
# Let me further tweak the dataframe to make it easier to plot

plot_df = (
    df_country_sanctions.sort_values(by='num_targets', ascending=True)
    .tail(10)
)

In [140]:
(
    ggplot(plot_df,
           aes(y='sanction_country', x='num_targets', fill= 'num_targets'))    
    + geom_bar(stat='identity')
    + geom_text(aes(label='num_targets'), nudge_x=50, size=6)
    + scale_fill_gradient(name="Number of targets", low='blue', high='red', guide='none')
    # remove count and country name from the legend
    + labs(title='Unsurprisingly, Ukraine is the country with the most sanctions', 
           subtitle='Our sample consists of 4000 random targets so don\'t read too much into this',
           x='Number of unique targets', y='Top 10 countries')
    + ggsize(800, 300)
    + theme(
        plot_title=element_text(size=14, face='bold'),
        plot_subtitle=element_text(size=12),
    )
)