**Richard Davies, Finn McEvoy, Josh Hellings** - Automated Data Visualisation for Policymaking 2025

<a href="https://colab.research.google.com/drive/1W8kRQ9LEqRJBOJIHLvpA-R8KNgNmhiwv?usp=sharing" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python

The bread and butter of a data workflow is cleaning and preparation, taking raw datasets and transforming them into a useful form.

So far in the course, we've showed the manual process of finding data, cleaning it in Excel, uploading data, and charting it. Now, we'll start tidying up this workflow, pulling data directly into a Python notebook and preparing it for charting with Vega-Lite.

<br>

But first, a quick primer...

## What is a Python Notebook? (and why Google Colab)

A Python notebook is an interactive document made up of two types of cells:
- Code cells: where you write and run Python code.
- Markdown cells: where you write formatted text (headings, bullet points, links, images).

Why notebooks are great:
- **Interactive**: run code in small steps, see results immediately.
- **Narrative + code**: combine explanation, code, and outputs in one place.
- **Reproducible**: keep a record of the exact code that produced your results.

We'll run this notebook in **Google Colab**, a free, cloud-based environment with Python pre-installed:
- No setup required on your machine.
- Run cells with Shift+Enter.


<br>

---

# Introducing Tools: `Requests`

The `Requests` module allows us to fetch resources from the internet, whether these are `CSVs`, `JSONs`, images, `HTML` or anything else. This is particuarly important for requesting data from APIs.

</br>

It is simple to use. Usually allow we need to do is:

1. Make a request with `requests.get` and our target URL. For example we can request something from our GitHub repos:

    `req = requests.get("https://raw.githubusercontent.com/mclass-user/mclass-user.github.io/main/s2_chart1.json")`
</br>
</br>
2. Access the fetched data. Using `req.json()` for JSON data or `req.text` for most other data. For example, we can see the returned JSON for the chart we just fetched:
    </br>
    </br>
    `data = req.json()`

First, let's import the modules we'll be using.

In [1]:
# 1. PREPARATORY STEPS - ACCESS PACKAGES WE NEED

## // The "requests" package, for opening web sites and retrieving information:
import requests

## // The "pandas" package, for working with tabular data. This will be your most used tool.
import pandas as pd

In [2]:
response = requests.get("https://raw.githubusercontent.com/RDeconomist/RDeconomist.github.io/refs/heads/main/charts/library/chartLine3.json")

# Let's try and view the data
print(response.json())


{'$schema': 'https://vega.github.io/schema/vega-lite/v5.json', 'title': {'text': 'Declining populations', 'subtitle': 'Cumulative % change in population. Source: UN.', 'subtitleFontStyle': 'italic', 'subtitleFontSize': 10, 'anchor': 'start', 'color': 'black'}, 'data': {'url': 'https://raw.githubusercontent.com/RDeconomist/RDeconomist.github.io/main/data/demographicsUNPopChange.csv'}, 'repeat': {'layer': ['S Korea', 'Japan', 'UK', 'Italy', 'Spain', 'France', 'US']}, 'spec': {'height': 300, 'width': 240, 'mark': {'type': 'line', 'strokeWidth': 2}, 'encoding': {'x': {'field': 'Year', 'type': 'temporal'}, 'y': {'field': {'repeat': 'layer'}, 'type': 'quantitative', 'title': None}, 'color': {'datum': {'repeat': 'layer'}, 'scale': {'range': ['black', 'yellow', 'orange', 'lightgrey', 'lightblue', 'pink', 'lightgreen']}}}}}


So we've managed to bring some data (in this case, a JSON chart) from online into our Python Notebook using the `requests` module.

<br>

Now, let's move on pulling some data into the notebook, that we will clean and prepare for charting.

<br>
<br>

---

# Practical

### Load a CSV from GitHub

There are two simple ways to read a CSV from a GitHub RAW URL:
- Requests + pandas: fetch with `requests.get`, then save as CSV and load into pandas
- Pandas directly: `pd.read_csv(raw_url)` (simplest for public files)

We'll use the Shiller dataset for both examples. Source website: [shillerdata.com](https://shillerdata.com/). CSV data stored on GitHub [here](https://github.com/jhellingsdata/jhellingsdata.github.io/blob/main/Data/data_shiller.csv)


<br>

##### Download the data:

Requests. save CSV. Load into Pandas.

In [3]:
# Define a variable with our URL
csv_url = "https://raw.githubusercontent.com/jhellingsdata/jhellingsdata.github.io/refs/heads/main/Data/data_shiller.csv"

In [4]:
# Make a get request to retreive the contents at our URL
response = requests.get(csv_url)
response.text

'\ufeffDate,Date_Fraction,P,D,E,CPI,Rate GS10,Price_real,Dividend_real,Earnings_real\r\n1871.01,1871.041667,4.44,0.26,0.4,12.46406116,5.32,115.6560178,6.772649694,10.41946107\r\n1871.02,1871.125,4.5,0.26,0.4,12.84464132,5.323333333,113.7457998,6.571979544,10.11073776\r\n1871.03,1871.208333,4.61,0.26,0.4,13.0349719,5.326666667,114.8247907,6.476018564,9.963105482\r\n1871.04,1871.291667,4.74,0.26,0.4,12.55922645,5.33,122.5350372,6.721331153,10.34050947\r\n1871.05,1871.375,4.86,0.26,0.4,12.27381157,5.333333333,128.5587538,6.877628805,10.58096739\r\n1871.06,1871.458333,4.82,0.26,0.4,12.08348099,5.336666667,129.5089587,6.985960426,10.74763142\r\n1871.07,1871.541667,4.73,0.26,0.4,12.08348099,5.34,127.0907416,6.985960426,10.74763142\r\n1871.08,1871.625,4.79,0.26,0.4,11.8932314,5.343333333,130.7616768,7.097711056,10.91955547\r\n1871.09,1871.708333,4.84,0.26,0.4,12.17864628,5.346666667,129.0301437,6.931371358,10.66364824\r\n1871.1,1871.791667,4.59,0.26,0.4,12.36889587,5.35,120.4832263,6.82475791

<br>

Save the response as a CSV file.

In [5]:
# Save contents as a CSV file
with open('data_shiller.csv', mode='w') as f:
    f.write(response.text)

<br>

Load the data back in using Pandas

In [6]:
# Load in the data using Pandas
df = pd.read_csv('data_shiller.csv')

df

Unnamed: 0,Date,Date_Fraction,P,D,E,CPI,Rate GS10,Price_real,Dividend_real,Earnings_real
0,1871.01,1871.041667,4.440000,0.260000,0.40,12.464061,5.320000,115.656018,6.772650,10.419461
1,1871.02,1871.125000,4.500000,0.260000,0.40,12.844641,5.323333,113.745800,6.571980,10.110738
2,1871.03,1871.208333,4.610000,0.260000,0.40,13.034972,5.326667,114.824791,6.476019,9.963105
3,1871.04,1871.291667,4.740000,0.260000,0.40,12.559226,5.330000,122.535037,6.721331,10.340509
4,1871.05,1871.375000,4.860000,0.260000,0.40,12.273812,5.333333,128.558754,6.877629,10.580967
...,...,...,...,...,...,...,...,...,...,...
1853,2025.06,2025.458333,6029.951500,77.350000,222.53,322.561000,4.380000,6069.414509,77.856217,223.986347
1854,2025.07,2025.541667,6296.498182,77.726667,,323.048000,4.390000,6328.151413,,
1855,2025.08,2025.625000,6408.949524,78.103333,,323.976000,4.260000,6422.717917,,
1856,2025.09,2025.708333,6584.018095,78.480000,,324.440000,4.120000,6588.726184,,


<br>
<br>

---

**Alternative method 1**

Its also possible to skip saving the CSV, by using io.StringIO() to parse the text response into a file-like format.

```python
    response = requests.get(csv_url)                # Perform request to the URL
    response_parsed = io.StringIO(response.text)    # Parse the text response into a file-like format.
    df = pd.read_csv(response_parsed)               # Load the data into a Pandas DataFrame  
```

<br>

**Alternative method 2**

When pulling CSVs from the web (for instance, copying a CSV download link), we'll typically need to use one of those approaches above.

However, when pulling CSV data from raw GitHub links, we can pull the data directly into a Pandas DataFrame.

```python
    ## Load from CSV URL directly into pandas (this will work for raw GitHub CSV links)
    df = pd.read_csv(csv_url)
```

---

<br>
<br>

### Clean and transform

Before plotting the Price-Earnings ratio, we'll first need to calculate this value in a new column. We should also check the dates are in a format that works nicely in Vega-Lite (yyyy-mm-dd)


<br>

Let's inspect our dataframe again

In [7]:
df.head()       # Using .head() allows us to preview the 5 rows. .head(n) for top n rows

Unnamed: 0,Date,Date_Fraction,P,D,E,CPI,Rate GS10,Price_real,Dividend_real,Earnings_real
0,1871.01,1871.041667,4.44,0.26,0.4,12.464061,5.32,115.656018,6.77265,10.419461
1,1871.02,1871.125,4.5,0.26,0.4,12.844641,5.323333,113.7458,6.57198,10.110738
2,1871.03,1871.208333,4.61,0.26,0.4,13.034972,5.326667,114.824791,6.476019,9.963105
3,1871.04,1871.291667,4.74,0.26,0.4,12.559226,5.33,122.535037,6.721331,10.340509
4,1871.05,1871.375,4.86,0.26,0.4,12.273812,5.333333,128.558754,6.877629,10.580967


<br>

##### 1. Compute the price–earnings ratio: `PE = P / E`.

In [8]:
# 1) Price-Earnings ratio
df['PE_ratio'] = df['P'] / df['E']

<br>

##### 2. Parse `Date` from `YYYY.MM` to ISO `YYYY-MM-DD` (using the first day of the month).

First, let's make sure the data type for our 'Date' column is text (i.e. string), and not numerical (such as Integer or Float)

In [9]:
# Convert our column data type to string
df['Date'] = df['Date'].astype(str)

<br>

Pandas has built-in functions to help perform laods of tasks. We can use 'pd.to_datetime()' to parse dates into different formats. We give the function two things
- our date column, `df['Date']`
- current format of the date values, `format='%Y.%m'`. This (along with Vega) uses the standard D3 date abbreviations. See more [here](https://d3js.org/d3-time-format#locale_format).'%Y' corresponds to 4 digit years, and '%m' to 2 digit months (01 for January, etc)

In [10]:
# Convert to datetime and format as YYYY-MM-DD
pd.to_datetime(df['Date'], format='%Y.%m')

0      1871-01-01
1      1871-02-01
2      1871-03-01
3      1871-04-01
4      1871-05-01
          ...    
1853   2025-06-01
1854   2025-07-01
1855   2025-08-01
1856   2025-09-01
1857   2025-01-01
Name: Date, Length: 1858, dtype: datetime64[ns]

Let's save this as a new column

In [11]:
# Parse Date (YYYY.MM -> YYYY-MM-DD)
df['Date_iso'] = pd.to_datetime(df['Date'], format='%Y.%m')

<br>

Check our finished dataframe

In [12]:
df.head(13)

Unnamed: 0,Date,Date_Fraction,P,D,E,CPI,Rate GS10,Price_real,Dividend_real,Earnings_real,PE_ratio,Date_iso
0,1871.01,1871.041667,4.44,0.26,0.4,12.464061,5.32,115.656018,6.77265,10.419461,11.1,1871-01-01
1,1871.02,1871.125,4.5,0.26,0.4,12.844641,5.323333,113.7458,6.57198,10.110738,11.25,1871-02-01
2,1871.03,1871.208333,4.61,0.26,0.4,13.034972,5.326667,114.824791,6.476019,9.963105,11.525,1871-03-01
3,1871.04,1871.291667,4.74,0.26,0.4,12.559226,5.33,122.535037,6.721331,10.340509,11.85,1871-04-01
4,1871.05,1871.375,4.86,0.26,0.4,12.273812,5.333333,128.558754,6.877629,10.580967,12.15,1871-05-01
5,1871.06,1871.458333,4.82,0.26,0.4,12.083481,5.336667,129.508959,6.98596,10.747631,12.05,1871-06-01
6,1871.07,1871.541667,4.73,0.26,0.4,12.083481,5.34,127.090742,6.98596,10.747631,11.825,1871-07-01
7,1871.08,1871.625,4.79,0.26,0.4,11.893231,5.343333,130.761677,7.097711,10.919555,11.975,1871-08-01
8,1871.09,1871.708333,4.84,0.26,0.4,12.178646,5.346667,129.030144,6.931371,10.663648,12.1,1871-09-01
9,1871.1,1871.791667,4.59,0.26,0.4,12.368896,5.35,120.483226,6.824758,10.499628,11.475,1871-01-01


<br>
<br>

Finally, we can save the data as a CSV.

In [13]:
df.to_csv('data_shiller_cleaned')

The CSV should appear in the files tab of your Colab workspace.

NOTE: When working in Colab, the CSV is only saved to instance running your code. So make sure to download your data else it will be lost once your code instance is shut.