<a href="https://colab.research.google.com/github/sguzik/ga4-data-api-starter/blob/main/Getting_Started.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Retrieving data from the GA4 Analytics API

This notebook documents how to get started with the [Google Analytics Data API](https://developers.google.com/analytics/devguides/reporting/data/v1). It is intended as a resource for product managers and data analysts trying to transition to pulling data from GA4.

These notes were compiled by [Sam Guzik](https://samguzik.com), product lead at New York Public Radio. The code below is synthesized from a number of sources. The goal is to save others the time of searching for foundational answers so they can dive into analysis more quikcly.

The key steps this document will cover are:


*   Creating a service account in Google Cloud Platform to access GA4 
*   Giving the service account access to GA4
*   Configuring a query in a Colab notebook

By the end of this notebook, you should be in a position to analyze your data using Pandas.


## Creating a service account in Google Cloud Platform to access GA4

In order to use the GA4 API, you need a service account with permissions to make API calls.

[The quickstart page](https://developers.google.com/analytics/devguides/reporting/data/v1/quickstart-client-libraries) of the Google Analytics Data API offers a one-click way to create a service account in Google Cloud Platform. If that button doesn't work, you'll need to follow these steps to create the account manually.

*You should only need to do this the first time you work with the GA4 API.*

1. Go to [Google Cloud](https://cloud.google.com) and sign into the management console. If you are using a Google account mananged by your employer, you may need special permissions to access the cloud management console.
2. Select an existing project or create a new project. [Here are Google's directions](https://developers.google.com/workspace/guides/create-project) for creating a new project.
3. Search for `Google Analytics Data API` in the search bar. Click `Enable` on the results page. Note that there is also a `Google Analytics API` -- that's the legacy API for UA properties.
4. Navigate to `APIs & Services > Credentials`from the Navigation menu.
5. Click `Create Credentials` at the top of the window. Select `Service Account`
6. Enter a name for the new account. Use whatever naming convention makes sense for your work. Click `Create Account` then click `Done` at the bottom of the screen.
7. On the `Credentials` page, select the account you just created from the list. It will be in the `Service Accounts` section.
8. Select `Keys` in the horizontal menu under the account name.
9. Click `Add Key` and then `Create new key` in the submenu. In the dialog box, select JSON. The prompt will start a file download.
10. Rename the downloaded file `credentials.json` and keep it for use later in the process.
11. Note the email address for your service account. You can find that on the `Details` tab. It will be in the format `<ACCOUNT_NAME>@<CLOUD_PROJECT_NAME>-<CLOUD_PROJECT_ID>.iam.gserviceaccount.com`. You will need that in the next step.

*The JSON file downloaded in this step includes authentication details for your account. Keep it in a safe place and do not store it in version control systems like GitHub.*

## Giving the service account access to GA4

Now that you've created a service account, you need to give it access to GA4. Add the email address you saved in step 11 of the previous section to your GA account just like you would any new user.


1.   Log in to GA4 and click `Admin` in the left-hand navigation pane.
2.   Click `Property Access Management` in the `Property` column.
3.   Add the email address you noted above. `Viewer` permissions should be sufficient to pull data.

If you manage multiple GA properties, you can add the same service account to all of them at once by selecting `Account Access Management` in the `Account` column (on the left of the Admin panel).

## Configure a query in a Colab notebook

This is the meat of getting data from GA4. 

This example assumes that you're using a Colab notebook, but the same principles should work in any Python script.

*NOTE: As of this writing, the Google Analytics Data API is still in beta. Details of this process may change as the API evolves.*

### Install dependencies

We start by installing the `google-analytics-data` package. This includes the functions we'll need to call the GA4 API. The package's documentation is available [here](https://googleapis.dev/python/analyticsdata/latest/).

When you run this step, Colab may give you a warning about a newly installed version of the `[google]` package. It will give you an option to restart the Colab runtime in order to use that package. If you do that, just start over from this step.

In [None]:
!pip install google-analytics-data
!pip install --upgrade pandas

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Now we import [Pandas](https://pandas.pydata.org), which we'll use for data manipulation later in the process. We also import `OS`, which will let us use environment variables to store our service account log-in information.

In [None]:
import pandas as pd
import os

### Upload service account credentials
Upload the `credentials.json` file you downloaded in step 10 of the first section into the `Files` pane of your Colab notebook.

Colab only stores files for a single session, so you'll need to repeat this every time you return to the project. That's by design -- it helps ensure your credentials stay secure.

In [None]:
# set credentials for GA4 login
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "credentials.json"

### Configure the query

Here's where the fun begins.

First we import the required modules from the `google-analytics-data` package.

Note that different queries will require different modules. As you go deeper with the API, you may find that you need to add additional modules here.

In [None]:
from google.analytics.data_v1beta import BetaAnalyticsDataClient
from google.analytics.data_v1beta.types import (
    DateRange,
    Dimension,
    Metric,
    Filter,
    FilterExpression,
    FilterExpressionList,
    RunReportRequest,
)

Now, let's set up our first query.

Let's retrieve the top article pages from our site (Gothamist.com, in this example) since the beginning of the year. We'll retrieve two dimensions at once: The page title and article publish date (a custom dimension we've configured for Gothamist).

Note the difference in naming convention for the two dimensions. Page title is a built-in GA4 dimension, so it uses a value from [this list of dimensions](https://developers.google.com/analytics/devguides/reporting/data/v1/api-schema#dimensions) (specifically `pageTitle`). Article publish is a custom dimension, so it uses a special format: `customEvent:article_publish_date`. There's more information about that [here](https://developers.google.com/analytics/devguides/reporting/data/v1/api-schema#custom_dimensions).

We'll pull and metrics for our report: Total users and page views. The list of metrics available to the API is [here](https://developers.google.com/analytics/devguides/reporting/data/v1/api-schema#metrics)

And to demonstrate how to group two filters together, we'll apply two filters: One to restrict our report to page view events and a second to filter to article pages (using a custom dimension configured for Gothamist.com). You'll need to adapt those filters to work with your domain. More details about how to configure filters for queries is available [here](https://developers.google.com/analytics/devguides/reporting/data/v1/basics#dimension_filters)


In [None]:
def run_report(property_id="YOUR-GA4-PROPERTY-ID"):
    """Runs a simple report on a Google Analytics 4 property."""
    # This code is adapted from the Google Analytics Data API quickstart.
    # https://developers.google.com/analytics/devguides/reporting/data/v1/quickstart-client-libraries

    # Using a default constructor instructs the client to use the credentials
    # specified in GOOGLE_APPLICATION_CREDENTIALS environment variable.
    client = BetaAnalyticsDataClient()

    request = RunReportRequest(
        property=f"properties/{property_id}",
        dimensions=[
            Dimension(name="pageTitle"),
            Dimension(name="customEvent:article_publish_date")
        ],
        metrics=[
            Metric(name="totalUsers"),
            Metric(name="screenPageViews"),
        ],
        # Edit this value to set how many responses you get
        limit=100,
        # Edit the daterange here. The API also supports multiple date ranges.
        date_ranges=[DateRange(start_date="2023-01-01", end_date="yesterday")],
        dimension_filter=FilterExpression(
            and_group=FilterExpressionList(
                expressions=[
                    FilterExpression(
                      filter=Filter(
                          field_name="eventName",
                          string_filter=Filter.StringFilter(value="page_view"),
                      )
                    ),
                    FilterExpression(
                        filter=Filter(
                            field_name="customEvent:page_type",
                            string_filter=Filter.StringFilter(value="article"),
                        )
                    ),
                ]
            )
        ),
    )
    response = client.run_report(request)

    return ga4_response_to_df(response)
    
# https://serhiipuzyrov.com/2021/03/how-to-get-google-analytics-4-property-report-to-pandas-dataframe-using-api/
def ga4_response_to_df(response):
    dim_len = len(response.dimension_headers)
    metric_len = len(response.metric_headers)
    all_data = []
    for row in response.rows:
        row_data = {}
        for i in range(0, dim_len):
            row_data.update({response.dimension_headers[i].name: row.dimension_values[i].value})
        for i in range(0, metric_len):
            row_data.update({response.metric_headers[i].name: row.metric_values[i].value})
        all_data.append(row_data)
    df = pd.DataFrame(all_data)
    return df


In [None]:
df = run_report(property_id="314466847")
df.head(5)

Unnamed: 0,pageTitle,customEvent:article_publish_date,totalUsers,screenPageViews
0,"Holdout tenant in $1,500 West Village apartmen...",2023-02-05T12:01:00.000Z,137519,142857
1,Is New York City facing a ‘doom loop’ scenario...,2023-01-02T10:01:17.004Z,136304,156198
2,High School Teacher Coerced Teen Into Posing F...,2012-10-13T14:25:56.000Z,120303,128502
3,Dumping radioactive water in Hudson River is ‘...,2023-02-17T17:00:00.000Z,91291,105266
4,"Norovirus, a gross stomach bug, appears to be ...",2023-02-24T11:01:00.000Z,72397,83759
5,MTA’s modern subway cars rolling on the tracks...,2023-02-03T19:10:00.000Z,64442,68349
6,MTA union official and conductor brawl in Bron...,2023-02-22T18:23:00.000Z,51281,61154
7,Flaco to remain free: Central Park Zoo gives u...,2023-02-19T18:06:00.000Z,44229,49689
8,What New Yorkers need to know as thousands of ...,2023-01-09T11:01:00.000Z,41249,48231
9,Sue Simmons Explains Why She Dropped F-Bomb On...,2012-10-18T16:02:00.000Z,39089,42768


### Cleanup

You should now have your analytics data in a Pandas dataframe. You can analyze it however you'd like.

Before proceeding, you may want to ensure that all numeric values are actually stored as numbers. I've notced that metrics come through as objects, making analysis harder.

So this code fixes that (you can adjust to reference the column names in your query).

In [None]:
df['totalUsers'] = pd.to_numeric(df['totalUsers'])
df['screenPageViews'] = pd.to_numeric(df['screenPageViews'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
 #   Column                            Non-Null Count  Dtype 
---  ------                            --------------  ----- 
 0   pageTitle                         100 non-null    object
 1   customEvent:article_publish_date  100 non-null    object
 2   totalUsers                        100 non-null    int64 
 3   screenPageViews                   100 non-null    int64 
dtypes: int64(2), object(2)
memory usage: 3.2+ KB


Now I can get basic summary statistics for the numeric columns:

In [None]:
df.describe()

Unnamed: 0,totalUsers,screenPageViews,publishedHour
count,100.0,100.0,100.0
mean,24000.68,27147.49,15.63
std,22855.146023,25111.692137,4.990203
min,10893.0,11990.0,0.0
25%,13505.0,15106.0,12.0
50%,17351.0,19798.0,16.5
75%,22125.75,25201.0,19.0
max,137519.0,156198.0,23.0


In my example, I also want to treat the `article_publish_date` as a datetime value to make analysis easier. Here's how I do that:

In [None]:
df['dateTime'] = pd.to_datetime(df['customEvent:article_publish_date'])
df['publishedDate'] = df['dateTime'].dt.date
df['publishedDay'] = df['dateTime'].dt.day_name()
df['publishedHour'] = df['dateTime'].dt.hour
df.head(5)

Unnamed: 0,pageTitle,customEvent:article_publish_date,totalUsers,screenPageViews,dateTime,publishedDate,publishedDay,publishedHour
0,"Holdout tenant in $1,500 West Village apartmen...",2023-02-05T12:01:00.000Z,137519,142857,2023-02-05 12:01:00+00:00,2023-02-05,Sunday,12
1,Is New York City facing a ‘doom loop’ scenario...,2023-01-02T10:01:17.004Z,136304,156198,2023-01-02 10:01:17.004000+00:00,2023-01-02,Monday,10
2,High School Teacher Coerced Teen Into Posing F...,2012-10-13T14:25:56.000Z,120303,128502,2012-10-13 14:25:56+00:00,2012-10-13,Saturday,14
3,Dumping radioactive water in Hudson River is ‘...,2023-02-17T17:00:00.000Z,91291,105266,2023-02-17 17:00:00+00:00,2023-02-17,Friday,17
4,"Norovirus, a gross stomach bug, appears to be ...",2023-02-24T11:01:00.000Z,72397,83759,2023-02-24 11:01:00+00:00,2023-02-24,Friday,11


I went on to use this setup to analyze our top stories by publish date and time, among other things.

## Next steps
I hope this walkthrough saves you some Googling while getting started with the GA4 API.

Do you have any feedback? Email me at sam@samguzik.com.