# 1. Analyzing PyPI package downloads

This section covers how to use the PyPI package dataset to learn more about downloads of a package (or packages) hosted on PyPI. For example, you can use it to discover the distribution of Python versions used to download a package.

**Datasets**
* https://bigquery.cloud.google.com/dataset/the-psf:pypi

In [4]:
from google.cloud.bigquery import magics
from google.oauth2 import service_account

# File key.json yang sudah di download pada Service Account
credentials = (service_account.Credentials.from_service_account_file('key.json'))
magics.context.credentials = credentials
# Project id
magics.context.project = 'default-demo-app-d177'

In [5]:
%load_ext google.cloud.bigquery

## Counting package downloads

The following query counts the total number of downloads for the project “pytest”.

In [6]:
%%bigquery
SELECT COUNT(*) AS num_downloads
FROM `the-psf.pypi.downloads*`
WHERE file.project = 'pytest'
  -- Only query the last 30 days of history
  AND _TABLE_SUFFIX
    BETWEEN FORMAT_DATE(
      '%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
    AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())

Unnamed: 0,num_downloads
0,10262360


To only count downloads from pip, filter on the `` details.installer.name `` column.

In [7]:
%%bigquery
SELECT COUNT(*) AS num_downloads
FROM `the-psf.pypi.downloads*`
WHERE file.project = 'pytest'
  AND details.installer.name = 'pip'
  # Only query the last 30 days of history
  AND _TABLE_SUFFIX
    BETWEEN FORMAT_DATE(
      '%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
    AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())

Unnamed: 0,num_downloads
0,9673168


## Package downloads over time

To group by monthly downloads, use the `` _TABLE_SUFFIX `` pseudo-column. Also use the pseudo-column to limit the tables queried and the corresponding costs.

In [8]:
%%bigquery
SELECT
  COUNT(*) AS num_downloads,
  SUBSTR(_TABLE_SUFFIX, 1, 6) AS `month`
FROM `the-psf.pypi.downloads*`
WHERE
  file.project = 'pytest'
  -- Only query the last 6 months of history
  AND _TABLE_SUFFIX
    BETWEEN FORMAT_DATE(
      '%Y%m01', DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH))
    AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
GROUP BY `month`
ORDER BY `month` DESC

Unnamed: 0,num_downloads,month
0,7279245,201903
1,8833017,201902
2,9039863,201901
3,6135212,201812
4,7376321,201811
5,8871884,201810
6,6044538,201809


# 2. Data Driven Decisions Using PyPI Download Statistics

## Python Versions

In the original scenario we were curious if we could drop support for Python 2.6. What does that query look like? For the following examples we’ll be using cryptography since that’s my primary project and the one I run queries on most commonly.



In [17]:
%%bigquery --use_legacy_sql
SELECT
  REGEXP_EXTRACT(details.python, r"^([^\.]+\.[^\.]+)") as python_version,
  COUNT(*) as download_count,
FROM
  TABLE_DATE_RANGE(
    [the-psf:pypi.downloads],
    DATE_ADD(CURRENT_TIMESTAMP(), -31, "day"),
    DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
  )
WHERE
  file.project = 'cryptography'
GROUP BY
  python_version,
ORDER BY
  download_count DESC
LIMIT 100

Unnamed: 0,python_version,download_count
0,2.7,7952944
1,3.6,3340582
2,3.5,1230706
3,3.7,1172481
4,3.4,286225
5,,279882
6,2.6,76483
7,3.8,2601
8,3.3,389
9,3.2,65


As you can see, Python 2.6 makes up 109,134 out of 2,707,632 downloads in the past 30 days. This represents roughly 4% of downloads. Is that low enough to drop support?.

`` null `` also makes up approximately 3.4% of downloads. These are downloads from PyPI using clients that do not support sending the statistics we’re querying against. This can be an older version of `` pip `` or alternate clients. You also see 341 downloads from 1.17, which is…who knows! When making maintenance decisions you should factor these unknowns as you feel appropriate.

## OpenSSL versions

`` cryptography `` supports a wide variety of OpenSSL versions. However, supporting 0.9.8 and 1.0.0 are a significant challenge since they are missing many of the features we need (and are no longer supported by upstream). Let’s craft a query to see what versions of OpenSSL are in use:

In [18]:
%%bigquery --use_legacy_sql
SELECT
  details.system.name,
  REGEXP_EXTRACT(details.openssl_version, r"^OpenSSL ([^ ]+) ") as openssl_version,
  COUNT(*) as download_count,
FROM
  TABLE_DATE_RANGE(
    [the-psf:pypi.downloads],
    DATE_ADD(CURRENT_TIMESTAMP(), -31, "day"),
    DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
  )
WHERE
  details.openssl_version IS NOT NULL
GROUP BY
  details.system.name,
  openssl_version,
HAVING
  download_count >= 100
ORDER BY
  download_count DESC
LIMIT 100

Unnamed: 0,details_system_name,openssl_version,download_count
0,Linux,1.0.2g,597455067
1,Linux,1.0.2k-fips,365898383
2,Linux,1.1.0j,340292063
3,Linux,1.0.1f,149848214
4,Linux,1.1.0g,123137787
5,Linux,1.1.0f,115554361
6,Linux,1.0.1t,83532157
7,Linux,1.1.1a,57303159
8,Linux,,46991845
9,Linux,1.0.1e-fips,32690563


While I haven’t provided the entire set of results it turns out less than 100,000 downloads out of 210,063,137 were made using OpenSSL 1.0.0. 0.9.8 holds a much greater share, but only due to Darwin (aka macOS…aka OS X). In cryptography’s case we statically link wheels on Mac and Windows so we can ignore the OpenSSL version on those platforms. Looks like dropping 0.9.8 and 1.0.0 is probably safe!

## Most Popular Projects

Maybe you just want to know how popular your package is relative to others in the past 30 days.

In [19]:
%%bigquery --use_legacy_sql
SELECT
  file.project,
  COUNT(*) as total_downloads,
FROM
  TABLE_DATE_RANGE(
    [the-psf:pypi.downloads],
    DATE_ADD(CURRENT_TIMESTAMP(), -31, "day"),
    DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
  )
GROUP BY
  file.project
ORDER BY
  total_downloads DESC
LIMIT 100


Unnamed: 0,file_project,total_downloads
0,urllib3,66275331
1,pip,60009468
2,six,55657795
3,botocore,50393304
4,s3transfer,46279104
5,python-dateutil,45917377
6,requests,43505516
7,certifi,39985490
8,pyasn1,39363621
9,pyyaml,38830196


**Source**
* https://packaging.python.org/guides/analyzing-pypi-package-downloads/
* https://langui.sh/2016/12/09/data-driven-decisions/
* https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql