## Get data

Read data from anaconda and pypi. Get package data from repodata.json

```bash
curl https://repo.anaconda.com/pkgs/main/linux-64/repodata.json -o repodata.json
jq -r '.packages[].name' repodata.json | uniq > anaconda.txt
```

Used [pypinfo](https://github.com/ofek/pypinfo) to query the top 5000 pypi downloads in Google Big Query. Follow the directions to create a project on BigQuery. Note that this query is reduced to the top 1000.

```bash
pypinfo -l 5000 -j --days 365 "" project > popular-pypi-downloads.json
```

In [15]:
import json
from operator import itemgetter

anaconda_data = open('anaconda.txt')
anaconda_pkgs = set()
for pkg in anaconda_data:
    anaconda_pkgs.add(pkg.rstrip())

pypi_data = open('popular-pypi-downloads.json')
pypi_json = json.load(pypi_data)
pypi_pkgs = set()
pypi_projects = pypi_json['rows']
pypi_top1k = pypi_projects[:1000]                          
for project in pypi_top1k:
    pypi_pkgs.add(project.get('project'))

## Compare the difference

In [16]:
pkg_diff = sorted(pypi_pkgs.difference(anaconda_pkgs))
pkg_intersect = sorted(pypi_pkgs.intersection(anaconda_pkgs))
print('Number of PyPI packages that are not available in Anaconda:\n {}'.format(len(pkg_diff)))
#print('\nAnaconda packages not in PyPI:\n')
# for pkg in pkg_diff:
#       print(pkg)
print('\nNumber of PyPI packages that are available in Anacond:\n {}'.format(len(pkg_intersect)))
#print('Anaconda packages in PyPi:\n')
# for pkg in pkg_intersect:
#       print(pkg)

Number of PyPI packages that are not available in Anaconda:
 613

Number of PyPI packages that are available in Anacond:
 387


## List of PyPI packages not in Anaconda ordered by download count

In [18]:
missing_pkg_list = []
for pkg in pkg_diff:
    for prj in pypi_top1k:
        if pkg == prj.get('project'):
            missing_pkg_list.append(prj)

by_download_counts = sorted(missing_pkg_list, key = itemgetter('download_count'), reverse=True)

intersect_pkg_list = []
for pkg in pkg_intersect:
    for prj in pypi_top1k:
        if pkg == prj.get('project'):
            intersect_pkg_list.append(prj)
            
# for download in by_download_counts:
#     print(download)

## Calculate total downloads

In [25]:
download_total = 0
for pkg in pypi_top1k:
    download_total = download_total + pkg.get('download_count')
    
missing_total = 0
for pkg in missing_pkg_list:
    missing_total = missing_total + pkg.get('download_count')
    
intersect_total = 0
for pkg in intersect_pkg_list:
    intersect_total = intersect_total + pkg.get('download_count')
    
print('Total PyPi downloads 15561303313\n')
print('Total top 1K downloads: {}\n'.format(download_total))
print('Total missing pkg downloads: {}\n'.format(missing_total))
print('Total number of PyPI package downloads with corresponding Anaconda packages: {}\n'.format(intersect_total))
print('Top 1K PyPI packages as percentage of total downloads: {}%\n'.format(round(download_total/15561303313*100),0))
print('Percentage of top 1K PyPI downloads without corresponding Anaconda packages : {}%\n'.format(round(missing_total/download_total*100),0))
print('Percentage of all PyPI downloads without corresponding Anaconda packages: {}%\n'.format(round(missing_total/15561303313*100),0))
print('Percentage of top 1K PyPI downloads with corresponding Anaconda packages: {}%\n'.format(round(intersect_total/15561303313*100),0))


Total PyPi downloads 15561303313

Total top 1K downloads: 14403311239

Total missing pkg downloads: 3378293970

Total number of PyPI package downloads with corresponding Anaconda packages: 11025017269

Top 1K PyPI packages as percentage of total downloads: 93%

Percentage of top 1K PyPI downloads without corresponding Anaconda packages : 23%

Percentage of all PyPI downloads without corresponding Anaconda packages: 22%

Percentage of top 1K PyPI downloads with corresponding Anaconda packages: 71%



## Get conda-forge data

In [26]:
from bs4 import BeautifulSoup
import requests

url = 'http://conda-forge.org/feedstocks/'
cf_feedstocks = requests.get(url)
feedstocks = set()
soup = BeautifulSoup(cf_feedstocks.content, 'html.parser')
for item in soup.find_all('li', attrs={'class':'list-group-item'}):
    feedstocks.add(item.find('a').contents[0].lstrip().rstrip())

# for fs in feedstocks:
#     print(fs)


## CondaForge packages not in Anconda Distribution but in PyPI top 1K

In [27]:
cf_pkg_diff = pypi_pkgs.difference(feedstocks)
print("CondaForge packages not in Anaconda Distribution, but are in PyPI top 1K  : {}\n".format(len(cf_pkg_diff)))
cf_diff_sorted = sorted(cf_pkg_diff)
for cf_pkg in cf_diff_sorted:
    print(cf_pkg)

CondaForge packages not in Anaconda Distribution, but are in PyPI top 1K  : 338

acme
analytics-python
antlr4-python3-runtime
aspy-yaml
awsebcli
azure-batch
azure-cli-acr
azure-cli-acs
azure-cli-advisor
azure-cli-appservice
azure-cli-backup
azure-cli-batch
azure-cli-batchai
azure-cli-billing
azure-cli-cdn
azure-cli-cloud
azure-cli-cognitiveservices
azure-cli-command-modules-nspkg
azure-cli-configure
azure-cli-consumption
azure-cli-container
azure-cli-cosmosdb
azure-cli-dla
azure-cli-dls
azure-cli-eventgrid
azure-cli-extension
azure-cli-feedback
azure-cli-find
azure-cli-interactive
azure-cli-iot
azure-cli-keyvault
azure-cli-lab
azure-cli-monitor
azure-cli-network
azure-cli-nspkg
azure-cli-profile
azure-cli-rdbms
azure-cli-redis
azure-cli-reservations
azure-cli-resource
azure-cli-role
azure-cli-servicefabric
azure-cli-sql
azure-cli-storage
azure-cli-vm
azure-cosmosdb-nspkg
azure-cosmosdb-table
azure-datalake-store
azure-eventgrid
azure-mgmt-advisor
azure-mgmt-applicationinsights
azure-mg