# PyPi spam identification by url blacklist matching

*19 Feb 2018*

Following the [PyPi spam incident in February 2018](https://status.python.org/incidents/mgjw1g5yjy5j), this notebook aims to detect spam in the uploaded Python packages by matching the links contained in the description against blacklisted domain names.

The PyPi metadata (`pypi-metadata-2018-02-19.parq`) for ~120000 Python packages  used in this analysis can be found in the project [README.md](https://github.com/rth/pypi-stats-viz).

The hpHost list of blacklisted domain names can be downloaded from https://hosts-file.net/?s=Download. For this analysis, the `hosts.txt` file should be extracted from `hosts.zip`. See https://hosts-file.net/?s=classifications for more details about the domain classification.

This notebook requires Python 3.5+. The dependencies can be installed with,
```
pip install numpy pandas dask toolz cloudpickle pyarrow urlextract
```

TLDR; this approach flags ~173 Python packages, most being false positive. So either PyPi currently contains almost no spam or other approaches should be tried instead. 

In [6]:
import pandas as pd
import numpy as np
from urllib.parse import urlparse
from pathlib import Path

import dask.dataframe as dd
from dask.diagnostics import ProgressBar
from urlextract import URLExtract

We start by loading the PyPi metadata and for each package extracting urls from the `description`, `home_page` and `summary` fields,

In [7]:
df = dd.read_parquet(str(Path('..')/'data'/'pypi-metadata-2018-02-19.parq'), engine='pyarrow')

In [8]:
url_extractor = URLExtract()

with ProgressBar():
    df['text_all'] = (df.description + ' ' + df.home_page + ' ' + df.summary).astype('str')
    df['extracted_urls'] = df.text_all.apply(url_extractor.find_urls)
    df_s = df[['name', 'extracted_urls']].compute()

[                                        ] | 0% Completed |  0.1s

  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result


[########################################] | 100% Completed | 58.6s


In [12]:
print('Number of processed PyPi packages: ', df_s.shape[0])

Number of processed PyPi packages:  128712


In [9]:
df_s.sample(10)

Unnamed: 0,name,extracted_urls
5602,dspam-milter,[https://travis-ci.org/whyscream/dspam-milter....
6336,easyshop.core,"[(http://www.demmelhuber.net/shop), http://www..."
2900,gpssim,[https://bitbucket.org/wjiang/gpssim]
10274,nester-edvan,[http://www.headfirstlabs.com]
7289,imaprelay,"[hostname=imap.exchange.megacorp.com, hostname..."
6586,vice.plone.outbound,[http://dev.plone.org/collective/browser/vice....
1947,DBSync,[http://github.com/zhoubangtao/dbsync]
5476,LabelLib,[https://github.com/Fluorescence-Tools/LabelLib]
2239,pyeasy,[]
2816,loop_lista2201,[http://www.headfirstlabs.com]


Now we will flatten this nested DataFrame, to get a single package and URL pair per row,

*Note*: it doesn't look like there is a simple vectorized way to do this in pandas

In [22]:
res = []
for idx, row in df_s.iterrows():
    if not hasattr(row['extracted_urls'], '__len__'):
        continue
    for url in row['extracted_urls']:
        res.append({'name': row['name'], 'url': url})
        
df_u = pd.DataFrame(res)

In [23]:
df_u.sample(10)

Unnamed: 0,name,url
547848,specio,https://travis-ci.org/paris-saclay-cds/specio....
335586,moses,https://github.com/l04m33/moses.git
154246,django-cached-modelforms,models.py
407768,pantsbuild.pants.contrib.confluence,https://rbcommons.com/s/twitter/r/860
218735,effective-distance,https://github.com/benmaier/effective-distance
259259,google-cloud-trace,https://cloud.google.com/trace
453658,py_sonicvisualiser,http://www.sonicvisualiser.org
260156,goprohero,https://github.com/joshvillbrandt/GoProController
440255,postmen,https://github.com/postmen/postmen-sdk-python/...
489967,python-lirc,[Debian](http://bugs.debian.org/cgi-bin/bugrep...


We can then exactract domain name from each URL,

In [24]:
def url_parse_netloc(url):
    try:
        return urlparse(url).netloc
    except:
        return float('nan')

df_u['url_netloc'] = df_u.url.apply(url_parse_netloc)

For instance the most popular domains linked to in Python packages on PyPi are as follows,

In [25]:
df_u.groupby('url_netloc').name.count().sort_values(ascending=False).head(20)

url_netloc
github.com                   162530
                             117585
pypi.python.org               31886
img.shields.io                28411
travis-ci.org                 27060
rbcommons.com                 12628
odoo-community.org            11313
coveralls.io                  10106
bitbucket.org                  7602
badge.fury.io                  5726
code.google.com                4722
readthedocs.org                4494
codecov.io                     4032
www.gnu.org                    3584
en.wikipedia.org               2888
pypip.in                       2810
raw.githubusercontent.com      2635
runbot.odoo-community.org      2229
docs.python.org                2096
landscape.io                   2015
Name: name, dtype: int64

Finally, we load hpHosts host file,

In [27]:
blacklist = pd.read_csv('hosts.txt', skiprows=27, sep='\t', encoding='latin1', comment='#',
                        names=['localhost', 'domain'])
print('Domains in the blacklist: ', blacklist.shape[0])

Domains in the blacklist:  724042


and merge find the linked urls that belong to blacklisted domains,

In [29]:
packages_flagged = (pd.merge(blacklist[['domain']], df_u,  how='inner', left_on='domain', right_on='url_netloc')
                      .drop_duplicates())

In [36]:
bad = packages_flagged.groupby(['name', 'domain']).url.apply(' , '.join).reset_index()

# some of the domains in the blacklist are OK in the context of PyPi
whitelist = ['mysite.com', 'addthis.com', 'www.addthis.com', 'www.academia.edu', 'www.google-analytics.com']
bad = bad[~bad.domain.isin(whitelist)]
print('Flagged %s PyPI packages' % bad.name.nunique())
bad.sample(5)

Flagged 173 PyPI packages


Unnamed: 0,name,domain,url
103,inlinestyler,www.campaignmonitor.com,http://www.campaignmonitor.com/css/
94,github-release-notifier,acme.com,https://acme.com/updated
82,elektrika.openx,openx.org,http://openx.org
147,ns1cli,ns1.com,https://ns1.com
122,mailsnake,www.mailchimp.com,http://www.mailchimp.com/api/1.3/


In [37]:
bad.to_csv('bad.csv', sep='\t')

The full results are available in https://pastebin.com/DNP4D4uj,  most of the flagged packages, while sometimes dubious, are mostly false positives. So either PyPi currently contains almost no spam or other approaches should be tried instead. 