Project Sonar produces a SSL dataset every week. This data is gathered by first performing a TCP SYN scan across the entire IPv4 address space for port 443 and then running a collection script against every system that returned a positive response. The data is then compared against the previous scan and any new entries (hosts, names, or certificates) are uploaded to scans.io. For SSL certificates on ports other than 443 please see the More SSL dataset.
The data consists of three files per week:
All three files are gzip-compressed CSVs.
The certs file contains a SHA1 hash of the X509 certificate followed by the base64-encoded X509 certificate itself. The hosts file contains an IP address and the SHA1 hash of the certificate that was found on that IP. If multiple certificates are found on a host, these certificate hashes will be displayed in the order they were seen. It is common for a SSL/TLS server to provide multiple certificates in the response, typically consisting of the server's certificate, followed by a certificate authority's glue certificate, and finally the root certificate. The names file contains a SHA1 hash of the certificate followed by the Common Name or one of the SubjectAltName entries. It is common for a single certificate to have many names associated with it.
Due to the incremental nature of published data, it is necessary to process all historical data files in order to obtain a complete picture of the latest scan. A reasonable approach is to download all data files and process them sequentially, loading the certs, hosts, and names into separate database tables. The X509 certificates will need to be parsed and possibly stored as multiple fields within the certs table. The date of the scan (represented by the file name) should be stored as a column within each table.
Once all of the bulk data has been loaded in the correct order, it becomes easy to determine which certificates and names correspond to which IP addresses and vice-versa. Depending on available memory and storage speed, it may make sense to create join tables or just add indexes to certain fields (SHA1).
The incremental data format is time intensive to setup, but becomes much faster to keep updated, as only the relatively small weekly data files need to be processed, as opposed to the complete raw dataset.