Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
More SSL Certificates
Project Sonar produces a "More SSL" dataset every week. This data is gathered by first performing a TCP SYN scan across the entire IPv4 address space for ports 25, 143, 465, 993, and 995. A collection script is then run against every system that returned a positive response. This scan includes all SSL data except for port 443 and handles STARTTLS as necessary. The data is then compared against the previous scan and any new entries (hosts, names, or certificates) are uploaded to scans.io.
The data consists of three files per week:
All three files per port and protocol are gzip-compressed CSVs.
The certs file contains a SHA1 hash of the X509 certificate followed by the base64-encoded X509 certificate itself. The endpoints file contains an IP address and the SHA1 hash of the certificate that was found on that IP. If multiple certificates are found on an endpoint, these certificate hashes will be displayed in the order they were seen. It is common for a SSL/TLS server to provide multiple certificates in the response, typically consisting of the server's certificate, followed by a certificate authority's glue certificate, and finally the root certificate. The names file contains a SHA1 hash of the certificate followed by the Common Name or one of the SubjectAltName entries. It is common for a single certificate to have many names associated with it.
Due to the incremental nature of published data, it is necessary to process all historical data files in order to obtain a complete picture of the latest scan. A reasonable approach is to download all data files and process them sequentially, loading the certs, hosts, and names into separate database tables. The X509 certificates will need to be parsed and possibly stored as multiple fields within the certs table. The date of the scan (represented by the file name) should be stored as a column within each table.
Once all of the bulk data has been loaded in the correct order, it becomes easy to determine which certificates and names correspond to which IP addresses and vice-versa. Depending on available memory and storage speed, it may make sense to create join tables or just add indexes to certain fields (SHA1).
The incremental data format is time intensive to setup, but becomes much faster to keep updated, as only the relatively small weekly data files need to be processed, as opposed to the complete raw dataset.