Skip to content
This repository has been archived by the owner on Feb 7, 2023. It is now read-only.

Commit

Permalink
add to model readme
Browse files Browse the repository at this point in the history
  • Loading branch information
soodoku committed Aug 6, 2020
1 parent 1401634 commit ad9134a
Showing 1 changed file with 9 additions and 1 deletion.
10 changes: 9 additions & 1 deletion pydomains/models/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,14 @@

We use LSTM to estimate the relationship between the characters in the domain name and the category of content it hosts. We first break down the domain name into common bi-chars and then learn patterns in sequences of common bi-chars.

### Performance

For model performance and comparison to Random Forest and SVC models, see the relevant notebooks and `eps images of the ROC <./pydomains/models/roc>`__.

### Calibration

We also checked if the probabilities were calibrated. We find LSTM to be pretty well calibrated. The notebooks are posted `here <./pydomains/models/calibration/>`__

### Shallalist

In shallalist, a few domains are assigned to multiple categories. We ignore those. We only look at domain names that have been assigned to one category. The other issue with shallalist is that some of the categories don't have many domains. To learn models that have high accuracy and recall, we subset on categories that have more than a 1,000 unique domain names. We also take out categories where the recall is < .3 --- suggesting there is little systematic pattern to the domain names (at least based on the kinds of patterns our model can detect). This leaves us with 30 categories. We consign rest of the domains to the 'other' category.
Expand All @@ -23,7 +31,7 @@ In shallalist, a few domains are assigned to multiple categories. We ignore thos

### Phish

Phishing URLs are crafted in a way to mimic URLs of popular sites. For instance, there are bunch that have the word 'paypal' in them. So rather than use Common Crawl, we use Alexa top 1M domains as the source of 'legitimate domains.' In particular, we use 50,000 unique domains from PhishTank year 2016-2017, and pair it with the top 50,000 most visited domains from the 1M Alexa domain list.
Phishing URLs are crafted in a way to mimic URLs of popular sites. For instance, there are bunch that have the word 'paypal' in them. So rather than use Common Crawl, we use Alexa top 1M domains as the source of 'legitimate domains.' In particular, we use 50,000 unique domains from PhishTank year 2016-2017, and pair it with the top 50,000 most visited domains from the 1M Alexa domain list.

### Malware

Expand Down

0 comments on commit ad9134a

Please sign in to comment.