Skip to content

TAnas0/Gitcoin_Metamorphosis

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gitcoin Metamorphosis Hackathon

Calling all data scientists and data analysts! The Honest Protocol is sponsoring $15K in prizes in Gitcoin’s Metamorphosis Hackathon.

alt text

Analysis of suspicious domains

This is a project to build a dataset around ScamSniffer's list of domains (or any website in general) in order to shed more light into scam operations.

This dataset could eventually help to:

  • Categorize the scams: NFTs, ICO, Exchange, Donation campaign, airdrop, seed phrase theft...
  • Find links between scam operations: using the same IP addresses
  • Automatically extract further information about the scam: wallet addresses, email addresses, social media accounts...
  • Assist in reporting the scam: geo-locate and identify the jurisdiction it falls under, identify third-parties to report to (hosting providers, domain names registrar...)

Data collected

The data that seems relevant to me, in a rough order of importance:

  • Most common words in the website
  • Text contents of the website
  • IP addresses of the domain
  • [] JavaSript tags and external scripts
  • [] Subdomains and paths of the website

How to run locally

This explains how to setup the repository and run the data extraction yourself. The outputs will all be saved in the output directory. You will find in it the domains.csv file that includes the following information:

  • Is the website up?
  • IP addresses associated with the website
  • Does the website have a robots.txt file? (Could be used later to identify paths, or identify suspicious behaviour for blocking SEO indexing)
  • Processed text contents of the website: text with punctuation and English stopwords removed

The HTML itself of the websites will also be dumped into the output/htmls directory. It can be used for manual inspection, archiving the suspicious website, or avoiding an HTTP request in the code.

# Download list of suspicious domains
wget https://raw.githubusercontent.com/scamsniffer/scam-database/main/blacklist/domains.json

## Install Python dependencies
pip install -r requirements.txt
## Run the script
python main.py

Further improvements

  • Save into a JSON instead of a CSV (more suitable for array of IPs, most common words...)
  • Lemmatise text contents
  • Scrape websites that require JavaScript to be enabled
  • Download JS files referenced in website and scan them for further information (API calls, wallet addresses...)
  • Check if the reputation of the IP addresses
  • Retrieve history of the domain from the WayBack Archive
  • List technologies the website is built with using Wappalyzer

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%