μ μ± URLμ λ¨Έμ λ¬λμΌλ‘ μ€μκ° λΆμΒ·μ°¨λ¨νλ 보μ μμ€ν
flowchart LR
A([π URL μ
λ ₯]) --> B[/νΌμ² μΆμΆ\nfeature_engineering.py/]
B --> C1[URL κΈΈμ΄ κ³μ°]
B --> C2[νΉμλ¬Έμ λΆμ]
B --> C3[λλ©μΈ λΆμ]
B --> C4[ν€μλ νμ§]
C1 & C2 & C3 & C4 --> D[(25κ° νΌμ² 벑ν°)]
D --> E[π€ XGBoost λͺ¨λΈ\ntrain_model.py]
E --> F{μμΈ‘ κ²°κ³Ό}
F -->|β
μ μ| G[allow]
F -->|β οΈ κ²½κ³ | H[alert]
F -->|π« μν| I[block]
E --> K[π μ±λ₯ νκ°\nmodel_evaluation.png]
style A fill:#4CAF50,color:#fff
style E fill:#FF6600,color:#fff
style F fill:#2196F3,color:#fff
style G fill:#4CAF50,color:#fff
style H fill:#FF9800,color:#fff
style I fill:#F44336,color:#fff
IS_NetShield/
βββ π src/
β βββ π§ feature_engineering.py # URL β νΌμ² μΆμΆ
β βββ π€ train_model.py # XGBoost νμ΅ λ° νκ°
β βββ π api_server.py # FastAPI μλ² (μμ )
βββ π data_mal/
β βββ π malicious_phish.csv # μ μ/μ
μ± νΌν© URL λ°μ΄ν°μ
(Kaggle)
β βββ π online-valid.csv # μ€μκ° νΌμ± URL (PhishTank, λΉκ΅ κ²μ¦μ©)
βββ π model/
β βββ πΎ xgb_model.pkl # νμ΅λ λͺ¨λΈ
βββ π results/
β βββ π model_evaluation.png # λͺ¨λΈ νκ° κ²°κ³Ό μκ°ν
βββ π« .gitignore
βββ π README.md
pip install xgboost scikit-learn pandas numpy matplotlib seaborn requests[1] Kaggle - malicious_phish.csv:
π https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset
컬λΌ:
url,type(benign / phishing / malware / defacement)
[2] PhishTank (Cisco Talos) β CC BY-SA 2.5
python train_model.py
νμ΅ λ°μ΄ν°: 651,191κ° (Kaggle Malicious URL Dataset)
ν
μ€νΈ λ°μ΄ν°: 130,239κ° (μ 체μ 20%)
| μ§ν | μμΉ |
|---|---|
| Accuracy | 0.9408 |
| Precision | 0.9006 |
| Recall | 0.9296 |
| F1 Score | 0.9149 |
| ROC-AUC | 0.9860 |
| False Positive Rate | 0.0534 |
| μμΈ‘ μ μ | μμΈ‘ μ μ± | |
|---|---|---|
| μ€μ μ μ | 81,045 β | 4,576 β |
| μ€μ μ μ± | 3,140 β | 41,478 β |
- μ€ν (μ μ β μ μ±): 4,576건 (5.3%)
- λ―Έν (μ μ± β μ μ): 3,140건 (7.0%)
| μμ | νΌμ² | μλ―Έ |
|---|---|---|
| 1 | domain_length | μ μ± URLμ λλ©μΈμ΄ κΈΈλ€ |
| 2 | has_www | www μμ΄ μ΄μν μλΈλλ©μΈ μ¬μ© |
| 3 | subdomain_depth | μλΈλλ©μΈμ΄ κΉμμλ‘ μμ¬ |
| 4 | path_depth | κ²½λ‘κ° λ³΅μ‘ν μλ‘ μμ¬ |
| 5 | tld_risk | .tk .xyz λ± κ³ μν TLD μ¬μ© |
| μΉ΄ν κ³ λ¦¬ | νΌμ² |
|---|---|
| π URL κΈΈμ΄ | url_length, domain_length, path_length, query_length |
| π£ νΉμλ¬Έμ | count_dots, count_hyphens, count_at, count_percent λ± |
| π λλ©μΈ | subdomain_depth, has_ip_address, tld_risk |
| π νλ‘ν μ½ | is_https |
| π ν€μλ | has_phishing_keyword, has_brand_keyword |
| π§© ν¨ν΄ | has_typosquatting, has_double_slash |
| π ν΅κ³ | url_entropy, digit_ratio, path_depth |
| λͺ¨λΈ | νΉμ§ | μ ν |
|---|---|---|
| π₯ μ°λ¦¬ λͺ¨λΈ | XGBoost + 25κ° νΌμ² μμ§λμ΄λ§, λ‘컬 μΆλ‘ | Local ML |
| π΅ Google Safe Browsing | μ κ³ νμ€, λ¬΄λ£ API | Cloud API |
| π VirusTotal | 70κ° μμ§ μμλΈ, μ λ΅μ§λ‘ νμ© | Cloud API |
FastAPI κΈ°λ° REST API μλ²λ₯Ό ꡬμΆνμ¬ μ€μκ° URL νμ§ μλΉμ€λ₯Ό μ 곡ν μμ μ λλ€.
# ν¨ν€μ§ μ€μΉ
pip install fastapi uvicorn
# μλ² μ€ν
uvicorn api_server:app --reload --host 0.0.0.0 --port 8000| λ©μλ | κ²½λ‘ | μ€λͺ |
|---|---|---|
| POST | /analyze |
λ¨μΌ URL λΆμ |
| POST | /analyze/batch |
λ€μ URL μΌκ΄ λΆμ (μ΅λ 100κ°) |
| GET | /health |
μλ² μν νμΈ |
| GET | /stats |
λͺ¨λΈ μ 보 μ‘°ν |
{
"url": "http://paypa1-secure.xyz/login/verify",
"score": 99,
"verdict": "block",
"label": "μν",
"reasons": ["νΌμ± ν€μλ ν¬ν¨", "κ³ μν TLD λλ©μΈ", "HTTP λΉμνΈν"],
"response_time_ms": 12.4,
"timestamp": "2026-04-03T18:00:00"
}EC2 + ALB + WAF μ‘°ν©μΌλ‘ μ€μ 보μ κ²½κ³λ₯Ό ꡬμ±ν μμ μ λλ€.
μ¬μ©μ / κ°μ 곡격μ
β
Route 53 (DNS)
β
ALB (λ‘λ λ°Έλ°μ)
β
AWS WAF (1μ°¨ λ£° κΈ°λ° μ°¨λ¨)
β
EC2 νμ§ μμ§ (FastAPI + XGBoost)
β
S3 (λ‘κ·Έ μ μ₯) + CloudWatch (λͺ¨λν°λ§)
| μλΉμ€ | μν |
|---|---|
| EC2 | FastAPI μλ² + XGBoost λͺ¨λΈ νΈμ€ν |
| ALB | νΈλν½ λΆμ° λ° HTTPS μ²λ¦¬ |
| AWS WAF | IP μ°¨λ¨, μλ €μ§ μ μ± ν¨ν΄ 1μ°¨ νν°λ§ |
| S3 | νμ§ λ‘κ·Έ λ° λͺ¨λΈ μν°ν©νΈ μ μ₯ |
| CloudWatch | μ€μκ° λͺ¨λν°λ§ λ° μλ |
β
1λ¨κ³ ML λͺ¨λΈ νμ΅ λ° νκ° β μλ£
π 2λ¨κ³ FastAPI μλ² κ΅¬μΆ β μ§ν μμ
β³ 3λ¨κ³ AWS λ°°ν¬ (EC2+ALB+WAF) β μ§ν μμ
β³ 4λ¨κ³ React λμ보λ UI β μ§ν μμ
π 보μ νλ‘μ νΈ | Information Security Class