<a href="https://colab.research.google.com/github/qkrdudwls/Bible-vs-Quran/blob/main/DM_LSH.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Mining Project**


## Overview
- Bible과 Quran의 전체 본문을 한 번에 crawling
- Shingling, Min-Hashing, LSH 알고리즘을 적용하여 유사도 계산

## Set up


### 구글 드라이브 마운트

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Selenium 설치 & 구글 드라이브에 chromedriver 설치

In [None]:
!pip install selenium
!apt-get update

!apt install chromium-chromedriver
!cp /usr/bin/lib/chromium-browser/chromedriver '/content/drive/MyDrive/Colab Notebooks'
!pip install chromedriver-autoinstaller

Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:3 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:5 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:7 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:8 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
chromium-chromedriver is already the newest version (1:85.0.4183.83-0ubuntu2.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 6 not upgraded.

### Spark 설치

In [None]:
!pip install pyspark
!pip install -U -q PyDrive2
!apt install openjdk-8-jdk-headless -qq

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

openjdk-8-jdk-headless is already the newest version (8u432-ga~us1-0ubuntu2~22.04).
0 upgraded, 0 newly installed, 0 to remove and 6 not upgraded.


### Datasketch 설치

In [None]:
!pip install datasketch

import datasketch
print(f"Datasketch {datasketch.__version__}")

Datasketch 1.6.5


#### Version 확인

In [None]:
!python --version

import selenium
import pyspark
import datasketch

print(f"Selenium {selenium.__version__}")
print(f"PySpark {pyspark.__version__}")
print(f"Datasketch {datasketch.__version__}")

Python 3.10.12
Selenium 4.26.1
PySpark 3.5.3
Datasketch 1.6.5


### Import library

In [None]:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import sys
from selenium.webdriver.common.keys import Keys
import urllib.request
import os
from urllib.request import urlretrieve

import time
import pandas as pd
import chromedriver_autoinstaller
import re

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col, lit
from datasketch import MinHash, MinHashLSH
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType

import matplotlib.pyplot as plt
import numpy as np
import random
from collections import Counter
import math

### chrome_options 설정

In [None]:
chrome_path="/content/drive/Mydrive/Colab Notebooks/chromedriver"

sys.path.insert(0,chrome_path)
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless') # ensure GUI is off
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')  # set path to chromedriver as per your configuration

chromedriver_autoinstaller.install()  # set the target URL

## Scrapping?

#### URL

In [None]:
bible_url = 'https://www.gutenberg.org/cache/epub/10/pg10-images.html'
quran_url = 'https://www.gutenberg.org/cache/epub/2800/pg2800-images.html'

#### 데이터 전처리 함수
- 영어를 제외한 문자

In [None]:
def process_text(text):
    processed_text = re.sub(r'[^a-zA-z\s]', ' ', text.lower())
    return ' '.join(processed_text.split())

### Bible
- Scrapping 한 텍스트 데이터를 bible.txt에 저장

In [None]:
bible=[]
bible_driver = webdriver.Chrome(options=chrome_options)
bible_driver.get(bible_url)
wait = WebDriverWait(bible_driver, 10)  # 최대 10초 대기

# 모든 class="chapter" div 태그 내부의 모든 텍스트를 추출
try:
    chapter_divs = bible_driver.find_elements(By.CLASS_NAME, "chapter")
    for index, chapter_div in enumerate(chapter_divs, start=1):
        bible_text = chapter_div.get_attribute("textContent").strip()
        print(f"Extracted Text from div #{index}:\n{bible_text}\n")

        # 전처리
        bible_processed_text=process_text(bible_text)

        bible.append(('bible',bible_processed_text))
except Exception as e:
    print("Error:", e)

# 드라이버 종료
bible_driver.quit()

# 텍스트 파일로 저장
with open("bible.txt", "w", encoding="utf-8") as f:
    for item in bible:
        f.write(f"{item[1]}\n\n")

[1;30;43m스트리밍 출력 내용이 길어서 마지막 5000줄이 삭제되었습니다.[0m

3:18 The grace of our Lord Jesus Christ be with you all. Amen.

Extracted Text from div #56:
The First Epistle of Paul the Apostle to Timothy

1:1 Paul, an apostle of Jesus Christ by the commandment of God our Saviour, and
Lord Jesus Christ, which is our hope; 1:2 Unto Timothy, my own son in the
faith: Grace, mercy, and peace, from God our Father and Jesus Christ our Lord.


1:3 As I besought thee to abide still at Ephesus, when I went into Macedonia,
that thou mightest charge some that they teach no other doctrine, 1:4 Neither
give heed to fables and endless genealogies, which minister questions, rather
than godly edifying which is in faith: so do.


1:5 Now the end of the commandment is charity out of a pure heart, and of a
good conscience, and of faith unfeigned: 1:6 From which some having swerved
have turned aside unto vain jangling; 1:7 Desiring to be teachers of the law;
understanding neither what they say, nor whereof they affir

### Quran
- Scrapping 한 텍스트 데이터를 quran.txt에 저장

In [None]:
quran=[]
start_collecting = False
quran_driver = webdriver.Chrome(options=chrome_options)
quran_driver.get(quran_url)
wait = WebDriverWait(quran_driver, 10)  # 최대 10초 대기

# h4 태그부터 <div id="pg-end-separator"> 태그까지 모든 텍스트 수집
for element in quran_driver.find_elements(By.XPATH, "//*"):
    if element.tag_name == "h4":
        start_collecting = True
        continue
    if element.get_attribute("id") == "pg-end-separator":
        break
    if start_collecting and element.tag_name != "h4":
        text = element.text.strip()
        # 전처리
        if text:
            quran_processed_text = process_text(text)
            if quran_processed_text:
                quran.append(quran_processed_text)

# 모든 텍스트를 하나의 문자열로 병합
filtered_text = "\n".join(quran)

# 결과 출력
print(filtered_text)

# 드라이버 종료
quran_driver.quit()
# 결과를 quran.txt 파일에 저장
with open("quran.txt", "w", encoding="utf-8") as file:
    file.write(filtered_text)

mecca verses
in the name of god the compassionate the merciful
recite thou in the name of thy lord who created
created man from clots of blood
recite thou for thy lord is the most beneficent
who hath taught the use of the pen
hath taught man that which he knoweth not
nay verily man is insolent
because he seeth himself possessed of riches
verily to thy lord is the return of all
what thinkest thou of him that holdeth back
a servant of god when he prayeth
what thinkest thou hath he followed the true guidance or enjoined piety
what thinkest thou hath he treated the truth as a lie and turned his back
what doth he not know how that god seeth
nay verily if he desist not we shall seize him by the forelock
the lying sinful forelock
then let him summon his associates
we too will summon the guards of hell
nay obey him not but adore and draw nigh to god
the word sura occurs nine times in the koran viz sur ix xxiv xlvii twice ii x but it is not easy to determine whether it means a whole chapter or 

## Similarity

#### Spark

In [None]:
# Spark 세션 생성
spark = SparkSession.builder.appName("Calculate Similarity of Bible and Quran").getOrCreate()

### Shingling & Min-Hashing & LSH
- Shingling, Min-Hashing, LSH 알고리즘을 적용하여 Bible과 Quran의 텍스트 데이터에서 유사한 문장 찾기

#### Shingling

In [None]:
# n-gram 설정
n = 5

def create_shingles(file_path, n):
    with open(file_path, 'r') as file:
        text = file.read().replace('\n', ' ')

    # Shingles 생성
    shingles = set()
    for i in range(len(text) - n + 1):
        shingles.add(text[i:i+n])

    return shingles

# Bible와 Quran에 대해 Shingling 적용
bible_shingles = create_shingles('/content/bible.txt', n)
quran_shingles = create_shingles('/content/quran.txt', n)

# 결과 확인
print("Bible Shingles 개수:", len(bible_shingles))
print("Quran Shingles 개수:", len(quran_shingles))

Bible Shingles 개수: 80285
Quran Shingles 개수: 61484


#### Min-Hashing

In [None]:
# Minhash 함수 생성
num_hashes = 100  # 사용할 해시 함수의 개수
max_shingle_id = 2**32 - 1  # 최대 Shingle ID 값

def minhash(shingles, num_hashes):
    min_hashes = []
    for _ in range(num_hashes):
        a = random.randint(1, max_shingle_id)
        b = random.randint(1, max_shingle_id)
        min_hash = min(((a * hash(shingle) + b) % max_shingle_id) for shingle in shingles)
        min_hashes.append(min_hash)
    return min_hashes

# Minhash 적용
bible_minhash = minhash(bible_shingles, num_hashes)
quran_minhash = minhash(quran_shingles, num_hashes)

# 결과 확인
print("Bible Minhash:", bible_minhash[:5])
print("Quran Minhash:", quran_minhash[:5])

Bible Minhash: [58470, 5970, 32507, 35742, 155227]
Quran Minhash: [76202, 37536, 27080, 271, 245737]


#### LSH

In [None]:
def lsh_signature(minhash, num_bands, rows_per_band):
    assert num_bands * rows_per_band == len(minhash), "밴드 수와 행 수가 일치하지 않습니다."
    signature_bands = []
    for i in range(num_bands):
        start = i * rows_per_band
        end = start + rows_per_band
        band = tuple(minhash[start:end])
        signature_bands.append(band)
    return signature_bands

# 밴드와 행 설정
num_bands = 20
rows_per_band = 5

# LSH Signature 생성
bible_lsh = lsh_signature(bible_minhash, num_bands, rows_per_band)
quran_lsh = lsh_signature(quran_minhash, num_bands, rows_per_band)

# 결과 확인 (밴드 수)
print("Bible LSH 밴드 수:", len(bible_lsh))
print("Quran LSH 밴드 수:", len(quran_lsh))

# 유사도 검사 (공통 밴드가 있는지 확인)
common_bands = set(bible_lsh).intersection(set(quran_lsh))
print("공통 밴드 수:", len(common_bands))

Bible LSH 밴드 수: 20
Quran LSH 밴드 수: 20
공통 밴드 수: 0


#### Jaccard Similarity 계산

In [None]:
def jaccard_similarity(set1, set2):
    intersection = set1.intersection(set2)
    union = set1.union(set2)
    return len(intersection) / len(union)

jaccard_sim = jaccard_similarity(bible_shingles, quran_shingles)
print("Jaccard 유사도:", jaccard_sim)

Jaccard 유사도: 0.4346765705958549


#### Cosine Similarity 계산

In [None]:
def cosine_similarity(set1, set2):
    counter1 = Counter(set1)
    counter2 = Counter(set2)

    common_shingles = set1 & set2
    dot_product = sum(counter1[shingle] * counter2[shingle] for shingle in common_shingles)

    magnitude1 = math.sqrt(sum(value**2 for value in counter1.values()))
    magnitude2 = math.sqrt(sum(value**2 for value in counter2.values()))

    return dot_product / (magnitude1 * magnitude2)

cosine_sim = cosine_similarity(bible_shingles, quran_shingles)
print("Cosine Similarity:", cosine_sim)

Cosine Similarity: 0.6113574926671269


In [None]:
spark.stop()