PLAGARISM BETWEEN TWO TEXT

In [7]:
# Import necessary libraries
import spacy  # Importing the SpaCy library for natural language processing
from sklearn.feature_extraction.text import TfidfVectorizer  # Importing TF-IDF vectorization from scikit-learn
from sklearn.metrics.pairwise import cosine_similarity  # Importing cosine similarity computation from scikit-learn

# Function to perform Named Entity Recognition (NER) using SpaCy
def perform_ner(text):
    nlp = spacy.load("en_core_web_sm")  # Load the SpaCy English language model
    doc = nlp(text)  # Process the input text using the loaded SpaCy model
    entities = [ent.text for ent in doc.ents]  # Extract named entities from the processed text
    return entities

# Function to check plagiarism between two documents using TF-IDF and cosine similarity
def check_plagiarism(doc1, doc2):
    tfidf_vectorizer = TfidfVectorizer()  # Initialize TF-IDF vectorizer
    tfidf_matrix = tfidf_vectorizer.fit_transform([doc1, doc2])  # Compute TF-IDF matrices for both documents
    similarity = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])[0][0]  # Compute cosine similarity
    return similarity

# Function for text summarization (simple example: taking the first 100 characters)
def summarize_text(text):
    return text[:100]  # Taking the first 100 characters.

# Main execution block
if __name__ == "__main__":
    document1 = input("Enter Document 1:\n")  # User input for Document 1
    document2 = input("Enter Document 2:\n")  # User input for Document 2

    # Named Entity Recognition
    doc1_entities = perform_ner(document1)  # Extract named entities from Document 1
    doc2_entities = perform_ner(document2)  # Extract named entities from Document 2

    # Plagiarism Detection
    similarity = check_plagiarism(document1, document2)  # Compute plagiarism similarity between the two documents

    # Summarization
    summary_doc1 = summarize_text(document1)  # Summarize Document 1
    summary_doc2 = summarize_text(document2)  # Summarize Document 2

    # Display the results
    print("\nNamed Entities in Document 1:", doc1_entities)
    print("\nNamed Entities in Document 2:", doc2_entities)
    print("\nPlagiarism Similarity:", similarity)
    print("\nSummary of Document 1:", summary_doc1)
    print("\nSummary of Document 2:", summary_doc2)


Enter Document 1:
The discovery of penicillin by Alexander Fleming was a significant milestone in the history of medicine. Fleming made this groundbreaking discovery in 1928 when he noticed that a mold called Penicillium notatum produced a substance that killed a wide variety of bacteria. This discovery revolutionized the field of antibiotics and laid the foundation for the development of modern medicine.  In addition to his work on penicillin, Alexander Fleming made other important contributions to medical science. He studied the properties of lysozyme, an enzyme that has antibacterial properties, and he also investigated the use of antiseptics in wound treatment. Fleming's research and discoveries have had a lasting impact on the field of microbiology and medicine.
Enter Document 2:
Marie Curie, a pioneering physicist and chemist, is best known for her groundbreaking research on radioactivity. Born in 1867 in Poland, Curie became the first woman to win a Nobel Prize and remains the o

PLAGARISM BETWEEN TWO PDFS

In [3]:
pip install PyMuPDF

Collecting PyMuPDF
  Downloading PyMuPDF-1.23.15-cp310-none-manylinux2014_x86_64.whl (4.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.4/4.4 MB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting PyMuPDFb==1.23.9 (from PyMuPDF)
  Downloading PyMuPDFb-1.23.9-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (30.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.6/30.6 MB[0m [31m37.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDFb, PyMuPDF
Successfully installed PyMuPDF-1.23.15 PyMuPDFb-1.23.9


In [6]:
# Import necessary libraries
import os  # For interacting with the operating system
import spacy  # For natural language processing
from sklearn.feature_extraction.text import TfidfVectorizer  # For TF-IDF vectorization
from sklearn.metrics.pairwise import cosine_similarity  # For cosine similarity computation
import fitz  # PyMuPDF library for working with PDFs

# Install required libraries in the Colab environment
!pip install PyMuPDF
!python -m spacy download en_core_web_sm

# Function to extract text from a PDF file using PyMuPDF
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)  # Open the PDF file using PyMuPDF
    text = ""
    for page_num in range(doc.page_count):
        page = doc[page_num]
        text += page.get_text()  # Extract text from each page and concatenate
    return text

# Function for Named Entity Recognition using SpaCy
def perform_ner(text):
    nlp = spacy.load("en_core_web_sm")  # Load the SpaCy English language model
    doc = nlp(text)
    entities = [ent.text for ent in doc.ents]  # Extract named entities
    return entities

# Function for checking plagiarism using TF-IDF and cosine similarity
def check_plagiarism(text1, text2):
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_matrix = tfidf_vectorizer.fit_transform([text1, text2])
    similarity = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])[0][0]
    return similarity

# Upload PDFs manually to Colab environment
from google.colab import files
uploaded = files.upload()

# Extract file paths of the uploaded PDFs
pdf_path1 = list(uploaded.keys())[0]
pdf_path2 = list(uploaded.keys())[1]

# Extract text from PDFs
text1 = extract_text_from_pdf(pdf_path1)
text2 = extract_text_from_pdf(pdf_path2)

# Named Entity Recognition
doc1_entities = perform_ner(text1)
doc2_entities = perform_ner(text2)

# Plagiarism Detection
similarity = check_plagiarism(text1, text2)

# Display the results
print("\nNamed Entities in PDF 1:", doc1_entities)
print("\nNamed Entities in PDF 2:", doc2_entities)
print("\nPlagiarism Similarity (%):", similarity * 100)


2024-01-17 06:32:34.739315: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-17 06:32:34.739377: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-17 06:32:34.740734: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m43.0 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load

Saving Identification_of_Plant-Leaf_Diseases_Using_CNN_an.pdf to Identification_of_Plant-Leaf_Diseases_Using_CNN_an (1).pdf
Saving agriculture-12-01192.pdf to agriculture-12-01192 (1).pdf

Named Entities in PDF 1: ['Article\nIdentiﬁcation of Plant-Leaf Diseases Using', 'CNN', 'Transfer-Learning Approach\nSk Mahmudul Hassan 1', 'Arnab Kumar Maji 1', 'Michał', '2', 'Zbigniew Leonowicz 2', 'El˙zbieta', 'Jasi´nska', '3', 'Citation', 'Hassan', 'S.M.', 'Maji', 'A.K.', 'Leonowicz', 'Z.', 'Jasi´nska', 'CNN', 'Transfer-Learning Approach', '2021', '10, 1388', 'https://doi.org/10.3390/\nelectronics10121388\nAcademic Editor', 'Juan M. Corchado\nReceived', '24 May 2021', '8 June 2021', '9 June 2021', '2021', 'Licensee MDPI', 'Basel', 'Switzerland', 'the Creative Commons\nAttribution', 'creativecommons.org/licenses/by/', '4.0/', '1', 'Department of Information Technology', 'North Eastern Hill University', 'Shillong', 'Meghalaya 793022', 'India', '2', 'Department of Electrical Engineering Fundamental