<a href="https://colab.research.google.com/github/vinayparjapati5/Cosine-Similarity/blob/main/Cosine_Similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
## **Mini Project : Cosine Similarity**
---



### Cosine similarity is a metric used to measure the similarity between two vectors. In the context of text analysis, it is often used to compare the similarity between documents based on their content. The cosine similarity score ranges between 0 and 1, where 0 indicates no similarity and 1 indicates identical similarity.


In [2]:
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Get the list of text files in the current directory
files = [doc for doc in os.listdir() if doc.endswith('.txt')]

# Read the content of each text file
notes = []
for file in files:
    with open(file, 'r', encoding='utf-8') as f:
        notes.append(f.read())

# Vectorize the notes using TF-IDF
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(notes).toarray()

# Pair each file with its corresponding vector
file_vectors = list(zip(files, vectors))

# Set to store plagiarism results
result = set()

# Function to calculate similarity using cosine similarity
def similarity(doc1, doc2):
    return cosine_similarity([doc1, doc2])[0][1]

# Check plagiarism among the notes
def check_plagiarism():
    global file_vectors
    for file_a, vector_a in file_vectors:
        new_vectors = file_vectors.copy()
        current_index = new_vectors.index((file_a, vector_a))
        del new_vectors[current_index]
        for file_b, vector_b in new_vectors:
            sim_score = similarity(vector_a, vector_b)
            file_pair = sorted((file_a, file_b))
            score = (file_pair[0], file_pair[1], sim_score)
            result.add(score)
    return result

# Print the plagiarism results
for data in check_plagiarism():
    print("Similarity data:\n", data)


Similarity data:
 ('File1.txt', 'File2.txt', 0.1909077417163092)
Similarity data:
 ('File1.txt', 'File3.txt', 0.8465773472126983)
Similarity data:
 ('File2.txt', 'File3.txt', 0.42585024005265176)
