# Jaccard Similarity Project
#### Compare the sentences in the paragraph below using Jaccard Similarity. 

### Use this paragraph:

Jaccard similarity compares shared elements over total unique elements in two sets. Cosine similarity measures the angle between two vectors in high-dimensional space. Jaccard is best for binary features like word presence or absence. Cosine works well with weighted vectors like TF-IDF in NLP tasks.

### To do this, perform the following steps:

HINT: You may re-use any of the code in this week's notebook to perform these tasks.

#### 1. Preprocess them, including lowercase, lemmatisation, and stop word removal.

In [1]:
! pip install nltk pandas matplotlib scikit-learn


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
# Text Preprocessing - Your Code Here. 
# Use multiple cells as needed.
import nltk
from nltk.corpus import stopwords

text = "Jaccard similarity compares shared elements over total unique elements in two sets. Cosine similarity measures the angle between two vectors in high-dimensional space. Jaccard is best for binary features like word presence or absence. Cosine works well with weighted vectors like TF-IDF in NLP tasks."
sentences = text.split('.')
sentence_dict = {}

nltk.download('stopwords')
nltk.download('wordnet')

# Get the English stop words
stop_words = set(stopwords.words('english'))

# Create lemmatizer
lemmatizer = nltk.WordNetLemmatizer()

for i, sentence in enumerate(sentences):
    if sentence.strip(): 
        # Convert to lowercase and split into words
        words = sentence.strip().lower().split()
        
        # Remove stop words
        words = [word for word in words if word not in stop_words]

        # Lemmatize words
        words = [lemmatizer.lemmatize(word) for word in words]
        
        # Join words back into sentence
        sentence_dict[f'doc{i+1}'] = ' '.join(words)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/rewheaton/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/rewheaton/nltk_data...


#### 2. Apply Jaccard similarity to the vectors.

In [14]:
# Jaccard similarity - Your Code Here.
# Use multiple cells as needed.
import pandas as pd

def get_jaccard_similarity(doc1, doc2):
    set1 = set(doc1.split())
    set2 = set(doc2.split())
    intersection = set1.intersection(set2)
    union = set1.union(set2)
    return float(len(intersection) / len(union))

jaccard_matrix = []

for doc1 in sentence_dict:
    row = {key: 1.0 for key in sentence_dict}
    row['target'] = doc1
    for doc2 in sentence_dict:
        if doc1 != doc2:
            row[doc2] = get_jaccard_similarity(sentence_dict[doc1], sentence_dict[doc2])
            
    jaccard_matrix.append(row)

jaccard_matrix_df = pd.DataFrame(jaccard_matrix).set_index('target')
jaccard_matrix_df


Unnamed: 0_level_0,doc1,doc2,doc3,doc4
target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
doc1,1.0,0.133333,0.0625,0.0
doc2,0.133333,1.0,0.0,0.133333
doc3,0.0625,0.0,1.0,0.0625
doc4,0.0,0.133333,0.0625,1.0


#### 3. Identify which 2 of the 4 documents are the MOST similar to each other.
Your answer here:

There is a tie: Document 1 and Document 2 and Document 2 and Document 4 are the most similar with a jacardian similarity of 0.13333. Document 1 and 4 have a similarity of zero which is weird.

#### 4. Identify which 2 of the 4 documents the are LEAST similar to each other.
Your answer here:

Another tie: Doc1 and Doc4 as well as Doc2 and Doc3 have similarity scores of zero.

#### 5. State your opinion, whether you agree or disagree with the calculated findings, and WHY.
Your answer here:


I think the findings are mathmatically correct though I don't think word overlap is a real accurate measure of semantic similarity. It might acually be useful to identify plagerism though you may want to use n-grams > 1. 

#### 6. Assignment Submission Directions: 

1. After creating, running, and completing all cells in the Jupyter Notebook, save the file.
2. Upload the saved Jupyter notebook in the following format: firstnamelastnameWEEK4PROJECT.ipynb
3. When ready, click UPLOAD (in Blackboard), attach all applicable files and click SUBMIT.

#### Grading Rubric - 50 points

||Emerging (0-7 points)|Proficient (80%+) (8-9 points)|Exemplary (10 points)|
|:--|:---|:---|:---|
|Text Preprocessing (10 points)|<b>MORE THAN ONE OF</b> Lowercase, Lemmatisation, or Stop Word Removal was not applied, or was applied incorrectly.|<b>ONE OF</b> Lowercase, Lemmatisation, or Stop Word Removal was not applied, or was applied incorrectly.|<b>ALL OF</b> Lowercase, Lemmatisation, and Stop Word Removal was applied correctly.|
|Jaccard Similarity (10 points)|Jaccard similarity was not applied, or all results are incorrect.|Jaccard similarity was applied but some results are not correct.|Jaccard similarity was applied correctly to the vectors and results are as expected.|
|Most Similarity Comparison (10 points)|<b>Neither</b> document is correctly identified.|Only <b>ONE</b> of the documents is correctly identified.|<b>Both</b> of the documents are correctly identified.|
|Least Similarity Comparison (10 points)|<b>Neither</b> document is correctly identified.|Only <b>ONE</b> of the documents is correctly identified.|<b>Both</b> of the documents are correctly identified.|
|Reflection (10 points)|<b>More than two</b> questions / prompts were not answered and/or rationale provided in reflection does not align with the questions.|Answers were missing for <b>one or two</b> questions / prompts and/or rationale provided in reflection aligns with most or all of the questions.|Answers were provided for <b>all</b> questions / prompts and rationale provided in reflection aligns with the questions.|