<a href="https://colab.research.google.com/github/sarahethompson/eportfolio/blob/main/artefacts/module-3/unit-5-Jaccard_Coefficient_Calculations_updated.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Calculate the Jaccard coefficient for the following pairs based on the provided data: (Jack, Mary), (Jack, Jim), and (Jim, Mary).

Name    Gender  Fever   Cough   Test-1  Test-2  Test-3  Test-4
Jack    M       Y       N       P       N       N       A
Mary    F       Y       N       P       A       P       N
Jim     M       Y       P       N       N       N       A

## Create dataframe

### Subtask:
Create a pandas DataFrame from the provided data.


**Reasoning**:
The subtask is to create a pandas DataFrame from the provided data. I will import pandas, create a dictionary with the data, and then create and display the DataFrame.



In [40]:
import pandas as pd

data = {
    'Name': ['Jack', 'Mary', 'Jim'],
    'Fever': ['Y', 'Y', 'Y'],
    'Cough': ['N', 'N', 'P'],
    'Test-1': ['P', 'P', 'N'],
    'Test-2': ['N', 'A', 'N'],
    'Test-3': ['N', 'P', 'N'],
    'Test-4': ['A', 'N', 'A']
}

df = pd.DataFrame(data)
display(df)

Unnamed: 0,Name,Fever,Cough,Test-1,Test-2,Test-3,Test-4
0,Jack,Y,N,P,N,N,A
1,Mary,Y,N,P,A,P,N
2,Jim,Y,P,N,N,N,A


## Encode data

### Subtask:
Encode the 'Gender', 'Fever', 'Cough', 'Test-1', 'Test-2', 'Test-3', and 'Test-4' columns as binary (1 for relevant values, 0 for others).


**Reasoning**:
Define the columns to be encoded and apply the specified binary encoding rules to each of these columns in the DataFrame.



In [43]:
cols_to_encode = ['Fever', 'Cough', 'Test-1', 'Test-2', 'Test-3', 'Test-4']

# Encoding based on user's instruction: Y & P = 1; N & A = 0
df['Fever'] = df['Fever'].replace({'Y': 1, 'N': 0}).astype(int)
df['Cough'] = df['Cough'].replace({'P': 1, 'N': 0}).replace('N', 0).astype(int) # Ensure N is 0
df[['Test-1', 'Test-2', 'Test-3', 'Test-4']] = df[['Test-1', 'Test-2', 'Test-3', 'Test-4']].replace(['Y', 'P'], 1).replace(['N', 'A'], 0).astype(int)

display(df)

Unnamed: 0,Name,Fever,Cough,Test-1,Test-2,Test-3,Test-4
0,Jack,1,0,1,0,0,0
1,Mary,1,0,1,0,1,0
2,Jim,1,1,0,0,0,0


## Calculate jaccard similarity

### Subtask:
Calculate the Jaccard similarity for the pairs (Jack, Mary), (Jack, Jim), and (Jim, Mary) using the encoded data.


**Reasoning**:
Extract the encoded data for each person and calculate the Jaccard similarity for each pair.



In [44]:
from sklearn.metrics import jaccard_score

# Extract the encoded data for each person, excluding Gender
cols_for_jaccard = ['Fever', 'Cough', 'Test-1', 'Test-2', 'Test-3', 'Test-4']
jack_data = df.loc[df['Name'] == 'Jack', cols_for_jaccard].values.flatten()
mary_data = df.loc[df['Name'] == 'Mary', cols_for_jaccard].values.flatten()
jim_data = df.loc[df['Name'] == 'Jim', cols_for_jaccard].values.flatten()

# Calculate Jaccard similarity using the standard formula, ensuring binary average
jack_mary_jaccard = jaccard_score(jack_data, mary_data, average='binary')
jack_jim_jaccard = jaccard_score(jack_data, jim_data, average='binary')
jim_mary_jaccard = jaccard_score(jim_data, mary_data, average='binary')


print(f"Jaccard Similarity:")
print(f"Jack & Mary: {jack_mary_jaccard:.2f}")
print(f"Jack & Jim: {jack_jim_jaccard:.2f}")
print(f"Jim & Mary: {jim_mary_jaccard:.2f}")
print("-" * 20)

# Calculate Jaccard Distance (1 - Jaccard Similarity)
print(f"Jaccard Distance (1 - Jaccard Similarity):")
print(f"Jack & Mary: {1 - jack_mary_jaccard:.2f}")
print(f"Jack & Jim: {1 - jack_jim_jaccard:.2f}")
print(f"Jim & Mary: {1 - jim_mary_jaccard:.2f}")

Jaccard Similarity:
Jack & Mary: 0.67
Jack & Jim: 0.33
Jim & Mary: 0.25
--------------------
Jaccard Distance (1 - Jaccard Similarity):
Jack & Mary: 0.33
Jack & Jim: 0.67
Jim & Mary: 0.75


In [46]:
from sklearn.metrics import confusion_matrix
import numpy as np

# Use the same encoded data for each person
# jack_data, mary_data, and jim_data were already created in a previous cell

def simple_matching_coefficient(y_true, y_pred):
    """Calculates the Simple Matching Coefficient for binary data."""
    # Ensure inputs are numpy arrays
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)

    # Calculate the confusion matrix components
    # C[0,0] is TN (0-0 matches)
    # C[0,1] is FP (0-1 mismatches)
    # C[1,0] is FN (1-0 mismatches)
    # C[1,1] is TP (1-1 matches)
    C = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = C.ravel()

    # Calculate SMC
    smc = (tp + tn) / (tp + tn + fp + fn)
    return smc

# Calculate Simple Matching Coefficient
print("Jack & Mary (SMC):", simple_matching_coefficient(jack_data, mary_data))
print("Jack & Jim (SMC):", simple_matching_coefficient(jack_data, jim_data))
print("Jim & Mary (SMC):", simple_matching_coefficient(jim_data, mary_data))

Jack & Mary (SMC): 0.8333333333333334
Jack & Jim (SMC): 0.6666666666666666
Jim & Mary (SMC): 0.5


## Notes on Jaccard Similarity vs. Simple Matching Coefficient and Task Summary

This notebook explores calculating similarity between individuals based on their characteristics and test results using Jaccard Similarity and the Simple Matching Coefficient (SMC), and clarifies the specific requirements of the task.

**Task Requirements:**
The goal was to calculate a specific metric for pairs (Jack, Mary), (Jack, Jim), and (Jim, Mary) based on the provided data table. Based on the expected "answers" provided, it was determined that the task required the calculation of **Jaccard Distance**, not Jaccard Similarity, using a specific encoding and subset of the data.

**Data Encoding:**
Before calculating similarity, categorical data was encoded into a binary format (0s and 1s). Based on the expected results, the specific encoding rule applied was:
*   'Y' and 'P' were encoded as **1**.
*   'N' and 'A' were encoded as **0**.

**Included Columns:**
Also based on the expected results, only the following columns were included in the similarity calculation:
*   'Fever'
*   'Cough'
*   'Test-1', 'Test-2', 'Test-3', 'Test-4'
The 'Gender' column was excluded as it was considered a symmetric variable in the context of this specific task.

**Jaccard Similarity:**
The Jaccard similarity coefficient (\(J\)) measures the overlap in shared '1' values (shared presences). The formula is \( J = \frac{f_{11}}{f_{01}+f_{10}+f_{11}} \). It ignores shared '0' values. Using the specified encoding and columns, the Jaccard similarities were calculated as approximately:
*   Jack & Mary: 0.67
*   Jack & Jim: 0.33
*   Jim & Mary: 0.25

**Jaccard Distance:**
The Jaccard distance (\(D\)) is a measure of dissimilarity, calculated as \( D = 1 - J \). It can also be calculated as \( D = \frac{f_{01}+f_{10}}{f_{01}+f_{10}+f_{11}} \). Using the calculated Jaccard similarities, the Jaccard distances were:
*   Jack & Mary: 0.33
*   Jack & Jim: 0.67
*   Jim & Mary: 0.75
These values matched the expected "answers" for the task.

**Simple Matching Coefficient (SMC):**
The Simple Matching Coefficient is an alternative similarity measure that considers **all matches** (both 1-1 and 0-0). It was calculated to demonstrate a metric that treats shared absences as equally contributing to similarity as shared presences. The SMC results were different from the Jaccard scores, highlighting the importance of choosing the appropriate metric based on the analytical goal.

**Summary of Process:**
1.  The provided data was loaded into a pandas DataFrame.
2.  Relevant columns ('Fever', 'Cough', 'Test-1' to 'Test-4') were encoded into a binary format (Y/P=1, N/A=0) as required by the task's expected results.
3.  Jaccard similarity was calculated using the encoded data and the standard formula, ensuring correct handling of binary data (`average='binary'`).
4.  Jaccard distance was calculated as 1 minus the Jaccard similarity.
5.  The calculated Jaccard distances were found to match the expected answers for the task.
6.  The Simple Matching Coefficient was also calculated and discussed as a contrasting similarity measure.

This notebook demonstrates the process of calculating Jaccard distance based on specific data encoding and column selection requirements, and contrasts it with Jaccard similarity and the Simple Matching Coefficient.