# Prospecção de Dados (Data Mining) DI/FCUL - HA3

## Third Home Assignement (MC/DI/FCUL - 2024)

### Fill in the section below

### GROUP: `13`

* Miguel Landum, 35019 - Hours worked on the project
* Niklas Schmitz, 62689 - Hours worked on the project
* Pol Parra, 62692 - Hours worked on the project
* Til Dietrich, 62928 - Hours worked on the project




The purpose of this Home Assignment is
* Find similar items with Local Sensitivity Hashing
* Do Dimensionality Reduction

**NOTE 1: Students are not allowed to add more cells to the notebook**

**NOTE 2: The notebook must be submited fully executed**

**NOTE 3: Name of notebook should be: HA3_GROUP-XX.ipynb (where XX is the group number)**


**NOTE to run code locally:** add data (data_d3.pickle, data_d4.pickle) to **assignment-3/data/** folder

## 1. Read the Dataset

The dataset correspond to about 99% of the Human Proteome (set of known Human Proteins - about 19,500), coded with specific structural elements. They are presented in a dictionary where the key is the [UniprotID](https://www.uniprot.org/) of the protein and the value is a set of indices of a specific structural characteristic

Students can use one of two datasets, that are **not** subsets of each other: 
* `data_d3.pickle` - smaller set of structural features (2048)
* `data_d4.pickle` - much larger set of structural features (20736) **Note:** This dataset has been Zipped to fit into moodle. Students should unzip it before usage 

Select **one** of the datasets and perform all analyses with it. 

It may be adviseable the usage of sparse matrices, especially for the `d4` dataset



In [1]:
### Your code Here
import pickle
import numpy as np
from scipy.sparse import csr_matrix
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
import itertools
import time

data_d3=pickle.load(open("data_d3.pickle", "rb"))
data_d4=pickle.load(open("data_d4.pickle", "rb"))

## 2. Perform Local Sensitivity Hashing (LSH)

* examine the selected dataset in terms of similarities and select a set of LSH parameters able to capture the most similar proteins
* Comment your results

**BONUS POINTS:** It might be interesting to identify **some** of the candidate pairs in Uniprot, to check if they share some of the same properties (e.g. for [protein P28223](https://www.uniprot.org/uniprotkb/P28223/entry))


In [2]:
import pickle
import numpy as np
from scipy.sparse import csr_matrix
import pandas as pd
from IPython.display import display
import itertools

data_d3 = pickle.load(open("data_d3.pickle", "rb"))
data_d4 = pickle.load(open("data_d4.pickle", "rb"))

### Add supporting functions here
def create_sparse_matrix(data):
    uniprot_ids = list(data.keys())
    structural_features = list(data.values())

    # Find the number of unique features
    num_features = max(max(indices) for indices in structural_features) + 1

    # Prepare data for csr_matrix
    matrix_data = []
    rows = []
    cols = []

    for row, indices in enumerate(structural_features):
        rows.extend([row] * len(indices))
        cols.extend(indices)
        matrix_data.extend([1] * len(indices))

    # Create the sparse matrix
    sparse_matrix = csr_matrix((matrix_data, (rows, cols)), shape=(len(uniprot_ids), num_features))
    
    return sparse_matrix, uniprot_ids


# Create sparse matrices for both datasets
sparse_matrix_d3, uniprot_ids_d3 = create_sparse_matrix(data_d3)
sparse_matrix_d4, uniprot_ids_d4 = create_sparse_matrix(data_d4)

# Create dense matrices for both datasets
dense_matrix_d3 = sparse_matrix_d3.toarray()
dense_matrix_d4 = sparse_matrix_d4.toarray()

df_d3 = pd.DataFrame(dense_matrix_d3, index=uniprot_ids_d3)
df_d4 = pd.DataFrame(dense_matrix_d4, index=uniprot_ids_d4)

def MakeBucketsT(TDocs, perms, N, B, R, NB):
    Buckets = {}
    all_docs = set(range(N))
    for b in range(B):
        SIGS = np.zeros((N, R), dtype="int32")
        for r in range(R):
            perm = perms[b * R + r]
            L = all_docs.copy()
            i = 0
            while len(L) > 0:
                elem = perm[i]
                docs_found = TDocs[elem] & L
                if len(docs_found) > 0:
                    SIGS[list(docs_found), r] = i
                    L = L - docs_found
                i += 1
                if i == len(perm):
                    SIGS[list(L), r] = i
                    L = set()
        for d in range(N):
            bucket = hash(tuple(SIGS[d])) % NB
            Buckets.setdefault((b, bucket), set()).add(d)
    return Buckets

def LSHT(Data, B, R, NB=28934501):
    N, M = Data.shape
    DT = Data.T  # Transpose the matrix
    DataT = [set(np.where(DT[i] == 1)[0]) for i in range(M)]  # Adjust indexing for dense matrix
    P = B * R
    np.random.seed(3)
    perms = [np.random.permutation(M) for _ in range(P)]
    buckets = MakeBucketsT(DataT, perms, N, B, R, NB)
    return buckets

# Use LSHT to find similar proteins
buckets_d3 = LSHT(dense_matrix_d3, 80, 8)
buckets_d4 = LSHT(dense_matrix_d4, 80, 8)


def compare_characteristics(df, proteins):
    subset = df.loc[proteins]
    num_columns_differ = (subset.iloc[0] != subset.iloc[1]).sum()
    total_columns = len(subset.columns)
    num_columns_match = total_columns - num_columns_differ

    # Calculate the number of shared 1s between the proteins
    shared_1s = ((subset.iloc[0] == 1) & (subset.iloc[1] == 1)).sum()

    # Calculate the total number of 1s if both proteins had the same characteristics
    total_1s = subset.iloc[0].sum() + subset.iloc[1].sum() - shared_1s

    return num_columns_differ, num_columns_match, shared_1s, total_1s

def jaccard_similarity(protein1, protein2):
    intersection = np.sum(np.minimum(protein1, protein2))
    union = np.sum(np.maximum(protein1, protein2))
    return intersection / union

def extract_similar_pairs(buckets, uniprot_ids, df):
    similar_pairs = []
    for (b, buck), docs in buckets.items():
        if len(docs) > 1:
            #print("Band", b, "suggests these similar docs:", [uniprot_ids[d] for d in docs])
            combs = itertools.combinations(docs, 2)
            for i, j in combs:
                protein1 = df.iloc[i].values
                protein2 = df.iloc[j].values
                jaccard_sim = jaccard_similarity(protein1, protein2)
                

                num_differ, num_match, shared_1s, total_1s = compare_characteristics(df, [uniprot_ids[i], uniprot_ids[j]])
                similar_pairs.append((uniprot_ids[i], uniprot_ids[j], num_differ, num_match, shared_1s, total_1s, jaccard_sim))
                
                if len(similar_pairs) >= 20:  # Limit to 20 pairs
                    return similar_pairs
    return similar_pairs

# Extract similar pairs and include Jaccard similarity
similar_pairs_d3 = extract_similar_pairs(buckets_d3, uniprot_ids_d3, df_d3)
similar_pairs_d4 = extract_similar_pairs(buckets_d4, uniprot_ids_d4, df_d4)

# Create DataFrames with Jaccard similarity included
df_similar_pairs_d3 = pd.DataFrame(similar_pairs_d3, columns=['Protein 1', 'Protein 2', 'Characteristics Differ', 'Total Match', 'Shared Characteristics', 'Total Characteristics', 'Jaccard Similarity'])
df_similar_pairs_d4 = pd.DataFrame(similar_pairs_d4, columns=['Protein 1', 'Protein 2', 'Characteristics Differ', 'Total Match', 'Shared Characteristics', 'Total Characteristics', 'Jaccard Similarity'])


In [3]:
display("Sparse matrix shape for data_d3:", sparse_matrix_d3.shape)
display("Sparse matrix shape for data_d4:", sparse_matrix_d4.shape)


# Display similar protein pairs for data_d3
display("Similar protein pairs for data_d3:")
count_d3 = 0
for (b, buck), docs in buckets_d3.items():
    if len(docs) > 1 and count_d3 < 20:
        similar_proteins = [uniprot_ids_d3[d] for d in docs]
        display(f"Band {b} suggests these similar proteins: {similar_proteins}")
        count_d3 += 1

# Display similar protein pairs for data_d4
display("Similar protein pairs for data_d4:")
count_d4 = 0
for (b, buck), docs in buckets_d4.items():
    if len(docs) > 1 and count_d4 < 20:
        similar_proteins = [uniprot_ids_d4[d] for d in docs]
        display(f"Band {b} suggests these similar proteins: {similar_proteins}")
        count_d4 += 1
        
        
# Print similar protein pairs for data_d3
print("Similar protein pairs for data_d3:")
display(df_similar_pairs_d3)

# Print similar protein pairs for data_d4
print("Similar protein pairs for data_d4:")
display(df_similar_pairs_d4)




'Sparse matrix shape for data_d3:'

(19258, 2048)

'Sparse matrix shape for data_d4:'

(19258, 20736)

'Similar protein pairs for data_d3:'

"Band 0 suggests these similar proteins: ['A0A024RBG1', 'Q9NZJ9']"

"Band 0 suggests these similar proteins: ['A0A075B6P5', 'P01615']"

"Band 0 suggests these similar proteins: ['A0A075B6S2', 'A0A0C4DH68']"

"Band 0 suggests these similar proteins: ['A0A075B6S6', 'P06310']"

"Band 0 suggests these similar proteins: ['Q6IPX1', 'A0A087WVF3']"

"Band 0 suggests these similar proteins: ['A0A087WW87', 'P01614']"

"Band 0 suggests these similar proteins: ['A6NER0', 'A0A087X179']"

"Band 0 suggests these similar proteins: ['P07919', 'A0A096LP55']"

"Band 0 suggests these similar proteins: ['A0A0A0MT36', 'A0A0C4DH24']"

"Band 0 suggests these similar proteins: ['A0JP26', 'A0A0A6YYL3']"

"Band 0 suggests these similar proteins: ['A0A0B4J2D9', 'P0DP09']"

"Band 0 suggests these similar proteins: ['A0A0B4J2F2', 'P57059']"

"Band 0 suggests these similar proteins: ['P01742', 'A0A0B4J2H0']"

"Band 0 suggests these similar proteins: ['P01619', 'A0A0C4DH25']"

"Band 0 suggests these similar proteins: ['P01825', 'P0DP06', 'A0A0C4DH41']"

"Band 0 suggests these similar proteins: ['P01767', 'A0A0C4DH42']"

"Band 0 suggests these similar proteins: ['P01597', 'A0A0C4DH67', 'P04432']"

"Band 0 suggests these similar proteins: ['P01611', 'A0A0C4DH73']"

"Band 0 suggests these similar proteins: ['P0DPF7', 'A0A0J9YXY3']"

"Band 0 suggests these similar proteins: ['A0A1B0GTK5', 'P0DP71']"

'Similar protein pairs for data_d4:'

"Band 0 suggests these similar proteins: ['P34932', 'Q16401']"

"Band 0 suggests these similar proteins: ['P0CL82', 'P0DSO3', 'O76087', 'P0CL80', 'P0DTW1', 'P0CL81', 'Q13069']"

"Band 0 suggests these similar proteins: ['P47813', 'O14602']"

"Band 0 suggests these similar proteins: ['Q9QC07', 'Q9BXR3', 'Q9UQG0', 'P10266']"

"Band 0 suggests these similar proteins: ['P0CI25', 'P0CI26']"

"Band 0 suggests these similar proteins: ['Q16777', 'Q9BTM1', 'Q99878', 'P0C0S8', 'Q8IUE6']"

"Band 0 suggests these similar proteins: ['Q6ZUB0', 'P0C874']"

"Band 0 suggests these similar proteins: ['P63120', 'P63129']"

"Band 0 suggests these similar proteins: ['P0C0S5', 'Q71UI9']"

"Band 0 suggests these similar proteins: ['Q6RFH8', 'P0CJ90', 'Q9UBX2']"

"Band 0 suggests these similar proteins: ['P19961', 'P0DTE8']"

"Band 0 suggests these similar proteins: ['P01593', 'P01594']"

"Band 0 suggests these similar proteins: ['P01850', 'A0A5B9']"

"Band 0 suggests these similar proteins: ['P01614', 'A0A087WW87']"

"Band 0 suggests these similar proteins: ['Q99877', 'P62807', 'Q93079', 'P58876', 'O60814', 'Q5QNW6']"

"Band 0 suggests these similar proteins: ['P0C7W8', 'P0C7V4']"

"Band 0 suggests these similar proteins: ['P0DPH9', 'A0A1B0GTR3']"

"Band 0 suggests these similar proteins: ['O00571', 'O15523']"

"Band 0 suggests these similar proteins: ['Q9UQ88', 'P21127']"

"Band 0 suggests these similar proteins: ['Q96AQ1', 'Q96LY2']"

Similar protein pairs for data_d3:


Unnamed: 0,Protein 1,Protein 2,Characteristics Differ,Total Match,Shared Characteristics,Total Characteristics,Jaccard Similarity
0,A0A024RBG1,Q9NZJ9,13,2035,175,188,0.930851
1,A0A075B6P5,P01615,2,2046,123,125,0.984
2,A0A075B6S2,A0A0C4DH68,58,1990,95,153,0.620915
3,A0A075B6S6,P06310,9,2039,130,139,0.935252
4,Q6IPX1,A0A087WVF3,29,2019,361,390,0.925641
5,A0A087WW87,P01614,1,2047,127,128,0.992188
6,A6NER0,A0A087X179,50,1998,347,397,0.874055
7,P07919,A0A096LP55,20,2028,90,110,0.818182
8,A0A0A0MT36,A0A0C4DH24,19,2029,110,129,0.852713
9,A0JP26,A0A0A6YYL3,51,1997,305,356,0.856742


Similar protein pairs for data_d4:


Unnamed: 0,Protein 1,Protein 2,Characteristics Differ,Total Match,Shared Characteristics,Total Characteristics,Jaccard Similarity
0,P34932,Q16401,2530,18206,696,3226,0.215747
1,P0CL82,P0DSO3,23,20713,171,194,0.881443
2,P0CL82,O76087,9,20727,178,187,0.951872
3,P0CL82,P0CL80,8,20728,178,186,0.956989
4,P0CL82,P0DTW1,35,20701,162,197,0.822335
5,P0CL82,P0CL81,7,20729,178,185,0.962162
6,P0CL82,Q13069,33,20703,165,198,0.833333
7,P0DSO3,O76087,14,20722,177,191,0.926702
8,P0DSO3,P0CL80,17,20719,175,192,0.911458
9,P0DSO3,P0DTW1,18,20718,172,190,0.905263


### Your short analysis here

#### **Dataset Overview**
We are working with two datasets that correspond to approximately 99% of the Human Proteome. Each dataset is presented in a dictionary format where the key is the UniProtID of a protein and the value is a set of indices representing specific structural characteristics. 

The datasets are:

data_d3.pickle - A smaller set of structural features with 2048 features.

data_d4.pickle - A larger set of structural features with 20736 features.


### **Local Sensitivity Hashing (LSH)**

LSH is performed on both datasets to find similar proteins. The parameters B (number of bands) and R (number of rows per band) are varied to observe the effect on capturing similar protein pairs.



For data_d3, the following are some of the similar protein pairs identified using LSH:


* Band 0 suggests these similar proteins: ['A0A024RBG1', 'Q9NZJ9']
* Band 0 suggests these similar proteins: ['A0A075B6P5', 'P01615']


For data_d4, the following are some of the similar protein pairs identified using LSH:

* Band 0 suggests these similar proteins: ['P34932', 'Q16401']
* Band 0 suggests these similar proteins: ['P0CL82', 'P0DSO3', 'O76087', 'P0CL80', 'P0DTW1', 'P0CL81', 'Q13069']



### **Detailed Analysis of Similar Protein Pairs**

We analyzed the characteristics of the similar protein pairs identified. Here are some of the detailed results for data_d3:



#### For data_d3:

* A0A024RBG1 (NUD4B_HUMAN) and Q9NZJ9 (NUDT4_HUMAN) share 177 specific structural characteristics and differ in 13 characteristics, with a total of 188 characteristics that they could share if they were exactly the same protein based on this structural characteristics.
* A0A075B6P5 (KV228_HUMAN) and P01615 (KVD28_HUMAN) share 123 specific structural characteristics and differ in 2 characteristics, with a total of 125 characteristics that they could share if they were exactly the same protein based on this structural characteristics.


A quick look at the Uniprot website reveals that both pairs belong to the same protein family, suggesting that the matching found using the LSH also has biological relevance.


#### For data_d4:


* P34932 (HSP74_HUMAN) and Q16401 (PSMD5_HUMAN) share 696 specific structural characteristics and differ in 2530	 characteristics, with a total of 3226 characteristics that they could share if they were exactly the same protein based on this structural characteristics.
* P0CL82 (GG12I_HUMAN) and  P0DSO3 (GAGE4_HUMAN) share 171 specific structural characteristics and differ in  23	 characteristics, with a total of 194 characteristics that they could share if they were exactly the same protein based on this structural characteristics.


A quick look at the Uniprot website reveals that the frist pair of proteins P34932 (HSP74_HUMAN) and Q16401 (PSMD5_HUMAN) do no share the same protein family or any biological function.
For the second group of proteins P0CL82, P0DSO3, O76087, P0CL80, P0DTW1, P0CL81, Q13069 all of them belong to the GAGE family suggesting a biological similarity between them.
 


### **Conclusion**

Local Sensitivity Hashing (LSH) effectively captures similar protein pairs in large proteomic datasets. By adjusting the parameters B and R, we can control the granularity of similarity detection and the computational cost. Identifying similar proteins can lead to insights into shared properties and functional relationships, which can be further verified through databases like UniProt.

In this specific scenario, when applying LSH to both datasets, it appears that the pairs generated from both data_d3 and data_d4 exhibit greater consistency in overall biological, upon the screening a small subset of  the pairs generated from both datasets.

To augment our understanding, we introduced Jaccard similarity, quantitatively evaluating the degree of similarity between protein pairs. This metric provides additional context, indicating the proportion of shared structural characteristics relative to the total characteristics.

For instance, considering the pair A0A075B6P5 (KV228_HUMAN) and P01615 (KVD28_HUMAN) from data_d3, their Jaccard similarity of 0.984 suggests a high degree of overlap in their structural characteristics. 

On the other hand, the pair P0CL82 (GG12I_HUMAN) and  P0DSO3 (GAGE4_HUMAN), from data_d4 that were included in a group with several other proteins exhibit a  Jaccard similarity of 0.881443 that suggests a reasonable degree of overlap in their structural characteristics, on the contrary the P34932 (HSP74_HUMAN) and Q16401 (PSMD5_HUMAN) pair exhibit a Jaccard similarity  of 0.215747 indicating a limited intersection in their structural features,  reinforced by the fact that according to the Uniprot database this proteins do noy share any biological similarity.



## 3. Do dimensionality reduction 

Use the techniques discussued in class to make an appropriate dimensional reduction of the selected dataset. It is not necesary to be extensive, **it is better to select one approach and do it well than try a lot of techniques with poor insights and analysis**

It is important to do some sensitivity analysis, relating the dataset size reduction to the loss of information



In [None]:
### Add supporting functions here

## 4. Discuss your findings [to fill on your own]

* Comment your results above
* Discuss how could they be used for the full Uniprot that currently has about [248 Million proteins](https://www.uniprot.org/uniprotkb/statistics)


Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum
