# Prospecção de Dados (Data Mining) DI/FCUL - HA3

## Third Home Assignement (MC/DI/FCUL - 2024)

### Fill in the section below

### GROUP:`02`

* João Martins, 62532 - Hours worked on the project: 8
* Rúben Torres, 62531 - Hours worked on the project: 8
* Nuno Pereira, 56933 - Hours worked on the project: 8




The purpose of this Home Assignment is
* Find similar items with Local Sensitivity Hashing
* Do Dimensionality Reduction

**NOTE 1: Students are not allowed to add more cells to the notebook**

**NOTE 2: The notebook must be submited fully executed**

**NOTE 3: Name of notebook should be: HA3_GROUP-XX.ipynb (where XX is the group number)**



## 1. Read the Dataset

The dataset correspond to about 99% of the Human Proteome (set of known Human Proteins - about 19,500), coded with specific structural elements. They are presented in a dictionary where the key is the [UniprotID](https://www.uniprot.org/) of the protein and the value is a set of indices of a specific structural characteristic

Students can use one of two datasets, that are **not** subsets of each other: 
* `data_d3.pickle` - smaller set of structural features (2048)
* `data_d4.pickle` - much larger set of structural features (20736) **Note:** This dataset has been Zipped to fit into moodle. Students should unzip it before usage 

Select **one** of the datasets and perform all analyses with it. 

It may be adviseable the usage of sparse matrices, especially for the `d4` dataset



In [40]:
### Your code Here
import pandas as pd
import numpy as np
import pickle

#'A0A024R1R8', 'A0A024RBG1', 'A0A075B6H7', 'A0A075B6H8', 'A0A075B6H9', 'A0A075B6I0', 'A0A075B6I1'

# human_proteins_dataset= pd.read_pickle("data_d3.pickle")
human_proteins_dataset =pickle.load(open("data_d3.pickle", "rb"))
# dataframe = pd.DataFrame(list(human_proteins_dataset.items()), columns=["Id", "Proteome"])
# dataframe["Proteome"]

# for protein in dataframe["Proteome"][:20]:
#     print(protein)
#     print(len(protein))



## 2. Perform Local Sensitivity Hashing (LSH)

* examine the selected dataset in terms of similarities and select a set of LSH parameters able to capture the most similar proteins
* Comment your results

**BONUS POINTS:** It might be interesting to identify **some** of the candidate pairs in Uniprot, to check if they share some of the same properties (e.g. for [protein P28223](https://www.uniprot.org/uniprotkb/P28223/entry))


In [41]:
def MakeBucketsT(TDocs, perms, N, B, R, M, NB):
    Buckets = {}
    all_docs = set(range(N))
    for b in range(B):
        SIGS = np.zeros((N, R), dtype="int32")
        for r in range(R):
            perm = perms[b*R + r]
            L = all_docs.copy()
            i = 0
            while len(L) > 0:
                elem = perm[i]
                docs_found = TDocs[elem] & L
                if len(docs_found) > 0:
                    SIGS[list(docs_found), r] = i
                    L = L - docs_found
                i += 1
                if i == M:
                    SIGS[list(L), r] = i
                    L = {}
        for d in range(N):
            bucket = hash(tuple(SIGS[d])) % NB
            Buckets.setdefault((b, bucket), set()).add(d)
    return Buckets

def LSHTs(docs, words, B, R, NB=28934501, verbose=True):
    N, M = len(docs), words
    if verbose: print("Transpose the dataset")
    data = [[] for _ in range(M)]
    for doc, i in enumerate(docs):
        for word in i: 
            data[word].append(doc)
    dataT = [set(L) for L in data]
    P = B * R
    np.random.seed(3)
    if verbose: print(f"Generating {P} permutations for {(1/B)**(1/R):.3f} similarity")
    perms = [np.random.permutation(M) for _ in range(P)]
    if verbose: print("Computing buckets...")
    buckets = MakeBucketsT(dataT, perms, N, B, R, NB)
    return buckets

def make_word_indexes(words): 
    return dict(zip(words, range(len(words))))

def make_indexed_docs(docs, word_index):
    indexed_docs = []
    for doc in docs:
        indexed_doc = set(word_index[word] for word in doc)
        indexed_docs.append(indexed_doc)
    return indexed_docs

def remove_empty_docs(words_text_sets):
    return [doc for doc in words_text_sets if len(doc) > 0]

In [42]:
# Process the dataset
words = set(word for doc in human_proteins_dataset for word in doc)
word_index = make_word_indexes(words)
indexed_docs = make_indexed_docs(human_proteins_dataset, word_index)
indexed_docs = remove_empty_docs(indexed_docs)

B = 20
R = 5
bucks = LSHTs(indexed_docs, len(words), B, R)

for b, buck in bucks:
    if len(bucks[(b, buck)]) > 1:
        print("Band", b, "suggests these similar docs:", bucks[(b, buck)])

Transpose the dataset
Generating 100 permutations for 0.549 similarity
Computing buckets...


TypeError: MakeBucketsT() missing 1 required positional argument: 'NB'

### Your short analysis here

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum


## 3. Do dimensionality reduction 

Use the techniques discussued in class to make an appropriate dimensional reduction of the selected dataset. It is not necesary to be extensive, **it is better to select one approach and do it well than try a lot of techniques with poor insights and analysis**

It is important to do some sensitivity analysis, relating the dataset size reduction to the loss of information



In [None]:
### Add supporting functions here



In [None]:
### Add processing code here



## 4. Discuss your findings [to fill on your own]

* Comment your results above
* Discuss how could they be used for the full Uniprot that currently has about [248 Million proteins](https://www.uniprot.org/uniprotkb/statistics)


Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum
