# TP4: Descriptive functions

## Winter 2023 - BIN710 Data Mining (UdeS)

Fourth assignement as part of the Data Mining class at UdeS.

Student name : Simon Lalonde

### Directory structure

├── TP4_data.csv    ---> Data

├── TP4_soln.ipynb   ---> Jupyter Notebook

└── TP4.pdf    ---> Tasks to complete

### Data
1 file
- __ objects
- __ attributes
- __ classes (?)

### Goal
Compare and use partitionning methods on sequential data. In this case, we are dealing with clustering of RNA sequencing data.

---

## Necessary libraries

In [1]:
from pathlib import Path

import pandas as pd


## 0. Data pre-processing (incoherences) and feature generation (sequence length)

### Loading the data and exploring incongruencies

In [2]:
# dir/file setup and read
tp4_dir = Path.cwd()
filename = "TP4_data.csv"

df = pd.read_csv(tp4_dir / filename)
df.head()

Unnamed: 0,id,sequence
0,AAIY01303410.1/717-923,CCAACGUGGAUACUCCCGGGAGGUCACUCUCCCCGGGCUCUGUCCA...
1,CP000140.1/4143906-4143709,UACCUUUGCAUCCGAAUUGGUUCCGUACGCUCGUUCGGGCAUACGG...
2,URS0000D6BCE7_12908/1-215,GCGUAACGCGCUAUGGCUUAAACGGCUGCCCCAAAGCUGCCAAAGG...
3,X71081.1/4425-4646,CCAAUGUGGAUAUCCUUAGAGGUCUCUCUUGGGCUCUGUCCAGGUG...
4,AACY020770731.1/455-512,UUUCGUUCACCCUCAAUUGAGGGCGCAGUUCGAGUCAUACCAUGGA...


In [3]:
# Expecting str types OK
print(df.dtypes)

id          object
sequence    object
dtype: object


Checking for possible incongruencies

In [4]:
df["sequence"].head()

0    CCAACGUGGAUACUCCCGGGAGGUCACUCUCCCCGGGCUCUGUCCA...
1    UACCUUUGCAUCCGAAUUGGUUCCGUACGCUCGUUCGGGCAUACGG...
2    GCGUAACGCGCUAUGGCUUAAACGGCUGCCCCAAAGCUGCCAAAGG...
3    CCAAUGUGGAUAUCCUUAGAGGUCUCUCUUGGGCUCUGUCCAGGUG...
4    UUUCGUUCACCCUCAAUUGAGGGCGCAGUUCGAGUCAUACCAUGGA...
Name: sequence, dtype: object

In [5]:
# Biologically possible nucleotides for RNA
possible_nuc = ["C", "U", "A", "G"]
print(possible_nuc)

['C', 'U', 'A', 'G']


In [6]:
# Extracting the unique possible values of nucleotides/chars in the dataset
df["sequence"].str.split("").apply(lambda x: list(set([nuc for nuc in x if nuc != ""]))).str.join("").value_counts()

ACUG      571
UCAG      322
CAGUN       5
CKSAGU      1
MCAGU       1
Name: sequence, dtype: int64

**We see that there are a few sequences with incongruencies in the value of individual char/nucleotides i.e. (nucleotides with letters of K, M and S).**

**In the context of DNA/RNA sequencing, "N" represents an unknown/undetermined value so it is expected that some objects might contain N's.**

Checking for lowercase

In [7]:
print(f"Number of lowercase characters: {df['sequence'].str.islower().sum()}")

Number of lowercase characters: 0


In [8]:
rna_nuc_with_unknowns = ["C", "U", "A", "G", "N"]

In [9]:
print(f'{len(df[df["sequence"].apply(lambda x: any(nuc not in possible_nuc for nuc in x))])} samples with nucleotides other than {possible_nuc}')

7 samples with nucleotides other than ['C', 'U', 'A', 'G']


Checking for the 2 samples with mislabelled N's (other chars than UCAGN)

In [10]:
df[df["sequence"].apply(lambda x: any(nuc not in rna_nuc_with_unknowns for nuc in x))]

Unnamed: 0,id,sequence
53,AM462844.1/17936-17742,UUGUGGAAGAAGGAGCUCUCUUUAGUCCAGUCCGAGACAGCUUCAA...
145,AM457512.2/2227-2039,AGGGGCUUGUGGGAGCUUCUUUACACUCCAGAACUGAAAGGAGAUA...


In [11]:
df[df["sequence"].apply(lambda x: any(nuc not in rna_nuc_with_unknowns for nuc in x))]["sequence"].str.split("").apply(lambda x: list(set([nuc for nuc in x if nuc != ""]))).str.join("")

53     CKSAGU
145     MCAGU
Name: sequence, dtype: object

In [13]:
df[df["sequence"].apply(lambda x: any(nuc not in rna_nuc_with_unknowns for nuc in x))]

Unnamed: 0,id,sequence
53,AM462844.1/17936-17742,UUGUGGAAGAAGGAGCUCUCUUUAGUCCAGUCCGAGACAGCUUCAA...
145,AM457512.2/2227-2039,AGGGGCUUGUGGGAGCUUCUUUACACUCCAGAACUGAAAGGAGAUA...


Let's write a function to change any non-nucleotide chars to "N"

In [12]:
def replace_mislabelled_nuc(df: pd.DataFrame, col_name_seq: str, possible_nucleotides: list, replacement_nuc: str) -> pd.DataFrame:
    # Find all mislabelled nucleotides possibilites and save to list
    mislabelled_samples = df[df[col_name_seq].apply(lambda x: any(nuc not in possible_nucleotides for nuc in x))]
    unique_nucs = list(set("".join(mislabelled_samples[col_name_seq].to_list())))
    mislabelled_nucs = [nuc for nuc in unique_nucs if nuc not in possible_nucleotides]
    print(f"All mislabelled nucleotides present in data: {mislabelled_nucs}\n")


    # Loop through list of mislabelled nucleotides
    print("Replacing with N's...")
    for nuc in mislabelled_nucs:
        df["sequence"] = df["sequence"].str.replace(nuc, replacement_nuc)

    return df

In [14]:
# Replacing mislabelled nucs in the df
replace_mislabelled_nuc(
    df=df,
    col_name_seq="sequence",
    possible_nucleotides=rna_nuc_with_unknowns,
    replacement_nuc="N"
)


All mislabelled nucleotides present in data: ['M', 'K', 'S']

Replacing with N's...


Unnamed: 0,id,sequence
0,AAIY01303410.1/717-923,CCAACGUGGAUACUCCCGGGAGGUCACUCUCCCCGGGCUCUGUCCA...
1,CP000140.1/4143906-4143709,UACCUUUGCAUCCGAAUUGGUUCCGUACGCUCGUUCGGGCAUACGG...
2,URS0000D6BCE7_12908/1-215,GCGUAACGCGCUAUGGCUUAAACGGCUGCCCCAAAGCUGCCAAAGG...
3,X71081.1/4425-4646,CCAAUGUGGAUAUCCUUAGAGGUCUCUCUUGGGCUCUGUCCAGGUG...
4,AACY020770731.1/455-512,UUUCGUUCACCCUCAAUUGAGGGCGCAGUUCGAGUCAUACCAUGGA...
...,...,...
895,URS0000D6890F_1069618/1-62,CAUCUAUAGUUUCAGACAUGGAAUCGCCGAAAACGUCGGCGGUAAA...
896,ACLT01000067.1/45633-45455,AAUAACUGAUUGACUGAAAGUAGGAAUUAAAGCCGUCAAGUUGAGC...
897,URS0000D6BC2B_12908/1-161,UCCGUCAGCUAAUGGCAAUUAGACUGCUGAACUUAAACUGCAUAAG...
898,URS0000D6B588_12908/1-186,GCGAGAAUGUCUACACACCACGGUGGUAGGCAGAGUGUAUUUGUAA...


In [17]:
# Sanity check
print(f'{len(df[df["sequence"].apply(lambda x: any(nuc not in rna_nuc_with_unknowns for nuc in x))])} samples with nucleotides other than {possible_nuc}')

0 samples with nucleotides other than ['C', 'U', 'A', 'G']


### Generating the sequence length feature

In [20]:
df["sequence_length"] = df["sequence"].str.len()

In [21]:
df.head()

Unnamed: 0,id,sequence,sequence_length
0,AAIY01303410.1/717-923,CCAACGUGGAUACUCCCGGGAGGUCACUCUCCCCGGGCUCUGUCCA...,207
1,CP000140.1/4143906-4143709,UACCUUUGCAUCCGAAUUGGUUCCGUACGCUCGUUCGGGCAUACGG...,198
2,URS0000D6BCE7_12908/1-215,GCGUAACGCGCUAUGGCUUAAACGGCUGCCCCAAAGCUGCCAAAGG...,215
3,X71081.1/4425-4646,CCAAUGUGGAUAUCCUUAGAGGUCUCUCUUGGGCUCUGUCCAGGUG...,222
4,AACY020770731.1/455-512,UUUCGUUCACCCUCAAUUGAGGGCGCAGUUCGAGUCAUACCAUGGA...,58


## 1. Data partitionning based on sequence length