# ToDo
- [x] get representatives in data
- [x] ideam is the same as previous entry
- [x] get prot seqs from UniProt (API call)
- [x] clean up Pevious FASTA file
- [x] combine results with previous FASTA
- [x] check for duplicate sequences
- [x] extract UniProtID from old CSV and put in new column
- [ ] check for duplicates between both dsets
- [ ] merge CSV files
- [ ] create feature file for protspace3D
- [ ] show results in ProtSpace3D (separate those by others)
- [ ] Split by this group
- [ ] highlight representative
- [ ] only look at french data separatelly

- [ ] check if possible to get whole sequence of genomic sequences (without UniProt entry)
- [ ] retrieve names
- [ ] merge activity names
- [ ] add french data
- [ ] check if cluster
- [ ] which dataset clusters best? full, 

- NCBI selection by acc_id ([AccID list](https://www.ncbi.nlm.nih.gov/books/NBK25497/table/chapter2.T._entrez_unique_identifiers_ui/?report=objectonly))
  - entries comming from WGS, which are they?
  - ignore predicted entries, or add but label? -> XM_ XR_ XP_
    - maybe in the meantime there exists a match in uniprot?
  - Which entries always to keep?
  - separation by division? https://www.ncbi.nlm.nih.gov/genbank/samplerecord/#GenBankDivisionA

- how to decide which sequenc to keep?
- what could a kind of quality controll be?
  - Exclude if protein evidence is at homology or predicted level.
  - Include if at protein or transcription level
  - Exclude if it comes from EST, DNA, CDNA?
  - only take mRNA
  - sequences which are experimenally verified to be translated (uniprot: at protein level/transcript level)
  - What would be a good decision?

In [1]:
import jupyter_black

jupyter_black.load()

In [2]:
%load_ext autoreload
%autoreload 2

## Prepare data

In [45]:
import importlib
from pathlib import Path
import pandas as pd

import dset_3FTx
import uniprot_helper
import ncbi_helper

# pd.io.clipboards.to_clipboard(df.to_markdown(), excel=False)

# --- PATHS ---
base = Path("../data")
out_dir = base / "protspace"
raw = base / "raw"
helpers = base / "helpers"

csv_in = raw / "Ivan_3FTx.csv"
fasta_in = raw / "3and6_new-2.fasta"
genomic_fasta = raw / "Translation of 156 sequences.fasta"
zhang_fasta = raw / "BungarusMulticinctus.fasta"
zhang_sp6_fasta = raw / "zhang_sp6_mature_seq.fasta"
zhang_annotation = raw / "zhang_annotation.xlsx"
ritu_csv = raw / "drysdalia.csv"
french_excel = raw / "french_data.xls"
uniprot_uids_files = [raw / "dashevsky_uniprot.txt", raw / "snake_3FTx_sp.txt"]

ncbi_dir = base / "ncbi_entries"
uniprot_dir = base / "uniprot_entries"
blast_dir = base / "blast_out"
taxon_mapper_file = helpers / "taxon_mapper.csv"
name_activity_file = helpers / "name_activity.json"

fasta_out = out_dir / "3FTx.fasta"
csv_out = out_dir / "3FTx.csv"

# --- MAIN ---
uniprot_collector = uniprot_helper.UniProtDataGatherer(uniprot_dir=uniprot_dir)
ncbi_collector = ncbi_helper.NcbiDataGatherer(ncbi_dir=ncbi_dir)

df_original = dset_3FTx.OriginalDset(
    csv_path=csv_in, fasta_path=fasta_in, genomic_fasta_path=genomic_fasta
).df
df_zhang = dset_3FTx.ZhangDset(
    fasta_path=zhang_fasta,
    mature_fasta_path=zhang_sp6_fasta,
    annotation_excel_path=zhang_annotation,
).df
df_ritu = dset_3FTx.RituDset(csv_path=ritu_csv).df
df_french = dset_3FTx.FrenchDset(excel_path=french_excel).df
df_uniprot = dset_3FTx.parse_uniprot_ids_file(uniprot_uids_files=uniprot_uids_files)
df = pd.concat(
    [df_original, df_french, df_uniprot, df_ritu, df_zhang], ignore_index=True
)
df = dset_3FTx.map_ids2uniprot(df=df)
# 28 of original mature_seq have missing ends or no UniProt entry
df = dset_3FTx.get_uniprot_metadata(df=df)
df = dset_3FTx.get_ncbi_metadata(df=df)
df = df.dropna(subset="species")
df = dset_3FTx.add_taxon_id(df=df, taxon_mapper_file=taxon_mapper_file)
# run BLASTp to find UniProt entries (ignore entries that already have a uniprot_id)
df = dset_3FTx.run_blast(
    df=df, blast_dir=blast_dir, uniprot_collector=uniprot_collector
)
df = dset_3FTx.get_uniprot_metadata(df=df)
df = dset_3FTx.remove_low_quality_entries(df=df)
df = dset_3FTx.manual_curation(df=df)
df = dset_3FTx.infer_activit_from_name(df=df, name_activity_file=name_activity_file)
dset_3FTx.save_data(df=df, csv_file=csv_out, fasta_file=fasta_out)

Original - 954 entries: 623 UniProt IDs; 1 RefSeq IDs; 47 GenBank IDs identified.
- 127 full sequences information added by genomic supported alignment
Zhang - 970 entries: 274 UniProt IDs; 56 RefSeq IDs; 595 GenBank IDs identified.
Ritu - 119 entries: 8 UniProt IDs, 111 GI numbers identified
French - 39 entries: 39 UniProt IDs, identified
UniProt - 617 entries: 617 UniProt IDs identified




BLAST: 100%|██████████| 745/745 [00:00<00:00, 61929.10it/s]


231 low quality sequences removed
958 duplicate sequences were merged.


## Statistics

In [46]:
df["data_origin"].value_counts()

original             954
paper_zhang          471
paper_ritu            38
snake_3FTx_sp         31
dashevsky_uniprot     16
Name: data_origin, dtype: int64

In [21]:
def copy_df(df):
    df.index = df.index.fillna("NA")
    pd.io.clipboards.to_clipboard(df.to_markdown(), excel=False)
    print(df)

In [231]:
df["data_origin"].value_counts(dropna=False)

original             802
paper_zhang          669
genomic              152
paper_RituChandna    119
Name: data_origin, dtype: int64

In [8]:
len(df[df["mature_seq"].isna() & df["full_seq"].notna()])

54

In [10]:
df[df["mature_seq"].isna() & df["full_seq"].notna()]["data_origin"].value_counts(
    dropna=False
)

paper_zhang    54
Name: data_origin, dtype: int64

In [5]:
print(df["mature_seq"].notna().sum(), df["full_seq"].notna().sum())

1253 791


In [239]:
# Entries found in UniProt
res = df["db"].value_counts(dropna=False)
copy_df(df=res)

SP    701
NA    621
TR    267
NA    153
Name: db, dtype: int64


In [224]:
# number of UniProt accession IDs with a 100% sequence match
res = df["acc_id"].str.split(",").str.len().value_counts(dropna=False)
copy_df(df=res)

1.0    1303
NA      439
Name: acc_id, dtype: int64
