# Confirmation of known CNV associated with psychosis

The largest genome-wide study on the role of CNVs in psychosis, conducted by Marshall et al. in 2016 with 41,321 subjects, identified significant associations at eight loci: ```1q21.1```, ```2p16.3 (NRXN1)```, ```3q29```, ```7q11.2```, ```15q13.3```, ```distal 16p11.2```, ```proximal 16p11.2```, and ```22q11.2```.

Identifying the genetic causes of psychosis is crucial for advancing healthcare, as it enables the development of personalized treatments, aids in family planning, and helps reduce self-blame. Additionally, it plays a significant role in diminishing the stigma associated with psychiatric disorders. Consequently, following data preprocessing, the next step involved detecting known CNVs associated with psychosis in the study cohort.


In [1]:
# Import libraries
import pandas as pd

In [2]:
# Load the CNV and Marshal data
cnv_df = pd.read_csv("../results/merged_all_cnv_annotations.csv")
marshal_df = pd.read_csv("../databases/marshal_cnv.csv")

# Filter out rows with non protein coding CNVs
cnv_df = cnv_df[cnv_df["All protein coding genes"].notna()]

This detection was achieved by cross-referencing the CNVs listed by Marshall et al., including their coordinates, cytobands, and CNV types, with the CNVs present in the cohort. Overlaps were filtered to ensure they exceeded a threshold of zero.

After identifying CNVs based on the criteria outlined by Marshall et al., we observed 9 patients with CNVs at 3 distinct loci: ```22q11.21``` (2 patients with SCZ and 2 with BD), ```2p16.3``` (2 patients with SCZ and 1 UHR-NC), and ```1q21.1``` (2 SCZ patients). 

In [3]:
# Merge to find matching rows
matching_df = pd.merge(cnv_df, marshal_df, on=["Type", "Cytoband"], copy=False)

# Calculate overlap
overlap_start = matching_df[["Start_x", "Start_y"]].max(axis=1)
overlap_end = matching_df[["End_x", "End_y"]].min(axis=1)
matching_df["Overlap with Marshal (bp)"] = (overlap_end - overlap_start).clip(lower=0)

# Filter out CNVs with zero overlap
matching_df = matching_df[matching_df["Overlap with Marshal (bp)"] > 0]

# Select columns and rename them
matching_df = matching_df[
    [
        "ID",
        "Phenotype_1",
        "Cytoband",
        "Chromosome",
        "Start_x",
        "End_x",
        "Type",
        "Overlap with Marshal (bp)",
        "All protein coding genes",
        "Brain_genes",
        "is_rare",
        "Gene",
        "Status",
        "Classification",
    ]
].rename(
    columns={
        "Phenotype_1": "Phenotype",
        "Start_x": "Start",
        "End_x": "End",
        "All protein coding genes": "Genes",
        "Gene": "Genes (Marshal)",
        "Status": "Status (Marshal)",
    }
)
matching_df.reset_index(drop=True, inplace=True)
matching_df

Unnamed: 0,ID,Phenotype,Cytoband,Chromosome,Start,End,Type,Overlap with Marshal (bp),Genes,Brain_genes,is_rare,Genes (Marshal),Status (Marshal),Classification
0,S36774,SCZ,22q11.21,chr22,18731658,21110696,DEL,1018342,"LOC102724770, DGCR6, LOC102724788, PRODH, DGCR...","PRODH, ESS2, SEPTIN5, GP1BB, ARVCF, RTN4R, PI4...",True,,risk,Pathogenic
1,S36820,UHR-NC,2p16.3,chr2,50672899,50921973,DEL,249074,NRXN1,NRXN1,True,NRXN1,risk,Uncertain significance
2,S36895,SCZ,22q11.21,chr22,18731658,19023031,DEL,291373,"LOC102724770, DGCR6, LOC102724788, PRODH",PRODH,False,,risk,Uncertain significance
3,S36933,SCZ,2p16.3,chr2,50270373,50287511,DEL,17138,NRXN1,NRXN1,True,NRXN1,risk,Uncertain significance
4,S36952,SCZ,2p16.3,chr2,50814739,50830450,DEL,15711,NRXN1,NRXN1,True,NRXN1,risk,Uncertain significance
5,S36998,SCZ,1q21.1,chr1,145544126,145808812,DEL,264686,"GPR89A, PDZK1, CD160, RNF115",CD160,True,,risk,Uncertain significance
6,S37004,SCZ,1q21.1,chr1,145426050,146052017,DUP,625967,"GPR89A, PDZK1, CD160, RNF115, POLR3C, NUDT17, ...","CD160, ANKRD34A",True,,risk,Uncertain significance
7,S37045,BD,22q11.21,chr22,19335594,19348367,DEL,12773,HIRA,,False,,risk,Uncertain significance
8,S37122,BD,22q11.21,chr22,18131010,18142544,DUP,11534,TUBA8,,False,,protective,Uncertain significance


Additionally, we reviewed CNVs classified as "Pathogenic" or "Likely pathogenic" with caution, as they may be related to genetic disorders but not necessarily to psychosis. These include: ```Xp22.31``` (1 UHR-NC), ```9p21.3``` (1 UHR-C), ```16p11.2``` (2 SCZ), ```15q11.2``` (1 SCZ), ```8q24.3``` (1 UHR-NC), and ```8q22.1-22.2-22.3``` (1 UHR-NC). 

In [4]:
# Filter pathogenic CNVs and rename columns
pathogenic_df = cnv_df.loc[
    cnv_df["Classification"] != "Uncertain significance",
    [
        "ID",
        "Phenotype_1",
        "Cytoband",
        "Chromosome",
        "Start",
        "End",
        "Type",
        "All protein coding genes",
        "Brain_genes",
        "is_rare",
        "Classification",
    ],
].rename(columns={"Phenotype_1": "Phenotype", "All protein coding genes": "Genes"})
pathogenic_df.reset_index(drop=True, inplace=True)
pathogenic_df

Unnamed: 0,ID,Phenotype,Cytoband,Chromosome,Start,End,Type,Genes,Brain_genes,is_rare,Classification
0,S36747,UHR-NC,Xp22.31,chrX,6535213,8168775,DEL,"PUDP, STS, VCX, PNPLA4",,True,Pathogenic
1,S36774,SCZ,22q11.21,chr22,18731658,21110696,DEL,"LOC102724770, DGCR6, LOC102724788, PRODH, DGCR...","PRODH, ESS2, SEPTIN5, GP1BB, ARVCF, RTN4R, PI4...",True,Pathogenic
2,S36795,UHR-C,9p21.3,chr9,21959044,21974294,DEL,CDKN2A,,False,Likely pathogenic
3,S36908,SCZ,16p11.2,chr16,29599805,30187754,DEL,"SPN, QPRT, C16orf54, ZG16, KIF22, MAZ, PRRT2, ...","SPN, PRRT2, PAGR1, SEZ6L2, ASPHD1, DOC2A, TLCD3B",True,Pathogenic
4,S36961,SCZ,15q11.2,chr15,22650252,23122942,DEL,"GOLGA6L7, NIPA1, NIPA2, CYFIP1, TUBGCP5",NIPA1,False,Pathogenic
5,S36981,UHR-NC,8q24.3,chr8,142414773,145076114,DUP,"ADGRB1, ARC, JRK, PSCA, LY6K, THEM6, SLURP1, L...","ADGRB1, ARC, SLURP2, LYNX1, LY6H, GPIHBP1, NRB...",True,Likely pathogenic
6,S36981,UHR-NC,8q22.1-q22.2-q22.3,chr8,93064965,104546553,DUP,"FBXO43, POLR2K, SPAG1, RNF19A, ANKRD46, SNX31,...","NCALD, BAALC, RIMS2, CCNE2, POP1, KCNS2",True,Likely pathogenic
7,S37126,SCZ,16p11.2,chr16,29599805,30187754,DEL,"SPN, QPRT, C16orf54, ZG16, KIF22, MAZ, PRRT2, ...","SPN, PRRT2, PAGR1, SEZ6L2, ASPHD1, DOC2A, TLCD3B",True,Pathogenic


The final dataframe was exported as a CSV file for further analysis and interpretation.

In [5]:
# Concatenate the matching_df and pathogenic_df
df = pd.concat([matching_df, pathogenic_df], ignore_index=True)

# Drop duplicate rows based on 'ID' and 'Cytoband', then set as index
df.drop_duplicates(subset=["ID", "Cytoband"], inplace=True)
df.set_index(["ID", "Cytoband"], inplace=True)

# Save the final DataFrame to a CSV file
df.to_csv("../results/pathogenic_cnv_with_marshal.csv")