# AMP-Parkinson's Disease Progression Prediction
Use protein and peptide data measurements from Parkinson's Disease patients to predict progression of the disease.
このコンペティションの目的は、タンパク質の存在量データを用いてパーキンソン病（PD）の経過を予測することです。PDに関与するタンパク質の完全なセットは、まだ未解決の研究課題であり、予測価値を持つタンパク質は、さらに調査する価値があると思われます。このデータセットの中核は、数百人の患者から収集した脳脊髄液（CSF）サンプルの質量分析から得られたタンパク質存在量値です。各患者は、PDの重症度評価と同時に、複数年にわたり複数のサンプルを提供しました。

これを参考にコーディング予定<br>
https://www.kaggle.com/code/shimman/baseline-model-using-lgbm-regression

In [18]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import tqdm
import random

In [19]:
path = './data/'

In [25]:
# train_data
train_peptide = pd.read_csv(f'{path}train_peptides.csv')
train_protein = pd.read_csv(f'{path}train_proteins.csv')
train_clinical = pd.read_csv(f'{path}train_clinical_data.csv')
data = pd.read_csv(f'{path}supplemental_clinical_data.csv')
train_peptide.shape, train_protein.shape, train_clinical.shape, data.shape

((981834, 6), (232741, 5), (2615, 8), (2223, 8))

In [26]:
# test data
test_protein = pd.read_csv(f'{path}example_test_files/test_proteins.csv')
test_peptide = pd.read_csv(f'{path}example_test_files/test_peptides.csv')
test = pd.read_csv(f'{path}example_test_files/test.csv')

# EDA

- visit_id - ID code for the visit.
- visit_month - The month of the visit, relative to the first visit by the patient.
- patient_id - An ID code for the patient.
- UniProt - The UniProt ID code for the associated protein. There are often several peptides per protein.
- Peptide - The sequence of amino acids included in the peptide. See this table for the relevant codes. Some rare - annotations may not be included in the table. The test set may include peptides not found in the train set.
- PeptideAbundance - The frequency of the amino acid in the sample.

In [27]:
train_peptide.head(7)

Unnamed: 0,visit_id,visit_month,patient_id,UniProt,Peptide,PeptideAbundance
0,55_0,0,55,O00391,NEQEQPLGQWHLS,11254.3
1,55_0,0,55,O00533,GNPEPTFSWTK,102060.0
2,55_0,0,55,O00533,IEIPSSVQQVPTIIK,174185.0
3,55_0,0,55,O00533,KPQSAVYSTGSNGILLC(UniMod_4)EAEGEPQPTIK,27278.9
4,55_0,0,55,O00533,SMEQNGPGLEYR,30838.7
5,55_0,0,55,O00533,TLKIENVSYQDKGNYR,23216.5
6,55_0,0,55,O00533,VIAVNEVGR,170878.0


- visit_id - ID code for the visit.
- visit_month - The month of the visit, relative to the first visit by the patient.
- patient_id - An ID code for the patient.
- UniProt - The UniProt ID code for the associated protein. There are often several peptides per protein. The test set may include proteins not found in the train set.
- NPX - Normalized protein expression. The frequency of the protein's occurrence in the sample. May not have a 1:1 - relationship with the component peptides as some proteins contain repeated copies of a given peptide.

In [28]:
train_protein.head(7)

Unnamed: 0,visit_id,visit_month,patient_id,UniProt,NPX
0,55_0,0,55,O00391,11254.3
1,55_0,0,55,O00533,732430.0
2,55_0,0,55,O00584,39585.8
3,55_0,0,55,O14498,41526.9
4,55_0,0,55,O14773,31238.0
5,55_0,0,55,O14791,4202.71
6,55_0,0,55,O15240,177775.0


- visit_id - ID code for the visit.
- visit_month - The month of the visit, relative to the first visit by the patient.
- patient_id - An ID code for the patient.
- updrs_[1-4] - The patient's score for part N of the Unified Parkinson's Disease Rating Scale. Higher numbers indicate more severe symptoms. Each sub-section covers a distinct category of symptoms, such as mood and behavior for Part 1 and motor functions for Part 3.
- upd23b_clinical_state_on_medication - Whether or not the patient was taking medication such as Levodopa during the - UPDRS assessment. Expected to mainly affect the scores for Part 3 (motor function). These medications wear off fairly quickly (on the order of one day) so it's common for patients to take the motor function exam twice in a single month, both with and without medication.

In [29]:
train_clinical.head(7)

Unnamed: 0,visit_id,patient_id,visit_month,updrs_1,updrs_2,updrs_3,updrs_4,upd23b_clinical_state_on_medication
0,55_0,55,0,10.0,6.0,15.0,,
1,55_3,55,3,10.0,7.0,25.0,,
2,55_6,55,6,8.0,10.0,34.0,,
3,55_9,55,9,8.0,9.0,30.0,0.0,On
4,55_12,55,12,10.0,10.0,41.0,0.0,On
5,55_18,55,18,7.0,13.0,38.0,0.0,On
6,55_24,55,24,16.0,9.0,49.0,0.0,On
