# Data Cleaning
## Simultaneous ECG and PCG recordings transformed into continuous wavelet transform (ECG) and log-mel spectrogram (PCG)

There are two datasets which consist of Normal (EPHNOGram: https://physionet.org/content/ephnogram/1.0.0/) and Normal + Abnormal \\
  (CINC/PhysioNet2016 Challenge: https://physionet.org/content/challenge-2016/1.0.0/#files) heart function sound recordings.
  
For the PhysioNet data: 'The normal recordings were
from healthy subjects and the abnormal ones were from
patients typically with heart valve defects and coronary
artery disease (CAD). Heart valve defects include mitral
valve prolapse, mitral regurgitation, aortic regurgitation,
aortic stenosis and valvular surgery'

For the EPHNOGram data: 'The current database, recorded by version 2.1 of the developed hardware, 
has been acquired from 24 healthy adults aged between 23 and 29 (average: 25.4 ± 1.9 years) 
in 30min stress-test sessions during resting, walking, running and biking conditions, 
using indoor fitness center equipment. The dataset also contains several 30s sample records acquired during rest conditions.'

The PhysioNet data is sampled at 2000Hz for both ECG and PCG, and the EPHNOGRAM data is sampled at 8000hz for both. 
The EPHNOGRAM data is resampled to 2000Hz for heterogenity.

# Transformations

The LWTNet algorithm identifies object detection in video and audio using integrated attention over time. 
The ECG signals act as the 'video' after being transformed into spectrograms over windows of the signal 
(at 30 spectrogram windows/s, to mimic video frame rate), and the PCG audio recordings act as the audio to 
be synchronised and associated with labelled 'speakers' in the audio; heart sounds S1, S2, systole (S3, murmurs).
The PCG audio is transformed into a log-Mel spectrogram for training through the modified LWTNet; ECG-PCG-LWTNet.

In [3]:
import wfdb
from wfdb import processing
import tqdm
import matplotlib.pyplot as plt
from visualise_data import peaks_hr
import os
import pandas as pd
from helpers import clip_len, sample_rate_ecg, sample_rate_pcg, inputpath_physionet, inputpath_ephnogram_target, inputpath_ephnogram_data, outputpath, useDrive, get_filtered_df, create_new_folder

ModuleNotFoundError: No module named 'wfdb'

In [2]:
"""# Cleaning PhysioNet 2016 Challenge Data"""
def clean_physionet_data(inputpath_training, outputpath_, sample_clip_len=clip_len, ecg_sample_rate=sample_rate_ecg, pcg_sample_rate=sample_rate_pcg, create_spectrograms_ecg=True, create_spectrogram_pcg=True):
  print("* Cleaning PhysioNet Data - Creating References [1/4] *")
  if os.path.exists(outputpath_+'physionet') and len(os.listdir(outputpath_+'physionet')) != 0:
        print("! Warning: folder 'physionet' already exists - assuming PhysioNet data is clean !")
        return
  else:
    create_new_folder(outputpath_+'physionet')
  if not os.path.isfile(inputpath_training+'REFERENCE.csv'):
      raise ValueError("Error: file 'REFERENCE.csv' does not exist - aborting")
  ref_csv = pd.read_csv(inputpath_training+'REFERENCE.csv', names=['filename', 'label'])
  data = pd.DataFrame(columns=['label', 'qrs_inds'])
  
  for index, ref in tqdm.tqdm(ref_csv.iterrows()):
    # 1: Abnormal, -1: Normal
    label = ref['label']
    filename = ref['filename']
    record = wfdb.rdrecord(inputpath_training+filename, channels=[0]) #sampfrom=0, sampto=10000
    qrs_inds = processing.qrs.gqrs_detect(sig=record.p_signal[:,0], fs=record.fs)
    print(f"rr: {len(record.p_signal)}")
    # Plot results
    peaks_hr(sig=record.p_signal, peak_inds=qrs_inds, fs=record.fs,
         title="GQRS peak detection on record 100")
    row = pd.Series([label, qrs_inds])
    data.append(row, ignore_index=True)
  #receive_plays = get_filtered_df(csv, 'event', 'punt_received')
  #play_nos = np.unique(receive_plays['playId'])
  #for id in play_nos:
  #    df = get_filtered_df(csv, 'playId', id)
  #    games = np.unique(df['gameId'])
  #    create_new_folder(outputpath_+foldername)
  #players = pd.read_csv(inputpath_+"players.csv")
  #players.columns = players.columns.str.replace(' ', '')
  #players
  #"""Converting all heights to inches"""
  #if type(players['height']) is not String:
  #  print("! Warning: 'height' attribute is not String - assuming 'players' is clean !")
  #  return players
  #check = players['height'].str.split('-', expand=True)
  #check.columns = ['feet', 'inches']
  #check.loc[(check['inches'].notnull()), 'feet'] = check[check['inches'].notnull()]['feet'].astype(np.int16) * 12 + check[check['inches'].notnull()]['inches'].astype(np.int16)
  #players['height'] = check['feet']
  #players['height'] = players['height'].astype(np.float32)
  #"""Making all dates the same format"""
  ##TODO get birthdate from missing ones
 # for idx, row in players.iterrows():
  #  if type(row['birthDate']) is String and "/" in row['birthDate']: 
  #        split = row["birthDate"].split("/")
  #        players.loc[idx,"birthDate"] = split[2].replace(" ","")+"-"+split[0]+"-"+split[1]
  #players.to_csv(outputpath_+"players.csv",index=False)
  #cleaned_players = pd.read_csv(outputpath_+"players.csv")
  #return cleaned_players

print("*** Cleaning Data [0/4] ***")
print("** Cleaning PhysioNet Data **")

# Plot results
#peaks_hr(sig=record.p_signal, peak_inds=qrs_inds, fs=record.fs,
#         title="GQRS peak detection on record 100")
    
# Correct the peaks shifting them to local maxima
#min_bpm = 20
#max_bpm = 230
#min_gap = record.fs * 60 / min_bpm
# Use the maximum possible bpm as the search radius
#search_radius = int(record.fs * 60 / max_bpm)
#corrected_peak_inds = processing.peaks.correct_peaks(record.p_signal[:,0], 
#                                                     peak_inds=qrs_inds,
#                                                     search_radius=search_radius, 
#                                                     smooth_window_size=150)

# Display results
#print('Corrected GQRS detected peak indices:', sorted(corrected_peak_inds))
#peaks_hr(sig=record.p_signal, peak_inds=sorted(corrected_peak_inds), fs=record.fs,
#         title="Corrected GQRS peak detection on sampledata/100")
clean_physionet_data(inputpath_physionet, outputpath)

NameError: name 'clip_len' is not defined

Converting all heights to inches

In [None]:
check = players['height'].str.split('-', expand=True)
check.columns = ['feet', 'inches']
check.loc[(check['inches'].notnull()), 'feet'] = check[check['inches'].notnull()]['feet'].astype(np.int16) * 12 + check[check['inches'].notnull()]['inches'].astype(np.int16)
players['height'] = check['feet']
players['height'] = players['height'].astype(np.float32)
players

Making all dates the same format

In [None]:
for idx, row in players.iterrows():
  if "/" in row['birthDate']: 
        split = row["birthDate"].split("/")
        players.loc[idx,"birthDate"] = split[2].replace(" ","")+"-"+split[0]+"-"+split[1]
players

In [None]:
players.to_csv(outputpath+"cleaned_players.csv", index=False)
cleaned_players = pd.read_csv(outputpath+"cleaned_players.csv")
cleaned_players

# Plays

In [None]:
plays = pd.read_csv("plays.csv")
plays.head()

There are four special plays detailed. They should be given their own csvs.

In [None]:
plays['specialTeamsPlayType'].unique()

In [None]:
plays[plays['specialTeamsPlayType'] == "Kickoff"]["specialTeamsResult"].unique()

- Touchback - Kickoff resulted in ball becoming dead in defending team's endzone, so defending team gain possesion at 25 or 20 yard line. Either has to land there and stop, or a player catches and kneels to end play.
- Return - Kickoff resulted in ball being received by defending team and them running the ball up the field. (Is caught or becomes dead not in end zone?)
- Muffed - Receiving team don't gain possession of the ball properly, and can only start at where the ball was downed?
- Kickoff Team Recovery - kickoff team gain possesion of the ball after it crosses the receiving team's restraining line (35 yards) or a member of the receiving team possess the ball first.
- Out of Bounds - out of bounds
- Fair Catch - Receiver signals that they want a fair catch, meaning they can catch the ball without interference. Then the ball becomes dead at that spot and the receiving team cannot advance it.
- Downed - Ball brought to the ground??

In [None]:
plays[plays['specialTeamsPlayType'] == "Punt"]["specialTeamsResult"].unique()

- Non-Special Teams Result - Punt is passed instead.

In [None]:
plays[plays['specialTeamsPlayType'] == "Field Goal"]["specialTeamsResult"].unique()

- Kick Attempt Good - goal scored
- Kick Attempt No Good - goal missed
- Blocked Kick Attempt - kick blocked by an opponent
- Non-Special Teams Result - kick set up but passed instead?

In [None]:
plays[plays['specialTeamsPlayType'] == "Extra Point"]["specialTeamsResult"].unique()

- Non-Special Teams Result - Can choose to attempt another touchdown after first touchdown instead of conversion kick, so no one attempts the kick, kickerId is null. Mostly fails however.

## Kickoff

In [None]:
kickoff = plays[plays['specialTeamsPlayType'] == "Kickoff"]
kickoff.columns

The percentage of NA values in each column:

In [None]:
for column in kickoff.columns:
  print(column,(kickoff[column].isnull().sum()/len(kickoff[column])*100))

- Penalties have high percentages because they are rare, but still valid data
- Kickoffs have no kick blocker so kickBlockerId is irrelevant here
- passResult: Scrimmage outcome of the play if specialTeamsPlayResult is "Non-Special Teams Result", so irrelevant here
- looks like yardlineNumber should all be 35 because that's where a kickoff occurs, but some maybe different because of pentalies?

In [None]:
kickoff = kickoff.drop(columns=["kickBlockerId","passResult","specialTeamsPlayType"])

In [None]:
kickoff.to_csv(outputpath+"kickoff.csv",index=False)

specialTeamsPlayType is removed because the csv only has data about one special type, so would be a column with all the same values

## Punt

In [None]:
punt = plays[plays['specialTeamsPlayType'] == "Punt"]
punt

In [None]:
for column in punt.columns:
  print(column,(punt[column].isnull().sum()/len(punt[column])*100))

- Some kickerIds are null because the punt is not kicked (??), it is passed instead. Indicated by having the specialTeamsResult set to Non-Special Teams Result, and then the passResult shows the result of the pass.
- kickBlockerId is mostly null because it is rare to block a punt. When not null, specialTeamsResult has Blocked Punt


In [None]:
punt = punt.drop(columns=["specialTeamsPlayType"])

In [None]:
punt.to_csv(outputpath+"punt.csv",index=False)

## Field Goal

In [None]:
fieldGoal = plays[plays['specialTeamsPlayType'] == "Field Goal"]
fieldGoal

In [None]:
for column in fieldGoal.columns:
  print(column,(fieldGoal[column].isnull().sum()/len(fieldGoal[column])*100))

- kickReturnYardage is all null because the receiving cannot (??) advance the ball after a field goal ??
- playResult is mostly 0 because most attempts score goals, so kicking team essentially gains no yards because play is reset. Will be negative if goal is missed so receiving team get the ball at their 8 yard mark (??). For blocked kicks, it's anyone's ball after so kicking team may or may not gain yards afterwards.
- returnerId is mostly null because it's rare to return after a field goal??

In [None]:
fieldGoal = fieldGoal.drop(columns=["specialTeamsPlayType","kickReturnYardage"])

In [None]:
fieldGoal.to_csv(outputpath+"fieldGoal.csv",index=False)

# Extra Point

In [None]:
extraPoint = plays[plays['specialTeamsPlayType'] == "Extra Point"]
extraPoint

In [None]:
for column in extraPoint.columns:
  print(column,(extraPoint[column].isnull().sum()/len(extraPoint[column])*100))

- returnerId all null because no one returns
- kickLength all null because kicks happen at same place
- kickReturnYardage all null because you can't advance after an extra point attempt

In [None]:
extraPoint = extraPoint.drop(columns=["specialTeamsPlayType","kickReturnYardage","returnerId","kickLength"])
extraPoint.to_csv(outputpath+"extraPoint.csv",index=False)