# UCI Echocardiogram Data Exploration

The UCI Machine Learning Repository has a dataset
of echocardoigram results.  The data can be found here:
http://archive.ics.uci.edu/ml/datasets/Echocardiogram

The dataset is comprised of measurements from 132 patients who suffered heart attacks.  Each patient received an echocardiogram
after the heart attack and results were recorded.  

The target characteristic is the `alive-at-1` variable, which determines whether the patients were still alive one year after suffering their heart attack.

Here is a listing of all of the columns in the dataset:

```
1. survival -- the number of months patient survived (has survived, if patient is still alive). Because all the patients had their heart attacks at different times, it is possible that some patients have survived less than one year but they are still alive. Check the second variable to confirm this. Such patients cannot be used for the prediction task mentioned above. 
2. still-alive -- a binary variable. 0=dead at end of survival period, 1 means still alive 
3. age-at-heart-attack -- age in years when heart attack occurred 
4. pericardial-effusion -- binary. Pericardial effusion is fluid around the heart. 0=no fluid, 1=fluid 
5. fractional-shortening -- a measure of contracility around the heart lower numbers are increasingly abnormal 
6. epss -- E-point septal separation, another measure of contractility. Larger numbers are increasingly abnormal. 
7. lvdd -- left ventricular end-diastolic dimension. This is a measure of the size of the heart at end-diastole. Large hearts tend to be sick hearts. 
8. wall-motion-score -- a measure of how the segments of the left ventricle are moving 
9. wall-motion-index -- equals wall-motion-score divided by number of segments seen. Usually 12-13 segments are seen in an echocardiogram. Use this variable INSTEAD of the wall motion score. 
10. mult -- a derivate var which can be ignored 
11. name -- the name of the patient (I have replaced them with "name") 
12. group -- meaningless, ignore it 
13. alive-at-1 -- Boolean-valued. Derived from the first two attributes. 0 means patient was either dead after 1 year or had been followed for less than 1 year. 1 means patient was alive at 1 year.


```


In [17]:
# Start by importing necessary modules
import pandas
import seaborn

### Load data 

Although not explicitly described as CSV, the data is comma delimited.  Try to use pandas to read it directly.

The dataset indicates that missing values are denoted by `?`, so add
that to the list pandas will use.

In [2]:
dataFile = "data/echocardiogram_data.txt"
try:
  ecg = pandas.read_csv(dataFile, na_values=["?"],
                        header=None)
except Exception as e:
    print e

Error tokenizing data. C error: Expected 13 fields in line 50, saw 14



So there's something wrong with line 50.

What is it?

In [3]:
# Grab that one line
line50 = open(dataFile).readlines()[49]
print line50
# Chop it up 
fields = line50.split(",")
print fields
print len(fields)


,?,?,77,?,?,?,?,?,2,?,name,2,?

['', '?', '?', '77', '?', '?', '?', '?', '?', '2', '?', 'name', '2', '?\n']
14


Alright, that's indeed bad.  Worse, the `name` column is two entries from the end, as it should be.  This means that extra field is somewhere in the middle.

Well, at least the rest of the information looks sparse enough
that this record is useless. 

So, pandas has an option to skip lines with too many fields.  By default it will warn us for each bad line.

In [4]:
try:
  ecg = pandas.read_csv(dataFile, na_values=["?"],
                       error_bad_lines=False)
except Exception as e:
    print e

Skipping line 50: expected 13 fields, saw 14



Ok, so there's only one with too many lines.

Are there lines that are truncated?

In [5]:
for i, line in enumerate(open(dataFile).readlines()):
  if len(line.strip().split(",")) != 13:
    print "Bad line: %s, length: %s" % (i, len(line.strip().split(",")))

Bad line: 49, length: 14


Ok, so there's only one of those.

Now `ecg` has our data.

Take a look at it

In [6]:
ecg.head()

Unnamed: 0,11,0,71,0.1,0.260,9,4.600,14,1,1.1,name,1.2,0.2
0,19,0,72,0,0.38,6.0,4.1,14,1.7,0.588,name,1,0
1,16,0,55,0,0.26,4.0,3.42,14,1.0,1.0,name,1,0
2,57,0,60,0,0.253,12.062,4.603,16,1.45,0.788,name,1,0
3,19,1,57,0,0.16,22.0,5.75,18,2.25,0.571,name,1,0
4,26,0,68,0,0.26,5.0,4.31,12,1.0,0.857,name,1,0


Oh, there's no header there.
Put the column names in a file.  Load it in, then reload
the dataset.

In [11]:
columns

[]

In [14]:
columns = [line.strip() for line in open("data/columns.txt").readlines()]

try:
  ecg = pandas.read_csv(dataFile, na_values=["?"],
                       error_bad_lines=False,
                       names=columns)
except Exception as e:
    print e

In [15]:
ecg.head()

Unnamed: 0,survival,still_alive,age_at_heart_attack,pericardial_effusion,fractional_shortening,epss,lvdd,wall_motion_score,wall_motion_index,mult,name,group,alive_at_1
0,11,0,71,0,0.26,9.0,4.6,14,1.0,1.0,name,1,0
1,19,0,72,0,0.38,6.0,4.1,14,1.7,0.588,name,1,0
2,16,0,55,0,0.26,4.0,3.42,14,1.0,1.0,name,1,0
3,57,0,60,0,0.253,12.062,4.603,16,1.45,0.788,name,1,0
4,19,1,57,0,0.16,22.0,5.75,18,2.25,0.571,name,1,0


## Missing data

Three are a number of missing values 

The dataset description gives this listing:
```
   Attribute #:    Number of Missing Values: (total: 132)
   ------------    -------------------------
              1    2  
              2	   1  
              3	   5  
              4	   1  
              5	   8  
              6	   15 
              7	   11 
              8	   4  
              9	   1  
             10	   4 
             11	   0 
             12	   22
             13	   58
```

There are patients who have not been tracked for a year and are still
alive.  This is bad data that should not be included in the model. 
Use this bit of info from the description.

          Because all the patients
		  had their heart attacks at different times, it is 
		  possible that some patients have survived less than
		  one year but they are still alive.  Check the second
		  variable to confirm this.  Such patients cannot be 
		  used for the prediction task mentioned above.
       

In [22]:
recent = ecg[(ecg.survival < 12) & ecg.still_alive]
print len(recent)

34


Well, that's a bummer, just about a quarter of the dataset.  But 
it's definitely bogus data that we should not try to use.

## Quick snapshots

Start with a seaborn pairplot.  This is a really quick way to take a snapshot of the data.


In [18]:
seaborn.pairplot(ecg)



AttributeError: max must be larger than min in range parameter.