# ENSF 592 Programming Fundamentals for Data Engineers
Assignment 5: Numpy and Pandas(5 points)

due: 29-July Wed (11:59 midnight)

Description:
We are going to study data collected from 294 patients with heart disease, and extract some meaningfull information. You can download the dataset (data.csv) from d2l and if you want to know more about it, you can use the below Kaggle link:
https://www.kaggle.com/imnikhilanand/heart-attack-prediction

|Feature|Description|
|-------|----------------------------|
|age|age in years|
|gender|(1 = male; 0 = female)|
|cp|chest pain type|
|trestbps|resting blood pressure (in mm Hg on admission to the hospital)|
|chol|serum cholestoral in mg/dl|
|fbs|(fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)|
|restecg|resting electrocardiographic results|
|thalach|maximum heart rate achieved|
|exang|exercise induced angina (1 = yes; 0 = no)|
|oldpeak|ST depression induced by exercise relative to rest|
|slope|the slope of the peak exercise ST segment|
|ca|number of major vessels (0-3) colored by flourosopy|
|thal|3 = normal; 6 = fixed defect; 7 = reversable defect|
|num|diagnosis of heart disease (angiographic disease status)|

we would like to find answers to the following questions:
* Report1: what is the average age of patients ? 
* Report2: report the average `chol` level of people in intervals of 10 years old ([20,30], [30,40],[40,50],[50,60])
* Report3: report the average `trestbps` in people with `chol` of highest level(the highest 30%) and lowest level(the lowest 30%).
* Report4: report percentage of men and women with positive diagnosis of heart disease(`num=1`). 


As a first step, design a **class** that is responsible for general processes that you'll need.
* Write a constructor which gets the file_name as input and keeps csv file in a data frame.
* Implement a `getColumn` method that gets a `column` name and return the column.
* Implement a `select` method that gets a `column` name and searches for records that value of column is equal to the `value`.
* Implement a `rangeSelect` method that gets a `column` name and a `begin` and `end`. It searchs for records with `begin<column<end` and returns that sub-table.
* Implement a `percentageSelect` method that gets a `column` name, a `perc`(percentage) and `index`, and return the table based on `column` from that column: 
    * if index ==0 : returns the __first__ `perc*column.size` sub-table.  
    * if index ==-1 : returns the __last__ `perc*column.size` sub-table. 

In [2]:
import pandas as pd
import numpy as np

In [3]:
class Table:
    def __init__(self,file_name):
        self.data_df = pd.read_csv(file_name)
        self.data_df = self.data_df.replace({'?':0})
        self.data_df = pd.DataFrame(self.data_df, dtype = "int64")
        
    def getData(self):
        data = self.data_df
        return data
    
    def getColumn(self, col_name):
        col_data = self.data_df[col_name]
        return col_data
    
    def select(self, col_name, value):
        select_data = self.data_df[self.data_df[col_name] == 1]
        return select_data
    
    def rangeSelect(self,col_name,begin:int,end:int):
        data = self.data_df
        return data[(data[col_name] >= begin) & (data[col_name] <= end)]
    
    def percentageSelect(self, col_name, percent, index):
        if index == 0:
            sort_data = self.data_df.sort_values(by = col_name, ascending = False).reset_index()
        else:
             sort_data = self.data_df.sort_values(by = col_name, ascending = True).reset_index()
        num_rows = int(percent*sort_data[col_name].size)
        return sort_data[0:num_rows]

 Report 1

In [8]:
fileName = 'datasets_23651_30233_data.csv'
obj = Table(fileName)
data = obj.getColumn('age')
averageAge = data.mean()
print("The average age of the patients is %f" %averageAge)

The average age of the patients is 47.826531


Report 2

In [5]:
for i in range(20,60,10):
    data_chol = obj.rangeSelect('age',i,i+10)['chol']
    print("In the age group of %d to %d, average cholestrol level are %f" %(i, i+10, np.average(data_chol)))

In the age group of 20 to 30, average cholestrol level are 153.000000
In the age group of 30 to 40, average cholestrol level are 231.089286
In the age group of 40 to 50, average cholestrol level are 227.991736
In the age group of 50 to 60, average cholestrol level are 239.532258


Report 3

In [6]:
data = obj.percentageSelect('chol',.3, 0)['trestbps']
print("The average trestbps of top 30 percent with highest chol level are " + str(np.average(data)))
data = obj.percentageSelect('chol',.3, -1)['trestbps']
print("The average trestbps of top 30 percent with lowest chol level are " + str(np.average(data)))

The average trestbps of top 30 percent with highest chol level are 133.36363636363637
The average trestbps of top 30 percent with lowest chol level are 132.42045454545453


Report 4

In [7]:
data = obj.select('number', 1)
total_dia = data.shape[0]
men_dia = data[data['sex'] == 1].shape[0]
women_dia = data[data['sex'] == 0].shape[0]
print("Percentage of men with positive diagnosis of heart disease is %f"%(men_dia/total_dia*100))
print("Percentage of women with positive diagnosis of heart disease is %f"%(women_dia/total_dia*100))

Percentage of men with positive diagnosis of heart disease is 88.679245
Percentage of women with positive diagnosis of heart disease is 11.320755
