<a href="https://colab.research.google.com/github/samato0624/DATA602/blob/main/07_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Assignment 7**

# **Weeks 8 & 9 - Pandas**
* In this homework assignment, you will explore and analyze a public dataset of your choosing. Since this assignment is “open-ended” in nature, you are free to expand upon the requirements below. However, you must meet the minimum requirments as indicated in each section.

* You must use Pandas as the **primary tool** to process your data.

* The preferred method for this analysis is in a .ipynb file. Feel free to use whichever platform of your choosing.  
 * https://www.youtube.com/watch?v=inN8seMm7UI (Getting started with Colab).

* Your data should need some "work", or be considered "dirty".  You must show your skills in data cleaning/wrangling.

### **Some data examples:**
•	https://www.data.gov/

•	https://opendata.cityofnewyork.us/

•	https://datasetsearch.research.google.com/

•	https://archive.ics.uci.edu/ml/index.php

### **Resources:**

•	https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html

•	https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html


### **Headings or comments**
**You are required to make use of comments, or headings for each section.  You must explain what your code is doing, and the results of running your code.**  Act as if you were giving this assignment to your manager - you must include clear and descriptive information for each section.

### **You may work as a group or indivdually on this assignment.**


# Introduction

In this section, please describe the dataset you are using.  Include a link to the source of this data.  You should also provide some explanation on why you chose this dataset.

The dataset I'm using for this homework assignment contains electronic health records (EHRs) for patients diagnosed with and with out heart disease. Here is the link: https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset/data?select=heart.csv. In my line of work I help software engineers put together designs history files (DHFs), for algorithms that are considered medical devices, to submit to competent authourities. My main motivation for choosing this dataset is to learn more about metrics for heart health.

______________
# Data Exploration
Import your dataset into your .ipynb, create dataframes, and explore your data.  

Include:

* Summary statistics means, medians, quartiles,
* Missing value information
* Any other relevant information about the dataset.  



In [None]:
import pandas as pd

# Import the data.
data = pd.read_csv("https://raw.githubusercontent.com/samato0624/DATA602/main/heart.csv")

# Display data type to confirm it's a dataframe.
print("Data type:", type(data))

# Display first few rows of the dataframe.
print("\nFirst few rows of the dataframe:")
print(data.head())

# Provide summary statistics with mean, median, and quartiles.
print("\nSummary statistics:")
summary_stats = print(data[["age", "trestbps", "chol", "target"]].groupby("target").describe())
print(summary_stats)

# Show a dataframe containing missing data (the dataframe should be empty as there are no missing values).
print("\nSubset of rows with missing data:")
missing_data_subset = data[data.isnull().any(axis=1)]
print(missing_data_subset.head())
'''
Here are simple explanations for what each column describes.
1. age in years
2. sex (1 = male; 0 = female)
3. chest pain type (4 values)
4. resting blood pressure (in mm Hg on admission to the hospital)
5. serum cholestoral in mg/dl
6. fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
7. resting electrocardiographic results (values 0,1,2)
8. maximum heart rate achieved
9. exercise induced angina (1 = yes; 0 = no)
10. oldpeak = ST depression induced by exercise relative to rest
11. the slope of the peak exercise ST segment
12. number of major vessels (0-3) colored by flourosopy
13. thal: 0 = normal; 1 = fixed defect; 2 = reversable defect
14. target: 0 = no heart disease; 1 = heart disease
'''

# Data Wrangling
Create a subset of your original data and perform the following.  

1. Modify multiple column names.

2. Look at the structure of your data – are any variables improperly coded? Such as strings or characters? Convert to correct structure if needed.

3. Fix missing and invalid values in data.

4. Create new columns based on existing columns or calculations.

5. Drop column(s) from your dataset.

6. Drop a row(s) from your dataset.

7. Sort your data based on multiple variables.

8. Filter your data based on some condition.

9. Convert all the string values to upper or lower cases in one column.

10. Check whether numeric values are present in a given column of your dataframe.

11. Group your dataset by one column, and get the mean, min, and max values by group.
  * Groupby()
  * agg() or .apply()

12. Group your dataset by two columns and then sort the aggregated results within the groups.

**You are free (and should) to add on to these questions.  Please clearly indicate in your assignment your answers to these questions.**

In [None]:
# Create a list of desired columns.
columns_of_interest = ["age", "sex", "cp", "trestbps", "fbs", "chol", "thalach", "target"]

# Remove duplicate rows (req 3 & 6 covered).
data = data.drop_duplicates()

# Replace row values in the column "sex" from 1 and 0 to male and female (req 2 covered).
data["sex"] = data["sex"].replace({1: "male", 0: "female"})
data["sex"] = data["sex"].astype(str)

# Filter to the desired columns and create a new dataframe (req 5 covered).
filtered_data = data[columns_of_interest]

# Rename columns in use (req 1 covered).
pd.options.mode.chained_assignment = None # Here I'm supressing a warning created by the line of code below.
filtered_data.rename(columns={"age" : "Age", "sex" : "Gender", "cp" : "Chest_pain_type", "trestbps": "Resting_blood_pressure", "fbs" : "Fasting_blood_glucose", "chol" : "Cholesterol", "thalach" : "Max_heart_rate", "target" : "Patient_has_heart_disease"}, inplace=True)

# Convert all values in the gender column from lowercase to uppercase (req 9 covered).
filtered_data["Gender"] = filtered_data["Gender"].str.upper()

# Sort the data by age in ascending order and cholesterol by descending order (req 7 covered).
sorted_data = filtered_data.sort_values(by=["Age", "Cholesterol"], ascending=[True, False])
print(sorted_data.head())

# Check for numeric values in a column (req 10).
column_name = "Gender" # Column to look at.
numeric_found = False
for value in sorted_data[column_name]: #Loop through the values of the column to try and detect a number.
    try:
        pd.to_numeric(value)
        numeric_found = True
        break
    except ValueError:
        pass

if numeric_found:
    print("At least one numeric value found in the column '{}'.".format(column_name))
else:
    print("No numeric values found in the column '{}'.".format(column_name))

# Create a new column defining the blood pressure as normal, pre-hypertension, or hypertension (req 4 covered).
normal_threshold = 120
pre_hypertension_threshold = 140
sorted_data['Blood_pressure_category'] = pd.cut(sorted_data['Resting_blood_pressure'],
                                         bins=[-float('inf'), normal_threshold, pre_hypertension_threshold, float('inf')],
                                         labels=['Normal', 'Pre-hypertension', 'Hypertension'])
print(sorted_data)

# Filtering rows where 'Resting_blood_pressure' is greater than 140 (req 8 covered).
condition = sorted_data['Resting_blood_pressure'] > 140
filtered_data = sorted_data[condition]
print(filtered_data)

# Group by 'Patient_has_heart_disease' and calculate mean, min, and max values for each group (req 11 covered).
grouped_data = sorted_data.groupby('Patient_has_heart_disease').agg({
    'Age': ['mean', 'min', 'max'],
    'Resting_blood_pressure': ['mean', 'min', 'max'],
    'Cholesterol': ['mean', 'min', 'max']
})
grouped_data.columns = ['Age_mean', 'Age_min', 'Age_max', # change the column names for clarity.
                        'Resting_blood_pressure_mean', 'Resting_blood_pressure_min', 'Resting_blood_pressure_max',
                        'Cholesterol_mean', 'Cholesterol_min', 'Cholesterol_max']
print(grouped_data)

# Group by 'Patient_has_heart_disease' and 'Gender' and calculate mean, min, and max values for each group (req 12 covered)
grouped_data = sorted_data.groupby(['Patient_has_heart_disease', 'Gender']).agg({
    'Age': ['mean', 'min', 'max'],
    'Resting_blood_pressure': ['mean', 'min', 'max'],
    'Cholesterol': ['mean', 'min', 'max']
})
grouped_data.columns = ['Age_mean', 'Age_min', 'Age_max', # change the column names for clarity.
                        'Resting_blood_pressure_mean', 'Resting_blood_pressure_min', 'Resting_blood_pressure_max',
                        'Cholesterol_mean', 'Cholesterol_min', 'Cholesterol_max']
print(grouped_data)

# Conclusions  

After exploring your dataset, provide a short summary of what you noticed from this dataset.  What would you explore further with more time?

After exploring the dataset I found that overall this dataset was mishandled. The heart disease diagnosis in the 'target' column is 1 for people who do not have heart disease and 0 for those who do.

In terms of what I can gleen from the data,there seems to be approximately a 10% difference in the mean cholesterol and mean resting blood pressure for woman while men seems to have negligible differences in blood pressure (but a similiar difference in cholesterol) between those who are diagnosed with heart disease and those who aren't. I would further explore other factors beyond cholesterol and blood pressure for why certain men are experiencing heart disease as opposed to others.