# Introduction to Data Analysis with Pandas

## Introduction
This tutorial introduces you to data analysis using Pandas. You will learn how to load a dataset, perform basic data exploration, calculate statistics, and more using Pandas.

## Setup
Before getting started, ensure that you have Pandas and Pyarrow installed. You can install it using pip:

In [None]:
!pip install pyarrow pandas

You also need to fetch the data you want to analyse. For this tutorial, we have placed the data into this project at `health_records.csv` for you.

Try to get a feeling for the data by opening the file tree and double-clicking the respective file to view and browse the data in Colab.

# Step-by-Step Tutorial
##Step 1: Load the dataset into a Pandas DataFrame

We start by loading the dataset into a Pandas DataFrame. This allows us to work with the data in a tabular format similar to a spreadsheet.

* Use pandas' [`read_cs()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) method. It is a method provided by Pandas to read data from a CSV file and create a DataFrame. It takes the file path as an argument and returns a DataFrame containing the data from the CSV file.  method for this.

In [13]:
import pandas as pd

# Use pandas read_csv method to read in your CSV file.
df = pd.read_csv('health_records.csv')

## Step 2: Display the first 10 rows of the dataset
Let's take a quick look at the first few rows of the dataset to get an idea of what it looks like.

* Use the [`DataFrame.head()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) method of the DataFrame object you just created for this purpose.

In [None]:
df_head_10 = #...
print("First 10 rows of the dataset:")
print(df_head_10)

## Step 3: Calculate and display statistics for the 'Age' column

Now, let's calculate some basic statistics for the 'Age' column, such as the mean, median, and standard deviation. You can get the age column of the CSV data by indexing the DataFrame object with the column name as a string, i.e. `df['Age']`, and calculate some basic statistics using [predefined methods](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#computations-descriptive-stats).

**Explanation:** `df['Age'].mean()` calculates the mean (average) of the 'Age' column. Similarly, `df['Age'].median()` calculates the median (middle value), and `df['Age'].std()` calculates the standard deviation of the 'Age' column.

In [None]:
mean_age = #...
median_age = #...
std_age = #...

print("\nStatistics for Age:")
print(f"Mean Age: {mean_age:.2f}")
print(f"Median Age: {median_age}")
print(f"Standard Deviation of Age: {std_age:.2f}")

## Step 4: Count and display the number of male and female patients

Next, let's count the number of male and female patients in the dataset to understand the gender distribution. We will use the [`DataFrame.value_counts()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.value_counts.html#pandas.DataFrame.value_counts) method for this, which counts the occurrences of unique values in a column. It's useful for understanding the distribution of categorical data.

In [None]:
gender_counts = #...

print("\nGender Distribution:")
print("Male:", gender_counts['Male'])
print("Female:", gender_counts['Female'])

## Step 5: Calculate and display the average blood pressure for patients with and without heart conditions

We'll now calculate the average blood pressure for patients with and without heart conditions to see if there's any difference. For this, we will filter the DataFrame object using a syntax that's called boolean indexing on the `'Heart Condition'` column.

* For example, `df[df['Heart Condition'] == 'Yes']` filters the DataFrame to include only rows where the condition `df['Heart Condition'] == 'Yes'` is true, i.e. where the 'Heart Condition' column is 'Yes'.

In [None]:
patients_with_heart_condition = df['Heart Condition'] == 'Yes'
patients_without_heart_condition = #...
avg_bp_heart_condition = #...
avg_bp_no_heart_condition = #...

print("\nAverage Blood Pressure:")
print("Patients with Heart Condition:", avg_bp_heart_condition)
print("Patients without Heart Condition:", avg_bp_no_heart_condition)

## Step 6: Create a new DataFrame for patients with high cholesterol and high blood pressure

Let's identify patients with high cholesterol and high blood pressure, as they may be at higher risk. For this, we must build a complex condition that will filter for rows where the 'Cholesterol' column is greater than 200 and either the 'Systolic BP' column is greater than 140 or the 'Diastolic BP' column is greater than 90. We can do this using the unary or binary bitwise operators
* `&` (and),
* `|` (or, inclusive or),
* `~` (not),
* `^` (xor, exclusive or).

For example, the following expression will filter the data for all males which are older than 50:
```python
df[(df['Gender'] == 'Male') & (df['Age'] > 50)]
```

Now let's identify the high risk patients with high cholesterol and high blood pressure as described above.

In [None]:
patients_with_high_colesterol = df['Cholesterol'] > 200
patients_with_high_systolic_bp = #...
patients_with_high_diastolic_bp = #...
high_risk_patients = #...

print(high_risk_patients.head(10))

We can see that most high risk patients seem to be older than 50. You can verify this by computing the average age of all high risk patients.

In [None]:
print("The average age of high risk patients:")
print() #...

## Step 7: Calculate and display the percentage of patients with both high cholesterol and high blood pressure

We'll calculate the percentage of patients with both high cholesterol and high blood pressure to understand the prevalence of this risk factor. To do this, we need to get the total number of high risk patients using `len(high_risk_patients)`, since `len()` will return the number of rows for a DataFrame object.

In [None]:
total_high_risk_patients = #...
total_patients = #...

percentage_high_risk = #...

print("\nPercentage of Patients with High Cholesterol and High Blood Pressure:", f"{percentage_high_risk:.2f}%")


## Step 8: Save the high-risk patients DataFrame to a new CSV file

Finally, let's save the DataFrame containing high-risk patients to a new CSV file for further analysis.

* Use the `DataFrame.to_csv()` method, which saves the DataFrame to a CSV file. You can pass the `index=False` parameter to specify that we don't want to include the row index in the CSV file.

In [None]:
high_risk_patients#...
print("\nHigh-risk patients saved to 'high_risk_patients.csv'")