
# Pandas - Data Correlations and Plotting

In this lecture, we will explore two important topics in Pandas:
- Finding correlations between columns in a dataset
- Visualizing data with plots using Pandas and Matplotlib



## 1. Data Correlations

The `corr()` method in Pandas allows us to calculate the relationship between each column in a data set.

Let's see how we can use it to find correlations between columns in our example dataset.

### Example:
Here is an example of the correlation matrix for a workout dataset:


In [None]:

import pandas as pd

# Sample data for correlation example
data = {
    "Duration": [60, 60, 60, 45, 45, 60, 60, 450, 30, 60, 60, 60, 60, 60, 60, 60, 60, 60, 45, 60, 45, 60, 45, 60, 45, 60, 60, 60, 60, 60, 60, 60],
    "Pulse": [110, 117, 103, 109, 117, 102, 110, 104, 109, 98, 103, 100, 100, 106, 104, 98, 98, 100, 90, 103, 97, 108, 100, 130, 105, 102, 100, 92, 103, 100, 102, 92],
    "Maxpulse": [130, 145, 135, 175, 148, 127, 136, 134, 133, 124, 147, 120, 120, 128, 132, 123, 120, 120, 112, 123, 125, 131, 119, 101, 132, 126, 120, 118, 132, 132, 129, 115],
    "Calories": [409.1, 479.0, 340.0, 282.4, 406.0, 300.0, 374.0, 253.3, 195.1, 269.0, 329.3, 250.7, 250.7, 345.3, 379.3, 275.0, 215.2, 300.0, None, 323.0, 243.0, 364.2, 282.0, 300.0, 246.0, 334.5, 250.0, 241.0, None, 280.0, 380.3, 243.0]
}

df = pd.DataFrame(data)

# Calculate correlation matrix
correlation_matrix = df.corr()
print(correlation_matrix)


In [None]:
correlation_matrix


## **🚀 Key Observations**
### 🔹 **Duration vs. Calories (-0.114)**
- **Weak negative correlation**, meaning **longer workouts slightly decrease calorie burn** in this dataset.
- This is unusual, but it could mean that some **long-duration exercises were low-intensity**, leading to fewer calories burned.

### 🔹 **Pulse vs. Calories (0.513)**
- **Moderate positive correlation**, meaning **higher heart rates tend to burn more calories**.
- This makes sense since **intense workouts with a higher heart rate usually burn more energy**.

### 🔹 **Maxpulse vs. Calories (0.357)**
- **Weak to moderate correlation**.
- Suggests that workouts where **Maxpulse is higher tend to burn more calories**, but the effect is not very strong.

### 🔹 **Pulse vs. Maxpulse (0.277)**
- **Weak positive correlation**.
- This suggests that people with **higher heart rates at rest or during workouts tend to have slightly higher max heart rates**, but not always.

### 🔹 **Duration vs. Pulse (0.004)**
- **Almost zero correlation**, meaning **exercise duration has no meaningful relationship with heart rate** in this dataset.
- This could be because **both low and high heart rate exercises exist across all durations**.

---

## **🔥 Final Takeaways**
✔ **Most correlated pair:** `Pulse ↔ Calories (0.513)`, meaning **higher heart rates tend to result in higher calorie burn**.  
✔ **Least correlated pair:** `Duration ↔ Pulse (0.004)`, meaning **workout duration has no strong impact on pulse rate**.  
✔ **Unexpected finding:** `Duration ↔ Calories (-0.114)`, suggesting **longer workouts might not always mean higher calorie burn**.

---


## 2. Plotting

Pandas allows us to create plots easily using the `plot()` method, and Matplotlib's Pyplot submodule helps visualize the data.

### Example: Basic Plot
We can use the `plot()` method to create a default plot for the DataFrame:


In [None]:

import matplotlib.pyplot as plt

# Plot the DataFrame
df.plot()

# Show the plot
plt.show()



### Scatter Plot
We can create scatter plots by specifying `kind='scatter'` in the `plot()` method. A scatter plot needs both an x-axis and a y-axis.

Let's create a scatter plot to show the relationship between "Duration" and "Calories".


In [None]:

# Scatter plot of Duration vs Calories
df.plot(kind='scatter', x='Duration', y='Calories')

# Show the plot
plt.show()



Now, let's create a scatter plot for "Duration" and "Maxpulse". We already know from the correlation matrix that there is a weak correlation between these two columns.


In [None]:

# Scatter plot of Duration vs Maxpulse
df.plot(kind='scatter', x='Duration', y='Maxpulse')

# Show the plot
plt.show()



### Histogram
Histograms can be created by specifying `kind='hist'` in the `plot()` method. A histogram shows the frequency of values in a column.

Let's create a histogram for the "Duration" column to see the distribution of workout durations.


In [None]:

# Histogram of Duration
df['Duration'].plot(kind='hist')

# Show the plot
plt.show()
