# Data Exploration

In this notebook, we will explore the Global Climate Change dataset to understand its structure, visualize data distributions, and identify relationships between features.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
sns.set(style='whitegrid')

In [2]:
# Load the dataset
data_path = '../data/raw/Global_Climate_Change_Data_2020_2025.csv'
df = pd.read_csv(data_path)

# Display the first few rows of the dataframe
df.head()

In [3]:
# Check the shape of the dataset
df.shape

(1000, 10)

In [4]:
# Check for missing values
missing_values = df.isnull().sum()
missing_values[missing_values > 0]

In [5]:
# Visualize the distribution of a key feature
plt.figure(figsize=(10, 6))
sns.histplot(df['temperature'], bins=30, kde=True)
plt.title('Distribution of Temperature')
plt.xlabel('Temperature')
plt.ylabel('Frequency')
plt.show()

In [6]:
# Visualize relationships between features
plt.figure(figsize=(10, 6))
sns.scatterplot(x='temperature', y='co2_levels', data=df)
plt.title('Temperature vs CO2 Levels')
plt.xlabel('Temperature')
plt.ylabel('CO2 Levels')
plt.show()

In [7]:
# Correlation matrix
plt.figure(figsize=(12, 8))
correlation = df.corr()
sns.heatmap(correlation, annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

## Conclusion

In this notebook, we explored the Global Climate Change dataset, visualized key features, and identified relationships between them. This analysis will guide our feature engineering and model training in subsequent notebooks.