# Exploratory Data Analysis (EDA) for Wind Energy Prediction

This notebook performs a detailed Exploratory Data Analysis (EDA) on the `data.csv` dataset to understand its structure, distributions, relationships, and identify potential issues before feature engineering and model training.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
try:
    df = pd.read_csv('data.csv')
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print("Error: data.csv not found. Please ensure the file is in the correct directory.")
    exit()


## 1. Dataset Overview

Let's start by examining the basic properties of the dataset: its shape, data types, and the presence of any missing values.


In [None]:
# Dataset shape
print(f"Dataset shape: {df.shape}")

# Dataset dtypes
print("\nDataset dtypes:")
print(df.dtypes)

# Missing values
print("\nMissing values:")
print(df.isnull().sum())


## 2. Summary Statistics

Next, we'll look at the descriptive statistics for numerical columns to understand the central tendency, dispersion, and shape of the dataset's distribution.


In [None]:
# Summary statistics
print("\nSummary Statistics:")
print(df.describe())


## 3. Histograms and Boxplots

Visualizing the distribution of each numerical feature provides insights into their spread, skewness, and presence of outliers.


In [None]:
# Histograms
plt.figure(figsize=(15, 10))
for i, col in enumerate(['wind_speed_ms', 'actual_power_output_kw', 'theoretical_power_curve_kwh', 'wind_direction_deg']):
    plt.subplot(2, 2, i + 1)
    sns.histplot(df[col], kde=True)
    plt.title(f'Distribution of {col}')
plt.tight_layout()
plt.show()

# Boxplots
plt.figure(figsize=(15, 10))
for i, col in enumerate(['wind_speed_ms', 'actual_power_output_kw', 'theoretical_power_curve_kwh', 'wind_direction_deg']):
    plt.subplot(2, 2, i + 1)
    sns.boxplot(y=df[col])
    plt.title(f'Boxplot of {col}')
plt.tight_layout()
plt.show()


## 4. Time Series Line Plots

As the data is time-series based, line plots can reveal trends, seasonality, and other temporal patterns.


In [None]:
# Ensure 'timestamp' is datetime for plotting
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Line plots for key features over time (sampling for readability if dataset is large)
plt.figure(figsize=(18, 10))

plt.subplot(3, 1, 1)
sns.lineplot(x='timestamp', y='actual_power_output_kw', data=df.sample(n=min(len(df), 1000), random_state=42) if len(df) > 1000 else df)
plt.title('Actual Power Output over Time')

plt.subplot(3, 1, 2)
sns.lineplot(x='timestamp', y='wind_speed_ms', data=df.sample(n=min(len(df), 1000), random_state=42) if len(df) > 1000 else df)
plt.title('Wind Speed over Time')

plt.subplot(3, 1, 3)
sns.lineplot(x='timestamp', y='theoretical_power_curve_kwh', data=df.sample(n=min(len(df), 1000), random_state=42) if len(df) > 1000 else df)
plt.title('Theoretical Power Curve over Time')

plt.tight_layout()
plt.show()


## 5. Correlation Heatmap

Understanding the correlation between numerical features is crucial for identifying relationships and potential multicollinearity.


In [None]:
# Correlation Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap of Numerical Features')
plt.show()


## 6. Important Observations

Based on the EDA, here are some important observations:

- **Dataset Shape and Missing Values:** The dataset contains X rows and Y columns. (To be filled after running the EDA).
- **Data Types:** All columns appear to be numerical except for 'timestamp', which is correctly identified as datetime.
- **Summary Statistics:** (To be filled after running the EDA).
- **Histograms and Boxplots:** (To be filled after running the EDA).
- **Time Series Trends:** (To be filled after running the EDA).
- **Correlations:** (To be filled after running the EDA).
