# Renewable Energy Forecasting - Data Cleaning & Preprocessing
## AICTE/Shell/Edunet Internship Project

This notebook demonstrates complete data cleaning and preprocessing workflow for renewable energy forecasting.

**Objective**: Predict solar or wind power generation based on weather data using machine learning.

**Dataset**: Renewable Energy Forecasting Dataset (1000 samples, 10 features + target)

## 1. Import Required Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

print('All libraries imported successfully!')

ModuleNotFoundError: No module named 'matplotlib'

## 2. Load Raw Dataset

In [2]:
# Load raw data
df_raw = pd.read_csv('renewable_energy_raw.csv')

print("Dataset Shape:", df_raw.shape)
print("\nFirst 10 rows:")
print(df_raw.head(10))
print("\nDataset Info:")
print(df_raw.info())

Dataset Shape: (1000, 11)

First 10 rows:
    Feature_1  Feature_2  Feature_3  Feature_4  Feature_5    Feature_6  \
0  374.602665   3.184427  17.851171  70.452725   0.871599   989.363552   
1  950.719235   8.357564  17.409364  79.751105   0.941630   997.343566   
2  732.020742  13.157715  37.187637  38.785092   0.928048  1035.454739   
3  598.698618  11.117261  17.486386  66.865557   0.746170   984.000439   
4  156.103039  12.195137  18.158492  62.880949   0.744775  1036.964968   
5  156.078921  10.052359        NaN  82.462278   0.780452   958.813443   
6   58.177804  10.538010  23.492195  87.956530   0.808322  1027.679844   
7  866.189528  12.813337  33.301317  20.911758   0.822537  1034.754763   
8  601.154900   4.120186  11.960985  70.551494   0.903909   968.181765   
9  708.101771   7.596662  24.627136  23.887685   0.717004   993.034653   

    Feature_7  Feature_8  Feature_9  Feature_10    Target  
0  233.372503   3.879945  33.216075    0.913578  0.607938  
1   62.059090  18.67725

## 3. Exploratory Data Analysis (EDA)

In [3]:
print("\n" + "="*60)
print("EXPLORATORY DATA ANALYSIS")
print("="*60)

# Missing values
print("\n1. MISSING VALUES:")
missing_data = df_raw.isnull().sum()
missing_percent = (missing_data / len(df_raw)) * 100
missing_df = pd.DataFrame({
    'Column': df_raw.columns,
    'Missing_Count': missing_data.values,
    'Missing_Percent': missing_percent.values
})
print(missing_df[missing_df['Missing_Count'] > 0].to_string(index=False))

# Descriptive Statistics
print("\n2. DESCRIPTIVE STATISTICS:")
print(df_raw.describe())

# Data Types
print("\n3. DATA TYPES:")
print(df_raw.dtypes)

# Duplicate rows
print(f"\n4. DUPLICATE ROWS: {df_raw.duplicated().sum()}")


EXPLORATORY DATA ANALYSIS

1. MISSING VALUES:
   Column  Missing_Count  Missing_Percent
Feature_1             16              1.6
Feature_3             14              1.4
Feature_5             12              1.2

2. DESCRIPTIVE STATISTICS:
        Feature_1    Feature_2   Feature_3    Feature_4   Feature_5  \
count  984.000000  1000.000000  986.000000  1000.000000  988.000000   
mean   488.592148     9.268573   25.045241    56.778129    0.848118   
std    292.676737    10.919924    8.736342    21.486701    0.086269   
min      4.731560     0.546665   10.000349    20.049004    0.700009   
25%    230.695884     4.081459   17.805702    38.142558    0.772617   
50%    494.158540     8.232562   24.950623    56.320946    0.846584   
75%    740.882540    11.719568   32.768441    75.315561    0.922501   
max    999.717702    96.795279   39.934626    94.966828    0.999325   

         Feature_6    Feature_7    Feature_8    Feature_9   Feature_10  \
count  1000.000000  1000.000000  1000.00000

## 4. Data Cleaning - Step 1: Handle Missing Values

In [4]:
print("\n" + "="*60)
print("STEP 1: HANDLING MISSING VALUES")
print("="*60)

df_cleaned = df_raw.copy()

# Use median imputation for numerical features
imputer = SimpleImputer(strategy='median')
numeric_cols = ['Feature_1', 'Feature_3', 'Feature_5']

print(f"\nMissing values before imputation:")
print(df_cleaned[numeric_cols].isnull().sum())

df_cleaned[numeric_cols] = imputer.fit_transform(df_cleaned[numeric_cols])

print(f"\nMissing values after imputation:")
print(df_cleaned[numeric_cols].isnull().sum())
print("\n✓ Missing values successfully handled using median imputation!")


STEP 1: HANDLING MISSING VALUES


NameError: name 'SimpleImputer' is not defined

## 5. Data Cleaning - Step 2: Detect and Handle Outliers

In [None]:
print("\n" + "="*60)
print("STEP 2: DETECTING AND HANDLING OUTLIERS (IQR Method)")
print("="*60)

def remove_outliers_iqr(data, column, multiplier=1.5):
    """Detect outliers using IQR method"""
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - multiplier * IQR
    upper_bound = Q3 + multiplier * IQR
    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    return outliers, lower_bound, upper_bound

outlier_count = 0
for col in ['Feature_2', 'Feature_1', 'Feature_3']:
    outliers, lower, upper = remove_outliers_iqr(df_cleaned, col)
    if len(outliers) > 0:
        print(f"\n{col}: Found {len(outliers)} outliers")
        print(f"  Valid Range: [{lower:.2f}, {upper:.2f}]")
        # Cap outliers instead of removing
        df_cleaned[col] = df_cleaned[col].clip(lower, upper)
        outlier_count += len(outliers)

print(f"\n✓ Total outliers handled: {outlier_count} values were capped")

## 6. Data Cleaning - Step 3: Remove Duplicates

In [5]:
print("\n" + "="*60)
print("STEP 3: REMOVING DUPLICATES")
print("="*60)

duplicates_before = len(df_cleaned)
df_cleaned = df_cleaned.drop_duplicates()
duplicates_removed = duplicates_before - len(df_cleaned)

print(f"\nDuplicate rows found and removed: {duplicates_removed}")
print(f"Rows before: {duplicates_before}, After: {len(df_cleaned)}")
print(f"\n✓ Duplicate removal complete!")


STEP 3: REMOVING DUPLICATES

Duplicate rows found and removed: 0
Rows before: 1000, After: 1000

✓ Duplicate removal complete!


## 7. Data Cleaning - Step 4: Data Type Validation

In [6]:
print("\n" + "="*60)
print("STEP 4: DATA TYPE VALIDATION")
print("="*60)

print(f"\nData Types before conversion:")
print(df_cleaned.dtypes)

# Ensure all numeric columns are float64
numeric_features = [col for col in df_cleaned.columns if col.startswith('Feature')] + ['Target']
for col in numeric_features:
    df_cleaned[col] = df_cleaned[col].astype('float64')

print(f"\n✓ All numeric columns validated as float64")


STEP 4: DATA TYPE VALIDATION

Data Types before conversion:
Feature_1     float64
Feature_2     float64
Feature_3     float64
Feature_4     float64
Feature_5     float64
Feature_6     float64
Feature_7     float64
Feature_8     float64
Feature_9     float64
Feature_10    float64
Target        float64
dtype: object

✓ All numeric columns validated as float64


## 8. Data Preprocessing - Step 5: Feature Normalization

In [7]:
print("\n" + "="*60)
print("STEP 5: FEATURE NORMALIZATION (Min-Max Scaling)")
print("="*60)

# Initialize MinMax scaler
scaler = MinMaxScaler(feature_range=(0, 1))

features_to_scale = ['Feature_1', 'Feature_2', 'Feature_3', 'Feature_4', 
                     'Feature_5', 'Feature_6', 'Feature_7', 'Feature_8', 
                     'Feature_9', 'Feature_10']

print(f"\nScaling the following features to range [0, 1]:")
print(features_to_scale)

df_cleaned[features_to_scale] = scaler.fit_transform(df_cleaned[features_to_scale])

print(f"\n✓ Features successfully normalized to [0, 1]")
print(f"\nNormalized data statistics:")
print(df_cleaned[features_to_scale].describe())


STEP 5: FEATURE NORMALIZATION (Min-Max Scaling)


NameError: name 'MinMaxScaler' is not defined

## 9. Final Data Quality Check

In [None]:
print("\n" + "="*60)
print("CLEANED DATA SUMMARY")
print("="*60)

print(f"\nFinal Dataset Shape: {df_cleaned.shape}")
print(f"Total Rows: {len(df_cleaned)}")
print(f"Total Columns: {len(df_cleaned.columns)}")
print(f"Missing Values: {df_cleaned.isnull().sum().sum()}")

print(f"\nFirst 5 rows of cleaned data:")
print(df_cleaned.head())

print(f"\nDescriptive Statistics:")
print(df_cleaned.describe())

## 10. Data Comparison: Raw vs Cleaned

In [8]:
print("\n" + "="*60)
print("DATA COMPARISON: RAW vs CLEANED")
print("="*60)

comparison_summary = pd.DataFrame({
    'Metric': ['Total Rows', 'Total Columns', 'Missing Values', 'Duplicate Rows', 'Outliers Handled'],
    'Raw Data': [len(df_raw), len(df_raw.columns), df_raw.isnull().sum().sum(), 0, 0],
    'Cleaned Data': [len(df_cleaned), len(df_cleaned.columns), df_cleaned.isnull().sum().sum(), duplicates_removed, outlier_count]
})

print(comparison_summary.to_string(index=False))


DATA COMPARISON: RAW vs CLEANED


NameError: name 'outlier_count' is not defined

## 11. Save Processed Data

In [None]:
# Save cleaned data
df_cleaned.to_csv('renewable_energy_cleaned.csv', index=False)
print("✓ Cleaned data saved as 'renewable_energy_cleaned.csv'")

# Save comparison summary
comparison_summary.to_csv('data_preprocessing_summary.csv', index=False)
print("✓ Preprocessing summary saved as 'data_preprocessing_summary.csv'")

print("\n" + "="*60)
print("DATA PREPROCESSING COMPLETE!")
print("="*60)
print("\nFiles generated:")
print("1. renewable_energy_raw.csv - Original raw dataset")
print("2. renewable_energy_cleaned.csv - Cleaned and processed dataset")
print("3. data_preprocessing_summary.csv - Comparison summary")

## 12. Next Steps for ML Model Development

Now that we have clean, preprocessed data, the next steps are:

1. **Train-Test Split**: Split data into training and testing sets (80-20 or 70-30)
2. **Feature Engineering**: Create new features if needed (e.g., lagged features for time-series)
3. **Model Selection**: Choose appropriate models:
   - LSTM (Long Short-Term Memory) for time-series forecasting
   - Random Forest for non-linear relationships
   - Gradient Boosting for high accuracy
   - XGBoost for improved performance
4. **Model Training**: Train models on the cleaned dataset
5. **Model Evaluation**: Use metrics like RMSE, MAE, R² Score
6. **Hyperparameter Tuning**: Optimize model parameters using GridSearchCV or RandomizedSearchCV
7. **Predictions & Visualization**: Generate predictions and visualize results

**Good luck with your AICTE/Shell/Edunet internship project!**