# Cloud Cover and Rainfall Prediction for Solar Panel Output in Tamil Nadu

This Jupyter notebook demonstrates how to predict cloud cover and rainfall probability in Tamil Nadu, India, to estimate their impact on solar panel energy output. It uses `rioxarray`, `xarray`, `Dask`, and `GDAL` to handle large geospatial datasets efficiently. The notebook converts non-spatial CSV weather data into spatial datasets, performs spatial analysis, and provides visualizations for a demo to a solar panel installation company.

## Objectives
- Load and process large weather datasets (CSV and raster) using Dask for memory efficiency.
- Convert CSV data with coordinates into spatial datasets using `rioxarray`.
- Analyze cloud cover and rainfall probability.
- Estimate reduced solar panel output due to weather conditions.
- Visualize results for a demo in Chennai, Tamil Nadu.

## Prerequisites
- Install required packages: `pip install rioxarray xarray dask geopandas pandas numpy scikit-learn matplotlib rasterio`
- Download sample weather data (e.g., from NASA or IMD) or use synthetic data as shown below.

In [None]:
# Import libraries
import pandas as pd
import geopandas as gpd
import xarray as xr
import rioxarray
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import rasterio
import os
import dask.dataframe as dd
from dask.distributed import Client, LocalCluster

# Create data directory
os.makedirs("data/weather", exist_ok=True)

# Initialize Dask client for parallel computing
try:
    cluster = LocalCluster()
    client = Client(cluster)
    print(client)
except Exception as e:
    print(f"Could not initialize Dask client: {e}")
    print("Continuing without Dask distributed computing...")

## Step 1: Load and Prepare Non-Spatial CSV Data

We simulate a large CSV dataset containing weather observations (temperature, humidity, cloud cover, precipitation) with latitude and longitude for Tamil Nadu. Dask is used to handle large datasets that may not fit in RAM.

In [None]:
# Simulate large CSV weather data
np.random.seed(42)
n_samples = 10000  # Reduced for demo, use larger for real data
data = {
    'date': pd.date_range('2024-01-01', periods=n_samples, freq='H'),
    'latitude': np.random.uniform(8.0, 13.5, n_samples),  # Tamil Nadu lat range
    'longitude': np.random.uniform(77.0, 80.3, n_samples),  # Tamil Nadu lon range
    'temperature': np.random.normal(30, 5, n_samples),  # Celsius
    'humidity': np.random.uniform(50, 90, n_samples),  # %
    'cloud_cover': np.random.uniform(0, 100, n_samples),  # %
    'precipitation': np.random.choice([0, 1], n_samples, p=[0.8, 0.2])  # Binary (0: no rain, 1: rain)
}
df = pd.DataFrame(data)
weather_csv_path = 'data/weather/weather_data_tn.csv'
df.to_csv(weather_csv_path, index=False)

# Load CSV with Pandas (or Dask for very large files)
df = pd.read_csv(weather_csv_path)
print(df.head())

## Step 2: Convert CSV to Spatial Dataset

Convert the DataFrame to a GeoDataFrame and then to an `xarray` Dataset with `rioxarray` for spatial analysis.

In [None]:
# Convert to GeoDataFrame
gdf = gpd.GeoDataFrame(
    df,
    geometry=gpd.points_from_xy(df.longitude, df.latitude),
    crs="EPSG:4326"
)

# Create a time, lat, lon xarray grid
# For demonstration, we'll create a coarser grid
lat_bins = np.linspace(8.0, 13.5, 20)
lon_bins = np.linspace(77.0, 80.3, 20)

# Use a smaller time sample for the demo (first day only)
unique_dates = pd.to_datetime(df.date).dt.date.unique()[:5]
time_stamps = pd.to_datetime([f"{date} 12:00:00" for date in unique_dates])

# Create empty dataset
ds = xr.Dataset(
    {
        'temperature': (['time', 'lat', 'lon'], np.zeros((len(time_stamps), len(lat_bins), len(lon_bins)))),
        'humidity': (['time', 'lat', 'lon'], np.zeros((len(time_stamps), len(lat_bins), len(lon_bins)))),
        'cloud_cover': (['time', 'lat', 'lon'], np.zeros((len(time_stamps), len(lat_bins), len(lon_bins)))),
        'precipitation': (['time', 'lat', 'lon'], np.zeros((len(time_stamps), len(lat_bins), len(lon_bins))))
    },
    coords={
        'time': time_stamps,
        'lat': lat_bins,
        'lon': lon_bins
    }
)

# Fill dataset with average values from points (basic interpolation)
for t, timestamp in enumerate(time_stamps):
    date_str = timestamp.strftime('%Y-%m-%d')
    day_data = df[pd.to_datetime(df.date).dt.strftime('%Y-%m-%d') == date_str]
    
    # For each lat-lon grid cell, find nearby points and take their average
    for i, lat in enumerate(lat_bins):
        for j, lon in enumerate(lon_bins):
            # Find points within 0.5 degrees of this grid cell
            nearby = day_data[
                (np.abs(day_data.latitude - lat) < 0.5) & 
                (np.abs(day_data.longitude - lon) < 0.5)
            ]
            
            if len(nearby) > 0:
                ds['temperature'][t, i, j] = nearby.temperature.mean()
                ds['humidity'][t, i, j] = nearby.humidity.mean()
                ds['cloud_cover'][t, i, j] = nearby.cloud_cover.mean()
                ds['precipitation'][t, i, j] = nearby.precipitation.mean()

# Add CRS information
ds.rio.write_crs("EPSG:4326", inplace=True)
print(ds)

## Step 3: Create Synthetic Raster Data

Create a synthetic GeoTIFF for cloud cover to demonstrate raster processing. In a real scenario, you would use actual satellite data.

In [None]:
# Create synthetic GeoTIFF for cloud cover
cloud_tiff_path = 'data/weather/cloud_cover_tn.tiff'

# Generate synthetic cloud cover pattern (higher in coastal areas)
height, width = 50, 50
cloud_data = np.zeros((height, width), dtype=np.float32)

# Create cloud pattern (more clouds in eastern coastal areas)
for i in range(width):
    for j in range(height):
        # Distance from east coast (normalized)
        east_distance = 1 - (i / width)
        # More clouds near coast
        cloud_data[j, i] = 30 + 60 * east_distance + np.random.normal(0, 10)

# Clip values to valid range
cloud_data = np.clip(cloud_data, 0, 100)

# Write to GeoTIFF
with rasterio.open(
    cloud_tiff_path, 'w', driver='GTiff',
    height=height, width=width, count=1, dtype='float32',
    crs='EPSG:4326',
    transform=rasterio.transform.from_bounds(77.0, 8.0, 80.3, 13.5, width, height)
) as dst:
    dst.write(cloud_data, 1)

# Load raster with rioxarray
cloud_raster = rioxarray.open_rasterio(cloud_tiff_path)
print(cloud_raster)

# Plot cloud cover raster
plt.figure(figsize=(10, 8))
cloud_raster.plot(cmap='Blues')
plt.title('Cloud Cover Raster (%) - Tamil Nadu')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()

## Step 4: Predict Rainfall Probability

Train a Random Forest model to predict rainfall probability using the simulation data.

In [None]:
# Prepare data for machine learning
X = df[['temperature', 'humidity', 'cloud_cover']]
y_rain = df['precipitation']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y_rain, test_size=0.2, random_state=42)

# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict and evaluate
y_pred = rf.predict(X_test)
print(f"Rainfall Prediction Accuracy: {accuracy_score(y_test, y_pred):.4f}")

# Show feature importance
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.bar(feature_importance['Feature'], feature_importance['Importance'])
plt.title('Feature Importance for Rainfall Prediction')
plt.ylabel('Importance')
plt.show()

## Step 5: Predict Rainfall Probability Across Tamil Nadu

Apply the trained model to grid data to create a rainfall probability map.

In [None]:
# Function to predict rain probability for a given grid
def predict_rain_probability(ds, time_idx):
    # Extract features for the specific time
    temp = ds['temperature'].isel(time=time_idx).values.flatten()
    humidity = ds['humidity'].isel(time=time_idx).values.flatten()
    cloud = ds['cloud_cover'].isel(time=time_idx).values.flatten()
    
    # Create feature array for prediction
    features = np.column_stack([temp, humidity, cloud])
    
    # Remove missing values
    valid_idx = ~np.isnan(features).any(axis=1)
    valid_features = features[valid_idx]
    
    # Predict probabilities
    if len(valid_features) > 0:
        probas = rf.predict_proba(valid_features)[:, 1]  # Probability of class 1 (rain)
        
        # Create result array (initialize with NaN)
        result = np.full(temp.shape, np.nan)
        result[valid_idx] = probas  # Assign values to valid indices
        
        # Reshape back to grid dimensions
        return result.reshape(ds['temperature'].isel(time=time_idx).shape)
    else:
        return np.full_like(temp.reshape(ds['temperature'].isel(time=time_idx).shape), np.nan)

# Calculate rain probability for each time step
rain_probas = []
for t in range(len(ds.time)):
    rain_probas.append(predict_rain_probability(ds, t))

# Add to dataset
ds['rain_probability'] = (['time', 'lat', 'lon'], np.stack(rain_probas))

# Plot rainfall probability for the first time step
plt.figure(figsize=(10, 8))
ds['rain_probability'].isel(time=0).plot(cmap='Blues', vmin=0, vmax=1)
plt.title(f'Rainfall Probability - {ds.time.values[0]}')
plt.show()

## Step 6: Estimate Solar Panel Output Reduction

Estimate the reduction in solar panel output based on cloud cover and rainfall. Assume a linear reduction model: output reduces by 0.5% per 1% cloud cover, and rain reduces output by an additional 20%.

In [None]:
# Solar output reduction model
def calculate_solar_output(cloud_cover, rain_prob):
    base_output = 100  # Max output in clear conditions (%)
    cloud_reduction = cloud_cover * 0.5  # 0.5% reduction per 1% cloud cover
    rain_reduction = np.where(rain_prob > 0.5, 20, 0)  # 20% reduction if rain probability > 50%
    return base_output - cloud_reduction - rain_reduction

# Apply to dataset
ds['solar_output'] = calculate_solar_output(ds['cloud_cover'], ds['rain_probability'])

# Plot solar output for the first time step
plt.figure(figsize=(10, 8))
ds['solar_output'].isel(time=0).plot(cmap='viridis', vmin=0, vmax=100)
plt.title(f'Estimated Solar Panel Output (%) - {ds.time.values[0]}')
plt.show()

## Step 7: Visualize Results for Chennai

Create maps and plots for Chennai to demonstrate weather impacts on solar output.

In [None]:
# Extract data for Chennai (approx. lat: 13.08, lon: 80.27)
# Find nearest grid point to Chennai
chennai_lat, chennai_lon = 13.08, 80.27
chennai_ds = ds.sel(lat=chennai_lat, lon=chennai_lon, method='nearest')

# Plot time series for Chennai
fig, axes = plt.subplots(3, 1, figsize=(12, 12), sharex=True)

# Plot cloud cover
chennai_ds['cloud_cover'].plot(ax=axes[0], marker='o')
axes[0].set_title('Cloud Cover in Chennai')
axes[0].set_ylabel('Cloud Cover (%)')
axes[0].grid(True)

# Plot rain probability
chennai_ds['rain_probability'].plot(ax=axes[1], marker='o', color='blue')
axes[1].set_title('Rain Probability in Chennai')
axes[1].set_ylabel('Probability')
axes[1].grid(True)

# Plot solar output
chennai_ds['solar_output'].plot(ax=axes[2], marker='o', color='green')
axes[2].set_title('Estimated Solar Output in Chennai')
axes[2].set_ylabel('Output (%)')
axes[2].set_xlabel('Time')
axes[2].grid(True)

plt.tight_layout()
plt.show()

## Step 8: Save Results for Demo

Save the processed dataset and visualizations for the solar panel company.

In [None]:
# Save dataset to NetCDF file
output_path = 'data/weather/solar_weather_tn.nc'
ds.to_netcdf(output_path)
print(f"Dataset saved to {output_path}")

# Save key plots
plt.figure(figsize=(10, 8))
ds['solar_output'].isel(time=0).plot(cmap='viridis', vmin=0, vmax=100)
plt.title(f'Estimated Solar Panel Output (%) - {ds.time.values[0]}')
plt.savefig('data/weather/solar_output_map.png', dpi=300, bbox_inches='tight')

# Save time series for Chennai
fig, axes = plt.subplots(2, 1, figsize=(10, 8), sharex=True)
chennai_ds['cloud_cover'].plot(ax=axes[0], marker='o')
axes[0].set_title('Cloud Cover in Chennai')
chennai_ds['solar_output'].plot(ax=axes[1], marker='o', color='green')
axes[1].set_title('Estimated Solar Output in Chennai')
plt.tight_layout()
plt.savefig('data/weather/chennai_solar_output.png', dpi=300, bbox_inches='tight')
print("Plots saved to data/weather/")

## Conclusion

This notebook demonstrates how to process weather datasets, convert CSV data to spatial datasets, predict rainfall, and estimate solar panel output reductions. The visualizations and saved outputs are ready for a demo to a solar panel installation company in Tamil Nadu. For real-world use, replace synthetic data with actual weather data from sources like the India Meteorological Department (IMD) or NASA.