# Solar Insights AI: Data Understanding

This Jupyter notebook aims to explore and analyze solar energy data from the Sierraleone-Bumbuna dataset. The objective is to gain insights into the dataset's features and basic information. 

Key steps in this analysis include:
1. Loading and inspecting the dataset
2. Examining basic statistics and data distributions
3. Checking for missing values


In [1]:
# Import necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Load the Sierra Leone dataset
df_sierraleone = pd.read_csv('data/sierraleone-bumbuna.csv')

In [3]:
# Display the first five rows
print("First five rows of the dataset:")
display(df_sierraleone.head())

First five rows of the dataset:


Unnamed: 0,Timestamp,GHI,DNI,DHI,ModA,ModB,Tamb,RH,WS,WSgust,WSstdev,WD,WDstdev,BP,Cleaning,Precipitation,TModA,TModB,Comments
0,2021-10-30 00:01,-0.7,-0.1,-0.8,0.0,0.0,21.9,99.1,0.0,0.0,0.0,0.0,0.0,1002,0,0.0,22.3,22.6,
1,2021-10-30 00:02,-0.7,-0.1,-0.8,0.0,0.0,21.9,99.2,0.0,0.0,0.0,0.0,0.0,1002,0,0.0,22.3,22.6,
2,2021-10-30 00:03,-0.7,-0.1,-0.8,0.0,0.0,21.9,99.2,0.0,0.0,0.0,0.0,0.0,1002,0,0.0,22.3,22.6,
3,2021-10-30 00:04,-0.7,0.0,-0.8,0.0,0.0,21.9,99.3,0.0,0.0,0.0,0.0,0.0,1002,0,0.1,22.3,22.6,
4,2021-10-30 00:05,-0.7,-0.1,-0.8,0.0,0.0,21.9,99.3,0.0,0.0,0.0,0.0,0.0,1002,0,0.0,22.3,22.6,


In [4]:
#remove comment column 
del df_sierraleone['Comments']

# Get a summary of the data
print("\nDataFrame Info:")
df_sierraleone.info()


DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 525600 entries, 0 to 525599
Data columns (total 18 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Timestamp      525600 non-null  object 
 1   GHI            525600 non-null  float64
 2   DNI            525600 non-null  float64
 3   DHI            525600 non-null  float64
 4   ModA           525600 non-null  float64
 5   ModB           525600 non-null  float64
 6   Tamb           525600 non-null  float64
 7   RH             525600 non-null  float64
 8   WS             525600 non-null  float64
 9   WSgust         525600 non-null  float64
 10  WSstdev        525600 non-null  float64
 11  WD             525600 non-null  float64
 12  WDstdev        525600 non-null  float64
 13  BP             525600 non-null  int64  
 14  Cleaning       525600 non-null  int64  
 15  Precipitation  525600 non-null  float64
 16  TModA          525600 non-null  float64
 17  TModB       

In [5]:
# Get summary statistics
print("\nSummary Statistics:")
display(df_sierraleone.describe())


Summary Statistics:


Unnamed: 0,GHI,DNI,DHI,ModA,ModB,Tamb,RH,WS,WSgust,WSstdev,WD,WDstdev,BP,Cleaning,Precipitation,TModA,TModB
count,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0
mean,201.957515,116.376337,113.720571,206.643095,198.114691,26.319394,79.448857,1.146113,1.691606,0.363823,133.044668,7.17222,999.876469,0.000967,0.004806,32.504263,32.593091
std,298.49515,218.652659,158.946032,300.896893,288.889073,4.398605,20.520775,1.239248,1.617053,0.295,114.284792,7.535093,2.104419,0.031074,0.047556,12.434899,12.009161
min,-19.5,-7.8,-17.9,0.0,0.0,12.3,9.9,0.0,0.0,0.0,0.0,0.0,993.0,0.0,0.0,10.7,11.1
25%,-2.8,-0.3,-3.8,0.0,0.0,23.1,68.7,0.0,0.0,0.0,0.0,0.0,999.0,0.0,0.0,23.5,23.8
50%,0.3,-0.1,-0.1,3.6,3.4,25.3,85.4,0.8,1.6,0.4,161.5,6.2,1000.0,0.0,0.0,26.6,26.9
75%,362.4,107.0,224.7,359.5,345.4,29.4,96.7,2.0,2.6,0.6,234.1,12.0,1001.0,0.0,0.0,40.9,41.3
max,1499.0,946.0,892.0,1507.0,1473.0,39.9,100.0,19.2,23.9,4.1,360.0,98.4,1006.0,1.0,2.4,72.8,70.4


In [6]:
# Check for missing values
print("\nMissing Values in Each Column:")
missing_values = df_sierraleone.isnull().sum()
display(missing_values)


Missing Values in Each Column:


Timestamp        0
GHI              0
DNI              0
DHI              0
ModA             0
ModB             0
Tamb             0
RH               0
WS               0
WSgust           0
WSstdev          0
WD               0
WDstdev          0
BP               0
Cleaning         0
Precipitation    0
TModA            0
TModB            0
dtype: int64

In [None]:
# Visualize missing values
plt.figure(figsize=(12, 6))
sns.heatmap(df_sierraleone.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

### Key Findings

1. **Data Completeness**:
   - The dataset is complete with no missing values in any of the columns.

2. **Data Distribution**:
   - The summary statistics provide insights into the distribution of the data. For example:
     - `GHI` (Global Horizontal Irradiance) has a mean of approximately 201.96 and a standard deviation of 288.89.
     - `DNI` (Direct Normal Irradiance) has a mean of approximately 116.38 and a standard deviation of 261.71.
     - `Tamb` (Ambient Temperature) has a mean of approximately 26.32°C and a standard deviation of 4.40°C.
     - `RH` (Relative Humidity) has a mean of approximately 79.45% and a standard deviation of 20.52%.

3. **Data Range**:
   - The data ranges for various columns are provided, such as:
     - `GHI` ranges from -1.2 to 1473.0.
     - `DNI` ranges from -0.2 to 952.3.
     - `Tamb` ranges from 12.3°C to 39.9°C.
     - `RH` ranges from 9.9% to 100.0%.