# Solar Insights AI: Data Understanding

This Jupyter notebook aims to explore and analyze solar energy data from the Togo-Dapaong_qc dataset. The objective is to gain insights into the dataset's features and basic information. 

Key steps in this analysis include:
1. Loading and inspecting the dataset
2. Examining basic statistics and data distributions
3. Checking for missing values


In [1]:
# Import necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Load the Togo dataset
df_togo = pd.read_csv('data/togo-dapaong_qc.csv')

In [3]:
# Display the first five rows
print("First five rows of the dataset:")
display(df_togo.head())

First five rows of the dataset:


Unnamed: 0,Timestamp,GHI,DNI,DHI,ModA,ModB,Tamb,RH,WS,WSgust,WSstdev,WD,WDstdev,BP,Cleaning,Precipitation,TModA,TModB,Comments
0,2021-10-25 00:01,-1.3,0.0,0.0,0.0,0.0,24.8,94.5,0.9,1.1,0.4,227.6,1.1,977,0,0.0,24.7,24.4,
1,2021-10-25 00:02,-1.3,0.0,0.0,0.0,0.0,24.8,94.4,1.1,1.6,0.4,229.3,0.7,977,0,0.0,24.7,24.4,
2,2021-10-25 00:03,-1.3,0.0,0.0,0.0,0.0,24.8,94.4,1.2,1.4,0.3,228.5,2.9,977,0,0.0,24.7,24.4,
3,2021-10-25 00:04,-1.2,0.0,0.0,0.0,0.0,24.8,94.3,1.2,1.6,0.3,229.1,4.6,977,0,0.0,24.7,24.4,
4,2021-10-25 00:05,-1.2,0.0,0.0,0.0,0.0,24.8,94.0,1.3,1.6,0.4,227.5,1.6,977,0,0.0,24.7,24.4,


In [4]:
#remove comment column 
del df_togo['Comments']

# Get a summary of the data
print("\nDataFrame Info:")
df_togo.info()


DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 525600 entries, 0 to 525599
Data columns (total 18 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Timestamp      525600 non-null  object 
 1   GHI            525600 non-null  float64
 2   DNI            525600 non-null  float64
 3   DHI            525600 non-null  float64
 4   ModA           525600 non-null  float64
 5   ModB           525600 non-null  float64
 6   Tamb           525600 non-null  float64
 7   RH             525600 non-null  float64
 8   WS             525600 non-null  float64
 9   WSgust         525600 non-null  float64
 10  WSstdev        525600 non-null  float64
 11  WD             525600 non-null  float64
 12  WDstdev        525600 non-null  float64
 13  BP             525600 non-null  int64  
 14  Cleaning       525600 non-null  int64  
 15  Precipitation  525600 non-null  float64
 16  TModA          525600 non-null  float64
 17  TModB       

In [5]:
# Get summary statistics
print("\nSummary Statistics:")
display(df_togo.describe())


Summary Statistics:


Unnamed: 0,GHI,DNI,DHI,ModA,ModB,Tamb,RH,WS,WSgust,WSstdev,WD,WDstdev,BP,Cleaning,Precipitation,TModA,TModB
count,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0
mean,230.55504,151.258469,116.444352,226.144375,219.568588,27.751788,55.01316,2.368093,3.22949,0.55774,161.741845,10.559568,975.915242,0.000535,0.001382,32.444403,33.54333
std,322.532347,250.956962,156.520714,317.346938,307.93251,4.758023,28.778732,1.462668,1.882565,0.268923,91.877217,5.91549,2.153977,0.023116,0.02635,10.998334,12.769277
min,-12.7,0.0,0.0,0.0,0.0,14.9,3.3,0.0,0.0,0.0,0.0,0.0,968.0,0.0,0.0,13.1,13.1
25%,-2.2,0.0,0.0,0.0,0.0,24.2,26.5,1.4,1.9,0.4,74.8,6.9,975.0,0.0,0.0,23.9,23.6
50%,2.1,0.0,2.5,4.4,4.3,27.2,59.3,2.2,2.9,0.5,199.1,10.8,976.0,0.0,0.0,28.4,28.4
75%,442.4,246.4,215.7,422.525,411.0,31.1,80.8,3.2,4.4,0.7,233.5,14.1,977.0,0.0,0.0,40.6,43.0
max,1424.0,1004.5,805.7,1380.0,1367.0,41.4,99.8,16.1,23.1,4.7,360.0,86.9,983.0,1.0,2.3,70.4,94.6


In [6]:
# Check for missing values
print("\nMissing Values in Each Column:")
missing_values = df_togo.isnull().sum()
display(missing_values)


Missing Values in Each Column:


Timestamp        0
GHI              0
DNI              0
DHI              0
ModA             0
ModB             0
Tamb             0
RH               0
WS               0
WSgust           0
WSstdev          0
WD               0
WDstdev          0
BP               0
Cleaning         0
Precipitation    0
TModA            0
TModB            0
dtype: int64

In [None]:
# Visualize missing values
plt.figure(figsize=(12, 6))
sns.heatmap(df_togo.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

### Key Findings

1. **Data Completeness**:
   - The dataset is complete with no missing values in any of the columns.

2. **Data Distribution**:
   - The summary statistics provide insights into the distribution of the data. For example:
     - `GHI` (Global Horizontal Irradiance) has a mean of approximately 230.56 and a standard deviation of 322.53.
     - `DNI` (Direct Normal Irradiance) has a mean of approximately 151.26 and a standard deviation of 250.96.
     - `Tamb` (Ambient Temperature) has a mean of approximately 27.75째C and a standard deviation of 4.76째C.
     - `RH` (Relative Humidity) has a mean of approximately 55.01% and a standard deviation of 28.78%.

3. **Data Range**:
   - The data ranges for various columns are provided, such as:
     - `GHI` ranges from -12.7 to 1424.0.
     - `DNI` ranges from 0.0 to 1004.5.
     - `Tamb` ranges from 14.9째C to 41.4째C.
     - `RH` ranges from 3.3% to 99.8%.