# Exploratory Data Analysis (EDA)

## Table of Contents
1. [Dataset Overview](#dataset-overview)
2. [Handling Missing Values](#handling-missing-values)
3. [Feature Distributions](#feature-distributions)
4. [Possible Biases](#possible-biases)
5. [Correlations](#correlations)


. [Correlations](#correlations)


In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


## Dataset Overview

[Provide a high-level overview of the dataset. This should include the source of the dataset, the number of samples, the number of features, and example showing the structure of the dataset.]

- Lifecycle of 4,500 Clouds
- Simulated with the Weather Forecast Model ICON
- Located above the Atlantic Ocean during the formation of Hurricane [Paulette](https://zoom.earth/storms/paulette-2020/)

### Background
A common method in cloud research is the application of cloud-tracking tools to study cloud life cycles and trajectories. The underlying data typically come from satellite imagery or from numerical weather prediction models. Clouds, however, are ephemeral, ever-changing objects, constantly shifting in shape and form. Tracking such transient features is therefore a challenging task.

The temporal resolution of satellite imagery is limited to a best case of around 5 minutes. Numerical models are not restricted in the same way, but they are limited by the amount of data that can realistically be written and stored. For cloud studies, this temporal resolution becomes a severe bottleneck: short-lived clouds may appear and disappear between the standard model output intervals of 15 to 60 minutes.

To overcome this limitation, the Leibniz Institute for Tropospheric Research ([TROPOS](https://www.tropos.de)) is developing - within the EU-funded CleanCloud project — the software Targo (Targeted Output). Targo enables cloud tracking in simulations at extremely high temporal resolution of around 30 seconds.

### Method
Targo is attached as a plugin to the weather forecast model ICON via the CoMIn interface. It requests meteorological fields at every model time step (typically 30–60 seconds) and performs cloud tracking using the community software tobac. At every cloud position, additional meteorological variables are requested, allowing for a comprehensive characterization of each cloud. This approach provides highly resolved cloud data at minimal storage cost.

### Casestudy
The study object is Hurricane Paulette from September 2020. A hurricane is one of the most extreme meteorological phenomena on our planet. It is characterized by a strong pressure depression (the hurricane eye), surrounded by a ring of convective clouds, intense precipitation, and thunderstorms.

We applied our new cloud-tracking method to Paulette for a 48-hour period on 7–8 September 2020. From this simulation, we obtained 4,500 tracks of convective clouds. 
An example of identified cloud objects and an overfiew or the cloud trajectories can be found in the following figures.

![Identified Cloud Objects](Cloud_Features.png) ![Cloud Tracks](Tracks.png) 


### Data Structure
A cloud may exist in multiple frames (time steps) of the simulation. Coresponding clouds in the frames are linked together to cells. In other words, a cell is a time series of a cloud.
Our dataset consist of these cells.

Each cell is stored as an indevidual dataframe 

In [None]:
import pandas as pd

# Load the data
# Replace 'your_dataset.csv' with the path to your actual dataset
df = pd.read_csv('your_dataset.csv')

# Number of samples
num_samples = df.shape[0]

# Number of features
num_features = df.shape[1]

# Display these dataset characteristics
print(f"Number of samples: {num_samples}")
print(f"Number of features: {num_features}")

# Display the first few rows of the dataframe to show the structure
print("Example data:")
print(df.head())



## Handling Missing Values

[Identify any missing values in the dataset, and describe your approach to handle them if there are any. If there are no missing values simply indicate that there are none.]


In [None]:
# Check for missing values
missing_values = df.isnull().sum()
missing_values


In [None]:
# Handling missing values
# Example: Replacing NaN values with the mean value of the column
# df.fillna(df.mean(), inplace=True)

# Your code for handling missing values goes here


## Feature Distributions

[Plot the distribution of various features and target variables. Comment on the skewness, outliers, or any other observations.]


In [None]:
# Example: Plotting histograms of all numerical features
df.hist(figsize=(12, 12))
plt.show()


## Possible Biases

[Investigate the dataset for any biases that could affect the model’s performance and fairness (e.g., class imbalance, historical biases).]


In [None]:
# Example: Checking for class imbalance in a classification problem
# sns.countplot(x='target_variable', data=df)

# Your code to investigate possible biases goes here


## Correlations

[Explore correlations between features and the target variable, as well as among features themselves.]


In [None]:
# Example: Plotting a heatmap to show feature correlations
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True)
plt.show()
