## This mini-project will focus on processing and replacing missing values in a dataset. 
- The dataset chosen or generated must have a sufficient number of missing values to contribute to the main objective of this project.
- Each part in the project must be accompanied by a clear, valid, and detailed explanation, along with the corresponding code blocks.
- Respect exactly the same number of cells. Your marks will be strongly infuenced by the respect of this structure.

---
# 1. Choose an existing dataset or generate a synthetic one.
- If generated, describe how you ensure the quality of the synthetic data.
- If chosen, justify your choice, and provide a general desciption. 
- In all cases, cite your objectives, and your working plan. 
- Provide code.
- As an objective, you may use *** simple regression model, or a clustering algorithm***. 
---

## Answer

### 1. Chosen Data:
The data chosen is about the weather in Jena city (Germany) between 2009 and 2017. I found it in the Kaggle datasets section.

### 2. Justification:
The dataset is a rich time series dataset. It is useful for learning purposes, particularly for dealing with preprocessing, feature engineering, modeling, hypothesis testing, and drawing conclusions from insights.

### 3. Data Description:
The Jena Climate dataset is a weather time series dataset recorded at the Weather Station of the Max Planck Institute for Biogeochemistry in Jena, Germany.

The dataset consists of 14 different quantities (such as air temperature, atmospheric pressure, humidity, wind direction, and more) recorded every 10 minutes over several years. This dataset covers data from January 1st, 2009, to December 31st, 2016.

#### Columns Description:

| Index | Feature           | Format          | Description|
|:------|:------------------|:----------------|:-----------|
| 1     | Date Time         | 01.01.2009 00:10:00 | Date-time reference|
| 2     | Pressure (mbar)   | 996.52          | The pascal (SI derived unit) used to quantify internal pressure. Meteorological reports typically state atmospheric pressure in millibars. |
| 3     | Temperature (°C)  | -8.02           | Temperature in Celsius|
| 4     | Temperature (K)   | 265.4           | Temperature in Kelvin|
| 5     | Dew Point (°C)    | -8.9            | Temperature in Celsius relative to humidity. The dew point is the temperature at which the air cannot hold all the moisture, causing water condensation. |
| 6     | Relative Humidity (%) | 93.3         | Relative Humidity is a measure of how saturated the air is with water vapor; %RH indicates the water content within collection objects. |
| 7     | Saturation Vapor Pressure (mbar) | 3.33 | Saturation vapor pressure|
| 8     | Vapor Pressure (mbar) | 3.11        | Vapor pressure|
| 9     | Vapor Pressure Deficit (mbar) | 0.22 | Vapor pressure deficit|
| 10    | Specific Humidity (g/kg) | 1.94      | Specific humidity |
| 11    | Water Vapor Concentration (mmol/mol) | 3.12 | Water vapor concentration |
| 12    | Air Density (g/m³) | 1307.75         | Air density |
| 13    | Wind Speed (m/s)   | 1.03            | Wind speed |
| 14    | Maximum Wind Speed (m/s) | 1.75     | Maximum wind speed |
| 15    | Wind Direction (°) | 152.3           | Wind direction in degrees |



***Source of description:** the dataset information card. link: https://www.kaggle.com/datasets/mnassrib/jena-climate*

### 4. Load and general understanding of the data

In [1]:
import pandas as pd

# load dataset
filepath = "dataset/jena_climat_dataset_2009_2017.csv"
weather = pd.read_csv(filepath)

# change to datetime dtype
weather['datetime'] = pd.to_datetime(weather['datetime']) 

print("1. preview the dataset: ", "="*50)
print(weather.loc[1])

print("\n1. data exploration: ", "="*50)
print("\na. general information about the columns")
print(weather.info())

print("\nb. Descriptive statistics on the columns")
columns = ["temperature (celsius)", "relative_humidity (%)", "pressure (mbar)"] 
print(weather[columns].describe())

  weather['datetime'] = pd.to_datetime(weather['datetime'])


datetime                          2015-03-28 06:30:00
pressure (mbar)                                 995.5
temperature (celsius)                            4.67
potential_temperature (kelvin)                  278.2
dew_point (celsius)                              1.18
relative_humidity (%)                            78.0
max_vapor_pressure (mbar)                        8.53
actual_vapor_pressure (mbar)                     6.65
vapor_pressure_deficit (mbar)                    1.88
specific_humidity (g/kg)                         4.17
water_concentration (mmol/mol)                   6.68
air_density (g/m**3)                             1245
wind_speed (m/s)                                  NaN
max_wind_speed (m/s)                             1.24
wind_direction (degrees)                        229.2
Name: 1, dtype: object


a. general information about the columns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 449990 entries, 0 to 449989
Data columns (total 15 columns):
 #   Column  

### 5. Objective
The primary objective is to **forecast temperature  (in degrees Celsius)** based on historical weather data. This involves leveraging the rich time series data from the Jena Climate dataset to predict future temperature values. Such forecasts can provide valuable insights for climate modeling and help in planning activities sensitive to temperature changes.


### 6. Working Plan

#### Step 1: Data Understanding and Preparation
1. **Explore the dataset**: we did!
2. **Handle missing values**:
   - Identify missing values in the dataset.
   - Employ appropriate techniques to fill or remove missing values.
3. **Convert data types**:
   - Ensure the `Date Time` column is in datetime format.
   - Check and correct other columns’ types if necessary.
4. **Feature selection**:
   - Select relevant features for the temperature forecasting model (e.g., atmospheric pressure, humidity, wind speed).
5. **Data splitting**:
   - splitting the data on two parts, one for experimenting and the onther for testing.
#### Step 2: Exploratory Data Analysis (EDA)
1. **Correlation analysis**:
   - Check for possible relationships between air temperature and other variables (not preferable since thoe indicators will not be available in the prediction of the future)
2. **Visualize trends and seasonality of the target varivable**:
   - Plot temperature over time to identify trends and seasonal patterns.
3. **Stationarity check**:
   - Perform stationarity tests (e.g., Augmented Dickey-Fuller test) to check if the series is suitable for forecasting.
4. **Insights!**

#### Step 3: Data Transformation
1. **Resample the data**:
   - Aggregate the data to a daily or hourly level if necessary to simplify the time series.
2. **Create lag features**:
   - Generate lag features for the target variable (e.g., temperature at t-1, t-2, etc.).
3. **Scale the data**:
   - Normalize or standardize the features to prepare for model training.

#### Step 4: Model Selection and Training
1. **Data preprocessing**:
   - Aggregate the data to a daily or hourly level if necessary to simplify the time series
   - Generate lag features for the target variable (e.g., temperature at t-1, t-2, etc.).
   - Normalize or standardize the features to prepare for model training.
   - Divide the dataset into training and validation sets based on time (e.g., first 80% for training, last 20% for validation).
3. **Modeling**:
   - Experiment with advanced timeseries models like: ARIMA, XGboost, LinearRegression.
   - Fit the model
   - Visualize the predictions
   - Model evaluations metrics
   - Hyperparameters tuning using GridSearch (to find the optimal hypermarameters values).
   - Final: training, evaluation!

#### Step 6: Forecasting and Interpretation
1. **Generate forecasts**:
   - Use the model to predict future temperatures.
2. **Interpret results**:
   - Analyze the forecast’s accuracy and reliability.
3. **Draw insights**:
   - Highlight key findings and potential applications of the forecast results.


---
# 2. Clean your data: 
- Evaluate the quality of data.
- Identify relevant features.
- Missing values or any additional operations ? (use different missing appraoches to tackle the missing values issue, create different data variants for a later comparison).   
- statistics, check data types, conversions and more ?
- Divide data into two parts one for experiments, another portion for validation (explain the logic of your splitting).
---

####  a. Explore the dataset:
We already did! Check [*Loading an generat undestanding of the data*](#4.-Load-and-general-understanding-of-the-data)


#### b. Handle missing values:
   - Identify missing values in the dataset.
   - Employ appropriate techniques to fill or remove missing values.


datetime                              0
pressure (mbar)                       0
temperature (celsius)             21350
potential_temperature (kelvin)        0
dew_point (celsius)                   0
relative_humidity (%)                 0
max_vapor_pressure (mbar)             0
actual_vapor_pressure (mbar)          0
vapor_pressure_deficit (mbar)         0
specific_humidity (g/kg)          31466
water_concentration (mmol/mol)        0
air_density (g/m**3)                  0
wind_speed (m/s)                  49568
max_wind_speed (m/s)                  0
wind_direction (degrees)              0
dtype: int64

#### c. Convert data types:
   - Ensure the `Date Time` column is in datetime format.
   - Check and correct other columns’ types if necessary.

#### d. Feature selection:
   - Select relevant features for the temperature forecasting model (e.g., atmospheric pressure, humidity, wind speed).


#### e. Data splitting:
   - splitting the data on two parts, one for experimenting and the onther for testing.

---
# 3. Exploratory Data Analyses
- Describe the methodology you will follow to achieve your objectives, relatioships analysis ? feature engineering ? 
- Apply your appraoch, and cite your findings.
- Draft a conclusion, and define new operations.
- Use visaluzation to understand your findings.
---

#### a. Approach and methodology

#### 2. Correlation analysis:
   - Check for possible relationships between air temperature and other variables (not preferable since thoe indicators will not be available in the prediction of the future)


### 3. Visualize trends and seasonality of the target varivable:
   - Plot temperature over time to identify trends and seasonal patterns.


#### 5. Results!

---
# 4. Data modeling:
- Satisfied by your data ?
- Run simple clustering or regression.
- Evaluate your data modeling by choosing the right evaluatiors( clustering or regression).
- Make visualization.

---

#### a. Data preprocessing:
   - Aggregate the data to a daily or hourly level if necessary to simplify the time series

   - Generate lag features for the target variable (e.g., temperature at t-1, t-2, etc.).

   - Normalize or standardize the features to prepare for model training.

   - Divide the dataset into training and validation sets based on time (e.g., first 80% for training, last 20% for validation).

#### b. Modeling:
   - Experiment with advanced timeseries models like: ARIMA, XGboost, LinearRegression.


   - Fit the model

   - Visualize the predictions

   - Model evaluation

   - Hyperparameters tuning using GridSearch (to find the optimal hypermarameters values).

   - Final: training, evaluation!

---
# 5. Communicate results: 
- Provide textual description of your work.
- Use visualizations to simplify your findings, and communicate it.
- Any decisions based on facts ?
---