
<div style="text-align: center;">
  <img width="420" height="420" src="https://www.naterscreations.com/imputegap/logo_imputegab.png" />
</div>

<h1>ImputeGAP: Exercices</h1>

ImputeGAP is an end-to-end imputation library that implements the full imputation pipeline
from data collection to explaining the imputation results and their impact. It encompasses
two interleaving units: repair and explore. The two units can be accessed via a standardized
pipeline defined by configuration files or independent instantiation. Please, install the library and the jupyter requirements:

In [None]:
%pip install imputegap==1.0.8

In [None]:
%pip install -U ipywidgets
%matplotlib inline

<br>


<h1>Help

You can find all information about the library on the official document : https://imputegap.readthedocs.io/

*All the assignment can be answered by using the ImputeGAP library.*

<h1>Exercice 1: Load</h1>

1) Load the in-lib "chlorine" dataset, which contains data from chlorine residual management efforts to safeguard water distribution system security.

In [None]:
from imputegap.recovery.manager import TimeSeries
from imputegap.tools import utils

# initialize the time series object


# load and normalize the dataset


ts.data

<br>

 Display the first 10 time series, each containing only 10 values in the print.

<br>

3) Plot the first 5 time series in their entirety.

**ALL PLOTS ARE STORED IN "./imputegap_assets/*". To show the plot in the jupyter, add the following command:**

In [None]:
%matplotlib inline
ts.plots.show()

In [None]:
# plot


%matplotlib inline
ts.plots.show()

<br>

4) Although the report is insightful, we want to focus the analysis on the first 4 seasonal patterns. Create a new object named **chlorine_ts** containing only the relevant values of the time series. Then, normalize the data using the min-max scaling method. Finally, plot the first 9 time series from this new dataset. <hint: the seasonal patterns are the curve shape of the data (see the plot).>

In [None]:
# load and normalize the dataset



<br>

5) What is the shape of the time series.

*your answer*


<br>

<h1>Exercice 2: Contamination</h1>

1) Use again the **chlorine_ts** object and simulate sensor dysfunction by contaminating its values. Specifically, simulate a scenario where all sensors located in Zurich < representing the top 10% of the series in the matrix > are disconnected between hours 300 and 410 < time series values >. Store the resulting matrix in a new variable named **chlorine_m**.

**hint : the offset must be equal to 0.5456 (54,56%), the contamination is set by default in the top of the matrix**

In [None]:
#contaminate

<br>

2.  Plot the result, ensuring that the missing values are visibly represented. Use subplots to display the 5 time series.

In [None]:
# plot

<br>

<h1>Exercice 3: Imputation</h1>

1) In this section, we aim to repair the values affected during the contamination. First, initialize an imputation object from the MatrixCompletion family, `IterativeSVD` and assign it to a variable named `chlorine_imputer`. Then, perform the imputation to repair the dataset.

In [None]:
# impute

<br>

2) Evaluate the quality of the imputation by computing the score between the original `chlorine_ts` dataset and the imputed dataset. Display the RMSE (Root Mean Squared Error) value below.

In [None]:
#score

#print


*your answer*

<br>

3) Plot the results of the imputation using subplots. Each plot should display the imputed series, the contaminated series, and the original ground truth for comparison.

In [None]:
# plot

# .show()

<br>

4) Improve the imputation by applying automatic hyperparameter tuning using the `ray_tune` optimizer to find the optimal parameters for `IterativeSVD`. Report the new RMSE after optimization. Then, replot the results with the optimized imputation, showing the contaminated data, the imputed values, and the original ground truth for comparison. <hint: you need a create dictionary to set the optimizer>

In [None]:
# impute with optimizer


# print


# plot


<br>

New RMSE : *your answer*

<br>

<h1>Exercice 4: Benchmark</h1>

In [None]:
from imputegap.recovery.benchmark import Benchmark

# set the analysis

<br>

2) What is the best algorithm and why ?

*your answer*

3) Which dataset is best suited for the `SoftImpute` algorithm, and why?


In [None]:
# set analysis

*your answer*

<br>

<h1>Exercice 5: Explainer</h1>

We will launch an explainer analysis to understand the influence of dataset features on the imputation process. This will help identify which features contribute most to the repair quality.

1)  Launch an explainer analysis using the `Explainer` object (assigned to the variable `exp_chlorine`) on the previously used dataset (`chlorine_ts.data`). Use the `pycatch-22` feature extractor, the `mcar` missing data pattern, and the `IterativeSVD` imputation method to understand the relationship between features and imputation quality.

In [None]:
from imputegap.recovery.explainer import Explainer

# set object


# configure the explanation

2) print the result with the function of the Explainer Object. What is the top feature, can you explain why ?

In [None]:
# print the impact of each feature

*your answer*

<br>

<h1>Exercice 6: Downstream</h1>

We will now analyze the impact of the imputation on downstream prediction tasks. The goal is to determine whether the repair process enhances the overall analytics and predictive performance of the dataset.

A) Launch a downstream analysis on the `forecast-economy` dataset using the `CDRec` algorithm (from the MatrixCompletion family), and compare its performance with the `ZeroImpute` strategy (which fills missing values with zero).

1) Create a time series object named `ts_downstream` and load the full `forecast-economy` dataset, applying `Z-Score` normalization.

2) Contaminate the data using the `aligned` missingness pattern, with `80%` missing rate per time series.

3) Initialize an imputation object named `imputer_downstream` using the `KNNImputer` method.

4) Define a dictionary specifying the downstream task as `"forecast"`, the model as `"hw-add"`, and include the necessary comparators (`CDRec`, `ZeroImpute`).

5) Print the evaluation results using `imputer_downstream.downstream_metrics`.


In [None]:
from imputegap.recovery.imputation import Imputation
from imputegap.recovery.manager import TimeSeries
from imputegap.tools import utils

# initialize the time series object

# load and normalize the dataset

# contaminate the time series

# define and impute the contaminated series

# compute and print the downstream results


What is the meaning of the result ?

*your anwser*

<br>

<h1>Exercice Bonus: GIT</h1>

If you like the library, please add a ⭐ in our GitHub repository.

