## **MINERALS OF ENERGY TRANSITION** 

**EXTRACTION, TRANSFORMATION, LOAD (ETL)**

**DISCLAIMER:** <br>
*The data used for analyzing the market was obtained from open sources, the information and insights in the present document can't be used with commercial purposes keeping each data source with their original licences.*

**SUMMARY:** <br>
TBD

**OBJECTIVE:** <br>
To extract and transform data from the origin considering the production values as main or target variable.


**PIPELINE:** <br>
* Data Extraction
* Data Transformation
    * Data Scaling
    * Time Series Components
* Transformed Data - Quality Assessment
* Data Load (Export)


### 00.00. RESOURCES AND WORK ENVIRONMENT SETTING
#### 00.01. LIBRARIES AND WD

In [1]:
##-- ENVIRONMENT SETTINGS
import pandas as pd
import numpy as np
import os
import json

import sqlite3

import statsmodels.api as sm

import warnings
warnings.filterwarnings('ignore')

In [2]:
#-- Custom Libraries
main_wd = os.getcwd()[:os.getcwd().find('2023.Minerals-EDA') + len('2023.Minerals-EDA') +1]
os.chdir(main_wd)
os.chdir([x for x in json.load(open('./config/config.json',))[0]['directory'] if 'wd_custom_libraries' in x][0]['wd_custom_libraries'])    
import aux_time_series

os.chdir(main_wd)

In [3]:
##-- Work Directory  
wd_in = [x for x in json.load(open('./config/config.json',))[0]['data'] if 'raw' in x][0]['raw']
wd_out = [x for x in json.load(open('./config/config.json',))[0]['data'] if 'processed' in x][0]['processed']
csvAttr_imp = json.load(open('./config/config.json',))[0]['csvAttr_imp'][0]
csvAttr_exp = json.load(open('./config/config.json',))[0]['csvAttr_exp'][0]

#### 00.02 DATA
There are two main datasets, The first one contains the production data with its data point of international trade by producer country but both concepts are not equally completed due to coming from different sources.
The other dataset contains only the trade but for every single country in the source database. 

The mineral units of measure have been standardized for trade and production in the [ETL main process]( https://github.com/zapallo-droid-ca/2023.Minerals-ETL) (other project) where petroleum and natural gas are in tonnes of oil equivalent and the rest of minerals in tonnes.

The minerals query contains a custom variable with the dimension for what the observation or row can be utilized for, considering that trade only contains values for years after 1986 and production has older values.


In [4]:
##-- Queries
ft_minerals_q = open(main_wd + '/ETL/ft_minerals.sql').read()
ft_trade_q = open(main_wd + '/ETL/ft_trade.sql').read()

##-- Connection
conn = sqlite3.connect(wd_in + '/minerals_db.db')
cursor = conn.cursor()

##-- Data
#Fact Tables
ft_minerals = pd.read_sql(ft_minerals_q, conn)
ft_trade = pd.read_sql(ft_trade_q, conn)

#Aux Data
dim_country = pd.read_sql('SELECT country_code , country_desc , region_desc  FROM dim_country dc',conn)
dim_mineral = pd.read_sql('SELECT mineral_code , mineral_desc  FROM dim_mineral dm',conn)
dim_unit = pd.read_sql('SELECT unit_code , unit_desc  FROM dim_unit du',conn)
dim_calendar = pd.read_sql('SELECT year  FROM dim_calendar dc',conn)

#-- Closing Connection
conn.close()

del(conn, cursor, ft_minerals_q, ft_trade_q)

In [5]:
#--key for future joins
ft_minerals['key'] = ft_minerals['year'].astype(str) + '-' + ft_minerals['country_code'] + '-' + ft_minerals['mineral_code']

#--Indexing DateTime
ft_minerals.set_index(pd.to_datetime(ft_minerals['year'], format = '%Y'), inplace = True)
#considering that there are non-unique values for the index, the frequency will be settled when is needed.

The main sources only have values for production or trade in those years where transactions were made, to work with time series and to get the statistical measures and values, the queries extracted data using a cross-join function between the fact tables and calendar table contained in the Data Warehouse where all the null values where imputed with “0”.

During the query process, the calculations of rolling aggregations and ratios were made to delegate the process to SQL.


In [6]:
#Saving shape to final control
struct_shape = ft_minerals.shape[0]

### 01.00. TRANSFORMATIONS
#### 01.01. Time Series Decomposition





In [7]:
##--Taking a look into the target table
ft_minerals.sort_values('quantity_produced', ascending = False).head(2)

Unnamed: 0_level_0,country_code,country_desc,region_desc,mineral_code,mineral_desc,unit_code,unit_desc,year,quantity_produced,quantity_import,...,produced_percentage_of_region,imports_percentage_of_global,imports_percentage_of_region,exports_percentage_of_global,exports_percentage_of_region,imports_USD_percentage_of_global,imports_USD_percentage_of_region,exports_USD_percentage_of_global,exports_USD_percentage_of_region,key
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-01-01,CHN,china,Asia,cm02,steel,1,tonnes,2020,1064767000.0,21224830.0,...,0.0,8.500191,12.760579,1.349451,12.674839,8.822646,13.667664,1.556738,18.587715,2020-CHN-cm02
2019-01-01,CHN,china,Asia,cm02,steel,1,tonnes,2019,996342000.0,23360820.0,...,0.0,6.029206,8.46531,1.345937,14.42612,6.361385,9.200178,1.641698,22.401386,2019-CHN-cm02


In [8]:
##-- Decomposing the Target Series
ft_prod_tsdecomp = aux_time_series.tsDecomposition(data = ft_minerals[['country_code','mineral_code','quantity_produced']], index_frequency = 'YS', period = 12, target = 'quantity_produced', category = 'country_code', sub_category = 'mineral_code', scale_data = False)
ft_prod_tsdecomp.tail()

process ended, df_timeSeriesDecomp: (40800, 7)cm09


Unnamed: 0,date,country_code,mineral_code,level_original,residual,seasonal,trend
40269,2020-01-01,GAB,cm06,8147000.0,3075048.0,682605.2,4389347.0
40270,2020-01-01,GAB,cm13,0.4867966,0.09204907,0.03530675,0.3594408
40271,2020-01-01,GAB,cm14,11.73689,-1.025673,0.3346822,12.42788
40261,2020-01-01,FRA,cm02,11595700.0,-2400074.0,-1083084.0,15078860.0
40799,2020-01-01,_KS,cm11,0.0,-1211.233,-831.2674,2042.5


In [9]:
##--Adding components to main ft (fact table) 
#join key with ft_minerals
ft_prod_tsdecomp['key'] = ft_prod_tsdecomp['date'].dt.year.astype(str) + '-' + ft_prod_tsdecomp['country_code'] + '-' + ft_prod_tsdecomp['mineral_code']

ft_minerals = ft_minerals.merge(ft_prod_tsdecomp[['key','level_original','residual','seasonal','trend']], how = 'left', on = 'key')
ft_minerals.head()

Unnamed: 0,country_code,country_desc,region_desc,mineral_code,mineral_desc,unit_code,unit_desc,year,quantity_produced,quantity_import,...,exports_percentage_of_region,imports_USD_percentage_of_global,imports_USD_percentage_of_region,exports_USD_percentage_of_global,exports_USD_percentage_of_region,key,level_original,residual,seasonal,trend
0,CMR,cameroon,Africa,cm01,aluminium,1,tonnes,1970,52000.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1970-CMR-cm01,52000.0,0.0,3047.164352,0.0
1,EGY,egypt,Africa,cm01,aluminium,1,tonnes,1970,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1970-EGY-cm01,0.0,0.0,-1710.436921,0.0
2,GHA,ghana,Africa,cm01,aluminium,1,tonnes,1970,113000.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1970-GHA-cm01,113000.0,0.0,25059.293692,0.0
3,MOZ,mozambique,Africa,cm01,aluminium,1,tonnes,1970,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1970-MOZ-cm01,0.0,0.0,6046.878183,0.0
4,NGA,nigeria,Africa,cm01,aluminium,1,tonnes,1970,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1970-NGA-cm01,0.0,0.0,-3027.256944,0.0


In [10]:
##--Categories (To be able to work only with this dataset)
ft_prod_tsdecomp = ft_prod_tsdecomp.merge(dim_country, on = 'country_code', how = 'left')
ft_prod_tsdecomp = ft_prod_tsdecomp.merge(dim_mineral, on = 'mineral_code', how = 'left')

#Reordering
ft_prod_tsdecomp = ft_prod_tsdecomp[['key','country_code','country_desc','region_desc','mineral_code','mineral_desc','date','level_original','trend','seasonal','residual']]
ft_prod_tsdecomp.head()

Unnamed: 0,key,country_code,country_desc,region_desc,mineral_code,mineral_desc,date,level_original,trend,seasonal,residual
0,1970-AFG-cm11,AFG,afghanistan,Asia,cm11,chromium,1970-01-01,0.0,0.0,-104.318866,0.0
1,1970-NLD-cm13,NLD,netherlands,Europe,cm13,natural gas,1970-01-01,30.708914,0.0,-3.5659,0.0
2,1970-NLD-cm14,NLD,netherlands,Europe,cm14,petroleum,1970-01-01,2.163717,0.0,-0.181257,0.0
3,1970-NOR-cm01,NOR,norway svalbard jan mayen,Europe,cm01,aluminium,1970-01-01,530000.0,0.0,23033.1875,0.0
4,1970-NOR-cm02,NOR,norway svalbard jan mayen,Europe,cm02,steel,1970-01-01,870000.0,0.0,-44818.075231,0.0


### 02.00. CLUSTERING
#### 02.01. Dynamic Time Warping Similarity

The dataset contains multiple dimensions due to its categories; The countries could be taken as the subject of analysis but on the other hand, we have also values for each mineral like another category. The procedure will consider each mineral for clustering the countries.

Clustering is a classification problem and time series datasets are not compatible with clustering without reshaping. To tackle this the dataset will be reshaped getting each month ordered by columns and separating the procedure by component, so, we will clustering countries by mineral and also by component, this means that we will have x clusters by mineral on the trend component, y clusters by mineral on the seasonal component and z clusters by mineral on the residual component.

In this case, the Dynamic Time Warping (DTW) will be used as measure of similarity among the component series aiming to capture similarities with consistent shapes.

[Here is an nice explaination of the DTW Algorithm](https://www.youtube.com/watch?v=_K1OsqCicBY
)


In [11]:
##--Pivoting Dataset and getting DTW Matrix 
#- Trend Component
dtw_matrix_trend, dtw_index_trend = aux_time_series.dtw_matrix_funct(data = ft_prod_tsdecomp, index = 'country_code', columns = 'date', values = 'trend', category = 'mineral_code')

#- Seasonal Component
dtw_matrix_seaso, dtw_index_seaso = aux_time_series.dtw_matrix_funct(data = ft_prod_tsdecomp, index = 'country_code', columns = 'date', values = 'seasonal', category = 'mineral_code')

#- Residual Component
dtw_matrix_resid, dtw_index_resid = aux_time_series.dtw_matrix_funct(data = ft_prod_tsdecomp, index = 'country_code', columns = 'date', values = 'residual', category = 'mineral_code')

process finished, 15 DTW matrix created and 15 index lists created

#### 02.02. Kmeans

To analyze and determine the optimal number of clusters three measures will be considered, the silohuette score, the calinski harabasz score and the davies bouldin score for each clustering algorithm

In [12]:
#Parameters
randomStateValue = 92
early_stop_yield = 0.2 #% of total categories without changing results
categories = ft_minerals.mineral_code.unique()


In [13]:
#- Trend Component
result_t, clusters_tests_t = aux_time_series.clustering_kmeans_multi(x = dtw_matrix_trend, 
                                                                     y = dtw_index_trend,
                                                                    categories = categories, 
                                                                    early_stop_yield = early_stop_yield, 
                                                                    randomStateValue = randomStateValue)

12.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 12, cal: 0, dav: 11
7.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 7, cal: 0, dav: 7
6.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 6, cal: 0, dav: 6
18.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 18, cal: 0, dav: 9
6.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 6, cal: 0, dav: 6
4.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 4, cal: 0, dav: 4
11.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 11, cal: 0, dav: 0
5.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 5, cal: 0, dav: 2
19.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 19, cal: 1, dav: 19
10.0 iters without change reached (0.2 yield), iters w

In [14]:
#- Seasonal Component
result_s, clusters_tests_s = aux_time_series.clustering_kmeans_multi(x = dtw_matrix_seaso,
                                                                     y = dtw_index_seaso,
                                                                    categories = categories, 
                                                                    early_stop_yield = early_stop_yield, 
                                                                    randomStateValue = randomStateValue)

12.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 12, cal: 0, dav: 12
7.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 7, cal: 0, dav: 1
6.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 6, cal: 0, dav: 6
18.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 18, cal: 0, dav: 16
6.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 6, cal: 0, dav: 6
4.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 4, cal: 0, dav: 4
11.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 11, cal: 0, dav: 0
5.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 5, cal: 0, dav: 2
19.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 19, cal: 0, dav: 19
10.0 iters without change reached (0.2 yield), iters 

In [15]:
#- Residual Component
result_r, clusters_tests_r = aux_time_series.clustering_kmeans_multi(x = dtw_matrix_resid,
                                                                     y = dtw_index_resid,
                                                                    categories = categories, 
                                                                    early_stop_yield = early_stop_yield, 
                                                                    randomStateValue = randomStateValue)

12.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 12, cal: 0, dav: 12
7.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 7, cal: 0, dav: 3
6.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 6, cal: 0, dav: 5
18.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 18, cal: 0, dav: 15
6.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 6, cal: 0, dav: 6
4.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 4, cal: 0, dav: 4
11.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 11, cal: 0, dav: 3
5.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 5, cal: 0, dav: 1
19.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 19, cal: 0, dav: 19
10.0 iters without change reached (0.2 yield), iters 

In [16]:
##-- Unpacking Data
clustered_data = pd.DataFrame()
selection_criteria = 'suggested_clusters'

packing_data = [('trend',result_t),('seasonal',result_s),('residual',result_r)]

for package in packing_data:
    by = package[0]
    data = package[1]    

    for key in list(data.keys()):    
        extracted_data = data[key].copy()
        extracted_data = extracted_data[['index',selection_criteria]]
        extracted_data.columns = ['index','cluster']

        extracted_data['category'] = key        
        extracted_data['by'] = by   
        #extracted_data['key_join'] = extracted_data['category'] + '-' + extracted_data['index'] 
        clustered_data = pd.concat([clustered_data,extracted_data])    

clustered_data.head()

Unnamed: 0,index,cluster,category,by
0,ARE,0,cm01,trend
1,ARG,0,cm01,trend
2,AUS,0,cm01,trend
3,AUT,0,cm01,trend
4,AZE,0,cm01,trend


In [17]:
##-- Pivoting Data and Creating Join Key
clustered_data = clustered_data.pivot(index = ['index','category'], columns = 'by', values = 'cluster').reset_index()
clustered_data.columns = clustered_data.columns.rename('')
clustered_data.sort_values('index', ascending = True, inplace = True)

clustered_data['key'] = clustered_data['index'] + '-' + clustered_data['category']
clustered_data = clustered_data[['key','trend','seasonal','residual']]
clustered_data.columns = ['key','cluster_by_trend_dtw','cluster_by_seasonal_dtw','cluster_by_residual_dtw']
clustered_data.head()

Unnamed: 0,key,cluster_by_trend_dtw,cluster_by_seasonal_dtw,cluster_by_residual_dtw
0,AFG-cm11,0,0,0
1,AFG-cm13,0,0,0
2,AFG-cm14,0,0,0
3,AGO-cm02,0,0,0
4,AGO-cm06,0,0,0


In [18]:
##--Adding components to main ft (fact table) 
#key in ft_minerals
ft_minerals.drop(columns = 'key', inplace = True)
ft_minerals['key'] = ft_minerals['country_code'] + '-' + ft_minerals['mineral_code']

#join
ft_minerals = ft_minerals.merge(clustered_data, how = 'left', on = 'key')
ft_minerals.drop(columns = 'key', inplace = True)
ft_minerals.head()

Unnamed: 0,country_code,country_desc,region_desc,mineral_code,mineral_desc,unit_code,unit_desc,year,quantity_produced,quantity_import,...,imports_USD_percentage_of_region,exports_USD_percentage_of_global,exports_USD_percentage_of_region,level_original,residual,seasonal,trend,cluster_by_trend_dtw,cluster_by_seasonal_dtw,cluster_by_residual_dtw
0,CMR,cameroon,Africa,cm01,aluminium,1,tonnes,1970,52000.0,0.0,...,0.0,0.0,0.0,52000.0,0.0,3047.164352,0.0,0,0,0
1,EGY,egypt,Africa,cm01,aluminium,1,tonnes,1970,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,-1710.436921,0.0,0,0,0
2,GHA,ghana,Africa,cm01,aluminium,1,tonnes,1970,113000.0,0.0,...,0.0,0.0,0.0,113000.0,0.0,25059.293692,0.0,0,0,0
3,MOZ,mozambique,Africa,cm01,aluminium,1,tonnes,1970,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,6046.878183,0.0,0,0,0
4,NGA,nigeria,Africa,cm01,aluminium,1,tonnes,1970,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,-3027.256944,0.0,0,0,0


In [19]:
struct_shape == ft_minerals.shape[0]

True

#### 02.03. Alternative (Raw Time Series Component - No DTW)

To analyze and determine the optimal number of clusters three measures will be considered, the silohuette score, the calinski harabasz score and the davies bouldin score for each clustering algorithm

In [20]:
##--Pivoting Dataset - Reshaping
#- Trend Component
matrix_trend, index_trend = aux_time_series.ts_component_kmeans_preproc(data = ft_prod_tsdecomp, index = 'country_code', columns = 'date', values = 'trend', category = 'mineral_code')

#- Seasonal Component
matrix_seaso, index_seaso = aux_time_series.ts_component_kmeans_preproc(data = ft_prod_tsdecomp, index = 'country_code', columns = 'date', values = 'seasonal', category = 'mineral_code')

#- Residual Component
matrix_resid, index_resid = aux_time_series.ts_component_kmeans_preproc(data = ft_prod_tsdecomp, index = 'country_code', columns = 'date', values = 'residual', category = 'mineral_code')

process finished, 15 matrix created and 15 index lists created

In [21]:
#Parameters
randomStateValue = 92
early_stop_yield = 0.2 #% of total categories without changing results
categories = ft_minerals.mineral_code.unique()


In [22]:
#- Trend Component
result_t, clusters_tests_t = aux_time_series.clustering_kmeans_multi(x = matrix_trend, 
                                                                     y = index_trend,
                                                                    categories = categories, 
                                                                    early_stop_yield = early_stop_yield, 
                                                                    randomStateValue = randomStateValue)

12.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 12, cal: 0, dav: 12
7.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 7, cal: 0, dav: 7
6.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 6, cal: 0, dav: 5
18.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 18, cal: 0, dav: 0
6.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 6, cal: 0, dav: 6
4.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 4, cal: 0, dav: 4
11.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 11, cal: 0, dav: 0
5.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 5, cal: 0, dav: 3
19.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 19, cal: 0, dav: 19
10.0 iters without change reached (0.2 yield), iters w

In [23]:
#- Seasonal Component
result_s, clusters_tests_s = aux_time_series.clustering_kmeans_multi(x = matrix_seaso,
                                                                     y = index_seaso,
                                                                    categories = categories, 
                                                                    early_stop_yield = early_stop_yield, 
                                                                    randomStateValue = randomStateValue)

12.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 12, cal: 12, dav: 12
7.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 7, cal: 0, dav: 5
6.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 6, cal: 3, dav: 6
18.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 18, cal: 0, dav: 18
6.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 6, cal: 0, dav: 6
4.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 4, cal: 4, dav: 4
11.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 11, cal: 0, dav: 0
5.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 5, cal: 0, dav: 4
19.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 19, cal: 0, dav: 18
10.0 iters without change reached (0.2 yield), iters

In [24]:
#- Residual Component
result_r, clusters_tests_r = aux_time_series.clustering_kmeans_multi(x = matrix_resid,
                                                                     y = index_resid,
                                                                    categories = categories, 
                                                                    early_stop_yield = early_stop_yield, 
                                                                    randomStateValue = randomStateValue)

12.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 12, cal: 12, dav: 12
7.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 7, cal: 0, dav: 3
6.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 6, cal: 5, dav: 5
18.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 18, cal: 0, dav: 18
6.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 6, cal: 0, dav: 6
4.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 4, cal: 4, dav: 4
11.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 11, cal: 0, dav: 4
5.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 5, cal: 0, dav: 5
19.0 iters without change reached (0.2 yield), iters without change by measure -> sil: 19, cal: 0, dav: 19
10.0 iters without change reached (0.2 yield), iters

In [25]:
##-- Unpacking Data
clustered_data = pd.DataFrame()
selection_criteria = 'suggested_clusters'

packing_data = [('trend',result_t),('seasonal',result_s),('residual',result_r)]

for package in packing_data:
    by = package[0]
    data = package[1]    

    for key in list(data.keys()):    
        extracted_data = data[key].copy()
        extracted_data = extracted_data[['index',selection_criteria]]
        extracted_data.columns = ['index','cluster']

        extracted_data['category'] = key        
        extracted_data['by'] = by   
        #extracted_data['key_join'] = extracted_data['category'] + '-' + extracted_data['index'] 
        clustered_data = pd.concat([clustered_data,extracted_data])    

clustered_data.head()

Unnamed: 0,index,cluster,category,by
0,ARE,0,cm01,trend
1,ARG,0,cm01,trend
2,AUS,0,cm01,trend
3,AUT,0,cm01,trend
4,AZE,0,cm01,trend


In [26]:
##-- Pivoting Data and Creating Join Key
clustered_data = clustered_data.pivot(index = ['index','category'], columns = 'by', values = 'cluster').reset_index()
clustered_data.columns = clustered_data.columns.rename('')
clustered_data.sort_values('index', ascending = True, inplace = True)

clustered_data['key'] = clustered_data['index'] + '-' + clustered_data['category']
clustered_data = clustered_data[['key','trend','seasonal','residual']]
clustered_data.columns = ['key','cluster_by_trend','cluster_by_seasonal','cluster_by_residual']
clustered_data.head()

Unnamed: 0,key,cluster_by_trend,cluster_by_seasonal,cluster_by_residual
0,AFG-cm11,0,1,0
1,AFG-cm13,0,0,0
2,AFG-cm14,0,0,0
3,AGO-cm02,0,0,0
4,AGO-cm06,0,1,0


In [27]:
##--Adding clusters to main ft (fact table) 
#key in ft_minerals
ft_minerals['key'] = ft_minerals['country_code'] + '-' + ft_minerals['mineral_code']

#join
ft_minerals = ft_minerals.merge(clustered_data, how = 'left', on = 'key')
ft_minerals.drop(columns = 'key', inplace = True)
ft_minerals.head()

Unnamed: 0,country_code,country_desc,region_desc,mineral_code,mineral_desc,unit_code,unit_desc,year,quantity_produced,quantity_import,...,level_original,residual,seasonal,trend,cluster_by_trend_dtw,cluster_by_seasonal_dtw,cluster_by_residual_dtw,cluster_by_trend,cluster_by_seasonal,cluster_by_residual
0,CMR,cameroon,Africa,cm01,aluminium,1,tonnes,1970,52000.0,0.0,...,52000.0,0.0,3047.164352,0.0,0,0,0,0,0,0
1,EGY,egypt,Africa,cm01,aluminium,1,tonnes,1970,0.0,0.0,...,0.0,0.0,-1710.436921,0.0,0,0,0,0,0,0
2,GHA,ghana,Africa,cm01,aluminium,1,tonnes,1970,113000.0,0.0,...,113000.0,0.0,25059.293692,0.0,0,0,0,0,0,0
3,MOZ,mozambique,Africa,cm01,aluminium,1,tonnes,1970,0.0,0.0,...,0.0,0.0,6046.878183,0.0,0,0,0,0,0,0
4,NGA,nigeria,Africa,cm01,aluminium,1,tonnes,1970,0.0,0.0,...,0.0,0.0,-3027.256944,0.0,0,0,0,0,0,0


In [28]:
struct_shape == ft_minerals.shape[0]

True

### 03.00. Outliers Detection

Considering that we have the residual component, we can use it to determine outliers in the time series, in this case we are going to apply the Hampel Filter wihout imputations.

The outliers detection will be made for each country series.

In [29]:
data = ft_minerals.copy()
index = 'year'
target = 'residual'
category = 'country_code'
sub_category = 'mineral_code'

In [30]:
iterators = tuple(data[[category,sub_category]].drop_duplicates().itertuples(index = False))

outliers = pd.DataFrame()
for iter in iterators:
    X = data[(data[category] == iter[0]) & (data[sub_category] == iter[1])][[index,target]].reset_index(drop = True).sort_values(index, ascending = True).copy()

    results = aux_time_series.hampel_filter(data = X, 
                                            index = 'year',
                                            target = 'residual',
                                            windows_size = 10, 
                                            n_sigmas = 3)

    results['key'] = results[index].astype(str) + '-' + iter[0] + '-' + iter[1]

    outliers = pd.concat([outliers,results])

outliers.shape[0] == ft_minerals.shape[0]

True

In [31]:
##--Adding outliers to main ft (fact table) 
#key in ft_minerals
ft_minerals['key'] = ft_minerals[index].astype(str) + '-' + ft_minerals[category] + '-' + ft_minerals[sub_category]

#join
ft_minerals = ft_minerals.merge(outliers[['key','outlier',f'imputed_{target}_values']], how = 'left', on = 'key')
ft_minerals.drop(columns = 'key', inplace = True)
ft_minerals.head()

Unnamed: 0,country_code,country_desc,region_desc,mineral_code,mineral_desc,unit_code,unit_desc,year,quantity_produced,quantity_import,...,seasonal,trend,cluster_by_trend_dtw,cluster_by_seasonal_dtw,cluster_by_residual_dtw,cluster_by_trend,cluster_by_seasonal,cluster_by_residual,outlier,imputed_residual_values
0,CMR,cameroon,Africa,cm01,aluminium,1,tonnes,1970,52000.0,0.0,...,3047.164352,0.0,0,0,0,0,0,0,False,0.0
1,EGY,egypt,Africa,cm01,aluminium,1,tonnes,1970,0.0,0.0,...,-1710.436921,0.0,0,0,0,0,0,0,False,0.0
2,GHA,ghana,Africa,cm01,aluminium,1,tonnes,1970,113000.0,0.0,...,25059.293692,0.0,0,0,0,0,0,0,False,0.0
3,MOZ,mozambique,Africa,cm01,aluminium,1,tonnes,1970,0.0,0.0,...,6046.878183,0.0,0,0,0,0,0,0,False,0.0
4,NGA,nigeria,Africa,cm01,aluminium,1,tonnes,1970,0.0,0.0,...,-3027.256944,0.0,0,0,0,0,0,0,False,0.0


In [32]:
struct_shape == ft_minerals.shape[0]

True

### 04.00. Exporting Data

In [33]:
##--Python EDA
ft_minerals.to_csv(wd_out + '/ft_minerals.csv.gz', sep = csvAttr_exp['sep'], index = False, encoding = csvAttr_exp['encoding'])
ft_trade.to_csv(wd_out + '/ft_trade.csv.gz', sep = csvAttr_exp['sep'], index = False, encoding = csvAttr_exp['encoding'])

In [34]:
ft_minerals.drop(columns = ['country_desc', 'region_desc', 'mineral_desc', 'unit_desc'])

Unnamed: 0,country_code,mineral_code,unit_code,year,quantity_produced,quantity_import,quantity_export,value_export,value_import,trade_isna,...,seasonal,trend,cluster_by_trend_dtw,cluster_by_seasonal_dtw,cluster_by_residual_dtw,cluster_by_trend,cluster_by_seasonal,cluster_by_residual,outlier,imputed_residual_values
0,CMR,cm01,1,1970,52000.0,0.0000,0.000,0.000000e+00,0.000000e+00,1,...,3047.164352,0.000000e+00,0,0,0,0,0,0,False,0.000000
1,EGY,cm01,1,1970,0.0,0.0000,0.000,0.000000e+00,0.000000e+00,0,...,-1710.436921,0.000000e+00,0,0,0,0,0,0,False,0.000000
2,GHA,cm01,1,1970,113000.0,0.0000,0.000,0.000000e+00,0.000000e+00,1,...,25059.293692,0.000000e+00,0,0,0,0,0,0,False,0.000000
3,MOZ,cm01,1,1970,0.0,0.0000,0.000,0.000000e+00,0.000000e+00,0,...,6046.878183,0.000000e+00,0,0,0,0,0,0,False,0.000000
4,NGA,cm01,1,1970,0.0,0.0000,0.000,0.000000e+00,0.000000e+00,0,...,-3027.256944,0.000000e+00,0,0,0,0,0,0,False,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40795,BRA,cm09,1,2020,421500.0,137514.5400,25.000,2.000000e+04,1.056098e+08,0,...,-12695.853009,4.283586e+05,3,0,0,0,0,0,False,5837.228009
40796,CHL,cm09,1,2020,28662.0,0.0003,55037.807,2.934161e+07,4.889000e+01,0,...,3664.631655,3.172079e+04,11,0,0,0,0,0,False,-6723.423322
40797,COL,cm09,1,2020,0.0,0.0000,0.000,0.000000e+00,0.000000e+00,0,...,-1.983796,0.000000e+00,0,0,0,0,0,0,False,1.983796
40798,ECU,cm09,1,2020,0.0,0.0000,0.000,0.000000e+00,0.000000e+00,0,...,-13.771412,0.000000e+00,0,0,0,0,0,0,False,13.771412


In [35]:
##--BI Tools
#- ft_minerals
cols_to_drop = ['cluster_by_trend_dtw', 'cluster_by_seasonal_dtw', 'cluster_by_residual_dtw', 
                'cluster_by_trend', 'cluster_by_seasonal', 'cluster_by_residual',
                'country_desc', 'region_desc', 'mineral_desc', 'unit_desc']

ft_minerals['cluster_key'] = ft_minerals['country_code'] + '-' + ft_minerals['mineral_code']             
ft_minerals.drop(columns = cols_to_drop).to_csv(wd_out + '/BI/ft_production.csv.gz', sep = csvAttr_exp['sep'], index = False, encoding = csvAttr_exp['encoding'])

#- ft_trade
cols_to_drop = ['country_desc', 'region_desc', 'mineral_desc', 'unit_desc']
ft_trade.drop(columns = cols_to_drop).to_csv(wd_out + '/BI/ft_trade.csv.gz', sep = csvAttr_exp['sep'], index = False, encoding = csvAttr_exp['encoding'])

#- Clusters
ft_cluster = ft_minerals[['country_code','mineral_code','cluster_by_trend_dtw', 'cluster_by_seasonal_dtw',
                           'cluster_by_residual_dtw', 'cluster_by_trend', 'cluster_by_seasonal',
                           'cluster_by_residual']].copy().drop_duplicates().reset_index(drop = True)

ft_cluster['key'] = ft_cluster['country_code'] + '-' + ft_cluster['mineral_code']
ft_cluster.to_csv(wd_out + '/BI/ft_cluster_countries.csv.gz', sep = csvAttr_exp['sep'], index = False, encoding = csvAttr_exp['encoding'])

#- Other Dimensions
dim_unit.to_csv(wd_out + '/BI/dim_unit.csv.gz', sep = csvAttr_exp['sep'], index = False, encoding = csvAttr_exp['encoding'])
dim_country.to_csv(wd_out + '/BI/dim_country.csv.gz', sep = csvAttr_exp['sep'], index = False, encoding = csvAttr_exp['encoding'])
dim_mineral.to_csv(wd_out + '/BI/dim_mineral.csv.gz', sep = csvAttr_exp['sep'], index = False, encoding = csvAttr_exp['encoding'])
dim_calendar.to_csv(wd_out + '/BI/dim_calendar.csv.gz', sep = csvAttr_exp['sep'], index = False, encoding = csvAttr_exp['encoding'])