<div class="alert alert-block alert-info">
This script <b>cleans the raw data by dropping some columns of the <code>df_master_raw</code></b>. 
    <hr> 
    Note: <br>
    <i><b>Input file(s)' name(s) and metadata</b></i> (if available) are <b>printed out (below 👇🏼) in 'read data to df' section.</b>
</div>

In [1]:
# %env
# %who_ls
# %who
# %who int
# %pinfo <var name>

# Imports

In [2]:
%config IPCompleter.use_jedi = False # disable jedi autocompleter (https://stackoverflow.com/a/65734178/14485040)

import project_path  # makes possible the access to `src` directory using relative path
from src.utils import explore_dir, make_readme_info
from src.utils import read_excel_to_pandas as r_excel
from src.utils import set_outputs_dir
from src.utils import write_pandas_to_excel as w_excel

%run init_nb.ipynb

# INPUTS: Identify file(s) and read data to df

In [3]:
# Explore the directory to find the file(s)
inputs_dir, files_list = explore_dir(
    path_to_dir=r"..\data\lcaforsac", file_extension="xlsx", print_files_list=True
)

['lcia-results-from-sp910-combined.xlsx', 'mapped-lcia-results.xlsx']


<div class="alert alert-block alert-danger">
    <strong> pending (possible) improvements: </strong> <br>


1. FIND WITH A REGULAR EXPRESSION! # THE PATTERN IS THE TUPLE LIKE NAMING OF THE METHODS !!!


</div>

In [4]:
# Process raw data

# Master df with raw data
df_master_raw = r_excel(inputs_dir, "mapped-lcia-results.xlsx", sheets="Sheet1")
print(
    "df of the master data (raw) ".ljust(40, "."),
    f"{df_master_raw.shape}\n".rjust(13, "."),
)

# Get unique names of the LCIA methods in a list
LCIA_METHODS = r_excel(
    inputs_dir, "mapped-lcia-results.xlsx", sheets="df_lcia_labels", show_readme=False
)["Method"].to_list()

print("Unique names of LCIA methods ({} in total):".format(len(LCIA_METHODS)))
print(
    "".join(map('\n\t"{}", '.format, LCIA_METHODS))
)  # unique method names from all the workbooks


===> Trying to load 'readme' data... ===
File: mapped-lcia-results.xlsx from
C:\Users\ViteksPC\Documents\00-ETH_projects\17-AESA_ecoinvent_chemicals\notebooks\0.02-vt-map-lcia-results-to-sp910-and-ei35apos-processes.ipynb
Generated on 2021-12-03 (Friday), 16:38:03 by Tulus, V.
Includes:
<<<
Sheet1: LCIA method results (per category) for ALL chemical markets from SimaPro910 mapped against metadata from Ecoinvent v3.5 APOS. 
df_lcia_labels: unique names of the LCIA methods used in Sheet1.
>>>

df of the master data (raw) ............ ...(946, 93)

Unique names of LCIA methods (7 in total):

	"ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H", 
	"PBs-LCIA (baseline) V0.72", 
	"PBs - Alternative: EF - LANCA V0.70", 
	"ReCiPe 2016 Endpoint (H) V1.03 / World (2010) H/A", 
	"Cumulative Energy Demand V1.11 / Cumulative energy demand", 
	"IPCC 2013 GWP 100a V1.03", 
	"PBs-LCIA V0.71 V0.71", 


# Operations 
- drop redundant and unnecessary columns
<div class="alert alert-block alert-info">
created: <code>df_analysis_prev</code>
</div>

## Identify columns w/ method labels and list "non-method" columns

In [5]:
# a. select all the methods, make a dictionary
"""creates a dictionary -> {'method': [method labels in df]}
        {'method1': ["('method1', 'category1', 'unit1')", "('method1', 'category2', 'unit2')", ...], 
         'method2': [...]
"""
dict_fullMethods = {}

for method in LCIA_METHODS:
    lst = []
    for label in df_master_raw.columns:
        if method in label:
            lst.append(label)
    dict_fullMethods.setdefault(method, []).extend(
        lst
    )  # should be .extend() ! not .append()

# b. flat list of df's labels corresponding to a method
LCIA_METHODS_PER_CATEGORY = [
    value for key in dict_fullMethods.keys() for value in dict_fullMethods[key]
]
# (an alternative) [item for sublist in list(dict_fullMethods.values()) for item in sublist]
print(
    "df_master_raw (consisting of {} columns) contains a list of {} methods."
    "\n\nHere is a sample of 3 randomly shown methods:"
    "\n\t- {}\n\t- {}\n\t- {}"
    "\n\n*Check the full list of methods by printing 'LCIA_METHODS_PER_CATEGORY',\n"
    "or using 'dict_fullMethods' dictionary with keys in 'LCIA_METHODS'.".format(
        len(df_master_raw.columns),
        len(LCIA_METHODS_PER_CATEGORY),
        *random.sample(LCIA_METHODS_PER_CATEGORY, 3)
    )
)
# c. rest of the columns in df_master_raw
rest_of_columns = [col for col in df_master_raw.columns if col not in LCIA_METHODS_PER_CATEGORY]
print(
    "\nThe rest of the {} columns, shown below, "
    "may contain redundant or unnecessary information,"
    "\nfill free to select only required columns.".format(
        len(rest_of_columns)
    )
)
print("".join(map('\n\t"{}", '.format, rest_of_columns)))

df_master_raw (consisting of 93 columns) contains a list of 62 methods.

Here is a sample of 3 randomly shown methods:
	- ('ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H', 'Terrestrial ecotoxicity', 'kg 1,4-DCB')
	- ('PBs-LCIA (baseline) V0.72', 'Ocean acidification', 'Omega Aragon')
	- ('PBs - Alternative: EF - LANCA V0.70', 'Non-cancer human health effects', 'CTUh')

*Check the full list of methods by printing 'LCIA_METHODS_PER_CATEGORY',
or using 'dict_fullMethods' dictionary with keys in 'LCIA_METHODS'.

The rest of the 31 columns, shown below, may contain redundant or unnecessary information,
fill free to select only required columns.

	"wkbName", 
	"Activity", 
	"activity_comment", 
	"type", 
	"referenceProduct", 
	"shortName_geo", 
	"activityName_SP", 
	"fullName_SimaPro", 
	"unit", 
	"amount", 
	"allocation_percentage", 
	"wasteType", 
	"category", 
	"inline_comment", 
	"activityName_EI", 
	"geo", 
	"activity_ISICclass", 
	"activity_ecoSpold01class", 
	"technologyLevel", 
	"r

## Select columns w/ non-method labels
<div class="alert alert-block alert-danger">
    <strong> <code>METADATA</code> has to be populated manually ❗ </strong>
</div>

In [6]:
# 2. Pick from the rest of the columns
print(df_master_raw[rest_of_columns].nunique())

# list of df's non-method labels (select manually from the list printed above)
METADATA = [
    "Activity",
    "activity_comment",
    "type",
    "referenceProduct",
    "category",
    "inline_comment",
    # 👆🏼 above columns are originally from _SP,
    # 👇🏼 below from _EI
    "geo",
    "activity_ISICclass",
    "activity_ecoSpold01class",
    "technologyLevel",
    "referenceProductAmount",
    "referenceProductUnit",
    "referenceProduct_prodVolume",
    "referenceProduct_prodVolumeComment",
    "referenceProduct_price",
    "referenceProduct_priceUnit",
    "referenceProduct_priceComment",
    "referenceProduct_casNumber",
    "referenceProduct_CPCclass",
    "activity_generalComment",
    "sourceFilename",
]
print(
    "\nTotal ºn of non-method columns (above) is {}, you selected {} of them.".format(
        len(rest_of_columns), len(METADATA)
    )
)

wkbName                                18
Activity                              946
activity_comment                      946
type                                    1
referenceProduct                      720
shortName_geo                           8
activityName_SP                       243
fullName_SimaPro                      946
unit                                    2
amount                                  1
allocation_percentage                   1
wasteType                              17
category                               48
inline_comment                        516
activityName_EI                       724
geo                                     8
activity_ISICclass                     40
activity_ecoSpold01class               37
technologyLevel                         2
referenceProductName                  720
referenceProductAmount                  1
referenceProductUnit                    2
referenceProduct_prodVolume           514
referenceProduct_prodVolumeComment

In [7]:
# Make df of METADATA for later export

df_metadata = pd.DataFrame(METADATA, columns=["METADATA"]) 
# df_metadata

## Select columns w/ method labels
<div class="alert alert-block alert-danger">
    <strong> <code>METHODS</code> is generated here 👇🏼</strong>  <br>
     Will be used throughout the script for calculations and plotting
</div>

In [8]:
# LCIA_METHODS_PER_CATEGORY # here is the complete list of methods per category if needed
print("Here is the list of method names (again): ")
print("".join(map('\n\t"{}", '.format, LCIA_METHODS)))

Here is the list of method names (again): 

	"ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H", 
	"PBs-LCIA (baseline) V0.72", 
	"PBs - Alternative: EF - LANCA V0.70", 
	"ReCiPe 2016 Endpoint (H) V1.03 / World (2010) H/A", 
	"Cumulative Energy Demand V1.11 / Cumulative energy demand", 
	"IPCC 2013 GWP 100a V1.03", 
	"PBs-LCIA V0.71 V0.71", 


In [9]:
# select from method names printed above
select_keys = [
    "IPCC 2013 GWP 100a V1.03",
    "ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H",
]  # change manually if needed

METHODS = []
for key in select_keys:
    METHODS += dict_fullMethods[key]
print("{} methods have been selected:".format(len(METHODS)))
del select_keys
METHODS

19 methods have been selected:


["('IPCC 2013 GWP 100a V1.03', 'IPCC GWP 100a', 'kg CO2 eq')",
 "('ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H', 'Global warming', 'kg CO2 eq')",
 "('ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H', 'Stratospheric ozone depletion', 'kg CFC11 eq')",
 "('ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H', 'Ionizing radiation', 'kBq Co-60 eq')",
 "('ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H', 'Ozone formation, Human health', 'kg NOx eq')",
 "('ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H', 'Fine particulate matter formation', 'kg PM2.5 eq')",
 "('ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H', 'Ozone formation, Terrestrial ecosystems', 'kg NOx eq')",
 "('ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H', 'Terrestrial acidification', 'kg SO2 eq')",
 "('ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H', 'Freshwater eutrophication', 'kg P eq')",
 "('ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H', 'Marine eutrophication', 'kg N eq')",
 "('ReCiPe 2016 Midpoint (H) V1.03 / World (

In [10]:
# Make df of METHODS for later export

df_methods = pd.DataFrame(METHODS, columns=["METHODS"]) 
# df_methods

## Combine selected methods and metadata
- Generate ``analysis_prev`` df (and delete ``df_master_raw`` ?)

In [11]:
# 3. Combine steps 2 and 3

df_analysis_prev = df_master_raw.filter(items=METADATA + METHODS, axis=1).copy()
## or alternatively: 
## df_analysis_prev = df_master_raw.loc[:, list(METADATA + METHODS)].copy()
df_analysis_prev.sort_values(by="Activity", inplace=True)

# del df_master_raw # delete to free memory
pd.options.display.max_columns = None

print(
    "Created **df_analysis_prev** dataframe is of {} shape.".format(
        df_analysis_prev.shape
    )
)
df_analysis_prev.sample(5)

Created **df_analysis_prev** dataframe is of (946, 40) shape.


Unnamed: 0,Activity,activity_comment,type,referenceProduct,category,inline_comment,geo,activity_ISICclass,activity_ecoSpold01class,technologyLevel,referenceProductAmount,referenceProductUnit,referenceProduct_prodVolume,referenceProduct_prodVolumeComment,referenceProduct_price,referenceProduct_priceUnit,referenceProduct_priceComment,referenceProduct_casNumber,referenceProduct_CPCclass,activity_generalComment,sourceFilename,"('IPCC 2013 GWP 100a V1.03', 'IPCC GWP 100a', 'kg CO2 eq')","('ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H', 'Global warming', 'kg CO2 eq')","('ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H', 'Stratospheric ozone depletion', 'kg CFC11 eq')","('ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H', 'Ionizing radiation', 'kBq Co-60 eq')","('ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H', 'Ozone formation, Human health', 'kg NOx eq')","('ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H', 'Fine particulate matter formation', 'kg PM2.5 eq')","('ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H', 'Ozone formation, Terrestrial ecosystems', 'kg NOx eq')","('ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H', 'Terrestrial acidification', 'kg SO2 eq')","('ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H', 'Freshwater eutrophication', 'kg P eq')","('ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H', 'Marine eutrophication', 'kg N eq')","('ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H', 'Terrestrial ecotoxicity', 'kg 1,4-DCB')","('ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H', 'Freshwater ecotoxicity', 'kg 1,4-DCB')","('ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H', 'Marine ecotoxicity', 'kg 1,4-DCB')","('ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H', 'Human carcinogenic toxicity', 'kg 1,4-DCB')","('ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H', 'Human non-carcinogenic toxicity', 'kg 1,4-DCB')","('ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H', 'Land use', 'm2a crop eq')","('ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H', 'Mineral resource scarcity', 'kg Cu eq')","('ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H', 'Fossil resource scarcity', 'kg oil eq')","('ReCiPe 2016 Midpoint (H) V1.03 / World (2010) H', 'Water consumption', 'm3')"
742,"Zeolite, slurry, without water, in 50% solutio...",The transport amounts are based on eurostat tr...,Products,"Zeolite, slurry, without water, in 50% solutio...",Chemicals\Washing agents\Builders\Market,Production Volume Amount: 1.32236940655721,RER,"2023:Manufacture of soap and detergents, clean...",,,1,kg,1.322369,,1.04,EUR2005,Calculated based on inputs: The price of the p...,1318-02-1,"353: Soap, cleaning preparations, perfumes and...",The transport amounts are based on eurostat tr...,17293_49cd7b4e-f489-52ed-a8dd-b1a7a863335f_136...,2.304393,2.336952,1.352792e-06,0.164593,0.00595,0.004536,0.006025,0.010201,0.001172,0.00012,9.576944,0.124431,0.178216,0.778974,2.593167,0.051024,0.055383,0.606109,0.033281
195,"Iron(II) chloride {GLO}| market for | APOS, S","In this market, expert judgement was used to d...",Products,Iron(II) chloride,Chemicals\Inorganic\Market,Production Volume Amount: 4,GLO,2011:Manufacture of basic chemicals,,,1,kg,4.0,,0.33,EUR2005,Calculated value based on data from United Nat...,7758-94-3,34240: Phosphates of triammonium; salts and pe...,"In this market, expert judgement was used to d...",17182_7574458f-153e-4670-b06a-4f1073cc6f14_3b8...,0.254924,0.258795,1.191418e-07,0.030102,0.000785,0.000657,0.000795,0.001233,0.0002,1.2e-05,2.14669,0.020188,0.028839,0.019613,0.659444,0.010073,0.002444,0.066322,0.001895
822,"Mischmetal {GLO}| market for | APOS, S","In this market, expert judgement was used to d...",Products,Mischmetal,Metals\Non ferro\Market,Production Volume Amount: 5378721.06382979,GLO,2420:Manufacture of basic precious and other n...,electronics/module,0.0,1,kg,5378721.0,,6.63,EUR2005,Calculated based on inputs: The price of the p...,,"34290: Compounds of rare earth metals, of yttr...","In this market, expert judgement was used to d...",22441_8cccbd36-2a78-42c2-b29c-4629fc59634a_474...,21.830847,22.189012,1.242645e-05,1.758199,0.050372,0.047848,0.051625,0.088914,0.010717,0.000993,65.608319,0.800723,1.138628,0.895452,25.679308,0.678095,0.066717,7.667719,0.274947
524,Methyl methacrylate {RER}| market for methyl m...,This dataset represents the supply of 1 kg of ...,Products,Methyl methacrylate,Chemicals\Organic\Market,Production Volume Amount: 1.32236940655721,RER,2013:Manufacture of plastics and synthetic rub...,,,1,kg,1.322369,,0.186,EUR2005,Temporary price data. Calculated as 90% of pur...,80-62-6,347: Plastics in primary forms,This dataset represents the supply of 1 kg of ...,17202_34a4a6f9-f3b1-5421-999d-6deee83660d8_206...,6.934042,7.17915,1.662463e-08,0.001113,0.014393,0.008629,0.015664,0.028019,0.000299,0.000715,0.429529,0.013173,0.018205,0.131612,0.311101,0.000612,0.000518,2.492809,0.028157
887,"Urea formaldehyde foam, in situ foaming {GLO}|...","In this market, expert judgement was used to d...",Products,"Urea formaldehyde foam, in situ foaming",Construction\Insulation\Market,Production Volume Amount: 4,GLO,2220:Manufacture of plastics products,insulation materials/production,0.0,1,kg,4.0,,1.77,EUR2005,Calculated from EU prices by use of exchange r...,,363: Semi-manufactures of plastics,"In this market, expert judgement was used to d...",23466_0b74180d-2a09-446c-a051-21637de5af84_414...,2.925079,2.97737,1.227155e-06,0.089152,0.006601,0.005344,0.006981,0.014051,0.000774,0.000309,15.940408,0.093202,0.138517,0.246742,3.00503,0.066098,0.014996,1.344878,0.118552


# OUTPUTS: Export data to excel

In [12]:
%%time

# Set output directory
outputs_dir = set_outputs_dir(use_default=False, rel_path_output=r"..\data\lcaforsac")  # default `..\data\interim`

## Export dataframe to excel
excelName = "raw-data-chosen-lcia-methods-and-metadata.xlsx"

df_readme = make_readme_info(
    excelName,
    "Sheet1: Raw data with chosen LCIA methods and important metadata "
    "(redundant columns and extra methods were dropped)."
    "\nMETADATA: list of relevant metadata used in Sheet1."    
    "\nMETHODS: list of LCIA methods used in Sheet1."
    "\n[METHODS + METADATA have to be the only column labels in Sheet1]",
)

w_excel(
    path_to_file=outputs_dir,
    filename=excelName,
    dict_data_to_write={
        "Sheet1": df_analysis_prev,
        "METADATA": df_metadata,        
        "METHODS": df_methods,
    },
    readme_info=("readme", df_readme),
    ####         ExcelWriter_kwargs={"engine": "openpyxl", "encoding": "UTF-8"}
    #     startrow=0
)

File: raw-data-chosen-lcia-methods-and-metadata.xlsx successfully created in 
C:\Users\ViteksPC\Documents\00-ETH_projects\17-AESA_ecoinvent_chemicals\data\lcaforsac
Wall time: 1.42 s
