# Project proposal

In the financial fund industry, does green investment generate better performance ?   

## 1) The problem
* The idea of this project came from the raising interest in ESG funds to provide more sustainable investment vehicles to financial investors.
* The aim of this project is to analyze if funds selecting underlying environmental friendly assets can generate better financial performance.
* Finding available market data about ESG was a difficult task, as this kind of data is most of the time issued by market/index providers (MSCI, FTSE) and not freely available to the public. Hopefully, the platform https://fossilfreefunds.org/ offers more than 3,000 US stock funds ratings about environmental metrics with more than 10 months history.    

## 2) The data

### (a) Clear overview of your data
#### 2.a.1 Data origin and gathering
* The raw data can be acquired through download of Excel files directly from the webpage https://fossilfreefunds.org/how-it-works
* We can find 2 Excels files published each month. We only use the Fossil Free Funds (FFF) dataset where detailed information is contained 
* The code below allows the download and import of these Excel files into a single csv file. I will be using mainly the "Shareclasses" sheet as it contains all the metrics for our analysis.

In [None]:
import requests
from bs4 import BeautifulSoup
import re
import shutil
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from datetime import datetime
import time
import csv 
import os
import pickle
from IPython.display import display, HTML
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error as MAE
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error as MSE
from sklearn.metrics import mean_absolute_error as MAE
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import export_graphviz
import graphviz
from sklearn.neighbors import KNeighborsRegressor
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.layers import Dense, Activation, Flatten
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error 
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor # There is also a KerasClassifier class
from sklearn.model_selection import RandomizedSearchCV, KFold
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

!pip install openpyxl
import openpyxl
print("Library versions: pandas", pd.__version__," numpy", np.__version__," seaborn", sns.__version__)

In [None]:
def download_data():
    URL = 'https://fossilfreefunds.org/how-it-works'
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    urls = []
    names = []
    for i, link in enumerate(soup.findAll('a')):
        FULLURL = link.get('href')
        if bool(re.search('.*results.*.xlsx', FULLURL)):
            urls.append(FULLURL)
            names.append(os.path.basename(soup.select('a')[i].attrs['href']))

    names_urls = zip(names, urls)
    for name, url in names_urls:
        print("Download file: "+name)
        r = requests.get(url, verify=False,stream=True)
        r.raw.decode_content = True
        with open("/kaggle/working/" + name, 'wb') as out:
                shutil.copyfileobj(r.raw, out)    

def merge_excel():
    df = pd.DataFrame()
    files=os.listdir('data') 
    files_xls = [f for f in files if f[-4:]=='xlsx']

    for f in files_xls:
        if not re.match(r".*20210[5-9]+.*", f):
            print('Merging file: '+f)
            data = pd.read_excel('data/'+f, 'Shareclasses',engine='openpyxl')
            df = df.append(data) 
    df.to_csv('/kaggle/working/fossilfund_dataset.csv', index=False)
    print('Export to data/fossilfund_dataset.csv is finished')

#No download ncessary for this notebook, use of Kaggle dataset directly
#if not os.path.exists('data'):
#    os.makedirs('data')
#    download_data()
#    merge_excel()

#### 2.a.2 Feature overview
* We can assess with the code below that we have a large number of columns (121) and rows (101407). The number of rows being significantly higher than the number of columns, that should help us avoid having too much bias in our model
* 41 columns have null values so we will have to find a way to handle that
* We have 101 numerical values

In [None]:
df = pd.read_csv('/kaggle/input/fossil-free-funds/fossilfund_dataset.csv')

In [None]:
print("Total no. of columns in the dataframe", len(df.columns))
print("No. of columns containing null values", len(df.columns[df.isna().any()]))

print("No. of columns not containing null values", len(df.columns[df.notna().all()]))
print("No. of numerical columns ", len(df.select_dtypes(np.number).columns))

print("Total no. of rows in the dataframe", len(df))

In [None]:
df.info(max_cols=200)

* We can already convert date columns as date type instead of generic object

In [None]:
date_cols=df.filter(regex=" date.*",axis=1).columns
df[date_cols]=df[date_cols].apply(pd.to_datetime, errors='coerce')
df[date_cols]

* Sample dataset (first 10 rows)

In [None]:
display(HTML(df[0:10].to_html())) 

* With the sample dataset above, we can observe the following group of features identified by the column names:
* *Fund information*: general information about the fund
    * Fund profile: Shareclass name
    * Fund profile: Ticker
    * Fund profile: Fund name
    * Fund profile: Asset manager
    * Fund profile: Shareclass type
    * Fund profile: Shareclass inception date
    * Fund profile: Category group
    * Fund profile: Sustainability mandate
    * Fund profile: US-SIF member
    * Fund profile: Oldest shareclass inception date
    * Fund profile: Shareclass tickers
    * Fund profile: Portfolio holdings as-of date
    * Fund profile: Fund net assets
    * Fund profile: Percent rated
* *Financial results*: our target variable is contained in that group
    * Financial performance: Financial performance as-of date 
        * *Important date, keeps track of the date where performance of the fund is calculated*
    * Financial performance: Month end trailing returns, year 1
    * Financial performance: Month end trailing returns, year 3
    * Financial performance: Month end trailing returns, year 5
    * Financial performance: Month end trailing returns, year 10
* *Grading metrics*: overall grading by the FFF organization
    * Fossil Free Funds: Fossil fuel grade
    * Deforestation Free Funds: Deforestation grade
    * Gender Equality Funds: Gender equality grade
    * Gun Free Funds: Civilian firearm grade
    * Weapon Free Funds: Military weapon grade
    * Tobacco Free Funds: Tobacco grade
    * Prison Free Funds: Prison industrial complex grade
* *Detailed breakdown by categories*:
    * Fossil energy:
        * Fossil Free Funds: Fossil fuel holdings, count
        * Fossil Free Funds: Fossil fuel holdings, weight
        * Fossil Free Funds: Fossil fuel holdings, asset
        * Fossil Free Funds: Carbon Underground 200, count
        * Fossil Free Funds: Carbon Underground 200, weight
        * Fossil Free Funds: Carbon Underground 200, asset
        * Fossil Free Funds: Coal industry, count
        * Fossil Free Funds: Coal industry, weight
        * Fossil Free Funds: Coal industry, asset
        * Fossil Free Funds: Oil / gas industry, count
        * Fossil Free Funds: Oil / gas industry, weight
        * Fossil Free Funds: Oil / gas industry, asset
        * Fossil Free Funds: Macroclimate 30 coal-fired utilities, count
        * Fossil Free Funds: Macroclimate 30 coal-fired utilities, weight
        * Fossil Free Funds: Macroclimate 30 coal-fired utilities, asset
        * Fossil Free Funds: Fossil-fired utilities, count
        * Fossil Free Funds: Fossil-fired utilities, weight
        * Fossil Free Funds: Fossil-fired utilities, asset
        * Fossil Free Funds: Relative carbon footprint (tonnes CO2 / $1M USD invested)
        * Fossil Free Funds: Relative carbon intensity (tonnes CO2 / $1M USD revenue)
        * Fossil Free Funds: Total financed emissions scope 1 + 2 (tCO2e)
        * Fossil Free Funds: Total financed emissions scope 1 + 2 + 3 (tCO2e)
        * Fossil Free Funds: Carbon footprint portfolio coverage by market value weight
        * Fossil Free Funds: Carbon footprint portfolio coverage by number of disclosing titles
        * Fossil Free Funds: Clean200, count
        * Fossil Free Funds: Clean200, weight
        * Fossil Free Funds: Clean200, asset
    * Deforestation:
        * Deforestation Free Funds: Deforestation-risk producer, count
        * Deforestation Free Funds: Deforestation-risk producer, weight
        * Deforestation Free Funds: Deforestation-risk producer, asset
        * Deforestation Free Funds: Deforestation-risk financier, count
        * Deforestation Free Funds: Deforestation-risk financier, weight
        * Deforestation Free Funds: Deforestation-risk financier, asset
        * Deforestation Free Funds: Deforestation-risk consumer brand, count
        * Deforestation Free Funds: Deforestation-risk consumer brand, weight
        * Deforestation Free Funds: Deforestation-risk consumer brand, asset    
     * Gender Equality:
        * Gender Equality Funds: Gender equality group ranking
        * Gender Equality Funds: Gender equality score (out of 100 points)
        * Gender Equality Funds: Gender equality score, gender balance (out of 100 points)
        * Gender Equality Funds: Gender equality score, gender policies (out of 100 points)
        * Gender Equality Funds: Count of holdings with Equileap gender equality scores
        * Gender Equality Funds: Weight of holdings with Equileap gender equality scores
        * Gender Equality Funds: Gender equality score - Overall score (out of 100 points)
        * Gender Equality Funds: Gender equality score - Gender balance in leadership and workforce (out of 40 points)
        * Gender Equality Funds: Gender equality score - Equal compensation and work life balance (out of 30 points)
        * Gender Equality Funds: Gender equality score - Policies promoting gender equality (out of 20 points)
        * Gender Equality Funds: Gender equality score - Commitment, transparency, and accountability (out of 10 points)            
    * Gun:
        * Gun Free Funds: Civilian firearm, count
        * Gun Free Funds: Civilian firearm, weight
        * Gun Free Funds: Civilian firearm, asset
        * Gun Free Funds: Gun manufacturer, count
        * Gun Free Funds: Gun manufacturer, weight
        * Gun Free Funds: Gun manufacturer, asset
        * Gun Free Funds: Gun retailer, count
        * Gun Free Funds: Gun retailer, weight
        * Gun Free Funds: Gun retailer, asset
    * Weapon:            
        * Weapon Free Funds: Military weapon, count
        * Weapon Free Funds: Military weapon, weight
        * Weapon Free Funds: Military weapon, asset
        * Weapon Free Funds: Major military contractors, count
        * Weapon Free Funds: Major military contractors, weight
        * Weapon Free Funds: Major military contractors, asset
        * Weapon Free Funds: Nuclear weapons, count
        * Weapon Free Funds: Nuclear weapons, weight
        * Weapon Free Funds: Nuclear weapons, asset
        * Weapon Free Funds: Cluster munitions / landmines, count
        * Weapon Free Funds: Cluster munitions / landmines, weight
        * Weapon Free Funds: Cluster munitions / landmines, asset
    * Tobacco:            
        * Tobacco Free Funds: Tobacco producer, count
        * Tobacco Free Funds: Tobacco producer, weight
        * Tobacco Free Funds: Tobacco producer, asset
        * Tobacco Free Funds: Tobacco-promoting entertainment company, count
        * Tobacco Free Funds: Tobacco-promoting entertainment company, weight
        * Tobacco Free Funds: Tobacco-promoting entertainment company, asset
    * Prison:            
        * Prison Free Funds: All flagged, count
        * Prison Free Funds: All flagged, weight
        * Prison Free Funds: All flagged, asset
        * Prison Free Funds: Prison industry, count
        * Prison Free Funds: Prison industry, weight
        * Prison Free Funds: Prison industry, asset
        * Prison Free Funds: Border industry, count
        * Prison Free Funds: Border industry, weight
        * Prison Free Funds: Border industry, asset
        * Prison Free Funds: All flagged, higher risk, count
        * Prison Free Funds: All flagged, higher risk, weight
        * Prison Free Funds: All flagged, higher risk, asset
        * Prison Free Funds: Prison industry, higher risk, count
        * Prison Free Funds: Prison industry, higher risk, weight
        * Prison Free Funds: Prison industry, higher risk, asset
        * Prison Free Funds: Border industry, higher risk, count
        * Prison Free Funds: Border industry, higher risk, weight
        * Prison Free Funds: Border industry, higher risk, asset
        * Prison Free Funds: Private prison operators, count
        * Prison Free Funds: Private prison operators, weight
        * Prison Free Funds: Private prison operators, asset

* So we have a wide range of data information for each fund classification category: from overall category grades to the detailed breakdown of each category 
* This classification is being kept under an Excel spreadsheet (replace.xlsx)
* The information contained in this spreadsheet will also be used later on for column renaming and feature aggregation (continuous, discrete, nominal, ordinal variables)

We can have more information provided by FFF about each feature description in the "Key" Excel sheet provided by FFF:

In [None]:
key_df=pd.read_excel('/kaggle/input/fossil-free-funds/raw/InvestYourValuesshareclassresults20200716.xlsx', 'Key',  engine='openpyxl', header=None, names=['Category', 'Feature', 'Description'])
s = key_df.style.set_properties(**{'text-align': 'left'})
display(HTML(s.render()))
#display(HTML(key_df.to_html(index=False))) 

#### 2.a.3 Feature distribution
* The numerical features distribution per category is shown below 
* We have features with empty data, notably on categories *Gender Equality Funds* and *Prison Free Funds*
* Several times, we see a global feature (e.g. *Fossil fuel holdings*) being decomposed on 3 sub-features: count, weight, asset
    * These 3 sub-features should normally be related to each other (high correlation)

In [None]:
#create variables for each group and display information separately
#Our target value has to be 1-y as it is the most complete => need to show fill factor
#display(HTML(    .to_html())) 
origin_categories=["Fund profile","Fossil Free Funds", "Deforestation Free Funds", "Gender Equality Funds", "Gun Free Funds", "Prison Free Funds", "Weapon Free Funds", "Tobacco Free Funds", "Financial performance"]
for category in origin_categories:
    print("Distribution of numerical features for category: "+category)
    display(HTML(df.filter(regex=category+".*",axis=1).describe().to_html())) 
    print("\n\n")

#### 2.a.4 Data completness
* It is important to check the data completness fill-factor
* We see that *Prison free* related category lacks information as well as some *Gender equality* indicators

In [None]:
columns_stats=pd.DataFrame()
columns_stats['fill_percent']=df.notnull().sum(axis=0)/len(df)*100
fig = plt.figure(figsize=(10, 22))
columns_stats['fill_percent'].sort_values().plot.barh()
plt.title("Fill-factor for all features")
plt.xlabel("Percentage")
plt.gca().xaxis.set_major_formatter(mtick.PercentFormatter())
plt.show()

#### 2.a.4.1 Remove non relevant columns
* We drop the columns with less than 60% of data completness, mostly related to Gender equality category

In [None]:
columns_stats[columns_stats['fill_percent']<60]

In [None]:
df.drop(columns=columns_stats[columns_stats['fill_percent']<60].index.values.tolist(), axis=1, inplace=True)

#### 2.a.5 Target selection
* Our target column will be part of the group "Financial performance". 
    * By the table below, we select the target (among year 1,3,5,10) with the top fill-factor, that is "year 1". 

In [None]:
columns_stats['fill_percent'].filter(regex="Financial performance.*")

The "Financial performance: Month end trailing returns, year 1" is the most complete variable as a high percentage of funds have an inception date greater than 2 years (so performance for "year 1" is available) but less than 10 years (so no performance data available for "year 10")

In [None]:
perf_date=pd.DataFrame()
for year in [1,3,5,10]:
    filter_df=df['Fund profile: Shareclass inception date'][( (df['Financial performance: Financial performance as-of date'] - pd.DateOffset(years=year)) > df['Fund profile: Shareclass inception date']  )]
    perf_date["year"+str(year)] =filter_df.groupby(filter_df.dt.year).count()

#Normalize row perf_data by year
perf_date_norm=perf_date.div(perf_date.sum(axis=1), axis=0)*100
perf_date_norm.loc['2000-1-1 00:00:00':'2022-1-1 00:00:00'].plot.bar(figsize=(16, 8), stacked=True)

#Plot properties
plt.xlabel("Inception fund dates")
plt.ylabel("Percentage")
plt.title("Fund inception date inception date crossed with (performance as-of date - year)")
plt.legend(loc='lower center', bbox_to_anchor=(0.5, -0.22),
          ncol=4, fancybox=True, shadow=True)
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter())
#plt.setp(plt.gca().get_xticklabels(), rotation=30, horizontalalignment='right')
plt.show()

#### 2.a.6 Data uniqueness
* Performance data is the only type of information showing strong uniqueness

In [None]:
columns_stats=pd.DataFrame()
columns_stats['unique_values']=df.nunique()/len(df)*100
fig = plt.figure(figsize=(10, 22))
columns_stats['unique_values'].sort_values().plot.barh()
plt.title("Data uniqueness for all features")
plt.xlabel("Percentage")
plt.gca().xaxis.set_major_formatter(mtick.PercentFormatter())
plt.show()

#### 2.a.7 Columns renaming
* Column names are renamed to make plots more readable by using the Excel spreadsheet previously defined

In [None]:
cols_df=pd.read_excel('/kaggle/input/fossil-free-funds/misc/replace.xlsx', 'Cols',  engine='openpyxl')
df.rename(columns=dict(zip(cols_df["Column original"],cols_df["Short column name"])), inplace=True)

In [None]:
def getColCategory(category):
    return list(set(cols_df[cols_df['Category']==category]['Short column name']) & set(df.columns))

def getColType(type_col):
    return list(set(cols_df[cols_df['Type']==type_col]['Short column name']) & set(df.columns))    

def getEncoding(shortName):
    return cols_df[cols_df['Short column name']==shortName]['encoding'].values[0]

In [None]:
continuous= getColType('Continuous')
discrete= getColType('Discrete')
ordinal= getColType('Ordinal') 
nominal= getColType('Nominal') 

date_cols=df.filter(regex=".*Date.*",axis=1).columns

#### 2.a.8 0-Values  
* It is important to keep track of feature with a high percentage of low values so we can add this information later in our analyis
* We observe that the features displaying a higher number of 0-values than our threshold are part of the sub-features (count, weight, asset) 

In [None]:
threshold_0_level=10

def zeros_columns(df, col_category):
    zeros_percentage=(df[col_category]==0).sum()*100/len(df[col_category])
    zeros=zeros_percentage[(zeros_percentage>threshold_0_level)].index.values.tolist()
    print("Columns with 0-values > "+str(threshold_0_level)+"% : "+str(len(zeros))+"/"+str(len(col_category)))
    print(zeros_percentage[(zeros_percentage>threshold_0_level)].sort_values(ascending=False))
    return zeros
continuous_zeros = zeros_columns(df, continuous)
discrete_zeros = zeros_columns(df, discrete)

#### 2.a.9 Null values
* Lot of null values for nominal and ordinal data
* We will replace n/a with median values for numerical data in the feature encoding section  
* For categorical data, n/a values will be replaced with NoMapping value

In [None]:
threshold_null_level=2

def null_columns(df, col_category):
    null_percentage=(df[col_category].isnull()).sum()*100/len(df[col_category])
    nulls=null_percentage[(null_percentage>threshold_null_level)].index.values.tolist()
    print("Columns with null-values > "+str(threshold_null_level)+"% : "+str(len(nulls))+"/"+str(len(col_category)))
    print(null_percentage[(null_percentage>threshold_null_level)].sort_values(ascending=False))
    return nulls
    
discrete_null = null_columns(df, discrete)
continuous_null = null_columns(df, continuous)
nominal_null = null_columns(df, nominal)
ordinal_null = null_columns(df, ordinal)

#### 2.a.10 Duplicate values check
* We first try to see if there is some duplicate row => the result is none

In [None]:
df.duplicated().sum()

* To catch duplicate values, as there is no unique id available per row, we will have to use a multiindex with the following combination
'FI_ShareclassName', 'FP_PerformanceAs-OfDate'
* We take FI_ShareclassName for this selection as it containes the most unique dataset 

In [None]:
df[nominal].nunique()/len(df)*100

* We first check how many duplicate entries with these 2 features combination: #9254 rows duplicate

In [None]:
index_col=['FI_ShareclassName', 'FP_PerformanceAs-OfDate']
duplicate_rows=df[df.duplicated(subset=index_col, keep=False)]
duplicate_rows.to_csv('/kaggle/working/temp.csv', sep=';')
duplicate_rows

After looking at the raw data, it appears data has been published twice in 2 datasets:
* Invest+Your+Values+shareclass+results+20200913.xlsx
* Invest+Your+Values+shareclass+results+20200928.xlsx

* We decide to keep only the first occurence of duplicated rows together based on 'FI_ShareclassName', 'FP_PerformanceAs-OfDate'

In [None]:
origin_len=len(df)
duplicate_len=len(df[df.duplicated(subset=index_col, keep='last')])
df=df[~df.duplicated(subset=index_col, keep='last')]

* We can verify that we have no more null values and only 9254 have been removed

In [None]:
print("Remaining rows:",len(df),"(",origin_len,"-",duplicate_len,")")
df[df.duplicated(subset=index_col, keep=False)]

#### 2.a.11 Missing data
With the plot below, we want to verify if there is a data gap:
 * We can identify a data gap for November 2020 
 * The number and allocation of asset managers per date is quite constant

In [None]:
df_tmp=df.copy()
df_tmp['FI_AssetManagerFirstLetter']=df_tmp['FI_AssetManager'].str[0]
grouped_df=df_tmp.groupby(df_tmp['FP_PerformanceAs-OfDate'])
grouped_df['FI_AssetManagerFirstLetter'].value_counts()

In [None]:
grouped_df['FI_AssetManagerFirstLetter'].value_counts().groupby(level=0).apply(
    lambda x: x 
).unstack().plot.bar(figsize=(16, 8), stacked=True)

#Plot properties
plt.ylabel("Number of funds")
plt.xlabel("FI_AssetManagerFirstLetter")
plt.title("Number of Asset Manager by date")
plt.get_cmap('gist_rainbow')
plt.legend(loc='lower center', bbox_to_anchor=(0.5, -0.35),
          ncol=10, fancybox=True, shadow=True)
plt.setp(plt.gca().get_xticklabels(), rotation=30, horizontalalignment='right')
plt.show()

#### We observe which funds (shareclassname) have partial information
* We decide to remove funds with less than 60% complete of information

In [None]:
per_share_max_count = df.groupby(['FI_ShareclassName'])['FI_ShareclassName'].value_counts().max()
threshold_max_count=0.6

partial_share_missing=df.copy()
partial_share_missing=partial_share_missing.groupby(['FI_ShareclassName']).filter(lambda x: len(x) <= threshold_max_count * per_share_max_count)
partial_share_missing.groupby(['FI_ShareclassName'])['FI_ShareclassName'].value_counts().sort_values(ascending=False)

In [None]:
origin_len=len(df)
df.drop(partial_share_missing.index, inplace=True)
print("Remaining rows:",len(df),"(",origin_len,"-",len(partial_share_missing),")")

#### 2.a.11.1 Missing date
* Replace missing performance date with the performance date of the previous row

In [None]:
for date_col in df.filter(regex="Date.*",axis=1).columns:
    print('Number of empty dates for columns',date_col,":",len(df[df[date_col].isnull()]))
df[df['FP_PerformanceAs-OfDate'].isnull()]

In [None]:
filter=df[df['FP_PerformanceAs-OfDate'].isnull()]
df.loc[filter.index-1,['FP_PerformanceAs-OfDate','FI_PortfolioHoldingsAs-OfDate']]

In [None]:
df.loc[filter.index,'FP_PerformanceAs-OfDate']=df.loc[filter.index-1,'FP_PerformanceAs-OfDate']
df[df['FP_PerformanceAs-OfDate'].isnull()]

#### 2.a.12 Categorical analysis
* We show below the unique values for all the categorical columns

In [None]:
def checkuniquevalues(df, cols):
    #Check unique values
    for col in cols:
        print(col,": Total unique:",len(df[col].sort_values().unique())," - Values:",df[col].sort_values().unique())
checkuniquevalues(df, ordinal+nominal)
    #for col in cols:
    #    values=""
    #    for elem in df[col].sort_values().unique():
    #        values=values+"'"+str(elem)+"':'"+str(elem)+"', "

* We decide to drop some columns as they do not have relevant information
 * FI_FundName and FI_ShareclassName are not highly correlated to our target
 * \*FI_Ticker and FI_ShareclassTickers features are identical to FI_FundName and FI_ShareclassName  
 * FI_AssetManager : that might have been interesting to show if some an asset management company would be more linked to performance but an EDA with one-hot encoding of this feature did not show strong correlation results relative to our target.

In [None]:
df.drop(columns=['FI_Ticker','FI_ShareclassTickers','FI_ShareclassName', 'FI_FundName', 'FI_AssetManager'], axis=1, inplace=True)
nominal.remove('FI_Ticker')
nominal.remove('FI_ShareclassTickers')
nominal.remove('FI_ShareclassName')
nominal.remove('FI_FundName')
nominal.remove('FI_AssetManager')

In [None]:
#df.to_csv('/kaggle/input/fossil-free-funds/prep/fossilfund_dataset_clean.csv', index=False)

### 2.b Process the data
#### 2.b.1 Histogram for numeric data
* There is a mix of continuous, discrete values  
* Lots of 0-values are present that we need to address
* We observe different type of skewness


In [None]:
categories=cols_df['Category'].unique()   

for category in categories:
    #index_cols=list(set(cols_df[cols_df['Category']==category]['Short column name']) & set(df.columns))
    index_cols=getColCategory(category)
    length=len(df[index_cols].select_dtypes(exclude=object).columns)
    print("Histogram for "+category+" features")
    df[index_cols].hist(color='g', bins=50, grid=False, figsize=(length*2,length))
    plt.tight_layout()
    plt.show()

#### 2.b.2 Continuous features encoding exploration
* We try different types of features encoding here
* Based on our observations, we see that 
    * *weight* sub-features require $\exp^{1/2}$ transformation
    * *asset* sub-features require $\log(x+1)$ transformation
    * *Financial performance* features do not require transformation as they natively display a bell curve shape

In [None]:
# (C) Preprocessing function
def df_wo_zeros_null(df):
    df = df.copy()

    # Continuous            
    # Add additional column for holding 0 values
    # Filter-out zero values
    for c in list(continuous_zeros)+list(discrete_zeros):
        name = c + "_isempty"
        idx= df[c]==0
        df[name] = idx
        #Convert bool col as int
        df[name] = df[name].astype(int)
        df[c] = df[~idx][c]

    # Fill missing values
    for c in list(set(continuous + discrete) & set(df.select_dtypes(np.number).columns)):
        df[c].dropna(inplace=True)

    return df

In [None]:
from sklearn.preprocessing import QuantileTransformer

#continuous+discrete
cols= list(set(continuous) & set(df.select_dtypes(np.number).columns))
temp_df=df_wo_zeros_null(df)
for category in categories:
    #index_cols=list(set(cols_df[cols_df['Category']==category]['Short column name']) & set(df.columns))
    index_cols=list( set(getColCategory(category)) & set(cols))
    length=len(index_cols)
    for d in [1, 0.5, 2, 3]:
        test_df=temp_df[index_cols].copy()
        for c in test_df.columns:
            name = '{}**{}'.format(c, d)
            test_df[name]=test_df[c]**d
            test_df.drop(c, axis=1, inplace=True)
        test_df.hist(color='g', bins=30, grid=False, figsize=((length*1.2)+10,length+3))
        #test_df[test_df>0].hist(color='g', bins=30, grid=False, figsize=(length*2,length))

    #quantile = QuantileTransformer(output_distribution='normal')
    #test_df=df[index_cols]
    #test_df[index_cols]=quantile.fit_transform(test_df[index_cols]) 
    #test_df.hist(color='g', bins=30, grid=False, figsize=((length*1.2)+10,length+3))

    #log1p
    if(category != 'Gender Equality'):
        test_df=np.log1p(df[index_cols].copy())
        test_df.columns = [str(col) + '_log1p' for col in test_df.columns]
        test_df.hist(color='g', bins=30, grid=False, figsize=((length*1.2)+10,length+3))
    plt.tight_layout()
    plt.show()   

* Special feature distribution handling

In [None]:
from scipy.stats import boxcox
special_distrib=['FI_PercentRated','F_CarbonMarketValueWeight','GE_WeightOfHoldings']
length=len(special_distrib)
test_df=df[special_distrib].copy()
for feature in special_distrib:
    name = 'expm1({})'.format(feature)
    test_df[name]=(1-test_df[feature])**0.5
    
    #test_df[name]=boxcox(test_df[feature], 0.3)
    test_df.drop(feature, axis=1, inplace=True)
test_df.hist(color='g', bins=30, grid=False, figsize=((length*1.2)+10,length+3))    
plt.tight_layout()
plt.show() 

#### 2.b.2 Best encoding assessment
* We pre-process slightly the features here (remove 0-values and replace null values by the median) and encode the features according to our findings
* We can verify gaussian distribution across all our numerical features

In [None]:
cols= list(set(continuous) & set(df.select_dtypes(np.number).columns))
preprocess_df=df_wo_zeros_null(df)
continuous_log1p=[]
continuous_exp05=[]
continuous_exp1_05=[]

for category in categories:
    #index_cols=list(set(cols_df[cols_df['Category']==category]['Short column name']) & set(df.columns))
    index_cols=list( set(getColCategory(category)) & set(cols))
    length=len(index_cols)
    test_df=preprocess_df[index_cols].copy()
    for c in test_df.columns:
        encoding=getEncoding(c)
        if(encoding == "log1p"):
            name = 'log1p({})'.format(c)
            test_df[name]=np.log1p(test_df[c])
            test_df.drop(c, axis=1, inplace=True)
            continuous_log1p.append(c)
        elif (encoding == "^0.5"):
            name = '{}**{}'.format(c, '0.5')
            test_df[name]=test_df[c]**0.5
            test_df.drop(c, axis=1, inplace=True)
            continuous_exp05.append(c)
        elif (encoding == "(1-x)^0.5"):
            name = '(1-{})^0.5'.format(c)
            test_df[name]=(1-test_df[c])**0.5
            test_df.drop(c, axis=1, inplace=True)
            continuous_exp1_05.append(c)            
    
    test_df.hist(color='g', bins=30, grid=False, figsize=((length*1.2)+10,length+3))
    plt.tight_layout()
    plt.show()   

#### 2.b.3 Pre-processing numerical features

In [None]:
outliers_threshold=3

# (C) Preprocessing function
def preprocess_numerical(df):
    df = df.copy()

    # Continuous            
    # Add additional column for holding 0 values
    # Filter-out zero values
    for c in list(continuous_zeros)+list(discrete_zeros):
        name = c + "_isempty"
        idx= df[c]==0
        df[name] = idx
        #Convert bool col as int
        df[name] = df[name].astype(int)
        df[c] = df[~idx][c]


    # Apply feature encoding
    df[continuous_log1p]=np.log1p(df[continuous_log1p])
    df[continuous_exp05]=df[continuous_exp05]**0.5
    df[continuous_exp1_05]=(1-df[continuous_exp1_05])**0.5
  
    # Apply z-scores for data with outliers
    for c in continuous:
        z_scores = (df[c] - df[c].mean()) / df[c].std()
        idx = (np.abs(z_scores) > outliers_threshold)
        df[c] = df[~idx][c]

    # Fill missing values
    for c in list(set(continuous + discrete) & set(df.select_dtypes(np.number).columns)):
        df[c].fillna(df[c].median(), inplace=True)

    #Replace dates
    for date_col in date_cols:
        df[date_col]=pd.to_numeric(df[date_col].apply(pd.to_datetime, errors='coerce'))

    return df

preprocess_df=preprocess_numerical(df)

* Plot features to validate visualize the effectivness of our preprocessing method

In [None]:
cols= list(set(continuous) & set(df.select_dtypes(np.number).columns))

for category in categories:
    index_cols=list( set(getColCategory(category)) & set(cols))
    length=len(index_cols)
    print("Histogram for "+category+" features")
    preprocess_df[index_cols].hist(color='g', bins=50, grid=False, figsize=(length*2,length))
    plt.tight_layout()
    plt.show()

* Verify that we have no null values

In [None]:
print("We had", len(df.columns), "features before peprocessing numerical")
print("After processing, we have a total of", len(preprocess_df.columns), "features")
print("Number of null values in the dataframe:", preprocess_df[discrete+continuous].isnull().sum().sum())

In [None]:
for c in preprocess_df.columns:
    print (c)

#### 2.b.5 Pre-processing categorical
* One-hot encode categorical features, except for grade where ordinal encoding is used

In [None]:
def ordinal_mapping(df, cols, dictionary):
    for col in cols:
        df[col]=df[col].map(dictionary)
    return df

#grade_cols=df.filter(regex="Grade",axis=1).columns
#checkuniquevalues(df, grade_cols)

In [None]:
def preprocessing_categorical(df):
    df = df.copy()
    # convert ordinal to int columns
    ordinal_mapping(df, ordinal, {'A':6, 'B':5, 'C':4, 'D':3, 'E':2, 'F':1, np.nan:0})

    # One-hot encoding
    df = pd.get_dummies(df, columns=nominal, dummy_na=False)

    return df

preprocess_df=preprocessing_categorical(preprocess_df)

* Perform some check after the processing, we observed a large increase of new columns

In [None]:
print("After processing, we have a total of", len(preprocess_df.columns), "features")

In [None]:
def compareAfter_preprocess(df_origin, df_new):
    number_previous_col=0
    for col in df_new.columns:
        if(col in df_origin.columns ):
            if(not col in continuous):
                number_previous_col=number_previous_col+1
                if(len( list(set(df_new[col].unique()) -set(df_origin[col].unique()) )) >0):
                    print(col,"- difference of value mapping :",list(set(df_new[col].unique()) -set(df_origin[col].unique())))
        else:
            print(col,"- new column")    

compareAfter_preprocess(df,preprocess_df)

#### 2.b.4 Remove highly correlated features
* The goal of this step is to identify and remove high correlated features (as we introduced many new features)	

In [None]:
def get_redundant_pairs(df):
    '''Get diagonal and lower triangular pairs of correlation matrix'''
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

def get_top_correlations(df, n=10):
    au_corr = df.corr().unstack()
    labels_to_drop = get_redundant_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
    return au_corr[0:n]

In [None]:
correlation_threshold=0.95

for category in ['FI','FP','F','D','GE','G','W','T','P']:
    filter=preprocess_df.filter(regex=category+"_.*",axis=1).select_dtypes(np.number)
    if(len(filter.columns)>0):
        print("Top Absolute Correlations")
        tmp_corr=get_top_correlations(filter, 40)
        print(tmp_corr[tmp_corr >= correlation_threshold].sort_index())

In [None]:
 def correlation(dataset, threshold):
    col_corr = set() # Set of all the names of deleted columns
    corr_matrix = dataset.corr()
    corr_results=[]

    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if (corr_matrix.iloc[i, j] >= threshold) and (corr_matrix.columns[j] not in col_corr):
                colname = corr_matrix.columns[i] # getting the name of column
                col_corr.add(colname)
                if colname in dataset.columns:
                    del dataset[colname] # deleting the column from the dataset
                    corr_results.append({
                        'Deleted column': colname,
                        'Correlation column': corr_matrix.columns[j],
                        'Correlation': corr_matrix.iloc[i, j]
                    })

    return pd.DataFrame(corr_results)

corr_results=correlation(preprocess_df, correlation_threshold)
pd.set_option('display.max_rows', None)
corr_results

In [None]:
print("Remaining features",len(preprocess_df.columns))

In [None]:
#preprocess_df.to_csv('/kaggle/input/fossil-free-funds/prep/fossilfund_dataset_prep.csv', index=False)

### 2.b Plan to manage and process the data

#### Data cleaning and data manipulation
- [x] A row duplicate cleaning process needs to be performed
- [x] Features wit nearly empty fill-factor will be removed ('Fund profile: US-SIF member') as well as non-relevant/redundant categorical columns ('Fund profile: Ticker','Fund profile: Shareclass tickers')
- [x] Dates column need to be checked for consistency (within our analysis range) and integrity (no missing dates) 
- [x] For the set of 'asset, count, weight' features, we need to ensure strong correlation between these variable and consistency  
    * If one of these variable equals zero (count or weight or asset = 0), we need to make sure that remaining 2 other variables should be 0

#### Feature engineering
- [x] Grades features will need to be encoded => ordinal encoding
* Pre-processing for numerical data:
    - [x] Null values will be replaced with median values
    - [x] Features with high number of 0-Values will be identified (new column _isempty will be added) 
    - [x] Outliers will also need to be removed (e.g by deleting entries ranging +-10 std of the mean, assuming a gaussian distribution)
* Pre-processing for continuous data, it might be required to do a transformation (log, exp) to reduce the skewness
* Pre-processing for categorical data:
    - [x] We need to make sure date is encoded consistently, verify the number of unique values
    - [x] n/a values will be replaced with NoMapping value
    - [x] categorical data will be one-hot encoded

## 3) Exploratory data analysis (EDA)
### (a) Preliminary EDA


#### 3.a.1 Correlation of target with features
* The following plot also comparison between the performance target with other features. It shows that correlation of performance variables against other features are mostly on the blue zone (negative correlation)
* The most negative correlation is with Fosssil Fuel Holdings features

In [None]:
fig = plt.figure(figsize=(25, 20))
sns.heatmap(preprocess_df.reindex(sorted(preprocess_df.columns), axis=1).corr(method='pearson'), 
            cmap='RdBu_r',
            annot=False,
            linewidth=0.5)

*Additional information* :
* As suggested, the following clustermap allows to cluster correlated features together.
* We can observe several clusters with a mix of different features category
* *asset* and *count* type of features are clustered together: this is certainly due to the similar data distribution
* *weight* type of features do not show such clustering which might indicate more independances betweens these variables
* A possible result for these findings will be to group highly correlated features together
    * In the part 2, highly correlated features have been removed to overcome this situation


In [None]:
sns.clustermap(preprocess_df.corr(method='pearson'), cmap='RdBu_r', figsize=(30, 30))

In [None]:
sns.clustermap(preprocess_df.filter(regex=".*_a.*",axis=1).corr(method='pearson'), cmap='RdBu_r', figsize=(10, 10))

In [None]:
sns.clustermap(preprocess_df.filter(regex=".*_c.*",axis=1).corr(method='pearson'), cmap='RdBu_r', figsize=(10, 10))

In [None]:
sns.clustermap(preprocess_df.filter(regex=".*_w.*",axis=1).corr(method='pearson'), cmap='RdBu_r', figsize=(10, 10))

In [None]:
sns.clustermap(preprocess_df.filter(regex=".*_isempty",axis=1).corr(method='pearson'), cmap='RdBu_r', figsize=(10, 10))

* Here are the top positive/negative correlations. Fossil Fuel holdings (weight) is the most correlated with our target

In [None]:
target = cols_df[ (cols_df['Category']=='Financial performance') & (cols_df['Type']=='Continuous') ]['Short column name']
correlations=preprocess_df.corr(method='pearson')[target]#.sort_values(ascending=False)
print('Top positive correlations with ReturnsY1')
correlations['FP_ReturnsY1'].sort_values(ascending=False)[0:15]

In [None]:
print('Top negative correlations with ReturnsY1')
correlations['FP_ReturnsY1'].sort_values(ascending=False)[-5:]

In [None]:
correlations.filter(regex="FI_AssetManager.*",axis=0)['FP_ReturnsY1'].sort_values(ascending=False)

* To get even further, we compute pairplots between our target and features to observe these correlation further
    * Financial performance (Y3,Y5,Y10) features are strongly correlated 
    * We observe the downward regression for F_FossilFuelHoldings_w
    * A lot of 0-values and outliers can be seen, this will need to be addressed 
    * Sampling had to be used to reduce the plot time


In [None]:
target = 'FP_ReturnsY1'
nb_samples=5000
def chunks(l, n):
    n = max(1, n)
    return (l[i:i+n] for i in range(0, len(l), n))

for category in ['FI','FP','F','D','GE','G','W','T','P']:
    index_cols=preprocess_df.filter(regex="^"+category+"_.*",axis=1).columns
#for category in categories:
#    index_cols=list(set(getColCategory(category)) & set(preprocess_df.select_dtypes(np.number).columns))
    index_cols_numb_only=preprocess_df[index_cols].select_dtypes(exclude=object).columns
    index_cols_list=list(chunks(index_cols_numb_only, 5))
    number_sets=len(index_cols_list)
    for i in range(0,len(index_cols_list)):
        #Use sampling technique to speedup the process
        print("Pairplots for "+category+" numeric features - set "+str(i+1)+"/"+str(number_sets))
        plot= sns.pairplot(data=preprocess_df.sample(nb_samples), y_vars=target, x_vars=index_cols_list.pop(), height=3.5, aspect=1.1)
        plt.tight_layout()
        plt.show()

#### 3.a.2 Remove trend

* We try here to identify if our target has some trend that we get rid of (e.g. we had a stock market overall recovery process during 2020). One way of removing the trend is through *differencing*, which consists in computing the difference between consecutive observations. 
* More information on https://machinelearningmastery.com/remove-trends-seasonality-difference-transform-python/

In [None]:
for date_col in date_cols:
    preprocess_df[date_col]=preprocess_df[date_col].apply(pd.to_datetime, errors='coerce')

In [None]:
preprocess_df['FP_ReturnsY1_diff']=df['FP_ReturnsY1'].diff()

In [None]:
# create figure and axis objects with subplots()
fig,ax=plt.subplots(figsize=(8,5))
ax.plot(preprocess_df.groupby(['FP_PerformanceAs-OfDate'])['FP_ReturnsY1'].mean(), label='Target')
ax.set_xlabel("Year")
ax.set_ylabel("FP_ReturnsY1",color='C0')

ax2=ax.twinx()
# make a plot with different y-axis using second axis object
ax2.plot(preprocess_df.groupby(['FP_PerformanceAs-OfDate'])['FP_ReturnsY1_diff'].mean(), color='g', label='Differenciation')
ax2.set_ylabel("FP_ReturnsY1_diff",color='g')
plt.show()

* The results from this plot shows that differenciation does not help to remove trend, we will not use this strategy.

In [None]:
# create figure and axis objects with subplots()
fig,ax=plt.subplots(figsize=(8,5))
ax.plot(preprocess_df.groupby(['FI_PortfolioHoldingsAs-OfDate'])['FP_ReturnsY1'].mean(), label='Target')
ax.set_xlabel("Year")
ax.set_ylabel("FP_ReturnsY1",color='C0')

ax2=ax.twinx()
# make a plot with different y-axis using second axis object
ax2.plot(preprocess_df.groupby(['FI_PortfolioHoldingsAs-OfDate'])['FP_ReturnsY1_diff'].mean(), color='g', label='Differenciation')
ax2.set_ylabel("FP_ReturnsY1_diff",color='g')
plt.show()

### 3.b) Discuss how the EDA informs your project plan
- [x] For the moment, we do not see a strong relationship (> +/-0.5) between features and our target, except for financial performance features which are closely tied together 
    * Stronger correlations with other features with Y3, Y5, Y10 than Y1
- [x] The situation will improve with processing of 0/null values, the grade encoding and feature engineering (log/exp encoding)

### 3.c) What further EDA do you plan for project?
- [x] More preparation of data is required at this stage so we can ensure a good data set before our modelling for the next step

## 4) Machine learning 
### (a) Phrase your project goal as a clear machine learning question
* The outcome of this project is to predict performance of a fund based on ESG features
* The target is the *Financial performance: Month end trailing returns, year 1* variable
* This is a regression problem 
* Calculating a regression tree (random forest regressor) for identifing key indicators for good fund selection will also be experimented

### (b) What models are you planning to use and why?
* The principal ML training algorithm that will be used is the *Ridge regression* with :
    * Splitting for test and training data set
    * Standardization of data based on training set
    * Grid search for finding the best regularization (alpha parameter)
* The features will be select among through SelectKBest method (f_regression, mutual_info_regression)
* Models will range from 
    * simple model: 3 features
    * intermediate model: 10 features to 20 features
    * complex model: all features
* We will also experiment using other ML training regressors such as *RandomForests, kNNs and neural networks* to allow comparison with results from *Ridge regression*

### (c) Please tell us your detailed machine learning strategy 
* The pre-processing part will mainly required that we have no null values in our dataset
* Ridge regression with optimization of the regularization parameter will be used
* Our baseline will be the *median* of performance price return
* For assessing the accuracy of our model (cost function), we will use the MAE (Mean Absolute Error) method:
  * MAE is easy to interpret. For instance, a score of 3 means that the predictions are, in average, above or below the observed value by a distance of 3.
  * The MAE method is robust to outliers which is a nice statistical property but is difficult to optimize because it's not smooth.

* Assess individual modeles
* How could you maybe assess individual models in more detail than just their score? How could you identify weaknesses?
    * Assessment will be done through EDA, that is comparing the estimator resulting to our ML training against its target. The goal is to find possible root cause for the accuracy error and potentially adding/modifying features to reduce this error

# 5.Model fitting
## 5.1 Features selection
* Make use of SelectKBest (f_regression, mutual_info_regression) to select features of the simple model and intermediate model.
* Check that all columns of df are numbers (no null values allowed)

In [None]:
test_size_split=0.4
clean_df=pd.read_csv('/kaggle/input/fossil-free-funds/prep/fossilfund_dataset_prep.csv')

In [None]:
print("How many non numerical columns:",len(clean_df.select_dtypes(['number']).columns)-len(clean_df.columns))

In [None]:
#Remove FP
clean_df.drop(columns=['FP_ReturnsY3','FP_ReturnsY5','FP_ReturnsY10'], axis=1, inplace=True)

In [None]:
#All rows contain null values
print("No. of rows containing null values", len(clean_df.isna().sum(axis=1).eq(0) ))
print("Total no. of columns in the dataframe", len(clean_df.columns))
print("No. of numerical columns ", len(clean_df.select_dtypes(np.number).columns))
print("No. of columns containing null values", len(clean_df.columns[clean_df.isna().any()]))
print("Details of columns containing null values", clean_df.columns[clean_df.isna().any()])
print("No. of columns not containing null values", len(clean_df.columns[clean_df.notna().all()]))

#clean_df[clean_df.isna().any(axis=1)]

* Prepare X and y variables for the SelectKBest

In [None]:
target = 'FP_ReturnsY1'
X = clean_df.drop(columns=target)
y = clean_df[target]

In [None]:
# How many features do you want to keep?
k = 20

# Create the selecter object
skb = SelectKBest(f_regression, k=k)

# Fit the selecter to your data
X_new = skb.fit_transform(X, y)

# Extract the top k features from the `pvalues_` attribute
k_feat = np.argsort(skb.pvalues_)[:k]

# Reduce the dataframe according to the selecter
df_reduced = clean_df[X.columns[k_feat]]

In [None]:
# instantiate SelectKBest to determine 20 best features
best_features = SelectKBest(score_func=f_regression, k=k)
fit = best_features.fit(X,y)
df_scores = pd.DataFrame(fit.scores_)
df_columns = pd.DataFrame(X.columns)
# concatenate dataframes
feature_scores = pd.concat([df_columns, df_scores],axis=1)
feature_scores.columns = ['Feature_Name','Score']  # name output columns
topk_features=feature_scores.nlargest(k,'Score')
print(topk_features)  # print top k best features

In [None]:
def plotPredictors(selection, features_scores):
    subset_features=feature_scores[feature_scores['Feature_Name'].isin(selection)].sort_values(by='Score', ascending=False)
    plt.figure(figsize=(10,5))
    plt.bar(range(len(subset_features['Score'])), subset_features['Score'])
    plt.xticks(np.arange(0, len(subset_features['Score'])), subset_features['Feature_Name'], rotation="vertical")
    #plt.setp(plt.gca().get_xticklabels(), rotation=30, horizontalalignment='right')
    plt.tight_layout()

* We plot here the top 20 predictors (with the highest score):
 * Performance Reporting date has the best score
 * Grade and Carbon footprint are also present in the top 5
 * Absence of information (\_isempty) appear quite often

In [None]:
plotPredictors(topk_features['Feature_Name'],feature_scores)

* We plot the top features per category
 * Grade features appear quite often at the top of each category

In [None]:
for category in ['FI','FP','F','D','GE','G','W','T','P']:
    featurePerCategory=feature_scores[feature_scores['Feature_Name'].str.contains("^"+category+"_.*")]['Feature_Name'].values
    plotPredictors(featurePerCategory,feature_scores)

### 5.2 To get your top feature names
* We define our 3 models based on the previous observation 
 * Simple: we only keep 3 features not related to date and not related to absence of scoring (*is_empty*). We also do not select *P_PrisonIndustrialComplexGrade* as this feature had a high number of null values (33.64 %) that have been replaced by median values
 * Grade: we keep only grade features, except for the Gender equality category where *GE_WeightOfHoldings* has better scoring than the grading feature and discard *G_CivilianFirearmGrade* feature which has a too low score to be included
 * Intermediate: all top 20 features
 * Complex: all 122 features

In [None]:
features_simple_model=['F_RelativeCarbonFootprint','F_RelativeCarbonIntensity','F_FossilFuelGrade']

features_grade_model=['F_FossilFuelGrade',
'P_PrisonIndustrialComplexGrade',
'D_DeforestationGrade',
'T_TobaccoGrade',
'W_MilitaryWeaponGrade',
'GE_WeightOfHoldings']

features_intermediate_model=list(feature_scores.nlargest(k,'Score')['Feature_Name'].values)

features_complex_model=clean_df.drop(columns=target).columns

In [None]:
plotPredictors(features_simple_model,feature_scores)

In [None]:
plotPredictors(features_grade_model,feature_scores)

In [None]:
plotPredictors(features_intermediate_model,feature_scores)

In [None]:
plotPredictors(features_complex_model[0:50],feature_scores)

* We calculate here the MAE score for each model using Ridge regression with grid search tuning for the regularization (alpha) parameter

* Note: Because in linear regression the value of the coefficients is partially determined by the scale of the feature, and in regularized models all coefficients are summed together, we must make sure to standardize the feature prior to training.

In [None]:
def splitTrainTest(df,target,features):
    #split current dataframe into train set / test set
    # Create X, y
    X = df[features]
    y = df[target]
    # Split into train/test sets
    X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=test_size_split, random_state=0)
    # Standardize data
    scaler = StandardScaler()
    X_tr_rescaled = scaler.fit_transform(X_tr)
    X_te_rescaled = scaler.transform(X_te)
    
    return X_tr_rescaled, X_te_rescaled, y_tr, y_te, scaler


def saveModelResults(model, modelName, X_te_rescaled, y_te):
    mae = MAE(y_te, model.predict(X_te_rescaled))
    print('MAE with best alpha: {:,.3f}%'.format(mae))

    #Save result
    details = { 
        'model' : [modelName], 
        'test_accuracy' : [mae], 
    }
    df = pd.DataFrame(details)
    df.to_csv('/kaggle/working/results.csv', index=False, mode='a', header=False, float_format='%.3f')
    return [mae, model, modelName]

def RidgeModelTraining(df, features, modelName):
    X_tr_rescaled, X_te_rescaled, y_tr, y_te, scaler = splitTrainTest(df,target,features)

    # Fit/test N models for optimal regularization
    gs_results = []

    # Grid search
    for alpha in np.logspace(-8, 4, num=200):
        # Create and fit ridge regression
        ridge = Ridge(alpha=alpha)
        ridge.fit(X_tr_rescaled, y_tr)
        
        # Save model and its performance on train/test sets
        gs_results.append({
            'model': ridge,
            'alpha': alpha,
            'train_mse': MSE(y_tr, ridge.predict(X_tr_rescaled)),
            'train_mae': MAE(y_tr, ridge.predict(X_tr_rescaled)),
            'test_mse': MSE(y_te, ridge.predict(X_te_rescaled)),
            'test_mae': MAE(y_te, ridge.predict(X_te_rescaled)),
        })

    # Convert results to DataFrame
    gs_results = pd.DataFrame(gs_results)

    # Plot the validation curves
    plt.plot(np.log10(gs_results['alpha']), gs_results['train_mae'], label='train curve')
    plt.plot(np.log10(gs_results['alpha']), gs_results['test_mae'], label='test curve')

    # Mark best alpha value
    best_result = gs_results.loc[gs_results.test_mae.idxmin()]
    plt.scatter(np.log10(best_result.alpha), best_result.test_mae, marker='x', c='red', zorder=10)
    plt.title('Best alpha: {:.1e} - mse: {:.4f} mae: {:,.0f}%'.format(
        best_result.alpha, best_result.test_mse, best_result.test_mae))

    plt.xlabel('$log_{10}(alpha)$')
    plt.ylabel('MAE')
    plt.legend()
    plt.show()

    ridge = Ridge(alpha=best_result.alpha)
    ridge.fit(X_tr_rescaled, y_tr)
    #mae_best=MAE(y_te, ridge.predict(X_te_rescaled))
    #print('MAE with best alpha: {:,.2f}$'.format(mae_best))
    #return [mae_best, ridge]
    
    return saveModelResults(ridge, modelName, X_te_rescaled, y_te)
     

In [None]:
#MAE baseline
X_tr_rescaled, X_te_rescaled, y_tr, y_te, scaler = splitTrainTest(clean_df,target,features_simple_model)
median_predictions = np.full_like(y_te, np.median(y_tr))
mae_baseline=MAE(y_te, median_predictions)
print('Median baseline: {:,.2f}%'.format(mae_baseline))

In [None]:
ridge_simple = RidgeModelTraining(clean_df, features_simple_model, 'RidgeSimple')

In [None]:
ridge_grade = RidgeModelTraining(clean_df, features_grade_model, 'RidgeGrade')

In [None]:
ridge_intermediate = RidgeModelTraining(clean_df, features_intermediate_model, 'RidgeIntermediate')

In [None]:
ridge_complex = RidgeModelTraining(clean_df, features_complex_model, 'RidgeComplex')

* We finally compare the MAE and we can assess that the full model performs better than the other models (lowest MAE)

In [None]:
# Ridge comparison
mae_values = [mae_baseline, ridge_simple[0], ridge_grade[0] , ridge_intermediate[0], ridge_complex[0]]
titles = ['median', 'Simple', 'Grade','Intermediate', 'Complex']

xcor = np.arange(len(mae_values))
plt.bar(xcor, mae_values)
plt.xticks(xcor, titles)

plt.ylabel('MAE')
plt.show()

### 5.3 ML training regressors
* The simple model will be used for other ML training regressors for this chapter
* The goal is to try non-linear ML regressors to validate if we can achieve better performance (lower MAE) than ridge regression
* N.B: Hyperparameter optimization has already been ran for the ML parameters. The chapter regarding Hyperparameter will be presented later on.

In [None]:
selected_model=features_simple_model
X_tr_rescaled, X_te_rescaled, y_tr, y_te, scaler = splitTrainTest(clean_df,target,selected_model)

#### 5.3.1 RandomForests
* We will be using the RandomForestRegressor as ML regressor
* Explain parameter of this ML
* We first plot a graph with a limited depth random forest to visually assess how random forest behaves

In [None]:
n_estimators=100
max_depth=4
# Random Forest Regressor Model
rdForest = RandomForestRegressor(max_depth=max_depth,n_estimators=n_estimators, random_state=0)
model = rdForest.fit(X_tr_rescaled, y_tr)

#Show plot
tree = rdForest.estimators_[n_estimators-1]
# Export decision tree
dot_data = export_graphviz(
    tree, out_file=None,
    feature_names=clean_df[selected_model].columns, 
    class_names=categories,
    filled=True, rounded=True
)

# Display decision tree
graphviz.Source(dot_data)

* For this stage, we define a RandomForestRegressor with optimal parameters coming from our hyperparameter optimization
* We achieve significantly better performance than ridge regression ! The non-linearity might explain such performance

In [None]:
n_estimators=500
max_depth=None
max_features="log2"

rdForest = RandomForestRegressor(max_features=max_features, max_depth=max_depth,n_estimators=n_estimators,random_state=0)
model = rdForest.fit(X_tr_rescaled, y_tr)

rdForestResults=saveModelResults(model, "RandomForest", X_te_rescaled, y_te)

#### 5.3.2 kNNs 
* We use KNeighborsRegressor as ML regressor
* Similar to random forest, we achieve the best results due to the non-linearity of this ML 

In [None]:
n_neighbors=100
p=2
weights='distance'

neigh = KNeighborsRegressor(n_neighbors=n_neighbors, p=p, weights=weights)
neigh.fit(X_tr_rescaled, y_tr)


kNNResults=saveModelResults(neigh, "KNN", X_te_rescaled, y_te)

#### 5.3.3 Neural networks 
We have defined here a neural network with the following properties: 
 * We have define a sequential model, with some dense layers
 * Used '**relu**' as the activation function in the hidden layers
 * Used a '**normal**' initializer as the kernal_intializer (Initializers define the way to set the initial random weights of Keras layers)
 * Mean_absolute_error is our loss function
 * We defined the output layer with only one node and used 'linear 'as the activation function for the output layer

In [None]:
batch_size_nb=32
epochs=200
number_of_neurons=128
number_of_hidden_layers=10

NN_model = Sequential()

# The Input Layer :
NN_model.add(Dense(number_of_neurons, kernel_initializer='normal',input_dim = X_tr_rescaled.shape[1], activation='relu'))

# The Hidden Layers :
for i in range(0,number_of_hidden_layers):
    NN_model.add(Dense(number_of_neurons, kernel_initializer='normal',activation='relu'))

# The Output Layer :
NN_model.add(Dense(1, kernel_initializer='normal',activation='linear'))

# Compile the network :
NN_model.compile(loss='mean_absolute_error', optimizer='adam', metrics=['mean_absolute_error'])
NN_model.summary()

* Train model and save the results to avoid model fitting calculation afterwards

In [None]:
modelfilepath='/kaggle/input/fossil-free-funds/misc/modelNN_simple'
if not os.path.isfile(modelfilepath):
    #os.makedirs(modelfilepath)
    #Define a checkpoint 
    #checkpoint_name = traindir+'/Weights-{epoch:03d}--{val_loss:.5f}.hdf5' 
    #checkpoint = ModelCheckpoint(checkpoint_name, monitor='val_loss', verbose = 1, save_best_only = True, mode ='auto')
    #callbacks_list = [checkpoint]
    
    #Train model
    NN_model.fit(X_tr_rescaled, y_tr, epochs=epochs, batch_size=batch_size_nb, validation_split = 0.2)#, callbacks=callbacks_list)
    NN_model.save(modelfilepath)
else:
    NN_model=tf.keras.models.load_model(modelfilepath)

* The result of the NN is similar to the ridge one. This is explainable as the model and layers used on our model tend to reproduce a linear regression (Sequential)

In [None]:
NN_model.compile(loss='mean_absolute_error', optimizer='adam', metrics=['mean_absolute_error'])
NNResults=saveModelResults(NN_model, "NN", X_te_rescaled, y_te)

### 5.4 Compare ML with target and conclusion
* In this section, we compare the different features of our simple model (F_RelativeCarbonFootprint, P_PrisonIndustrialComplexGrade, F_FossilFuelGrade) against its target for the different ML regressor seen above 
* We can visually assess the nice fitting that random forest and KNN provided in comparison to Ridge and NN

In [None]:
def rand_jitter(arr, delta, feature_name):
    if not feature_name in continuous:
        return arr + delta 
    else:
        return arr

allModels=[kNNResults, rdForestResults, NNResults, ridge_simple ]

for i in range(0, len(selected_model)):
    plt.figure(figsize=(12,8))
    plt.title('ML comparison for '+selected_model[i])
    plt.xlabel(selected_model[i])
    plt.ylabel(target)
    x = scaler.inverse_transform(X_te_rescaled)[:,i]
    #plt.scatter(rand_jitter(x, 0, selected_model[i]), y_te, label="Test set", alpha=.5)
    sns.violinplot(y=rand_jitter(x, 0, selected_model[i]), x=y_te)
#    for idx,model in enumerate(allModels):
#        delta = (idx+1)/(len(allModels)*4)
#        plt.scatter(rand_jitter(x, delta, selected_model[i]), model[1].predict(X_te_rescaled), label=model[2],alpha=.5)
        
#        plt.legend()
    plt.show()

#### 5.4.1 MAE final comparison

In [None]:
# Final comparison
mae_values = [mae_baseline, ridge_simple[0], NNResults[0], rdForestResults[0], kNNResults[0]]
titles = ['Median', 'Ridge','NN', 'RFR', 'kNN']

xcor = np.arange(len(mae_values))
plt.bar(xcor, mae_values)
plt.xticks(xcor, titles)

plt.ylabel('MAE')
plt.show()

#### 5.4.2 Project conclusion

* The results above show a clear indication that funds with a high level of invested fossil-fund stocks display lower returns
* For the other type of features (Gender equality, Weapon, ...), the trend is similar in a lower scale.
* The date being taken from 2020, it would require a larger timeframe to assess if this correlation has always been this way or not.
* Fortunately, these findings serve as a great marketing tool for the investment industry to transition to more green-friendly investments vehicules which is a good thing economical and for the environmental sustainability of our planet.

### 5.5 Annex - Hyper parameter optimization
* This section aims to discover the best hyperparameters for the different ML training models
#### 5.5.1 Grid search for Random Forest and K-NN models
* Negative MAE is normal https://stackoverflow.com/questions/21443865/scikit-learn-cross-validation-negative-values-with-mean-squared-error

In [None]:
def grid_search(X_tr_rescaled, y_tr, X_te_rescaled, y_te):
    gs_results=[]
    
    pipeline1 = Pipeline((
    ('rfr', RandomForestRegressor()),
    ))

    pipeline2 = Pipeline((
    ('knn', KNeighborsRegressor()),
    ))

    parameters1 = {
    'rfr__n_estimators': [5, 10, 15, 20, 50, 100, 200, 500], #np.arange(5,105,5).tolist(),
    #'rfr__criterion': ['mae'], too slow, see https://stackoverflow.com/questions/57243267/why-is-training-a-random-forest-regressor-with-mae-criterion-so-slow-compared-to
    'rfr__max_features': ['auto', 'log2', 'sqrt', None],
    'rfr__max_depth': [5, 10, 15, 20, 50, 100, None]
    }

    parameters2 = {
    'knn__n_neighbors': [3, 7, 10, 15, 20, 50, 100, 200, 500],
    'knn__weights': ['uniform', 'distance'],
    'knn__p': [1, 2]
    }

    pars = [parameters1, parameters2]#, parameters3, parameters4]
    pips = [pipeline1, pipeline2]#, pipeline3, pipeline4]

    print("Starting Gridsearch")
    for i in range(len(pars)):
        gs = GridSearchCV(pips[i], pars[i], verbose=2, scoring='neg_mean_absolute_error', n_jobs=-1)#, return_train_score=True)
        gs.fit(X_tr_rescaled, y_tr)
        print ("MAE score", gs.best_score_ , "with parameters",gs.best_params_)
        # Save model and its performance on train/test sets
        gs_results.append({
            'model': gs.cv_results_,
            'best_params': gs.best_params_,
            'train_mae': gs.best_score_,
            'test_mae': gs.score(X_te_rescaled, y_te),
        })
    return pd.DataFrame(gs_results)
gs=grid_search(X_tr_rescaled, y_tr, X_te_rescaled, y_te)  

#### 5.5.2 NN hyperparameters 
* To tune NN hyperparameters we will need a function building/returning a model.
* This function's parameters must be our model hyperparameters that we want to tune.
* As you can see below - we will be tuning:
 - the number of hidden layers
 - the number of neurons in each hidden layer.

In [None]:
from tensorflow.keras import regularizers

def build_model(number_of_hidden_layers=1, number_of_neurons=2):#,l2_penalty=0):
  model = Sequential()

  # First hidden layer
  model.add(Dense(number_of_neurons, kernel_initializer='normal', input_dim = X_tr_rescaled.shape[1], activation='relu')) #, kernel_regularizer = regularizers.l2(l2_penalty)))

  # hidden layers
  for hidden_layer_number in range(1, number_of_hidden_layers):
    model.add(Dense(number_of_neurons, kernel_initializer='normal', activation='relu'))

  # output layer
  model.add(Dense(1, kernel_initializer='normal',activation='linear'))

  model.compile(optimizer='adam', loss='mean_absolute_error')

  return model

tuned_model = KerasRegressor(build_fn=build_model)


params = {
      'number_of_hidden_layers': [2, 3, 4, 5, 10, 15, 20], #Deeper Network Topology
      'number_of_neurons': [16, 32, 64, 128] #Wider Network Topology
      #'l2_penalty': np.logspace(-4, -1, num=5) #array([0.0001    , 0.00056234, 0.00316228, 0.01778279, 0.1       ])
      }

# Create a randomize search cross validation object, to find the best hyperparameters it will use a KFold cross validation with 4 splits
random_search = RandomizedSearchCV(tuned_model, param_distributions = params, cv = KFold(4), n_jobs=2, scoring="neg_mean_absolute_error", n_iter=8)
#random_search = GridSearchCV(tuned_model, params, cv = KFold(4))

# find the best parameters!
random_search.fit(X_tr_rescaled, y_tr)

random_search.best_estimator_.get_params()

It was found that the best combination of hyperparameters is:
 - 'number_of_hidden_layers': 10,
 - 'number_of_neurons': 128

### 5.5 Extra - Calculate MAE KNN with grade model
* We calculate the MAE for the grade model (instead of the simple)
* MAE score is worst than for the simple model which is normal has the grade model include low scoring features

In [None]:
selected_model=features_grade_model
X_tr_rescaled, X_te_rescaled, y_tr, y_te, scaler = splitTrainTest(clean_df,target,selected_model)
#grid_search(X_tr_rescaled, y_tr, X_te_rescaled, y_te)  

In [None]:
#{'knn__n_neighbors': 10, 'knn__p': 2, 'knn__weights': 'distance'}
n_neighbors=200
p=1
weights='distance'

neigh = KNeighborsRegressor(n_neighbors=n_neighbors, p=p, weights=weights)
neigh.fit(X_tr_rescaled, y_tr)


kNNResults=saveModelResults(neigh, "KNN", X_te_rescaled, y_te)

In [None]:
allModels=[kNNResults, ridge_grade ]

for i in range(0, len(selected_model)):
    plt.figure(figsize=(12,8))
    plt.title('ML comparison for '+selected_model[i])
    plt.xlabel(selected_model[i])
    plt.ylabel(target)
    x = scaler.inverse_transform(X_te_rescaled)[:,i]
    plt.scatter(rand_jitter(x, 0, selected_model[i]), y_te, label="Test set", alpha=.5)
    for idx,model in enumerate(allModels):
        delta = (idx+1)/(len(allModels)*4)
        plt.scatter(rand_jitter(x, delta, selected_model[i]), model[1].predict(X_te_rescaled), label=model[2],alpha=.5)
        
        plt.legend()
    plt.show()