# ESG Score Prediction

## Notebook Outline :

1. Introduction (ESG Score, Calculation method, Factors - Summary based on the TR pdf)
2. Data Explanation (Features - what they are)
3. Data Processing - Outlier Detection, Feature Transformation 
4. EDA - Basic Insights
6. Feature Selection/Importance
5. Data Modelling

## 1. Introduction 

A financial statement lists the assets, liabilities and equity of a company at a specific point in time and is used to calculate the net worth of a business. A basic tenet of double-entry book-keeping is that total assets (what a business owns) must equal liabilities plus equity (how the assets are financed). In other words, the balance sheet must balance. Subtracting liabilities from assets shows the net worth of the business. 

How does ESG score come into picture in a business?
(TBD by Aheli)


## 2. Data Explanation  

### Total Current Assets 
Current assets are cash or its equivalent or those assets that will be used by the business in a year or less

### Total Current Liabilities 
Debts that are due in one year or less are classified as current liabilities. If they're due in more than one year, they're long-term liabilities.

### Total Debt 
Total debt refers to the sum of borrowed money that your business owes. It’s calculated by adding together your current and long-term liabilities.

### Total Assets Reported 
"Total long-term assets" is the sum of capital and plant, investments, and miscellaneous assets.
"Total assets" is the sum of total current assets and total long-term assets

### Net Income - Actual 
Net income refers to the amount an individual or business makes after deducting costs, allowances and taxes.
In commerce, net income is what the business has left over after all expenses, including salary and wages, cost of goods or raw material and taxes. 

### Revenue Per Share
1. Earnings per share (EPS) is a company's net profit divided by the number of common shares it has outstanding.
2. EPS indicates how much money a company makes for each share of its stock and is a widely used metric for estimating corporate value.
3. A higher EPS indicates greater value because investors will pay more for a company's shares if they think the company has higher profits relative to its share price.
4. EPS can be arrived at in several forms, such as excluding extraordinary items or discontinued operations, or on a diluted basis.
5. Like other financial metrics, earnings per share is most valuable when compared against competitor metrics, companies of the same industry, or across a period of time.

### Total Revenue 
Total revenue is the amount of money a company brings in from selling its goods and services. In other words, company's use this metric to determine how well they're generating money from their core revenue-driving operations.

### Total Equity 
The total equity of a business is derived by subtracting its liabilities from its assets. The information for this calculation can be found on a company's balance sheet, which is one of its financial statements. 
An alternative approach for calculating total equity is to add up all of the line items in the stockholders' equity section of the balance sheet, which is comprised of common stock, additional paid-in capital, and retained earnings, minus treasury stock.
In essence, total equity is the amount invested in a company by investors in exchange for stock, plus all subsequent earnings of the business, minus all subsequent dividends paid out. Many smaller businesses are strapped for cash and so have never paid any dividends. In their case, total equity is simply invested funds plus all subsequent earnings.

### Company Market Capitilization 
Market cap—or market capitalization—refers to the total value of all a company's shares of stock. It is calculated by multiplying the price of a stock by its total number of outstanding shares. For example, a company with 20 million shares selling at $50 a share would have a market cap of $1 billion.

Why is market capitalization such an important concept? It allows investors to understand the relative size of one company versus another. Market cap measures what a company is worth on the open market, as well as the market's perception of its future prospects, because it reflects what investors are willing to pay for its stock.

### Property Plant And Equipment, Total - Gross
Carrying amount at the balance sheet date for long-lived physical assets used in the normal conduct of business and not intended for resale. This can include land, physical structures, machinery, vehicles, furniture, computer equipment, construction in progress, and similar items.

### P/E Ratio
The price/earnings (P/E) ratio, also known as an “earnings multiple,” is one of the most popular valuation measures used by investors and analysts. The basic definition of a P/E ratio is stock price divided by earnings per share (EPS)


## 3. Data Exploration

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df_firm = pd.read_csv("Firm_Data.csv")

In [3]:
df_firm.head()

Unnamed: 0,RIC,Date,Total Current Assets,Total Current Liabilities,Total Debt,"Total Assets, Reported",Net Income - Actual,Revenue Per Share,Total Revenue,Total Equity,Total CO2 Equivalent Emissions To Revenues USD in million,ESG Score,Social Pillar Score,Governance Pillar Score,Environmental Pillar Score,Company Market Capitalization,"Property Plant And Equipment, Total - Gross",P/E (Daily Time Series Ratio)
0,BVIC.L,31/12/2009,434.373405,483.824655,718.957376,1361.504594,107.309409,7.068281,1561.383359,-3.988004,,51.267135,45.539754,59.077651,53.535053,1430.56133,817.381317,19.299905
1,BVIC.L,31/12/2010,579.778906,580.095207,901.298414,1655.358922,139.656087,8.006665,1800.699024,-48.552134,,50.550242,57.622888,31.890959,55.261467,1772.376566,889.120843,16.9279
2,BVIC.L,31/12/2011,598.968478,607.694345,893.154869,1660.096296,122.255528,7.912984,2010.689188,35.059289,,46.73287,57.397993,28.924942,45.229496,1205.126197,836.280911,13.534436
3,BVIC.L,31/12/2012,615.523874,601.137943,907.121844,1658.099764,100.765756,7.914423,2030.84085,59.968319,,57.941343,70.63304,37.216902,55.762859,1603.272359,872.046035,18.363571
4,BVIC.L,31/12/2013,748.229068,814.064189,895.87401,1714.940377,133.919162,8.716901,2133.025672,65.996482,31.296503,49.513243,52.778937,33.146107,58.148148,2819.919676,907.653333,27.421399


In [4]:
df_firm.shape

(24479, 18)

In [5]:
df_firm.columns

Index(['RIC', 'Date', 'Total Current Assets', 'Total Current Liabilities',
       'Total Debt', 'Total Assets, Reported', 'Net Income - Actual',
       'Revenue Per Share', 'Total Revenue', 'Total Equity',
       'Total CO2 Equivalent Emissions To Revenues USD in million',
       'ESG Score', 'Social Pillar Score', 'Governance Pillar Score',
       'Environmental Pillar Score', 'Company Market Capitalization',
       'Property Plant And Equipment, Total - Gross',
       'P/E (Daily Time Series Ratio)'],
      dtype='object')

In [6]:
# Lets rename some columns
df_firm.rename(
    columns={
        "Total CO2 Equivalent Emissions To Revenues USD in million": "CO2 Emissions",
        "Property Plant And Equipment, Total - Gross": "PPE Total",
    },
    inplace=True,
)

In [7]:
df_firm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24479 entries, 0 to 24478
Data columns (total 18 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   RIC                            24479 non-null  object 
 1   Date                           24479 non-null  object 
 2   Total Current Assets           22689 non-null  float64
 3   Total Current Liabilities      22681 non-null  float64
 4   Total Debt                     23628 non-null  float64
 5   Total Assets, Reported         23663 non-null  float64
 6   Net Income - Actual            21343 non-null  float64
 7   Revenue Per Share              23617 non-null  float64
 8   Total Revenue                  23718 non-null  float64
 9   Total Equity                   23640 non-null  float64
 10  CO2 Emissions                  14408 non-null  float64
 11  ESG Score                      17762 non-null  float64
 12  Social Pillar Score            17761 non-null 

#### Share of Null Values in the data

In [8]:
df_firm.isnull().sum() / len(df_firm) * 100

RIC                               0.000000
Date                              0.000000
Total Current Assets              7.312390
Total Current Liabilities         7.345071
Total Debt                        3.476449
Total Assets, Reported            3.333470
Net Income - Actual              12.810981
Revenue Per Share                 3.521386
Total Revenue                     3.108787
Total Equity                      3.427428
CO2 Emissions                    41.141386
ESG Score                        27.439846
Social Pillar Score              27.443932
Governance Pillar Score          27.439846
Environmental Pillar Score       27.443932
Company Market Capitalization     7.516647
PPE Total                         9.653172
P/E (Daily Time Series Ratio)    21.933086
dtype: float64

Lets remove rows where ESG score is null

In [9]:
df_firm = df_firm[df_firm["ESG Score"].notna()]
df_firm.shape

(17762, 18)

In [10]:
df_firm.isnull().sum() / len(df_firm) * 100

RIC                               0.000000
Date                              0.000000
Total Current Assets              4.081748
Total Current Liabilities         4.081748
Total Debt                        0.090080
Total Assets, Reported            0.095710
Net Income - Actual               3.704538
Revenue Per Share                 0.258980
Total Revenue                     0.090080
Total Equity                      0.095710
CO2 Emissions                    18.883009
ESG Score                         0.000000
Social Pillar Score               0.005630
Governance Pillar Score           0.000000
Environmental Pillar Score        0.005630
Company Market Capitalization     0.805090
PPE Total                         5.601847
P/E (Daily Time Series Ratio)    15.611981
dtype: float64

#### Outliers 

Note: Clearly there are some extreme values. However, the question is whether these values are outliers or real-world values 

In [None]:
df_firm.describe(
    [0.1, 0.2, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.97, 0.99, 0.999]
)

#### Visualising the distribution 

##### Correlation heatmaps

It is clear from the heatmap that there is a strong correlation between the following features :

(We look into correlation values >= 0.8)

1. Total Current Assets - Total Current Liabilities - Total Debt - Total Assets, Reported
2. Total Equity - Total Assests, Reported

In [None]:
columns = [
    "Total Current Assets",
    "Total Current Liabilities",
    "Total Debt",
    "Total Assets, Reported",
    "Net Income - Actual",
    "Total Revenue",
    "Total Equity",
    "ESG Score",
]

In [None]:
cormat = df_firm[columns].corr()
sns.heatmap(cormat, annot=True)
plt.figure(figsize=(20, 20))
plt.show()

It is clear from the heatmap that none of these features are correlated among each other

In [None]:
columns = [
    "Revenue Per Share",
    "CO2 Emissions",
    "Company Market Capitalization",
    "PPE Total",
    "P/E (Daily Time Series Ratio)",
    "ESG Score",
]

In [None]:
cormat = df_firm[columns].corr()
sns.heatmap(cormat, annot=True)
plt.figure(figsize=(20, 20))
plt.show()

In [None]:
columns = [
    "Total Current Assets",
    "Total Current Liabilities",
    "Total Debt",
    "Total Assets, Reported",
    "Total Revenue",
    "Total Equity",
    "Net Income - Actual",
    "PPE Total",
    "Revenue Per Share",
    "CO2 Emissions",
    "Company Market Capitalization",
    "P/E (Daily Time Series Ratio)",
]

In [None]:
cormat = df_firm[columns].corr()
sns.heatmap(cormat, annot=True, fmt=".1f")
plt.figure(figsize=(30, 30))
plt.show()

#### Data Imputation

In [12]:
columns_to_impute = [
    "Total Current Assets",
    "Total Current Liabilities",
    "Total Debt",
    "Total Assets, Reported",
    "Net Income - Actual",
    "Revenue Per Share",
    "Total Revenue",
    "Total Equity",
    "CO2 Emissions",
    "Company Market Capitalization",
    "PPE Total",
    "P/E (Daily Time Series Ratio)",
]

In [14]:
df_firm = df_firm.dropna(
    subset=columns_to_impute,
    how="all",
)

In [15]:
df_firm.shape

(17760, 18)

In [16]:
df_firm[columns_to_impute] = df_firm.groupby(["RIC"])[
    columns_to_impute
].transform(lambda x: x.fillna(x.mean()))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_firm[columns_to_impute] = df_firm.groupby(['RIC'])[columns_to_impute]\


In [17]:
df_firm.shape

(17760, 18)

In [21]:
df_firm[df_firm['PPE Total'].notna()].isna().sum()

RIC                                0
Date                               0
Total Current Assets             695
Total Current Liabilities        695
Total Debt                         0
Total Assets, Reported             0
Net Income - Actual              106
Revenue Per Share                  0
Total Revenue                      0
Total Equity                       0
CO2 Emissions                     10
ESG Score                          0
Social Pillar Score                1
Governance Pillar Score            0
Environmental Pillar Score         1
Company Market Capitalization      0
PPE Total                          0
P/E (Daily Time Series Ratio)    123
dtype: int64

In [23]:
df_firm=df_firm.fillna(df_firm.mean())

  df_firm=df_firm.fillna(df_firm.mean())


In [24]:
df_firm[df_firm['PPE Total'].notna()].isna().sum()

RIC                              0
Date                             0
Total Current Assets             0
Total Current Liabilities        0
Total Debt                       0
Total Assets, Reported           0
Net Income - Actual              0
Revenue Per Share                0
Total Revenue                    0
Total Equity                     0
CO2 Emissions                    0
ESG Score                        0
Social Pillar Score              0
Governance Pillar Score          0
Environmental Pillar Score       0
Company Market Capitalization    0
PPE Total                        0
P/E (Daily Time Series Ratio)    0
dtype: int64

#### Simple Linear Regression

In [51]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

In [55]:
columns_selected = [
    "Total Equity",
    "CO2 Emissions",
    "Company Market Capitalization",
    "PPE Total",
    "P/E (Daily Time Series Ratio)",
]

In [56]:
X = df_firm[columns_selected]
y = df_firm['ESG Score']

In [57]:
# rescale the features
scaler = MinMaxScaler()

# apply scaler() to all the numeric columns 
X = scaler.fit_transform(X)

In [58]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [59]:
model = LinearRegression()
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
# model evaluation
print(
  'mean_squared_error : ', mean_squared_error(y_test, y_pred))
print(
  'mean_absolute_error : ', mean_absolute_error(y_test, y_pred))

mean_squared_error :  328.77697789769456
mean_absolute_error :  14.942169409978684


In [60]:
# create a KFold object with 5 splits 
folds = KFold(n_splits = 5, shuffle = True, random_state = 100)
scores = cross_val_score(model, X_train, y_train, scoring='r2', cv=folds)
scores  

array([0.07562335, 0.08329892, 0.09090981, 0.07984194, 0.0562524 ])