# ESG Score Prediction

## Notebook Outline :

1. Introduction (ESG Score, Calculation method, Factors - Summary based on the TR pdf)
2. Data Explanation (Features - what they are)
3. Data Processing - Outlier Detection, Feature Transformation 
4. EDA - Basic Insights
6. Feature Selection/Importance
5. Data Modelling

# Introduction

### ESG Score

Environmental, social and governance score is a way of measuring important parameters of a company to evaluate its sustainibility. It is available in both percentages and letter grades (D- to A+). Thomas Reuters ESG scores was designed to measure a company's ESG performance based on several themes such as emissions, environmental
product innovation, human rights, shareholders, etc. This uses publicly available data from around 6000 public companies and around 400 ESG metrics. 

### Factors 

The ESG score is based on a number of different factors.
<br> • Environmental factors include resource use, emissions, innovations, etc. 
<br> • Social factors include workforce, human rights, community etc. 
<br> • Governance factors include management, share holders, CSR strategy etc. 

### Calculation Method

The calculation of the overall ESG score is based on two kinds of ESG scores.


Thomas Reuters ESG score - Out of publicly reported company data, 400 ESG measures are calculatd. Out of this, 178 data points are selected for the scoring process. It is then grouped into 10 categories. 

Thomas Reuters ESG Controversy score - The controversy category score is based on a list of 23 controversy topics. It is a comprehensive measure of the company's ESG performance relative to negative media stories captured from global media. 

The ESG Score Calculation Methodology:

Percentile Rank Method is used to calculate the scores. It is based on three factors:
<br> • How many companies have the same value?
<br> • How many companies have a value at all?
<br> • How many companies are worse than the current one?



$$ Score = \frac{\text{No. of companies with a worst value} + \frac{\text{No. of companies with the same value included the current one}}{2}}{\text{No. of companies with a value}} $$



TODO :
1. Figure out Return and MC data values 
2. P/E Daily time series ration - what does this column mean? - Karthik
3. Read about KNN Imputation 
4. Null values and outlier 

'Total Current Assets', 'Total Current Liabilities', 'Total Debt', 'Total Assets, Reported' - Dev

'P/E (Daily Time Series Ratio)','CO2 Emissions','Total Revenue', 'Total Equity' - Karthik

 'Net Income - Actual','Revenue Per Share','Company Market Capitalization', 'PPE Total', - Sush

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df_firm = pd.read_csv("Firm Data.csv")

In [None]:
df_firm.head(20)

In [None]:
df_firm.shape

In [None]:
df_firm.columns

In [3]:
# Lets rename some columns
df_firm.rename(columns={"Total CO2 Equivalent Emissions To Revenues USD in million": "CO2 Emissions", "Property Plant And Equipment, Total - Gross": "PPE Total"},inplace=True)

In [None]:
df_ID = pd.read_csv("ID_Data.csv")
df_ID.head()

In [None]:
df_return = pd.read_csv("Return_Data.csv")
df_return.head(20)

### Data Exploration

In [None]:
df_firm.info()

#### Share of Null Values in the data

In [None]:
# count = df_firm.isnull().sum()/len(df_firm) *100
# display(count)
import plotly.express as px
null_df = df_firm.apply(lambda x: sum(x.isnull())).to_frame(name= 'count')
print(null_df)

plt.plot(null_df.index, null_df['count'])
plt.xticks(null_df.index, null_df.index, rotation=45, horizontalalignment='right')
plt.margins(0.1)
plt.show()

figure = px.line(x = null_df.index, y = null_df['count'])
figure.update_layout(xaxis_title = '')


figure.show()

### Data imputation

In [None]:
#Data Imputation
#Dropping rows with all NaN values corresponding to each stock
df_firm = df_firm.dropna(subset=['Total Current Assets', 'Total Current Liabilities', 'Total Debt', 'Total Assets, Reported', 'Net Income - Actual', 'Environmental Pillar Score','CO2 Emissions', 'P/E (Daily Time Series Ratio)'], how='all')
#df_firm['Total Current Assets'].isna()
df_firm['Total Current Assets'] = df_firm.groupby('RIC').transform(lambda group: group.fillna(group.mean()))['Total Current Assets']
df_firm['Total Current Liabilities'] = df_firm.groupby('RIC').transform(lambda group: group.fillna(group.mean()))['Total Current Liabilities']
df_firm['Total Debt'] = df_firm.groupby('RIC').transform(lambda group: group.fillna(group.mean()))['Total Debt']
df_firm['Total Assets, Reported'] = df_firm.groupby('RIC').transform(lambda group: group.fillna(group.mean()))['Total Assets, Reported']
df_firm['Net Income - Actual'] = df_firm.groupby('RIC').transform(lambda group: group.fillna(group.mean()))['Net Income - Actual']
df_firm['Revenue Per Share'] = df_firm.groupby('RIC').transform(lambda group: group.fillna(group.mean()))['Revenue Per Share']
df_firm['Total Revenue'] = df_firm.groupby('RIC').transform(lambda group: group.fillna(group.mean()))['Total Revenue']
df_firm['Total Equity'] = df_firm.groupby('RIC').transform(lambda group: group.fillna(group.mean()))['Total Equity']
df_firm['CO2 Emissions'] = df_firm.groupby('RIC').transform(lambda group: group.fillna(group.mean()))['CO2 Emissions']
df_firm['ESG Score'] = df_firm.groupby('RIC').transform(lambda group: group.fillna(group.mean()))['ESG Score']
df_firm['Social Pillar Score'] = df_firm.groupby('RIC').transform(lambda group: group.fillna(group.mean()))['Social Pillar Score']
df_firm['Governance Pillar Score'] = df_firm.groupby('RIC').transform(lambda group: group.fillna(group.mean()))['Governance Pillar Score']
df_firm['Environmental Pillar Score'] = df_firm.groupby('RIC').transform(lambda group: group.fillna(group.mean()))['Environmental Pillar Score']
df_firm['Company Market Capitalization'] = df_firm.groupby('RIC').transform(lambda group: group.fillna(group.mean()))['Company Market Capitalization']
df_firm['PPE Total'] = df_firm.groupby('RIC').transform(lambda group: group.fillna(group.mean()))['PPE Total']
df_firm['P/E (Daily Time Series Ratio)'] = df_firm.groupby('RIC').transform(lambda group: group.fillna(group.mean()))['P/E (Daily Time Series Ratio)']

#Handling the remaining NaN values
for column in df_firm.columns[2:]:
    df_firm[column].fillna(df_firm[column].mean(), inplace=True)
#df_firm.apply(lambda x: x.fillna(x.mean()),axis=0)
df_firm.isnull().sum()/len(df_firm) *100

#### Outliers 

In [None]:
df_firm.describe([0.1,0.2,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.97,0.99])

In [None]:
import plotly.express as px

fig = plt.figure(figsize=(4,3))
sns.boxplot(df_firm['Total Current Assets'])
plt.xlabel('Total Current Assets', fontsize=14)
plt.show()

box1 = px.box(df_firm, y="Total Current Liabilities", width= 500, height= 500)
box1.show()

box2 = px.box(df_firm, y="Total Debt", width= 500, height= 500)
box2.show()

In [None]:
df_firm['Total Current Assets'].quantile(0.8)

In [None]:
df_firm['Total Current Assets'].clip(upper=5000,inplace=True)

In [4]:
columns = ['Total Current Assets', 'Total Current Liabilities',
       'Total Debt', 'Total Assets, Reported', 'Net Income - Actual',
       'Revenue Per Share', 'Total Revenue', 'Total Equity', 'CO2 Emissions',
       #'ESG Score', #'Social Pillar Score', 'Governance Pillar Score','Environmental Pillar Score',
        'Company Market Capitalization',
       'PPE Total', 'P/E (Daily Time Series Ratio)']

for i in columns :
    df_firm[i].clip(upper=df_firm[i].quantile(0.8),inplace=True)

#### Visualising the distribution 

In [None]:
columns = ['Total Current Assets', 'Total Current Liabilities',
       'Total Debt', 'Total Assets, Reported', 'Net Income - Actual',
       'Revenue Per Share', 'Total Revenue', 'Total Equity', 'CO2 Emissions',
       #'ESG Score', #'Social Pillar Score', 'Governance Pillar Score','Environmental Pillar Score',
        'Company Market Capitalization',
       'PPE Total', 'P/E (Daily Time Series Ratio)']

### Boxplots

In [None]:
box3 = px.box(df_firm, y="Total Current Liabilities", width= 500, height= 500)
box3.show()

box4 = px.box(df_firm, y="Total Debt", width= 500, height= 500)
box4.show()



### Visualization of data

### Distplot

In [None]:
import plotly.figure_factory as ff

# x1 = df_firm['Total Current Assets']
# x2 = df_firm['Total Current Liabilities']
# x3 = df_firm['Total Debt']
# x4 = df_firm['Total Assets, Reported']
# x5 = df_firm['Net Income - Actual']
# x6 = df_firm['Revenue Per Share']
# x7 = df_firm['Total Revenue']
# x8 = df_firm['Total Equity']
# x9 = df_firm['CO2 Emissions']
# x13 = df_firm['Company Market Capitalization'] 
# x14 = df_firm['PPE Total']
# x15 = df_firm['P/E (Daily Time Series Ratio)']
new_df = df_firm[['Total Current Assets', 'Total Current Liabilities',
       'Total Debt', 'Total Assets, Reported', 'Net Income - Actual',
       'Revenue Per Share', 'Total Revenue', 'Total Equity', 'CO2 Emissions','Company Market Capitalization',
       'PPE Total', 'P/E (Daily Time Series Ratio)']]
# x = [x1, x2, x3, x4, x5, x6, x7, x8, x9, x13, x14, x15]
print(new_df)
fig = ff.create_distplot(new_df, group_labels= 'columns')
fig.show()

### Scatterplot of each feature

In [None]:
import plotly.express as px

fig = px.scatter(df_firm, x="Total Current Assets", y="ESG Score", color= 'ESG Score',trendline= 'ols')
fig1 = px.scatter(df_firm, x="Total Current Liabilities", y="ESG Score", color= 'ESG Score',trendline= 'ols')
fig2 = px.scatter(df_firm, x="Total Debt", y="ESG Score", color= 'ESG Score',trendline= 'ols')
fig3 = px.scatter(df_firm, x="Total Assets, Reported", y="ESG Score", color= 'ESG Score',trendline= 'ols')
fig4 = px.scatter(df_firm, x="Net Income - Actual", y="ESG Score", color= 'ESG Score',trendline= 'ols')
fig5 = px.scatter(df_firm, x="Revenue Per Share", y="ESG Score", color= 'ESG Score',trendline= 'ols')
fig6 = px.scatter(df_firm, x="Total Revenue", y="ESG Score", color= 'ESG Score',trendline= 'ols')
fig7 = px.scatter(df_firm, x="Total Equity", y="ESG Score", color= 'ESG Score',trendline= 'ols')
fig8 = px.scatter(df_firm, x="CO2 Emissions", y="ESG Score", color= 'ESG Score',trendline= 'ols')
fig9 = px.scatter(df_firm, x="Company Market Capitalization", y="ESG Score", color= 'ESG Score',trendline= 'ols')
fig10 = px.scatter(df_firm, x="PPE Total", y="ESG Score", color= 'ESG Score',trendline= 'ols')
fig11 = px.scatter(df_firm, x="P/E (Daily Time Series Ratio)", y="ESG Score", color= 'ESG Score',trendline= 'ols')


fig.show()
fig1.show()
fig2.show()
fig3.show()
fig4.show()
fig5.show()
fig6.show()
fig7.show()
fig8.show()
fig9.show()
fig10.show()
fig11.show()

In [None]:
df_firm[columns].hist( figsize=(15,15))

plt.show()

In [None]:
cormat = df_firm[columns].corr()
sns.heatmap(cormat)
plt.figure(figsize=(20,20))
plt.show()