<a href="https://colab.research.google.com/github/yastika/MscFE_Capstone/blob/dev_M4_submission/Climate_FinTech_Solution_Internalizing_Environmental_Risks_into_Financial_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Section 1: Introduction**

# **Climate-Adjusted Credit Scoring Prototype**
This notebook demonstrates how to internalize environmental risks (like floods and heatwaves) into credit scoring for SMEs using Gradient Boosting Machine (GBM) and a prototype Streamlit dashboard.


## **Step 1: Data Collection and Integration**
We merge SME-level financial data and regional climate event history based on region and year.


1.   Collecting the Air Quality Dataset from Kaggle
2.   Using the locations to get the Geo Coordinates
3.   Using the Geo Coordinates to fetch the satellite images over the years.








In [None]:
!pip install googletrans==4.0.0-rc1 googlesearch-python
!pip install geopy
! pip install -q kaggle

Collecting googletrans==4.0.0-rc1
  Downloading googletrans-4.0.0rc1.tar.gz (20 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting googlesearch-python
  Downloading googlesearch_python-1.3.0-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx==0.13.3 (from googletrans==4.0.0-rc1)
  Downloading httpx-0.13.3-py3-none-any.whl.metadata (25 kB)
Collecting hstspreload (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading hstspreload-2025.1.1-py3-none-any.whl.metadata (2.1 kB)
Collecting chardet==3.* (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading chardet-3.0.4-py2.py3-none-any.whl.metadata (3.2 kB)
Collecting idna==2.* (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading idna-2.10-py2.py3-none-any.whl.metadata (9.1 kB)
Collecting rfc3986<2,>=1.3 (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading rfc3986-1.5.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting httpcore==0.9.* (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading httpcore-0.9.

In [None]:
import numpy as np
import pandas as pd
import requests
from google.colab import userdata
import os
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [None]:
os.environ["KAGGLE_KEY"] = userdata.get('KAGGLE_KEY')
os.environ["KAGGLE_USERNAME"] = userdata.get('KAGGLE_USERNAME')
!kaggle datasets download -d shrutibhargava94/india-air-quality-data

! unzip "india-air-quality-data.zip"

Dataset URL: https://www.kaggle.com/datasets/shrutibhargava94/india-air-quality-data
License(s): other
Archive:  india-air-quality-data.zip
  inflating: data.csv                


In [None]:
#Get the air pollution data
dataFrame_air = pd.read_csv("/content/data.csv",encoding='latin-1')

#Understanding the Dataset

print(f'Air Quality Dataframe shape : \n {dataFrame_air.shape}')
print('-----------------------------------------------------------------------')
print(f'Air Quality Dataframe columns : ]n {dataFrame_air.columns}')
print('-----------------------------------------------------------------------')
print(f'Air Quality Dataframe info : \n {dataFrame_air.info()}')
print('-----------------------------------------------------------------------')
print(f'Air Quality Dataframe describe : \n {dataFrame_air.describe()}')
print('-----------------------------------------------------------------------')
print(f'Air Quality Dataframe Column Datatypes : \n {dataFrame_air.dtypes}')
print('-----------------------------------------------------------------------')

#Identifying the columns with null value and the count
for i in dataFrame_air.columns:
  print(f'Column Name {i} and Null Values {dataFrame_air[i].isnull().sum()}')

  dataFrame_air = pd.read_csv("/content/data.csv",encoding='latin-1')


Air Quality Dataframe shape : 
 (435742, 13)
-----------------------------------------------------------------------
Air Quality Dataframe columns : ]n Index(['stn_code', 'sampling_date', 'state', 'location', 'agency', 'type',
       'so2', 'no2', 'rspm', 'spm', 'location_monitoring_station', 'pm2_5',
       'date'],
      dtype='object')
-----------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 435742 entries, 0 to 435741
Data columns (total 13 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   stn_code                     291665 non-null  object 
 1   sampling_date                435739 non-null  object 
 2   state                        435742 non-null  object 
 3   location                     435739 non-null  object 
 4   agency                       286261 non-null  object 
 5   type                         430349 non-null  object 
 6  

In [None]:
dataFrame_air.head()

Unnamed: 0,stn_code,sampling_date,state,location,agency,type,so2,no2,rspm,spm,location_monitoring_station,pm2_5,date
0,150.0,February - M021990,Andhra Pradesh,Hyderabad,,"Residential, Rural and other Areas",4.8,17.4,,,,,1990-02-01
1,151.0,February - M021990,Andhra Pradesh,Hyderabad,,Industrial Area,3.1,7.0,,,,,1990-02-01
2,152.0,February - M021990,Andhra Pradesh,Hyderabad,,"Residential, Rural and other Areas",6.2,28.5,,,,,1990-02-01
3,150.0,March - M031990,Andhra Pradesh,Hyderabad,,"Residential, Rural and other Areas",6.3,14.7,,,,,1990-03-01
4,151.0,March - M031990,Andhra Pradesh,Hyderabad,,Industrial Area,4.7,7.5,,,,,1990-03-01


In [None]:
dataFrame_air.tail()

Unnamed: 0,stn_code,sampling_date,state,location,agency,type,so2,no2,rspm,spm,location_monitoring_station,pm2_5,date
435737,SAMP,24-12-15,West Bengal,ULUBERIA,West Bengal State Pollution Control Board,RIRUO,22.0,50.0,143.0,,"Inside Rampal Industries,ULUBERIA",,2015-12-24
435738,SAMP,29-12-15,West Bengal,ULUBERIA,West Bengal State Pollution Control Board,RIRUO,20.0,46.0,171.0,,"Inside Rampal Industries,ULUBERIA",,2015-12-29
435739,,,andaman-and-nicobar-islands,,,,,,,,,,
435740,,,Lakshadweep,,,,,,,,,,
435741,,,Tripura,,,,,,,,,,


In [None]:
#Data cleaning
dataFrame_air['date'] = pd.to_datetime(dataFrame_air['date'], errors='ignore')

cutoff_date = '2000-01-01'
dataFrame_air = dataFrame_air[dataFrame_air['date'] >= cutoff_date]
dataFrame_air = dataFrame_air.set_index('date')

dataFrame_air = dataFrame_air.drop(['stn_code', 'agency', 'type', 'sampling_date','location_monitoring_station','pm2_5'], axis=1)


for i in dataFrame_air.columns:
  if dataFrame_air[i].dtypes == 'object':
    dataFrame_air[i] = dataFrame_air[i].astype('category')
  else:
    dataFrame_air[i].fillna(0, inplace=True)


In [None]:
import difflib
import googletrans
import re
from googlesearch import search

# To get correct location name
def correct_spelling(location_name, search_results):
  #translator = googletrans.Translator()
  #corrected_name = translator.translate(location_name, src='en', dest='en').text
  #return corrected_name
  list_segments = []
  for result in search_results:
    Segments = result.rpartition('/')
    if '-' in Segments[2] or '_' in Segments[2]:
      #print(Segments[2])
      items = re.split(r"[-_]", Segments[2])
      list_segments.extend(items)
    else:
      list_segments.append(Segments[2])

  #print(list_segments)
  corrected_name = difflib.get_close_matches(location_name, list_segments, n=1, cutoff=0.5)
  if corrected_name:
    return corrected_name[0]
  else:
    return location_name

def get_correct_location(location_name):
  #corrected_name = correct_spelling(location_name)
  search_results = search(location_name, num_results=5)
  corrected_name = correct_spelling(location_name,search_results)
  return corrected_name



#location_name = "Tilamol"
#search_results = get_correct_location(location_name)
#print(search_results)

#for result in search_results:
  #Segments = result.rpartition('/')
  #print(Segments[2])

['', '12035', 'watch?v=ygKEWbGJk48', 'tilamola', 'goa', 'overview', 'Plfnr081f9pplz1y', '']
tilamola


In [54]:
from geopy.geocoders import Nominatim
from geopy.exc import GeocoderTimedOut
import time

# Fetch geo coordinates
def get_geo_coordinates(dataFrame_geo_coordinates):
  geolocator = Nominatim(user_agent="my_geocoder") # Replace 'my_geocoder' with a descriptive name for your application

  for i in dataFrame_geo_coordinates['location']:
      try:
        location = geolocator.geocode(i)
        dataFrame_geo_coordinates.loc[dataFrame_geo_coordinates['location'] == i, 'latitude'] = location.latitude
        dataFrame_geo_coordinates.loc[dataFrame_geo_coordinates['location'] == i, 'longitude'] = location.longitude
        time.sleep(1)  # Pause for 1 second between requests to avoid rate limiting
      except GeocoderTimedOut:
        print(f"Geocoding timed out for {i}. Retrying...")
        time.sleep(5) # wait longer and retry
        location = geolocator.geocode(i)
        dataFrame_geo_coordinates.loc[dataFrame_geo_coordinates['location'] == i, 'latitude'] = location.latitude
        dataFrame_geo_coordinates.loc[dataFrame_geo_coordinates['location'] == i, 'longitude'] = location.longitude
      except Exception as e:
        print(f"Error geocoding {i}: {e}")
        print("Redirecting to automated geocoding...")
        search_results = get_correct_location(i)
        if (i == search_results):
          dataFrame_geo_coordinates.loc[dataFrame_geo_coordinates['location'] == i, 'latitude'] = None
          dataFrame_geo_coordinates.loc[dataFrame_geo_coordinates['location'] == i, 'longitude'] = None
        else:
          location = geolocator.geocode(search_results)
          dataFrame_geo_coordinates.loc[dataFrame_geo_coordinates['location'] == i, 'latitude'] = location.latitude
          dataFrame_geo_coordinates.loc[dataFrame_geo_coordinates['location'] == i, 'longitude'] = location.longitude
        time.sleep(1)  # Pause for 1 second between requests to avoid rate limiting


In [None]:
dataFrame_geo_coordinates = pd.DataFrame(dataFrame_air['location'].unique(), columns=['location'])
get_geo_coordinates(dataFrame_geo_coordinates)


Error geocoding Vishakhapatnam: 'NoneType' object has no attribute 'latitude'
Redirecting to automated geocoding
['search?num=7', 'licensed', 'image?q=tbn:ANd9GcQvW', '1qyBk', '69uLUA2bWjnNO1Kjd4pqAwDXsW40Y83', 'MuzG', 'qygh', 'ptujgoaFxouLMh', 'Visakhapatnam', '', 'search?num=7']




Error geocoding Tilamol: 'NoneType' object has no attribute 'latitude'
Redirecting to automated geocoding
['', '12035', 'watch?v=ygKEWbGJk48', 'tilamola', 'goa', 'overview', 'Plfnr081f9pplz1y', '']
Error geocoding Anklesvar: 'NoneType' object has no attribute 'latitude'
Redirecting to automated geocoding
['search?num=7', 'history', 'of', '', 'ankleshwar', 'Anklesvar', 'INA', 'Ankleshwar', '802608.html']
Error geocoding Trivendrum: 'NoneType' object has no attribute 'latitude'
Redirecting to automated geocoding
['search?num=7', 'Thiruvananthapuram', 'licensed', 'image?q=tbn:ANd9GcQfSWhZ2w9MkswPVS2on7SJqUVPpVXJrlwCIZuaHXRH8TFbCJpfQAWwrU5bXPtm5a0m', 'search?num=7', 'ksrtc', 'swift', 'bus', 'catches', 'fire', 'in', 'thiruvananthapuram', 'nuqtcpgq']
Error geocoding Kotttayam: 'NoneType' object has no attribute 'latitude'
Redirecting to automated geocoding
['search?num=7', 'Kottayam', 'licensed', 'image?q=tbn:ANd9GcRMz4IvIhyev5IJJxXQH7DPo55sqBiV665kzT', 'fVbReDOOGnm', '9QAW9uZBNoVHgarpy', 's

In [62]:
dataFrame_geo_coordinates.head()
dataFrame_geo_coordinates[dataFrame_geo_coordinates['location']=='Tilamol']

Unnamed: 0,location,latitude,longitude
74,Tilamol,15.220556,74.086642


In [None]:
#Sentiment related to air pollution

In [None]:
#Sector-wise impact analysis

**Section 3: Feature Engineering**

## **Step 2: Feature Engineering**
We construct normalized environmental scores and select relevant financial features.


In [None]:
baseline_temp = 30  # Hypothetical baseline

# Normalize climate risk features
combined_df["flood_risk"] = combined_df["flood_frequency"] / combined_df["flood_frequency"].max()
combined_df["heat_index"] = (combined_df["avg_temp"] - baseline_temp) / 10

# Select modeling features
features = combined_df[["flood_risk", "heat_index", "credit_score", "debt_ratio"]]
target = combined_df["defaulted"]


**Section 4: Model Development**

## **Step 3: Model Training (Gradient Boosting Classifier)**
Train a GBM model using SME and climate features to predict default probabilities.


In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, mean_squared_error

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Train model
model = GradientBoostingClassifier()
model.fit(X_train, y_train)

# Predictions
predictions = model.predict_proba(X_test)[:, 1]

# Evaluation metrics
auc = roc_auc_score(y_test, predictions)
rmse = mean_squared_error(y_test, predictions, squared=False)

print(f"AUC: {auc:.2f}")
print(f"RMSE: {rmse:.2f}")

**Section 5: Climate Risk Scoring Function**

## **Step 4: Climate Credit Risk Score Labeling**
Convert numeric probability scores into categorical labels: Low, Medium, High Risk.


In [None]:
def climate_credit_risk_score(probability):
    if probability > 0.7:
        return "High Risk"
    elif probability > 0.4:
        return "Medium Risk"
    else:
        return "Low Risk"

# Example usage
example_prob = 0.65
print(f"Risk Category: {climate_credit_risk_score(example_prob)}")

## **Step 5: Prototype Dashboard**

**Interactive Credit Scoring Dashboard with Ipywidgets**

In [None]:
import ipywidgets as widgets
from IPython.display import display, Markdown

# Define the interactive widgets
flood_slider = widgets.FloatSlider(value=0.3, min=0.0, max=1.0, step=0.01, description='Flood Risk')
heat_slider = widgets.FloatSlider(value=0.2, min=0.0, max=1.0, step=0.01, description='Heat Index')
credit_slider = widgets.IntSlider(value=600, min=300, max=850, step=10, description='Credit Score')
debt_slider = widgets.FloatSlider(value=0.5, min=0.0, max=1.0, step=0.01, description='Debt Ratio')

# Define function to run model and display result
def update_risk(flood, heat, credit_score, debt_ratio):
    input_vector = [[flood, heat, credit_score, debt_ratio]]
    prob = model.predict_proba(input_vector)[0][1]
    risk_label = climate_credit_risk_score(prob)

    display(Markdown(f"### 📊 Predicted Risk Probability: `{prob:.2f}`"))
    display(Markdown(f"### 🛡️ Risk Category: `{risk_label}`"))

# Display interactive widget panel
widgets.interact(update_risk,
                 flood=flood_slider,
                 heat=heat_slider,
                 credit_score=credit_slider,
                 debt_ratio=debt_slider)