# KLIMATA ILOILO DATA EXTRACTION AND PREPROCESSING STAGE

### As we start KLIMATA Iloilo, we extract and process the data from PROJECT CCHAIN. This jupyter notebook contains code that preprocesses both air and atmospheric data.

### For the population and amenity dataset, we used Microsoft excel for the preprocessing.

### Importing essential libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Importing CSV file for air quality

In [2]:
air_df = pd.read_csv(r"C:/Users/Value Lines/Documents/climate_air_quality.csv")

air_df.head()

Unnamed: 0,uuid,adm4_pcode,date,freq,no2,co,so2,o3,pm10,pm25
0,CAIRQ000000,PH015518001,2003-01-02,D,3.66,0.1034,0.31,38.39,23.63,16.21
1,CAIRQ000001,PH015518001,2003-01-03,D,4.35,0.101,0.7,38.37,34.43,23.82
2,CAIRQ000002,PH015518001,2003-01-04,D,3.82,0.1016,0.34,37.72,27.76,19.16
3,CAIRQ000003,PH015518001,2003-01-05,D,3.07,0.1053,0.24,38.44,31.26,20.64
4,CAIRQ000004,PH015518001,2003-01-06,D,3.02,0.0953,0.22,39.89,24.7,16.68


### Checking of air quality dataframe rows and columns

In [3]:
air_df.shape

(6421095, 10)

### Checking columns' datatype and null value counts

In [4]:
print(air_df.info())
print(air_df.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6421095 entries, 0 to 6421094
Data columns (total 10 columns):
 #   Column      Dtype  
---  ------      -----  
 0   uuid        object 
 1   adm4_pcode  object 
 2   date        object 
 3   freq        object 
 4   no2         float64
 5   co          float64
 6   so2         float64
 7   o3          float64
 8   pm10        float64
 9   pm25        float64
dtypes: float64(6), object(4)
memory usage: 489.9+ MB
None
uuid            0
adm4_pcode      0
date            0
freq            0
no2           879
co              0
so2             0
o3              0
pm10          879
pm25          879
dtype: int64


### Cleans up all column names for consistency

In [6]:
air_df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

### Replacing of missing values with the median of the column

In [7]:
num_cols = air_df.select_dtypes(include=['float64', 'int64']).columns
air_df[num_cols] = air_df[num_cols].fillna(air_df[num_cols].median())


### Converting column 'date' into proper datetime format

In [8]:
if 'date' in air_df.columns:
    air_df['date'] = pd.to_datetime(air_df['date'], errors='coerce')

### Cleans up all text (categorical)  columns in the air quality dataframe for consistency

In [9]:
cat_cols = air_df.select_dtypes(include=['object']).columns
for col in cat_cols:
    air_df[col] = air_df[col].str.strip().str.lower()

### Drops irrelavant column

In [10]:
air_df.drop(columns=['freq'], inplace=True)

### Viewing the preprocessed air quality dataframe

air_df.head(10)

### Importing of air quality dataframe as a csv file

In [None]:
air_df.to_csv("AIR_PROCESSED.csv", index=False)

### After data preprocessing the air quality data from PROJECT CCHAIN, we proceeded to the next dataset which is the atmospheric data that included the heat index and rainfall estimates.

### Importing CSV file for atmospheric data (heat index and rainfall estimates)

In [13]:
atmosphere_df = pd.read_csv(r'C:/Users/Value Lines/Documents/climate_atmosphere.csv')

### Checking of air quality dataframe rows and columns

In [14]:
atmosphere_df.shape

(6421095, 13)

### Viewing the atmospheric dataframe

In [None]:
atmosphere_df.head(10)

### Cleans up all column names for consistency

In [15]:
atmosphere_df.columns = atmosphere_df.columns.str.strip().str.lower().str.replace(' ', '_')

### Replacing of missing values with the median of the column

In [16]:
num_cols = atmosphere_df.select_dtypes(include=['float64', 'int64']).columns
atmosphere_df[num_cols] = atmosphere_df[num_cols].fillna(atmosphere_df[num_cols].median())

### Checking columns' null value counts

In [17]:
print(atmosphere_df.isnull().sum())

uuid          0
adm4_pcode    0
date          0
freq          0
tave          0
tmin          0
tmax          0
heat_index    0
pr            0
wind_speed    0
rh            0
solar_rad     0
uv_rad        0
dtype: int64


### Converting column 'date' into proper datetime format

In [18]:
if 'date' in atmosphere_df.columns:
    atmosphere_df['date'] = pd.to_datetime(atmosphere_df['date'], errors='coerce')

### Cleans up all text (categorical)  columns in the atmospheric dataframe for consistency

In [19]:
cat_cols = atmosphere_df.select_dtypes(include=['object']).columns
for col in cat_cols:
    atmosphere_df[col] = atmosphere_df[col].str.strip().str.lower()

### Drops irrelavant column

In [20]:
atmosphere_df.drop(columns=['freq'], inplace=True)

### Importing of atmospheric dataframe as a csv file

In [25]:
atmosphere_df.to_csv('ATMOSPHERE.csv', index=False)