# Temperature Forecast Project using ML
**Project Description**
This data is for the purpose of bias correction of next-day maximum and minimum air temperatures forecast of the LDAPS model operated by the Korea Meteorological Administration over Seoul, South Korea. This data consists of summer data from 2013 to 2017. The input data is largely composed of the LDAPS model's next-day forecast data, in-situ maximum and minimum temperatures of present-day, and geographic auxiliary variables. There are two outputs (i.e. next-day maximum and minimum air temperatures) in this data. Hindcast validation was conducted for the period from 2015 to 2017.

**Attribute Information:**
For more information, read [Cho et al, 2020].
1. station - used weather station number: 1 to 25
2. Date - Present day: yyyy-mm-dd ('2013-06-30' to '2017-08-30')
3. Present_Tmax - Maximum air temperature between 0 and 21 h on the present day (Â°C): 20 to 37.6
4. Present_Tmin - Minimum air temperature between 0 and 21 h on the present day (Â°C): 11.3 to 29.9
5. LDAPS_RHmin - LDAPS model forecast of next-day minimum relative humidity (%): 19.8 to 98.5
6. LDAPS_RHmax - LDAPS model forecast of next-day maximum relative humidity (%): 58.9 to 100
7. LDAPS_Tmax_lapse - LDAPS model forecast of next-day maximum air temperature applied lapse rate (Â°C): 17.6 to 38.5
8. LDAPS_Tmin_lapse - LDAPS model forecast of next-day minimum air temperature applied lapse rate (Â°C): 14.3 to 29.6
9. LDAPS_WS - LDAPS model forecast of next-day average wind speed (m/s): 2.9 to 21.9
10. LDAPS_LH - LDAPS model forecast of next-day average latent heat flux (W/m2): -13.6 to 213.4
11. LDAPS_CC1 - LDAPS model forecast of next-day 1st 6-hour split average cloud cover (0-5 h) (%): 0 to 0.97
12. LDAPS_CC2 - LDAPS model forecast of next-day 2nd 6-hour split average cloud cover (6-11 h) (%): 0 to 0.97
13. LDAPS_CC3 - LDAPS model forecast of next-day 3rd 6-hour split average cloud cover (12-17 h) (%): 0 to 0.98
14. LDAPS_CC4 - LDAPS model forecast of next-day 4th 6-hour split average cloud cover (18-23 h) (%): 0 to 0.97
15. LDAPS_PPT1 - LDAPS model forecast of next-day 1st 6-hour split average precipitation (0-5 h) (%): 0 to 23.7
16. LDAPS_PPT2 - LDAPS model forecast of next-day 2nd 6-hour split average precipitation (6-11 h) (%): 0 to 21.6
17. LDAPS_PPT3 - LDAPS model forecast of next-day 3rd 6-hour split average precipitation (12-17 h) (%): 0 to 15.8
18. LDAPS_PPT4 - LDAPS model forecast of next-day 4th 6-hour split average precipitation (18-23 h) (%): 0 to 16.7
19. lat - Latitude (Â°): 37.456 to 37.645
20. lon - Longitude (Â°): 126.826 to 127.135
21. DEM - Elevation (m): 12.4 to 212.3
22. Slope - Slope (Â°): 0.1 to 5.2
23. Solar radiation - Daily incoming solar radiation (wh/m2): 4329.5 to 5992.9
24. Next_Tmax - The next-day maximum air temperature (Â°C): 17.4 to 38.9
25. Next_Tmin - The next-day minimum air temperature (Â°C): 11.3 to 29.8T

You have to build separate models that can predict the minimum temperature for the next day and the maximum temperature for the next day based on the details provided in the dataset.




**Dataset Link-**
•	https://github.com/FlipRoboTechnologies/ML_-Datasets/blob/main/Temperature%20Forecast/temperature.csv


In [1]:
!pip install rasterio
!pip install folium

Collecting rasterio
  Obtaining dependency information for rasterio from https://files.pythonhosted.org/packages/03/d9/40d44154946a55e8fe63b21d44120dae02f4f62500338d09ee0d29d59025/rasterio-1.3.10-cp311-cp311-win_amd64.whl.metadata
  Downloading rasterio-1.3.10-cp311-cp311-win_amd64.whl.metadata (15 kB)
Collecting affine (from rasterio)
  Obtaining dependency information for affine from https://files.pythonhosted.org/packages/0b/f7/85273299ab57117850cc0a936c64151171fac4da49bc6fba0dad984a7c5f/affine-2.4.0-py3-none-any.whl.metadata
  Downloading affine-2.4.0-py3-none-any.whl.metadata (4.0 kB)
Collecting cligj>=0.5 (from rasterio)
  Obtaining dependency information for cligj>=0.5 from https://files.pythonhosted.org/packages/73/86/43fa9f15c5b9fb6e82620428827cd3c284aa933431405d1bcf5231ae3d3e/cligj-0.7.2-py3-none-any.whl.metadata
  Downloading cligj-0.7.2-py3-none-any.whl.metadata (5.0 kB)
Collecting snuggs>=1.4.1 (from rasterio)
  Obtaining dependency information for snuggs>=1.4.1 from https

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

#analysis datatime
import datetime as dt
from datetime import datetime

import folium
import rasterio as rio
from folium import plugins
from folium.plugins import HeatMap

import warnings
warnings.filterwarnings('ignore')

from sklearn.preprocessing import LabelEncoder #OneHotEncoder

#Standardize the feature
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score

#Classification Models
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import ExtraTreesClassifier

#regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression,Lasso,Ridge
from sklearn.svm import SVR
from sklearn.linear_model import SGDRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.model_selection import cross_val_score

In [3]:
#Get dataset
temp_url = 'https://raw.githubusercontent.com/FlipRoboTechnologies/ML_-Datasets/main/Temperature%20Forecast/temperature.csv'
temp_df = pd.read_csv(temp_url)
temp_df.head()

HTTPError: HTTP Error 404: Not Found

In [4]:
temp_df.shape
temp_df.info()

NameError: name 'temp_df' is not defined

In [5]:
#extract the  date object into day,month, year format
temp_df['Date'] = pd.to_datetime(temp_df['Date'])
temp_df['day'] = temp_df['Date'].dt.day
temp_df['month'] = temp_df['Date'].dt.month
temp_df['year'] = temp_df['Date'].dt.year
temp_df.head()


NameError: name 'temp_df' is not defined

In [None]:
temp_df.drop(['Date'],axis=1,inplace=True)

In [None]:
temp_df.isnull().sum()

In [None]:
temp_df.duplicated().sum()

In [None]:
temp_df.sample(5)

In [None]:
temp_df.describe()

In [None]:
#Im using mean to filling the missing values
temp_df.fillna(temp_df.mean(),inplace=True)


In [None]:
temp_df.isnull().sum()

In [None]:
temp_df.shape

In [None]:
temp_df.describe()

In [None]:
#correlation matrix

corr_matrix = temp_df.corr()
corr_matrix

In [None]:
corr_matrix = temp_df.corr()

#plotting the heatmap
plt.figure(figsize=(20,15))
sns.heatmap(corr_matrix,annot= True,cmap = 'coolwarm')
plt.title = 'Correlation Matrix'
plt.show()

As per the above correlation  :
 **Next_Tmax** is highly Possitive correlation with

 Present_Tmax(.61),

 Present_Tmin(0.47),

 LDAPS_Tmax_lapse(0.83),

 LDAPS_Tmin_lapse(0.59),

 Next_Tmin(0.62),   

Negitive:
 LDAPS_RHmin(-0.44),.
  
    and
    
  **Next_tmin** have higly correlation with -**Possitive**
  Present_Tmax(0.62),

  Present_Tmin(0.8),

  LDAPS_Tmin_lapse(0.88),

  LDAPS_Tmax_lapse(0.59)

  and
  
  **Solar Radiation **
  is highly negitive correlation with month (-0.84)



based on above correlation Data ,we are selecting features for predections

present_Tmax, present_Tmin,
LDAPS_RHmin,
LDAPS_Tmax_lapse
LDAPS_Tmax_lapse
Next_Tmin

and for Next_tmin
 present_Tmax,Present_tmin,LDAPS_Tmin_lapse,
 Ldaps_Tmax_lapse


In [None]:
temp_df.columns

In [None]:
#ploting all the column values
temp_df.plot(subplots=True,figsize=(25,20))
plt.show()

In [None]:
#temperature Hist plot

temp_df.hist(figsize=(15,10))
plt.show()

In [None]:
#check the skewness
skew_df = temp_df.skew()
skew_df

**Highest possitive skewness : **
station             0.000000

LDAPS_RHmin         0.300220

LDAPS_WS            1.579236

  **Moderately possitive skewed :**

##LDAPS_LH            0.673757

##LDAPS_CC1           0.459458  

##LDAPS_CC2           0.472350

##LDAPS_CC3           0.640735

##LDAPS_CC4           0.666482

LDAPS_PPT1          5.393821
LDAPS_PPT2          5.775355
LDAPS_PPT3          6.457129
LDAPS_PPT4          6.825464
lat                 0.087062
year                0.000000
DEM                 1.723257
Slope               1.563020

**Negitive Skewness :  **
lon                -0.285213

Solar radiation    -0.511210
Next_Tmax          -0.340200
Next_Tmin          -0.404447
day                -0.008926
month              -0.195889
Present_Tmax       -0.264137
Present_Tmin       -0.367538

LDAPS_RHmax        -0.855015
LDAPS_Tmax_lapse   -0.227880
LDAPS_Tmin_lapse   -0.581763

In [None]:
#Log transformation of the data
#plan to redusing the positive skewness

skewness_colums = ['LDAPS_WS','LDAPS_LH','LDAPS_CC1','LDAPS_CC2',
                   'LDAPS_CC3','LDAPS_CC4','LDAPS_PPT1',
                   'LDAPS_PPT2','LDAPS_PPT3','LDAPS_PPT4','DEM','Slope']
for col in skewness_colums:
  temp_df[col] = np.log(temp_df[col]+1)

In [None]:
#after log transformation skewness

skew_df = temp_df.skew()
skew_df

The log transfermation has reduced the skewness for servera features ,sitll some remain highly skewed .

#we will try with another Box-Cox or squared Root method  



In [None]:
#applying Box_cox  transsfermation
from scipy.stats import boxcox

boxcox_colums = ['LDAPS_PPT1','LDAPS_PPT2','LDAPS_PPT3','LDAPS_PPT4']
for col in boxcox_colums:
  temp_df[col], _= boxcox(temp_df[col]+1)


In [None]:
skew_df = temp_df.skew()
skew_df

In [None]:
tem_compar = temp_df.copy()

In [None]:
compare_tem = tem_compar.loc['2013': '2017']
compare_tem

In [None]:
#scatter plot
sns.scatterplot(x='Present_Tmax',y='Next_Tmax',data=compare_tem)
plt.show()

In [None]:
#split the data
features  = temp_df.drop(['Next_Tmax','Next_Tmin'],axis=1)
target_max = temp_df['Next_Tmax']
target_min = temp_df['Next_Tmin']



In [None]:
missing_val = features.isnull().sum()
missing_val

In [None]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
features = imputer.fit_transform(features)



In [None]:
#Splitting the data

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_log_error


X_train_max,X_test_max,y_train_max,y_test_max = train_test_split(features,target_max,test_size=0.2,random_state=42)
X_train_min,X_test_min,y_train_min,y_test_min = train_test_split(features,target_min,test_size=0.2,random_state=42)

In [None]:
#scalling the data

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

X_train_max = scaler.fit_transform(X_train_max)
X_test_max = scaler.transform(X_test_max)

X_train_min = scaler.fit_transform(X_train_min)
X_test_min = scaler.transform(X_test_min)

In [None]:
#model building  for Max

models = [LinearRegression(),
          Ridge(alpha = 0.001),
          Lasso(alpha=0.003),
          SVR(),
          DecisionTreeRegressor(),
          RandomForestRegressor(),
          GradientBoostingRegressor(),
          AdaBoostRegressor(base_estimator=LinearRegression())]

model_names = ['LinearRegression','Ridge','Lasso','SVR','DecisionTreeRegressor','RandomForestRegressor','GradientBoostingRegressor','AdaBoostRegressor']
tmp_model_df = pd.DataFrame(columns = ['Model_name','MSE','R2', 'MeanCV'])
for model,model_names in zip(models,model_names):
  model.fit(X_train_max,y_train_max)
  pred = model.predict(X_test_max)
  mse = mean_squared_error(y_test_max,pred)
  r2 = r2_score(y_test_max,pred)
  mean_cv = cross_val_score(model,X_train_max,y_train_max,cv=5).mean()
  tmp_model_df = pd.concat([tmp_model_df,pd.DataFrame({'Model_name':[model_names],'MSE':[mse],'R2':[r2],'MeanCV':[mean_cv]})],ignore_index=True)
tmp_model_df


	Model_name	MSE	R2	MeanCV
0	LinearRegression	2.269422	0.767641	0.764176

1	Ridge	2.269422	0.767641	0.764176

2	Lasso	2.271593	0.767419	0.764001

3	SVR	1.009534	0.896637	0.878005

4	DecisionTreeRegressor	2.147772	0.780096	0.747639

##5	RandomForestRegressor	0.827086	0.915317	0.896692

6	GradientBoostingRegressor	1.337337	0.863074	0.848335

7	AdaBoostRegressor	2.403693	0.753893	0.750541



as per the above models  is have best performance with the lowest MSE and highest R2


**Hyperparameter tuning**

In [None]:
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}


In [None]:
from sklearn.model_selection import GridSearchCV

rf_model = RandomForestRegressor(random_state = 42)
grid_search = GridSearchCV(estimator = rf_model, param_grid = param_grid, cv=3,scoring = 'neg_mean_squared_error',verbose =2,n_jobs= -1)
grid_search.fit(X_train_max, y_train_max)

In [None]:
#Best parameters and model :
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_
print("Best Parameters:", best_params)
print("Best Model:", best_model)



In [None]:
#deploy best_model using best parameters

rf_model_best = RandomForestRegressor(max_depth=20,min_samples_leaf=1,min_samples_split=2,n_estimators=200)
rf_model_best.fit(X_train_max,y_train_max)
pred_best = rf_model_best.predict(X_test_max)

mse_max = mean_squared_error(y_test_max,pred_best)
r2_max = r2_score(y_test_max,pred_best)

print("MSE:", mse_max)
print("R2:", r2_max)

In [None]:
import joblib
joblib.dump(rf_model_best,'rf_model_best.pkl')

**#model building  for Min**

In [None]:
#model building  for Min

models_min = [LinearRegression(),
            Ridge(alpha = 0.001),
            Lasso(alpha=0.003),
            SVR(),
            DecisionTreeRegressor(),
            RandomForestRegressor(),
            GradientBoostingRegressor(),
            AdaBoostRegressor(base_estimator=LinearRegression())]

model_names_min = ['LinearRegression','Ridge','Lasso','SVR','DecisionTreeRegressor','RandomForestRegressor','GradientBoostingRegressor','AdaBoostRegressor']

tmp_model_df = pd.DataFrame(columns = ['MSE','R2', 'MeanCV'])

for model,model_names_min in zip(models_min,model_names_min):
  model.fit(X_train_min,y_train_min)
  pred_min = model.predict(X_test_min)
  mse = mean_squared_error(y_test_min,pred_min)
  r2 = r2_score(y_test_min,pred_min)
  mean_cv = cross_val_score(model,X_train_min,y_train_min,cv=5).mean()
  tmp_model_df = pd.concat([tmp_model_df,pd.DataFrame({'model_names':[model_names_min],'MSE':[mse],'R2':[r2],'MeanCV':[mean_cv]})],ignore_index=True)
tmp_model_df

#Best Model selection
Key points and trying to explain as per my understanding  

**Lower is better in Mean squared Error**

**Higher is better in R-squared value**

**Higher value is better in Mean Cross validation Score**




## 1. 0.486753	0.921890	0.905323	SVR
Here in SVC( support vector Regressor the values are given to see best performance model for predection


and next best model is :

# 0.570188	0.908501	0.899237	RandomForestRegressor







Now I'm going to select first best performance model is **SVR** and second best performance models **RandomForestRegressor**:

In [None]:
#for SVR model :
#Parameter grid for SVR

param_grid = {
    'kernel': ['rbf'],
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto'],
    'epsilon': [0.01, 0.1, 1]

}

svr =SVR()
grid_search = GridSearchCV(estimator=svr,param_grid=param_grid,cv=3,n_jobs=-1, verbose=2,scoring= 'neg_mean_squared_error')

grid_search.fit(X_train_min,y_train_min)

In [None]:
best_params = grid_search.best_params_
print(f'Best parameters :{best_params}')

In [None]:
#Best SVR
svr_best  = SVR(C = 10, epsilon =  0.1, gamma = 'scale', kernel = 'rbf')
svr_best.fit(X_train_min,y_train_min)

y_train_min_pred = svr_best.predict(X_train_min)
y_test_pred = svr_best.predict(X_test_min)


print(y_train_min_pred)
print(y_test_pred)

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

mse_train = mean_squared_error(y_train_min,y_train_min_pred)
mae_train = mean_squared_error(y_train_min,y_train_min_pred)
r2_train_test = r2_score(y_train_min,y_train_min_pred)


mse_test = mean_squared_error(y_test_min,y_test_pred)
mae_test = mean_squared_error(y_test_min,y_test_pred)
r2_test = r2_score(y_test_min,y_test_pred)

print(f'MSE_train : {mse_train}')
print(f'MAE_train : {mae_train}')
print(f'R2_train : {r2_train_test}')

print(f'MSE_test : {mse_test}')
print(f'MAE_test : {mae_test}')
print(f'R2_test : {r2_test}')

In [None]:
import joblib
joblib.dump(svr_best,'svr_model.pkl')