## Linear Regression

We will be applying Linear Regression along with hyper parameter tuning on the Algerian Fire Dataset

* Dataset Information
  1. The dataset contains 244 instances with each 122 instances of mainly 2 regions of Algeria, one being Bejaia region & another Sidi Bel-abbes region.
  2. The dataset is not clean and we will have to perform a lot of EDA on this to achieve the FWI o/p 


#### Ridge(L2 Regularization) & Lasso Regression (L1 Regularization) & ElasticNet Regression (L1 and L2 Regularization)

* The methods above L2 Reg: controls overfitting of the model by adding a(slope)^2 in the cost function to never let cost = 0 
* The method L1 does feature selection by adding a(|slope|) to the cost function
* Elastic Net combines both the above methods to achieve best results

In [None]:
## necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

In [None]:
## reading the dataset
dataset = pd.read_csv("/kaggle/input/huozai/Algerian_forest_fires_dataset_UPDATE.csv",header=1)
dataset.head()

In [None]:
## EDA
# Checking for null values 
# Checking for any anomalies on the dataset
# Checking on collinearity
# Checking on removing of columns which aren't necessary
# converting data types if necessary

In [None]:
dataset.info()

In [None]:
## checking for null values
print("Null values \n",dataset.isnull().sum())
print("\nThe rows which contain null components")
dataset[dataset.isnull().any(axis=1)]

## we figured out that there are 2 parts of the dataset and we need to create a single source
dataset.loc[:122,"Region"]="Bejaia"
dataset.loc[122:,"Region"]="Sidi Bel Abbes"
dataset = dataset

## removing the null values
print(type(dataset))
dataset = dataset.dropna().reset_index(drop=True)
print("\nThe rows which contain null components")
dataset[dataset.isnull().any(axis=1)]



In [None]:
print(dataset.iloc[118:125,])
## we observe that the 122th index still has the feature names, which aren't required
dataset = dataset.drop(122).reset_index(drop=True)
dataset.iloc[118:125,]

In [None]:
## check for the spacing in column names
print("Columns in the dataset are :",dataset.columns)
print("\nDataset data types: \n",dataset.dtypes)

## we observe that we need to remove the spaces 
dataset.columns = dataset.columns.str.strip()
int_cols = ['day','month','year','Temperature','RH','Ws']
double_cols = ['Rain','FFMC','DMC','DC','ISI','BUI','FWI']

for ic in int_cols:
    dataset[ic] = dataset[ic].astype(int)
for dc in double_cols:
    dataset[dc] = dataset[dc].astype(float)

## FWI is the output column -  for the regression problem
## Classes is the output column  - for logistic regression problems

print("Post operation")
print("Columns in the dataset are :",dataset.columns)
print("\nDataset data types: \n",dataset.dtypes)


In [None]:
dataset.info()

In [None]:
dataset.head(5)

In [None]:
df = dataset.copy(deep=True)

In [None]:
df.head()

## EDA

In [None]:
df['Classes'].value_counts()

In [None]:
## removing spaces -- as we observe this column contains some spaces
df['Classes'] = df['Classes'].str.strip()

In [None]:
## doing an encoding on classes value
df['Classes'] = np.where(df['Classes']=="not fire",0,1)


In [None]:
df['Classes'].value_counts()

In [None]:
df.head()

In [None]:
df.drop(['year','month','day','Region'],axis=1,inplace=True)

In [None]:
## Visualization to understand the data 
plt.style.use('seaborn')
df.hist(bins=50,figsize=(20,20))
plt.show()

In [None]:
## check the correaltion
df.corr()

In [None]:
sns.heatmap(df.corr())

In [None]:
sns.pairplot(df)

In [None]:
## plot the dependent feature on box plot i.e FWI
sns.boxplot(df['FWI'],color='blue')

## the output variable has some outliers - can be ignored

In [None]:
## remove the highly collinear variables
# BUI DC can be removed as they are again collinear with DMC
df.drop(['BUI','DC'],axis=1,inplace=True)
df.head()

#### Monthly Fire Analysis of the Bejaia Region

In [None]:
dataset['Region'].value_counts()

In [None]:
df_bejaia = dataset[dataset['Region']=='Bejaia'].copy(deep=True)
# df_bejaia.head()
df_bejaia['Classes'] = df_bejaia['Classes'].str.strip()

plt.subplots(figsize=(13,6))
sns.set_style('whitegrid')
sns.countplot(x='month',hue='Classes',data=df_bejaia)
plt.xlabel('Months')
plt.ylabel('Fire')
plt.title('Monthly Analysis of Fire in the Bejaia Region')

## Maximum number of fire incidents in Bejaia region occur in the month of August 

In [None]:
df_sidi = dataset[dataset['Region']=='Sidi Bel Abbes'].copy(deep=True)
# df_bejaia.head()
df_sidi['Classes'] = df_sidi['Classes'].str.strip()

plt.subplots(figsize=(13,6))
sns.set_style('whitegrid')
sns.countplot(x='month',hue='Classes',data=df_sidi)
plt.xlabel('Months')
plt.ylabel('Fire')
plt.title('Monthly Analysis of Fire in the Sidi Bel Abes Region')

## Maximum number of fire incidents in Sidi Bel Abes region occur in the month of August 

In [None]:
## dividing the dataframe into train test splits
from sklearn.model_selection import train_test_split
y = df.iloc[:,-2]
X = df.iloc[:,:-2]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)

## we need to scale the training features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
## to avoid data leakage we don't fit transform on the test dataset
X_test = scaler.transform(X_test)

In [None]:
## Linear Regression Model
from sklearn.linear_model import LinearRegression
model_lr =LinearRegression()
model_lr.fit(X_train,y_train)
y_pred = model_lr.predict(X_test)

In [None]:
## metrics
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
mse = mean_squared_error(y_test,y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test,y_pred)
score = r2_score(y_test,y_pred)
N = len(y_test)
P = len(X.columns)
adj_score = 1-((1-score)*(N-1)/(N-P-1))

print("MSE is: ",mse)
print("RMSE is: ",rmse)
print("MAE is: ",mae)
print("R2 Score is: ",score)
print("Adj R2 score: ",adj_score)

In [None]:
## Ridge Regression (L2), Lasso (L1) and Elastic Net
from sklearn.linear_model import RidgeCV,LassoCV
model_ridge = RidgeCV()
model_lasso = LassoCV()

print("Ridge Regression")
model_ridge.fit(X_train,y_train)
y_pred_ridge = model_ridge.predict(X_test)

mse = mean_squared_error(y_test,y_pred_ridge)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test,y_pred_ridge)
score = r2_score(y_test,y_pred_ridge)
N = len(y_test)
P = len(X.columns)
adj_score = 1-((1-score)*(N-1)/(N-P-1))

print("MSE is: ",mse)
print("RMSE is: ",rmse)
print("MAE is: ",mae)
print("R2 Score is: ",score)
print("Adj R2 score: ",adj_score)

print("\n\nLasso Regression")
model_lasso.fit(X_train,y_train)
y_pred_lasso = model_lasso.predict(X_test)

mse = mean_squared_error(y_test,y_pred_lasso)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test,y_pred_lasso)
score = r2_score(y_test,y_pred_lasso)
N = len(y_test)
P = len(X.columns)
adj_score = 1-((1-score)*(N-1)/(N-P-1))

print("MSE is: ",mse)
print("RMSE is: ",rmse)
print("MAE is: ",mae)
print("R2 Score is: ",score)
print("Adj R2 score: ",adj_score)


In [None]:
## we will do hyper parameter tuning with elastic net linear regression
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import RandomizedSearchCV

params_grid = {
    'alpha':[0.1,0.5,1.0,5.0,10.0],
    'l1_ratio':[0.1,0.3,0.5,0.7,0.9]
}

model_elastic_net = ElasticNet(random_state=42)
random_search = RandomizedSearchCV(estimator=model_elastic_net,
                              param_distributions=params_grid,
                              random_state=42,
                              n_jobs=-1,
                              n_iter=10,
                              cv=5)

In [None]:
random_search.fit(X_train,y_train)
y_pred_en = random_search.predict(X_test)

print("Elastic Net with Hyper Parameter Tuning: \n")
mse = mean_squared_error(y_test,y_pred_en)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test,y_pred_en)
score = r2_score(y_test,y_pred_en)
N = len(y_test)
P = len(X.columns)
adj_score = 1-((1-score)*(N-1)/(N-P-1))

print("MSE is: ",mse)
print("RMSE is: ",rmse)
print("MAE is: ",mae)
print("R2 Score is: ",score)
print("Adj R2 score: ",adj_score)

In [None]:
## get the parameters
random_search.best_params_

#### The best model out of the below models is: Lasso (L1 regularization)
* Linear Regression
* Ridge Regression
* Lasso Regression
* Elastic Net with Hyperparameter Tuning using Randomized Search CV

In [None]:
## creating the pickle object of the model - helps serialize the model
import pickle

with open('model.pkl', 'wb') as file:
        pickle.dump(model_lasso,file )