Data : Forestfires
Source : https://archive.ics.uci.edu/ml/datasets/Forest+Fires

  P. Cortez and A. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data. 
  In J. Neves
  Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence
  Guimaraes
  Available at: http://www.dsi.uminho.pt/~pcortez/fires.pdf

1. Title: Forest Fires

2. Sources
   Created by: Paulo Cortez and Aníbal Morais (Univ. Minho) @ 2007
   
3. Past Usage:

   P. Cortez and A. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data.
   In Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence
   December
   
   In the above reference
   Then
   post-processed with the inverse of the ln(x+1) transform. Four different input setups were
   used. The experiments were conducted using a 10-fold (cross-validation) x 30 runs. Two
   regression metrics were measured: MAD and RMSE. A Gaussian support vector machine (SVM) fed
   with only 4 direct weather conditions (temp
   12.71 +- 0.01 (mean and confidence interval within 95% using a t-student distribution). The
   best RMSE was attained by the naive mean predictor. An analysis to the regression error curve
   (REC) shows that the SVM model predicts more examples within a lower admitted error. In effect
   the SVM model predicts better small fires
 
4. Relevant Information:

   This is a very difficult regression task. It can be used to test regression methods. Also
   it could be used to test outlier detection methods
   are there. Yet

5. Number of Instances: 517 

6. Number of Attributes: 12 + output attribute
  
   Note: several of the attributes may be correlated
   feature selection.

7. Attribute information:

   For more information

   1. X - x-axis spatial coordinate within the Montesinho park map: 1 to 9
   2. Y - y-axis spatial coordinate within the Montesinho park map: 2 to 9
   3. month - month of the year: "jan" to "dec" 
   4. day - day of the week: "mon" to "sun"
   5. FFMC - FFMC index from the FWI system: 18.7 to 96.20
   6. DMC - DMC index from the FWI system: 1.1 to 291.3 
   7. DC - DC index from the FWI system: 7.9 to 860.6 
   8. ISI - ISI index from the FWI system: 0.0 to 56.10
   9. temp - temperature in Celsius degrees: 2.2 to 33.30
   10. RH - relative humidity in %: 15.0 to 100
   11. wind - wind speed in km/h: 0.40 to 9.40 
   12. rain - outside rain in mm/m2 : 0.0 to 6.4 
   13. area - the burned area of the forest (in ha): 0.00 to 1090.84 
   (this output variable is very skewed towards 0.0
    sense to model with the logarithm transform). 

8. Missing Attribute Values: None


### Here, the dependent variable is area

In [1]:
from numpy import *
import matplotlib.pyplot as plt 
import numpy as np
import pandas as pd

In [2]:
data = pd.read_csv('forestfires.csv')
data = data.drop(['month','day', 'rain','area'], axis=1)

In [42]:
def feature_selection_module(data):
    features = data.drop(['area'], axis=1)
    y = data['area']
    features= features.reset_index()
    y = y.reset_index()
    data_rearranged = pd.merge(y,features,on = 'index', how = 'left')
    data_rearranged = data_rearranged.drop(['index'], axis=1)
    n = data.shape[1]
    significant_r = 1.96/(sqrt(n))
    data_temp = data 
    
    def covariance(X, Y,data):
        mean_x = X.mean()
        mean_y = Y.mean()
        data_temp = data.copy(deep=True)
        data_temp['(xi-xbar)'] = X - mean_x
        data_temp['(yi-ybar)'] = Y - mean_y
        data_temp['(xi-xbar)(yi-ybar)'] = data_temp['(xi-xbar)']*data_temp['(yi-ybar)']
        covariance = (data_temp['(xi-xbar)(yi-ybar)'].sum())/(n-1)
        return covariance
    
    def standard_deviation(X,data):
        mean_x = X.mean()
        data_temp = data.copy(deep=True)
        data_temp['(xi-xbar)'] = X - mean_x
        data_temp['(xi-xbar)**2'] = data_temp['(xi-xbar)']*data_temp['(xi-xbar)']
        variance = (data_temp['(xi-xbar)**2'].sum())/(n-1)
        standard_deviation = sqrt(variance)
        return standard_deviation 
    
    def correlation_coefficient(cov, std_dev1, std_dev2):
        corr_coeff = cov/(std_dev1*std_dev2)
        return corr_coeff

    def partial_correlation_coefficient(Y, X1, X2):
        cov_y_x1 = covariance(Y, X1)
        cov_y_x2 = covariance(Y, X2)
        s_y = standard_deviation(Y)
        s_x1 = standard_deviation(X1)
        s_x2 = standard_deviation(X2)
        r_y_x1 = cov_y_x1/(s_y*s_x1)
        r_y_x2 = cov_y_x2/(s_y*s_x2)
        numerator = r_y_x1 - r_y_x2
        denominator = sqrt((1-r_y_x1**2)*(1-r_y_x2**2))
        partial_corr_coeff = numerator/denominator
        return partial_corr_coeff
    
    corr_matrix = []
    
    def correlation_matrix(data):
        for i in range(n):
            corr_matrix.append([])
            for j in range(n):
                cov = covariance(data[data.columns[i]], data[data.columns[j]],data)
                std_dev1 = standard_deviation(data[data.columns[i]],data)
                std_dev2 = standard_deviation(data[data.columns[j]],data)
                corr_matrix[i].append(correlation_coefficient(cov, std_dev1, std_dev2))
        return corr_matrix
    
    def multi_collinearity_module(data, significant_r):
        corr_matrix = correlation_matrix(data)
        
        insignificant_features_index_list = []
        
        ### Removing insignficant features (for which r of y and x(i) < r_significant)
        corr_mat = np.asarray(corr_matrix)
        
        for i in range(corr_mat.shape[1]):
            if abs(corr_mat[0][i]) < significant_r:
                insignificant_features_index_list.append(i-1)
            else:
                continue
                
        ### Removing multi-collinearity
        
        p = 1
        q = 1
        
        while (p < corr_mat.shape[0]):
            q+=1
            while (q < corr_mat.shape[1]):
                if corr_matrix[p][q] > significant_r :
                    partial_correlation_coefficient1 = partial_correlation_coefficient(data['area'],data[data.columns[p]],data[data.columns[q]])
                    partial_correlation_coefficient2 = partial_correlation_coefficient(data['area'],data[data.columns[q]],data[data.columns[p]])
                    if partial_correlation_coefficient1 > significant_r:
                        if partial_correlation_coefficient2 > significant_r:
                            if partial_correlation_coefficient1 > partial_correlation_coefficient2:
                                insignificant_features_index_list.append(q)
                            else :
                                insignificant_features_index_list.append(p)
                        else : 
                            insignificant_features_index_list.append(q)
                    else :
                        if partial_correlation_coefficient2 > significant_r:
                            insignificant_features_index_list.append(p)
                        else :
                            insignificant_features_index_list.append(p)
                            insignificant_features_index_list.append(q)
        return(insignificant_features_index_list)
    
    features_list = []
    for i in range(1, data_rearranged.shape[1]+1):
        features_list.append(i)
    
    insignificant_features_index_list = multi_collinearity_module(data_rearranged, significant_r)
    significant_features_index_list = features_list - insignificant_features_index_list
    return significant_features_index_list
            
        