# ImbalanceMetrics (Regression): Usage
## Example 1: Beginner

## Dependencies
First, we load the required dependencies. Here we import regression_metrics from imbalanced_metrics to evalute the result we get from the LinearRegression. In addition, we use pandas and numpy for data handling, and train_test_split to split the dataset.

In [1]:
## load dependencies
from imbalance_metrics import regression_metrics as rm
import pandas as pd 
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error, mean_absolute_error , r2_score
from sklearn.model_selection import train_test_split

## Data
Firstly, we load our data. In this example, we use the Ames Housing Dataset training split retreived from Kaggle, originally complied by Dean De Cock. 
Link to original dataset - https://www.kaggle.com/datasets/prevek18/ames-housing-dataset


In [2]:
## load data
df = pd.read_csv(
    'https://raw.githubusercontent.com/paobranco/ImbalanceMetrics/main/data/housing.csv', index_col=None, na_values=['NA']
)
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [3]:
# We need to drop rows with nan values
df=df.drop(columns=["Alley","PoolQC","Fence","MiscFeature"])
# We need to change the data type from object to number
object_data = df.select_dtypes(include='object')
num_data = df.select_dtypes(exclude='object')

enc = LabelEncoder()
for i in range(0, object_data.shape[1]):
    object_data.iloc[:,i] = enc.fit_transform(object_data.iloc[:,i])   

  object_data.iloc[:,i] = enc.fit_transform(object_data.iloc[:,i])


In [4]:
# Merging the numerical part of the dataset and processed obeject part of the dataset together.
df = pd.concat([num_data, object_data], axis = 1)
df=df.fillna(0)

In [5]:
# Show the description of the dataset
df.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,KitchenQual,Functional,FireplaceQu,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,SaleType,SaleCondition
count,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,57.623288,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.117123,443.639726,...,2.339726,5.749315,3.804795,2.485616,1.284247,3.927397,3.960959,1.856164,7.513014,3.770548
std,421.610009,42.300571,34.664304,9981.264932,1.382997,1.112799,30.202904,20.645407,180.731373,456.098091,...,0.830161,0.979659,1.398954,1.933206,0.892831,0.647822,0.566832,0.496592,1.5521,1.100854
min,1.0,20.0,0.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,365.75,20.0,42.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,2.0,6.0,2.0,1.0,1.0,4.0,4.0,2.0,8.0,4.0
50%,730.5,50.0,63.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,3.0,6.0,4.0,1.0,1.0,4.0,4.0,2.0,8.0,4.0
75%,1095.25,70.0,79.0,11601.5,7.0,6.0,2000.0,2004.0,164.25,712.25,...,3.0,6.0,5.0,5.0,2.0,4.0,4.0,2.0,8.0,4.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,3.0,6.0,5.0,6.0,3.0,5.0,5.0,2.0,8.0,5.0


In [6]:
# Assign x and y values from the dataframe as train and test.
X = df.drop(columns="SalePrice") # Every other columns except price      
y = df["SalePrice"]    # y = price

In [7]:
#Split X and y into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

## Model
After, we train our model with data. In this example, we use the `LinearRegression()` from sklearn. This model will predict 'SalePrice' as y_pred which we will compare with true value y in our evaluation.


In [8]:
reg = LinearRegression().fit(X_train, y_train)
y_pred = reg.predict(X_test)

## Evaluation

In this section, we present the evaluation metrics available in the package. These metrics can be used to evaluate the performance of models while the dataset is imbalanced. In this example, we are only going to use regression metrics. 
This package includes the following evaluation metrics for regression:

* `phi_weighted_r2` : Calculates the R^2 score between 'y' and 'y_pred' with weighting by `phi`.

* `phi_weighted_mse` : Calculates the mean squared error between 'y' and 'y_pred' with weighting by `phi`.

* `phi_weighted_mae` : Calculates the mean absolute error between 'y' and 'y_pred' with weighting by `phi`.

* `phi_weighted_root_mse` : Calculates the root mean squared error between 'y' and 'y_pred' with weighting by `phi`.

* `ser_t` : Calculates the Squared error-relevance values between 'y' and 'y_pred' with weighting by `phi` at thershold 't'.

* `aer_t` : Calculates the Absolute error-relevance values between 'y' and 'y_pred' with weighting by `phi` at thershold 't'.

* `era` : Calculates the Squared/Absolute error-relevance areas (ERA) between 'y' and 'y_pred'.


But first, we are going to import `mean_squared_error`, `mean_absolute_error` and `r2` from sklearn package to see their evaluation. 

In [9]:
mse = mean_squared_error(y_test , y_pred)
print("Mean Squared Error:", mse) 

mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae) 

r2 = r2_score(y_test, y_pred)
print("R2:", r2)            

Mean Squared Error: 1706894348.1517744
Mean Absolute Error: 20058.32356322943
R2: 0.7095855146158936


In next part, we have used the metrics functions from our package, "ImbalanceMetrics".

In [10]:
# Calculate phi_weighted_mse. Here the weight calculated by phi relevance function is done using default value.
wmse = rm.phi_weighted_mse (y_test , y_pred)
print("Weighted Mean Squared Error:", wmse)

# Calculate phi_weighted_mae. Here the weight calculated by phi relevance function is done using default value.
wmae = rm.phi_weighted_mae ( y_test , y_pred)
print("Weighted Mean Absolute Error:", wmae)

# Calculate phi_weighted_r2. Here the weight calculated by phi relevance function is done using default value.
wr2 = rm.phi_weighted_r2 ( y_test , y_pred)
print("Weighted R2:", wr2)

Weighted Mean Squared Error: 1631760419.90464
Weighted Mean Absolute Error: 27660.475642436235
Weighted R2: 0.7480105800662555


In [11]:
# Calculate phi_weighted_root_mse. Here the weight calculated by phi relevance function is done using default value.
wrmse = rm.phi_weighted_root_mse (y_test, y_pred)
print("Weighted Root Mean Squared Error:", wrmse)

Weighted Root Mean Squared Error: 40395.054399080094


In [12]:
# Calculate ser_t. Here the weight calculated by phi relevance function is done by passig rel_method,rel_xtrm_type, rel_coef and ctrl_pts. The threshold is defined by the user. In this case threshold is .055.
threshold = .055
ser_t= rm.ser_t(y_test, y_pred,threshold)
print("Ser("+str(threshold)+"):", ser_t)

# Calculate ser_t. Here the weight calculated by phi relevance function is done by passig rel_method,rel_xtrm_type, rel_coef and ctrl_pts.. The threshold is defined by the user. In this case threshold is .3.
threshold = .3
ser_t= rm.ser_t(y_test, y_pred,threshold)
print("Ser("+str(threshold)+"):", ser_t)

# Calculate sera. Here the weight calculated by phi relevance function is done by passig rel_method,rel_xtrm_type, rel_coef and ctrl_pts. Here the the relevance factor, rel is default value (None)
sera= rm.era(y_test, y_pred)
print("Sera:", sera)

# Calculate sera. Here the weight calculated by phi relevance function is done by passig rel_method,rel_xtrm_type, rel_coef and ctrl_pts. Here the the relevance factor, rel is 'phi'
sera= rm.era(y_test, y_pred, weight = 'phi')
print("Sera(phi):", sera)

# Calculate sera. Here the weight calculated by phi relevance function is done by passig rel_method,rel_xtrm_type, rel_coef and ctrl_pts. Here the the relevance factor, rel is 'threshold'
sera= rm.era(y_test, y_pred, weight = 'threshold')
print("Sera(thres):", sera)

# Calculate sera. Here the weight calculated by phi relevance function is done by passig rel_method,rel_xtrm_type, rel_coef and ctrl_pts. Here the the relevance factor, rel is 'threshold'
sera= rm.era(y_test, y_pred, weight = 'density')
print("Sera(density):", sera)


Ser(0.055): 183120530675.53183
Ser(0.3): 163681756109.36713
Sera: 140734625271.35788
Sera(phi): 120955635778.19684
Sera(thres): 60488453317.65305
Sera(density): 113109541878.96922


In [13]:

# Calculate aer_t. Here the weight calculated by phi relevance function is done by passig rel_method,rel_xtrm_type, rel_coef and ctrl_pts. The threshold is defined by the user. In this case threshold is .055.
threshold = .055
aer_t= rm.aer_t(y_test, y_pred,threshold)
print("Aer("+str(threshold)+"):", aer_t)

# Calculate aera. Here the weight calculated by phi relevance function is done by passig rel_method,rel_xtrm_type, rel_coef and ctrl_pts. Here the the relevance factor, rel is default value (None)
aera= rm.era(y_test, y_pred,   tech = 'absolute')
print("Aera:", aera)

# Calculate aera. Here the weight calculated by phi relevance function is done by passig rel_method,rel_xtrm_type, rel_coef and ctrl_pts. Here the the relevance factor, rel is 'phi'
aera= rm.era(y_test, y_pred,   tech = 'absolute',weight = 'phi')
print("Aera(phi):", aera)

# Calculate aera. Here the weight calculated by phi relevance function is done by passig rel_method,rel_xtrm_type, rel_coef and ctrl_pts. Here the the relevance factor, rel is 'threshold'
aera= rm.era(y_test, y_pred,  tech = 'absolute', weight = 'threshold')
print("Aera(thres):", aera)

# Calculate aera. Here the weight calculated by phi relevance function is done by passig rel_method,rel_xtrm_type, rel_coef and ctrl_pts. Here the the relevance factor, rel is 'threshold'
aera= rm.era(y_test, y_pred, tech = 'absolute',  weight = 'density')
print("Aera(density):", aera)



Aer(0.055): 3691987.9800428455
Aera: 2358140.0752145112
Aera(phi): 1886574.0988630585
Aera(thres): 943458.6663679656
Aera(density): 1814534.628494367


## Conclusion

In this package, we have presented a set of evaluation metrics specifically designed for imbalanced domains. Our package, "ImbalanceMetrics", provides a comprehensive set of evaluation metrics to assess the performance of machine learning models trained on imbalanced datasets.

Our package includes several evaluation metrics that address the challenges of imbalanced domains. These metrics can provide a more accurate assessment of model performance than traditional metrics, which can be misleading in imbalanced domains.

To learn more about our package, please refer to the documentation, which includes detailed descriptions of all the available metrics and their usage.

