In the scatterplot in the article (Figure 4), you'll see the distribution of the dataset after the anomalies have been removed. I used the Interquartile Range (IQR) Method, also known as the Fence Rule or Tukey's method, to identify and remove outliers. This technique calculates the first (Q1, 25th percentile) and third (Q3, 75th percentile) quartiles, and then computes the IQR by subtracting Q1 from Q3. The method then defines lower and upper fences by subtracting (for the lower fence) or adding (for the upper fence) 1.5 times the IQR from Q1 and Q3, respectively. Any values falling below the lower fence or above the upper fence are considered potential outliers. The multiplier of 1.5 is the most commonly used and is the one I applied in this analysis. Our dataset initially contained 545 observations, and after implementing the IQR method, we were left with 438 observations. While this may seem like a significant number of observations omitted, for the purpose of this article, I am focused on building a well-fitted model. So, let's proceed with the 438 values we have after outlier removal. <br>
I have downloaded the new dataset to use in the Multiple_Linear_Regression file.

In [9]:
import pandas as pd  
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np  
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [10]:
#reading the csv file
housing_data = pd.read_csv('Housing.csv')

#how many rows and columns are in the dataset?
housing_data.shape

(545, 13)

In [8]:
# Detect outliers in numeric columns of a pandas DataFrame using the IQR method.

def detect_outliers_iqr(df, numeric_cols=None, k=1.0):

    # If no columns are specified, use all numeric columns
    if numeric_cols is None:
        numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    
    outlier_indices = {}
    outlier_summary = []
    
    # Calculate outliers for each numeric column
    for col in numeric_cols:
        # Calculate Q1, Q3, and IQR
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        
        # Define outlier boundaries
        lower_bound = Q1 - k * IQR
        upper_bound = Q3 + k * IQR
        
        # Find outlier indices
        outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)].index.tolist()
        outlier_indices[col] = outliers
        
        # Calculate outlier statistics
        n_outliers = len(outliers)
        pct_outliers = 100 * n_outliers / len(df)
        
        # Add to summary
        outlier_summary.append({
            'column': col,
            'lower_bound': lower_bound,
            'upper_bound': upper_bound,
            'num_outliers': n_outliers,
            'pct_outliers': pct_outliers
        })
    
    # Create summary DataFrame
    outlier_summary_df = pd.DataFrame(outlier_summary)
    
    # Get all unique outlier indices
    all_outliers = list(set([idx for col_outliers in outlier_indices.values() for idx in col_outliers]))
    
    # Create clean DataFrame with outliers removed
    clean_df = df.drop(all_outliers)
    
    return outlier_indices, outlier_summary_df, clean_df

In [11]:
# Take a look at our new dataset
ndf = detect_outliers_iqr(housing_data)[2]
ndf.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
36,8043000,7482,3,2,3,yes,no,no,yes,no,1,yes,furnished
40,7875000,6550,3,1,2,yes,no,yes,no,yes,0,yes,furnished
45,7560000,6000,3,2,3,yes,no,no,no,yes,0,no,semi-furnished
48,7455000,4300,3,2,2,yes,no,yes,no,no,1,no,unfurnished
49,7420000,7440,3,2,1,yes,yes,yes,no,yes,0,yes,semi-furnished


In [93]:
# Let's download the new dataset for further use
ndf.to_csv("Housing_cleaned.csv", index=False)