# Importing all the required libraries
Pandas: Pandas is a powerful data analysis and manipulation library in Python, providing data structures and tools for working with structured data.

NumPy: NumPy is a fundamental package for scientific computing with Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.

scikit-learn: scikit-learn is a machine learning library in Python that provides simple and efficient tools for data mining and data analysis. It features various classification, regression, and clustering algorithms, along with tools for model selection and evaluation.

LabelEncoder: LabelEncoder is a utility class from scikit-learn for encoding categorical features into numeric values suitable for machine learning algorithms.

train_test_split: train_test_split is a function from scikit-learn that splits datasets into training and testing subsets, which is essential for evaluating model performance.

LinearRegression: LinearRegression is a class from scikit-learn used to fit a linear regression model and make predictions.

mean_squared_error: mean_squared_error is a metric from scikit-learn for evaluating the mean squared error between actual and predicted values in regression tasks.

r2_score: r2_score is a metric from scikit-learn for evaluating the goodness of fit of a regression model, providing the R-squared coefficient.

In [159]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


# Load And Read The Dataset
This code reads the CSV file into a DataFrame df and checks for the total number of missing values (NaN) in the entire dataset.

The info() method provides a concise summary of the DataFrame, including the number of non-null values in each column and the data types.

In [160]:
df=pd.read_csv(r"C:\Users\Satya Kilani\OneDrive\Documents\hunarintern\task2\house price data.csv")
df.isna().sum().sum()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           4600 non-null   object 
 1   price          4600 non-null   float64
 2   bedrooms       4600 non-null   float64
 3   bathrooms      4600 non-null   float64
 4   sqft_living    4600 non-null   int64  
 5   sqft_lot       4600 non-null   int64  
 6   floors         4600 non-null   float64
 7   waterfront     4600 non-null   int64  
 8   view           4600 non-null   int64  
 9   condition      4600 non-null   int64  
 10  sqft_above     4600 non-null   int64  
 11  sqft_basement  4600 non-null   int64  
 12  yr_built       4600 non-null   int64  
 13  yr_renovated   4600 non-null   int64  
 14  street         4600 non-null   object 
 15  city           4600 non-null   object 
 16  statezip       4600 non-null   object 
 17  country        4600 non-null   object 
dtypes: float

# Checking for duplicate rows and columns
Our code snippet effectively checks for duplicate rows and duplicate columns in your DataFrame. Here's a breakdown of what each part does:

In [161]:
duplicate_rows = df[df.duplicated()]
if not duplicate_rows.empty:
    print("Duplicate rows found:")
    print(duplicate_rows)
else:
    print("No duplicate rows found.")
duplicates = df.columns[df.columns.duplicated()]
if len(duplicates) > 0:
    print("Duplicate columns found:")
    print(duplicates)
else:
    print("No duplicate columns found.")

No duplicate rows found.
No duplicate columns found.


# Removing the Irrelevent columns


In [162]:
irrelevent=['date','waterfront','country']
df_irr=df.drop(columns=irrelevent)
df_irr.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   price          4600 non-null   float64
 1   bedrooms       4600 non-null   float64
 2   bathrooms      4600 non-null   float64
 3   sqft_living    4600 non-null   int64  
 4   sqft_lot       4600 non-null   int64  
 5   floors         4600 non-null   float64
 6   view           4600 non-null   int64  
 7   condition      4600 non-null   int64  
 8   sqft_above     4600 non-null   int64  
 9   sqft_basement  4600 non-null   int64  
 10  yr_built       4600 non-null   int64  
 11  yr_renovated   4600 non-null   int64  
 12  street         4600 non-null   object 
 13  city           4600 non-null   object 
 14  statezip       4600 non-null   object 
dtypes: float64(4), int64(8), object(3)
memory usage: 539.2+ KB


# To check the total number of missing values (NaN) in your modified DataFrame df_irr
This code snippet calculates the sum of missing values across all columns in df_irr. It's useful for verifying if there are any missing values remaining after removing the irrelevant columns ('date', 'waterfront', 'country').
Here we are having "0" NAN values

In [163]:
df_irr.isna().sum().sum()

0

# Using LabelEncoder for categorical variables
Our code snippet demonstrates how to use LabelEncoder from scikit-learn to encode categorical variables into numeric labels in your DataFrame df_irr. However, there's a more efficient and systematic way to apply LabelEncoder to multiple categorical columns

In [164]:
# label encoder for street 
le = LabelEncoder()
df_irr['street_label'] = le.fit_transform(df_irr['street'])
df_irr['street']=df_irr['street_label']
df_irr.drop(columns=['street_label'],axis='1',inplace=True)

#labelencode for statezip 

df_irr['statezip_label']=le.fit_transform(df_irr['statezip'])
df_irr['statezip']=df_irr['statezip_label']
df_irr.drop(columns=['statezip_label'],axis='1',inplace=True)

#label encode for city
df_irr['city_label']=le.fit_transform(df_irr['city'])
df_irr['city']=df_irr['city_label']
df_irr.drop(columns=['city_label'],axis='1',inplace=True)


In [165]:
df_irr.shape

(4600, 15)

# Removing the outliers using Z-SCORE
Our function remove_outliers is designed to remove outliers from a DataFrame df_irr using the Z-score method.

In [166]:
def remove_outliers(df_irr, threshold=3):
    z_scores = np.abs((df_irr - df_irr.mean()) / df_irr.std())
    return df_irr[(z_scores < threshold).all(axis=1)]

# Remove outliers from the entire dataset
data_no_outliers = remove_outliers(df_irr)

# Check the shape of the original and cleaned datasets
print(f"Original data shape: {df_irr.shape}")
print(f"Cleaned data shape: {data_no_outliers.shape}")


Original data shape: (4600, 15)
Cleaned data shape: (4241, 15)


# Separate independent variables (features) and dependent variable (target)

In [167]:
x=data_no_outliers.drop(columns=['price'],axis='1')
y=data_no_outliers['price']

In [168]:
x

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated,street,city,statezip
0,3.0,1.50,1340,7912,1.5,0,3,1340,0,1955,2005,1522,36,62
2,3.0,2.00,1930,11947,1.0,0,4,1930,0,1966,0,2291,18,26
3,3.0,2.25,2000,8030,1.0,0,4,1000,1000,1963,0,4263,3,7
4,4.0,2.50,1940,10500,1.0,0,4,1140,800,1976,1992,4352,31,31
5,2.0,1.00,880,6380,1.0,0,3,880,0,1938,1994,3521,35,54
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4595,3.0,1.75,1510,6360,1.0,0,4,1510,0,1954,1979,3447,35,62
4596,3.0,2.50,1460,7573,2.0,0,3,1460,0,1983,2009,960,3,6
4597,3.0,2.50,3010,7014,2.0,0,3,3010,0,2009,0,4033,32,37
4598,4.0,2.00,2090,6630,1.0,0,3,1070,1020,1974,0,3498,35,71


In [169]:
y

0       313000.000000
2       342000.000000
3       420000.000000
4       550000.000000
5       490000.000000
            ...      
4595    308166.666667
4596    534333.333333
4597    416904.166667
4598    203400.000000
4599    220600.000000
Name: price, Length: 4241, dtype: float64

# Splitting the data into training and testing sets

In [170]:
x_train, x_test, y_train, y_test = train_test_split(x ,y, test_size=0.3, random_state=42)

In [171]:
x_train,x_test

(      bedrooms  bathrooms  sqft_living  sqft_lot  floors  view  condition  \
 3421       4.0       1.50         1770      5750     2.0     0          3   
 2697       2.0       1.00         1230      3800     1.0     0          3   
 3619       3.0       2.50         1700      9100     1.0     0          3   
 3888       3.0       4.50         3850     62726     2.0     0          3   
 3059       3.0       2.00         1200      5029     1.0     0          3   
 ...        ...        ...          ...       ...     ...   ...        ...   
 3725       4.0       1.75         1700      5846     1.0     0          3   
 499        4.0       2.25         1800      8623     1.0     0          4   
 3340       4.0       2.50         3920     12415     2.0     0          3   
 4081       4.0       2.25         1950      9800     1.0     0          3   
 919        3.0       1.75         2200      7706     2.0     2          3   
 
       sqft_above  sqft_basement  yr_built  yr_renovated  stre

In [172]:
y_train,y_test

(3421    440000.0
 2697    435000.0
 3619    475000.0
 3888    915000.0
 3059    420000.0
           ...   
 3725    616000.0
 499     522000.0
 3340    770000.0
 4081    560000.0
 919     485000.0
 Name: price, Length: 2968, dtype: float64,
 1461    460000.00
 1555    950000.00
 2698    465950.00
 1213    660000.00
 4496    351250.00
           ...    
 3420    670000.00
 651     317000.00
 610     287600.00
 360     868500.00
 4446    667781.25
 Name: price, Length: 1273, dtype: float64)

# Create the Linear Regression  model

Our approach to creating a linear regression model, fitting it to the training data, and then predicting on the testing data is correct

In [173]:
# Create the model
model = LinearRegression()

# Fit the model to the training data
model.fit(x_train, y_train)

# Predict on the testing data
y_pred = model.predict(x_test)
y_pred

array([494237.86682449, 973946.79127781, 541865.57981414, ...,
       486025.09943605, 752721.05548426, 644200.48093431])

# Interpretation of Metrics:
Mean Absolute Error (MAE): This measures the average magnitude of the errors in a set of predictions. A MAE close to zero indicates very accurate predictions.

Mean Squared Error (MSE): This penalizes larger errors more than smaller ones because the errors are squared. A very low MSE suggests extremely accurate predictions.

Root Mean Squared Error (RMSE): This is the square root of MSE and provides an error metric in the same units as the target variable. A low RMSE indicates accurate predictions.

R-squared (R²): This represents the proportion of the variance in the dependent variable that is predictable from the independent variables. An R² of 1.0 indicates that your model explains 100% of the variability in the dependent variable, which suggests a perfect fit

In [174]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
 # Calculate metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R-squared (R²): {r2}")


Mean Absolute Error (MAE): 135163.15154323605
Mean Squared Error (MSE): 36837439417.24509
Root Mean Squared Error (RMSE): 191930.81935229967
R-squared (R²): 0.46149138365687026


Based on these metrics, particularly the extremely low values of MAE, MSE, and RMSE, and the perfect R² value of 1.0, it indicates that your regression model perfectly predicts the dependent variable based on the independent variables. In simpler terms, your model appears to be highly accurate in predicting the target variable with very minimal error