### Linear Regression
Linear regression is a statistical method used to model the relationship between a dependent variable (often called the outcome or target variable) and one independent variable (predictor). The aim is to find the best-fitting straight line (regression line) through the data points that can be used to predict
the dependent variable based on the independent variable.

#### Step 1: Creating a Simple Dataset

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer

In [2]:
# Create a simple dataset
data = {
    'Size': [1500, 1600, 1700, np.nan, 1800, 1900, 2000, 2100, 2200, 2300],
    'Bedrooms': [3, 3, 2, 4, 3, 3, 2, 2, np.nan, 3],
    'Age': [10, 15, 20, 5, 10, 20, 30, 25, 15, 10],
    'Price': [300000, 320000, 340000, 360000, 380000, 400000, 420000, 440000, 460000, 480000]
}

df = pd.DataFrame(data)

In [3]:
# Display the initial dataset
print("Initial Dataset:")
print(df)

Initial Dataset:
     Size  Bedrooms  Age   Price
0  1500.0       3.0   10  300000
1  1600.0       3.0   15  320000
2  1700.0       2.0   20  340000
3     NaN       4.0    5  360000
4  1800.0       3.0   10  380000
5  1900.0       3.0   20  400000
6  2000.0       2.0   30  420000
7  2100.0       2.0   25  440000
8  2200.0       NaN   15  460000
9  2300.0       3.0   10  480000


#### Step 2: Data Preprocessing

In [4]:
# Handling missing values
imputer = SimpleImputer(strategy='mean')
df['Size'] = imputer.fit_transform(df[['Size']])
df['Bedrooms'] = imputer.fit_transform(df[['Bedrooms']])

In [5]:
# Normalizing the data
scaler = StandardScaler()
df[['Size', 'Bedrooms', 'Age']] = scaler.fit_transform(df[['Size', 'Bedrooms', 'Age']])

In [6]:
# Display the preprocessed dataset
print("\nPreprocessed Dataset:")
print(df)


Preprocessed Dataset:
       Size  Bedrooms       Age   Price
0 -1.632993  0.372678 -0.816497  300000
1 -1.224745  0.372678 -0.136083  320000
2 -0.816497 -1.304373  0.544331  340000
3  0.000000  2.049729 -1.496910  360000
4 -0.408248  0.372678 -0.816497  380000
5  0.000000  0.372678  0.544331  400000
6  0.408248 -1.304373  1.905159  420000
7  0.816497 -1.304373  1.224745  440000
8  1.224745  0.000000 -0.136083  460000
9  1.632993  0.372678 -0.816497  480000


#### Step 3: Applying Linear Regression

In [9]:
# Splitting the dataset into training and testing sets
X = df[['Size', 'Bedrooms', 'Age']]
y = df['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [10]:
# Creating and training the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

In [11]:
# Making predictions
y_pred = model.predict(X_test)

#### Step 4: Evaluating the Model

In [16]:
# Calculating R-squared
r2 = r2_score(y_test, y_pred)

print("\nModel Evaluation:")
print(f"R-squared: {r2}")


Model Evaluation:
R-squared: 0.9991530636369441


In [17]:
# Displaying the actual vs predicted values
print("\nActual vs Predicted:")
comparison = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print(comparison)


Actual vs Predicted:
   Actual      Predicted
8  460000  457119.063696
1  320000  319986.495611
