# Assignment 1 
## Linear Regression Analysis on USA Housing Dataset

DSCI 6601: Practical Machine Learning

Student: Sahil Khan

Date: 01-Oct-2024
_____________________________________________________________________________________________________________________________________________________________________________________________

### Introduction
In this assignment, we will analyze the USA Housing dataset using linear regression. The objective is to build a linear regression model, evaluate its performance, and explore how the features influence the predictions. We will perform the tasks listed in the assignment and include our Python code, comments explaining each step, and the corresponding outputs for each question. The dataset can be downloaded from the following link: https://www.kaggle.com/datasets/vedavyasv/usa-housing.
_____________________________________________________________________________________________________________________________________________________________________________________________

### IMPORTING LIBRARIES

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

_____________________________________________________________________________________________________________________________________________________________________________________________

## Task 1
### i) Model Fitting and Evaluation
- Splitting data into a training set (80%) and a testing set (20%).
- Building linear regression model to fit the training data.

In [2]:
housingData = pd.read_csv('data/USA_Housing.csv')   # Loading the data


X = housingData.drop(['Price', 'Address'], axis=1)  # Select all columns except 'Price' and 'Address'
y = housingData.iloc[:, -2] # Select the second last column 'Price'

# Data split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


We removed the 'Address' column from the training dataset for the following reasons:
- It is the only column with string data, which is not useful for our current analysis.
- The absence of postal codes in the 'Address' limits its usefulness, as postal codes could significantly impact house price predictions. Therefore, we excluded this column.

In [3]:
# Linear regression model
housingModel = LinearRegression()
housingModel.fit(X_train, y_train)

Now the model is successfully fitted using the training data.

### ii) Reporting Coefficients and Model Evaluation
Now we retrieve the model's coefficients and intercept. We also calculate the R2 score to evaluate the model's performance on the test data.

In [4]:
# Model coefficients and intercept
print("RESULT:")
print("Intercept:", housingModel.intercept_)
print("Coefficients:", housingModel.coef_)

y_pred = housingModel.predict(X_test)

# Evaluate model using R2 score
r2Score = r2_score(y_test, y_pred)
print("R2 Score:", r2Score)

RESULT:
Intercept: -2635072.900931213
Coefficients: [2.16522058e+01 1.64666481e+05 1.19624012e+05 2.44037761e+03
 1.52703134e+01]
R2 Score: 0.9179971706834602


R2 score of 0.91 suggests that 91% of the variance in the data is captured by the model, indicating a strong fit.
_____________________________________________________________________________________________________________________________________________________________________________________________

## Task 2
### Predictions on Sample Data
Now we randomly select 20 samples in random from the test set. Using the fitted model, make predictions for these inputs and compare the predicted values to the actual values and present in our results in a table.

In [5]:
# Randomly select 20 samples
sampleIndices = np.random.choice(X_test.index, 20, replace=False)
X_sample = X_test.loc[sampleIndices]    # Sample Test Data
y_actual = y_test.loc[sampleIndices]    # Corresponding Actual Target data


y_predicted = housingModel.predict(X_sample)


# Compare predicted and actual values
table = pd.DataFrame({'Actual': round(y_actual,2), 'Predicted': np.round(y_predicted,2), 'Diff': round(y_actual - y_predicted, 2)})
print(table)

          Actual   Predicted       Diff
4615  1434835.95  1484464.02  -49628.06
1718  1251688.62  1256991.46   -5302.85
393   1638265.39  1621289.95   16975.44
3496   561703.77   722156.79 -160453.02
95    1840236.01  1598770.08  241465.93
144   1230149.14  1253548.56  -23399.42
927   1626941.79  1499196.39  127745.39
296    984311.77   962610.88   21700.89
1108  1423296.12  1457954.83  -34658.71
3970  1271518.96  1060076.43  211442.54
3375  1311389.58  1374687.26  -63297.69
2119  1111580.10  1103733.14    7846.96
1553   541953.91   593301.15  -51347.25
4231  1794014.30  1563758.96  230255.33
2710   783350.67   967240.33 -183889.65
1973  1608726.68  1516813.82   91912.86
2707  1439431.45  1560733.67 -121302.22
2195  1014093.31   998979.74   15113.57
1615   493350.03   432977.16   60372.87
239   1690091.02  1691529.14   -1438.12


The table above compares the predicted prices to the actual prices for selected 20 random test samples. Variance is less indicating that the model performs well.
_____________________________________________________________________________________________________________________________________________________________________________________________

## Task 3. 
### Feature Importance and Model Adjustment
Based on the model’s coefficients, we will rank the features by importance. We are gonna drop the least important feature and re-fit the model using the remaining features. Then we will recalculate the R2 score for the new model and compare it to the original R2.

In [6]:
# Ranking features by importance
featureRank = pd.DataFrame({'Feature': X.columns, 'Coefficient': np.abs(housingModel.coef_)})   # taking absolute value of coefficient because it can be -ve or +ve.
featureRank = featureRank.sort_values(by='Coefficient', ascending=False)    # sorting ranks
print("Feature Importance:\n", featureRank)

# Dropping the least important feature
leastRank = featureRank['Feature'].iloc[-1]
X_train_new = X_train.drop(columns=[leastRank]) # new train data
X_test_new = X_test.drop(columns=[leastRank])   # new test data

# Refit the model without the least important feature
housingModelNew = LinearRegression()
housingModelNew.fit(X_train_new, y_train)

# Recalculate R2 score
r2_score_new = housingModelNew.score(X_test_new, y_test)
print("\nRefined R2 Score:", r2_score_new)
print("Original R2 Score:", r2Score)

Feature Importance:
                         Feature    Coefficient
1           Avg. Area House Age  164666.480722
2     Avg. Area Number of Rooms  119624.012232
3  Avg. Area Number of Bedrooms    2440.377611
0              Avg. Area Income      21.652206
4               Area Population      15.270313

Refined R2 Score: 0.7466716127211295
Original R2 Score: 0.9179971706834602


_____________________________________________________________________________________________________________________________________________________________________________________________

#### FINDINGS:

After dropping the 'Area Population' feature, the R2 score decreased from **0.91** to **0.74**. This indicates that the 'Area Population' feature had a substantial impact on the model's predictive power, and removing it significantly degrades the model's performance.

_____________________________________________________________________________________________________________________________________________________________________________________________

### Conclusion
In this analysis, we successfully built and evaluated a linear regression model to predict housing prices. The model demonstrated a good fit, with an R² score of 0.91. We also ranked the features by importance and found that dropping the least important feature (AREA POPULATION) had significant impact on the model’s accuracy, suggesting that all features are equally impactful.

This assignment showcases the effectiveness of linear regression for predicting housing prices and highlights the importance of feature selection in improving model performance.