# Conversational AI: Data Science (UCS663)
# House Prices: Advanced Regression Techniques
**Suvrat Arora** <br>
**101903331** <br>
**3CO13** <br>

In the given problem statement, we are provided with a dataset, **'House Prices: Advanced Regression Techniques'**, containing 80 columns. In the dataset, we can see that many values are missing, and it is highly inconsistent; thus, there's a dire need for Exploratory Data Analysis and Feature Engineering. <br>

Thus, we'll deal with the problem statement in a two-fold manner: <br>
1. Exploratory Data Analysis (EDA) <br>
2. Feature Engineering <br>
3. Model Building and Accuracy Comparision <br>

In [12]:
#Importing Necessary Libraries
import cuml
import cudf
import cupy
import seaborn as sn

In [13]:
#Loading the train and test dataset
train_data = cudf.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
test_data = cudf.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

# Exploratory Data Analysis

In [14]:
#Train Data Shape
train_data.shape

In [15]:
#Test Data Shape
test_data.shape

In [16]:
train_data.head()

In [17]:
train_data.describe()

In [18]:
train_data.info()

**Missing Values:** <br>
Firstly, we'll vizualize the missing values in the training dataset using matrix vizualization and subsequently find the percentage of missing values in each column. 

In [19]:
# Missing values Visualization using Missingno Matrix Plot
import missingno as msg
msg.matrix(train_data.to_pandas())

In [20]:
#Checking the features with missing values along with their percentage

features_with_na=[features for features in train_data.columns if train_data[features].isnull().sum()>0]

for feature in features_with_na:
    print(feature, 'has', cupy.round(train_data[feature].isnull().mean(), 4) * 100,'% Missing Values')
    
print("\n Total Number of Missing Columns: ",len(features_with_na) )

In [21]:
# Dropping columns where more than 40% values are null 
#Train Data
print("Columns deleted from training dataset:")
for i in train_data.columns:
    if train_data[i].isna().sum()/len(train_data) > 0.40:
        print(i)
        train_data.drop([i],axis=1,inplace='True')
#Test Data  
print("\nColumns deleted from testing dataset:")
for i in test_data.columns:
    if test_data[i].isna().sum()/len(train_data) > 0.40:
        print(i)
        test_data.drop([i],axis=1,inplace='True')

In [22]:
train_data.info()

In [23]:
# Determining Numerical Features
numerical_features = [feature for feature in train_data.columns if train_data[feature].dtype != 'O']
print('Number of Numerical Features are', len(numerical_features))

In [24]:
# Determining Categorical Features
categorical_features = [feature for feature in train_data.columns if train_data[feature].dtype == 'O']
print('Number of Categorical Features are', len(categorical_features))

In [25]:
# Determining Date Time Features
Date_Time_features = [feature for feature in numerical_features if 'Year' in feature or 'Yr' in feature ]
Date_Time_features

In [26]:
# Determining the number of categories in each categorical feature
for feature in categorical_features:
    data = train_data.copy()
    print(feature,' has ', len(data[feature].unique()), 'categories')

In [27]:
categorical_features_nan = [feature for feature in train_data.columns if train_data[feature].isnull().sum() > 0 and train_data[feature].dtype == 'O']


for feature in categorical_features_nan:
    print(f"{feature}: {cupy.round(train_data[feature].isnull().mean(),4)}% missing values")

# Feature Engineering 

In [28]:
def replace_missing_nan_cat(dataset,features):
    data = dataset.copy()
    data[features] = data[features].fillna('Missing')
    return data

In [29]:
train_data = replace_missing_nan_cat(train_data,categorical_features)
test_data = replace_missing_nan_cat(test_data,categorical_features)

In [30]:
train_data[categorical_features].head(100)

In [31]:
for feature in categorical_features_nan:
    print(f"{feature}: {cupy.round(train_data[feature].isnull().mean(),4)}% missing values")

In [32]:
# Applying Meidan Imputation on numerical feature
numerical_features_nan = [feature for feature in train_data.columns if train_data[feature].isnull().sum() > 0 and train_data[feature].dtype != 'O']
numerical_features_nan

for feature in numerical_features_nan:
    train_data[feature] = train_data[feature].fillna(train_data[feature].median())

In [33]:
print(train_data[numerical_features_nan].isnull().sum())

In [34]:
# Normializing the Date Time Features
for feature in ['YearBuilt','YearRemodAdd','GarageYrBlt']:
    train_data[feature] = train_data['YrSold'] - train_data[feature]
    test_data[feature] = test_data['YrSold'] - test_data[feature]

In [35]:
train_data[['YearBuilt','YearRemodAdd','GarageYrBlt']].head()

In [36]:
num_continuous_features_log=['LotFrontage', 'LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
for feature in num_continuous_features_log:
    train_data[feature] = cupy.log(train_data[feature])

In [37]:
train_data[categorical_features].head(10)


In [38]:
train_data['MSZoning'].unique()

In [39]:
# Scaling the numerical features
from cuml.preprocessing import MinMaxScaler
scale=MinMaxScaler()
train_data[numerical_features]=scale.fit_transform(train_data[numerical_features])

In [40]:
# Encoding Categorialc features
from cuml.preprocessing import LabelEncoder
enc = LabelEncoder()
for feature in categorical_features:
    train_data[feature] = enc.fit_transform(train_data[feature])
    test_data[feature] = enc.fit_transform(test_data[feature])
train_data[categorical_features].head()

# Feature Selection



In [41]:
train_df = train_data[["OverallQual","YearBuilt","YearRemodAdd","ExterQual","TotalBsmtSF","1stFlrSF","GrLivArea","FullBath","TotRmsAbvGrd","GarageCars","GarageArea",
                   "MSZoning", "Utilities","BldgType","Heating","KitchenQual","SaleCondition","LandSlope","SalePrice"]]
test_df_X = test_data[["OverallQual","YearBuilt","YearRemodAdd","ExterQual","TotalBsmtSF","1stFlrSF","GrLivArea","FullBath","TotRmsAbvGrd","GarageCars","GarageArea",
                   "MSZoning", "Utilities","BldgType","Heating","KitchenQual","SaleCondition","LandSlope"]]
len(train_df.columns)

# Model Building

In [42]:
from sklearn.model_selection import train_test_split
X = train_data.drop('SalePrice',axis=1)
Y = train_data['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

In [43]:
X_train.head()


In [44]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

In [45]:
import cuml
from cuml import LinearRegression

In [46]:
import pandas as pd
import matplotlib.pyplot as plt

algorithm = ['svd', 'eig', 'qr', 'svd-qr', 'svd-jacobi']
models = pd.DataFrame(columns=["Algorithm","MAE","MSE","R2 Score"],index=algorithm)

for i in algorithm:
    lr = LinearRegression(fit_intercept = True, normalize = False, algorithm = i)
    reg = lr.fit(X_train,y_train)
    preds=lr.predict(X_test)
    MSE=cuml.metrics.regression.mean_squared_error(y_test,preds)
    R2_Score=cuml.metrics.regression.r2_score(y_test,preds)
    MAE=cuml.metrics.regression.mean_absolute_error(y_test,preds)
    row = {"Algorithm": i,"MAE": MAE, "MSE": MSE,"R2 Score": R2_Score}
    print(row)
