# <center> CS530 Midterm

### Notes:
Download two data sets house_SalePrice.csv and house_SalePrice_predict.csv from Canvas and answer the following questions. We want to build a model to predict the sales price of homes. The target variable is 'SalePrice'.

The file house_SalePrice.csv is for training.  Each row is represents one house, and contains both features of the house and the sale price.

The file house_SalePrice_predict.csv contains additional houses but does not include the sale price. Your goal is to predict the price for these houses.

### Questions

#### 1. Pre-processing the data
1). There are missing values in the data. Instead of dropping them, fill them in by setting each missing value to the mean/median/mode of the column.  Here are some references if you need them: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html and https://scikit-learn.org/stable/modules/impute.html

Note:  If there's a more sophisticated method you prefer, you can use that instead.  Just note it.

In [287]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# Preprocessing
from sklearn import metrics
from sklearn.impute import SimpleImputer
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import FeatureUnion
# Models
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn import linear_model

# Model evaluation
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.metrics import matthews_corrcoef
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score


import warnings
warnings.filterwarnings('ignore')

test = pd.read_csv('house_SalePrice_predict.csv')

df = pd.read_csv('house_SalePrice.csv')

2). use one-hot enconding to convert the categorical variables into dummies.

#### 2. Regresssion model building
In this part, you need to use the data to build a linear model by using OLS first and then build another linear model by using Lasso. Make sure to split the data into training and test sets. Report the performance on test set. Using k-fold cross validation to tune the hyperparameters in Lasso.

In [288]:
X_train , X_test,y_train, y_test = train_test_split(df.drop(['SalePrice'], 
                                                            axis = 1), df['SalePrice'],
                                                     test_size=0.33, random_state=42)

col_trans = make_column_transformer((OneHotEncoder(handle_unknown = 'ignore', 
                                                   drop = 'first'), list(X_train.select_dtypes(include = 'O').columns)),
                                    (StandardScaler(), 
                                     list(X_train.select_dtypes(exclude = 'O').columns)),
                                  remainder = 'passthrough')

# Use the median to impute missing values 
transformer = FeatureUnion(
    transformer_list=[
                      ('features', SimpleImputer(strategy='most_frequent'))])

ols_pipe = Pipeline(steps = [
                         ('preprocess', col_trans),
                         ('transformer', transformer), 
                         ('model', LinearRegression())
                         ])

lasso_pipe = Pipeline(steps = [
                         ('preprocess', col_trans), 
                         ('transformer', transformer), 
                         ('model', linear_model.Lasso(max_iter=10000))])


parameters = {'model__alpha': [0.00001, 0.001, 0.01, 1, 10, 100, 1000, 10000, 100000]}


## OLS SCORING 

In [289]:
ols_preds = ols_pipe.fit(X_train, y_train).predict(X_test)

print(f'The Rsqu for the OLS model is: {round(ols_pipe.score(X_test, y_test), 3)})')

The Rsqu for the OLS model is: 0.73)


## LASSO PREDICTIONS

The best alpha is 100000

In [290]:
clf_lasso = GridSearchCV(lasso_pipe, parameters, cv = 5, scoring = "r2")

clf_lasso.fit(X_train, y_train)
lasso_preds = clf_lasso.predict(X_test)

print(f'The Rsqu for the lasso model is: {round(clf_lasso.score(X_test, y_test), 3)})')

The Rsqu for the lasso model is: 0.752)


In [291]:
clf_lasso.cv_results_

{'mean_fit_time': array([0.44046779, 0.10746312, 0.10519147, 0.09429593, 0.09598956,
        0.07479467, 0.058988  , 0.06396427, 0.05908289]),
 'mean_score_time': array([0.01560607, 0.01334434, 0.01898804, 0.0189887 , 0.01585441,
        0.01774817, 0.01626134, 0.01936398, 0.01767683]),
 'mean_test_score': array([ 0.73986156,  0.73986193,  0.73986523,  0.74026342,  0.74247091,
         0.746031  ,  0.74261025,  0.68084174, -0.00492056]),
 'param_model__alpha': masked_array(data=[1e-05, 0.001, 0.01, 1, 10, 100, 1000, 10000, 100000],
              mask=[False, False, False, False, False, False, False, False,
                    False],
        fill_value='?',
             dtype=object),
 'params': [{'model__alpha': 1e-05},
  {'model__alpha': 0.001},
  {'model__alpha': 0.01},
  {'model__alpha': 1},
  {'model__alpha': 10},
  {'model__alpha': 100},
  {'model__alpha': 1000},
  {'model__alpha': 10000},
  {'model__alpha': 100000}],
 'rank_test_score': array([7, 6, 5, 4, 3, 1, 2, 8, 9], dtype=i

#### 3. Classification model building

Create a binary variable to indicate whether the sale price is greater than median sale price (=1 if it's higher than median, and 0 otherwise). Build and compare two classification models to predict whether or not the house sells above the median price: a logistic model and a random forest model.

 ## Logistic regression 

In [292]:
df2 = df.copy()
df2['above_median'] = df2['SalePrice'].apply(lambda x: 1 if x > np.median(df['SalePrice']) else 0)
df2.drop(['SalePrice'], axis = 1, inplace=True)

X_train2 , X_test2, y_train2, y_test2 = train_test_split(df2.drop(['above_median'], 
                                                            axis = 1), df2['above_median'],
                                                     test_size=0.33, random_state=42)

col_trans = make_column_transformer((OneHotEncoder(handle_unknown = 'ignore', 
                                                   drop = 'first'), list(X_train2.select_dtypes(include = 'O').columns)),
                                    (StandardScaler(), 
                                     list(X_train2.select_dtypes(exclude = 'O').columns)),
                                     remainder = 'passthrough')

transformer = FeatureUnion(
    transformer_list=[
                      ('features', SimpleImputer(strategy='most_frequent'))])


logistic_pipe = Pipeline(steps = [
                         ('preprocess', col_trans), 
                         ('transformer', transformer), 
                         ('model', linear_model.LogisticRegression())])


logistic_pipe.fit(X_train2, y_train2)
logistic_preds = logistic_pipe.predict(X_test2)


matthews_corrcoef(y_test2, logistic_preds)

0.7729477813923188

## Random Forest 

In [293]:
from sklearn.ensemble import RandomForestClassifier


rf_pipe = Pipeline(steps = [
                         ('preprocess', col_trans), 
                         ('transformer', transformer), 
                         ('model', RandomForestClassifier())])


rf_pipe.fit(X_train2, y_train2)
rf_preds = rf_pipe.predict(X_test2)


matthews_corrcoef(y_test2, rf_preds)

0.7823229278882637

There is a lot of variation between test-train splits, but random forest usually performs better than the logistic.

#### 4. Model comparasions
Now use your OLS and Lasso regressions from part 2 and binarize the output so that they predict 1 if the house is predicted to sell for more than the median, and 0 otherwise.

How does the result compare to the Logistic and Random Forest models? Which one is best and why do you think it performs better?

In [294]:
ols_preds_binary = ols_preds >= np.median(df['SalePrice'])

matthews_corrcoef(y_test2, ols_preds_binary)

0.725394827191577

In [295]:
lasso_preds_binary = lasso_preds >= np.median(df['SalePrice'])

matthews_corrcoef(y_test2, lasso_preds_binary)

0.7035039542847492

The results show that the logistic and the random forest have very similar results. In some train-test splits the logistic is better, in others the random forest is better. Still both are better than the lasso and the ols predictions. 

I think this shows a limitation of using methods suited for continuous variables (ols, and lasso) to clasification. They are not very bad, but are sistematically worst than the two classification algorithms. The later take into account by design that the prediction is a categorical variable and use a logistic transformation of a linear function, and a entropy measure to achieve better performance. 

#### 5. Test your model (Bonus)

Create a csv file with two columns by using house_SalePrice_predict.csv:
  1. Your best salePrice prediction.
  2. Your best aboveMedian prediction.

Name the file as '\<Your Last Name\>_prediction.csv'

In [296]:
lasso_final_preds = clf_lasso.predict(test)

logistic_final_preds = logistic_pipe.predict(test)


my_predictions = pd.DataFrame({"salePrice":lasso_final_preds, "aboveMedian": logistic_final_preds})

my_predictions.to_csv("Charris_prediction.csv")