# Group Project: Minimal Working Expamle

Below script is a minimum working example using the group project data to derive a ML-model. 

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')
plt.rcParams['font.size'] = 10

In [None]:
# Load data
df = pd.read_csv("GroupProjectDataSet.csv", sep=',')
print('Shape of data frame:', df.shape)
df.head(10)

# Overview

The data set consists of 1460 observations with 81 variables (including the target variable "(prize) class" and the id variable). 79 variables are descriptive variables that should explain Class.

Quantitative: 1stFlrSF, 2ndFlrSF, 3SsnPorch, BedroomAbvGr, BsmtFinSF1, BsmtFinSF2, BsmtFullBath, BsmtHalfBath, BsmtUnfSF, EnclosedPorch, Fireplaces, FullBath, GarageArea, GarageCars, GarageYrBlt, GrLivArea, HalfBath, KitchenAbvGr, LotArea, LotFrontage, LowQualFinSF, MSSubClass, MasVnrArea, MiscVal, MoSold, OpenPorchSF, OverallCond, OverallQual, PoolArea, ScreenPorch, TotRmsAbvGrd, TotalBsmtSF, WoodDeckSF, YearBuilt, YearRemodAdd, YrSold

Qualitative: Alley, BldgType, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, BsmtQual, CentralAir, Condition1, Condition2, Electrical, ExterCond, ExterQual, Exterior1st, Exterior2nd, Fence, FireplaceQu, Foundation, Functional, GarageCond, GarageFinish, GarageQual, GarageType, Heating, HeatingQC, HouseStyle, KitchenQual, LandContour, LandSlope, LotConfig, LotShape, MSZoning, MasVnrType, MiscFeature, Neighborhood, PavedDrive, PoolQC, RoofMatl, RoofStyle, SaleCondition, SaleType, Street, Utilities

# Handling Missing Data

In [None]:
# Plot missing values

missing = df.isnull().sum().sort_values(ascending=False)
missing = missing[missing > 0]
missing.plot.bar()

In [None]:
# Plot missing values 2.0

# Assess missing values
cols = df.columns[df.isna().any()]
df_nan = df[cols].copy()
df_nan['Class'] = df['Class']
print('Percentage of missing values per column:')
df_nan.isna().sum() / df_nan.shape[0]



# Plot missing values
plt.figure(figsize=(10, 6))
sns.heatmap(df_nan.isna().transpose(),
            cmap="Blues",
            cbar_kws={'label': 'Missing Values'});





In [None]:
#Assess Class imbalance. You make your own assessment on potential effects of class-imbalance.
plt.hist(df['Class'], bins=[0, 1, 2, 3, 4, 5]);

In [None]:
# Percentage of missing values for the variables

percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([missing, percent], axis=1, keys=['Nr. of missing values', 'Share'])
missing_data.head(20)

19 variables have missing values. Of the 19 variables four (PoolQC, MiscFeature, Alley, Fence) have more than 50% missing data and one (FireplaceQu) with nearly 50% missing data. But often NA does not mean that there is no data available. Instead (especially for thecategorical variables) it means that the house is lacking this specific object. NA in the PoolQC variable means that there is no pool; NA in the Alley variable means that there is "no alley access". All the descriptions of which NA stand for non-available data and which stand for a missing trait can be found in the data description.


The following variables have NAs that can be filled:
- PoolQC: Na = No Pool
- MiscFeature: Na = None
- Alley: NA = No alley access
- Fence: NA = No Fence
- FireplaceQu: NA = No Fireplace
- GarageCond: NA = No Garage
- GarageType: NA = No Garage
- GarageFinish: NA = No Garage
- GarageQual: NA = No Garage
- BsmtFinType2: NA = No Basement
- BsmtExposure: NA = No Basement
- BsmtQual: NA = No Basement
- BsmtCond: NA = No Basement
- BsmtFinType1: NA = No Basement

In [None]:
# Filling missing values for variables where appropriate

df["PoolQC"] = df["PoolQC"].fillna(value = "None")
df["MiscFeature"] = df["MiscFeature"].fillna(value = "None")
df["Alley"] = df["Alley"].fillna(value = "None")
df["Fence"] = df["Fence"].fillna(value = "None")
df["FireplaceQu"] = df["FireplaceQu"].fillna(value = "None")
df["GarageCond"] = df["GarageCond"].fillna(value = "None")
df["GarageType"] = df["GarageType"].fillna(value = "None")
df["GarageFinish"] = df["GarageFinish"].fillna(value = "None")
df["GarageQual"] = df["GarageQual"].fillna(value = "None")
df["BsmtFinType2"] = df["BsmtFinType2"].fillna(value = "None")
df["BsmtExposure"] = df["BsmtExposure"].fillna(value = "None")
df["BsmtQual"] = df["BsmtQual"].fillna(value = "None")
df["BsmtCond"] = df["BsmtCond"].fillna(value = "None")
df["BsmtFinType1"] = df["BsmtFinType1"].fillna(value = "None")

In [None]:
missing = df.isnull().sum().sort_values(ascending=False)
missing = missing[missing > 0]
missing.plot.bar()

In [None]:
# Percentage of missing values for the variables

percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([missing, percent], axis=1, keys=['Nr. of missing values', 'Share'])
missing_data.head(5)

For all but five variables we coud fill the missing data because with them NA indicates the lack of the corresponding trait. For LotFrontage we miss 17% of the values and 5.5% for GarageYrBlt.

- LotFrontage ---> High Correlation with other variable?
- GarageYrBlt can probably be ignored since it highly correlates with YearBuilt.
- MasVnrType and MasVnrArea have a strong correaltion with "YearBuilt" and "OverallQual" ---> Delete them?
- Electrical one missing value ---> Delete this observation or just leave it?

# Feature Engineering


## Dealing with Categorical Features (Encoding Categorical Variables) / Splitting Into X and y


In [None]:
# Numerical variables that should be handled as categorical variables
df = df.replace({"MSSubClass" : {20 : "SC20", 30 : "SC30", 40 : "SC40", 45 : "SC45", 
50 : "SC50", 60 : "SC60", 70 : "SC70", 75 : "SC75", 
80 : "SC80", 85 : "SC85", 90 : "SC90", 120 : "SC120", 
150 : "SC150", 160 : "SC160", 180 : "SC180", 190 : "SC190"}})
df = df.replace({"MoSold" : {1 : "Jan", 2 : "Feb", 3 : "Mar", 4 : "Apr", 5 : "May", 6 : "Jun",
7 : "Jul", 8 : "Aug", 9 : "Sep", 10 : "Oct", 11 : "Nov", 12 : "Dec"}})

# !!!!!!!!! Dropping NAs !!!!!!!!!!! 

In [None]:
df = df.dropna()
print(len(df))

In [None]:
# Assign response to y
y = df.iloc[:, -1] # Corresponds to Class

# Factorize categorical values, assign output to X
# create (multiple) dummy variables for a categorical variable
# panda way
X = pd.get_dummies(df.iloc[:, :-1])
X.head()




In [None]:
X.info()
X.shape

## Partitioning of the Data Set Into Train and Test Set

We are using a 70/30 (training/testing) splitting. (The parameter random_state=0 fixes the random split in a way such that results are reproducible.)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.3, 
                                                    random_state=0, 
                                                    stratify=y)

A stratified sample is one that maintains the proportion of values as in the original data set. If, for example, the response vector 
 is a binary categorical variable with 25% zeros and 75% ones, stratify=y ensures that the random splits have 25% zeros and 75% ones too. Note that stratify=y does not mean stratify=yes but rather tells the function to take the categorical proportions from response vector y.



In [None]:
cols_scl = X.columns.values
cols_scl

## Feature Scaling


In [None]:
from sklearn.preprocessing import MinMaxScaler 

# Get cols to scale
cols_scl = X.columns.values[:6]

# Apply MinMaxScaler on continuous columns only (check dummies!!!)
mms = MinMaxScaler()
#X_train_norm = mms.fit_transform(X_train[cols_scl])  # fit & transform
#X_test_norm  = mms.transform(X_test[cols_scl])  # ONLY transform

In [None]:
from sklearn.preprocessing import StandardScaler 

# Apply StandardScaler on continuous columns only
stdsc = StandardScaler()
#X_train_std = stdsc.fit_transform(X_train[cols_scl])  # fit & transform
#X_test_std  = stdsc.transform(X_test[cols_scl])  # ONLY transform

# Assessing Target Variable "Class"

** Assess Class imbalance. You make your own assessment on potential effects of class-imbalance. **

In [None]:
plt.figure(1); plt.title('Distribution of Class')
sns.histplot(data=y, discrete = True)

We see that our "Class" deviates from the normal distribution, is positively skewed and shows peakedness (cortosis).

In [None]:
#skewness and kurtosis
print("Skewness: %f" % df['Class'].skew())
print("Kurtosis: %f" % df['Class'].kurt())


# !!! Vorgegeben Code !!!

In [None]:
# Imports
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.feature_selection import SelectPercentile, chi2
from sklearn.compose import make_column_selector as selector

In [None]:
# Mute warnings (related to LogReg 'max_iter' param)
import warnings
warnings.filterwarnings('ignore')


num_transformer = Pipeline(
    steps=[("scaler", StandardScaler()), ("imputer", SimpleImputer(strategy="median"))]
)

cat_transformer = Pipeline(
    steps=[
        ("encoder", OneHotEncoder(handle_unknown="ignore")),
        ("selector", SelectPercentile(chi2, percentile=50)),
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", num_transformer, selector(dtype_include=np.number)),
        ("cat", cat_transformer, selector(dtype_include=object)),
    ]
)
clf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())]
)


clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))
clf

In [None]:
param_grid = {
    "preprocessor__num__imputer__strategy": ["mean", "median"],
    "preprocessor__cat__selector__percentile": [10, 30, 50, 70],
    "classifier__C": [0.1, 1.0, 10, 100],
}

search_cv = RandomizedSearchCV(clf, param_grid, n_iter=10, random_state=0)
search_cv

In [None]:
search_cv.fit(X_train, y_train)

# Print results
print('Best CV accuracy: {:.2f}'.format(search_cv.best_score_))
print('Test score:       {:.2f}'.format(search_cv.score(X_test, y_test)))
print('Best parameters: {}'.format(search_cv.best_params_))


Now let's see similarly for RandomForest

In [None]:
from sklearn.ensemble import RandomForestClassifier


clf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", RandomForestClassifier())]
)


clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))
clf

In [None]:
param_grid = {
    "preprocessor__num__imputer__strategy": ["mean", "median"],
    "preprocessor__cat__selector__percentile": [10, 30, 50, 70],
    "classifier__max_depth": [1, 3, 5, 10],
}

search_cv = RandomizedSearchCV(clf, param_grid, n_iter=10, random_state=0)

search_cv.fit(X_train, y_train)

# Print results
print('Best CV accuracy: {:.2f}'.format(search_cv.best_score_))
print('Test score:       {:.2f}'.format(search_cv.score(X_test, y_test)))
print('Best parameters: {}'.format(search_cv.best_params_))


In [None]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OrdinalEncoder

cat_selector = selector(dtype_include=object)
num_selector = selector(dtype_include=np.number)

cat_tree_processor = OrdinalEncoder(
    handle_unknown="use_encoded_value",
    unknown_value=-1,
    encoded_missing_value=-2,
)
num_tree_processor = SimpleImputer(strategy="mean", add_indicator=True)

tree_preprocessor = make_column_transformer(
    (num_tree_processor, num_selector), (cat_tree_processor, cat_selector)
)

#####

clf = Pipeline(
    steps=[("preprocessor", tree_preprocessor), ("classifier", RandomForestClassifier())]
)


clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))
clf

In [None]:
param_grid = {
    "classifier__max_depth": [5, 10, 25],
}

search_cv = RandomizedSearchCV(clf, param_grid, n_iter=10, random_state=0)

search_cv.fit(X_train, y_train)

# Print results
print('Best CV accuracy: {:.2f}'.format(search_cv.best_score_))
print('Test score:       {:.2f}'.format(search_cv.score(X_test, y_test)))
print('Best parameters: {}'.format(search_cv.best_params_))


# Copy Paste Function from Last Year !!!!!!

For scoring our models we will be using the weighted 
$F_1$-Score.

In [None]:

############## Report Functions ##############
# Function to get best parametrs, mean cross-validation score of best parameters, standard deviation of the cross-validation scores of the best parameters
# and the score of the test set
def get_results_cv(func, X_test_cleaned, y_test_cleaned):
  """
  Inputs required: already fitted gridsearcv or randomsearchcv function, cleaned X test set, cleaned y test set
  Returns best parameters, mean score, sd of score of cross-validation. Also returns test-score of best parameters
  """
  std_best_score = func.cv_results_["std_test_score"][func.best_index_]
  print(f"Best parameters: {func.best_params_}")
  print(f"Mean CV score: {func.best_score_: .6f}")
  print(f"Standard deviation of CV score: {std_best_score: .6f}")
  print("Test Score: {:.6f}".format(func.score(X_test_cleaned, y_test_cleaned)))

# Function to get metrics report and heatmap of the confusion matrics for the test set
def final_report(y_true, y_pred):
  """
  Inputs required: true classes, predicted classes
  Returns classification report and confusion matrix of the model
  """
  class_report = metrics.classification_report(y_true, y_pred)
  print(class_report)
  cm = confusion_matrix(y_true, y_pred, normalize = "all")
  cm = pd.DataFrame(cm, ["1", "2", "3", "4", "5"],  ["1", "2", "3", "4", "5"])
  plt.figure(figsize = (10,5))
  sns.heatmap(cm, annot = True, fmt = ".2%", cmap = "Blues").set(xlabel = "Assigned Class", ylabel = "True Class", title = "Confusion Matrix")
     


## Support Vector Machines (Selina)

The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points.


Support vector machines (SVMs) are a type of supervised learning algorithm that can be utilized for tasks such as classification, regression, and outlier detection. One of the key benefits of SVMs is their effectiveness in handling high-dimensional data, as well as cases where the number of dimensions exceeds the number of samples. Additionally, SVMs are memory-efficient due to their use of a subset of training points, referred to as "support vectors," in the decision-making process. Moreover, SVMs are highly versatile, as they can incorporate different kernel functions to specify the decision function, with the option to define custom kernels. It's important to acknowledge that SVMs come with their own set of limitations. One such limitation arises when dealing with datasets where the number of features far exceeds the number of samples; in such cases, it is essential to exercise caution in selecting kernel functions and regularization terms to prevent over-fitting. Another limitation of SVMs is that they do not offer direct probability estimates, which necessitates the use of a time-consuming five-fold cross-validation method to calculate them.


## Importing Packages

In [None]:
from sklearn import svm
from imblearn.pipeline import Pipeline as imbpipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn import metrics
from sklearn.metrics import confusion_matrix


## SVC

In [None]:
# The SVC Class from Sklearn
svm= svm.SVC(
        C=1.0,                          # The regularization parameter
        kernel='rbf',                   # The kernel type used 
        degree=3,                       # Degree of polynomial function 
        gamma='scale',                  # The kernel coefficient
        coef0=0.0,                      # If kernel = 'poly'/'sigmoid'
        shrinking=True,                 # To use shrinking heuristic
        probability=False,              # Enable probability estimates
        tol=0.001,                      # Stopping crierion
        cache_size=200,                 # Size of kernel cache
        class_weight=None,              # The weight of each class
        verbose=False,                  # Enable verbose output
        max_iter= -1,                   # Hard limit on iterations
        decision_function_shape='ovr',  # One-vs-rest or one-vs-one
        break_ties=False,               # How to handle breaking ties
        random_state=42               # Random state of the model
    )

print(f"Parameters of the Support Vector Machine: {svm.get_params().keys()}")

In [None]:
# Building and training our model
clf = svm.fit(X_train, y_train)

# Making predictions with our data
predictions = clf.predict(X_test)

print(accuracy_score(y_test, predictions))

## Scaling/Pipeline

Support Vector Machine algorithms are not scale invariant, so it is highly recommended to scale your data. For example, scale each attribute on the input vector X to [0,1] or [-1,+1], or standardize it to have mean 0 and variance 1. Note that the same scaling must be applied to the test vector to obtain meaningful results. This can be done easily by using a Pipeline:

In [None]:
scaler = StandardScaler()
mms = MinMaxScaler()

ros = RandomOverSampler(random_state = 42)
kFold = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)


svm_pipe = imbpipeline(steps=[["scaler", scaler], ["ros", ros], ["SVM", svm]])
param_grid = {
    'ros': [ros, None], 
    'scaler': [scaler, mms],
    "SVM__kernel": ["linear", "sigmoid"],
    "SVM__C": [1, 10],
    "SVM__gamma": ["auto", "scale"]
}
gs = GridSearchCV(estimator = svm_pipe, param_grid = param_grid, scoring = "f1_weighted",
                  cv = kFold, n_jobs = -1)

gs = gs.fit(X_train, y_train)

get_results_cv(gs, X_test, y_test)
y_pred = gs.best_estimator_.predict(X_test)


# get test score, metrics report and confusion matrix
final_report(y_test, y_pred)

