# Project ML in Finance Group 5
### April 2023


#### Cyrill Stoll, Arthur Schlegel, Aleksandar Kuljanin and Selina Waber


## Introduction

Dean De Cock created the Ames Housing dataset here the link to the [Dataset](https://www.openml.org/search?type=data&sort=runs&id=42165&status=active). This dataset provides information about the sales of residential properties in Ames, Iowa between 2006 and 2010. It consists of 2930 observations and includes a significant amount of explanatory variables, such as 23 nominal, 23 ordinal, 14 discrete, and 20 continuous variables, that are used to evaluate the values of homes. 



## Importing Librarys

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')
plt.rcParams['font.size'] = 10

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## Load Data

In [None]:
# Load data
df = pd.read_csv("GroupProjectDataSet.csv", sep=',', index_col='Id')
print('Shape of data frame:', df.shape)
df.head()


In [None]:
df.describe()

### Overview

The data set consists of 1460 observations with 81 variables (including the target variable "(prize) class" and the id variable). 79 variables are descriptive variables that should explain Class.

Quantitative: 1stFlrSF, 2ndFlrSF, 3SsnPorch, BedroomAbvGr, BsmtFinSF1, BsmtFinSF2, BsmtFullBath, BsmtHalfBath, BsmtUnfSF, EnclosedPorch, Fireplaces, FullBath, GarageArea, GarageCars, GarageYrBlt, GrLivArea, HalfBath, KitchenAbvGr, LotArea, LotFrontage, LowQualFinSF, MSSubClass, MasVnrArea, MiscVal, MoSold, OpenPorchSF, OverallCond, OverallQual, PoolArea, ScreenPorch, TotRmsAbvGrd, TotalBsmtSF, WoodDeckSF, YearBuilt, YearRemodAdd, YrSold

Qualitative: Alley, BldgType, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, BsmtQual, CentralAir, Condition1, Condition2, Electrical, ExterCond, ExterQual, Exterior1st, Exterior2nd, Fence, FireplaceQu, Foundation, Functional, GarageCond, GarageFinish, GarageQual, GarageType, Heating, HeatingQC, HouseStyle, KitchenQual, LandContour, LandSlope, LotConfig, LotShape, MSZoning, MasVnrType, MiscFeature, Neighborhood, PavedDrive, PoolQC, RoofMatl, RoofStyle, SaleCondition, SaleType, Street, Utilities

In [None]:
numCols = list(df.select_dtypes(exclude='object').columns)
print(f"There are {len(numCols)} numerical features:\n", numCols)

In [None]:
catCols = list(df.select_dtypes(include='object').columns)
print(f"There are {len(catCols)} categorical features:\n", catCols)

## Handling Missing Values

Identifying missing values in data is crucial before determining the appropriate course of action, such as dropping features or imputing missing values, as many machine learning algorithms generate errors when trained on incomplete data.

In [None]:
# Plot missing values
missing = df.isnull().sum().sort_values(ascending=False)
missing = missing[missing > 0]
missing.plot.bar()

In [None]:
# Assess missing values
cols = df.columns[df.isna().any()]
df_nan = df[cols].copy()
df_nan['Class'] = df['Class']


# Plot missing values 2.0
plt.figure(figsize=(10, 6))
sns.heatmap(df_nan.isna().transpose(),
            cmap="Blues",
            cbar_kws={'label': 'Missing Values'});

In [None]:
# Percentage of missing values for the variables
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([missing, percent], axis=1, keys=['Nr. of missing values', 'Share'])
missing_data.head(22)

### Filling missing values for variables where appropriate

19 variables have missing values. Of the 19 variables four (PoolQC, MiscFeature, Alley, Fence) have more than 50% missing data and one (FireplaceQu) with nearly 50% missing data. But often NA does not mean that there is no data available. Instead (especially for thecategorical variables) it means that the house is lacking this specific object. NA in the PoolQC variable means that there is no pool; NA in the Alley variable means that there is "no alley access". All the descriptions of which NA stand for non-available data and which stand for a missing trait can be found in the data description.





#### Filling Categorical Variables

The following variables have NAs that can be filled:

- PoolQC: Na = No Pool
- MiscFeature: Na = None
- Alley: NA = No alley access
- Fence: NA = No Fence
- FireplaceQu: NA = No Fireplace
- GarageCond: NA = No Garage
- GarageType: NA = No Garage
- GarageFinish: NA = No Garage
- GarageQual: NA = No Garage
- BsmtFinType2: NA = No Basement
- BsmtExposure: NA = No Basement
- BsmtQual: NA = No Basement
- BsmtCond: NA = No Basement
- BsmtFinType1: NA = No Basement
- MasVnrType: NA = None

In [None]:
## Filling Categorical Variables 

df["PoolQC"] = df["PoolQC"].fillna(value = "No")
df["MiscFeature"] = df["MiscFeature"].fillna(value = "No")
df["Alley"] = df["Alley"].fillna(value = "No")
df["Fence"] = df["Fence"].fillna(value = "No")
df["FireplaceQu"] = df["FireplaceQu"].fillna(value = "No")
df["GarageCond"] = df["GarageCond"].fillna(value = "No")
df["GarageType"] = df["GarageType"].fillna(value = "No")
df["GarageFinish"] = df["GarageFinish"].fillna(value = "No")
df["GarageQual"] = df["GarageQual"].fillna(value = "No")
df["BsmtFinType2"] = df["BsmtFinType2"].fillna(value = "No")
df["BsmtExposure"] = df["BsmtExposure"].fillna(value = "No")
df["BsmtQual"] = df["BsmtQual"].fillna(value = "No")
df["BsmtCond"] = df["BsmtCond"].fillna(value = "No")
df["BsmtFinType1"] = df["BsmtFinType1"].fillna(value = "No")
df["MasVnrType"] = df["MasVnrType"].fillna(value= "No") #newly added

For all but five variables we coud fill the missing data because with them NA indicates the lack of the corresponding trait. For LotFrontage we miss 17% of the values and 5.5% for GarageYrBlt.

- LotFrontage ---> High Correlation with other variable?
- GarageYrBlt can probably be ignored since it highly correlates with YearBuilt.
- MasVnrType and MasVnrArea have a strong correaltion with "YearBuilt" and "OverallQual" ---> Delete them?
- Electrical one missing value ---> Delete this observation or just leave it?

#### Filling missing values for numerical data

In [None]:
## Numerical Variables
missing_numerical = ['GarageArea', 'GarageCars', 'BsmtFinSF1',
                     'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF',
                     'BsmtFullBath', 'BsmtHalfBath', 'MasVnrArea']

df[missing_numerical] = df[missing_numerical].fillna(0)

#### Filling special variables
 

In [None]:
# Filling special variables

df["GarageYrBlt"] = df["GarageYrBlt"].fillna(df["YearBuilt"]) 
# assuming that the garge was bulit with the house 


df["LotFrontage"] = df["LotFrontage"].fillna(df["LotFrontage"].mean())



most_frequent= df['Electrical'].value_counts().idxmax()
df["Electrical"] = df["Electrical"].fillna(most_frequent)


In [None]:
# further data cleaning
#df = df.dropna(axis='columns', thresh=1459)
#df = df.dropna(axis='rows', how = "any")

## Outliers - To DO -

Because regression models are very sensitive to outlier, we need to be aware of them. In the case of categorical data one can use sklearn's `OneHotEncoder` and specify the `min_frequency` parameter. If you specified the min_frequency parameter, rare categorical values will be assigned `infrequend_sklearn`.

https://medium.com/owl-analytics/categorical-outliers-dont-exist-8f4e82070cb2

## Create New Variables

<mark> Copy pastet from here https://chriskhanhtran.github.io/minimal-portfolio/projects/ames-house-price.html <mark> 
    
Should maybe change that and not copy paste it ???
    
    

In [None]:
df['totalSqFeet'] = df['TotalBsmtSF'] + df['1stFlrSF'] + df['2ndFlrSF']
df['totalBathroom'] = df["FullBath"] + df["BsmtFullBath"] + 0.5 * (df["HalfBath"] + df["BsmtHalfBath"])
df['houseAge'] = df["YrSold"] - df["YearBuilt"]
df['reModeled'] = np.where(df["YearRemodAdd"] == df["YearBuilt"], 0, 1)
df['isNew'] = np.where(df["YrSold"] == df["YearBuilt"], 1, 0)



Dropping columns

In [None]:
not_used_anymore = ['TotalBsmtSF','1stFlrSF', '2ndFlrSF',
                    "FullBath", "BsmtFullBath", "HalfBath",
                    "BsmtHalfBath", "YearBuilt", "YearRemodAdd"  ]

df= df.drop(not_used_anymore, axis=1)

In [None]:
df.columns.values

## Feature Engineering


### Dealing with Categorical Features (Encoding Categorical Variables) 

In [None]:
# Numerical variables that should be handled as categorical variables
df = df.replace({"MSSubClass" : {20 : "SC20", 30 : "SC30", 40 : "SC40", 45 : "SC45", 
50 : "SC50", 60 : "SC60", 70 : "SC70", 75 : "SC75", 
80 : "SC80", 85 : "SC85", 90 : "SC90", 120 : "SC120", 
150 : "SC150", 160 : "SC160", 180 : "SC180", 190 : "SC190"}})

df = df.replace({"MoSold" : {1 : "Jan", 2 : "Feb", 3 : "Mar", 4 : "Apr", 5 : "May", 6 : "Jun",
7 : "Jul", 8 : "Aug", 9 : "Sep", 10 : "Oct", 11 : "Nov", 12 : "Dec"}})


In [None]:
# other approach:
#to_factor_cols = ['YrSold', 'MoSold', 'MSSubClass']
#for col in to_factor_cols:
#    X[col] = X[col].apply(str)

## Numerical Features





#### Histograms

In [None]:
#  Visualize data to gain insights (Histograms)
df.hist(figsize=(30, 20), bins = 15, edgecolor = 'black', grid = False, color = 'royalblue')
plt.suptitle('Histograms of numerical features', x = 0.5, y = 1.02, size = 35)


#### Top 10 numerical variables highly correlated with `Class`:

In [None]:
corr_mat = df.corr().Class.sort_values(ascending=False)
corr_mat.head(11)

#### Recursive Feature Elimination

What are the top 10 features selected by Recursive Feature Elimination?


In [None]:
# Asign columns to feature matrix X interim and response vector y interim
X_interim = df.loc[:, df.columns != "Class"]
y = df["Class"]

from sklearn.model_selection import train_test_split
X_train_interim , X_test_interim, y_train_interim, y_test_interim = train_test_split(X_interim, y, 
                                                    test_size=0.3, 
                                                    random_state=0, 
                                                    stratify=y)

frames = [X_train_interim, y_train_interim]
df_train_interim = pd.concat(frames, axis=1)

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

estimator = LinearRegression()
rfe = RFE(estimator, n_features_to_select=10, step=1)
selector = rfe.fit(X_train_interim.fillna(0).select_dtypes(exclude='object'), y_train_interim)
selectedFeatures = list(
    X_interim.select_dtypes(exclude='object').columns[selector.support_])
selectedFeatures



 ??????? Can't be !!!!
 
 
 <mark>Can that be?<mark> 



### Overall Quality

Overall quality is a very important feature e.g. higher quality houses are more expensive.

In [None]:
sns.boxplot(x='OverallQual', y='Class', data=df)
title = plt.title('House Price/Class by Overall Quality')

### Living Area

The price of a house is linearly correlated with its living area. By examining the scatter plot depicted below, it is evident that there exist some outliers in the data, particularly the two houses positioned in the bottom-right corner. These houses have a living area of more than 4000 square feet but are priced lower than Class 2.



In [None]:
print("Correlation: ", df[['GrLivArea','Class']].corr().iloc[1, 0])
sns.jointplot(x=df['GrLivArea'],y= df['Class'], kind='reg', marginal_kws={'kde': True})


### GarageCars

houses with garage that can hold 4 cars are cheaper than houses with 3 garages.


In [None]:
sns.boxplot(x='GarageCars', y='Class', data=df)
title = plt.title('House Price/Class by Garage Size')

### House Age

In addition to living area, the age of a house also influences its price significantly. Typically, newer houses command higher prices on average. However, it is worth noting that there are some houses constructed before 1900 that have a relatively high price despite their age.

In [None]:
sns.scatterplot(x='houseAge', y='Class', data=df)
title = plt.title('House Price/Class by Year Built')


## Lable Encoding ??? 

Ordinal categorical features are label encoded. ??? <mark> copy pastet from https://chriskhanhtran.github.io/minimal-portfolio/projects/ames-house-price.html <mark> does it even make sense??
    

In [None]:
from sklearn.preprocessing import LabelEncoder

# Ordinal categorical columns
label_encoding_cols = [
    "Alley", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2",
    "BsmtQual", "ExterCond", "ExterQual", "FireplaceQu", "Functional",
    "GarageCond", "GarageQual", "HeatingQC", "KitchenQual", "LandSlope",
    "LotShape", "PavedDrive", "PoolQC", "Street", "Utilities"
]

# Apply Label Encoder
label_encoder = LabelEncoder()

for col in label_encoding_cols:
    df[col] = label_encoder.fit_transform(df[col])

## Asign columns to feature matrix X and response vector y


In [None]:
# Asign columns to feature matrix X and response vector y

X = df.loc[:, df.columns != "Class"]
y = df["Class"] 

print(X.shape)
print(y.shape)

## Adding Dummies

In [None]:
### I am not 100% sure about this one!!! ####
### Does not change a thing!!!!!!!!!!!!
# factorise the binary variables (no need to create two dummy variables)
# ---> Problem of Multicollinearity 
#Without this the get_dummies would create two variables CentralAir_y and CentralAir_n
#pd.factorize(X['Street'])
# Central Air and one other
# does not change a thing
# pd.factorize(X['CentralAir'])

In [None]:
# Factorize categorical values, assign output to X
# create (multiple) dummy variables for a categorical variable
X = pd.get_dummies(X.iloc[:,:]) 

print(X.shape)
X.head()

In [None]:
X.columns.values

In [None]:
X.info()

## Partitioning of the Data Set Into Train and Test Set

We are using a 70/30 (training/testing) splitting. (The parameter `random_state=0` fixes the random split in a way such that results are reproducible.)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.3, 
                                                    random_state=0, 
                                                    stratify=y)

A stratified sample is one that maintains the proportion of values as in the original data set. If, for example, the response vector  𝑦 is a binary categorical variable with 25% zeros and 75% ones, `stratify=y` ensures that the random splits have 25% zeros and 75% ones too. Note that `stratify=y` does not mean `stratify=yes` but rather tells the function to take the categorical proportions from response vector `y`.

In [None]:
X_train.info()

## ANOVA of Categorical Variables



### Correlation

In [None]:
# Create correlation matrix from train data excluding `Class`

#corr_mat = df_train.iloc[:, :-1].corr()
#corr_mat = df_train.corr()
## difference????
corr_mat = X_train.corr()

# Select correlations greater than 0.5
high_corr_mat = corr_mat[abs(corr_mat) > 0.5]

# Plot correlation heatmap
sns.heatmap(high_corr_mat, annot=True)
title = plt.title('Correlation Heatmap')

We can see multicollinearity in our training data. Highly correlated are:

- ?
- ?
- ?
- ???

#### What is Multicollinearity?

Multicollinearity is a situation in which two or more predictor variables in a machine learning model are highly correlated with each other. It is considered bad for several reasons:

- It reduces the statistical significance of the coefficients of the correlated variables, making it difficult to interpret the importance of individual predictors in the model.

- It can lead to unstable or unreliable estimates of the coefficients, making it hard to predict the effect of changes in the predictor variables on the outcome variable.

- It can cause the model to be overfit, meaning it performs well on the training data but poorly on new, unseen data.

- It can also increase the variance of the coefficients, making the model more sensitive to small changes in the input data.


--> Therefore, for each pair of highly correlated features, we will remove a feature that has a lower correlation with `Class`. 

<mark>TO DO!!<mark>

## Skewness and Normalizing Variables

Linear regression assumes that the data follows a normal distribution, and therefore, transforming skewed data can improve the performance of the models.

In [None]:
plt.figure(1); plt.title('Distribution of Class')
sns.histplot(data=y, discrete = True)

In [None]:
sns.displot(data=y, kind='hist', kde=True)
title = plt.title("House Price Distribution")

We see that our "Class" deviates from the normal distribution, is positively/right-skewed skewed and shows peakedness (cortosis).

In [None]:
#skewness and kurtosis
print("Skewness: %f" % df['Class'].skew())
print("Kurtosis: %f" % df['Class'].kurt())

To normalize right-skewed data, log transformation can be used as a method since it pulls the larger values towards the center. However, because log(0) results in NaN, log(1+X) is preferred as a fix for the skewness instead.

In [None]:
y_trafo = np.log(1 + y)
sns.displot(data=y_trafo, kind='hist', kde=True)
title = plt.title("House Price Distribution Transformation")

In [None]:
#skewness = train_data.skew().sort_values(ascending=False)
#skewness[abs(skewness) > 0.75]

In [None]:
# List of skewed columns
#skewed_cols = list(skewness[abs(skewness) > 0.5].index)

# Remove 'MSSubClass' and 'SalePrice'
#skewed_cols = [
#    col for col in skewed_cols if col not in ['MSSubClass', 'SalePrice']
#]

# Log-transform skewed columns
#for col in skewed_cols:
#    X[col] = np.log(1 + X[col])

## Feature Scaling

Standardizing the dataset before running machine learning algorithms is generally recommended, except for Decision Tree and Random Forest models. This is because optimization methods and gradient descent algorithms tend to perform and converge faster on features that are similarly scaled.

However, outliers can have a negative impact on the sample mean and standard deviation, and models like Lasso and others are highly sensitive to outliers. In such cases, using the median and interquartile range is a better alternative. For this reason, the <mark> RobustScaler?? or StandardScaler? <mark> method is used to transform the training data.
    

In [None]:
from sklearn.preprocessing import MinMaxScaler 

# Get cols to scale
cols_scl = X.columns.values[:]

# Apply MinMaxScaler on continuous columns only (check dummies!!!)
mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train[cols_scl])  # fit & transform
X_test_norm  = mms.transform(X_test[cols_scl])  # ONLY transform

In [None]:
from sklearn.preprocessing import StandardScaler 

# Apply StandardScaler on continuous columns only
stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train[cols_scl])  # fit & transform
X_test_std  = stdsc.transform(X_test[cols_scl])  # ONLY transform

## One Hot Encoding

## Cross Validation

## Leave-One-Out Cross Validation


## Decision Tree

## !! UNDER CONSTRUCTION !!!

In [None]:
# Imports
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.feature_selection import SelectPercentile, chi2
from sklearn.compose import make_column_selector as selector

In [None]:
# Mute warnings (related to LogReg 'max_iter' param)
import warnings
warnings.filterwarnings('ignore')


num_transformer = Pipeline(
    steps=[("scaler", StandardScaler()), ("imputer", SimpleImputer(strategy="median"))]
)

cat_transformer = Pipeline(
    steps=[
        ("encoder", OneHotEncoder(handle_unknown="ignore")),
        ("selector", SelectPercentile(chi2, percentile=50)),
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", num_transformer, selector(dtype_include=np.number)),
        ("cat", cat_transformer, selector(dtype_include=object)),
    ]
)
clf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())]
)


clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))
clf

In [None]:
param_grid = {
    "preprocessor__num__imputer__strategy": ["mean", "median"],
    "preprocessor__cat__selector__percentile": [10, 30, 50, 70],
    "classifier__C": [0.1, 1.0, 10, 100],
}

search_cv = RandomizedSearchCV(clf, param_grid, n_iter=10, random_state=0)
search_cv

In [None]:
search_cv.fit(X_train, y_train)

# Print results
print('Best CV accuracy: {:.2f}'.format(search_cv.best_score_))
print('Test score:       {:.2f}'.format(search_cv.score(X_test, y_test)))
print('Best parameters: {}'.format(search_cv.best_params_))


Now let's see similarly for RandomForest

In [None]:
from sklearn.ensemble import RandomForestClassifier


clf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", RandomForestClassifier())]
)


clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))
clf

In [None]:
param_grid = {
    "preprocessor__num__imputer__strategy": ["mean", "median"],
    "preprocessor__cat__selector__percentile": [10, 30, 50, 70],
    "classifier__max_depth": [1, 3, 5, 10],
}

search_cv = RandomizedSearchCV(clf, param_grid, n_iter=10, random_state=0)

search_cv.fit(X_train, y_train)

# Print results
print('Best CV accuracy: {:.2f}'.format(search_cv.best_score_))
print('Test score:       {:.2f}'.format(search_cv.score(X_test, y_test)))
print('Best parameters: {}'.format(search_cv.best_params_))


In [None]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OrdinalEncoder

cat_selector = selector(dtype_include=object)
num_selector = selector(dtype_include=np.number)

cat_tree_processor = OrdinalEncoder(
    handle_unknown="use_encoded_value",
    unknown_value=-1,
    encoded_missing_value=-2,
)
num_tree_processor = SimpleImputer(strategy="mean", add_indicator=True)

tree_preprocessor = make_column_transformer(
    (num_tree_processor, num_selector), (cat_tree_processor, cat_selector)
)

#####

clf = Pipeline(
    steps=[("preprocessor", tree_preprocessor), ("classifier", RandomForestClassifier())]
)


clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))
clf

In [None]:
param_grid = {
    "classifier__max_depth": [5, 10, 25],
}

search_cv = RandomizedSearchCV(clf, param_grid, n_iter=10, random_state=0)

search_cv.fit(X_train, y_train)

# Print results
print('Best CV accuracy: {:.2f}'.format(search_cv.best_score_))
print('Test score:       {:.2f}'.format(search_cv.score(X_test, y_test)))
print('Best parameters: {}'.format(search_cv.best_params_))


# Copy Paste from Last Year Function (need to change that)

For scoring our models we will be using the weighted  $𝐹_1$
 -Score.

In [None]:

############## Report Functions ##############
# Function to get best parametrs, mean cross-validation score of best parameters, standard deviation of the cross-validation scores of the best parameters
# and the score of the test set
def get_results_cv(func, X_test_cleaned, y_test_cleaned):
  """
  Inputs required: already fitted gridsearcv or randomsearchcv function, cleaned X test set, cleaned y test set
  Returns best parameters, mean score, sd of score of cross-validation. Also returns test-score of best parameters
  """
  std_best_score = func.cv_results_["std_test_score"][func.best_index_]
  print(f"Best parameters: {func.best_params_}")
  print(f"Mean CV score: {func.best_score_: .6f}")
  print(f"Standard deviation of CV score: {std_best_score: .6f}")
  print("Test Score: {:.6f}".format(func.score(X_test_cleaned, y_test_cleaned)))

# Function to get metrics report and heatmap of the confusion matrics for the test set
def final_report(y_true, y_pred):
  """
  Inputs required: true classes, predicted classes
  Returns classification report and confusion matrix of the model
  """
  class_report = metrics.classification_report(y_true, y_pred)
  print(class_report)
  cm = confusion_matrix(y_true, y_pred, normalize = "all")
  cm = pd.DataFrame(cm, ["1", "2", "3", "4", "5"],  ["1", "2", "3", "4", "5"])
  plt.figure(figsize = (10,5))
  sns.heatmap(cm, annot = True, fmt = ".2%", cmap = "Blues").set(xlabel = "Assigned Class", ylabel = "True Class", title = "Confusion Matrix")
     


# Support Vector Machines (Selina)

The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points.

Support vector machines (SVMs) are a type of supervised learning algorithm that can be utilized for tasks such as classification, regression, and outlier detection. One of the key benefits of SVMs is their effectiveness in handling high-dimensional data, as well as cases where the number of dimensions exceeds the number of samples. Additionally, SVMs are memory-efficient due to their use of a subset of training points, referred to as "support vectors," in the decision-making process. Moreover, SVMs are highly versatile, as they can incorporate different kernel functions to specify the decision function, with the option to define custom kernels. It's important to acknowledge that SVMs come with their own set of limitations. One such limitation arises when dealing with datasets where the number of features far exceeds the number of samples; in such cases, it is essential to exercise caution in selecting kernel functions and regularization terms to prevent over-fitting. Another limitation of SVMs is that they do not offer direct probability estimates, which necessitates the use of a time-consuming five-fold cross-validation method to calculate them.

## Importing Packages


In [None]:
from sklearn import svm
from sklearn.svm import SVC

from imblearn.pipeline import Pipeline as imbpipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn import metrics
from sklearn.metrics import confusion_matrix


## SVM

In [None]:
# The SVC Class from Sklearn
svm= svm.SVC(
        C=1.0,                          # The regularization parameter
        kernel='rbf',                   # The kernel type used 
        degree=3,                       # Degree of polynomial function 
        gamma='scale',                  # The kernel coefficient
        coef0=0.0,                      # If kernel = 'poly'/'sigmoid'
        shrinking=True,                 # To use shrinking heuristic
        probability=False,              # Enable probability estimates
        tol=0.001,                      # Stopping crierion
        cache_size=200,                 # Size of kernel cache
        class_weight=None,              # The weight of each class
        verbose=False,                  # Enable verbose output
        max_iter= -1,                   # Hard limit on iterations
        decision_function_shape='ovr',  # One-vs-rest or one-vs-one
        break_ties=False,               # How to handle breaking ties
        random_state=42               # Random state of the model
    )

print(f"Parameters of the Support Vector Machine: {svm.get_params().keys()}")

In [None]:
# Building and training our model
clf = svm.fit(X_train, y_train)

# Making predictions with our data
predictions = clf.predict(X_test)

print(accuracy_score(y_test, predictions))

## Scaling/Pipeline

Support Vector Machine algorithms are not scale invariant, so it is highly recommended to scale your data. For example, scale each attribute on the input vector X to [0,1] or [-1,+1], or standardize it to have mean 0 and variance 1. Note that the same scaling must be applied to the test vector to obtain meaningful results. This can be done easily by using a Pipeline:

In [None]:
scaler = StandardScaler()
mms = MinMaxScaler()

ros = RandomOverSampler(random_state = 42)
kFold = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)


svm_pipe = imbpipeline(steps=[["scaler", scaler], ["ros", ros], ["SVM", svm]])
param_grid = {
    'ros': [ros, None], 
    'scaler': [scaler, mms],
    "SVM__kernel": ["linear", "sigmoid"],
    "SVM__C": [1, 10],
    "SVM__gamma": ["auto", "scale"]
}
gs = GridSearchCV(estimator = svm_pipe, param_grid = param_grid, scoring = "f1_weighted",
                  cv = kFold, n_jobs = -1)

gs = gs.fit(X_train, y_train)

get_results_cv(gs, X_test, y_test)
y_pred = gs.best_estimator_.predict(X_test)


# get test score, metrics report and confusion matrix
final_report(y_test, y_pred)



### From Internet

https://stackoverflow.com/questions/62346013/svc-object-has-no-attribute-svc

In [None]:
tuned_parameters = [{'kernel': ['linear', 'poly', 'rbf'],
                     'C': [1]}
                   ]

clf = GridSearchCV(SVC(), tuned_parameters, scoring='accuracy')
clf.fit(X_train, y_train)


print("Best parameters set found on development set:\n")
print(clf.best_params_)
print("\nGrid scores on development set:\n")
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"
          % (mean, std * 2, params))

print("\nDetailed classification report:\n")
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")

y_true, y_pred = y_test, clf.predict(X_test)

print(classification_report(y_true, y_pred))

### Random Forest Feature Selection ###


In [None]:
svm= svm.SVC(random_state = 42, max_iter = 1000)
pipe = imbpipeline(steps=[["scaler", scaler],["ros", ros], ["SVM", svm]])

param_grid= {
    "scaler": [scaler, mms],
    "ros": [ros, None],
    "SVM__kernel": ["linear", "sigmoid"],
    "SVM__C": [1, 10],
    "SVM__gamma": ["auto", "scale"]
}
gs = GridSearchCV(estimator = pipe, param_grid = param_grid, scoring = "f1_weighted", cv = kFold, n_jobs = -1)
gs = gs.fit(X_train_rf, y_train_cleaned)

get_results_cv(gs, X_test_rf, y_test_cleaned)
y_pred = gs.best_estimator_.predict(X_test_rf)
# get test score, metrics report and confusion matrix
final_report(y_test_cleaned, y_pred)

### XGBoost Feature Selection 

In [None]:
svm = svm.SVC(random_state = 42, max_iter = 1000)
pipe = imbpipeline(steps=[["scaler", scaler],["ros", ros], ["SVM", svm]])

param_grid= {
    "scaler": [scaler, mms],
    "ros": [ros, None],
    "SVM__kernel": ["linear", "sigmoid"],
    "SVM__C": [1, 10],
    "SVM__gamma": ["auto", "scale"]
}
gs = GridSearchCV(estimator = pipe, param_grid = param_grid, scoring = "f1_weighted", cv = kFold, n_jobs = -1)
gs = gs.fit(X_train_xgbc, y_train_cleaned)

get_results_cv(gs, X_test_xgbc, y_test_cleaned)
y_pred = gs.best_estimator_.predict(X_test_xgbc)
# get test score, metrics report and confusion matrix
final_report(y_test_cleaned, y_pred)

### PCA Dimension Reduction 

In [None]:
svm = SVC(random_state = 42, max_iter = 1000, shrinking = True, kernel = "sigmoid", C = 1)
pca = PCA()

svm_pipe = imbpipeline(steps=[["scaler", scaler], ["pca", pca], ["ros", ros], ["SVM", svm]])
param_grid = {
    'ros': [ros, None], 
    'scaler': [scaler, mms],
    "SVM__gamma": ["auto", "scale"],
    "pca__n_components": np.arange(4, 10, 1)
}
gs = GridSearchCV(estimator = svm_pipe, param_grid = param_grid, scoring = "f1_weighted",
                  cv = kFold, n_jobs = -1)
gs = gs.fit(X_train_cleaned, y_train_cleaned)

get_results_cv(gs, X_test_cleaned, y_test_cleaned)
y_pred = gs.best_estimator_.predict(X_test_cleaned)
# get test score, metrics report and confusion matrix
final_report(y_test_cleaned, y_pred)