# Data Science Project Spring 2023

## 200+ Financial Indicators of US stocks (2014-2018)

### Yiwei Gong, Janice Herman, Alexander  Morawietz and Selina Waber

University of Zurich, Spring 2023

## Importing Packages

In [None]:
import os 
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from pandas_datareader import data


from sklearn.model_selection import train_test_split

## Loading the Data Set


We used the data set from Nicolas Carbone from the webpage [kaggle](https://www.kaggle.com/datasets/cnic92/200-financial-indicators-of-us-stocks-20142018). Each dataset contains over 200 financial indicators, that are found in the [10-K filings](https://www.investopedia.com/terms/1/10-k.asp#:~:text=Key%20Takeaways-,A%2010%2DK%20is%20a%20comprehensive%20report%20filed%20annually%20by,detailed%20than%20the%20annual%20report.) of publicly traded companies from the US between the years 2014 - 2018.

In [None]:
def load_dataset():
    project_directory = sys.path[0] ## get path of project directory
    data_directory = os.path.join(project_directory, 'data')

    years = [2014, 2015, 2016, 2017, 2018]

    ## Loading the yearly dataset into the array dfs
    dfs = []
    for year in years:
        df = pd.read_csv(os.path.join(data_directory, f'{year}_Financial_Data.csv'), sep=',')
        df['year'] = np.full(df.shape[0], str(year)) ## append column with the year respecitvely
        df['PRICE VAR [%]'] = df[f'{year +1} PRICE VAR [%]'] ## Adding variable of the same name for all df, e.g. '2016 PRICE VAR [%]' renamed to 'PRICE VAR [%]'
        df = df.drop(columns=[f'{year +1} PRICE VAR [%]']) # dropp year-specific variable name
        df.columns.values[0] = 'Stock Name' # name the first variable 
        dfs.append(df)
    
    df = pd.concat(dfs, ignore_index=True) ## concat the diffrent dataframes

    return df

## Some Explanation of Variables:

### Adding `year` as a categorical variable

We added a column named year which contains the respecitve year.

### Handling the variable `Price VAR [%]`

The last column, `PRICE VAR [%]`, lists the percent price variation of each stock for the year. For example, if we consider the dataset 2015_Financial_Data.csv, we will have:

- 200+ financial indicators for the year 2015;
- percent price variation for the year 2016 (meaning from the first trading day on Jan 2016 to the last trading day on Dec 2016).

We renamed all the variables with the specific year in it, e.g. `2016 PRICE VAR [%]` to `PRICE VAR [%]`. We dropped the old ones. 

### the variable `class`

class lists a binary classification for each stock, where

- for each stock, if the PRICE VAR [%] value is positive, class = 1. From a trading perspective, the 1 identifies those stocks that an hypothetical trader should BUY at the start of the year and sell at the end of the year for a profit.
- for each stock, if the PRICE VAR [%] value is negative, class = 0. From a trading perspective, the 0 identifies those stocks that an hypothetical trader should NOT BUY, since their value will decrease, meaning a loss of capital.

The columns `PRICE VAR [%]` and `class` make possible to use the datasets for both classification and regression tasks:

- If the user wishes to train a machine learning model so that it learns to classify those stocks that in buy-worthy and not buy-worthy, it is possible to get the targets from the class column;

- If the user wishes to train a machine learning model so that it learns to predict the future value of a stock, it is possible to get the targets from the PRICE VAR [%] column.

### the variable `Stock Name`

We named the first variable Stock Namesince it has not been named in the original dataset.


## Numerical and Catgorical Features/Variables



In [None]:
# We are converting Classto a cathegorical variable.
def class_to_categorical(df):
    df['Class'] = df.Class.astype('object') ## object or catheogry?? whats the difference??
    return df

In [None]:
def print_number_of_numerical_categorical_variables(df):
    
    numCols = df.select_dtypes(exclude='object').columns
    print(f"There are {len(numCols)} numerical features:\n")

    catCols = df.select_dtypes(include='object').columns
    print(f"There are {len(catCols)} categorical features:\n", catCols)

## Any Duplicates? 

No, there are no duplicates for rows but there are 20 duplicates for columns/ 10 each. Not same variable name but same data!

In [None]:
def check_duplicates_row(df):
    print(f'Duplicates in Rows:', True in list(df.duplicated()))

In [None]:
def check_duplicates_col(df):
    print(f'Duplicates in Columns:', True in list(df.T.duplicated().T))
    print("Show the Duplicates:")
    print(df.T[df.T.duplicated(keep=False)].T)

In [None]:
def remove_duplicates(df,columns):
    shape_old=df.shape

    df=df.drop(columns=columns)

    #print(f' Shape with duplicates:', shape_old) 
    #print(f' Shape after removal of duplicates:', df.shape) 
    
    return df


Our Duplicates are the following pairs:

- `ebitperRevenue` and `eBITperRevenu`
- `ebtperEBIT` and `eBTperEBIT`
- `niperEBT` and `nIperEBT`
- `returnOnAssets` and `Return on Tangible Assets`
- `returnOnCapitalEmployed` and `ROIC`
- `payablesTurnover` and `Payables Turnover`
- `inventoryTurnover` and `Inventory Turnover`
- `debtRatio` and `Debt to Assets`
- `debtEquityRatio` and `Debt to Equity`
- `cashFlowToDebtRatio` and `cashFlowCoverageRatios`

## Feature Engineering

We got the data from the following webpages: [S&P means](https://www.macrotrends.net/2526/sp-500-historical-annual-returns) and [inflation](https://www.macrotrends.net/countries/USA/united-states/inflation-rate-cpi)

In [None]:
def adding_indicators(df):
    
    ## Yearly Means of S&P 500
    sp500_means = pd.Series([11.39, -0.73, 9.54, 19.42, -6.24], index = [2014, 2015, 2016, 2017, 2018]) ## or should it start with year 2015 to year 2019???
    ## for year 2019 we got 28.88%
    
    # Yearly Inflaction Rate measured by consumer price index
    inflation = pd.Series([1.62, 0.12, 1.26, 2.13, 2.44], index = [2014, 2015, 2016, 2017, 2018]) ## or should it start with year 2015 to year 2019???
    ## or should we look at annual change????
    
    ##Adding to the dataframe
    df["inflation"] = df.apply(lambda x: inflation[int(x["year"])], axis=1)
    df["sp500_means"] = df.apply(lambda x: sp500_means[int(x["year"])], axis=1)
    
    
    ## Calculation of Excess Return:
    df["excess_return"] = np.subtract(df["PRICE VAR [%]"], df["sp500_means"])


    ## Calculation of Cashflow Margin:
    df["cashflow_margin"] = df["Operating Cash Flow"].divide(df["Revenue"])
    # Pay attention to ZeroDivisionError, replace infinity by NAN
    df["cashflow_margin"] = df["cashflow_margin"].replace([np.inf, -np.inf], np.nan)
   
    
    ## Calculation of Return on Net Assets (RONA)
    df["Net_working_capital"] = df["Total assets"]-df["Cash and cash equivalents"]
    df["RONA"] = df["EBIT"]/df["Net_working_capital"]
    df["RONA"] = df["RONA"].replace([np.inf, -np.inf], np.nan)


    #df["operating_liabilities"] = df["Total liabilities"]-df["Total debt"]
    
    return df

### Dropping obviously wrong indicators

In [None]:
def dropping_indicators(df):
    df_new = df.drop(["operatingProfitMargin"], axis = 1) # consisting only of the value 1.
    
    #maybe more to drop? which are not yet addressed in correlation or duplicates or elsewhere?
    
    return df_new

## Correlation of the variables

In [None]:
def show_correlation(df):
    X = df[df.columns.difference(['Class', 'Stock Name', 'Sector', 'year', 'PRICE VAR [%]'])]
    y = df["Class"]
    plt.matshow(X.corr().abs())
    plt.colorbar()
    plt.show()

 --> double check if abs_corr_unstack is correct!!!

In [None]:
def correlation(df):
    X = df[df.columns.difference(['Class', 'Stock Name', 'Sector', 'year', 'PRICE VAR [%]'])]
    y = df["Class"]

    abs_corr = X.corr().abs()
    for i in range(len(abs_corr)):
        abs_corr.iloc[i, i] = 0
        
    abs_corr_unstack = abs_corr.unstack()
    abs_corr_unstack.sort_values(kind="quicksort")[-50:] # Why that?

    #print((abs_corr_unstack.values>0.99).sum()/2)

    return abs_corr_unstack

In [None]:
#suggestion to deal with the correlations: remove a variable if its correlation with another variable is higher than 0.99
def remove_correlation(df, abs_corr_unstack):
    columns_to_drop = []
    columns_to_remain = []

    for pair in abs_corr_unstack.index.values:
        if abs_corr_unstack[pair] > 0.99:
            if pair[0] not in columns_to_remain and pair[1] not in columns_to_remain:
                    columns_to_remain.append(pair[0])
                    if pair[1] not in columns_to_drop:
                        columns_to_drop.append(pair[1])
            elif pair[0] in columns_to_remain:
                if pair[1] not in columns_to_drop:
                    columns_to_drop.append(pair[1])
            elif pair[1] in columns_to_remain:
                if pair[0] not in columns_to_drop:
                    columns_to_drop.append(pair[0])

    df_corr_removed = df.drop(columns=columns_to_drop)

    return df_corr_removed

## Class Balance?

The Variable `Class`is not balanced. We have to keep that in mind for train and test split. 

In [None]:
def check_class_imbalance(y):
    sns.countplot(x=y)

## Outliers Dedection for `PRICE VAR[%]`

In [None]:
def get_list_of_sectors(df):
    df_ = df.loc[:, ['Sector','PRICE VAR [%]']]

    # Get list of sectors
    sector_list = df_['Sector'].unique()

    # Plot the percent price variation for each sector
    for sector in sector_list:
        
        temp = df_[df_['Sector'] == sector]

        plt.figure(figsize=(30,5))
        plt.plot(temp['PRICE VAR [%]'])
        plt.title(sector.upper())
        plt.show()
    

### Outliers

 copy paste from here https://www.kaggle.com/code/cnic92/explore-and-clean-financial-indicators-dataset

In [None]:
def check_outliers(df):
    # Get stocks that increased more than 500%
    gain = 500
    top_gainers = df[df['PRICE VAR [%]'] >= gain]
    top_gainers = top_gainers['PRICE VAR [%]'].sort_values(ascending=False)
    print(f'{len(top_gainers)} STOCKS with more than {gain}% gain.')



## Outliers cleaning

There are outliers/extreme values that are probably caused by mistypings. During our analysis of the data, we noticed that the values of NA and 0 were frequently used. We realized that 0 was used interchangeably with NA.  Also there are a lot of values that seem impossible. 

In [None]:
from sklearn.ensemble import IsolationForest

def remove_outliers(X_train, X_test, y_train, y_test):
    ## Isolation Forest
    outliers = IsolationForest(random_state = 42).fit(X_train) # fit Isolation Forest only to training data
    outliers_train = outliers.predict(X_train)
    outliers_test = outliers.predict(X_test)

    ## Remove outliers where 1 represent inliers and -1 represent outliers:
    X_train_cleaned = X_train[np.where(outliers_train == 1, True, False)]
    y_train_cleaned = y_train[np.where(outliers_train == 1, True, False)]
    X_test_cleaned = X_test[np.where(outliers_test == 1, True, False)]
    y_test_cleaned = y_test[np.where(outliers_test == 1, True, False)]
    print("Shape with outliers: ", X_train.shape,", Shape without outliers: ", X_train_cleaned.shape,", Removed outliers: ", X_train.shape[0]-X_train_cleaned.shape[0])
    print("Shape with outliers: ", X_test.shape,", Shape without outliers: ", X_test_cleaned.shape,", Removed outliers: ", X_test.shape[0]-X_test_cleaned.shape[0]) 
    

    return X_train_cleaned, X_test_cleaned, y_train_cleaned, y_test_cleaned


https://www.kaggle.com/code/nareshbhat/outlier-the-silent-killer

## Missing Values

There are a lot of missing values. 

In [None]:
def check_missing_values(df):
    print(f'There are in total {df.isnull().sum().sum()} NAN in the dataframe')

    ## Overview of all variables with missing values
    df.isnull().mean().sort_values(ascending=False).plot.bar(figsize=(100,20))
    plt.ylabel('Percentage of missing values')
    plt.xlabel('Variables')
    plt.title('Quantifying ALL missing data')
    plt.show()

    most_nan = df.isnull().mean().sort_values(ascending=False)
    most_nan = most_nan[most_nan > 0.3]

    most_nan.plot.bar(figsize=(20,20))
    plt.ylabel('Percentage of missing values')
    plt.xlabel('Variables')
    plt.title('Data with more than 30% missing')
    plt.show()

    # Percentage of missing values for the variables
    missing = df.isnull().sum().sort_values(ascending=False)
    percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
    missing_data = pd.concat([missing, percent], axis=1, keys=['Nr. of missing values', 'Percent of Missing Values'])
    missing_data.head(25)

    # Plot missing values 2.0
    sns.heatmap(df.isna().transpose(), cmap="Blues", cbar_kws={'label': 'Missing Values'});


## Handling Missing Data

In [None]:
def handle_missing_data(df, threshold):

    print(sum(df.isna().mean() > threshold)) # 76 of the remaining variables have more than 30% NAs
    # Calculate the percentage of missing values in each column
    missing_values_percentage = df.isna().sum() / df.shape[0]
    # Identify the columns with a higher percentage of missing values than the threshold
    columns_to_drop = missing_values_percentage[missing_values_percentage > threshold].index
    print(f"Columns to drop: {columns_to_drop}")
    
    # df = df.loc[::, df.isna().mean() < threshold] # drop all columns with NA proportion higher than threshold

    numCols = df.select_dtypes(include=['float64', 'int64']).columns
    print("New numerical columns:", numCols)
    df[numCols] = df[numCols].fillna(df[numCols].median())

    catCols = df.select_dtypes(exclude=np.number).columns
    print("New categorical columns:", catCols)
    for col in catCols:
        df[col].fillna("Unknown", inplace=True)

    return numCols, catCols, df

## Handling unique values and cardinality


"Each categorical variable consists of unique values. A categorical feature is said to possess high cardinality when there are too many of these unique values. One-Hot Encoding becomes a big problem in such a case since we have a separate column for each unique value (indicating its presence or absence) in the categorical variable. This leads to two problems, one is obviously space consumption, but this is not as big a problem as the second problem, the curse of dimensionality" [reference here](https://towardsdatascience.com/dealing-with-features-that-have-high-cardinality-1c9212d7ff1b)


In [None]:
def reduce_cardinality(column, threshold):
    #threshold
    threshold_value = int(threshold * len(column))
    # Initialize
    categories_list = []
    s = 0
    counts = []
    
    # Count the frequencies of unique values in the column
    for value in column:
        # Check if the value is already in the counts list
        index = next((i for i, x in enumerate(counts) if x[0] == value), None)
        if index is not None:
            counts[index] = (value, counts[index][1] + 1)
        else:
            counts.append((value, 1))
    
    # Sort the list of tuples based on count in descending order
    counts.sort(key=lambda x: x[1], reverse=True)
    
    # Loop through the tuples (value, count)
    for i, j in counts:
        # Add the frequency to the global sum
        s += j
        # Append the category name to the list
        categories_list.append(i)
        # Check if the global sum has reached the threshold value, if so break the loop
        if s >= threshold_value:
            break
    
    # Append the category 'Other' to the list
    categories_list.append('Other')
    
    # Replace all instances not in our new categories by 'Other'
    new_column=column.apply(lambda x: x if x in categories_list else 'Other')
    
    #print(categories_list)
    
    return new_column


## Adding Dummies

Important:  
- Do first apply the `reduce_cardinality(df['Sector'])` method!
- `add_dummies()`only for the catCols: `Sector`and `class` class should already be binary encoded! So dummies are only created for `Sector`?

In [None]:
def add_dummies(df, catCols):
    df = pd.get_dummies(df, columns=catCols)
    #df.head()

    return df

# Part 2: Putting all together 

In [None]:
def putting_all_together(): 
    
    # load dataset
    df = load_dataset()
    
    # class to categorical
    df = class_to_categorical(df)
    
    # check_duplicates_row(df)
    # check_duplicates_col(df)
    duplicated_columns = ['eBITperRevenue', 'eBTperEBIT', 'nIperEBT', 'Return on Tangible Assets', 
                     'ROIC', 'Payables Turnover', 'Inventory Turnover', 'Debt to Assets', 'Debt to Equity', 
                     'cashFlowCoverageRatios']
    
    # Remove duplicated columns
    df = remove_duplicates(df, duplicated_columns)


    # check correlation
    abs_corr_unstack = correlation(df)

    # Remove correlation
    df = remove_correlation(df, abs_corr_unstack)

    #adding indicators
    df = adding_indicators(df)
    
    #dropping indicators who are obviously wrong---> this has to be optimized!!! 
    df = dropping_indicators(df)

    # Remove missing values
    numCols, catCols, df = handle_missing_data(df, 0.3)

    df_reduced = df
    df_reduced['Sector'] = reduce_cardinality(df['Sector'], 0.75)

    df_dummies = df_reduced[df_reduced.columns.difference(['Class', 'year', 'PRICE VAR [%]', 'Stock Name'])]
    df_dummies = add_dummies(df_dummies, ['Sector'])

    X = df_dummies[df.columns.difference(['Class', 'Stock Name', 'Sector', 'year', 'PRICE VAR [%]'])]
    y = df['Class']

    return X, y

## Train-Test Split

In [None]:
X,y = putting_all_together()
y = y.astype('int')

## Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.3, random_state = 42) 

### Removing Outliers
X_train, X_test, y_train, y_test = remove_outliers(X_train, X_test, y_train, y_test)

df_train = pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)

# Part 3: Feature Selection


We implemented the following different feature selections:
- `ExtraTreeClassifier`
- `XGBClassifier`

### ExtraTrees Feature Selection

In [None]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

def get_significant_features(X_train, X_test, y_train, n):
    # Feature selection using Extra Trees Classifier on the resampled training data
    model = ExtraTreesClassifier(random_state=42)
    model.fit(X_train, y_train)
    importances = model.feature_importances_
    importances_normalized = np.std([tree.feature_importances_ for tree in
                                        model.estimators_],
                                        axis = 0)

    # Select top features with highest importance scores
    top_features = pd.Series(importances, index=X_train.columns).nlargest(n)

    # Subset X_resampled and X_test with selected features
    X_train_selected = X_train[top_features.index]
    X_test_selected = X_test[top_features.index]

    return X_train_selected, X_test_selected, importances_normalized

### Random Forest Feature Selection

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

## Feature Selection using Random Forest for outside of pipeline
def random_forest_feature_selection(X_train, X_test, y_train, y_test, n):
    
    model = SelectFromModel(RandomForestClassifier(n_estimators = n))
    model.fit(X_train, y_train)
    
    list_train_rf= X_train.columns[(model.get_support())]
    list_test_rf= X_test.columns[(model.get_support())]

    X_train_rf = X_train[list_train_rf]
    X_test_rf = X_test[list_test_rf]
    
    return X_train_rf, X_test_rf


### XG-Boost Feature Selection

In [None]:
import xgboost as xgb
from sklearn.feature_selection import SelectFromModel

## Feature Selection using XGBoost for outside of pipeline
def xg_boost_feature_selection(X_train, X_test, y_train, y_test, n):
    
    params = { "objective": "multi:softmax", 'num_class': 3 , 'random_state': 42 }
    model= xgb.XGBClassifier(**params)
    select_xgbc = SelectFromModel(estimator = model, threshold = "median")
    select_xgbc.fit(X_train, y_train)

    list_train_xgbc= X_train.columns[(select_xgbc.get_support())]
    list_test_xgbc= X_test.columns[(select_xgbc.get_support())]


    X_train_xgbc = X_train[list_train_xgbc]
    X_test_xgbc = X_test[list_test_xgbc]
    
    return X_train_xgbc, X_test_xgbc

# Part 4: Actual Machine Learning Models/Algos

this function below need to be adjusted since its only copy pasted!!!!!

In [None]:
from sklearn.metrics import confusion_matrix

def print_results_crossvalidation(func, X_test, y_test):
  
  std_best_score = func.cv_results_["std_test_score"][func.best_index_]
  print(f"Best parameters: {func.best_params_}")
  print(f"Mean CV score: {func.best_score_:}")
  print(f"Standard deviation of CV score: {std_best_score:}")
  print("Test Score:".format(func.score(X_test, y_test)))

def report(y_true, y_pred):
    
  class_report = classification_report(y_true, y_pred)
  print(class_report)
  conf_matrix = confusion_matrix(y_true, y_pred, normalize = "true")
  conf_matrix = pd.DataFrame(conf_matrix, ["Class 0", "Class 1"],  ["Class 0", "Class 1"])
  sns.heatmap(conf_matrix, annot = True).set(xlabel = "Assigned Class", ylabel = "True Class", title = "Confusion Matrix")
    

## Initalization of the pipeline and GridSearchCV

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import StratifiedKFold

from imblearn.pipeline import Pipeline as imbpipeline

from sklearn.model_selection import  GridSearchCV

In [None]:
# Set up pipeline and GridSearchCV
scaler = StandardScaler()
mms = MinMaxScaler()

ros = RandomOverSampler(random_state = 42)
kFold = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

First easy implementation all features

In [None]:
# initialize logistic regression classifier
logistic = LogisticRegression(random_state=42, max_iter=20, n_jobs=-1) # initialize logistic regression classifier
print(f"Parameters of the logistic regression : {logistic.get_params().keys()}")


Pipeline and GridSearchCV

In [None]:
pipe = imbpipeline(steps=[["scaler", scaler], ["ros", ros], ["classifier", logistic]])
# Set up parameter grid with hyperparameters we want to tune
param_grid = {'ros': [ros, None], # upsampling or not
              'scaler': [scaler, None, mms], # scaling input by standardizing or min-max scaling or not scaling at all
              'classifier__C': [1, 6, 7, 8, 9, 10],
              'classifier__penalty': [None, 'l2', 'l1', 'elasticnet']}

# Conduct grid search with cross-validation to find hyperparameters that yield the best score
gs = GridSearchCV(estimator = pipe, param_grid = param_grid, scoring = "f1_weighted", cv = kFold, n_jobs = -1)
# Fit using the best parameters
gs = gs.fit(X_train, y_train)


In [None]:
# Printing
print_results_crossvalidation(gs, X_test, y_test)
y_pred = gs.best_estimator_.predict(X_test)
report(y_test, y_pred)

###  Logistic Regression with Random Forest Feature Selection 

In [None]:
forest = RandomForestClassifier(random_state = 42)

In [None]:
pipe = imbpipeline(steps=[["scaler", scaler], ["feature_selection",  SelectFromModel(estimator = forest, threshold = "median")],
                          ["ros", ros], ["classifier", logistic]])

param_grid = {'ros': [ros, None],
              'scaler': [scaler, mms, None],
              'classifier__C': [1, 6, 7, 8, 9, 10],
              'classifier__penalty': [None, 'l2', 'l1', 'elasticnet']}

gs = GridSearchCV(estimator = pipe, param_grid = param_grid, scoring = "f1_weighted", cv = kFold, n_jobs = 200)
gs = gs.fit(X_train, y_train)

In [None]:
print_results_crossvalidation(gs, X_test, y_test)
y_pred = gs.best_estimator_.predict(X_test)
report(y_test, y_pred)

### Logistic Regression with XGBoost Feature Selection

In [None]:
xgbc= xgb.XGBClassifier(objective = "multi:softmax", random_state = 42)

In [None]:
pipe = imbpipeline(steps=[["scaler", scaler], ["feature_selection",  SelectFromModel(estimator = xgbc, threshold = "median")],
                          ["ros", ros], ["classifier", logistic]])

param_grid = {'ros': [ros, None],
              'scaler': [scaler, mms, None],
              'classifier__C': [1, 6, 7, 8, 9, 10],
              'classifier__penalty': [None, 'l2', 'l1', 'elasticnet']}

gs = GridSearchCV(estimator = pipe, param_grid = param_grid, scoring = "f1_weighted", cv = kFold, n_jobs = -1)
gs = gs.fit(X_train, y_train)

In [None]:
print_results_crossvalidation(gs, X_test, y_test)
y_pred = gs.best_estimator_.predict(X_test)
report(y_test, y_pred)

### Logistic Regression with Kernel PCA Dimension Reduction

In [None]:
logistic = LogisticRegression(random_state=42, max_iter=20, n_jobs=-1)
kpca = KernelPCA(random_state = 42, eigen_solver = "arpack")

In [None]:
pipe = imbpipeline(steps=[["scaler", scaler], ["kpca",  kpca],
                          ["ros", ros], ["classifier", logistic]])

param_grid = {'kpca__n_components': np.arange(5, 10, 1),
              #linear is the "normal" PCA we have discussed in the lecture
              #only consider linear and sigmoid since we have ran it before and saw that sigmoid normally performed better than poly and rbf
              "kpca__kernel": ["linear", "sigmoid"],
              'classifier__C': [1, 6, 7, 8, 9, 10]}
gs = GridSearchCV(estimator = pipe, param_grid = param_grid, scoring = "f1_weighted", cv = kFold)
gs = gs.fit(X_train, y_train)

In [None]:
print_results_crossvalidation(gs, X_test, y_test)
y_pred = gs.best_estimator_.predict(X_test)
# get test score, metrics report and confusion matrix
report(y_test, y_pred)

### Logistic Regression with Kernel PCA Finetuning

In [None]:
logistic = LogisticRegression(random_state=42, max_iter=20, n_jobs=-1, C = 9)
kpca = KernelPCA(random_state = 42, eigen_solver = "arpack", n_components = 6, kernel = "sigmoid")

In [None]:
pipe = imbpipeline(steps=[["scaler", scaler], ["kpca",  kpca], ["ros", ros], ["classifier", logistic]])                     

param_grid = {'ros': [ros, None],
              'scaler': [scaler, mms, None],
              'classifier__penalty': [None, 'l2', 'l1', 'elasticnet']}
gs = GridSearchCV(estimator = pipe, param_grid = param_grid, scoring = "f1_weighted", cv = kFold)
gs = gs.fit(X_train, y_train)


In [None]:
print_results_crossvalidation(gs, X_test, y_test)
y_pred = gs.best_estimator_.predict(X_test)
# get test score, metrics report and confusion matrix
report(y_test, y_pred)

### Feature Engineering with Logistic Regression

In [None]:
logistic = LogisticRegression(random_state=42, max_iter=20, n_jobs=-1) # initialize logistic regression classifier

In [None]:
pipe = imbpipeline(steps=[["scaler", scaler], ["ros", ros], ["classifier", logistic]])
# Set up parameter grid with hyperparameters we want to tune
param_grid = {'ros': [ros, None], # upsampling or not
              'scaler': [scaler, None, mms], # scaling input by standardizing or min-max scaling or not scaling at all
              'classifier__C': [1, 6, 7, 8, 9, 10],
              'classifier__penalty': [None, 'l2', 'l1', 'elasticnet']}
# Conduct grid search with cross-validation to find hyperparameters that yield the best score
gs = GridSearchCV(estimator = pipe, param_grid = param_grid, scoring = "f1_weighted", cv = kFold, n_jobs = -1)
# Fit using the best parameters
gs = gs.fit(X_train_fe, y_train_fe)


In [None]:
print_results_crossvalidation(gs, X_test_fe, y_test_fe)
y_pred = gs.best_estimator_.predict(X_test_fe)
report(y_test_fe, y_pred)

# Random Forest

## Random Forest with all Features

In [None]:
forest = RandomForestClassifier(random_state = 42)
print(f"Parameters of the Random Forest: {forest.get_params().keys()}")

In [None]:
rfpipe = imbpipeline(steps=[["ros", ros], ["rf", forest]])

random_grid = {
    "rf__criterion": ["gini", "entropy"],
    "rf__max_features": ["auto", "sqrt", "log2"],
    "rf__max_depth": np.array([None, 5, 10, 20]),
    "rf__min_samples_leaf":np.array([1, 2, 5]),
    "rf__min_samples_split": np.array([2, 5, 10]),
    "rf__n_estimators": np.array([50, 100, 200, 500]),
    "rf__class_weight": [None, "balanced", "balanced_subsample"]
}

rs = RandomizedSearchCV(estimator = rfpipe, param_distributions = random_grid, scoring = "f1_weighted",
                  cv = kFold, n_jobs = -1, n_iter = 100, random_state = 42, error_score = "raise")
rs = rs.fit(X_train, y_train)


In [None]:
print_results_crossvalidation(rs, X_test, y_test)
y_pred = rs.best_estimator_.predict(X_test)
report(y_test, y_pred)

### Random Forest with All Features but also Finetuning

In [None]:
forest = RandomForestClassifier(random_state = 42, class_weight = "balanced_subsample", criterion = "entropy", max_features = "sqrt", min_samples_split = 10, min_samples_leaf = 5)


In [None]:
rfpipe = imbpipeline(steps=[["scaler", scaler], ["ros", ros], ["rf", forest]])

param_grid = {
    "scaler": [scaler, mms, None],
    "ros": [ros, None],
    "rf__max_depth": [None, 5, 10, 15, 20],
    "rf__n_estimators": np.array([150, 200, 250])
}

gs = GridSearchCV(estimator = rfpipe, param_grid = param_grid, scoring = "f1_weighted", cv = kFold, n_jobs = -1)
gs = gs.fit(X_train, y_train)


In [None]:
print_results_crossvalidation(gs, X_test, y_test)
y_pred = gs.best_estimator_.predict(X_test)
report(y_test, y_pred)

## Random Forest Feature Selectionn

In [None]:
# instead of the pipeline implementation, we use the prepared data sets X_train_rf and X_test_rf
forest = RandomForestClassifier(random_state = 42)

In [None]:
rfpipe = imbpipeline(steps=[["ros", ros], ["rf", forest]])

random_grid = {
    "rf__criterion": ["gini", "entropy"],
    "rf__max_features": ["auto", "sqrt", "log2"],
    "rf__max_depth": np.array([None, 5, 10, 20]),
    "rf__min_samples_leaf":np.array([1, 2, 5]),
    "rf__min_samples_split": np.array([2, 5, 10]),
    "rf__n_estimators": np.array([ 100, 200, 500]),
    "rf__class_weight": [None, "balanced", "balanced_subsample"]
}

rs = RandomizedSearchCV(estimator = rfpipe, param_distributions = random_grid, scoring = "f1_weighted",
                  cv = kFold, n_jobs = -1, n_iter = 100, random_state = 42, error_score = "raise")
rs = rs.fit(X_train_rf, y_train)

In [None]:
print_results_crossvalidation(rs, X_test_rf, y_test)
y_pred = rs.best_estimator_.predict(X_test_rf)
report(y_test, y_pred)

### Random Forest with XGBoost Feature Selection

In [None]:
# instead of the pipeline implementation, we use the prepared data sets X_train_xgbc and X_test_xgbc
forest = RandomForestClassifier(random_state = 42)

In [None]:
rfpipe = imbpipeline(steps=[["ros", ros], ["rf", forest]])

random_grid = {
    "rf__criterion": ["gini", "entropy"],
    "rf__max_features": ["auto", "sqrt", "log2"],
    "rf__max_depth": np.array([None, 5, 10, 20]),
    "rf__min_samples_leaf":np.array([1, 2, 5]),
    "rf__min_samples_split": np.array([2, 5, 10]),
    "rf__n_estimators": np.array([ 100, 200]),
    "rf__class_weight": [None, "balanced", "balanced_subsample"]
}

rs = RandomizedSearchCV(estimator = rfpipe, param_distributions = random_grid, scoring = "f1_weighted",
                  cv = kFold, n_jobs = -1, n_iter = 100 random_state = 42, error_score = "raise")
rs = rs.fit(X_train_xgbc, y_train)

In [None]:
print_results_crossvalidation(rs, X_test_xgbc, y_test)
y_pred = rs.best_estimator_.predict(X_test_xgbc)
report(y_test, y_pred)

### Random Forest with Kernel PCA: Broad Hyperparametertuning 

In [None]:
forest = RandomForestClassifier(random_state = 42, max_features = "sqrt")
kpca = KernelPCA(random_state = 42)


In [None]:
rfpipe = imbpipeline(steps=[["scaler", scaler],["kpca", kpca], ["ros", ros], ["rf", forest]])

random_grid = {
    "kpca__n_components": np.arange(4, 10, 1),
    "kpca__kernel": ["linear", "sigmoid"],
    "kpca__gamma": np.linspace(0.005, 0.01, 5),
    "kpca__coef0": np.linspace(0.8, 1.2, 5),
    "rf__criterion": ["gini", "entropy"],
    "rf__max_depth": np.array([None, 2, 5, 10, 20]),
    "rf__n_estimators": np.array([ 100, 200, 500]),
}

rs = RandomizedSearchCV(estimator = rfpipe, param_distributions = random_grid, scoring = "f1_weighted",
                  cv = kFold, n_jobs = -1, n_iter = 100, random_state = 42, error_score = "raise")
rs = rs.fit(X_train, y_train)

In [None]:
print_results_crossvalidation(rs, X_test, y_test)
y_pred = rs.best_estimator_.predict(X_test)
report(y_test, y_pred)
     

## Random Forest with Kernel PCA: Hyperparameter Finetuning 

In [None]:
forest_ft = RandomForestClassifier(random_state = 42, class_weight = "balanced_subsample", criterion = "gini", 
                                   max_features = "sqrt", min_samples_leaf= 10, min_samples_split = 5)
kpca_rf_ft = KernelPCA(random_state = 42, kernel = "sigmoid", n_components = 5, gamma = 0.01, coef0 = 0.9)





In [None]:
rfpipe = imbpipeline(steps=[["scaler", scaler],["kpca", kpca_rf_ft], ["ros", ros], ["rf", forest_ft]])

param_grid = {
    "scaler": [scaler, mms, None],
    "ros": [ros, None],
    "rf__max_depth": np.array([15, 20, 25]),
    "rf__n_estimators": np.array([100, 300, 500]),
}

gs = GridSearchCV(estimator = rfpipe, param_grid = param_grid, scoring = "f1_weighted", cv = kFold, n_jobs = -1)
gs = gs.fit(X_train, y_train)

In [None]:
print_results_crossvalidation(gs, X_test, y_test)
y_pred = gs.best_estimator_.predict(X_test)
report(y_test, y_pred)

### Random Forest with Feature Engineering

all features

In [None]:
forest = RandomForestClassifier(random_state = 42)

In [None]:
rfpipe = imbpipeline(steps=[["ros", ros], ["rf", forest]]) # we saw that random forest performed better when not scaling the input

random_grid = {
    "rf__criterion": ["gini", "entropy"],
    "rf__max_features": ["auto", "sqrt", "log2"],
    "rf__max_depth": np.array([None, 5, 10, 20]),
    "rf__min_samples_leaf":np.array([1, 2, 5]),
    "rf__min_samples_split": np.array([2, 5, 10]),
    "rf__n_estimators": np.array([50, 100, 200]),
    "rf__class_weight": [None, "balanced", "balanced_subsample"]
}

rs = RandomizedSearchCV(estimator = rfpipe, param_distributions = random_grid, scoring = "f1_weighted",
                  cv = kFold, n_jobs = -1, n_iter = 100, random_state = 42, error_score = "raise")
rs = rs.fit(X_train, y_train)


In [None]:
print_results_crossvalidation(rs, X_test, y_test)
y_pred = rs.best_estimator_.predict(X_test)
report(y_test, y_pred)

### Random Forest Feature Engineering Fine Tuning

In [None]:
forest = RandomForestClassifier(random_state = 42, class_weight = "balanced_subsample", criterion = "gini", max_features = "sqrt", min_samples_split = 10, min_samples_leaf = 5)

In [None]:
rfpipe = imbpipeline(steps=[["ros", ros], ["rf", forest]])

param_grid = {
    "rf__max_depth": [None, 20],
    "rf__n_estimators": np.array([150, 200, 250])
}

gs = GridSearchCV(estimator = rfpipe, param_grid = param_grid, scoring = "f1_weighted", cv = kFold, n_jobs = -1)
gs = gs.fit(X_train, y_train)

In [None]:
print_results_crossvalidation(gs, X_test, y_test)
y_pred = gs.best_estimator_.predict(X_test)
report(y_test, y_pred)

## Support Vector Machines

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
#from sklearn.svm import SVC
from sklearn import svm
from sklearn.svm import SVC
from imblearn.pipeline import Pipeline as imbpipeline
from imblearn.over_sampling import RandomOverSampler
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score

In [None]:
X,y = putting_all_together()
y = y.astype('int')

## Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.3, random_state = 42) 

### Removing Outliers
X_train, X_test, y_train, y_test = remove_outliers(X_train, X_test, y_train, y_test)

df_train = pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)

print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")

In [None]:
# The SVC Class from Sklearn
svm1= svm.SVC(
        C=1.0,                          # The regularization parameter
        kernel='rbf',                   # The kernel type used 
        degree=3,                       # Degree of polynomial function 
        gamma='scale',                  # The kernel coefficient
        coef0=0.0,                      # If kernel = 'poly'/'sigmoid'
        shrinking=True,                 # To use shrinking heuristic
        probability=False,              # Enable probability estimates
        tol=0.001,                      # Stopping crierion
        cache_size=200,                 # Size of kernel cache
        class_weight=None,              # The weight of each class
        verbose=False,                  # Enable verbose output
        max_iter= -1,                   # Hard limit on iterations
        decision_function_shape='ovr',  # One-vs-rest or one-vs-one
        break_ties=False,               # How to handle breaking ties
        random_state=42               # Random state of the model
    )

print(f"Parameters of the Support Vector Machine: {svm1.get_params().keys()}")


# Building and training our model
clf = svm1.fit(X_train, y_train)

# Making predictions with our data
predictions = clf.predict(X_test)

# Model Accuracy: how often is the classifier correct?
print("Accurancy:", accuracy_score(y_test, predictions))

# Model Precision: what percentage of positive tuples are labeled as such?
print("Precision:", precision_score(y_true= y_test, y_pred= predictions, average= 'weighted')) # WEIGHTED???? 

# Model Recall: what percentage of positive tuples are labelled as such?
print("Recall:", recall_score(y_test, predictions, average= 'weighted'))  # WEIGHTED???? 

#Whole classification report
print(classification_report(y_test, predictions))

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import StratifiedKFold

scaler = StandardScaler()
mms = MinMaxScaler()

ros = RandomOverSampler(random_state = 42)
kFold = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)

In [None]:
svm_pipe = imbpipeline(steps=[["scaler", scaler], ["ros", ros], ["SVM", svm1]])
param_grid = {
    'ros': [ros, None], 
    'scaler': [scaler, mms],
    "SVM__kernel": ["linear", "sigmoid", "rbf"],
    "SVM__C": [1, 5, 10, 50],
    "SVM__gamma": ["auto", "scale"]
}

In [None]:
gs = GridSearchCV(estimator = svm_pipe, param_grid = param_grid, scoring = "f1_weighted",
                  cv = kFold, n_jobs = -1, refit = True, verbose = 5)

In [None]:
gs = gs.fit(X_train, y_train)

In [None]:
%time gs.fit(X_train, y_train)
print_results_crossvalidation(gs, X_test, y_test)
y_pred = gs.best_estimator_.predict(X_test)

report(y_test, y_pred)

In [None]:
# # Plotting a Bar Graph to compare the models
# plt.bar(X_train.columns, importances)
# plt.xlabel('Feature Labels')
# plt.ylabel('Feature Importances')
# plt.title('Comparison of different Feature Importances')
# plt.show()

In [None]:
# Create StandardScaler object
sc = StandardScaler()

# Standardize features; equal results as if done in two
# X_train_std = sc.fit_transform(X_train_selected)
# X_test_std = sc.transform(X_test_selected)
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)

In [None]:
# # Grid Search for getting optimal C and gamma
# gamma_range = np.outer(np.logspace(-3, 0, 4),np.array([1,5]))
# gamma_range = gamma_range.flatten()
# print(gamma_range)

# C_range = np.outer(np.logspace(-1, 1, 3),np.array([1,5]))
# C_range = C_range.flatten()
# print(C_range)

# parameters = {'kernel':['linear', 'rbf'], 'C':C_range, 'gamma': gamma_range}

# svm = SVC()
# grid = RandomizedSearchCV(estimator=svm, param_distributions=parameters, n_iter=5, n_jobs=-1, verbose=2)
# grid.fit(X_train_std, y_train)

# print('Best CV accuracy: {:.2f}'.format(grid.best_score_))
# print('Test score:       {:.2f}'.format(grid.score(X_test_std, y_test)))
# print('Best parameters: {}'.format(grid.best_params_))

In [None]:
from sklearn import metrics

svm = SVC(kernel='rbf', C=1.0)
svm.fit(X_train_std , y_train)

# Predict classes and print results
y_pred = svm.predict(X_test_std)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print("Test score: {:.2f}".format(svm.score(X_test_std, y_test)))
