# Data Science Project Spring 2023

## 200+ Financial Indicators of US stocks (2014-2018)

### Yiwei Gong, Janice Herman, Alexander  Morawietz and Selina Waber

University of Zurich, Spring 2023

## Importing Packages

In [None]:
import os 
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from pandas_datareader import data

from sklearn.impute import KNNImputer
from sklearn.ensemble import IsolationForest

from sklearn.model_selection import train_test_split

from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import StratifiedKFold

from imblearn.pipeline import Pipeline as imbpipeline
from scipy.stats import loguniform

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
import xgboost as xgb

from sklearn.linear_model import LogisticRegression
from scipy.stats import loguniform

from sklearn.svm import SVC

from sklearn.decomposition import PCA


from sklearn.naive_bayes import GaussianNB



## Our Goal

The efficient market hypothesis states that - if asset prices reflect all available information - it should not be possible to  be superior to the market in the long run ("beat the market"). However, does this hypothesis really holds? Based on the 10-K filings, we try to predict the market by applying different maschine learning algorithms. We apply Naive Bayes, Random Forest, Logistic Regression and SVM Classifer and try to answer the following questions: 
- Which algorithms yield the best prediction scores? 
- Can one improves the prediction by applying feature selection algorithms?
- Which features are most relevant to predict the stock market?

Finally, we build a feedforward neural network. However, we could not include it into our general pipeline and leave it therefore as an addendum.


### Data set

We used the data set from Nicolas Carbone from the webpage [kaggle](https://www.kaggle.com/datasets/cnic92/200-financial-indicators-of-us-stocks-20142018). Each dataset contains over 200 financial indicators, that are found in the [10-K filings](https://www.investopedia.com/terms/1/10-k.asp#:~:text=Key%20Takeaways-,A%2010%2DK%20is%20a%20comprehensive%20report%20filed%20annually%20by,detailed%20than%20the%20annual%20report.) of publicly traded companies from the US between the years 2014 - 2018.


### Our Approach

Overall we agreed to use and define classes and functions as much as possible. Using a class can make your code cleaner, easier to manage and debug, and more efficient, particularly when dealing with complex data processing tasks.

We have defined the following classes:
- Dataset
- TrainModel

They will be explained in more detail below. 

Further we defined the following functions outside our classes specifically for our feature selection:
- get_significant_features
- random_forest_feature_selection
- xg_boost_feature_selection
- pca_feature_selection 




## Part 1: The class Dataset

We define a class `Dataset` to handle all stages of pre-processing and exploratory data analysis (EDA) for our financial dataset. The steps performed by this class can be grouped into several categories:

1) Data loading and merging: The `load_dataset` function loads multiple yearly datasets from csv files located in the specified path, and concatenates them into one large pandas DataFrame. The `load_small_dataset` function does a similar operation but for a single file.

2) Data Cleaning: This includes handling duplicate data (`remove_duplicates`), handling missing data (`handle_missing_data`), and removing outliers (`remove_outliers`). The outliers removal process uses the Isolation Forest algorithm.

3) Feature Engineering: The class has several methods (`adding_indicators`, `dropping_indicators`, `remove_correlation`) for generating new features (such as `cashflow_margin` and `RONA`), and removing highly correlated features.

4) Data transformation: The class provides functionality to convert categorical variables to dummy variables (`add_dummies`), and reduce the cardinality of categorical variables (`reduce_cardinality`).

5) Dataset splitting: The `split_dataset` function splits the dataset into training and testing sets using stratified sampling.

6) EDA: The `do_eda` function integrates various functions to perform a complete exploratory data analysis, including removing duplicates, removing correlated features, feature engineering, dropping certain indicators, handling missing data, and converting categorical variables into dummy variables. It also provides an option to reduce the cardinality of certain columns.

In [None]:
class Dataset():
    def __init__(self, path, years):
        if type(years) == list:    
            self.data = self.load_dataset(path, years)
        elif type(years) == int:
            self.data = self.load_small_dataset(path, years)
        else: raise TypeError("Wrong input for the data loading")            
        self.X = None
        self.y = None
        self.train_X = None
        self.test_X = None
        self.train_y = None
        self.test_y = None

    def load_dataset(self, path, years):
        ## Loading the yearly dataset into the array dfs
        dfs = []
        for year in years:
            df = pd.read_csv(os.path.join(path, f'{year}_Financial_Data.csv'), sep=',')
            df['year'] = np.full(df.shape[0], str(year)) ## append column with the year respecitvely
            df['PRICE VAR [%]'] = df[f'{year +1} PRICE VAR [%]'] ## Adding variable of the same name for all df, e.g. '2016 PRICE VAR [%]' renamed to 'PRICE VAR [%]'
            df = df.drop(columns=[f'{year +1} PRICE VAR [%]']) # dropp year-specific variable name
            df.columns.values[0] = 'Stock Name' # name the first variable 
            dfs.append(df)
        
        df = pd.concat(dfs, ignore_index=True) ## concat the diffrent dataframes
        return df

    def load_small_dataset(self, path, year):
        project_directory = sys.path[0] ## get path of project directory
        data_directory = os.path.join(project_directory, 'data')
        
        df = pd.read_csv(os.path.join(data_directory, f'{year}_Financial_Data.csv'), sep=',')
        df['year'] = np.full(df.shape[0], str(year)) ## append column with the year respecitvely
        df['PRICE VAR [%]'] = df[f'{year +1} PRICE VAR [%]'] ## Adding variable of the same name for all df, e.g. '2016 PRICE VAR [%]' renamed to 'PRICE VAR [%]'
        df = df.drop(columns=[f'{year +1} PRICE VAR [%]'])
        df.columns.values[0] = 'Stock Name' # name the first variable
        return df
    
    def class_to_categorical(self):
        self.data['Class'] = self.data.Class.astype('object') ## convert 'Class' to categorical data as an object.

    def remove_duplicates(self, columns):
        self.data=self.data.drop(columns=columns)
    
    def adding_indicators(self):
        df = self.data
        ## Yearly Means of S&P 500
        sp500_means = pd.Series([11.39, -0.73, 9.54, 19.42, -6.24], index = [2014, 2015, 2016, 2017, 2018]) 
        
        # Yearly Inflaction Rate measured by consumer price index
        inflation = pd.Series([1.62, 0.12, 1.26, 2.13, 2.44], index = [2014, 2015, 2016, 2017, 2018]) 
        
        ##Adding to the dataframe
        df["inflation"] = df.apply(lambda x: inflation[int(x["year"])], axis=1)
        df["sp500_means"] = df.apply(lambda x: sp500_means[int(x["year"])], axis=1)
       
        ## Calculation of Cashflow Margin:
        df["cashflow_margin"] = df["Operating Cash Flow"].divide(df["Revenue"])
        # Pay attention to ZeroDivisionError, replace infinity by NAN
        df["cashflow_margin"] = df["cashflow_margin"].replace([np.inf, -np.inf], np.nan)
        
        ## Calculation of Return on Net Assets (RONA)
        df["Net_working_capital"] = df["Total assets"]-df["Cash and cash equivalents"]
        df["RONA"] = df["EBIT"]/df["Net_working_capital"]
        df["RONA"] = df["RONA"].replace([np.inf, -np.inf], np.nan)
        self.data = df
    
    def dropping_indicators(self):
        self.data = self.data.drop(["operatingProfitMargin"], axis = 1) # consisting only of the value 1.

    def remove_correlation(self):
        X = self.data[self.data.columns.difference(['Class', 'Stock Name', 'Sector', 'year', 'PRICE VAR [%]'])]
        y = self.data["Class"]

        abs_corr = X.corr().abs()
        for i in range(len(abs_corr)):
            abs_corr.iloc[i, i] = 0

        abs_corr_unstack = abs_corr.unstack()
        columns_to_drop = []
        columns_to_remain = []

        for pair in abs_corr_unstack.index.values:
            if abs_corr_unstack[pair] > 0.99:
                if pair[0] not in columns_to_remain and pair[1] not in columns_to_remain:
                        columns_to_remain.append(pair[0])
                        if pair[1] not in columns_to_drop:
                            columns_to_drop.append(pair[1])
                elif pair[0] in columns_to_remain:
                    if pair[1] not in columns_to_drop:
                        columns_to_drop.append(pair[1])
                elif pair[1] in columns_to_remain:
                    if pair[0] not in columns_to_drop:
                        columns_to_drop.append(pair[0])

        self.data = self.data.drop(columns=columns_to_drop)
    
    def remove_outliers(self):
        # Isolation Forest
        outliers = IsolationForest(random_state = 42).fit(self.X_train) # fit Isolation Forest only to training data
        outliers_train = outliers.predict(self.X_train)
        outliers_test = outliers.predict(self.X_test)

        # Remove outliers where 1 represent inliers and -1 represent outliers:
        self.X_train = self.X_train[np.where(outliers_train == 1, True, False)]
        self.y_train = self.y_train[np.where(outliers_train == 1, True, False)]
        self.X_test = self.X_test[np.where(outliers_test == 1, True, False)]
        self.y_test = self.y_test[np.where(outliers_test == 1, True, False)]

    def handle_missing_data(self, threshold=0.3, imputer="median"):
        df = self.data
        # Calculate the percentage of missing values in each column
        missing_values_percentage = df.isna().sum() / df.shape[0]
        # Identify the columns with a higher percentage of missing values than the threshold
        columns_to_drop = missing_values_percentage[missing_values_percentage > threshold].index
        df = df.drop(columns=columns_to_drop)

        if imputer == "median":
            numCols = df.select_dtypes(include=['float64', 'int64']).columns
            df[numCols] = df[numCols].fillna(df[numCols].median())
            catCols = df.select_dtypes(exclude=np.number).columns
        for col in catCols:
            df[col].fillna("Unknown", inplace=True)  
            missing_values_percentage = df.isna().sum() / df.shape[0]
            columns_to_drop = missing_values_percentage[missing_values_percentage > threshold].index

        if imputer == "KNN":
            X = df.drop(columns = ["PRICE VAR [%]", "Class", "year", "Sector", "Stock Name"])

            imputer = KNNImputer(n_neighbors = 5, weights = "distance")
            X = imputer.fit_transform(X)
            X = pd.DataFrame(X, columns=df.columns.drop(["PRICE VAR [%]", "Class", "year", "Sector", "Stock Name"]))
            df[X.columns] = X
        
        self.data = df
    
    def reduce_cardinality(self, column, threshold):
        threshold_value = int(threshold * len(column))
        categories_list = []
        s = 0
        counts = []
        
        # Count the frequencies of unique values in the column
        for value in column:
            # Check if the value is already in the counts list
            index = next((i for i, x in enumerate(counts) if x[0] == value), None)
            if index is not None:
                counts[index] = (value, counts[index][1] + 1)
            else:
                counts.append((value, 1))
        
        # Sort the list of tuples based on count in descending order
        counts.sort(key=lambda x: x[1], reverse=True)
        
        # Loop through the tuples (value, count)
        for i, j in counts:
            # Add the frequency to the global sum
            s += j
            # Append the category name to the list
            categories_list.append(i)
            # Check if the global sum has reached the threshold value, if so break the loop
            if s >= threshold_value:
                break
        
        # Append the category 'Other' to the list
        categories_list.append('Other')
        print(categories_list)
        new_column = column.apply(lambda x: x if x in categories_list else 'Other')
        self.data['Sector'] = new_column

    def add_dummies(self, catCols):
        df_dummies = pd.get_dummies(self.data, columns=catCols)
        self.data = df_dummies
    
    def do_eda(self, remove_duplicates=True, feature_eng=True, remove_corr=True, missing_thresh=0.3, reduce_col=['Sector'], reduce_thresh=0.75, imputer = "median"):
        self.class_to_categorical()
        
        if remove_duplicates:
            duplicated_columns = ['eBITperRevenue', 'eBTperEBIT', 'nIperEBT', 'Return on Tangible Assets', 
                     'ROIC', 'Payables Turnover', 'Inventory Turnover', 'Debt to Assets', 'Debt to Equity', 
                     'cashFlowCoverageRatios']
            self.remove_duplicates(duplicated_columns)
        
        if remove_corr:
            self.remove_correlation()
        
        if feature_eng:
            self.adding_indicators()
        
        self.dropping_indicators()
        self.handle_missing_data(missing_thresh, imputer)
        #self.reduce_cardinality(self.data['Sector'], reduce_thresh)
        self.add_dummies(reduce_col)
        
    def split_dataset(self, remove_outliers=True):
        self.y = self.data['Class'].astype('int')
        self.X = self.data[self.data.columns.difference(['Class', 'Stock Name', 'Sector', 'year', 'PRICE VAR [%]'])]

        ## Train-Test Split
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(self.X, self.y, stratify = self.y, test_size = 0.3, random_state = 42) 

        if remove_outliers:
            self.remove_outliers()


## Explanation of the variables: 


### Adding `year` as a categorical variable

After loading the data set, we add a column named `year` which is used for the exploratory data analysis. However, for the training of our models the variable year is dropped again.

### Handling the variable `Price VAR [%]`

The data set contains a column `PRICE VAR [%]`, which lists the percent price variation of each stock for the following year after the 10-K filings were published. Since this variable is listed in the original data set with the specific year, we had to rename it, e.g. `2016 PRICE VAR [%]` to `PRICE VAR [%]`.

### The variable `class`

Our maschine learning algorithms are trained to classify if the stock price is falling or rising. This information is listed under the categorial variabel class `PRICE VAR [%]`. If the class variable is 1, the stock gained value in the year after the corresponding 10-K filings were published. A hypothetical trader should have bought this stock. Conversely, a value of 0 indicates that the value is decreasing and that a hypothetical trader should not have bought this stock.

The two features `PRICE VAR [%]` and `class` would allow both classification and regression tasks. However, since we cannot quantify the uncertainty of a `PRICE VAR [%]` prediction, the hypothetical trader does not gain any information by regression compared to classification. Therefore, we restrict ourselves on classification algorithms.


### Duplicates

During the process of data cleaning, we detected several doublicated columns but no doublicated rows. Our Duplicates are the following pairs:

- `ebitperRevenue` and `eBITperRevenu`
- `ebtperEBIT` and `eBTperEBIT`
- `niperEBT` and `nIperEBT`
- `returnOnAssets` and `Return on Tangible Assets`
- `returnOnCapitalEmployed` and `ROIC`
- `payablesTurnover` and `Payables Turnover`
- `inventoryTurnover` and `Inventory Turnover`
- `debtRatio` and `Debt to Assets`
- `debtEquityRatio` and `Debt to Equity`
- `cashFlowToDebtRatio` and `cashFlowCoverageRatios`

### Outliers cleaning and Missing values

Furthermore, we notice many missing values in our data set. We decided to drop columns with more than 30% of missing values and use a KNN imputer for the remaining ones. Additional to the missing data, many outliers were found which may reduce the quality of the data. However, considering financial indicators, it is not impossible that strong outliers reflect true values. Therefore, the handling of outliers is not obvious at all. We decided to use a Isolation Forest algorithm to remove the outliers. 

### Handling unique values and cardinality

Our data set contains one categorial variable called `Sector` which indicates in which field the company operates. Some of the values of the `Sector` variable are very weakly populated. To prevent the creation of unnecessary dummy variables, we had to reduce the cardinality. A short description of cardinality is given in the following: "Each categorical variable consists of unique values. A categorical feature is said to possess high cardinality when there are too many of these unique values. One-Hot Encoding becomes a big problem in such a case since we have a separate column for each unique value (indicating its presence or absence) in the categorical variable. This leads to two problems, one is obviously space consumption, but this is not as big a problem as the second problem, the curse of dimensionality" [reference here](https://towardsdatascience.com/dealing-with-features-that-have-high-cardinality-1c9212d7ff1b)

# Part 2: Feature Selection Functions


We implemented the following different feature selections functions:
1) Extra Tree
2) Random Forest
3) Xg Boost
4) PCA


### Why Feature Selection?

Feature selection is the process of selecting a subset of relevant features (variables or predictors). It is a critical step that can have a profound impact on the performance of your model. The main benefits of feature selection include:
- Simpler Models: 
- Less Overfitting: 
- Speeds Up Training: 
- Improved Accuracy: 
- Reduces Noise:
- Prevents Multicollinearity: 

### Difference Random Forest and Xg-Boost:

- Random Forest measures feature importance based on the average decrease in impurity (Gini or Entropy based on the chosen criterion) that results from splits over each feature. This is averaged over all trees in the forest. The random forest algorithm uses bagging (bootstrap aggregating), where different subsets of the data are used to create each decision tree. As a result, it tends to give a more robust estimate of feature importance which is less likely to be influenced by noise or outliers.

- XGBoost, on the other hand, calculates feature importance based on the number of times a feature appears in a tree across all trees in the model (weight), the average gain of splits which use the feature (gain), or the total number of instances split on this feature (coverage). XGBoost uses a concept called boosting, where each new tree is created to correct the errors made by the existing set of trees. As a result, it can model complex relationships and may assign higher importance to features that appear in complex interactions.






It's also important to remember that feature importance doesn't necessarily imply causality; a feature may be important in the context of a particular model without being a causal factor for the outcome being predicted.

### 1) ExtraTrees Feature Selection

In [None]:
def get_significant_features(X_train, X_test, y_train, n):
    # Feature selection using Extra Trees Classifier on the resampled training data
    model = ExtraTreesClassifier(random_state=42)
    model.fit(X_train, y_train)
    importances = model.feature_importances_
    importances_normalized = np.std([tree.feature_importances_ for tree in
                                        model.estimators_],
                                        axis = 0)

    # Select top features with highest importance scores
    top_features = pd.Series(importances, index=X_train.columns).nlargest(n)

    # Subset X_resampled and X_test with selected features
    X_train_selected = X_train[top_features.index]
    X_test_selected = X_test[top_features.index]

    return X_train_selected, X_test_selected, importances_normalized

### 2) Random Forest Feature Selection

In [None]:
def random_forest_feature_selection(X_train, X_test, y_train, y_test, n):
    
    model = SelectFromModel(RandomForestClassifier(n_estimators = n))
    model.fit(X_train, y_train)
    
    list_train_rf= X_train.columns[(model.get_support())]
    list_test_rf= X_test.columns[(model.get_support())]

    X_train_rf = X_train[list_train_rf]
    X_test_rf = X_test[list_test_rf]
    
    return X_train_rf, X_test_rf


### 3) XG-Boost Feature Selection

In [None]:
def xg_boost_feature_selection(X_train, X_test, y_train, y_test):
    
    params = { "objective": "multi:softmax", 'num_class': 2 , 'random_state': 42 }
    model= xgb.XGBClassifier(**params)
    select_xgbc = SelectFromModel(estimator = model, threshold = "median")
    select_xgbc.fit(X_train, y_train)

    list_train_xgbc= X_train.columns[(select_xgbc.get_support())]
    list_test_xgbc= X_test.columns[(select_xgbc.get_support())]


    X_train_xgbc = X_train[list_train_xgbc]
    X_test_xgbc = X_test[list_test_xgbc]
    
    return X_train_xgbc, X_test_xgbc

### 4) PCA Feature Selection

In [None]:
def pca_feature_selection(X_train, X_test, y_train, y_test):

    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Create a PCA object and fit it to the scaled data
    pca = PCA(n_components=3) # Select the number of components you want to keep
    pca.fit(X_train_scaled)

    # Transform the data to the selected number of components
    X_train_pca = pca.transform(X_train_scaled)
    X_test_pca = pca.transform(X_test_scaled)

    # Train the logistic regression model on the transformed data
    logModel_pca = LogisticRegression().fit(X_train_pca, y_train)
    y_pred_pca = logModel_pca.predict(X_test_pca)

    return X_train_pca, X_test_pca

# Part 3: Actual Machine Learning Models/Algos

In [None]:
# Initializing dataset path
path = os.path.join(sys.path[0], 'data')
years = [2014, 2015, 2016, 2017, 2018] # Years to include in the training

# Loading and Handling dataset
df = Dataset(path, years)
df.do_eda(feature_eng=True, missing_thresh=0.3, reduce_thresh=0.75, imputer = "median")

# Splitting dataset into training and testing dataset
df.split_dataset()
X_train, X_test, y_train, y_test = df.X_train, df.X_test, df.y_train, df.y_test

# Unified parameters for pipeline's random search
scaler = StandardScaler()
mms = MinMaxScaler()
ros = RandomOverSampler(random_state = 42)

# Standardize features; equal results as if done in two
X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)


## The class TrainModel

Here is our class `TrainModel`. This class is used for training our different models. It consists of the following parts:

1) The __init__ method: This is the constructor that initializes the TrainModel instance. It takes four arguments:

- model, which is the model to be trained,
- model_params, which are the hyperparameters for the model,
- pipeline, which is a series of data preprocessing steps, and
- pipeline_params, which are the parameters for the pipeline steps.
- train_model method: This method trains the model using RandomizedSearchCV which performs hyperparameter tuning. If -- use_std is set to True, it uses standardized training data (X_train_std) for the model training, else it uses normal training data (X_train).

2) train_pipeline method: This method trains the entire pipeline using Randomized Search Cross Validation. The best parameters found are printed.

3) get_results method: After the model has been trained, this method is used to get the model's predictions on the test set. If use_std is True, it uses standardized test data (X_test_std), otherwise it uses the normal test data (X_test). The predictions are then passed to the report method.

4) report method: This method prints a classification report which includes precision, recall, f1-score, and support for each class. It also prints and plots a confusion matrix using seaborn's heatmap, which shows the number of correct and incorrect predictions made by the model.

In this code, it's important to note that X_train, X_test, y_train, y_test, X_train_std, X_test_std should be defined in the global scope for this class to work, as they are not passed as arguments to the methods.

In [None]:
class TrainModel:
    def __init__(self, model, model_params, pipeline, pipeline_params):
        self.model = model
        self.pipe = pipeline

        # Set uniform params for pipeline Randomized Search
        self.pipeline_params = pipeline_params

        # Set params for model Randomized Search
        self.model_params = model_params
        
        self.kFold = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)
    
    def train_model(self, use_std=False):
        random = RandomizedSearchCV(estimator=self.model, param_distributions=self.model_params, n_iter=1, verbose=10, random_state=42, n_jobs=-1)
        if use_std:
            print("used_std")
            self.model = random.fit(X_train_std, y_train)
        else:
            self.model = random.fit(X_train, y_train)
        print("Best model params used:", self.model.best_params_)

    def train_pipeline(self, use_std=False):
        random = RandomizedSearchCV(estimator=self.pipe, param_distributions=self.pipeline_params, scoring = "f1_weighted", cv = self.kFold, n_jobs = -1, verbose=10)
        if use_std:
            print("used_std")
            self.model = random.fit(X_train_std, y_train)
        else:
            self.model = random.fit(X_train, y_train)
        print("Best pipeline params used:", self.model.best_params_)

    def get_results(self, use_std=False, plot=True):
        if use_std:
            print("used_std")
            y_pred = self.model.best_estimator_.predict(X_test_std)
        else:
            y_pred = self.model.best_estimator_.predict(X_test)
        self.report(y_test, y_pred, plot)

    def report(self, y_test, y_pred, plot=True):
        class_report = classification_report(y_test, y_pred)
        print(class_report)
        conf_matrix = confusion_matrix(y_test, y_pred, normalize = "true")
        print(conf_matrix)
        conf_matrix = pd.DataFrame(conf_matrix, ["Class 0", "Class 1"],  ["Class 0", "Class 1"])
        if plot:    
            sns.heatmap(conf_matrix, annot = True).set(xlabel = "Assigned Class", ylabel = "True Class", title = "Confusion Matrix")



### Scaling our Data

Scaling is a necessary pre-processing step for certain machine learning algorithms, especially those that rely on the calculation of distances or optimization methods.

Here are some examples of algorithms where scaling is crucial:
- Linear and Logistic Regression: When regularization is used (like Ridge or Lasso), features need to be on the same scale since regularization penalizes larger weights more heavily.

- Support Vector Machines (SVM): SVMs try to maximize the margin between different classes. Feature scaling can have a significant impact on the margin size and therefore, the SVM's performance.

- K-Nearest Neighbors (K-NN): This algorithm calculates the distance between different points. Scaling ensures that all features contribute equally to the distance calculation.

- Neural Networks: The weights of neural networks are updated using optimization methods like gradient descent. Feature scaling can make the surface of the loss function smoother, leading to quicker convergence.

- Principal Component Analysis (PCA): PCA is affected by the scales of the features because it maximizes the variance along the new axis.

However, there are also algorithms where scaling isn't necessary:

- Tree-based algorithms: Algorithms like Decision Trees, Random Forests, Gradient Boosting, and XGBoost do not require feature scaling. These algorithms are not distance-based and can handle various scales.

- Naive Bayes: Naive Bayes is not affected by the feature scales as it is not distance based.


### Report

The F1 score is a measure of a model's accuracy that considers both precision and recall. It's the harmonic mean of these two metrics. Precision refers to the percentage of your results which are relevant, while recall refers to the percentage of total relevant results correctly classified by your algorithm.

The "weighted" F1 score means that each class's F1 score is weighted by the number of true instances for that class. This is useful in multi-class classification problems where you have an imbalanced dataset. That is, the number of instances of each class varies greatly.



### Comparison of Logistic Regression, Random Forest, Support Vector Machines, and Naive Bayes methods:

 1) Logistic Regression (LR):

This is a statistical model used for binary classification problems (can be extended for multiclass problems).
LR models the probabilities of class memberships as a logistic function of a linear combination of the predictors.
It assumes a linear relationship between the features and the logit of the outcome, and it is parametric, meaning it makes assumptions about the data distribution.
LR may struggle with complex non-linear data, and features should ideally be scaled before being fed into this algorithm.

2) Random Forest (RF):

This is an ensemble learning method that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
RF can handle both numerical and categorical data, and it's good for high-dimensional spaces as well as large numbers of training examples.
It does not require feature scaling, and it inherently provides a measure of feature importance.
However, a large number of trees can make the model slow and ineffective for real-time predictions.

3)  Support Vector Machines (SVM):

This is a powerful and flexible class of supervised algorithms for both classification and regression.
SVMs can handle linear and non-linear data well by applying a technique called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.
It can handle high-dimensional data well, but it can be sensitive to overfitting depending on the kernel used and can struggle with larger datasets due to its computational complexity.
SVM requires feature scaling for optimal performance.

4)  Naive Bayes (NB):

This is a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable.
It is simple and fast, and it's particularly suited for high-dimensional datasets.
NB can handle both numerical and categorical data, and it works well even with less training data.
However, the strong assumption of independent features (which is rarely true in real life) is a limitation.

## 1 Logistic Regression

Logistic Regression is a statistical and machine learning algorithm often used for binary classification problems. Logistic regression is simple and computationally efficient, thats why its often used as a baseline. 


In [None]:
model = LogisticRegression(solver='lbfgs', random_state=42, max_iter=3000, n_jobs=-1) 

pipeline_params = {'ros': [ros, None], # upsampling or not
              'scaler': [scaler, mms], # scaling input by standardizing or min-max scaling or not scaling at all
              'classifier__C': loguniform(1e-2, 1e0),
              'classifier__penalty': ['l2']} 

### 1.1 Logistic Regression without Feature Selection

In [None]:
pipe = imbpipeline(steps=[["scaler", scaler], ["ros", ros], ["classifier", model]])

train = TrainModel(model, None, pipe, pipeline_params)
train.train_pipeline()
train.get_results()

### 1.2 Logistic Regression with Random Forest Feature Selection 

In [None]:
feat_sel = RandomForestClassifier( random_state = 42, n_estimators=100, n_jobs=-1)
pipe = imbpipeline(steps=[["scaler", scaler], ["feature_selection",  SelectFromModel(estimator = feat_sel, threshold = "median")],
                          ["ros", ros], ["classifier", model]])

train = TrainModel(model, None, pipe, pipeline_params)
train.train_pipeline()
train.get_results()

### 1.3 Logistic Regression with XGBoost Feature Selection

In [None]:
feat_sel = xgb.XGBClassifier(objective = "multi:softmax", num_class = 2,  random_state = 42, n_jobs =-1)
pipe = imbpipeline(steps=[["scaler", scaler], ["feature_selection",  SelectFromModel(estimator = feat_sel, threshold = "median")],
                          ["ros", ros], ["classifier", model]])

train = TrainModel(model, None, pipe, pipeline_params)
train.train_pipeline()
train.get_results()


### 1.4 Logistic Regression with PCA Feature Selection

In [None]:
pca = PCA()

pipeline_params = {'ros': [ros, None], # upsampling or not
              'scaler': [scaler, mms], # scaling input by standardizing or min-max scaling or not scaling at all
              'classifier__C': loguniform(1e-2, 1e0),
              'classifier__penalty': ['l2'],
              'pca__n_components': np.arange(4, 10, 1)} 

pipe = imbpipeline(steps=[["scaler", scaler], ["pca", pca ],  ["ros", ros], ["classifier", model]])

train = TrainModel(model, None, pipe, pipeline_params)
train.train_pipeline()
train.get_results()


## 2 Random Forest

Random Forest is a popular and versatile machine learning method that is capable of performing both regression and classification tasks. It is also used for dimensionality reduction, treats missing values, outlier values, and other things.

Random Forests generally have a high prediction accuracy and are quite efficient on large datasets. 



In [None]:
model = RandomForestClassifier(random_state = 42)
pipeline_params = {
    "rf__criterion": ["gini", "entropy"],
    "rf__max_features": [ "sqrt", "log2"],
    "rf__max_depth": np.array([None, 5, 10, 20]),
    "rf__n_estimators": np.array([ 50, 100, 200, 250]),
    "rf__class_weight": [None, "balanced", "balanced_subsample"],
}

### 2.1 Random Forest without Feature Selection

In [None]:
pipe = imbpipeline(steps=[["ros", ros], ["rf", model]]) #####?????? whats rf here?

train = TrainModel(model, None, pipe, pipeline_params)
train.train_pipeline()
train.get_results()

### 2.2 Random Forest with Random Forest Feature Selection

In [None]:
feat_sel = RandomForestClassifier(random_state = 42, n_estimators=100, n_jobs=-1)
pipe = imbpipeline(steps=[["feature_selection",  SelectFromModel(estimator = feat_sel, threshold = "median")],
                          ["ros", ros], ["rf", model]])

train = TrainModel(model, None, pipe, pipeline_params)
train.train_pipeline()
train.get_results()

### 2.3 Random Forest with XGBoost Feature Selection

In [None]:
feat_sel = xgb.XGBClassifier(objective = "multi:softmax", num_class = 2,  random_state = 42, n_jobs =-1)
pipe = imbpipeline(steps=[["feature_selection",  SelectFromModel(estimator = feat_sel, threshold = "median")],
                          ["ros", ros], ["rf", model]])

train = TrainModel(model, None, pipe, pipeline_params)
train.train_pipeline()
train.get_results()

### 2.4 Random Forest with PCA Feature Selection

In [None]:
pca = PCA()
pipeline_params = { "rf__criterion": ["gini", "entropy"],
                    "rf__max_features": [ "sqrt", "log2"],
                    "rf__max_depth": np.array([None, 5, 10, 20]),
                    "rf__n_estimators": np.array([ 50, 100, 200, 250]),
                    "rf__class_weight": [None, "balanced", "balanced_subsample"],
                    'pca__n_components': np.arange(4, 10, 1)} 

pipe = imbpipeline(steps=[["scaler", scaler], ["pca", pca ],  ["ros", ros], ["rf", model]])

train = TrainModel(model, None, pipe, pipeline_params)
train.train_pipeline()
train.get_results()

## 3 Support Vector Machines


Support Vector Machines (SVMs) aim to find a distinct hyperplane in a high-dimensional space to classify data points. They excel in managing high-dimensional data, even when dimensions outnumber samples, and are memory-efficient by leveraging a subset of training points called "support vectors" in decision-making. 

However, SVMs struggle with large feature sets compared to samples, requiring careful kernel and regularization term selection to avoid over-fitting. 

### 3.1 SVM without Feature Selection

In [None]:
model= SVC(random_state=42, max_iter= 1000)

In [None]:
pipe = imbpipeline(steps=[["ros", ros], ["SVM", model]])
pipeline_params = {
    'ros': [ros, None], 
    "SVM__kernel": ["linear", "rbf"],
    "SVM__C": np.outer(np.logspace(-1, 1, 3),np.array([1,5])).flatten(),
    "SVM__gamma": np.outer(np.logspace(-3, 0, 4),np.array([1,5])).flatten()
}

In [None]:
train = TrainModel(model, None, pipe, pipeline_params)
train.train_pipeline(use_std=True)
train.get_results(use_std=True)

### 3.2 SVM with RandomForest Feature Selection

In [None]:
feat_sel = RandomForestClassifier( random_state = 42, n_estimators=100, n_jobs=-1)
pipe = imbpipeline(steps=[["feature_selection",  SelectFromModel(estimator = feat_sel, threshold = "median")],
                          ["ros", ros], ["SVM", model]])

train = TrainModel(model, None, pipe, pipeline_params)
train.train_pipeline(use_std=True)
train.get_results(use_std=True)

### 3.3 SVM with XgBoost Feature Selection

In [None]:
feat_sel = xgb.XGBClassifier(objective = "multi:softmax", num_class = 2,  random_state = 42, n_jobs =-1)
pipe = imbpipeline(steps=[["feature_selection",  SelectFromModel(estimator = feat_sel, threshold = "median")],
                          ["ros", ros], ["SVM", model]])

train = TrainModel(model, None, pipe, pipeline_params)
train.train_pipeline(use_std=True)
train.get_results(use_std=True)

### 3.4 SVM with PCA Feature Selection

In [None]:
pca = PCA()

pipe = imbpipeline(steps=[["scaler", scaler], ["pca", pca], ["ros", ros], ["SVM", model]])
pipeline_params = {
    'ros': [ros, None], 
    'scaler': [scaler, mms],
    "SVM__kernel": ["linear", "rbf"],
    "SVM__C": [ 10],
    "SVM__gamma": ["scale"],
    "pca__n_components": np.arange(4, 10, 1),
}

train = TrainModel(model, None, pipe, pipeline_params)
train.train_pipeline()
train.get_results()

## 4 Gaussian Naive Bayes

GaussianNB assumes that the likelihoods of different features in a class are independent from each other. Furthermore, the likelihoods in a class are assumed to be gaussian. Therefore, GaussianNB can only be applied to continous features. The probability to be in a specific class is estimated by the prior probabilities of the classes and the probabilities of the fitted gaussian distribution.

In [None]:
model = GaussianNB(random_state = 42)
priors = (sum(y_train == 0)/y_train.shape[0], sum(y_train == 1)/y_train.shape[0])
pipeline_param = {
    "scaler": [scaler, mms, None],
    "ros": [ros, None],
    "gnb__priors": [None, priors],
    "gnb__var_smoothing": np.logspace(0, -10, num = 100),
}

### 4.1 Naive Bayes without Feature Selection

In [None]:
pipe = imbpipeline(steps=[["scaler", scaler], ["ros", ros], ["gnb", model]])

train = TrainModel(model, None, pipe, pipeline_params)
train.train_pipeline()
train.get_results()

### 4.2 Naive Bayes with Random Forest Feature Selection

In [None]:
feat_sel = RandomForestClassifier(random_state = 42)
pipe = imbpipeline(steps=[["scaler", scaler], ["feature_selection",  SelectFromModel(estimator = feat_sel, threshold = "median")],
                          ["ros", ros], ["gnb", model]])

train = TrainModel(model, None, pipe, pipeline_params)
train.train_pipeline()
train.get_results()

### 4.3 Naive Bayes with XGBoost Feature Selection

In [None]:
feat_sel = xgb.XGBClassifier(objective = "multi:softmax", num_class = 2,  random_state = 42, n_jobs =-1)
pipe = imbpipeline(steps=[["scaler", scaler], ["feature_selection",  SelectFromModel(estimator = feat_sel, threshold = "median")],
                          ["ros", ros], ["gnb", model]])

train = TrainModel(model, None, pipe, pipeline_params)
train.train_pipeline()
train.get_results()

### 4.4 Naive Bayes with PCA Feature Selection

In [None]:
pca = PCA()
pipe = imbpipeline(steps=[["scaler", scaler], ["pca", pca], ["gnb", model]])
pipeline_params= {
    "scaler": [scaler, mms, None], 
    "gnb__priors": [None, priors],
    "gnb__var_smoothing": np.linspace(1, 0, num = 5),
    "pca__n_components": np.arange(4, 10, 1)
}

train = TrainModel(model, None, pipe, pipeline_params)
train.train_pipeline()
train.get_results()

## Part 4: Feed Forward Neural Network

In [None]:
import torch
from torchvision import datasets as datasets
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms as transforms
import torch.nn as nn
import torch.nn.functional as F
import torch.nn as nn
import torch.optim as optim
import torchvision

In [None]:
X, y = df.X, df.y
stdCols = X.select_dtypes(include = "float64").columns #columns to be standardized
X[stdCols] = (X[stdCols]-X[stdCols].mean())/X[stdCols].std()

input_dim = X.shape[1]
batch_size = 4
PATH = './net.pth'

optimal_lr = 0.009
optimal_momentum = 0.8
optimal_dropout = 0.3

In [None]:
class FCNetwork(nn.Module):
    def __init__(self, dropout_rate=0.3, input_dim = input_dim):
        super().__init__()
        self.lin1 = nn.Linear(input_dim, input_dim//2)
        self.lin2 = nn.Linear(input_dim//2, input_dim//4)
        self.lin3 = nn.Linear(input_dim//4, 1)
        self.drop1 = nn.Dropout(p = dropout_rate)
        self.drop2 = nn.Dropout(p = dropout_rate)
        self.prob = nn.Sigmoid()

    def forward(self, x):
        x = F.relu(self.lin1(x))
        x = self.drop1(x)
        x = F.relu(self.lin2(x))
        x = self.drop2(x)
        x = self.lin3(x)
        x = self.prob(x)
        return x

In [None]:
#Definition of Dataset for Dataloader
class CustomDataset(Dataset):
    def __init__(self, X, y):
        super().__init__()
        self.X = torch.tensor(X.values, dtype=torch.float32)
        self.y = torch.tensor(y.astype("int32").values, dtype=torch.float32).reshape(-1, 1)
        
    def __len__(self):
        return len(self.X)
        
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

In [None]:
def CustomDataloader(X, y):    
    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.3, random_state = 42) 
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, stratify = y_train, test_size = 0.2, random_state = 42)

    train_dataset = CustomDataset(X_train, y_train)
    val_dataset = CustomDataset(X_val, y_val)
    test_dataset = CustomDataset(X_test, y_test)

    trainloader = DataLoader(train_dataset, batch_size=batch_size,
                                              shuffle=True)

    valloader = DataLoader(val_dataset, batch_size=batch_size,
                                             shuffle=False)

    testloader = DataLoader(test_dataset, batch_size=batch_size,
                                             shuffle=False)
    
    return trainloader, valloader, testloader

In [None]:
def FCNetwork_train(network, trainloader, valloader, PATH, nr_epochs, print_running_loss):
    criterion = nn.BCELoss()
    optimizer = optim.SGD(network.parameters(), lr=optimal_lr, momentum=optimal_momentum)
    
    min_val_loss = float("inf")
    for epoch in range(nr_epochs):  

        running_loss = 0.0
        network.train()
        n = 0
        for i, data in enumerate(trainloader, 0):
            inputs, labels = data

            optimizer.zero_grad()

            outputs = network(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
            n += 1

        if print_running_loss:
            print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / n:.3f}')

        network.eval()
        val_loss = 0
        n = 0
        with torch.no_grad():
            for i, data in enumerate(valloader, 0):
                inputs, labels = data

                outputs = network(inputs)
                loss = criterion(outputs, labels)
                val_loss += loss.item()
                n += 1

        if print_running_loss:
            print(f'[{epoch + 1}, {i + 1:5d}] loss: {val_loss / n:.3f}')            

        if val_loss < min_val_loss:
            min_val_loss = val_loss
            torch.save(network.state_dict(), PATH)
            
            if print_running_loss:   
                print(f"The new best model is at epoch {epoch}")
        
        if print_running_loss:
            print(f'Epoch: {epoch} over')

In [None]:
def test_true_predicted(network, testloader, PATH): #, print_accuracy):
    network.load_state_dict(torch.load(PATH))

    y_pred = []
    y_true = []
    
    network.eval()
    with torch.no_grad():
        for data in testloader:
            inputs, labels = data
            outputs = network(inputs)
            predicted = outputs>0.5 
            
            y_true.extend(torch.reshape(labels, (batch_size, )).tolist())
            y_pred.extend(torch.reshape(predicted, (batch_size, )).tolist())

    return y_true, y_pred

In [None]:
network = FCNetwork(input_dim = input_dim, dropout_rate = optimal_dropout)
trainloader, valloader, testloader = CustomDataloader(X, y)
FCNetwork_train(network, trainloader, valloader, PATH, nr_epochs=15, print_running_loss=True)

In [None]:
y_true = test_true_predicted(network, testloader, PATH)[0]
y_pred = test_true_predicted(network, testloader, PATH)[1]

In [None]:
class_report = classification_report(y_true, y_pred)
print(class_report)
conf_matrix = confusion_matrix(y_true, y_pred, normalize = "true")
print(conf_matrix)

# Collection of important Figures from EDA and Feature Selection 

In the following, we want to explore the characteristics of the most important features in more detail. For this purpose, we apply an Extra Tree Classifier and show some histograms of the most important features.

In [None]:
df = Dataset(path, years)
df.do_eda(remove_duplicates=True, feature_eng=False, remove_corr=True, missing_thresh=0.3, reduce_col=['Sector'], reduce_thresh=0.75, imputer = "median")
df.split_dataset()

X, y = df.X, df.y

In [None]:
model = ExtraTreesClassifier(random_state=42)
model.fit(X, y)
importances = model.feature_importances_
importances_normalized = np.std([tree.feature_importances_ for tree in
                                    model.estimators_],
                                    axis = 0)

In [None]:
top_features = pd.Series(importances, index=X.columns).nlargest(200)
n = len(top_features)

# Plot feature importance of all features
plt.figure(figsize=(6, 5))
plt.bar(range(n), top_features[:n], align='center')

plt.xlim([-1, n])
plt.xlabel('Feature')
plt.ylabel('Feature Importance')

plt.tight_layout()

In [None]:
# plot the importance of all features
n = 15

plt.figure(figsize=(10, 8))

# Plot the most important features
plt.subplot(1,2,1)
plt.bar(range(n), top_features[:n], align='center')

for i in range(n):
    plt.text(i, top_features[i]+1e-4 , round(top_features[i], 3), ha='center', fontsize = 6)

plt.xticks(range(n), top_features.index[:n], rotation=90)
plt.xlim([-1, n])
plt.ylim([0, 0.09])
plt.title(f"The {n} most important features")
plt.ylabel('Feature Importance')

# Plot the 10 least important features

plt.subplot(1,2,2)
plt.bar(range(n), top_features[-n:], align='center')

for i in range(n):
    plt.text(i, top_features[len(top_features)-n+i]+1e-4 , round(top_features[len(top_features)-n+i], 3), ha='center', fontsize = 6)

plt.xticks(range(n), top_features.index[-n:], rotation=90)
plt.xlim([-1, n])
plt.ylim([0, 0.09])
plt.title(f"The {n} least important features")
plt.ylabel('Feature Importance')

plt.tight_layout()

In [None]:
fig, axes = plt.subplots(2,3)
fig.set_size_inches(20, 10)

#df = pd.concat((X, y), axis = 1)

for i, ax in enumerate(axes.flatten()):
    feat_cl0 = df[y == 0][top_features.index[i]]
    feat_cl1 = df[y == 1][top_features.index[i]]
    
    mean_cl0 = feat_cl0.mean()
    mean_cl1 = feat_cl1.mean()
    
    med_cl0 = feat_cl0.median()
    med_cl1 = feat_cl1.median()    
    
    plot_range = (feat_cl0.quantile(0.05), feat_cl0.quantile(0.95))
    ax.hist(feat_cl0, color = "red", bins = 100, alpha = 0.5, range = plot_range, label = "Sell")
    ax.axvline(x = med_cl0, color = "red", linestyle='--', label = f"median {round(med_cl0, 3)}")
    ax.axvline(x = mean_cl0, color = "red", linestyle=':', label = f"mean {round(mean_cl0, 3)}")
    ax.hist(feat_cl1, color = "green", bins = 100, alpha = 0.5, range = plot_range, label = "Buy")
    ax.axvline(x = med_cl1, color = "green", linestyle='--', label = f"mean {round(med_cl1, 3)}")
    ax.axvline(x = mean_cl1, color = "green", linestyle=':', label = f"median {round(med_cl1, 3)}")
    ax.set_xlabel(top_features.index[i])
    ax.legend()

plt.tight_layout()
plt.show()

# Discussion

With accuracies higher than 0.6, it seems that our models can predict the market beyond chance. However, the conclusion that it is possible to "beat the market" should not be drawn without further considerations. Firstly, since our data set is slightly imbalanced, one achieves an accuracy of 0.55 when predicting only values of one. More problematic, however, is the fact that we trained our models using the combined data set from 2014 to 2018. Our class variable is very unbalanced when considering the years individually. Therefore, it is possible that our algorithms learn the year of a measurement in the combined dataset and assign the majority class of that year to the measurement.

To check if we can predict the stock market without any possible information of the year, we conclude our project by training Random Forrest Classifiers on different years seperately. The comparison of the accuracy score of the models trained on the different years reveals that the prediction is not strogly impeded due to the class imbalance. Furthermore, we conclude that the predictive power observed in the combined data set of the years 2014-2018 is independent of a hidden effect of the year.

### Accuracy with random prediction

In [None]:
print(f"Accuracy when all predictions are set to one: {round(accuracy_score(y_test, np.ones(len(y_test))),2)}")

In [None]:
accuracies = []
priors = (sum(y_test == 0)/y_test.shape[0], sum(y_test == 1)/y_test.shape[0])
np.random.seed(42)

for i in range(10000):
    y_pred = np.random.choice([0, 1], size = len(y_test), p = priors) 
    accuracies.append(accuracy_score(y_test, y_pred))
    
fig = plt.hist(accuracies, bins = 50, label = f"mean {round(np.mean(accuracies),2)} $\pm$ {round(np.std(accuracies),2)}")
plt.xlabel("accuracy")
plt.xlim([0.4,0.6])
plt.ylim([0,750])
plt.legend()
plt.title("Accuracy with random prediction")
plt.show()

### Class Imbalance in the years 2014-2018

In [None]:
project_directory = sys.path[0]
data_directory = os.path.join(project_directory, 'data')

class0_years = []
class1_years = []
for year in range(2014,2019):
    df_year = pd.read_csv(os.path.join(data_directory, f'{year}_Financial_Data.csv'), sep=',')
    class0_years.append(sum(df_year["Class"] == 0))
    class1_years.append(sum(df_year["Class"] == 1))
    
N = 5
ind = np.arange(N) # the x locations for the groups
width = 0.35
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.bar(ind, class0_years, width, color='r', label = "Class 0 (Sell)")
ax.bar(ind, class1_years, width, bottom=class0_years, color='g', label = "Class 1 (Buy)")
ax.set_ylabel('Occupancy')
ax.set_title('Class occupancy in different years')
ax.set_xticks(ind, [f"{i}" for i in range(2014,2019)])
#ax.set_yticks(np.arange(0, 81, 10))
ax.legend()
plt.show()

### Random Forrest Classifier trained on the data from different years seperately

In [None]:
#Random Forrest Classifier fitted to the data of 2014
for year in range(2014,2019):
    df = Dataset(path, year)
    df.do_eda(feature_eng=False, missing_thresh=0.3, reduce_thresh=0.75, imputer = "median")
    df.split_dataset()
    X_train, X_test, y_train, y_test = df.X_train, df.X_test, df.y_train, df.y_test

    model = RandomForestClassifier(random_state = 42)
    pipeline_params = {
        "rf__criterion": ["gini", "entropy"],
        "rf__max_features": [ "sqrt", "log2"],
        "rf__max_depth": np.array([None, 5, 10, 20]),
        "rf__n_estimators": np.array([ 50, 100, 200, 250]),
        "rf__class_weight": [None, "balanced", "balanced_subsample"],
    }

    pipe = imbpipeline(steps=[["ros", ros], ["rf", model]])

    train = TrainModel(model, None, pipe, pipeline_params)
    print(f"Random Forrest Classifier trained on data from {year}\n")
    train.train_pipeline()
    train.get_results(plot=False)
    print("\n\n")