<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="400" alt="cognitiveclass.ai logo">
</center>

# **Investigation relationships between exchange rate BTC/BUSD and ADOSC, NATR, TRANGE indicators**

## Lab 5. Classification in finances

Estimated time needed: **30** minutes

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
    
Для Марії
### The tasks:
*   

</div>


### Objectives

After completing this lab you will be able to:

*   Preprocess (normilize and transform categorical data and create DataSet
*   Features selection
*   Make classification
*   Visualize decision tree of classification model  

### Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li>Import and Load Data</li>
    <li>Data preparation</li>
        <ul>
            <li>Data transformation</li>
            <li>Encoding and Normalization</li>
        </ul>
    <li>Features Selection</li>
        <ul>
            <li>Chi-Squared Statistic</li>
            <li>Mutual Information Statistic</li>
            <li>Feature Importance</li>
        </ul>  
    <li>Classification models</li>
            <ul>
                <li>Train and Test DataSets Creation</li>
                <li>Extra Trees Classifier</li>
                <li>Logistic Regression</li>
            </ul>
    <li>Decision Tree</li>
            <ul>
                <li>Build model</li>
                <li>Visualization of Decision Tree</li>
            </ul>
</ol>

</div>

----


## Dataset Description

### Context
Dataset contains historical changes of the ***BTC/BUSD*** and ***ADOSC, NATR, TRANGE indicators*** for the period from *11/11/2022 to 11/24/2022* with an *1-minute* aggregation time.

### Columns

#### Input columns
* ***Time*** - the timestamp of the record
* ***Open*** -  the price of the asset at the beginning of the trading period
* ***High*** -  the highest price of the asset during the trading period
* ***Low*** - the lowest price of the asset during the trading period.
* ***Close*** - the price of the asset at the end of the trading period
* ***Volume*** - the total number of shares or contracts of a particular asset that are traded during a given period
* ***Count*** -  the number of individual trades or transactions that have been executed during a given time period
* ***ADOSC*** - Chaikin oscillator indicator
* ***NATR*** - normalized average true range (ATR) indicator
* ***TRANGE*** - true range indicator
* ***Volume_binned*** - categorical field that indicates the size of the Volume *(Low, Medium, High)*
* ***ADOSC_binned*** - categorical field that indicates the size of the ADOSC indicator *(Low, Medium, High)*
* ***NATR_binned*** - categorical field that indicates the size of the NATR indicator *(Low, Medium, High*
* ***TRANGE_binned*** - categorical field that indicates the size of the TRANGE indicator *(Low, Medium, High)*

#### Target columns
* ***Price*** - the average price at which a particular asset has been bought or sold during a given period
* ***Price_binned*** - categorical field that indicates the size of the Price *(Low, Medium, High)*


----


During the work, the task of a preliminary analysis of cryptocurrency price level based on numerical indicator values and its division into categories by levels.

In this lesson, we will try to give answers to a set of questions that may be relevant when analyzing banking data:

1. What are the most useful Python libraries for classification analysis?
2. How to transform category data?
3. How to create DataSet?
4. How to do features selection?
5. How to make, fit and visualize classification model?

In addition, we will make the conclusions for the obtained results of our classification analysis to discover wether indicators can be used in cryptocurrency price prediciton.


## 1. Import and Load Data


### Setup

[Scikit-learn](https://scikit-learn.org/stable/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX01KTEN2525-2023-01-01) (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

Let's install <em>scikit-learn</em> and other needed modules:


In [ ]:
! conda install -c conda-forge scikit-learn -y
! conda install python-graphviz -y

### Import Libraries


Import the libraries necessary to use in this lab. We can add some aliases to make the libraries easier to use in our code and set a default figure size for further plots. Ignore the warnings.


In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import graphviz
%matplotlib inline
plt.rcParams["figure.figsize"] = (8, 6)
# Data transformation
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
from sklearn.preprocessing import MinMaxScaler
# Features Selection
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, mutual_info_classif
# Classificators
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier 
from sklearn import tree
# warnings deactivate
import warnings
warnings.filterwarnings('ignore')
# for better visualization
from sklearn import set_config
set_config(display = 'diagram')

Further specify the value of the `precision` parameter equal to 2 to display two decimal signs (instead of 6 as default).


In [ ]:
pd.set_option("precision", 2)
pd.options.display.float_format = '{:.2f}'.format

### Load Data


We will use the same DataSet like in previous labs. Therefore next some steps will be the same.


First, we assign the URL of the dataset to <code>"path"</code>. 


In [ ]:
path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0ZXSEN/BTCBUSD_trade.csv' 

Then use the Pandas method <code>read_csv()</code> to load the data from the web address and set dataframe index column type to <strong>datetime</strong> using <code>pd.to_datetime()</code> method for correct time series analysis. 


In [ ]:
df = pd.read_csv(path)
df.set_index('Time', inplace=True)
df.index = pd.to_datetime(df.index)

df.head()

In [ ]:
df.shape

As you can see dataset consists of 15 columns. 'Price' column will be the target in further classification implementation. Also dataset has 16751 rows. In previous labs we investigated input columns. In our classification models we will use the following features:


Input features (column names):
1. `Volume` - the total number of units of the asset traded on all exchanges within a particular time period <em>(numeric)</em>
2. `ADOSC` - an volume-based indicator to measure the cumulative flow of money into and out of an asset <em>(numeric)</em>
3. `NATR` - an indicator measuring the volatility level <em>(numeric)</em>
4. `TRANGE` - a technical indicator which measures the daily range plus any gap from the closing price of the preceding day<em>(numeric)</em>
5. `Volume_binned` - Volume values divided into three category based on their level <em>(categorical: `Low`, `Medium`, `High`)</em>
6. `ADOSC_binned` - ADOSC values divided into three category based on their level <em>(categorical: `Low`, `Medium`, `High`)</em>
7. `NATR_binned` - NATR values divided into three category based on their level <em>(categorical: `Low`, `Medium`, `High`)</em>
8. `TRANGE_binned` - TRANGE values divided into three category based on their level <em>(categorical: `Low`, `Medium`, `High`)</em>

Output feature (desired target):

1. `Price_binned` - determine in which price category cryptocuttency price will be


Our goal is create the classification model that can predict  the cryptocurrency price level. To do this we must analize and prepare data for such type of model.


## 2. Data preparation


### Data transformation


First of all we should investigate how Pandas recognized types of features.


In [ ]:
df.info()

As you can see all categorical features was recogized like object. We must change thair type on "categorical". 


In [ ]:
col_cat = list(df.select_dtypes(include=['object']).columns)
col_cat

Let's look at the dataset size.


In [ ]:
df.loc[:, col_cat] = df[col_cat].astype('category')
df.info()

To see the unique values of exact feature (column) we can use <code>unique()</code> function:


In [ ]:
df['ADOSC_binned'].unique()

As was signed earlier the dataset contains 16571 objects (rows), for each of which 15 features are set (columns), including 1 target feature (y). 5 features, including target are categorical. These data type of values cannot use for classification. We must transform it to int or float. 
To do this we can use **[LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX01KTEN2525-2023-01-01)** and **[OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX01KTEN2525-2023-01-01)**. These functions can encode categorical features as an integer array.

Firs of all we separate DataSet on input and output (target) DataSets:


In [ ]:
X = df[['Volume', 'ADOSC', 'NATR', 'TRANGE', 'Volume_binned', 'ADOSC_binned', 'NATR_binned', 'TRANGE_binned']]  #input columns
y = df['Price_binned']    #target column 

### Encoding and Normalization


Than create list of categorical fields and transform thair values to int arrays:


In [ ]:
col_cat = ['Volume_binned', 'ADOSC_binned', 'NATR_binned', 'TRANGE_binned']
oe = OrdinalEncoder()
oe.fit(X[col_cat])
X_cat_enc = oe.transform(X[col_cat])

In [ ]:
X_cat_enc

Than we must transform arrays back into DataFrame:


In [ ]:
X_cat_enc = pd.DataFrame(X_cat_enc)
X_cat_enc.columns = col_cat
X_cat_enc

Numerical fields can have different scale and can consists negative values. These will lead to round mistakes and exeptions for some AI methods. To avoid it these features must be normalized.

Let's create list of numerical fields and normilize it using by **[MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX01KTEN2525-2023-01-01)**:


In [ ]:
col_num = ['Volume', 'ADOSC', 'NATR', 'TRANGE']

scaler = MinMaxScaler(feature_range=(0, 1))
X_num_enc = scaler.fit_transform(X[col_num])

In [ ]:
X_num_enc

Like in previous case transform back obtained arrays into DataFrame:


In [ ]:
X_num_enc = pd.DataFrame(X_num_enc)
X_num_enc.columns = col_num
X_num_enc

Then we should concatenate these DataFrames in one input DataFrame:


In [ ]:
x_enc = pd.concat([X_cat_enc, X_num_enc], axis=1)
x_enc

The same transformation we must do for target field:


In [ ]:
le = LabelEncoder()
le.fit(y)
y_enc = le.transform(y)
y_enc = pd.Series(y_enc)
y_enc.columns = y.name

In [ ]:
y.to_frame()

In [ ]:
y_enc.unique()

As you can see values <em>'High'</em> was changed on 0, <em>'Low'</em> on 1, and <em>'Medium'</em> on 2.


## 3. Features selection


As was signed before input fields consists 8 features. Of coure some of them are more significant for classification. 

There are two popular feature selection techniques that can be used for categorical input data and a categorical (class) target variable:
* Chi-Squared Statistic.
* Mutual Information Statistic.

Let’s take a closer look at each in turn. To do this we can use **[SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX01KTEN2525-2023-01-01)**.


### Chi-Squared Statistic


<em><strong>Pearson’s chi-squared statistical hypothesis test</strong></em> is an example of a test for independence between categorical variables.

You can learn more about this statistical test in the tutorial:
*   [A Gentle Introduction to the Chi-Squared Test for Machine Learning](https://machinelearningmastery.com/chi-squared-test-for-machine-learning/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX01KTEN2525-2023-01-01).

The results of this test can be used for feature selection, where those features that are independent of the target variable can be removed from the dataset.

The scikit-learn machine library provides an implementation of the chi-squared test in the **[chi2()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX01KTEN2525-2023-01-01#sklearn.feature_selection.chi2)** function. This function can be used in a feature selection strategy, such as selecting the top k most relevant features (largest values) via the SelectKBest class.

For example, we can define the <em>SelectKBest class</em> to use the <code>chi2()</code> function and select all (or most significant) features, then transform the train and test sets.


Apply SelectKBest class to extract features:


In [ ]:
bestfeatures = SelectKBest(score_func=chi2, k='all')
fit = bestfeatures.fit(x_enc,y_enc)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)

Concat two dataframes for better visualization:


In [ ]:
featureScores = pd.concat([dfcolumns, dfscores],axis=1)
featureScores.columns = ['Specs','Score']  #naming the dataframe columns
featureScores.sort_values(by=['Score'], ascending=False)

### Mutual Information Statistic 


<em><strong>Mutual information</strong></em> from the field of information theory is the application of information gain (typically used in the construction of decision trees) to feature selection.

<em>Mutual information<em> is calculated between two variables and measures the reduction in uncertainty for one variable given a known value of the other variable.

[You can learn more about mutual information in the following tutorial.](https://machinelearningmastery.com/information-gain-and-mutual-information?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX01KTEN2525-2023-01-01)

The scikit-learn machine learning library provides an implementation of mutual information for feature selection via the **[mutual_info_classif()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX01KTEN2525-2023-01-01#sklearn.feature_selection.mutual_info_classif)** function.

Like <code>chi2()</code>, it can be used in the SelectKBest feature selection strategy (and other strategies).


In [ ]:
bestfeatures = SelectKBest(score_func=mutual_info_classif, k='all')
fit = bestfeatures.fit(x_enc,y_enc)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
featureScores = pd.concat([dfcolumns, dfscores],axis=1)
featureScores.columns = ['Specs','Score']  #naming the dataframe columns
featureScores.sort_values(by=['Score'], ascending=False)

As you can see these 2 function select almost the same significant features.


We can see that categorical dataframe columns have the most significant impact. Thus, let's consider only them as inputs for predicting model.  


In [ ]:
x_enc = x_enc[x_enc.columns[:4]]
x_enc

### Feature Importance


You can get the feature importance of each feature of your DataFrame by using the feature importance property of the exact classification model.

<em>Feature importance</em> gives you a score for each feature of your data, the higher the score more important or relevant is the feature towards your output variable.

<em><strong>For example:</strong></em>

Feature importance is an inbuilt class that comes with **[Tree Based Classifiers](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX01KTEN2525-2023-01-01)**, we will be using Extra Tree Classifier for extracting the top 10 features for the dataset.


Let's create and fit the model:


In [ ]:
model = ExtraTreesClassifier()
model.fit(x_enc,y_enc)

Use inbuilt <code>feature_importances</code> method of tree based classifiers:


In [ ]:
print(model.feature_importances_)

Let's transform it into Series and plot graph of feature importances for better visualization:


In [ ]:
feat_importances = pd.Series(model.feature_importances_, index=x_enc.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()

You can see that for <em>Extra Tree Classifier</em> impotance of features are the same as in previous cases. 


## 4. Classification models


### Train and Test DataSets Creation


First of all we must separate DataSets for train and test DataSets for calculate accuracy of models. To do this we can use **[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX01KTEN2525-2023-01-01)**. 

Let's separate DataSets in <em>0.33</em> proportion <em>train/test<em>:


In [ ]:
X_train, X_test, y_train, y_test = train_test_split(x_enc, y_enc, test_size=0.33, shuffle=False, random_state=1)
print("X_train:", X_train.shape)
print("X_test:", X_test.shape)
print("y_train:", y_train.shape)
print("y_test:", y_test.shape)

### Extra Trees Classifier


#### What is an extra tree classifier?

<strong><em>Extra trees</em></strong> (short for extremely randomized trees)</em> is an ensemble supervised machine learning method that uses decision trees and perform their averaging to improve the predictive accuracy and control overfitting.

Let's create and fit ExtraTreesClassifier on our train dataset and calculate accuracy of classification:


In [ ]:
model = ExtraTreesClassifier()
model.fit(X_train, y_train)

Evaluate the model on test data for obtain predictions:


In [ ]:
yhat = model.predict(X_test)
print(yhat)

Evaluate accuracy: 


In [ ]:
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.2f' % (accuracy*100))

There are many different techniques for scoring features and selecting features based on scores. <em>How do you know which one to use?</em>

A robust approach is to evaluate models using different feature selection methods (and numbers of features) and select the method that results in a model with the best performance.


### Logistic Regression


<em><strong>Logistic regression</strong></em> is a good model for testing feature selection methods as it can perform better if irrelevant features are removed from the model. We will use this model in absolutelly similar way like previous one.


In [ ]:
model = LogisticRegression(solver='lbfgs')
model.fit(X_train, y_train)
yhat = model.predict(X_test)
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.2f' % (accuracy*100))

<details><summary>Click <b>here</b> for the solution</summary> 
<code>    
model = LogisticRegression(solver='lbfgs')
model.fit(X_train, y_train)
yhat = model.predict(X_test)
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.2f' % (accuracy*100))
    </code>
</details>


As you can see accuracy did not improve.


## Decision Tree 


The biggest drawback of the previous methods is the inability to visualize or justify the decision.


<strong><em>Decision trees</em></strong> are a popular supervised learning method for a variety of reasons. 

Benefits of decision trees include that <em>they can be used for both regression and classification</em>, they don’t require feature scaling, and they are relatively easy to interpret as you can visualize decision trees. This is not only a powerful way to understand your model, but also to communicate how your model works. 

Consequently, it would help to know how to make a visualization based on your model.


<em>A [Decision Tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX01KTEN2525-2023-01-01)</em> is a supervised algorithm used in machine learning. It is using a binary tree graph (each node has two children) to assign for each data sample a target value. The target values are presented in the tree leaves. To reach to the leaf, the sample is propagated through nodes, starting at the root node. In each node a decision is made, to which descendant node it should go. 

A decision is made based on the selected sample’s feature. Decision Tree learning is a process of finding the optimal rules in each internal tree node according to the selected metric.


### Build model


This metod allows also to calculate features impotance.
Let's calculate them. Choice best 10 of them. Refit the model and visualize decision tree.


In [ ]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
yhat = model.predict(X_test)
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.2f' % (accuracy*100)) 

In [ ]:
print("Features impotance:", model.feature_importances_)

Plot graph of feature importances for better visualization


In [ ]:
feat_importances = pd.Series(model.feature_importances_, index=x_enc.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()

### Visualization of Decision Tree


Let's visualize decision tree.
There are some ways to do it:

*   Text visualization
*   Plot tree
*   Graph visualization


### Text visualization


In [ ]:
text_representation = tree.export_text(model)
print(text_representation)

You can save it into file:


In [ ]:
with open("decistion_tree.log", "w") as fout:
    fout.write(text_representation)

### Plot tree 


You can plot tree using by two different way:
1. <code>**[plot_tree()](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX01KTEN2525-2023-01-01)**</code>
2. <code>export_graphviz()</code> from <em>graphviz library <em>


Because of slow rendering <code>plot_tree</code> implementation can take some time: 


In [ ]:
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(model,
               feature_names = x_enc.columns, 
               class_names = y.unique(),
               filled = True)

In [ ]:
fig.savefig('decision_tree.png')

Let's visualize our decision tree in a graph form using <em>graphviz</em> module:


In [ ]:
dot_data = tree.export_graphviz(model,
               feature_names = x_enc.columns, 
               class_names = y.unique(),
                                filled=True)

After creation you can draw graph:


In [ ]:
graph = graphviz.Source(dot_data, format="png") 
graph

And render it into file:


In [ ]:
graph.render("decision_tree_graphivz")

## Conclusion


In this lab we learned to do preliminary data processing. In particular, change data types, normalize and process categorical data. It was shown how to make feature selection by different methods. Learned how to build training and test DataSets. Shows how to work with different classifiers. It was also shown how to visualize a decision tree.

As a result, the accuracy of all three classification models did not exceed 52%. In the following courses, we will consider the effectiveness of wider range of nonlinear models in economics and financial services.


<div class="alert alert-danger alertdanger" style="margin-top: 20px">

<b style="font-size: 2em; font-weight: 600;">Question #1:</b>

Create an ExtraTreesClassifier object called "model".
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 

model = ExtraTreesClassifier()

<details><summary>Click <b>here</b> for the solution</summary> 
<code>model = ExtraTreesClassifier()</code>
</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">

<b style="font-size: 2em; font-weight: 600;">Question #2:</b>

Create user function that will calculate accuracy of defined classificator model.
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 

def model_ac(x_train, y_train, x_test, y_test, clf):
    model.fit(x_train, y_train)
    yhat = model.predict(x_test)
    accuracy = accuracy_score(y_test, yhat)
    return accuracy

<details><summary>Click <b>here</b> for the solution</summary> 
<code>
    model.fit(x_train, y_train)
    yhat = model.predict(x_test)
    accuracy = accuracy_score(y_test, yhat)
    return accuracy
</code>
</details>


In [ ]:
print('Accuracy: %.2f' % (model_ac(X_train, y_train, X_test, y_test, model)*100))

<div class="alert alert-danger alertdanger" style="margin-top: 20px">

<b style="font-size: 2em; font-weight: 600;">Question #3:</b>

Create user function that will calculate features impotance of defined classificator model.
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 

def model_imp(x_train, y_train, clf):
    model.fit(x_train, y_train)
    feat_importances = pd.Series(model.feature_importances_, index=x_enc.columns)
    return feat_importances.sort_values(ascending=False)

<details><summary>Click <b>here</b> for the solution</summary> 
<code>    
    model.fit(x_train, y_train)
    feat_importances = pd.Series(model.feature_importances_, index=x_enc.columns)
    return feat_importances.sort_values(ascending=False)
    </code>
</details>


In [ ]:
imp = model_imp(X_train, y_train, model)
print(imp)

<div class="alert alert-danger alertdanger" style="margin-top: 20px">

<b style="font-size: 2em; font-weight: 600;">Question #4:</b>

Build plot that show accuracy of defined model depedence on numbers of input features.
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 

col = []
ac = []
for c in imp.index:
    col.append(c)
    ac.append(model_ac(X_train[col], y_train, X_test[col], y_test, model))
    print('Input fields: ', len(col), 'Accuracy: %.2f' % (ac[-1]*100))
ac = pd.Series(ac)
ac.plot()

<details><summary>Click <b>here</b> for the solution</summary> 
<code>    
col = []
ac = []
for c in imp.index:
    col.append(c)
    ac.append(model_ac(X_train[col], y_train, X_test[col], y_test, model))
    print('Input fields: ', len(col), 'Accuracy: %.2f' % (ac[-1]*100))
ac = pd.Series(ac)
ac.plot()
    </code>
</details>


### Thank you for completing this lab!

## Authors

<a href="https://author.skills.network/instructors/yaryna_beida?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0ZXSEN2580-2023-01-01">Yaryna Beida</a>

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0ZXSEN2580-2023-01-01">Prof. Yaroslav Vyklyuk, DrSc, PhD</a>

<a href="https://author.skills.network/instructors/mariya_fleychuk?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0ZXSEN2580-2023-01-01">Prof. Mariya Fleychuk, DrSc, PhD</a>


 
 ## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                            |
| ----------------- | ------- | ---------- | --------------------------------------------- |
|     2023-03-25    |   1.0   |Yaryna Beida| Lab created                                   |

<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>
