![alt text](https://www.nlab.org.uk/wp-content/uploads/nlab.png)
# Implementing Orange Dataflows in Python

Remember libraries just provide custom functions and custom data objects. Combining them in a workflow is the same as connecting the custom data processing functions widgets provide in Orange.

An important difference between Orange and Python is that while Orange passes data along links in the background to *Widgets* in Python we explicitly pass the data (as a custom data object) between *functions*.

This has positives and negatives:
1. We have to know about the data format (-ve)
2. We can at any time get statistics / ask questions about our data (+ve)

In standard machine learning tasks (i.e. those covered by Orange) all data exists in a table (two-dimensional array) like format with rows representing data points and columns representing features. In a supervised setting where we have labels one of these columns will be output feature.

There are a number of choices for the custom data object, two common ones:
1. **A two-dimensional (numpy) array**: The simplest implementation you can have, also the fastest
2. **A pandas Dataframe**: Aa custom object that encapsulates a two-dimensional array but also includes meta-data and methods to easily summarize and manipulate data for use within a standard machine learning task).

**Throughout this module we'll be using pandas Dataframes**. However, if you are interested in learning more about using numpy arrays please ask!

Another important difference is that, unlike Orange, we do not select widgets (and access help regarding widgets) from a graphical interface. Rather, we must know (google) their existence and read their documentation online.

## Having trouble?
1. In class? Ask us!
2. If it's basic Python it might be worth [brushing up on your understanding of that first.](https://snakify.org)
3. If it's with concepts surrounding numpy, sklearn or pandas take a look at the material from FBA which covered 99% all concepts in here - this is primarily a refresh in a slightly different context.

# Your task?
### Most notebook cells are empty apart from comments. The comments indicate what should be there. Fill them in and run them.

# Demo Task: Predicting bankruptcy for companies

Task: Predict whether a company will go bankrupt in 2 years.

Data set from the paper: Zieba, M., Tomczak, S. K., & Tomczak, J. M. (2016). Ensemble Boosted Trees with Synthetic Features Generation in Application to Bankruptcy Prediction. Expert Systems with Applications.

Original data URL: http://archive.ics.uci.edu/ml/datasets/Polish+companies+bankruptcy+data

**In this practical, you should use the data (csv) directly accessable from this URL:**

https://drive.google.com/uc?export=download&id=1_UVpItT70ncyihqq1Axs3pxJqa8eay3X

This is a direct link to a csv file. You can directy use it in read_csv(...)

## The flow

![Screenshot%20from%202018-01-27%2015-02-50.png](https://www.nlab.org.uk/wp-content/uploads/Screenshot-from-2018-01-27-15-02-50.png)

## Replacing the *File* Widget
![file.png](https://www.nlab.org.uk/wp-content/uploads/file.png)

The equivalent to loading a file in via the File widget in Orange are csv loading functions within Python.

A number of libraries provide these. As in FBA we will consider loading csv files via the pandas library.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

The *File* Widget provides two main functions:
1. Loading the data
2. Showing us a preview so we get an idea if we've loaded the correct headings
3. Guessing what data types each column is and enabling us to set these if they're wrong (inconjunction with the [data documentation](http://archive.ics.uci.edu/ml/datasets/Polish+companies+bankruptcy+data#)).
4. Providing us with information (in conjunction with the [data documentation](http://archive.ics.uci.edu/ml/datasets/Polish+companies+bankruptcy+data#)) as to whether we need to fix anything else with regard to the data source and/or alter parameters within the widget.

In Python the following steps are done interactively (i.e. equivalent to us setting up the flow and checking we have the right settings in Orange):
1. Loading the data
2. Showing us a preview so we get an idea if we've loaded the correct headings
3. Read the [data documentation](http://archive.ics.uci.edu/ml/datasets/Polish+companies+bankruptcy+data#) and adjust any load parameters to ensure the data is loaded correctly


Loading the data is done via the function [read_csv(...)](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html). There are many parameters to handle the different different variants of csv and to deal with numerous errors that can be in these kind of files. Of note:
1. read_csv will attempt to infer the data types for each feature. Otherwise you can manually specify all types (via the parameter dtype, see FBA tutorial 5 or the function documentation).
2. If the file does not contain a header we need to set: header=None (if one does it will likely be inferred correctly, if not see the documentation).
3. If there is no header then one can specify the feature names during loading using the parameter names=['feature1','feature2',...]. Here we will not do this so we can show you how to add them after data load.
4. The delimiter is defined by either the parameter delimiter=',' or sep=',' (they do the same thing). Default is ','.


In [2]:
import pandas as pd

# 1. Load the data
df = pd.read_csv('https://drive.google.com/uc?export=download&id=1_UVpItT70ncyihqq1Axs3pxJqa8eay3X')

In [3]:
# 2. Show a preview (first 10 lines)
df.head(10)

Unnamed: 0,0.15929,0.4624,0.07773,1.1683,-44.853,0.46702,0.18948,0.82895,1.1223,0.3833,...,0.10899,0.41557,0.89101,0.001422,7.7928,4.9914,119.81,3.0465,3.056,0
0,-0.12743,0.46243,0.26917,1.7517,7.597,0.000925,-0.12743,1.1625,1.2944,0.53757,...,-0.089372,-0.23704,1.0625,0.15041,5.4327,3.4629,100.97,3.615,3.4725,0
1,0.070488,0.2357,0.52781,3.2393,125.68,0.16367,0.086895,2.8718,1.0574,0.67689,...,0.054286,0.10413,0.94571,0.0,7.107,3.3808,76.076,4.7978,4.7818,0
2,0.13676,0.40538,0.31543,1.8705,19.115,0.50497,0.13676,1.4539,1.1144,0.58938,...,0.10263,0.23203,0.89737,0.073024,6.1384,4.2241,88.299,4.1337,4.6484,0
3,-0.11008,0.69793,0.18878,1.2713,-15.344,0.0,-0.11008,0.43282,1.735,0.30207,...,0.43988,-0.3644,0.57153,0.0,18.801,2.7925,146.39,2.4934,15.036,0
4,0.021539,0.58425,0.086614,1.1791,-36.394,-0.001609,0.029628,0.71161,1.4388,0.41575,...,0.2196,0.051807,0.80128,0.12508,8.7603,3.8576,122.7,2.9746,3.3482,0
5,0.22743,0.52266,0.44456,1.87,-8.6787,0.0,0.283,0.91328,1.9811,0.47734,...,0.1611,0.47646,0.85765,0.024511,4.1654,5.2485,94.141,3.8772,44.539,0
6,0.038662,0.59498,0.070504,1.1191,-37.64,-0.52978,0.038662,0.68074,3.0861,0.40502,...,0.27059,0.095456,0.72991,0.0,11.085,8.4593,70.003,5.2141,9.1408,0
7,0.13103,0.47202,0.4935,2.1374,31.876,0.37472,0.16378,1.1185,1.0729,0.52798,...,0.067952,0.24817,0.93205,0.072213,7.5119,4.4377,69.488,5.2527,31.392,0
8,0.17698,0.19359,0.13925,3.7779,124.1,0.33845,0.21281,4.1656,1.2128,0.80641,...,0.17547,0.21946,0.82453,0.1779,9.2352,2.4957,51.133,7.1382,0.44144,0
9,0.11767,0.37332,0.26743,2.3229,18.308,0.14871,0.14571,1.6787,1.1986,0.62668,...,0.16567,0.18777,0.83433,0.27314,4.778,5.4098,84.179,4.336,1.6525,0


We now need to know if pandas has correctly identified the feature data types. Look at the [data documentation](http://archive.ics.uci.edu/ml/datasets/Polish+companies+bankruptcy+data#). What are they mean to be?

The datatype of all columns as recorded as the attribute *dtypes* in a pandas dataframe. Check them in the cell bellow (don't know how? Remember attributes of objects are accessed via the dot (.) operator. Still don't know, this was covered in FBA tutorial 5.... Or you could ask).


In [4]:
# Check the data types
df.dtypes

0.15929    float64
0.4624     float64
0.07773    float64
1.1683     float64
-44.853    float64
            ...   
4.9914     float64
119.81     float64
3.0465     float64
3.056      float64
0            int64
Length: 65, dtype: object

If a feature has a dtype of **object** then it has been interpreted as a String. If you are not happy to just accept this is the case please ask why in class for the longer explanation. No features have been interpreted that way here.

If a feature has a dtype of int64 or float64 then it should be considered to represent a continuous feature.

Checking our documentation these datatypes are correct for everything except our output feature. At this point we could manually specify all the datatypes in the csv_load function, but instead we'll simply change the datatype at the same time as giving it proper category labels in the step below.

## Replacing the *Edit Domain* Widget

![EditDomain.png](https://www.nlab.org.uk/wp-content/uploads/EditDomain.png)

The *Edit Domain* Widget has two main functions:

1. Updating the column names
2. Updating/providing categorical values within data points with meaningful labels

In a pandas dataframe the column names are stored in the dataframes *columns* attribute as a list. We can simply replace it with a new list. Since the features are different financial measurements (given by formulas) in the documentation we will simply use the labels given: x1, ..., x64.

For the task below you will need a list of the column headings in the documentation. So you can copy/paste rather than type it here it is:
['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x13', 'x14', 'x15', 'x16', 'x17', 'x18', 'x19', 'x20', 'x21', 'x22', 'x23', 'x24', 'x25', 'x26', 'x27', 'x28', 'x29', 'x30', 'x31', 'x32', 'x33', 'x34', 'x35', 'x36', 'x37', 'x38', 'x39', 'x40', 'x41', 'x42', 'x43', 'x44', 'x45', 'x46', 'x47', 'x48', 'x49', 'x50', 'x51', 'x52', 'x53', 'x54', 'x55', 'x56', 'x57', 'x58', 'x59', 'x60', 'x61', 'x62', 'x63', 'x64','status']


In [5]:
# print what the column headings currently are
df.columns

Index(['0.15929', '0.4624', '0.07773', '1.1683', '-44.853', '0.46702',
       '0.18948', '0.82895', '1.1223', '0.3833', '0.18948.1', '0.41025',
       '0.15548', '0.18948.2', '771.49', '0.47311', '2.1626', '0.18948.3',
       '0.13466', '46.838', '1.0346', '0.18082', '0.11321', '0.57607',
       '0.3833.1', '0.40783', '1.4423', '0.16882', '6.0662', '0.30915',
       '0.13466.1', '134.47', '2.7144', '0.39104', '0.18082.1', '1.4771',
       '658.7', '0.38385', '0.12851', '0.16702', '0.072354', '0.12851.1',
       '119.96', '73.126', '0.88223', '0.77736', '52.568', '0.15153',
       '0.10769', '1.1669', '0.46185', '0.3684', '0.83251', '0.8337',
       '90533.0', '0.10899', '0.41557', '0.89101', '0.001422', '7.7928',
       '4.9914', '119.81', '3.0465', '3.056', '0'],
      dtype='object')

In [6]:
# update the column headings based on our documentation

# print the column headings in the dataframe again to check they updated
# column names from documentation
new_columns = ['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x13', 'x14', 'x15', 'x16', 'x17', 'x18', 'x19', 'x20', 'x21', 'x22', 'x23', 'x24', 'x25', 'x26', 'x27', 'x28', 'x29', 'x30', 'x31', 'x32', 'x33', 'x34', 'x35', 'x36', 'x37', 'x38', 'x39', 'x40', 'x41', 'x42', 'x43', 'x44', 'x45', 'x46', 'x47', 'x48', 'x49', 'x50', 'x51', 'x52', 'x53', 'x54', 'x55', 'x56', 'x57', 'x58', 'x59', 'x60', 'x61', 'x62', 'x63', 'x64','status']

df.columns = new_columns
df.columns


Index(['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11',
       'x12', 'x13', 'x14', 'x15', 'x16', 'x17', 'x18', 'x19', 'x20', 'x21',
       'x22', 'x23', 'x24', 'x25', 'x26', 'x27', 'x28', 'x29', 'x30', 'x31',
       'x32', 'x33', 'x34', 'x35', 'x36', 'x37', 'x38', 'x39', 'x40', 'x41',
       'x42', 'x43', 'x44', 'x45', 'x46', 'x47', 'x48', 'x49', 'x50', 'x51',
       'x52', 'x53', 'x54', 'x55', 'x56', 'x57', 'x58', 'x59', 'x60', 'x61',
       'x62', 'x63', 'x64', 'status'],
      dtype='object')

Since the output feature is already 0 for *operating* and 1 for *bankrupt* we'll leave it as is. This is because, when considering two class problems and evaluation measures, sklearn assumes output features to be labeled with 1 for the target class and 0 for the non-target class.

While this "assumption" may seem a little strange it simplifies the implementation (makes it faster) which is important when dealing with very large datasets. Moreover, as we'll see next week sklearn has an extensive *preprocessing* set of libraries which makes transforming the data to fit this assumption really easy.

## Replacing the *Select Columns* Widget

![file.png](https://www.nlab.org.uk/wp-content/uploads/selectColumns.png)

The *Select Columns* Widget has three main functions:

1. Select the output feature (target variable)
2. Define the input features
3. Remove any features we do not want to include

When using pandas dataframes to select the output feature we'll use the column name.
To create the input features we could either select all columns (see FBA tutorial 6) but normally we just want to remove the output feature. In this case we can use the dataframe method .drop(). Note the drop() method actually returns a copy and does not modify the original dataframe. See the [documentation for drop()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html).

In [10]:
y = df['status']
df.filter(items=['status'])

Unnamed: 0,status
0,0
1,0
2,0
3,0
4,0
...,...
9786,1
9787,1
9788,1
9789,1


In [13]:
# create the output feature dataframe by selecting the output feature from the whole dataset


# create the input feature dataframe by dropping the output feature column
# (the axis parameter deontes whether to drop the row, axis = 0, or column, axis = 1).
X = df.drop('status', axis=1)
print("X shape:", X.shape)
print("y shape:", y.shape)

X shape: (9791, 64)
y shape: (9791,)


In [14]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X, y)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


## Replacing the *Random Foreset* and *Linear Regression* Widget (or any model)

![RF.png](https://www.nlab.org.uk/wp-content/uploads/RF.png)![lgregress.png](https://www.nlab.org.uk/wp-content/uploads/lgregress.png)

The Random Forest Widget in this Orange flow is only declared (i.e it's parameters provided and fixed) at this stage with training and testing being undertaken later. The same is true for Python. Instead of a Widget we create an custom object representing an untrained Random Forest Classifier (we have a classification problem).

The same is true for most models, including Logistic Regression.

[Documentation for the Random Forest Classifier with all possible parameters documented.](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

[Documentation of the Logistic Regression Classifier with all possible parameters documented.](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In general the default parameters are OK, although normally you would alter these. However, we will be looking into this (parameter tuning) in more depth later in the course so we'll leave them as is for now.

In [17]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Set up the Random Forest Classifier
rf_model = RandomForestClassifier(
    random_state=42
)

# Set up the Logistic Regression Classifier
log_reg_model = Pipeline([
    ("scaler", StandardScaler()),
    ("logreg", LogisticRegression(max_iter=5000))
])


In [18]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)
rf_model.fit(X_train, y_train)
log_reg_model.fit(X_train, y_train)

0,1,2
,steps,"[('scaler', ...), ('logreg', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,5000


## Replacing the *Test & Score* Widget

![testandscore.png](https://www.nlab.org.uk/wp-content/uploads/testandscore.png)

The Test & Score Widget coordinated the repeated splitting of the data into training and test sets based on a resampling strategy. Average the result of training and testing the model for each iteration given a measure of success. Typically this strategy will be either repeat random sampling or cross-validation.

The Test & Score Widget required you to specify:
1. The resampling strategy
2. The evaluation measure
3. If the evaluation measure is a binary success measure you need to specify the *target class*

In Python the coordination of the repeated training/testing by a given resampling strategy is done by a cross_val_score function provided by sklearn. While it is possible to implement repeat random sampling it is not that straightforward so we will use 10 fold cross validation, where 10 is a parameter you set when calling the cross_val_score function.

[All options are detailed in the full documentation for the cross_val_score function.](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)

Of note:
1. The function takes exactly one classifier. You must call the method once per classifier.
2. The function returns one value per fold, representing the result of the classifier (as defined by the evaluation measure) after each round of training and testing.
3. The parameter cv=k denotes k fold cross validation (stratified) where k is an integer (i.e. 10). However, since we want to compare classifiers and we want the same random sets to be used each time we will first partition the data via a KFolds strategy and then reuse that each time. See FBA Tutorial 6 for more information.
4. The evaluation measure is specified by the parameter *scoring*, see [the documentation for a full list of possibilities.](http://scikit-learn.org/stable/modules/model_evaluation.html)

In [None]:
from sklearn.model_selection import cross_val_score, KFold

# Create the training and test splits (all k = 10 of them) for reuse for evaluating each classifier
folds = KFold(n_splits=10, shuffle=True, random_state=0)

# Define a target class
# Do nothing, since the output feature is already encoded as sklearn expects
# with 1 for target class and 0 for non-target class.
# Next week we'll look at how we using the pre-processing framework to define target classes

# Evaluate the Logistic Regression Classifier
scores = {}
scores['Logistic Regression, CA'] = cross_val_score(lg, input_features, output_feature, scoring = 'accuracy', cv=folds)
scores['Logistic Regression, Precision'] = cross_val_score(lg, input_features, output_feature, scoring = 'precision', cv=folds)
scores['Logistic Regression, Recall'] = cross_val_score(lg, input_features, output_feature, scoring = 'recall', cv=folds)

# Evaluate the Random Forest Classifier

In [None]:
import numpy as np

# Print the results
for k in sorted(scores):
    print('{0:31}: {1:5.2f}%'.format(k, np.mean(scores[k])*100) )


And we're done!

Did you understand the last block of code? There might have been some things you haven't seen before. Let me break the code block down.

**for k in sorted(scores):**
Recall that a for loop assigns a value to k for each item in an iterable thing after the *in* keyword. It then performs the code block directly underneath it for each value of k. In this case the *thing* is a dictionary called *scores*. By default when we ask to iterate over a dictionary we iterate over it's keys (i.e. 'Logistic Regression, CA', 'Logistic Regression, Precision', ...). However, within a dictionary these keys have no fixed order so when we iterate over them it could be in any order. Since we want to output our results in a fixed order (by classifier then evaluation measure) we order them by the default, alphabetical, ordering by the sort function.

*In summary:* we call the sorted function (which sorts a list like object) passing it the dictionary which provides it's list of keys (in an arbitrary order) to the function. The function sorts these and returns it as the list which the for loop iterates over.

**print('{0:31}: {1:5.2f}%'.format(k, np.mean(scores[k])*100) )**:
This is simply a print statement with what is known as String Formatters.

{} act as place holders. Within the {} there are two parameters separated by a colon (:).
1. The left of the colon denotes the argument number in .format() function. See print formatting example 1 below.
2. The right of the colon contains a parameter that denotes the formatting. Using this we can specify a number of things including padding, rounding and data type conversion. In the code above .2f tells the computer to round the float value (that it must be) to two decimal). The number 5 before the decimal point and the number 31 in the other {} denotes the padding width. [Read about String Formatters in Python 3](https://www.digitalocean.com/community/tutorials/how-to-use-string-formatters-in-python-3).
3. *np.mean(scores[k])* : scores[k] asks for the scores for the results given by key k from the dictionary. These are the results from the *cross_val_score* method. Remember this method returns a list of performance results (as defined by the evaluation measure provided) with one result per fold. Since we did 10 fold cross-validation there are 10 results. To get a final score we want to average these. To do this we use the function *mean* from the numpy package (np is just an alias we defined for numpy in the import). *numpy.mean* takes a list like object of numbers and returns the mean.

In [None]:
# Print formatting example 1
print('{0:}, {1:}, {0:}'.format('pos0','pos1'))

pos0, pos1, pos0


## Our stand-alone script
Now we're done we have our full stand-alone script to compare to our Orange flow:
![Screenshot%20from%202018-01-27%2015-02-50.png](https://www.nlab.org.uk/wp-content/uploads/Screenshot-from-2018-01-27-15-02-50.png)



In [None]:
import pandas
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler # optional, not part of this tutiorial
from sklearn.model_selection import cross_val_score, KFold

# Replacing the File Widget
data = pandas.read_csv('https://drive.google.com/uc?export=download&id=1_UVpItT70ncyihqq1Axs3pxJqa8eay3X', header = None)

# Replacing the Edit Domain Widget
data.columns = ['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x13', 'x14', 'x15', 'x16', 'x17', 'x18', 'x19', 'x20', 'x21', 'x22', 'x23', 'x24', 'x25', 'x26', 'x27', 'x28', 'x29', 'x30', 'x31', 'x32', 'x33', 'x34', 'x35', 'x36', 'x37', 'x38', 'x39', 'x40', 'x41', 'x42', 'x43', 'x44', 'x45', 'x46', 'x47', 'x48', 'x49', 'x50', 'x51', 'x52', 'x53', 'x54', 'x55', 'x56', 'x57', 'x58', 'x59', 'x60', 'x61', 'x62', 'x63', 'x64','status']

# Replacing the Select Columns Widget
input_features = data.drop('status',axis = 1)

input_features = StandardScaler().fit_transform(input_features) # Optional - not part of this tutorial, see demo discussion and future weeks lecture

output_feature = data.status

# Replacing the Random Forest Widget
rf = RandomForestClassifier()

# Replacing the Logistic regression Widget
lg = LogisticRegression()

# Replacing the Test & Score Widget
folds = KFold(n_splits=10, shuffle=True, random_state=0)
scores = {}
scores['Logistic Regression, CA'] = cross_val_score(lg, input_features, output_feature, scoring = 'accuracy', cv=folds)
scores['Logistic Regression, Precision'] = cross_val_score(lg, input_features, output_feature, scoring = 'precision', cv=folds)
scores['Logistic Regression, Recall'] = cross_val_score(lg, input_features, output_feature, scoring = 'recall', cv=folds)
scores['Random Forest, CA'] = cross_val_score(rf, input_features, output_feature, scoring = 'accuracy', cv=folds)
scores['Random Forest, Precision'] = cross_val_score(rf, input_features, output_feature, scoring = 'precision', cv=folds)
scores['Random Forest, Recall'] = cross_val_score(rf, input_features, output_feature, scoring = 'recall', cv=folds)

# Print the results
for k in sorted(scores):
    print('{0:31}: {1:5.2f}%'.format(k, np.mean(scores[k])*100) )

Logistic Regression, CA        : 94.51%
Logistic Regression, Precision : 19.26%
Logistic Regression, Recall    :  1.61%
Random Forest, CA              : 95.42%
Random Forest, Precision       : 76.77%
Random Forest, Recall          : 18.64%
