# Making a model through datascribe

For this example we use the preprocessed Emergency Department data, and Scribe object as preprocessed previously.  Then, we will implement our logistic regression model and fit it.

## Imports

In [66]:
import os
import sys

# Get the path of the current script
current_script_path = os.path.abspath("__file__")

# Deduce the root folder of the project
project_root = os.path.dirname(os.path.dirname(os.path.dirname(current_script_path)))

# Add the project root to the Python path
sys.path.append(project_root)

In [67]:
from datascribe.datasets import load_ed_scribe_processed
import pandas as pd

Load example data frame and Scribe object which contains the preprocessed dataset.

In [68]:
df, s = load_ed_scribe_processed()

## k_fold

#### *(Scribe.model.k_fold(n_splits=5))*

As we want to use cross-validation method, first, kfold with 'n_splits' folds is defined. It uses Stratifiedkfold class to make stratified folds.

In [69]:
n_splits = 10
s.model.k_fold(n_splits=n_splits)

## Split the datasets

#### *(Scribe.model.split_dataset(\*arrays, ttest_size=0.25, stratify=None))*

Split the output from input data, then, split both datasets to train and test datasets.

In [70]:
X = df.drop('Admitted_Flag', axis=1)
y = df['Admitted_Flag']
test_size = 0.3
s.model.split_dataset(X, y, test_size=test_size, stratify=y)

## Standardise the independent variables

#### *(Scribe.preprocessing.standardise_data(X_train, X_test))*

You can standardise the input values using `standardise_data` from `preprocessing`

In [71]:
X_train, X_test = \
s.preprocessing.standardise_data(s.model.splitted_data['X_train'], 
                                    s.model.splitted_data['X_test'])

s.model.splitted_data['X_train'] = X_train
s.model.splitted_data['X_test'] = X_test

## Creating parameters and creating the regression model

#### *(Scribe.model.regression_model(params=params))*

Next, the regression model can be defined with different parameters. Gridsearchcv will be used to find the best parameters to use in the logistic regression model during the training process.

A full list of parameters can be found on the [sci-kit learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) website.

In [72]:
params = {
    'random_state': [42],
    'solver': ['lbfgs', 'sag']
}
s.model.regression_model(params=params)

To check that a model has been created, you can use the `check_model_exists` method.

In [73]:
s.model.check_model_exists()

True

## Fitting the model

#### *(Scribe.model.fit())*

At this stage, the model will be trained using X_train and y_train data.

In [74]:
s.model.fit()

## Test the model

#### *(Scribe.model.predict())*

Now, we can test the trained model and predict using `X_test` data.

In [75]:
y_predict = pd.DataFrame(s.model.predict())

Here, we can see which parameters make the best performance for our model.

In [76]:
s.model.model.best_params_

{'random_state': 42, 'solver': 'sag'}

In [77]:
s.model.splitted_data['y_test']

49208     0
13367     0
92594     0
100965    1
41704     0
         ..
80172     1
52220     0
34116     1
95913     1
81291     0
Name: Admitted_Flag, Length: 31831, dtype: Int8

## Viewing the text summary for the model

#### *(Scribe.model.model_commentary())*

To view the write up for this model, you can use the `model.model_commentary()` method.  This is a preview of the text which will be output to the final file.

In [78]:
s.model.model_commentary()

'A logistic regression model was used with stratified k fold (k=10) and GridCV search used for feature selection.  The train/test split was 70%/30%.'