## Preprocessing your dataset and keeping a log with `datascribe`

For this example, we will load the example Emergency Department dataset, create a Scribe instance, and then preprocess the data via the Scribe instance so that it is ready for a logistic regression model.

It will cover the methods for logging the following:

* Imputing missing values
* Scaling categorical data
* Dummy coding categorical fields

In [31]:
import sys
import os

# Get the path of the current script
current_script_path = os.path.abspath("__file__")

# Deduce the root folder of the project
project_root = os.path.dirname(os.path.dirname(os.path.dirname(current_script_path)))

# Add the project root to the Python path
sys.path.append(project_root)

### `datascribe` imports

In [32]:
# load to access example dataset
from datascribe.datasets import load_ed_example
# import Scribe object from datascribe package
from datascribe.scribe import Scribe

First, we will load the Emergency Department (ED) Dataset example, taken from the NHS A&E Synthetic dataset available from [NHS England](https://nhsengland-direct-uploads.s3-eu-west-1.amazonaws.com/A%26E+Synthetic+Data.7z?versionId=null).  A little further information is available in the helper function.

In [33]:
# load ED attendances example
df = load_ed_example()

Create a Scribe instance, specifying the output folder location.

In [34]:
# specify folder directory
file_loc = 'output'

# define what the rows are called (singular, plural)
row_description = ("attendance", "attendances")

# create a Scribe object called s
s = Scribe(df, dir=file_loc, row_descriptor=row_description)

Directory 'output' already exists.


At this point, it is worth reviewing the data frame information using `.info()` from **pandas**.

For this example dataset, the data types were input when the csv file was read into the pandas Dataframe.  If you have not checked the data types of your dataset, it is best to check that the data types are accurate for the information you are using.

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 106102 entries, 0 to 106101
Data columns (total 13 columns):
 #   Column                           Non-Null Count   Dtype         
---  ------                           --------------   -----         
 0   IMD_Decile_From_LSOA             106075 non-null  Int8          
 1   Age_Band                         106102 non-null  category      
 2   Sex                              106071 non-null  category      
 3   AE_Arrive_Date                   106102 non-null  datetime64[ns]
 4   AE_Arrive_HourOfDay              106071 non-null  category      
 5   AE_Time_Mins                     106102 non-null  Int32         
 6   AE_HRG                           106100 non-null  category      
 7   AE_Num_Diagnoses                 106102 non-null  Int32         
 8   AE_Num_Investigations            106102 non-null  Int32         
 9   AE_Num_Treatments                106102 non-null  Int32         
 10  AE_Arrival_Mode                  106102 non-

## Checking and managing null values in the dataset

Null values need to be managed for the model.  Use **pandas**' `.isna().sum()` to check which columns contain nulls.

In [36]:
df.isna().sum()

IMD_Decile_From_LSOA               27
Age_Band                            0
Sex                                31
AE_Arrive_Date                      0
AE_Arrive_HourOfDay                31
AE_Time_Mins                        0
AE_HRG                              2
AE_Num_Diagnoses                    0
AE_Num_Investigations               0
AE_Num_Treatments                   0
AE_Arrival_Mode                     0
Provider_Patient_Distance_Miles    27
Admitted_Flag                      27
dtype: int64

As we can see from the output, we have four columns with missing values.

There are several methods to filling missing values, which are available in the Scribe object's attribute `preprocessing`.  A few of them are demonstrated below.

### Imputing with the mean value
#### *(Scribe.preprocessing.imputing_numeric_mean(df, columns=None))*

`Provider_Patient_Distance_Miles` is a numerical field and advises us on how many miles the patient lives from the provider.  Here, we will use the mean value to imput missing values.

`imputing_numeric_mean` in the `preprocessing` attribute of `Scribe` allows you to either transform all numeric items to the mean value if no columns specified, or a list of columns if input as a parameter

In [37]:
cols_for_mean = 'Provider_Patient_Distance_Miles'
df = s.preprocessing.imputing_numeric_mean(df, cols_for_mean)

If you would like to check that the values were imputed, you can recheck the null values for the column.

In [38]:
# check all null values replaced
df[cols_for_mean].isna().sum()

0

You can view the columns which have been imputed with this method by checking the `imputed_mean_cols` attribute within `preprocessing`.

In [39]:
s.preprocessing.imputed_mean_cols

['Provider_Patient_Distance_Miles']

If you try to impute a column which has no null values, either because you have already immputed the values, or there were no missing values to begin with, you will receive a message to advise you.  However, if you try to impute a list of columns which contain some columns with missing values and some with no missing values, no message will appear but only the columns which were imputed will appear in `preprocessing.imputed_mean_cols`.

In [40]:
df = s.preprocessing.imputing_numeric_mean(df, cols_for_mean)

None of the columns were imputed with the mean as there were no suitable columns or null values.


### Imputing with the median value
#### *(Scribe.preprocessing.imputing_numeric_median(df, columns=None))*

`IMD_Decile_From_LSOA` is an integer data type and and indicates the deprivation decile of where the patient lives.  1 indicates the least deprived and 10 indicates the most deprived.

While it is numerical, we cannot use the mean value as you cannot be between items on the scale.  Therefore, we will replace the values with the median value.

Again, the method can process all numeric fields in the dataset or you can speficy which with the `columns` parameter.

In [41]:
cols_for_median = 'IMD_Decile_From_LSOA'
df = s.preprocessing.imputing_numeric_median(df, cols_for_median)

In [42]:
# check all null values replaced
df[cols_for_median].isna().sum()

0

### Imputing with the mode 
#### *(Scribe.preprocessing.imputing_numeric_mode(df, columns=None), Scribe.preprocessing.imputing_non_numeric_mode(df, columns=None))* 

For `Admitted_Flag`, we will use the mode to replace the missing values.  We will use the numeric method.

In [43]:
cols_for_mode_num = 'Admitted_Flag'
df = s.preprocessing.imputing_numeric_mode(df, cols_for_mode_num)

In [44]:
# check all null values replaced
df[cols_for_mode_num].isna().sum()

0

We will impute with the mode for the `AE_Arrive_HourOfDay` field.

In [45]:
cols_mode_cat = 'AE_Arrive_HourOfDay'
df = s.preprocessing.imputing_non_numeric_mode(df, cols_mode_cat)

There is a separate method for non-numeric fields, which mainly differs from the above if you decide to replace all numerical or non-numerical fields in one go.  However, it is best practice to tailor the impute method for the specific fields.

### Imputing with constant value
#### *(Scribe.preprocessing.imputing_non_numeric_constant(df, constant='missing', columns=None), Scribe.preprocessing.imputing_numeric_mode(df, constant=0, columns=None))*

Similar to mode, there are two methods, `imputing_numeric_constant` and `imputing_non_numeric_constant` for either replacing all the fields with that data type with a fixed value, or a selection of columns.  Here, we will impute the column `Sex` with the default value for the non-numeric method, which is **missing**.

In [46]:
col_constant = 'Sex'
df = s.preprocessing.imputing_non_numeric_constant(df, columns=col_constant)

In [47]:
# check all null values replaced
df[col_constant].isna().sum()

0

You can check the new value appears in the dataset by checking the column's unique values.

In [48]:
df['Sex'].unique()

['2', '1', 'missing']
Categories (3, object): ['1', '2', 'missing']

We will specify a constant value for `AE_HRG`, which will be *Nothing* (already a category in the dataset).

In [49]:
df = s.preprocessing.imputing_non_numeric_constant(df,
                                                         constant='Nothing',
                                                         columns='AE_HRG')

Now we will check all the fields have no missing values.

In [50]:
df.isna().sum()

IMD_Decile_From_LSOA               0
Age_Band                           0
Sex                                0
AE_Arrive_Date                     0
AE_Arrive_HourOfDay                0
AE_Time_Mins                       0
AE_HRG                             0
AE_Num_Diagnoses                   0
AE_Num_Investigations              0
AE_Num_Treatments                  0
AE_Arrival_Mode                    0
Provider_Patient_Distance_Miles    0
Admitted_Flag                      0
dtype: int64

### Other impute methods

There are options as well to backfill and forward fill which work in the same way, but are not split into numeric and non-numeric options.

#### *Scribe.preprocessing.imputing_backwardfill(df, columns=None)*

#### *Scribe.preprocessing.imputing_forwardfill(df, columns=None)*

### Checking that preprocessing steps have taken place

#### *(Scribe.preprocessing.check_imputes_step())*

You can check that values have been imputed via datascribe by using the checking method.  If any imput method has been successfully performed to replace values, it will return True.  Otherwise, it will return False.

In [51]:
s.preprocessing.check_imputes_step()

True

### Previewing the text summary

#### (Scribe.preprocessing.imputing_commentary())

With `imputing_commentary`, you can preview the text summary which would be prepared for the output file.

In [52]:
s.preprocessing.imputing_commentary()

'Null values in Provider_Patient_Distance_Miles were imputed with the mean value. Null values in IMD_Decile_From_LSOA were imputed with the median value. Null values in Admitted_Flag and AE_Arrive_HourOfDay were imputed with the mode value. Null values in Sex and AE_HRG were imputed with the the following constant values: missing, Nothing.'

## Scaling categories which have an order

#### *(Scribe.preprocessing.scale_categories(df, mappings, columns))*

We will use the `scale_categories` method in `preprocessing` to turn a couple of the non-numeric fields into a scale so that this information is not lost in the model.

While you can do this with just one column, we will process three columns at the same time using a list for the `columns` parameter and a list of dictionaries for the `mapppings`.

In [53]:
# specify columns
cols_for_scaling = ['Age_Band', 'AE_Arrive_HourOfDay','AE_HRG']
# specify mappings for above columns
cols_scale_map = [{'1-17': 1, '18-24': 2,
                   '45-64': 3, '25-44': 4,
                   '65-84': 5, '85+': 6},
                {'01-04': 1, '05-08': 2, '09-12': 3,
                 '13-16': 4, '17-20': 5, '21-24': 6},
                {'Nothing': 1, 'Low': 2, 'Medium': 3, 'High': 4}]
# call method to scale
df = s.preprocessing.scale_categories(df, cols_scale_map, cols_for_scaling)

If we preview the columns, we can see they are now on a scale between 0 and 1.

In [54]:
df[cols_for_scaling].head()

Unnamed: 0,Age_Band,AE_Arrive_HourOfDay,AE_HRG
0,0.6,0.0,0.666667
1,0.8,0.6,0.666667
2,0.6,1.0,0.333333
3,0.6,0.8,0.666667
4,0.2,0.0,0.333333


### Checking whether any categorical scaling has taken place

#### *(Scribe.preprocessing.check_cats_scaled())*

You can check whether any categorical fields have been processed with this checker.  True means that it has taken place.

In [55]:
s.preprocessing.check_cats_scaled()

True

### Scaling categories commentary

#### *(Scribe.preprocessing.cat_scaling_commentary())*

You can preview the summary text for scaling categories by using the `cat_scaling commentary` method.

In [56]:
s.preprocessing.cat_scaling_commentary()

'The order of Age_Band, AE_Arrive_HourOfDay and AE_HRG were retained by mapping the order of the values and using the `MinMaxScaler()` method from the `sklearn` package to create a scale between 0 and 1.'

## Encoding categorical items

#### *(Scribe.preprocessing.dummy_encoder(self, df, columns=None))*

For categories which do not have an order of values to maintain, we can create dummies or 'one-hot encoding'.  Similar to other methods described above, if no columns are specified, the method will transform all fields.

In [57]:
cols_for_dummies = ['Sex', 'AE_Arrival_Mode']
df = s.preprocessing.dummy_encoder(df, cols_for_dummies)

We can see the new columns have been added to the data frame, and the original columns removed.

In [58]:
df.head()

Unnamed: 0,IMD_Decile_From_LSOA,AE_Arrive_Date,AE_Time_Mins,AE_Num_Diagnoses,AE_Num_Investigations,AE_Num_Treatments,Provider_Patient_Distance_Miles,Admitted_Flag,Age_Band,AE_Arrive_HourOfDay,AE_HRG,Sex_1,Sex_2,Sex_missing,AE_Arrival_Mode_0,AE_Arrival_Mode_1,AE_Arrival_Mode_2
0,4,2016-08-30,150,1,8,4,1.0,0,0.6,0.0,0.666667,False,True,False,False,True,False
1,3,2016-01-13,220,1,8,5,0.0,0,0.8,0.6,0.666667,False,True,False,False,False,True
2,2,2016-04-28,160,1,8,4,1.0,0,0.6,1.0,0.333333,True,False,False,False,False,True
3,4,2016-11-23,240,1,8,4,4.0,0,0.6,0.8,0.666667,True,False,False,False,False,True
4,8,2016-09-03,60,1,8,3,8.0,0,0.2,0.0,0.333333,False,True,False,False,True,False


If you wish to preview the draft write up which will be included in the output document, you can preview it by calling `Scribe.preprocessing.dummy_encoding_commentary()`

In [59]:
s.preprocessing.dummy_encoding_commentary()

'One-hot encoding was used on Sex and AE_Arrival_Mode using the pandas `get_dummies` method.'

We can see that all the fields are now usable in the logistic regression model apart from the date field.  Currently, the package does not process this, so we will drop the field.

In [60]:
df.drop(columns='AE_Arrive_Date', inplace=True)

In [61]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 106102 entries, 0 to 106101
Data columns (total 16 columns):
 #   Column                           Non-Null Count   Dtype  
---  ------                           --------------   -----  
 0   IMD_Decile_From_LSOA             106102 non-null  Int8   
 1   AE_Time_Mins                     106102 non-null  Int32  
 2   AE_Num_Diagnoses                 106102 non-null  Int32  
 3   AE_Num_Investigations            106102 non-null  Int32  
 4   AE_Num_Treatments                106102 non-null  Int32  
 5   Provider_Patient_Distance_Miles  106102 non-null  Float32
 6   Admitted_Flag                    106102 non-null  Int8   
 7   Age_Band                         106102 non-null  float64
 8   AE_Arrive_HourOfDay              106102 non-null  float64
 9   AE_HRG                           106102 non-null  float64
 10  Sex_1                            106102 non-null  bool   
 11  Sex_2                            106102 non-null  bool   
 12  Se

The final method in `preprocessing` is the `standardise_data` method, which will be used in the next notebook where the model is created.