# An Ensembling Approach

**Purpose:** This notebook is designed to interactively guide the user through an end-to-end process for building an ensemble.  It provides a generic dataset, but the notebook can be repurposed for any structured dataset (.csv and .xlsx-formats).

## The dataset
Kickstarter is one of the main online crowdfunding platforms in the world. The dataset provided contains data on tens of thousands of projects launched on the platform in 2018. The datasets provided have the same structure and contain the following columns:

- **ID**: internal ID, _numeric_
- **name**: name of the project, _string_
- **category**: project's category, _string_
- **main_category**: campaign's category, _string_
- **currency**: project's currency, _string_
- **deadline**: project's deadline date, _timestamp_
- **goal**: fundraising goal, _numeric_
- **launched**: project's start date, _timestamp_
- **pledged**: amount pledged by backers (project's currency), _numeric_
- **state**: project's current state, _string_; **this is what you have to predict**
- **backers**: amount of poeple that backed the project, _numeric_
- **country**: project's country, _string_
- **usd pledged**: amount pledged by backers converted to USD (conversion made by KS), _numeric_
- **usd_pledged_real**: amount pledged by backers converted to USD (conversion made by fixer.io api), _numeric_
- **usd_goal_real**: fundraising goal is USD, _numeric_
- **launch_month**: the numeric value for the month of the year in which the project was launched, _numeric_
- **launch_dow**: an integer value ranging from 1 (Sunday) to 7 (Saturday) b, _numeric_
- **duration**: the number of days between the launched date and deadline date, _numeric_

</br></br>

**Resources**:
* [iPython Widget Documentation](https://ipywidgets.readthedocs.io/en/stable/examples/Widget%20List.html)
    
## Table of Contents

**1.0** **- Ingest Data**
    * 1.1 - Set Your Working Directory
    * 1.2 - Upload Your Data (for Modeling)
    * 1.3 - Select a Data Frame (for Modeling)
     
**2.0** **- Build Models**
    * 2.1 - Select Your Target Variable
    * 2.2 - Build Individual Models
    * 2.3 - Build Ensembles
    * 2.4 - Examine Ensembles
    * 2.5 - Save and Export a Model

**3.0** **- Score New Data**
    * 3.1 - Upload Your Data (for Scoring)
    * 3.2 - Select a Data Frame (for Scoring)
    * 3.3 - Score Your Data
    * 3.4 - Export Your Scored Data

## Dependencies

This script was executed using the following version of Python:
* **Python 3.6.2 :: Anaconda, Inc.**

Use this link to install Python on your machine:
* https://www.anaconda.com/distribution/#download-section

**About Python Versions:**
If you are running a higher-version of Python and this notebook fails to execute properly, you can downgrade your version in the terminal by running the following commands:
* conda search python [to see which versions are available on your machine]
* conda install python=3.6.2 [which will switch the active version to 3.6.2; if available in the list above]

**About Python Packages:**
All packages used in this notebook can be installed on your machine using the "pip install [package_name]" command on your terminal.  Be sure you've installed each of the packages below before attempting to execute the notebook.

Current package requirements include:
* os - https://docs.python.org/3/library/os.html
* Pandas - https://pandas.pydata.org/
* Datetime - https://docs.python.org/3/library/datetime.html
* re - https://docs.python.org/3/library/re.html
* Numpy - http://www.numpy.org/
* ipywidgets - https://ipywidgets.readthedocs.io/en/stable/user_install.html
* ipython - https://ipython.org/ipython-doc/rel-0.10.2/html/interactive/extension_api.html
* scikit-learn - https://scikit-learn.org/stable/install.html
* requests - https://2.python-requests.org/en/master/user/install/
* io - https://docs.python.org/3/library/io.html
* warnings - https://docs.python.org/3/library/warnings.html
* json - https://docs.python.org/3/library/json.html
* subprocess - https://docs.python.org/3/library/subprocess.html
* mlxtend - http://rasbt.github.io/mlxtend/
* joblib - https://joblib.readthedocs.io/en/latest/
* pickle - https://docs.python.org/3/library/pickle.html
* copy - https://docs.python.org/3.6/library/copy.html

The current template uses the following versions:
* os== module 'os' from '/anaconda3/lib/python3.6/os.py'
* pandas==0.24.1
* datetime== module 'datetime' from '/anaconda3/lib/python3.6/datetime.py'
* re== module 're' from '/anaconda3/lib/python3.6/re.py'
* numpy==1.16.1
* ipywidgets==7.4.2
* ipython==6.2.1
* scikit-learn==0.19.1
* requests==2.18.4
* io== module 'io' from '/anaconda3/lib/python3.6/io.py'
* warnings== module 'warnings' from '/anaconda3/lib/python3.6/warnings.py'
* json== module 'json' from '/anaconda3/lib/python3.6/json.py'
* mlxtend==0.15.0.0
* joblib==0.14.1
* pickle== module 'pickle' from '/anaconda3/lib/python3.6/pickle.py'
* copy== module 'copy' from '/anaconda3/lib/python3.6/copy.py'

## Before you begin, ensure you've installed the required Python packages

* See the list above and make note of the specific versions that were used in this notebook

In [3]:
############################################
###### Import required Python packages #####
############################################

import os
import pandas as pd
import re
import datetime as dt
from datetime import timezone
import numpy as np
from ipywidgets import interact, interactive, IntSlider, Layout
import ipywidgets as widgets
from IPython.display import display
from sklearn.model_selection import train_test_split
import io
import requests
import subprocess
import json
import joblib
import pickle
import copy
import warnings
warnings.filterwarnings('ignore')

## Note: Code Cells are Hidden by Default for Ease-of-Use

This notebook incorporates interactive "widgets" which will result in large blocks of cells being utilized to enable specific user-interactions.  Executing this cell will hide all "Code" cells while making all outputs visible to the user.  Refer to the link below for the source or simply "run" the block below to see the impact on the rest of the notebook.

* https://stackoverflow.com/questions/27934885/how-to-hide-code-from-cells-in-ipython-notebook-visualized-with-nbviewer

#### Disclaimer:
* As the "output text" notes, simply click the "here" hyperlink in the text to toggle on/off this feature

In [554]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')

## 1.0 - Data Ingestion

The series of code blocks below will walk you through the process of mapping to your working directory and uploading your dataset.

## 1.1 - Set Your Working Directory

Your "working directory" is a folder location on your computer that will store files either read-in or written-out by this script.  This code by default will return your current, active directory.  You can change this directory by typing in a specific path into the text box provided.

## AN IMPORTANT NOTE ABOUT INTERACTIVE WIDGETS

This notebook uses interactive widgets to help you make selections and inputs more conveniently.  As you work through this notebook, be sure to follow the steps below to ensure your selections are incorporated in the cells that follow:

#### 1. Run the cell containing the interactive widget(s) to bring them into view
#### 2. Apply your selections and/or inputs to the widgets that appear
#### 3. DO NOT rerun the cell as it will erase your selections and inputs
#### 4. To proceed, simply click on the next cell in the notebook, and Run it

<br/>

In [4]:
set_working_directory = widgets.Text(
    value=os.getcwd(),
    placeholder='/Users/bblanchard006/Desktop/SMU/QTW/Week 14',
    description='Directory:',
    disabled=False,
    layout=Layout(width='100%')
)

display(set_working_directory)

Text(value='/Users/bblanchard006/Desktop/SMU/QTW/Week 14', description='Directory:', layout=Layout(width='100%…

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed

After executing the cell above, you can leave the default directory or overwrite the text string that appears with your desired folder directory. **DO NOT execute the cell again after making your update.** The input above will be fed into the following code cell, where it will either successfully map to the new directory or notify you of an error.

In [5]:
try:
    os.chdir(set_working_directory.value)
    print('Changed directory to {}'.format(set_working_directory.value))
except Exception as e:
    print('Failed to change directory')
    print(e)

Changed directory to /Users/bblanchard006/Desktop/SMU/QTW/Week 14


## 1.2 - Upload Your Data (Excel and CSV files)

The function in the code cell below will find, ingest, and format both xlsx and csv files.  This is the dataset with "known" values which will be used to train your models.

In [6]:
########################################
##### Data Ingestion Functions
########################################

def compile_raw_data(filename, tab_names, subfolder, delimiter_char = ',', skip_rows = 0, file_ext = 'xlsx'):
    
    # Inputs: 
    ## filename = 'sample.csv' | 'sample.xlsx' - the filename in the directory (including the extension) 
    ## tab_names = None | ['Sheet1,'Sheet2'] - None for csv; [comma separated list of tab names] for xlsx
    ## subfolder = 'source_data' - string containing the name of a folder in the working directory
    ## delimiter_char = ',' | ';' - None for xlsx
    ## rows to skip = default 0 - Not used for csv; trims the user-defined number of rows from an xlsx
    ## file extension = csv | xlsx
    
    # Description: reads in the workbook; standardizes header names; 
    # Outputs: returns a dictionary of dataframes

    master_data = {}
    if subfolder:
        file_path = subfolder+'/{}'.format(filename)
    else:
        file_path = filename

    if file_ext == 'csv':
        tab_names = [re.sub('.csv','', filename)]

    for tab in tab_names:
        try:
            if file_ext == 'xlsx':
                dframe = pd.read_excel(file_path, tab, skip_rows)
            elif file_ext == 'csv' and delimiter_char == ',':
                dframe = pd.read_csv(file_path, header=0, delimiter=',')
            else:
                dframe = pd.read_csv(file_path, header=0, delimiter=';')
                
            sanitizer = {
                        '$':'USD',
                        '(':' ',
                        ')':' ',
                        '/':' ',
                        '-':' ',
                        ',':' ',
                        '.':' '
            }
                        
            for key, value in sanitizer.items():
                dframe.rename(columns=lambda x: x.replace(key, value), inplace=True)
                
            dframe.rename(columns=lambda x: x.strip(), inplace=True)
            dframe.rename(columns=lambda x: re.sub(' +','_', x), inplace=True)
            
            dframe.columns = map(str.lower, dframe.columns)
            
            master_data.update({tab:dframe})
        except Exception as e:
            master_data.update({tab:'Failed'})
    
    return master_data

The code blocks below enable conditional filtering to support multiple file types. Further instructions are provided below:

**Uploading csv files**

To upload a csv file, complete these steps:
1. Type in your filename along with the extension (ex. sample.csv)
2. Check the 'csv' radio-button
3. Is your file in the main directory or a sub-folder in the directory:
    * Select the "no" radio-button if your file is in your main directory
    * Select the "yes" radio-button to expose a text-box where you can type-in the name of your sub-folder
    
**Uploading xlsx files**

To upload an xlsx file, complete these steps:
1. Type in your filename along with the extension (ex. sample.xlsx)
2. Check the 'xlsx' radio-button
3. Type in the tab-names you'd like to ingest (comma-separated; Sheet1,Sheet2,Sheet3)
4. If the data in your file has leading rows, select how many rows to skip before ingesting the data (ex. if your data starts on Row 2 in the Excel-file, set the Skip Rows value to 1)
5. Is your file in the main directory or a sub-folder in the directory:
    * Select the "no" radio-button if your file is in your main directory
    * Select the "yes" radio-button to expose a text-box where you can type-in the name of your sub-folder

In [7]:
upload_type = widgets.RadioButtons(
    options=['local', 'url'],
    description='File Location:',
    disabled=False
)

upload_url = widgets.Text(
    value='https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv',
    placeholder='http://',
    description='URL:',
    disabled=False,
    layout=Layout(width='80%')
)
upload_filename = widgets.Text(
    value='training_data.csv',
    placeholder='Sample File.csv',
    description='File Name:',
    disabled=False,
    layout=Layout(width='50%')
)

file_type = widgets.RadioButtons(
    options=['csv', 'xlsx'],
    description='File Type:',
    disabled=False
)

tab_names = widgets.Text(
    value='Sheet1, Sheet2, Sheet3, etc',
    placeholder='ALL EMPLOYEES, PAST EMPLOYEES',
    description='Tab(s):',
    disabled=False,
    layout=Layout(width='50%')
)

subfolder_name = widgets.Text(
    value='source_data',
    placeholder='Subfolder name',
    description='Subfolder:',
    disabled=False,
    layout=Layout(width='50%')
)

subfolder = widgets.RadioButtons(
    options=['no','yes'],
    value='no',
    description='Subfolder:',
    disabled=False
)

skip_rows = widgets.IntSlider(
    value=0,
    min=0,
    max=10,
    step=1,
    description='Skip Rows:',
    disabled=False,
    continuous_update=True,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)

delimiter = widgets.RadioButtons(
    options=[',',';'],
    value=',',
    description='Delimiter:',
    disabled=False
)

def text_field(x):
    if(x=='xlsx'):
        display(tab_names)
        tab_names.on_submit(tab_names)
        display(skip_rows)
    else:
        display(delimiter)
        print('Tab Names: Not needed for csv files')

def sub_folder(y):
    if(y=='yes'):
        display(subfolder_name)
        subfolder_name.on_submit(subfolder_name)
    else:
        print('Using {} folder'.format(os.getcwd()))

def file_location(z):
    if(z=='local'):
        display(upload_filename)
        i = widgets.interactive(text_field, x=file_type)
        display(i)
        p = widgets.interactive(sub_folder, y=subfolder)
        display(p)
    else:
        display(upload_url)
    
q = widgets.interactive(file_location, z=upload_type)

display(q)

interactive(children=(RadioButtons(description='File Location:', options=('local', 'url'), value='local'), Out…

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed

The following code cell will attempt to ingest the data you've selected in the widgets above:

**Note About xlsx Files** - Depending on the number of tabs and the size of the data on each tab, ingesting an xlsx file can take several minutes to execute.  If possible, it may be more efficient to break your Excel file into separate csv files which take only a fraction of a second to ingest.

In [8]:
master_data = {}

if upload_type.value == 'url':
    url_response = requests.request("GET", upload_url.value)
    master_data['url_data'] = pd.read_csv(io.BytesIO(url_response.content))
else:
    if file_type.value == 'csv':
        tabs = None
        skiprows = 0
    else:
        tabs = [x.strip() for x in tab_names.value.split(',')]
        skiprows = skip_rows.value

    if subfolder.value == 'yes':
        subfolder = subfolder_name.value
    else:
        subfolder = None
    master_data = compile_raw_data(upload_filename.value, tabs, subfolder, delimiter_char = delimiter.value, skip_rows = skiprows, file_ext = file_type.value)


**Note:** If you see an AttributeError: 'NoneType' object has no attribute 'value' message above, simply rerun the last two code cells to reset the input parameters.

The following code cell will print out the attributes associated with the files you've uploaded and alert you of any errors:

In [9]:
for key, value in master_data.items():
    try:
        print('{} table was ingested with {} rows and {} columns'.format(key,value.shape[0],value.shape[1]))
    except:
        print('{} table failed to load'.format(key))

training_data table was ingested with 35000 rows and 18 columns


## 1.3 - Select a Data Frame

The following menus will allow you to select the dataset you would like to use in your modeling and the variables you would like included in the subsequent processes.  You can preview a sample of the data as well as increase or decrease the number of records returned by using the integer input widget (which has a default range; minimum rows = 1, maximum rows = 50).

Select an available frame from the list below:

In [10]:
dict_keys = widgets.Select(
    options=master_data.keys(),
    description='Tables:',
    disabled=False,
    layout=Layout(width='50%')
)

display(dict_keys)

Select(description='Tables:', layout=Layout(width='50%'), options=('training_data',), value='training_data')

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed

The cell below will provide a quick snapshot of the data you have selected above

In [11]:
master_data[dict_keys.value].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35000 entries, 0 to 34999
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id                35000 non-null  int64  
 1   name              35000 non-null  object 
 2   category          35000 non-null  object 
 3   main_category     35000 non-null  object 
 4   currency          35000 non-null  object 
 5   deadline          35000 non-null  object 
 6   goal              35000 non-null  float64
 7   launched          35000 non-null  object 
 8   pledged           35000 non-null  float64
 9   state             35000 non-null  object 
 10  backers           35000 non-null  int64  
 11  country           35000 non-null  object 
 12  usd_pledged       34659 non-null  float64
 13  usd_pledged_real  35000 non-null  float64
 14  usd_goal_real     35000 non-null  float64
 15  launch_month      35000 non-null  int64  
 16  launch_dow        35000 non-null  int64 

After selecting a frame above, select the variables you would like included in your workflow from the list below:

**NOTE:** To select multiple values from the picklist, either hold down the command key on your keyboard or click and hold the shift key to select ranges of variables.  You can scroll down if your mouse is within the widget window.

In [12]:
review_variables = widgets.SelectMultiple(
    options=master_data[dict_keys.value].columns.tolist(),
    description='Variables:',
    disabled=False,
    layout=Layout(width='50%')
)

display(review_variables)

SelectMultiple(description='Variables:', layout=Layout(width='50%'), options=('id', 'name', 'category', 'main_…

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed
Input the number of rows you'd like to sample:

In [13]:
review_var_list = []
for i in review_variables.value:
    review_var_list.append(i)
    
master_data['custom_table'] = master_data[dict_keys.value][review_var_list]

head_number = widgets.BoundedIntText(
    value=5,
    min=1,
    max=50,
    step=1,
    description='Rows:',
    disabled=False
)

def sample_view(head_number):
    sample = master_data['custom_table'].head(head_number)
    print(sample)

out = widgets.interactive_output(sample_view, {'head_number':head_number})

widgets.VBox([widgets.VBox([head_number]), out])

VBox(children=(VBox(children=(BoundedIntText(value=5, description='Rows:', max=50, min=1),)), Output()))

## 2.0 - Develop Individual Models

The following cells will walk you through the process of building the individual models needed for your ensemble.

## 2.1 - Select Your Target Variable

Your "Target" variable represents the thing you are attempting to predict. It should be either "categorical" (ex. text, labels) or "continuous" (ex. numeric values) in nature. The target and its type will impact which algorithms are used and the evaluation metrics that are useful in evaluating each models' performance.

Select your Target variable and note whether or not it is a categorical or continuous data type:

In [14]:
target = widgets.Select(
    options=master_data['custom_table'].columns.tolist(),
    description='Target',
    disabled=False
)

target_type = widgets.Select(
    options=['Continuous','Categorical'],
    description='Type',
    disabled=False,
)

display(target)
display(target_type)

Select(description='Target', options=('main_category', 'goal', 'state', 'country', 'launch_month', 'launch_dow…

Select(description='Type', options=('Continuous', 'Categorical'), value='Continuous')

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed

In the next cell, choose the labels you would like to include in your modeling.  Note: you can choose all for a multi-class problem, or you can select two for a binary-class problem.  TODO: develop a process for handling continuous target variables.

In [15]:
var_labels = master_data['custom_table'][target.value].unique().tolist()

target_labels = widgets.SelectMultiple(
    options=var_labels,
    description='Target',
    disabled=False
)

display(target_labels)

SelectMultiple(description='Target', options=('failed', 'canceled', 'successful', 'live', 'undefined', 'suspen…

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed

The next cell will filter your dataset to the labels selected.

In [16]:
selected_labels = [x for x in target_labels.value]

master_data['model_table'] = master_data['custom_table'][master_data['custom_table'][target.value].isin(selected_labels)]
master_data['model_table']['state'].unique()

array(['failed', 'successful'], dtype=object)

In [17]:
categorical_vars = master_data['custom_table'].columns.tolist()

cat_vars = widgets.SelectMultiple(
    options=categorical_vars,
    description='Categorical',
    disabled=False
)

display(cat_vars)

SelectMultiple(description='Categorical', options=('main_category', 'goal', 'state', 'country', 'launch_month'…

In [18]:
one_hot_vars = [x for x in cat_vars.value]
one_hot_df = pd.get_dummies(master_data['model_table'][one_hot_vars],prefix=one_hot_vars)
master_data['model_table'] = master_data['model_table'].merge(one_hot_df, how='inner', left_index=True, right_index=True)

for o in one_hot_vars:
    del master_data['model_table'][o]


In [19]:
master_data['model_table']

Unnamed: 0,goal,state,launch_month,launch_dow,duration,main_category_Art,main_category_Comics,main_category_Crafts,main_category_Dance,main_category_Design,...,country_JP,country_LU,country_MX,"country_N,0""",country_NL,country_NO,country_NZ,country_SE,country_SG,country_US
0,1000.0,failed,8,3,58,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,30000.0,failed,9,7,60,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,45000.0,failed,1,7,45,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,5000.0,failed,3,7,30,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
5,50000.0,successful,2,6,34,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34995,150000.0,failed,5,6,33,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
34996,5000.0,failed,3,4,30,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
34997,2000.0,successful,4,7,29,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
34998,5500.0,failed,3,5,59,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


## 2.2 - Build Individual Models

The code cells below will separate our target variable from our independent variables and create training and testing-splits before executing build sequences for several different model-types.

The next cell will separate our target variable from the independent variables

In [145]:
X = master_data['model_table'].drop(columns = [target.value])
y = master_data['model_table'][target.value]

The cells below will create our training and testing datasets.  Note: the approach uses a "stratified-sampling" method.  Use the slider to determine the percentage of records to include in your testing set.

In [146]:
test_prop = widgets.IntSlider(
    value=30,
    min=1,
    max=50,
    step=1,
    description='Test %:',
    disabled=False,
    continuous_update=True,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)

display(test_prop)

IntSlider(value=30, description='Test %:', max=50, min=1)

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed

In [147]:
test_proportion = test_prop.value/100
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_proportion, stratify=y)

### K-nearest Neighbor (k-NN)

We will first fit a k-nearest neighbor on our dataset by executing the cells below.  By default, a range has been hard-coded to evaluate up to 25-nearest neighbors.  The models are evaluated using 5-fold cross-validation.

In [148]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

np.random.seed(1999)

knn = KNeighborsClassifier()

params_knn = {'n_neighbors': np.arange(1, 25)}

knn_gs = GridSearchCV(knn, params_knn, cv=5)

knn_gs.fit(X_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid='deprecated', n_jobs=None,
             param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

Once we've completed our grid search above, the cell below will return the "best model"

In [149]:
knn_best = knn_gs.best_estimator_

print(knn_gs.best_params_)

{'n_neighbors': 24}


### Random Forest Classifier

We will next fit a Random Forest to our dataset by executing the cells below.  By default, the number of estimators have been hard-coded to 50, 100, 200.  The models are evaluated using 5-fold cross-validation.

In [150]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=1)

params_rf = {'n_estimators': [50, 100, 200]}

rf_gs = GridSearchCV(rf, params_rf, cv=5)

rf_gs.fit(X_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False, random_state=1,
                                   

Once we've completed our grid search above, the cell below will return the "best model"

In [151]:
rf_best = rf_gs.best_estimator_

print(rf_gs.best_params_)

{'n_estimators': 200}


### Logistic Regression

Finally, we will fit a Logistic Regression to our dataset by executing the cells below.

In [152]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(random_state=1)

log_reg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=1, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

### Individual Model Performance

Now that we have trained, evaluated, and selected (3) separate models, we can view their individual accuracies on the test set below:

In [153]:
print('knn: {}'.format(knn_best.score(X_test, y_test)))
print('rf: {}'.format(rf_best.score(X_test, y_test)))
print('log_reg: {}'.format(log_reg.score(X_test, y_test)))

knn: 0.6100075979594052
rf: 0.6218387061760555
log_reg: 0.5990448279604906


## 2.3 - Build an Ensemble

Now that we've seen how well our models perform independently, we can generate various ensembles to test their collective performance to see if we can improve our accuracy.

[Implementation of a majority voting EnsembleVoteClassifier for classification](http://rasbt.github.io/mlxtend/user_guide/classifier/EnsembleVoteClassifier/)

### Ensemble 1: Soft-voting with equal weighting

In [154]:
from mlxtend.classifier import EnsembleVoteClassifier

ensemble_soft_evc = EnsembleVoteClassifier(clfs=[knn_best,rf_best,log_reg], voting='soft', weights=[1,1,1], refit=False)
ensemble_soft_evc.fit(X_train, y_train)

print('soft ensemble: {}'.format(ensemble_soft_evc.score(X_test, y_test)))


soft ensemble: 0.6303050037989797


### Ensemble 2: Soft-voting with various weights

In [155]:
ensemble_soft_wgt_evc = EnsembleVoteClassifier(clfs=[knn_best,rf_best,log_reg], voting='soft', weights=[.3,.3,.4], refit=False)
ensemble_soft_wgt_evc.fit(X_train, y_train)

print('soft weight ensemble: {}'.format(ensemble_soft_wgt_evc.score(X_test, y_test)))


soft weight ensemble: 0.6319331379572344


### Ensemble 3: Hard-voting (weights not applicable)

In [156]:
ensemble_hard_evc = EnsembleVoteClassifier(clfs=[knn_best,rf_best,log_reg], voting='hard', refit=False)
ensemble_hard_evc.fit(X_train, y_train)

print('hard ensemble: {}'.format(ensemble_hard_evc.score(X_test, y_test)))

hard ensemble: 0.6216216216216216


### Comparing the Outputs

In [157]:
ens_soft_preds_evc = ensemble_soft_evc.predict(X_test)
ens_soft_preds_prob_evc = ensemble_soft_evc.predict_proba(X_test)
ens_soft_test_df_evc = copy.deepcopy(X_test)
ens_soft_test_df_evc['actual_state'] = y_test
ens_soft_test_df_evc['soft_ens_predicted_state'] = ens_soft_preds_evc

col_header = ['soft_ens_' + x + '_prob' for x in ensemble_soft_evc.classes_]
ens_soft_preds_prob_df_evc = pd.DataFrame(ens_soft_preds_prob_evc)
ens_soft_preds_prob_df_evc.columns = col_header
ens_soft_combined_output_evc = pd.concat([ens_soft_test_df_evc.reset_index(drop=True), ens_soft_preds_prob_df_evc.reset_index(drop=True)], axis=1)


In [158]:
ens_soft_wgt_preds_evc = ensemble_soft_wgt_evc.predict(X_test)
ens_soft_wgt_preds_prob_evc = ensemble_soft_wgt_evc.predict_proba(X_test)
ens_soft_wgt_test_df_evc = copy.deepcopy(X_test)
ens_soft_wgt_test_df_evc['actual_state'] = y_test
ens_soft_wgt_test_df_evc['soft_ens_wght_predicted_state'] = ens_soft_wgt_preds_evc

col_header = ['soft_ens_wght_' + x + '_prob' for x in ensemble_soft_wgt_evc.classes_]
ens_soft_wgt_preds_prob_df_evc = pd.DataFrame(ens_soft_wgt_preds_prob_evc)
ens_soft_wgt_preds_prob_df_evc.columns = col_header
ens_soft_wgt_combined_output_evc = pd.concat([ens_soft_wgt_test_df_evc.reset_index(drop=True), ens_soft_wgt_preds_prob_df_evc.reset_index(drop=True)], axis=1)


In [159]:
ens_hard_preds_evc = ensemble_hard_evc.predict(X_test)
ens_hard_test_df_evc = copy.deepcopy(X_test)
ens_hard_test_df_evc['actual_state'] = y_test
ens_hard_test_df_evc['hard_ens_predicted_state'] = ens_hard_preds_evc

ens_hard_combined_output_evc = ens_hard_test_df_evc


In [160]:
ens_soft_summary_df_evc = ens_soft_combined_output_evc[ens_soft_combined_output_evc.columns[-4:]]
ens_soft_wgt_summary_df_evc = ens_soft_wgt_combined_output_evc[ens_soft_wgt_combined_output_evc.columns[-3:]]
ens_hard_summary_df_evc = ens_hard_combined_output_evc[ens_hard_combined_output_evc.columns[-1:]]

ens_combined_output_evc = pd.concat([ens_soft_summary_df_evc.reset_index(drop=True), ens_soft_wgt_summary_df_evc.reset_index(drop=True),ens_hard_summary_df_evc.reset_index(drop=True)], axis=1)
ens_combined_output_evc


Unnamed: 0,actual_state,soft_ens_predicted_state,soft_ens_failed_prob,soft_ens_successful_prob,soft_ens_wght_predicted_state,soft_ens_wght_failed_prob,soft_ens_wght_successful_prob,hard_ens_predicted_state
0,failed,failed,0.727248,0.272752,failed,0.708031,0.291969,failed
1,successful,successful,0.417118,0.582882,successful,0.425041,0.574959,successful
2,successful,successful,0.422306,0.577694,successful,0.437934,0.562066,failed
3,failed,failed,0.619347,0.380653,failed,0.630883,0.369117,failed
4,successful,successful,0.274576,0.725424,successful,0.295657,0.704343,successful
...,...,...,...,...,...,...,...,...
9208,failed,failed,0.524016,0.475984,failed,0.526233,0.473767,failed
9209,successful,successful,0.457640,0.542360,successful,0.471668,0.528332,successful
9210,failed,failed,0.604614,0.395386,failed,0.597537,0.402463,failed
9211,failed,failed,0.653708,0.346292,failed,0.644949,0.355051,failed


## 2.4 - How Ensembles Work

The next few cells will construct the standardized tables needed for investigating the ensemble components

In [161]:
knn_preds = knn_best.predict(X_test)
knn_preds_prob = knn_best.predict_proba(X_test)
knn_test_df = copy.deepcopy(X_test)
knn_test_df['actual_state'] = y_test
knn_test_df['knn_predicted_state'] = knn_preds

col_header = ['knn_' + x + '_prob' for x in knn_best.classes_]
knn_preds_prob_df = pd.DataFrame(knn_preds_prob)
knn_preds_prob_df.columns = col_header
knn_combined_output = pd.concat([knn_test_df.reset_index(drop=True), knn_preds_prob_df.reset_index(drop=True)], axis=1)


In [162]:
rf_preds = rf_best.predict(X_test)
rf_preds_prob = rf_best.predict_proba(X_test)
rf_test_df = copy.deepcopy(X_test)
rf_test_df['actual_state'] = y_test
rf_test_df['rf_predicted_state'] = rf_preds

col_header = ['rf_' + x + '_prob' for x in rf_best.classes_]
rf_preds_prob_df = pd.DataFrame(rf_preds_prob)
rf_preds_prob_df.columns = col_header
rf_combined_output = pd.concat([rf_test_df.reset_index(drop=True), rf_preds_prob_df.reset_index(drop=True)], axis=1)


In [163]:
lr_preds = log_reg.predict(X_test)
lr_preds_prob = log_reg.predict_proba(X_test)
lr_test_df = copy.deepcopy(X_test)
lr_test_df['actual_state'] = y_test
lr_test_df['lr_predicted_state'] = lr_preds

col_header = ['lr_' + x + '_prob' for x in log_reg.classes_]
lr_preds_prob_df = pd.DataFrame(lr_preds_prob)
lr_preds_prob_df.columns = col_header
lr_combined_output = pd.concat([lr_test_df.reset_index(drop=True), lr_preds_prob_df.reset_index(drop=True)], axis=1)


### Consolidated Modeling Output Table

In [165]:
knn_summary_df = knn_combined_output[knn_combined_output.columns[-4:]]
rf_summary_df = rf_combined_output[rf_combined_output.columns[-3:]]
lr_summary_df = lr_combined_output[lr_combined_output.columns[-3:]]

combined_output = pd.concat([knn_summary_df.reset_index(drop=True), rf_summary_df.reset_index(drop=True), lr_summary_df.reset_index(drop=True)], axis=1)
combined_output


Unnamed: 0,actual_state,knn_predicted_state,knn_failed_prob,knn_successful_prob,rf_predicted_state,rf_failed_prob,rf_successful_prob,lr_predicted_state,lr_failed_prob,lr_successful_prob
0,failed,failed,0.791667,0.208333,failed,0.855000,0.145000,failed,0.535078,0.464922
1,successful,failed,0.500000,0.500000,successful,0.255000,0.745000,successful,0.496353,0.503647
2,successful,failed,0.583333,0.416667,successful,0.105000,0.895000,failed,0.578584,0.421416
3,failed,failed,0.708333,0.291667,successful,0.415000,0.585000,failed,0.734708,0.265292
4,successful,successful,0.083333,0.916667,successful,0.255000,0.745000,successful,0.485393,0.514607
...,...,...,...,...,...,...,...,...,...,...
9208,failed,failed,0.583333,0.416667,successful,0.442524,0.557476,failed,0.546189,0.453811
9209,successful,successful,0.375000,0.625000,successful,0.400000,0.600000,failed,0.597920,0.402080
9210,failed,failed,0.625000,0.375000,failed,0.655000,0.345000,failed,0.533842,0.466158
9211,failed,failed,0.750000,0.250000,failed,0.645000,0.355000,failed,0.566123,0.433877


### Consolidated Ensemble Output Table

In [166]:
ens_combined_output_evc

Unnamed: 0,actual_state,soft_ens_predicted_state,soft_ens_failed_prob,soft_ens_successful_prob,soft_ens_wght_predicted_state,soft_ens_wght_failed_prob,soft_ens_wght_successful_prob,hard_ens_predicted_state
0,failed,failed,0.727248,0.272752,failed,0.708031,0.291969,failed
1,successful,successful,0.417118,0.582882,successful,0.425041,0.574959,successful
2,successful,successful,0.422306,0.577694,successful,0.437934,0.562066,failed
3,failed,failed,0.619347,0.380653,failed,0.630883,0.369117,failed
4,successful,successful,0.274576,0.725424,successful,0.295657,0.704343,successful
...,...,...,...,...,...,...,...,...
9208,failed,failed,0.524016,0.475984,failed,0.526233,0.473767,failed
9209,successful,successful,0.457640,0.542360,successful,0.471668,0.528332,successful
9210,failed,failed,0.604614,0.395386,failed,0.597537,0.402463,failed
9211,failed,failed,0.653708,0.346292,failed,0.644949,0.355051,failed


### Analyzing a Single Record

In [169]:
row_num = 2

val1 = combined_output['knn_failed_prob'][row_num]
lab1 = combined_output['knn_predicted_state'][row_num]

val2 = combined_output['rf_failed_prob'][row_num]
lab2 = combined_output['rf_predicted_state'][row_num]

val3 = combined_output['lr_failed_prob'][row_num]
lab3 = combined_output['lr_predicted_state'][row_num]

print('knn label: '+lab1+', knn failed prob: '+str(val1))
print('rf label: '+lab2+', rf failed prob: '+str(val2))
print('lr label: '+lab3+', lr failed prob: '+str(val3))


lab4 = ens_combined_output_evc['soft_ens_predicted_state'][row_num]
lab5 = ens_combined_output_evc['soft_ens_wght_predicted_state'][row_num]
lab6 = ens_combined_output_evc['hard_ens_predicted_state'][row_num]
print('soft ensemble label: '+lab4+', soft ensemble failed prob: '+str((val1+val2+val3)/3))
print('soft wght ensemble label: '+lab5+', soft wght ensemble failed prob: '+str((val1*.3)+(val2*.3)+(val3*.4)))
print('hard ensemble label: '+lab6)

knn label: failed, knn failed prob: 0.5833333333333334
rf label: successful, rf failed prob: 0.105
lr label: failed, lr failed prob: 0.5785843485731808
soft ensemble label: successful, soft ensemble failed prob: 0.42230589396883805
soft wght ensemble label: successful, soft wght ensemble failed prob: 0.43793373942927233
hard ensemble label: failed


## 2.5 - Save and Export a Model

There are a few ways to export a trained model.  More often than not, you will either pickle your model or use the joblib function to store your model for later use.  See the link below for further details:

[Save and Load Machine Learning Models in Python with scikit-learn](https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/)

In [170]:
filename = 'ensemble_model.sav'
pickle.dump(ensemble_soft_evc, open('pickle_'+filename, 'wb'))
joblib.dump(ensemble_soft_evc, 'joblib_'+filename)

['joblib_ensemble_model.sav']

### Confirmation that our saved model can be reloaded

In [171]:
#loaded_model = pickle.load(open(filename, 'rb'))
loaded_model = joblib.load('joblib_'+filename)
result = loaded_model.score(X_test, y_test)
print('imported model:' +str(result))

imported model:0.6303050037989797


The original model is provided below for comparison:

In [172]:
print('soft ensemble: {}'.format(ensemble_soft_evc.score(X_test, y_test)))

soft ensemble: 0.6303050037989797


## 3.0 - Score a New Dataset

The following process will walk you through uploading another dataset to score against your top model.

**Note:** The new dataset must contain the same fields that were used to train your models in the prior steps.  The structure of the new dataset does not have to be consistent with the one used in prior steps (ex. there is no need to align columns).

## 3.1 - Upload Your Data (Excel and CSV files)

Follow the same process you used in subsequent steps to upload the dataset you would like to apply against your trained model.  This is the dataset with "unknown" values which your trained models will attempt to predict.

In [173]:
upload_type = widgets.RadioButtons(
    options=['local', 'url'],
    description='File Location:',
    disabled=False
)

upload_url = widgets.Text(
    value='https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv',
    placeholder='http://',
    description='URL:',
    disabled=False,
    layout=Layout(width='80%')
)
upload_filename = widgets.Text(
    value='hold_out_data.csv',
    placeholder='Sample File.csv',
    description='File Name:',
    disabled=False,
    layout=Layout(width='50%')
)

file_type = widgets.RadioButtons(
    options=['csv', 'xlsx'],
    description='File Type:',
    disabled=False
)

tab_names = widgets.Text(
    value='Sheet1, Sheet2, Sheet3, etc',
    placeholder='ALL EMPLOYEES, PAST EMPLOYEES',
    description='Tab(s):',
    disabled=False,
    layout=Layout(width='50%')
)

subfolder_name = widgets.Text(
    value='source_data',
    placeholder='Subfolder name',
    description='Subfolder:',
    disabled=False,
    layout=Layout(width='50%')
)

subfolder = widgets.RadioButtons(
    options=['no','yes'],
    value='no',
    description='Subfolder:',
    disabled=False
)

skip_rows = widgets.IntSlider(
    value=0,
    min=0,
    max=10,
    step=1,
    description='Skip Rows:',
    disabled=False,
    continuous_update=True,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)

delimiter = widgets.RadioButtons(
    options=[',',';'],
    value=',',
    description='Delimiter:',
    disabled=False
)

def text_field(x):
    if(x=='xlsx'):
        display(tab_names)
        tab_names.on_submit(tab_names)
        display(skip_rows)
    else:
        display(delimiter)
        print('Tab Names: Not needed for csv files')

def sub_folder(y):
    if(y=='yes'):
        display(subfolder_name)
        subfolder_name.on_submit(subfolder_name)
    else:
        print('Using {} folder'.format(os.getcwd()))

def file_location(z):
    if(z=='local'):
        display(upload_filename)
        i = widgets.interactive(text_field, x=file_type)
        display(i)
        p = widgets.interactive(sub_folder, y=subfolder)
        display(p)
    else:
        display(upload_url)

q = widgets.interactive(file_location, z=upload_type)

display(q)

interactive(children=(RadioButtons(description='File Location:', options=('local', 'url'), value='local'), Out…

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed
The following code cell will attempt to ingest the data you've selected in the widgets above:

**Note About xlsx Files** - Depending on the number of tabs and the size of the data on each tab, ingesting an xlsx file can take several minutes to execute.  If possible, it may be more efficient to break your Excel file into separate csv files which take only a fraction of a second to ingest.

In [174]:
new_data = {}

if upload_type.value == 'url':
    url_response = requests.request("GET", upload_url.value)
    new_data['url_data'] = pd.read_csv(io.BytesIO(url_response.content))
else:
    if file_type.value == 'csv':
        tabs = None
        skiprows = 0
    else:
        tabs = [x.strip() for x in tab_names.value.split(',')]
        skiprows = skip_rows.value

    if subfolder.value == 'yes':
        subfolder = subfolder_name.value
    else:
        subfolder = None
    new_data = compile_raw_data(upload_filename.value, tabs, subfolder, delimiter_char = delimiter.value, skip_rows = skiprows, file_ext = file_type.value)


**Note:** If you see an AttributeError: 'NoneType' object has no attribute 'value' message above, simply rerun the last two code cells to reset the input parameters.

The following code cell will print out the attributes associated with the files you've uploaded and alert you of any errors:

In [175]:
for key, value in new_data.items():
    try:
        print('{} table was ingested with {} rows and {} columns'.format(key,value.shape[0],value.shape[1]))
    except:
        print('{} table failed to load'.format(key))

hold_out_data table was ingested with 15000 rows and 18 columns


## 3.2 - Select a Data Frame to be Scored

The following menus will allow you to select the dataset you would like to score against your trained model.  This dataset should contain the fields you used to train the models in prior steps, but it does not have to consistent of the same structure (ex. there is no need to remove unused columns or align column locations).

Select an available frame from the list below:

In [176]:
dict_keys = widgets.Select(
    options=new_data.keys(),
    description='Tables:',
    disabled=False,
    layout=Layout(width='50%')
)

display(dict_keys)

Select(description='Tables:', layout=Layout(width='50%'), options=('hold_out_data',), value='hold_out_data')

Select the variables required by your model (**Note:** these are the variables you used to train your original model)

In [177]:
pred_variables = widgets.SelectMultiple(
    options=new_data[dict_keys.value].columns.tolist(),
    description='Variables:',
    disabled=False,
    layout=Layout(width='50%')
)

display(pred_variables)

SelectMultiple(description='Variables:', layout=Layout(width='50%'), options=('id', 'name', 'category', 'main_…

Select all categorical variables in your dataset (for one-hot encoding)

In [178]:
trim_vars = [x for x in pred_variables.value]
cat_vars = new_data[dict_keys.value][trim_vars].columns.tolist()

cat_vars_2 = widgets.SelectMultiple(
    options=cat_vars,
    description='Target',
    disabled=False
)

display(cat_vars_2)

SelectMultiple(description='Target', options=('main_category', 'goal', 'country', 'launch_month', 'launch_dow'…

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed

The cell below will execute the one-hot encoding and drop the original variables from your dataset

In [179]:
one_hot_vars = [x for x in cat_vars_2.value]
one_hot_df = pd.get_dummies(new_data[dict_keys.value][one_hot_vars],prefix=one_hot_vars)

master_data['score_table'] = copy.deepcopy(new_data[dict_keys.value][trim_vars])

master_data['score_table'] = master_data['score_table'].merge(one_hot_df, how='inner', left_index=True, right_index=True)

for o in one_hot_vars:
    del master_data['score_table'][o]


## 3.3 - Score the Data

Select the model object from your local directory.  This is the file you saved above in **Section 2.5**

In [180]:
file_list = widgets.Select(
    options=os.listdir(),
    description='Files:',
    disabled=False,
    layout=Layout(width='100%')
)

display(file_list)

Select(description='Files:', layout=Layout(width='100%'), options=('.DS_Store', 'Archive', 'ensemble_model.sav…

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed

The cell below will load your model and score your selected dataset

In [181]:
#loaded_model = pickle.load(open(file_list.value, 'rb'))
loaded_model = joblib.load(file_list.value)
result = loaded_model.predict(master_data['score_table'])

View your scored dataset below:

In [182]:
master_data['score_table']['predicted'] = result
master_data['score_table'].head()

Unnamed: 0,goal,launch_month,launch_dow,duration,main_category_Art,main_category_Comics,main_category_Crafts,main_category_Dance,main_category_Design,main_category_Fashion,...,country_LU,country_MX,"country_N,0""",country_NL,country_NO,country_NZ,country_SE,country_SG,country_US,predicted
0,50.0,5,3,29,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,successful
1,5000.0,3,5,29,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,successful
2,7500.0,9,3,50,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,failed
3,12000.0,6,6,49,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,failed
4,50000.0,6,3,39,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,failed


## 3.4 - Export Your Scored Data

The cells below allow you to export your scored dataset

In [183]:
def dict_to_excel(dict_name, dframe, subfolder, timestamp = False):
    
    # Inputs: a dictionary of dataframes; timestamp = True adds an ISO-formatted suffix to the filename
    # Description: Writes dataframes contained within a dictionary to xlsx (on your directory)

    if subfolder:
        file_path = subfolder+'/'
        suffix = '_' + re.sub(r"\:+", '', dt.datetime.now().isoformat()) + '.xlsx' if timestamp else '.xlsx'  
        file_path = os.path.join(file_path, dframe + suffix)
    else:
        suffix = '_' + re.sub(r"\:+", '', dt.datetime.now().isoformat()) + '.xlsx' if timestamp else '.xlsx'  
        file_path = os.path.join(dframe + suffix)
        
    try:
        dict_name[dframe].to_excel(file_path, index = False)
        print('Successfully wrote {} with {} rows and {} columns to the directory'.format(dframe+suffix, dict_name[dframe].shape[0], dict_name[dframe].shape[1]))
    except Exception as e:
        print('Writing the data to the directory failed')
        
def dict_to_parquet(dict_name, dframe, subfolder, timestamp = False):
    
    # Inputs: a dictionary of dataframes; timestamp = True adds an ISO-formatted suffix to the filename
    # Description: Writes dataframes contained within a dictionary to parquet (on your directory)

    if subfolder:
        file_path = subfolder+'/'
        suffix = '_' + re.sub(r"\:+", '', dt.datetime.now().isoformat()) + '.parquet.gzip' if timestamp else '.parquet.gzip'  
        file_path = os.path.join(file_path, dframe + suffix)
    else:
        suffix = '_' + re.sub(r"\:+", '', dt.datetime.now().isoformat()) + '.parquet.gzip' if timestamp else '.parquet.gzip'  
        file_path = os.path.join(dframe + suffix)
        
    try:
        dict_name[dframe].to_parquet(file_path, compression='gzip')
        print('Successfully wrote {} with {} rows and {} columns to the directory'.format(dframe+suffix, dict_name[dframe].shape[0], dict_name[dframe].shape[1]))
    except Exception as e:
        print('Writing the data to the directory failed')


Select which datasets you would like to export along with where and how you would like them exported.

In [184]:
dict_keys = widgets.SelectMultiple(
    options=master_data.keys(),
    description='Tables:',
    disabled=False,
    layout=Layout(width='50%')
)

display(dict_keys)

subfolder_option = widgets.RadioButtons(
    options=['no','yes'],
    value='no',
    description='Subfolder:',
    disabled=False
)

output_type = widgets.RadioButtons(
    options=['xlsx','parquet'],
    value='xlsx',
    description='Output Type:',
    disabled=False
)

timestamp_option = widgets.RadioButtons(
    options=['no','yes'],
    value='no',
    description='Timestamp:',
    disabled=False
)

subfolder_text = widgets.Text(
    value='output',
    placeholder='Subfolder name',
    description='Subfolder:',
    disabled=False,
    layout=Layout(width='50%')
)

def sub_folder_edit(y):
    if(y=='yes'):
        display(subfolder_text)
        subfolder_text.on_submit(subfolder_text)
        print('Your file(s) will be written to the subfolder in {}[Your Entry Above]'.format(os.getcwd()+os.sep))
    else:
        print('Using {} folder'.format(os.getcwd()))
        
y = widgets.interactive(sub_folder_edit, y=subfolder_option)

display(y, timestamp_option, output_type)

SelectMultiple(description='Tables:', layout=Layout(width='50%'), options=('training_data', 'custom_table', 'm…

interactive(children=(RadioButtons(description='Subfolder:', options=('no', 'yes'), value='no'), Output()), _d…

RadioButtons(description='Timestamp:', options=('no', 'yes'), value='no')

RadioButtons(description='Output Type:', options=('xlsx', 'parquet'), value='xlsx')

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed

The cell below export your selected datasets as you have defined above.

In [185]:
if subfolder_option.value == 'yes':
    subfolder = subfolder_text.value
else:
    subfolder = None
    
dframe_list = []
for df in dict_keys.value:
    dframe_list.append(df)

if timestamp_option.value == 'yes':
    timestamp_boolean = True
else:
    timestamp_boolean = False
 
for df in dframe_list:
    if output_type.value == 'parquet':
        dict_to_parquet(master_data, df, subfolder, timestamp = timestamp_boolean)
    else:
        dict_to_excel(master_data, df, subfolder, timestamp = timestamp_boolean)

Successfully wrote score_table.xlsx with 15000 rows and 43 columns to the directory


#### If you need any support,  please feel free to contact me at bablanchard@mail.smu.edu