# Section 7. Submission 

Hello all! Glad to see you here. You are almost at the finish line. Now that you have read through all the tutorials, it is your turn to create your solution. This notebook covers submission guidelines. To ensure your submission is valid, please confirm with your mentor. 

* <a href='00 - DSC 2022 Welcome and Logistics.ipynb#top'>**Section 0. Welcome and Logistics**</a> 
* <a href='01 - DSC 2022 Problem Definition.ipynb#top'>**Section 1. Problem Definition**</a> 
* <a href='02 - DSC 2022 Exploratory Data Analysis.ipynb#top'>**Section 2. Exploratory Data Analysis**</a> 
* <a href='03 - DSC 2022 Hypothesis testing.ipynb#top'>**Section 3. Hypothesis Testing**</a> 
* <a href='04 - DSC 2022 Feature Engineering.ipynb#top'>**Section 4. Feature Engineering**</a> 
* <a href='05 - DSC 2022 Modeling.ipynb#top'>**Section 5. Modeling**</a>
* <a href='06 - DSC 2022 Modeling with Deep Learning.ipynb#top'>**Section 6. Modeling with Deep Learning**</a>
* <a href='07 - DSC 2022 Submission.ipynb#top'>**Section 7. Submission**</a>
  * [Data Literacy Track](#literacy)
  * [Data Modeling Track](#model)

<a id='literacy'></a>
## Data Literacy Track 

Your submission should contain **no more than five figures** named team_DL_UK1_fig1.png (for ex) saved in one of your teammates' environment. Figures can contain subplots or facet grids; but keep in mind that there is the tradeoff between as much information as possible and readable to audiences. You can find how to save a figure in notebook <a href='02 - DSC 2022 Exploratory Data Analysis.ipynb#top'>**Section 2. Exploratory Data Analysis**</a>. Your submission will be evaluatednby our judge panel. 

<a id='model'></a>
## Data Modeling Track 

As explained in earlier notebook, you will be evaluated on data that has been **held out**. To confirm that your prediction algorithm works, you will have to generate predictions on a "dummy" holdout dataset that has the same shape as the real hold out data. Once submitted, **your mentors will replace the dummy file by the correct one** and re-run your algorithm on the real holdout dataset to evaluate your prediction.

Please make sure that the cells below can run from top to bottom.

In [1]:
import pandas as pd
import numpy as np
import pickle as pk
from feature_engineering import *
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeRegressor

### 1. Refit and save a model

<img src="fig/train_test_split.png" width=600 height=400 />

Recall that in section 5, we performed a train test split on the given data set(the red and yellow part) so that we get an estimation on how our model would perform on unseen data. However, the true unknown data is the hold-out data set. Therefore before you submit your model, refit your model on the entire given data set so that your model gets to see more data. 

In [2]:
cmg = pd.read_excel('cmg.xlsx', index_col = 'offeringId')
X_train, X_test, y_train, y_test = feature_engineering(cmg, test_frac = 0)
final_model = DecisionTreeRegressor(random_state=0, max_depth = 10).fit(X_train, y_train)

In [3]:
model_name = 'final_model.pkl'
with open(model_name,'wb') as file:
    pk.dump(final_model, file)

with open('final_model.pkl','rb') as file:
    saved_model = pk.load(file)

### 2. Prepare holdout set for modeling 

Please replace the cell below with your preprocessing steps.

In [4]:
# later for evaluation, we will replace the file path with the real file path
holdout = pd.read_excel('holdout_dummy.xlsx', index_col = 'offeringId')

# fill NA's 
holdout.fillna(0, inplace = True)

# create new feature 
holdout = change_bank(holdout)

# feature selection
holdout.drop(columns = ['offeringPricingDate', 'offeringSubSector', 'issuerCusip', 'issuerName', 'underwriters', 'leftLeadFirmId', 'leftLeadFirmName'],\
         inplace = True) 

# normalize 
cmg.fillna(0, inplace = True)
cmg = change_bank(cmg)
cmg.drop(columns = ['offeringPricingDate', 'offeringSubSector', 'issuerCusip', 'issuerName', 'underwriters', 'leftLeadFirmId', 'leftLeadFirmName'],\
         inplace = True) 
cmg.drop(columns = list(cmg.filter(like = 'post')), inplace = True)
numerical_cols = list(holdout.select_dtypes(include=np.number))
categorical_cols = list(holdout.select_dtypes(exclude=np.number))
numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(drop = 'if_binary')
preprocessor = ColumnTransformer(
        transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)])
preprocessor.fit(cmg)
holdout = preprocessor.transform(holdout)

### 3. Save prediction

In [5]:
your_prediction = saved_model.predict(holdout)

**Important: The cell below checks whether your output is of the right dimension. Please don't make any modifications to the cell!**

In [6]:
print(your_prediction.shape == (holdout.shape[0], 5))

True


In [7]:
# replace XX by your team ID!
YOUR_TEAM_FILE = "team_XX_pred.txt"
np.savetxt(YOUR_TEAM_FILE, your_prediction, fmt='%s')

You have reached the end of this tutorial series.

# Competition takeaways
We hope that you have learned, and are going to be able to re explore:
- How to formalize a problem in a data science framework 
- That running a notebook is not so hard to do and that it is a good tool to explore data and run code
- That many basic machine learning models are easily available and open source
In the future, when you come across interesting data sources, you can think of ways to quickly test out predictive power while using these resources!