# Section 4. Feature engineering 

This notebook introduces methods of feature engineering before modeling. Topics include dealing with NA's, feature creation, train test split and data transformation. Although feature engineering is criticial, there is no one 'correct' way to engineer the features. Contents in this notebook are not necessarily the best approaches, use this notebook as a demo and start from here. 

### CONTENTS
* <a href='00 - DSC 2022 Welcome and Logistics.ipynb#top'>**Section 0. Welcome and Logistics**</a> 
* <a href='01 - DSC 2022 Problem Definition.ipynb#top'>**Section 1. Problem Definition**</a> 
* <a href='04 - DSC 2022 Feature Engineering.ipynb#top'>**Section 4. Feature Engineering**</a> 
  * [1. Deal with NA's](#na)
  * [2. Feature creation](#create)
  * [3. Transformation](#transform)
  * [4. Put everything together](#function)
* <a href='05 - DSC 2022 Modeling.ipynb#top'>**Section 5. Modeling**</a>
* <a href='06 - DSC 2022 Modeling with Deep Learning.ipynb#top'>**Section 6. Modeling with Deep Learning**</a>
* <a href='07 - DSC 2022 Submission.ipynb#top'>**Section 7. Submission**</a>

In [1]:
import pandas as pd 
import numpy as np 
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

Again, we will read in the data first. In this notebook we will be making changes to the data frame. And therefore a safer way is to make a deep copy of the original data frame and make changes on the copied data so that we don't make accidental stupid changes to the original data frame. It is always good to have a backup. 

In [2]:
cmg = pd.read_excel('cmg.xlsx', index_col = 'offeringId')
cmg_transformed = cmg.copy(deep = True)
cmg_transformed.head()

Unnamed: 0_level_0,offeringPricingDate,offeringType,offeringSector,offeringSubSector,offeringDiscountToLastTrade,offeringPrice,issuerCusip,issuerName,pre15_Price_Normalized,pre14_Price_Normalized,...,pre1_Price_Normalized,underwriters,totalBookrunners,leftLeadFirmId,leftLeadFirmName,post1_Price_Normalized,post7_Price_Normalized,post30_Price_Normalized,post90_Price_Normalized,post180_Price_Normalized
offeringId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
b969a1c8-0a26-438a-81e6-5e95f3b30501,2003-10-02,IPO,Consumer Cyclical,Vehicles & Parts,0.0,13.0,501889208,BharCap Acquisition Corp.,,,...,,[{'firmId': '15af8b8d-c949-4fa0-b35e-a6482d3ca...,2,759ce574-3755-480b-8b83-c614f4568db1,Baird,-0.855769,-0.85,-0.831635,-0.825481,-0.836538
1081394b-c9f2-4479-8dd2-528027ff1eea,2005-07-21,IPO,Communication Services,Telecom Services,0.0,13.0,209034107,GrandSouth Bancorporation,,,...,,[{'firmId': 'dac135c0-9e99-4362-9762-7179a0023...,2,5eb63e75-8f95-464e-86fe-3222865c54ef,Credit Suisse,0.060769,0.136923,0.041538,-0.018462,-0.016923
714a166d-9eb0-4b3c-ab8e-7c0dc6f21ee0,2005-08-04,IPO,Communication Services,Internet Content & Information,0.0,27.0,056752108,Brand Velocity Acquisition Corp,,,...,,[{'firmId': 'a82a866c-d40e-453a-99e1-8acb44efb...,2,dac135c0-9e99-4362-9762-7179a0023c9e,Goldman Sachs & Co.,-0.546148,-0.637407,-0.711852,-0.746296,-0.798111
43f06950-8d20-4cfc-b16d-237e0927e1e6,2005-11-10,IPO,Industrials,Consulting Services,0.0,16.0,G47567105,ProLung Inc.,,,...,,[{'firmId': 'a82a866c-d40e-453a-99e1-8acb44efb...,2,cd9cd378-73b5-4cef-8666-ad2c5149ccd8,Goldman Sachs & Co.,-0.699502,-0.697394,-0.682808,-0.566124,-0.512702
96a13598-121a-41c0-83b5-448843cd8709,2006-02-03,IPO,Energy,Oil & Gas Midstream,0.0,21.0,29273V100,Golden Star Acquisition Corp,,,...,,[{'firmId': '7d932034-3e85-46ab-97b4-b6e8e86ee...,3,8fdb6c2d-3b35-40d4-a886-0a3461b42d98,UBS Investment Bank,-0.730357,-0.73869,-0.740595,-0.703571,-0.688095


<a id='na'></a>
## 1. Deal with NA's

Recall that the data set provided contains NA's in pre-deal performance columns for IPO's. And yet models won't work with NA's. However, in this case, we probably don't want to drop all rows that contain NA's since then we would lose all observations that are IPO's. 

Another approach to deal with NA's is imputing the missing values. There are various ways in how we could impute these values. 
Given the definition of pre-deal prices, that is for example 
$$\text{pre15_Price_Normalized} = \frac{\text{raw price 15 days prior to deal announcement} - \text{offering price}}{\text{offering price}},$$ we could fill all the normalized pre-deal price with 0 for all IPO's, which assumes that if those raw prices do exist, they would be the same as the filing price. 

In [3]:
cmg_transformed.fillna(0, inplace = True)
cmg_transformed.isna().sum()

offeringPricingDate            0
offeringType                   0
offeringSector                 0
offeringSubSector              0
offeringDiscountToLastTrade    0
offeringPrice                  0
issuerCusip                    0
issuerName                     0
pre15_Price_Normalized         0
pre14_Price_Normalized         0
pre13_Price_Normalized         0
pre12_Price_Normalized         0
pre11_Price_Normalized         0
pre10_Price_Normalized         0
pre9_Price_Normalized          0
pre8_Price_Normalized          0
pre7_Price_Normalized          0
pre6_Price_Normalized          0
pre5_Price_Normalized          0
pre4_Price_Normalized          0
pre3_Price_Normalized          0
pre2_Price_Normalized          0
pre1_Price_Normalized          0
underwriters                   0
totalBookrunners               0
leftLeadFirmId                 0
leftLeadFirmName               0
post1_Price_Normalized         0
post7_Price_Normalized         0
post30_Price_Normalized        0
post90_Pri

<a id='create'></a>
## 2. Feature creation 

Sometimes, variables we need for inference are not present in the given data frame. And hence we would need to create features based on what's given. 

For example, one hypothesis that we had earlier is that if issuer switched lead banks from past deals, current deal performs worse. We might want to incorporate such hypothesis in our model, however, in the original data frame, we don't have such a variable. Then we shall for each observation(deal) define a new feature **changeBank** to be whether compared to the last deal, the deal is using a different lead bank.

First, let's check whether or not issuers that make more than 1 offerings exists in our data set

In [4]:
temp = cmg.groupby(by = 'issuerCusip').size().to_frame('numOffering').sort_values('numOffering', ascending = False)
print('number of issuers that have more than 1 offering', sum(temp.numOffering > 1)) 
temp

number of issuers that have more than 1 offering 2010


Unnamed: 0_level_0,numOffering
issuerCusip,Unnamed: 1_level_1
649604501,17
647551100,16
008492100,14
015271109,13
570759100,12
...,...
550351100,1
550424105,1
55068A100,1
55087P104,1


Looks like we have a bunch of issuers that made multiple offerings. For example, we can pull out all the offering information by the issuer 649604501 ordered by date. We now see that the issuer switched lead firm multiple times (from Ladenburg Thalmann & Co. Inc., to Deutsche Bank Securities, UBS Investment Bank and finally Morgan Stanley). 

In [5]:
def queryOffering(issuerCusip):
    return cmg[cmg.issuerCusip == issuerCusip].sort_values(by = ['offeringPricingDate']).filter(items = ['offeringId', 'offeringPricingDate', 'offeringType', 'leftLeadFirmName'])

queryOffering('649604501')

Unnamed: 0_level_0,offeringPricingDate,offeringType,leftLeadFirmName
offeringId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
41912c5e-4348-4398-aff5-f6d1e584af53,2011-06-28,OVERNIGHT_FO,Ladenburg Thalmann & Co. Inc.
eb44b8ae-3fdc-4ff4-b468-2622007c4d45,2011-12-01,OVERNIGHT_FO,Ladenburg Thalmann & Co. Inc.
3a7c2db7-93f1-4141-9a89-622b69820414,2012-05-25,OVERNIGHT_FO,Ladenburg Thalmann & Co. Inc.
f6002e6f-1258-4f89-8625-e740414f103d,2012-07-12,OVERNIGHT_FO,Ladenburg Thalmann & Co. Inc.
4b7d3caf-bc85-4b97-ad03-863d0ecb8fef,2012-08-16,OVERNIGHT_FO,Deutsche Bank Securities
4daf79f9-4eb2-4ea4-a9a6-c121040703c1,2012-10-03,OVERNIGHT_FO,Deutsche Bank Securities
d351fe2d-d1ac-47b5-8aa4-7677f0e40091,2013-04-29,REGISTERED_BLOCK,Deutsche Bank Securities
42f3b9ed-e84f-40f6-9da2-edcf04923568,2014-01-07,OVERNIGHT_FO,UBS Investment Bank
9e5b0b4b-e947-4000-a4d0-f54407d66363,2014-04-02,OVERNIGHT_FO,UBS Investment Bank
c753d0d3-e132-4dd3-b1e0-f4bc03fdfba9,2014-11-21,OVERNIGHT_FO,UBS Investment Bank


Withouht further due, let's create our new feature **changeBank**. Our strategy here is to sort the data set by issuerCusip and pricing date. Then we concat the data with itself shifted by one row. After that, we set the value for changeBank to be true for rows that are comparing two deals from the same issuer(i.e. issuerCusip == lagissuerCusip) and current deal using a different lead firm compared to the previous deal(i.e. leftLeadFirmName != lagLeftLeadFirmName).

In [6]:
temp = cmg.filter(items = ['offeringId', 'issuerCusip', 'offeringPricingDate', 'leftLeadFirmName']).sort_values(by = ['issuerCusip', 'offeringPricingDate'])
temp['lagLeftLeadFirmName'] = temp['leftLeadFirmName'].shift(1)
temp['lagissuerCusip'] = temp['issuerCusip'].shift(1)
temp['changeBank'] = temp.apply(lambda x: True if x.leftLeadFirmName != x.lagLeftLeadFirmName and x.issuerCusip == x.lagissuerCusip else False, axis =1)
print('number of offerings that change bank from the previous offering', len(temp[temp.changeBank == True]))
temp.head(5)

number of offerings that change bank from the previous offering 1921


Unnamed: 0_level_0,issuerCusip,offeringPricingDate,leftLeadFirmName,lagLeftLeadFirmName,lagissuerCusip,changeBank
offeringId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
bd9e5775-0981-48a0-87a0-01387de77e3f,307108,2014-10-01,William Blair,,,False
82fb9a9d-f7a2-4759-92f6-32a8e247bfb3,307108,2017-11-07,Raymond James,William Blair,307108.0,True
76f2a97e-7b6a-4452-8c0a-8cbb5526cb7f,307108,2017-11-14,Raymond James,Raymond James,307108.0,False
6e6de148-cbfd-4e08-b5f8-fbffd5a12740,380204,2020-10-21,Morgan Stanley,Raymond James,307108.0,False
6e6de148-cbfd-4e08-b5f8-fbffd5a12740,380204,2020-10-21,Morgan Stanley,Morgan Stanley,380204.0,False


After creating the new feature, we add it to the original data frame. 

In [7]:
cmg_transformed = cmg_transformed.merge(temp[['changeBank']], how = 'left', left_index = True, right_index = True)

We can wrap up the process of creating the new feature changeBank into a function. We will be using this function later. 

In [8]:
def change_bank(df):
    temp = df.filter(items = ['offeringId', 'issuerCusip', 'offeringPricingDate', 'leftLeadFirmName']).sort_values(by = ['issuerCusip', 'offeringPricingDate'])
    temp['lagLeftLeadFirmName'] = temp['leftLeadFirmName'].shift(1)
    temp['lagissuerCusip'] = temp['issuerCusip'].shift(1)
    temp['changeBank'] = temp.apply(lambda x: True if x.leftLeadFirmName != x.lagLeftLeadFirmName and x.issuerCusip == x.lagissuerCusip else False, axis =1)
    df = df.merge(temp[['changeBank']], how = 'left', left_index = True, right_index = True)
    return df

<a id='transform'></a>
## 3. Transformation
- [Train test split](#split)
- [Column transformation](#trans)

<a id='split'></a>
### Train test split

In data science, what we are really trying to do is to learn patterns from the already knowns, and then try to apply the patterns we learn to the unknown data. In this competition, the unknown data is the hold out data(the navy part), which is not provided to you. 

Then how would you know whether you have trained a reasonable model beforehand? The answer is **train test split**! Some people also call it train validation split. To put it in a sentence, the reason for train test split is to **get an estimate for how your model would perform on the unknown data**! In this competition, you are provided with a data frame that contains both predictors and outcomes. You could divide the provided data into two parts: train(the red) and test(the yellow). You will fit models on the train set(red part) **pretending that you don't see the test set**(yellow part); after you have trained your models, you can then predict on the test set(yellow part). In fact, you know the true values of outcomes for the test set(yellow part), you can then compare your predictions with the true values to get an estimate on how your model would perform on new unseen data. 


After getting an estimate for your model performance, do not forget to refit your model on the entire data set provided (the red and the yellow part) since the real unknown data is the holdout set. You would want your model to be trained on more data!

<img src="fig/train_test_split.png" width=600 height=400 />

Hint: The **holdout data set contains offerings that are made later than the offerings in the provided data set**. Given this information, is there a better way to do train test split that would give you a better estimate on the test error? 

In [9]:
y = cmg_transformed.filter(like = 'post')
X = cmg_transformed.loc[:, ~cmg_transformed.columns.isin(list(y))].drop(columns = ['offeringPricingDate', 'offeringSubSector', 'issuerCusip', 'issuerName', 'underwriters', 'leftLeadFirmId', 'leftLeadFirmName'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

<a id='trans'></a>
### Column tranformation

```ColumnTransformer``` from sklearn is a very useful function that could transform categorical and continuous columns at the same time. You can find more details <a href='https://scikit-learn.org/stable/search.html?q=columntransformer'>here</a>.

For **categorical variables**, we would like to do one-hot encoding as shown in the figure below. 

<img src="fig/onehot.png" width=400 height=200 />

For **continuous variables**, we would like to standardize features by removing the mean and scaling to unit variance so that scale of features won't affect model fitting. We would like to always do train-test split prior to standard scaling because the test set should be transformed based on the mean & variance of the train set. 

In [10]:
numerical_cols = list(X.select_dtypes(include=np.number))
categorical_cols = list(X.select_dtypes(exclude=np.number))
numerical_cols
categorical_cols

['offeringType', 'offeringSector', 'changeBank']

In [11]:
numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(drop = 'if_binary')
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)])

X_train_transformed = preprocessor.fit_transform(X_train)
X_test_transformed = preprocessor.fit_transform(X_test)
print(type(X_train_transformed), X_train_transformed.shape)

<class 'numpy.ndarray'> (5991, 35)


The ColumnTransformer has automatically transformed our pandas data frame to a numpy array. We could always transform them back to pandas data frame and add on the column names. After the transformation, we observe that for the numerical columns, the mean are now very close to 0, which is exactly what we wanted.

In [12]:
cols = numerical_cols + list(preprocessor.named_transformers_['cat'].get_feature_names(categorical_cols))
X_train_transformed = pd.DataFrame(X_train_transformed, columns = cols )
X_test_transformed = pd.DataFrame(X_test_transformed, columns = cols)
X_train_transformed.mean()

offeringDiscountToLastTrade              2.313752e-16
offeringPrice                            2.834858e-17
pre15_Price_Normalized                  -3.921917e-16
pre14_Price_Normalized                   6.485103e-17
pre13_Price_Normalized                  -1.581503e-16
pre12_Price_Normalized                  -2.223712e-16
pre11_Price_Normalized                   2.191954e-16
pre10_Price_Normalized                  -1.541614e-16
pre9_Price_Normalized                    6.476069e-17
pre8_Price_Normalized                    9.424665e-17
pre7_Price_Normalized                    8.869183e-17
pre6_Price_Normalized                   -2.145486e-16
pre5_Price_Normalized                   -1.770316e-16
pre4_Price_Normalized                    1.851067e-16
pre3_Price_Normalized                    1.321760e-16
pre2_Price_Normalized                    3.803593e-16
pre1_Price_Normalized                   -2.180580e-16
totalBookrunners                        -1.138391e-16
offeringType_IPO            

<a id='function'></a>
## 4. Put everything together

We have introducted multiples ways in engineering our features. However, rather than running through all the cells above, isn't it more satsifying to have a single function that takes in the original data frame and outputs transformed data? The cell below is a function wrapper for all the engineering steps we had earlier. Plus, when we put everything into functions, we can make specifying parameters so much easier. For example, the fraction of data for test set, whether or not to normalize data and so on. This function is stored in *feature_engineering.py*.

In [13]:
def feature_engineering(df, test_frac = 0.2, normalize = True, random_state = 42):

    # fill na 
    df = df.fillna(0)
    
    # create new feature 
    df = change_bank(df)
    
    # split to X&y and feature selection 
    y = df.filter(like = 'post')
    X = df.loc[:, ~df.columns.isin(list(y))].drop(columns = ['offeringPricingDate', 'offeringSubSector', 'issuerCusip', 'issuerName', 'underwriters', 'leftLeadFirmId', 'leftLeadFirmName'])
    
    # train test split 
    if test_frac != 0: 
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_frac, random_state=random_state)
    elif test_frac == 0:
        X_train, X_test, y_train, y_test = X, pd.DataFrame(columns = list(X)), y, pd.DataFrame(columns = list(y))
    if not normalize:  return X_train, X_test, y_train, y_test
    
    # normalize data 
    numerical_cols = list(X.select_dtypes(include=np.number))
    categorical_cols = [col for col in list(X) if col not in numerical_cols]
    numerical_transformer = StandardScaler()
    categorical_transformer = OneHotEncoder(drop = 'if_binary')
    preprocessor = ColumnTransformer(
        transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)])
    X_train_transformed = preprocessor.fit_transform(X_train)
    cols = numerical_cols + list(preprocessor.named_transformers_['cat'].get_feature_names(categorical_cols))
    X_train_transformed = pd.DataFrame(X_train_transformed, columns = cols, index = X_train.index )
    if X_test.shape[0] != 0: 
        X_test_transformed = preprocessor.transform(X_test)
        X_test_transformed = pd.DataFrame(X_test_transformed, columns = cols, index = X_test.index)
    else: 
        X_test_transformed = X_test   
    return X_train_transformed, X_test_transformed, y_train, y_test

In [14]:
X_train_transformed, X_test_transformed, y_train, y_test = feature_engineering(cmg)
X_train_transformed.head()

Unnamed: 0_level_0,offeringDiscountToLastTrade,offeringPrice,pre15_Price_Normalized,pre14_Price_Normalized,pre13_Price_Normalized,pre12_Price_Normalized,pre11_Price_Normalized,pre10_Price_Normalized,pre9_Price_Normalized,pre8_Price_Normalized,...,offeringSector_Consumer Cyclical,offeringSector_Consumer Defensive,offeringSector_Energy,offeringSector_Financial Services,offeringSector_Healthcare,offeringSector_Industrials,offeringSector_Real Estate,offeringSector_Technology,offeringSector_Utilities,changeBank_True
offeringId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5930a8cd-0703-4a73-8247-77e84bda7028,-2.05567,-0.336381,-0.057339,-0.071574,-0.07265,-0.083294,-0.080572,-0.071593,-0.085162,-0.086253,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
599ea9d8-ab39-4f64-9ffb-fd85edcb15ff,0.60087,-0.35938,-0.063319,-0.062724,-0.063376,-0.064914,-0.065365,-0.065639,-0.06611,-0.066429,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
6aec700f-222a-40a6-8dae-f472c46dbcd1,0.299699,-0.434124,-0.07262,-0.066542,-0.055603,-0.059997,-0.054745,-0.046673,-0.049256,-0.038926,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
41e8dcfa-e6e3-442c-96e4-b510b0620347,0.60087,-0.35938,-0.063319,-0.062724,-0.063376,-0.064914,-0.065365,-0.065639,-0.06611,-0.066429,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3cd1799a-0146-491e-b2b8-ff3a038586b6,0.252646,-0.362829,-0.054213,-0.058364,-0.060417,-0.058787,-0.052325,-0.055362,-0.053953,-0.03741,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
