# Section 1. Problem Definition

* <a href='00 - DSC 2022 Welcome and Logistics.ipynb#top'>**Section 0. Welcome and Logistics**</a> 
* <a href='01 - DSC 2022 Problem Definition.ipynb#top'>**Section 1. Problem Definition**</a> 
  * [1. Data dictionary](#var)
  * [2. Get started with the data](#data)
* <a href='02 - DSC 2022 Exploratory Data Analysis.ipynb#top'>**Section 2. Exploratory Data Analysis**</a> 
* <a href='03 - DSC 2022 Hypothesis testing.ipynb#top'>**Section 3. Hypothesis Testing**</a> 
* <a href='04 - DSC 2022 Feature Engineering.ipynb#top'>**Section 4. Feature Engineering**</a> 
* <a href='05 - DSC 2022 Modeling.ipynb#top'>**Section 5. Modeling**</a>
* <a href='06 - DSC 2022 Modeling with Deep Learning.ipynb#top'>**Section 6. Modeling with Deep Learning**</a>
* <a href='07 - DSC 2022 Submission.ipynb#top'>**Section 7. Submission**</a>

The problem for this year's competition is to **use deal information to predict deal performance**. Our goal is to lower risk by decreasing participation in deals which are likely to break price and increasing alpha by increasing indication in deals which are likely to perform well. 

Below are **some hypotheses** that we have in mind and would like to test out.


1. Follow-on offering perform better than blocks.
2. Stock performance ahead of deal announcement is a predictor of post-deal performance. 
3. There is a correlation between discount to offering and performance.
4. Sectors matter. 
5. If issuer switched lead banks from past deals, current deal performs worse.

<a id='var'></a>
## 1. Data dictionary

In the cell below, you can find descriptions for features in the data set provided (cmg.xlsx). In our problem definition, each observation(a deal) corresponds to a row in the data frame. Each row contains features listed below. 

**`Predictors`**

Predictors can be divided into 4 categories: offering data, issuer data, past performance data and underwritters data. 

1. <u>Offering data </u> 
- offeringId: unique IDs of offerings
- offeringPricingDate: time when offerings are priced 
- offeringType: types of offerings ('IPO', 'OVERNIGHT_FO', 'MARKETED_FO', 'UNREGISTERED_BLOCK', 'REGISTERED_BLOCK', 'FO')
- offeringSector: sectors of the offerings ('Basic Materials', 'Healthcare', 'Financial Services', 'Consumer Defensive', 'Consumer Cyclical', 'Industrials', 'Communication Services', 'Technology', 'Utilities', 'Real Estate','Energy')
- offeringSubsector: subsectors of the offerings 
- offeringDiscountToLastTrade: whehther or not the offering was made at a discount 
- offeringPrice: filing price 

2. <u>Issuer data </u>
- issuerCusip: unique IDs of issuers 
- issuerName: name of issuers

3. <u>Past performance data</u> : stock prices normalized with respect to filing price since 15 days prior to the deal announcement. For example, pre15_Price_Normalized is calculated as 
$$\text{pre15_Price_Normalized} = \frac{\text{raw price 15 days prior to deal announcement} - \text{offering price}}{\text{offering price}}.$$
Note that **we don't have past performance data for IPO's** and the features listed below are NA's for IPO's.

- pre15_Price_Normalized
- pre14_Price_Normalized  
...
- pre1_Price_Normalized   


4. <u>Underwritters data</u>
- underwriters: list of all underwritters of a deal, containing information on underwritter ID, name, economic percentage and role in the deal 
- offeringTotalBookRunners: total number of bookrunners for the offering
- leftLeadFirmId: unique IDs of lead firms 
- leftLeadFirmName: name of lead firms

**`Outcomes`**

Outcomes are returns with respect to filing price. For example, 
$$\text{post180_Price_Normalized} = \frac{\text{raw price 180 day after deal announcement} - \text{offering price}}{\text{offering price}}.$$

- post1_Price_Normalized 
- post7_Price_Normalized 
- post30_Price_Normalized 
- post90_Price_Normalized 
- post180_Price_Normalized 

<a id='data'></a>
## 2. Get started with the data

Panda and Numpy are two packages that data scientists use a lot in dealing with data in using Python. So let's first import the two pacakges. In the section below, you will be able to learn how to read in the data, do indexing and selecting data from a pandas dataframe!

- [Read in data](#read)
- [Index and select data](#index)

In [1]:
import pandas as pd
import numpy as np

<a id='read'></a>
### Read in data 

From the data dictionary, we learned that offeringId is the unique ID of each offering, and hence it serves as a unique identifier of each observation(each row). Therefore, when reading in the data frame, we shall specify the index column to be offeringId. 

In [2]:
cmg = pd.read_excel('cmg.xlsx', index_col = 'offeringId')
print("The data contains {} rows and {} columns.".format(cmg.shape[0], cmg.shape[1]))
cmg.head(5)

The data contains 8489 rows and 32 columns.


Unnamed: 0_level_0,offeringPricingDate,offeringType,offeringSector,offeringSubSector,offeringDiscountToLastTrade,offeringPrice,issuerCusip,issuerName,pre15_Price_Normalized,pre14_Price_Normalized,...,pre1_Price_Normalized,underwriters,totalBookrunners,leftLeadFirmId,leftLeadFirmName,post1_Price_Normalized,post7_Price_Normalized,post30_Price_Normalized,post90_Price_Normalized,post180_Price_Normalized
offeringId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
b969a1c8-0a26-438a-81e6-5e95f3b30501,2003-10-02,IPO,Consumer Cyclical,Vehicles & Parts,0.0,13.0,501889208,BharCap Acquisition Corp.,,,...,,[{'firmId': '15af8b8d-c949-4fa0-b35e-a6482d3ca...,2,759ce574-3755-480b-8b83-c614f4568db1,Baird,-0.855769,-0.85,-0.831635,-0.825481,-0.836538
1081394b-c9f2-4479-8dd2-528027ff1eea,2005-07-21,IPO,Communication Services,Telecom Services,0.0,13.0,209034107,GrandSouth Bancorporation,,,...,,[{'firmId': 'dac135c0-9e99-4362-9762-7179a0023...,2,5eb63e75-8f95-464e-86fe-3222865c54ef,Credit Suisse,0.060769,0.136923,0.041538,-0.018462,-0.016923
714a166d-9eb0-4b3c-ab8e-7c0dc6f21ee0,2005-08-04,IPO,Communication Services,Internet Content & Information,0.0,27.0,056752108,Brand Velocity Acquisition Corp,,,...,,[{'firmId': 'a82a866c-d40e-453a-99e1-8acb44efb...,2,dac135c0-9e99-4362-9762-7179a0023c9e,Goldman Sachs & Co.,-0.546148,-0.637407,-0.711852,-0.746296,-0.798111
43f06950-8d20-4cfc-b16d-237e0927e1e6,2005-11-10,IPO,Industrials,Consulting Services,0.0,16.0,G47567105,ProLung Inc.,,,...,,[{'firmId': 'a82a866c-d40e-453a-99e1-8acb44efb...,2,cd9cd378-73b5-4cef-8666-ad2c5149ccd8,Goldman Sachs & Co.,-0.699502,-0.697394,-0.682808,-0.566124,-0.512702
96a13598-121a-41c0-83b5-448843cd8709,2006-02-03,IPO,Energy,Oil & Gas Midstream,0.0,21.0,29273V100,Golden Star Acquisition Corp,,,...,,[{'firmId': '7d932034-3e85-46ab-97b4-b6e8e86ee...,3,8fdb6c2d-3b35-40d4-a886-0a3461b42d98,UBS Investment Bank,-0.730357,-0.73869,-0.740595,-0.703571,-0.688095


The cell below shows you a way of getting column names in a data frame. You could also do `cmg.columns` to get column names. 

In [3]:
list(cmg)

['offeringPricingDate',
 'offeringType',
 'offeringSector',
 'offeringSubSector',
 'offeringDiscountToLastTrade',
 'offeringPrice',
 'issuerCusip',
 'issuerName',
 'pre15_Price_Normalized',
 'pre14_Price_Normalized',
 'pre13_Price_Normalized',
 'pre12_Price_Normalized',
 'pre11_Price_Normalized',
 'pre10_Price_Normalized',
 'pre9_Price_Normalized',
 'pre8_Price_Normalized',
 'pre7_Price_Normalized',
 'pre6_Price_Normalized',
 'pre5_Price_Normalized',
 'pre4_Price_Normalized',
 'pre3_Price_Normalized',
 'pre2_Price_Normalized',
 'pre1_Price_Normalized',
 'underwriters',
 'totalBookrunners',
 'leftLeadFirmId',
 'leftLeadFirmName',
 'post1_Price_Normalized',
 'post7_Price_Normalized',
 'post30_Price_Normalized',
 'post90_Price_Normalized',
 'post180_Price_Normalized']

We shall check types of features in the data frame using the code below. 

In [4]:
cmg.dtypes

offeringPricingDate            datetime64[ns]
offeringType                           object
offeringSector                         object
offeringSubSector                      object
offeringDiscountToLastTrade           float64
offeringPrice                         float64
issuerCusip                            object
issuerName                             object
pre15_Price_Normalized                float64
pre14_Price_Normalized                float64
pre13_Price_Normalized                float64
pre12_Price_Normalized                float64
pre11_Price_Normalized                float64
pre10_Price_Normalized                float64
pre9_Price_Normalized                 float64
pre8_Price_Normalized                 float64
pre7_Price_Normalized                 float64
pre6_Price_Normalized                 float64
pre5_Price_Normalized                 float64
pre4_Price_Normalized                 float64
pre3_Price_Normalized                 float64
pre2_Price_Normalized             

<a id='index'></a>
### Indexing and selecting data 

In the cells below, you will learn multiple ways on how to index and select data from a pandas data frame.  
- [[ ]](#basic)  
- [.loc](#loc): indexing by label.  
- [.iloc](#iloc): indexing by position.

You can find more details on how to index and select data here. https://pandas.pydata.org/docs/user_guide/indexing.html

<a id='basic'></a>
#### The basic: [ ]

You can use the square bracket [] to select by row positions or select by column names.

In [5]:
cmg[:3]

Unnamed: 0_level_0,offeringPricingDate,offeringType,offeringSector,offeringSubSector,offeringDiscountToLastTrade,offeringPrice,issuerCusip,issuerName,pre15_Price_Normalized,pre14_Price_Normalized,...,pre1_Price_Normalized,underwriters,totalBookrunners,leftLeadFirmId,leftLeadFirmName,post1_Price_Normalized,post7_Price_Normalized,post30_Price_Normalized,post90_Price_Normalized,post180_Price_Normalized
offeringId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
b969a1c8-0a26-438a-81e6-5e95f3b30501,2003-10-02,IPO,Consumer Cyclical,Vehicles & Parts,0.0,13.0,501889208,BharCap Acquisition Corp.,,,...,,[{'firmId': '15af8b8d-c949-4fa0-b35e-a6482d3ca...,2,759ce574-3755-480b-8b83-c614f4568db1,Baird,-0.855769,-0.85,-0.831635,-0.825481,-0.836538
1081394b-c9f2-4479-8dd2-528027ff1eea,2005-07-21,IPO,Communication Services,Telecom Services,0.0,13.0,209034107,GrandSouth Bancorporation,,,...,,[{'firmId': 'dac135c0-9e99-4362-9762-7179a0023...,2,5eb63e75-8f95-464e-86fe-3222865c54ef,Credit Suisse,0.060769,0.136923,0.041538,-0.018462,-0.016923
714a166d-9eb0-4b3c-ab8e-7c0dc6f21ee0,2005-08-04,IPO,Communication Services,Internet Content & Information,0.0,27.0,56752108,Brand Velocity Acquisition Corp,,,...,,[{'firmId': 'a82a866c-d40e-453a-99e1-8acb44efb...,2,dac135c0-9e99-4362-9762-7179a0023c9e,Goldman Sachs & Co.,-0.546148,-0.637407,-0.711852,-0.746296,-0.798111


In [6]:
cmg[['issuerCusip', 'offeringSector', 'offeringSubSector']]

Unnamed: 0_level_0,issuerCusip,offeringSector,offeringSubSector
offeringId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
b969a1c8-0a26-438a-81e6-5e95f3b30501,501889208,Consumer Cyclical,Vehicles & Parts
1081394b-c9f2-4479-8dd2-528027ff1eea,209034107,Communication Services,Telecom Services
714a166d-9eb0-4b3c-ab8e-7c0dc6f21ee0,056752108,Communication Services,Internet Content & Information
43f06950-8d20-4cfc-b16d-237e0927e1e6,G47567105,Industrials,Consulting Services
96a13598-121a-41c0-83b5-448843cd8709,29273V100,Energy,Oil & Gas Midstream
...,...,...,...
7fd2fd2f-339c-47a3-be1e-1af916dbfa37,53946R106,Financial Services,Banks
b39ff819-8e3f-4518-b838-3bebd9e7f8fb,91704K202,Industrials,Farm & Heavy Construction Machinery
8bafb8a8-e8c6-4a80-b3e5-f68871f08404,828363101,Basic Materials,Metals & Mining
9c1370c6-21e3-43ee-b4a6-924836979d66,11134Y101,Financial Services,Shell Companies


<a id='loc'></a>
#### .loc  
.loc allows you to select by label (ie. index names and column names)

In [7]:
cmg.loc[['43f06950-8d20-4cfc-b16d-237e0927e1e6'], ['issuerCusip', 'offeringSector', 'offeringSubSector']]

Unnamed: 0_level_0,issuerCusip,offeringSector,offeringSubSector
offeringId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
43f06950-8d20-4cfc-b16d-237e0927e1e6,G47567105,Industrials,Consulting Services


<a id='iloc'></a>
#### .iloc  
.iloc allows you to select by row/column position. The cell below shows how to select column 8-22(inclusive) for all rows. 

In [8]:
cmg.iloc[:, 8:23]

Unnamed: 0_level_0,pre15_Price_Normalized,pre14_Price_Normalized,pre13_Price_Normalized,pre12_Price_Normalized,pre11_Price_Normalized,pre10_Price_Normalized,pre9_Price_Normalized,pre8_Price_Normalized,pre7_Price_Normalized,pre6_Price_Normalized,pre5_Price_Normalized,pre4_Price_Normalized,pre3_Price_Normalized,pre2_Price_Normalized,pre1_Price_Normalized
offeringId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
b969a1c8-0a26-438a-81e6-5e95f3b30501,,,,,,,,,,,,,,,
1081394b-c9f2-4479-8dd2-528027ff1eea,,,,,,,,,,,,,,,
714a166d-9eb0-4b3c-ab8e-7c0dc6f21ee0,,,,,,,,,,,,,,,
43f06950-8d20-4cfc-b16d-237e0927e1e6,,,,,,,,,,,,,,,
96a13598-121a-41c0-83b5-448843cd8709,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7fd2fd2f-339c-47a3-be1e-1af916dbfa37,,,,,,,,,,,,,,,
b39ff819-8e3f-4518-b838-3bebd9e7f8fb,0.050000,0.225000,0.261000,0.600000,0.900,1.00,1.100000,0.60000,1.150000,1.006000,0.300000,6.499000,4.800000,4.000000,4.005000
8bafb8a8-e8c6-4a80-b3e5-f68871f08404,0.028261,0.005435,-0.011957,0.018478,-0.025,0.05,0.103261,0.21413,0.068478,0.040217,0.079348,0.115217,0.122826,0.090217,0.076087
9c1370c6-21e3-43ee-b4a6-924836979d66,,,,,,,,,,,,,,,
