# Section 1. Problem Definition

* <a href='00- DSC 2022 Problem Definition.ipynb#top'>**Section 0. Welcome and Logistics**</a> 
* <a href='01- DSC 2022 Problem Definition.ipynb#top'>**Section 1. Problem Definition**</a> 
  * [1. Variable Description](#var)
  * [2. Data](#data)

* <a href='02- DSC 2022 Exploratory Data Analysis.ipynb#top'>**Section 2. Exploratory Data Analysis**</a> 


The problem for this year's competition is to **use deal information to predict deal performance**. Our goal is to lower risk by decreasing participation in deals which are likely to break price and increasing alpha by increasing indication in deals which are likely to perform well. 

Below are **some hypotheses** that we have in mind.


1. Follow-on offering perform better than blocks
2. Stock performance ahead of deal announcement is a predictor or performance 
3. There is a correlation between discount to offering and performance 
4. Sectors matter
5. Past deal performance is an indictor of current deal performance 
6. If issuer switched lead banks from past deals, current deal performs worse

<a id='var'></a>
## 1. Variable description

In our problem, each observation(a deal) corresponds to a row in the data frame. Each row contains the variables listed below. 

**`Predictors`**

Predictors can be divided into 4 categories: offering data, issuer data, past performance data and underwritters data. 

1. <u>Offering data </u> 
- offeringId: unique IDs of offerings
- offeringPricingDate: time when offerings are priced 
- offeringType: types of offerings ('IPO', 'OVERNIGHT_FO', 'MARKETED_FO', 'UNREGISTERED_BLOCK', 'REGISTERED_BLOCK', 'FO')
- offeringSector: sectors of the offerings ('Basic Materials', 'Healthcare', 'Financial Services', 'Consumer Defensive', 'Consumer Cyclical', 'Industrials', 'Communication Services', 'Technology', 'Utilities', 'Real Estate','Energy')
- offeringSubsector: subsectors of the offerings 
- offeringDiscountToLastTrade: whehther or not the offering was made at a discount 
- offeringPrice: filing price 

2. <u>Issuer data </u>
- issuerCusip: unique IDs of issuers 
- issuerName: name of issuers

3. <u>Past performance data</u> : stock prices normalized with respect to filing price since 15 days prior to the deal announcement 
- pre15_Price_Normalized
- pre14_Price_Normalized 
    ...
- pre1_Price_Normalized   


4. <u>Underwritters data</u>
- underwriters: list of all underwritters of a deal, containing information on underwritter ID, name, economic percentage and role in the deal 
- offeringTotalBookRunners: total number of bookrunners for the offering
- leftLeadFirmId: unique IDs of lead firms 
- leftLeadFirmName: name of lead firms

**`Outcomes`**

Outcomes are returns with respect to filing price. For example, 
$$post1\_Price\_Normalized = \frac{post1\_Price - offering Price}{offering Price}$$

- post1_Price_Normalized 
- post7_Price_Normalized 
- post30_Price_Normalized 
- post90_Price_Normalized 
- post180_Price_Normalized 

<a id='data'></a>
## 2. Data

Panda and Numpy are two packages that data scientists use a lot in dealing with data in using Python. So let's first import the two pacakges. In the section below, you will be able to learn how to read in the data, do indexing and selecting data from a pandas dataframe!

- [Read in data](#read)
- [Index and select data](#index)

In [14]:
import pandas as pd
import numpy as np

<a id='read'></a>
### Read in data 

In variable descriptions, we learned that offeringId is the unique ID of each offering, and hence it serves as a unique identifier of each observation(each row). Therefore, when reading in the data frame, we shall specify the index column to be offeringId. 

In [15]:
cmg = pd.read_excel('cmg.xlsx', index_col = 'offeringId')
print("The data contains {} rows and {} columns.".format(cmg.shape[0], cmg.shape[1]))
cmg.head(5)

The data contains 8489 rows and 32 columns.


Unnamed: 0_level_0,offeringPricingDate,offeringType,offeringSector,offeringSubSector,offeringDiscountToLastTrade,offeringPrice,issuerCusip,issuerName,pre15_Price_Normalized,pre14_Price_Normalized,...,pre1_Price_Normalized,underwriters,totalBookrunners,leftLeadFirmId,leftLeadFirmName,post1_Price_Normalized,post7_Price_Normalized,post30_Price_Normalized,post90_Price_Normalized,post180_Price_Normalized
offeringId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
b969a1c8-0a26-438a-81e6-5e95f3b30501,2003-10-02,IPO,Consumer Cyclical,Vehicles & Parts,0.0,13.0,501889208,BharCap Acquisition Corp.,,,...,,[{'firmId': '15af8b8d-c949-4fa0-b35e-a6482d3ca...,2,759ce574-3755-480b-8b83-c614f4568db1,Baird,-0.855769,-0.85,-0.831635,-0.825481,-0.836538
1081394b-c9f2-4479-8dd2-528027ff1eea,2005-07-21,IPO,Communication Services,Telecom Services,0.0,13.0,209034107,GrandSouth Bancorporation,,,...,,[{'firmId': 'dac135c0-9e99-4362-9762-7179a0023...,2,5eb63e75-8f95-464e-86fe-3222865c54ef,Credit Suisse,0.060769,0.136923,0.041538,-0.018462,-0.016923
714a166d-9eb0-4b3c-ab8e-7c0dc6f21ee0,2005-08-04,IPO,Communication Services,Internet Content & Information,0.0,27.0,056752108,Brand Velocity Acquisition Corp,,,...,,[{'firmId': 'a82a866c-d40e-453a-99e1-8acb44efb...,2,dac135c0-9e99-4362-9762-7179a0023c9e,Goldman Sachs & Co.,-0.546148,-0.637407,-0.711852,-0.746296,-0.798111
43f06950-8d20-4cfc-b16d-237e0927e1e6,2005-11-10,IPO,Industrials,Consulting Services,0.0,16.0,G47567105,ProLung Inc.,,,...,,[{'firmId': 'a82a866c-d40e-453a-99e1-8acb44efb...,2,cd9cd378-73b5-4cef-8666-ad2c5149ccd8,Goldman Sachs & Co.,-0.699502,-0.697394,-0.682808,-0.566124,-0.512702
96a13598-121a-41c0-83b5-448843cd8709,2006-02-03,IPO,Energy,Oil & Gas Midstream,0.0,21.0,29273V100,Golden Star Acquisition Corp,,,...,,[{'firmId': '7d932034-3e85-46ab-97b4-b6e8e86ee...,3,8fdb6c2d-3b35-40d4-a886-0a3461b42d98,UBS Investment Bank,-0.730357,-0.73869,-0.740595,-0.703571,-0.688095


The cell below shows you a way of getting column names in a data frame. You could also do `cmg.columns` to get the column names. 

In [16]:
list(cmg)

['offeringPricingDate',
 'offeringType',
 'offeringSector',
 'offeringSubSector',
 'offeringDiscountToLastTrade',
 'offeringPrice',
 'issuerCusip',
 'issuerName',
 'pre15_Price_Normalized',
 'pre14_Price_Normalized',
 'pre13_Price_Normalized',
 'pre12_Price_Normalized',
 'pre11_Price_Normalized',
 'pre10_Price_Normalized',
 'pre9_Price_Normalized',
 'pre8_Price_Normalized',
 'pre7_Price_Normalized',
 'pre6_Price_Normalized',
 'pre5_Price_Normalized',
 'pre4_Price_Normalized',
 'pre3_Price_Normalized',
 'pre2_Price_Normalized',
 'pre1_Price_Normalized',
 'underwriters',
 'totalBookrunners',
 'leftLeadFirmId',
 'leftLeadFirmName',
 'post1_Price_Normalized',
 'post7_Price_Normalized',
 'post30_Price_Normalized',
 'post90_Price_Normalized',
 'post180_Price_Normalized']

In [17]:
cmg.dtypes

offeringPricingDate            datetime64[ns]
offeringType                           object
offeringSector                         object
offeringSubSector                      object
offeringDiscountToLastTrade           float64
offeringPrice                         float64
issuerCusip                            object
issuerName                             object
pre15_Price_Normalized                float64
pre14_Price_Normalized                float64
pre13_Price_Normalized                float64
pre12_Price_Normalized                float64
pre11_Price_Normalized                float64
pre10_Price_Normalized                float64
pre9_Price_Normalized                 float64
pre8_Price_Normalized                 float64
pre7_Price_Normalized                 float64
pre6_Price_Normalized                 float64
pre5_Price_Normalized                 float64
pre4_Price_Normalized                 float64
pre3_Price_Normalized                 float64
pre2_Price_Normalized             

<a id='index'></a>
### Indexing and selecting data 

In the cells below, you will learn multiple ways on how to index and select data from a pandas data frame.  
- [[ ]](#basic)  
- [.loc](#loc): indexing by label.  
- [.iloc](#iloc): indexing by position.

You can find more details on how to index and select data here. https://pandas.pydata.org/docs/user_guide/indexing.html

<a id='basic'></a>
#### The basic: [ ]

You can use the square bracket [] to select by row positions or select by column names.

In [18]:
cmg[:3]

Unnamed: 0_level_0,offeringPricingDate,offeringType,offeringSector,offeringSubSector,offeringDiscountToLastTrade,offeringPrice,issuerCusip,issuerName,pre15_Price_Normalized,pre14_Price_Normalized,...,pre1_Price_Normalized,underwriters,totalBookrunners,leftLeadFirmId,leftLeadFirmName,post1_Price_Normalized,post7_Price_Normalized,post30_Price_Normalized,post90_Price_Normalized,post180_Price_Normalized
offeringId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
b969a1c8-0a26-438a-81e6-5e95f3b30501,2003-10-02,IPO,Consumer Cyclical,Vehicles & Parts,0.0,13.0,501889208,BharCap Acquisition Corp.,,,...,,[{'firmId': '15af8b8d-c949-4fa0-b35e-a6482d3ca...,2,759ce574-3755-480b-8b83-c614f4568db1,Baird,-0.855769,-0.85,-0.831635,-0.825481,-0.836538
1081394b-c9f2-4479-8dd2-528027ff1eea,2005-07-21,IPO,Communication Services,Telecom Services,0.0,13.0,209034107,GrandSouth Bancorporation,,,...,,[{'firmId': 'dac135c0-9e99-4362-9762-7179a0023...,2,5eb63e75-8f95-464e-86fe-3222865c54ef,Credit Suisse,0.060769,0.136923,0.041538,-0.018462,-0.016923
714a166d-9eb0-4b3c-ab8e-7c0dc6f21ee0,2005-08-04,IPO,Communication Services,Internet Content & Information,0.0,27.0,56752108,Brand Velocity Acquisition Corp,,,...,,[{'firmId': 'a82a866c-d40e-453a-99e1-8acb44efb...,2,dac135c0-9e99-4362-9762-7179a0023c9e,Goldman Sachs & Co.,-0.546148,-0.637407,-0.711852,-0.746296,-0.798111


In [19]:
cmg[['issuerCusip', 'offeringSector', 'offeringSubSector']]

Unnamed: 0_level_0,issuerCusip,offeringSector,offeringSubSector
offeringId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
b969a1c8-0a26-438a-81e6-5e95f3b30501,501889208,Consumer Cyclical,Vehicles & Parts
1081394b-c9f2-4479-8dd2-528027ff1eea,209034107,Communication Services,Telecom Services
714a166d-9eb0-4b3c-ab8e-7c0dc6f21ee0,056752108,Communication Services,Internet Content & Information
43f06950-8d20-4cfc-b16d-237e0927e1e6,G47567105,Industrials,Consulting Services
96a13598-121a-41c0-83b5-448843cd8709,29273V100,Energy,Oil & Gas Midstream
...,...,...,...
fca78fb5-120c-4b6f-adbb-96674bb1e887,747324101,Healthcare,Biotechnology
b311ad95-9a6c-48c9-ac8f-3e21c45360ed,465005106,Healthcare,Medical Devices & Instruments
508e6045-373d-43cd-9aa2-61c5aee49a3e,57778T106,Financial Services,Shell Companies
658e0de4-fc92-4dd3-a562-bd0fcb3f5272,171757107,Healthcare,Biotechnology


<a id='loc'></a>
#### .loc  
.loc allows you to select by label (ie. index names and column names)

In [20]:
cmg.loc[['43f06950-8d20-4cfc-b16d-237e0927e1e6'], ['issuerCusip', 'offeringSector', 'offeringSubSector']]

Unnamed: 0_level_0,issuerCusip,offeringSector,offeringSubSector
offeringId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
43f06950-8d20-4cfc-b16d-237e0927e1e6,G47567105,Industrials,Consulting Services


<a id='iloc'></a>
#### .iloc  
.iloc allows you to select by row/column position. The cell below shows how to select column 10-25 for all rows. 

In [21]:
cmg.iloc[:, 10:25]

Unnamed: 0_level_0,pre13_Price_Normalized,pre12_Price_Normalized,pre11_Price_Normalized,pre10_Price_Normalized,pre9_Price_Normalized,pre8_Price_Normalized,pre7_Price_Normalized,pre6_Price_Normalized,pre5_Price_Normalized,pre4_Price_Normalized,pre3_Price_Normalized,pre2_Price_Normalized,pre1_Price_Normalized,underwriters,totalBookrunners
offeringId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
b969a1c8-0a26-438a-81e6-5e95f3b30501,,,,,,,,,,,,,,[{'firmId': '15af8b8d-c949-4fa0-b35e-a6482d3ca...,2
1081394b-c9f2-4479-8dd2-528027ff1eea,,,,,,,,,,,,,,[{'firmId': 'dac135c0-9e99-4362-9762-7179a0023...,2
714a166d-9eb0-4b3c-ab8e-7c0dc6f21ee0,,,,,,,,,,,,,,[{'firmId': 'a82a866c-d40e-453a-99e1-8acb44efb...,2
43f06950-8d20-4cfc-b16d-237e0927e1e6,,,,,,,,,,,,,,[{'firmId': 'a82a866c-d40e-453a-99e1-8acb44efb...,2
96a13598-121a-41c0-83b5-448843cd8709,,,,,,,,,,,,,,[{'firmId': '7d932034-3e85-46ab-97b4-b6e8e86ee...,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
fca78fb5-120c-4b6f-adbb-96674bb1e887,,,,,,,,,,,,,,[{'firmId': '71a3e477-4edd-402b-be66-e96e722c7...,4
b311ad95-9a6c-48c9-ac8f-3e21c45360ed,,,,,,,,,,,,,,[{'firmId': '60526594-5581-4d1d-b134-c710fee46...,4
508e6045-373d-43cd-9aa2-61c5aee49a3e,,,,,,,,,,,,,,[{'firmId': 'a34617b2-947b-467e-959e-b33da1a27...,1
658e0de4-fc92-4dd3-a562-bd0fcb3f5272,0.348387,0.380645,0.445161,0.483871,0.445161,0.516129,0.445161,0.438710,0.445161,0.406452,0.258065,0.264516,0.225806,[{'firmId': '7da216ae-4833-4d6c-9aed-6e7ed62d9...,1
