# Final (Woosoo Kim)

#### The following materials and tools are allowed: 
1. Lecture materials
2. Laptop

#### The following tools/actions are not allowed: 
1. ChatGPT
2. Web engine search
3. Communicating with anyone
4. Sharing any information about the midterm
5. Phones
6. All communication apps need to be turned off during the exam

## Problem 1

In this exercise, you will scrape Ford Motor Company’s ISS Governance QualityScore and pillar scores (Audit, Board, Shareholder Rights, Compensation) from Ford's profile page on *Yahoo! Finance* (`https://finance.yahoo.com/quote/F/profile?p=F`).

#### (10 points)
#### Use regular expressions only (no text splitting) to scrape Ford's scores.

#### Define variables as follows and print the scores:
* `QualityScore` = ISS Governance QualityScore
* `Audit` = Audit Score
* `Board` = Board Score
* `Shareholder` = Shareholder Rights Score
* `Compensation` = Compensation Score.

In [1]:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
import re
import warnings

# Ignore warning
warnings.filterwarnings("ignore", message="Passing literal html to 'read_html' is deprecated.*")

In [2]:
# Set up headless Chrome options
options = webdriver.ChromeOptions()
options.add_argument('--headless')  # Optional: runs the browser in the background
driver = webdriver.Chrome(options=options)

# Load the Yahoo Finance Profile page
url = "https://finance.yahoo.com/quote/F/profile?p=F"
driver.get(url)

# Wait for the page to load
driver.implicitly_wait(5)  # Waits up to 5 seconds 

# Retrieve HTML
html = driver.page_source

# Close the browser
driver.quit()

# Extract the scores
QualityScore = int(re.findall(r'(ISS Governance QualityScore.*is )(\d+)', html)[0][1])
Audit = int(re.findall(r'(Audit: )(\d+)', html)[0][1])
Board = int(re.findall(r'(Board: )(\d+)', html)[0][1])
Shareholder = int(re.findall(r'(Shareholder Rights: )(\d+)', html)[0][1])
Compensation = int(re.findall(r'(Compensation: )(\d+)', html)[0][1])

print("QualityScore :", QualityScore)
print("Audit        :", Audit)
print("Board        :", Board)
print("Shareholder  :", Shareholder)
print("Compensation :", Compensation)

QualityScore : 10
Audit        : 10
Board        : 10
Shareholder  : 10
Compensation : 8


#### (10 points)
#### Create a function that takes a ticker as an input to scrape a firm's ISS Governance QualityScore and pillar scores (Audit, Board, Shareholder Rights, Compensation) using regular expressions only (no text splitting).

#### Then extract the scores for 'F', 'AAPL', 'AMZN', and 'WMT', and save the data to a new pandas DataFrame. Print the resulting DataFrame.

In [3]:
def get_scores(ticker):

    # Set up headless Chrome options
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')  # Optional: runs the browser in the background
    driver = webdriver.Chrome(options=options)
    
    # Load the Yahoo Finance Profile page
    url = "https://finance.yahoo.com/quote/"+ticker+"/profile/?p="+ticker
    driver.get(url)

    # Wait for the page to load
    driver.implicitly_wait(5)  # Waits up to 5 seconds 

    # Retrieve HTML
    html = driver.page_source

    # Close the browser
    driver.quit()

    # Extract the scores
    QualityScore = int(re.findall(r'(ISS Governance QualityScore.*is )(\d+)', html)[0][1])
    Audit = int(re.findall(r'(Audit: )(\d+)', html)[0][1])
    Board = int(re.findall(r'(Board: )(\d+)', html)[0][1])
    Shareholder = int(re.findall(r'(Shareholder Rights: )(\d+)', html)[0][1])
    Compensation = int(re.findall(r'(Compensation: )(\d+)', html)[0][1])
    scores_df = pd.DataFrame({
        'Ticker': ticker, 
        'QualityScore': QualityScore, 
        'Audit': Audit, 
        'Board': Board, 
        'Shareholder': Shareholder, 
        'Compensation': Compensation
        }, index=[0])
    
    return scores_df

In [4]:
dfs = []
tickers = ['F', 'AAPL', 'AMZN', 'WMT']

for ticker in tickers:
    dfs.append(get_scores(ticker))
    
result_df = pd.concat(dfs, ignore_index=True)
result_df

Unnamed: 0,Ticker,QualityScore,Audit,Board,Shareholder,Compensation
0,F,10,10,10,10,8
1,AAPL,1,6,1,1,2
2,AMZN,9,5,10,3,10
3,WMT,3,8,8,1,6


## Problem 2

In this exercise, you will predict future stock returns using various return signals.

#### (10 points)

#### Suppose you are a portfolio manager in December 2017, and you would like to use return signals to study patterns in returns using data from 2010 to 2017. You will predict returns for 2018 and 2019. 

#### You are given two datasets: 
* `signals.xlsx` includes monthly return signals, and all the variables are already lagged.
* `ret.xlsx` includes monthly stock returns.

#### Compare two prediction methods: Simple regressions and Random forests
* Before estimating any model, winsorize the return signals (excluding the y-variable) at the 5% level (i.e., `limits=[0.05, 0.05]`).
* You should use all the return signals in both simple regression and random forest models.
* When estimating the random forest models, use the following parameters:  
* (n_estimators=100, bootstrap=True, max_features='sqrt', random_state=1234).fit(X_train, y_train.values.ravel())

#### Grading will be based on the following:
* (3 points) Did students correctly handle and manage the data before estimation?
* (5 points) Did students read, merge, and generate X_train, X_test, y_train, and y_test according to the goal of this question?
* (2 points) Did students correctly present the performance of the two methods (please do not include charts)?

#### You should always print out your results whenever possible so that you do not get penalized for not showing outputs.

## Linear Regression

In [5]:
import statsmodels.api as sm
from scipy.stats import mstats
from sklearn.linear_model import LinearRegression

In [6]:
ret = pd.read_excel('ret.xlsx', sheet_name='Sheet1')
ret

Unnamed: 0,PERMNO,year,month,Returns
0,10001,2010,1,-0.018932
1,10001,2010,2,-0.000656
2,10001,2010,3,0.020643
3,10001,2010,4,0.124385
4,10001,2010,5,0.004829
...,...,...,...,...
860459,93436,2019,8,-0.066222
860460,93436,2019,9,0.067639
860461,93436,2019,10,0.307427
860462,93436,2019,11,0.047695


In [7]:
signals = pd.read_excel('signals.xlsx', sheet_name='Sheet1')
signals

Unnamed: 0,PERMNO,year,month,Price-to-Book,Cyclically-Adjusted Price to Sales,Cyclically-Adjusted Price-Earnings Ratio,Cyclically-Adjusted Price to Cash Flows,Cyclically-Adjusted Price to Free Cash Flows,Fiscal Book-to-Market,Short Term Leverage,...,Price Delay 1,Price Delay 2,Price Delay 3,Active Flows,Short Interest Ratio,Short Interest Scaled by Supply,Short Squeeze Probability,Volatility of Liquidity,Probability of dividend increase,Momentum Acceleration
0,10026,2010,1,-2.001473,-0.262609,0.042719,0.081986,0.043090,0.462267,-740.740356,...,0.109283,0.117712,0.125230,-0.472102,-19.596277,-30.924706,0.396259,0.685669,0.000513,-0.259706
1,10026,2010,2,-2.094134,-0.308845,0.040788,0.078282,0.041143,0.462267,-740.740356,...,0.307764,-0.260631,-0.325766,-0.472102,-21.234505,-33.509975,0.327583,0.542910,0.270081,-0.006817
2,10026,2010,3,-2.122217,-0.341798,0.039065,0.075209,0.039325,0.470744,-750.095764,...,0.373770,-0.444835,-0.441161,-0.182801,-26.737495,-40.171337,0.292102,0.559151,0.177226,-0.240768
3,10026,2010,4,-2.159981,-0.360631,0.038336,0.073806,0.038591,0.461961,-750.095764,...,0.430938,-1.608122,-1.622348,-0.182801,-27.394373,-41.158257,0.511576,0.889156,0.016806,-0.384905
4,10026,2010,5,-2.315011,-0.430001,0.035767,0.068860,0.036005,0.431002,-750.095764,...,0.354480,-1.214991,-1.273745,-0.182801,-23.192181,-34.844734,0.339407,0.892484,0.118616,-0.254163
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40895,93423,2015,12,-11.978427,-1.480301,0.036168,0.068284,0.022223,0.084676,-0.318761,...,0.231828,-1.998100,-2.072584,0.063021,-50.951458,-58.558365,0.193941,0.868222,0.029665,-0.220867
40896,93423,2016,1,-12.680053,-1.537223,0.034167,0.064506,0.020993,0.079991,-0.318761,...,0.153004,-3.169122,-2.873457,0.063021,-54.418720,-62.543278,0.539353,0.895557,0.149446,-0.248443
40897,93423,2016,2,-11.602225,-1.448390,0.037341,0.070498,0.022944,0.087422,-0.318761,...,0.165371,-0.847746,-0.726417,0.063021,-58.107441,-66.782715,0.415136,0.774848,0.307180,-0.078321
40898,93423,2016,4,-30.884527,-1.523693,0.025958,0.072775,0.024961,0.031814,-0.152185,...,0.023173,-0.529685,-0.513124,-0.038767,-58.657127,-65.876579,0.280588,0.595971,0.041203,-0.004096


In [8]:
merged_data = pd.merge(ret, signals, on=['PERMNO', 'year', 'month'], how='left')
merged_data.dropna(inplace=True)
merged_data

Unnamed: 0,PERMNO,year,month,Returns,Price-to-Book,Cyclically-Adjusted Price to Sales,Cyclically-Adjusted Price-Earnings Ratio,Cyclically-Adjusted Price to Cash Flows,Cyclically-Adjusted Price to Free Cash Flows,Fiscal Book-to-Market,...,Price Delay 1,Price Delay 2,Price Delay 3,Active Flows,Short Interest Ratio,Short Interest Scaled by Supply,Short Squeeze Probability,Volatility of Liquidity,Probability of dividend increase,Momentum Acceleration
215,10026,2010,1,0.046296,-2.001473,-0.262609,0.042719,0.081986,0.043090,0.462267,...,0.109283,0.117712,0.125230,-0.472102,-19.596277,-30.924706,0.396259,0.685669,0.000513,-0.259706
216,10026,2010,2,0.021526,-2.094134,-0.308845,0.040788,0.078282,0.041143,0.462267,...,0.307764,-0.260631,-0.325766,-0.472102,-21.234505,-33.509975,0.327583,0.542910,0.270081,-0.006817
217,10026,2010,3,0.020311,-2.122217,-0.341798,0.039065,0.075209,0.039325,0.470744,...,0.373770,-0.444835,-0.441161,-0.182801,-26.737495,-40.171337,0.292102,0.559151,0.177226,-0.240768
218,10026,2010,4,0.071774,-2.159981,-0.360631,0.038336,0.073806,0.038591,0.461961,...,0.430938,-1.608122,-1.622348,-0.182801,-27.394373,-41.158257,0.511576,0.889156,0.016806,-0.384905
219,10026,2010,5,-0.046362,-2.315011,-0.430001,0.035767,0.068860,0.036005,0.431002,...,0.354480,-1.214991,-1.273745,-0.182801,-23.192181,-34.844734,0.339407,0.892484,0.118616,-0.254163
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
859322,93423,2015,12,0.058574,-11.978427,-1.480301,0.036168,0.068284,0.022223,0.084676,...,0.231828,-1.998100,-2.072584,0.063021,-50.951458,-58.558365,0.193941,0.868222,0.029665,-0.220867
859323,93423,2016,1,-0.085002,-12.680053,-1.537223,0.034167,0.064506,0.020993,0.079991,...,0.153004,-3.169122,-2.873457,0.063021,-54.418720,-62.543278,0.539353,0.895557,0.149446,-0.248443
859324,93423,2016,2,0.023274,-11.602225,-1.448390,0.037341,0.070498,0.022944,0.087422,...,0.165371,-0.847746,-0.726417,0.063021,-58.107441,-66.782715,0.415136,0.774848,0.307180,-0.078321
859326,93423,2016,4,0.082177,-30.884527,-1.523693,0.025958,0.072775,0.024961,0.031814,...,0.023173,-0.529685,-0.513124,-0.038767,-58.657127,-65.876579,0.280588,0.595971,0.041203,-0.004096


In [9]:
# Winsorize the data
X_cols = [col for col in merged_data.columns if not col in ['PERMNO', 'year', 'month', 'Returns']]
for col in X_cols:
    merged_data[col] = mstats.winsorize(merged_data[col], limits=[0.05, 0.05])

# Split the data into train and test
train_data = merged_data[merged_data['year'] <= 2017]
test_data = merged_data[merged_data['year'] > 2017]
X_train = train_data[X_cols].assign(_const=1)
y_train = train_data[['Returns']]
X_test = test_data[X_cols].assign(_const=1)
y_test = test_data[['Returns']].copy()

print(f'# Observations in X_train: {len(X_train)}')
print(f'# Observations in y_train: {len(y_train)}')
print(f'# Observations in X_test: {len(X_test)}')
print(f'# Observations in y_test: {len(y_test)}')

display(X_train.head())
display(y_train.head())

# Observations in X_train: 32510
# Observations in y_train: 32510
# Observations in X_test: 8300
# Observations in y_test: 8300


Unnamed: 0,Price-to-Book,Cyclically-Adjusted Price to Sales,Cyclically-Adjusted Price-Earnings Ratio,Cyclically-Adjusted Price to Cash Flows,Cyclically-Adjusted Price to Free Cash Flows,Fiscal Book-to-Market,Short Term Leverage,Long Term Leverage,Accruals,Accrual Percentage,...,Price Delay 2,Price Delay 3,Active Flows,Short Interest Ratio,Short Interest Scaled by Supply,Short Squeeze Probability,Volatility of Liquidity,Probability of dividend increase,Momentum Acceleration,_const
215,-2.001473,-0.262609,0.042719,0.081986,0.04309,0.462267,-40.93116,0.13631,0.092713,0.951806,...,0.117712,0.12523,-0.377582,-19.596277,-30.924706,0.396259,0.685669,0.000513,-0.259706,1
216,-2.094134,-0.308845,0.040788,0.078282,0.041143,0.462267,-40.93116,0.13631,0.092713,0.951806,...,-0.260631,-0.325766,-0.377582,-21.234505,-33.509975,0.327583,0.568199,0.135139,-0.006817,1
217,-2.122217,-0.341798,0.039065,0.075209,0.039325,0.470744,-40.93116,0.13631,0.093389,0.887193,...,-0.444835,-0.441161,-0.182801,-26.737495,-40.171337,0.292102,0.568199,0.135139,-0.240768,1
218,-2.159981,-0.360631,0.038336,0.073806,0.038591,0.461961,-40.93116,0.13631,0.093389,0.887193,...,-1.608122,-1.622348,-0.182801,-27.394373,-41.158257,0.511576,0.889156,0.016806,-0.384905,1
219,-2.315011,-0.430001,0.035767,0.06886,0.036005,0.431002,-40.93116,0.13631,0.093389,0.887193,...,-1.214991,-1.273745,-0.182801,-23.192181,-34.844734,0.339407,0.892484,0.118616,-0.254163,1


Unnamed: 0,Returns
215,0.046296
216,0.021526
217,0.020311
218,0.071774
219,-0.046362


In [10]:
model = LinearRegression().fit(X_train, y_train.values.ravel())

y_test['Returns_p'] = model.predict(X_test)
y_test.head()

Unnamed: 0,Returns,Returns_p
311,-0.088191,0.006101
312,-0.029688,-0.002922
313,0.019951,0.006393
314,0.006224,0.009488
315,0.030638,0.007235


In [11]:
model = sm.OLS(y_test['Returns'], y_test[['Returns_p']].assign(_const=1)).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                Returns   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     3.970
Date:                Fri, 06 Dec 2024   Prob (F-statistic):             0.0463
Time:                        17:35:53   Log-Likelihood:                 7393.2
No. Observations:                8300   AIC:                        -1.478e+04
Df Residuals:                    8298   BIC:                        -1.477e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Returns_p      0.1882      0.094      1.993      0.0

## Random Forests

In [12]:
from sklearn.ensemble import RandomForestRegressor

In [13]:
model = RandomForestRegressor(n_estimators=100, bootstrap=True, max_features='sqrt', random_state=1234).fit(X_train,y_train.values.ravel())

y_test['Returns_p_alt'] = model.predict(X_test)
y_test.head()

Unnamed: 0,Returns,Returns_p,Returns_p_alt
311,-0.088191,0.006101,0.003052
312,-0.029688,-0.002922,0.007799
313,0.019951,0.006393,-0.000445
314,0.006224,0.009488,0.007533
315,0.030638,0.007235,0.009963


In [14]:
model = sm.OLS(y_test['Returns'], y_test[['Returns_p_alt']].assign(_const=1)).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                Returns   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     4.400
Date:                Fri, 06 Dec 2024   Prob (F-statistic):             0.0360
Time:                        17:36:15   Log-Likelihood:                 7393.5
No. Observations:                8300   AIC:                        -1.478e+04
Df Residuals:                    8298   BIC:                        -1.477e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Returns_p_alt     0.1601      0.076      2.098

#### (Continued from the previous problem)

#### (10 points)

#### Suppose you are a portfolio manager in December 2017, and you would like to use return signals to study patterns in returns using data from 2010 to 2017. You will predict "HOMERUN" returns for 2018 and 2019. 

#### You are given two datasets: 
* `signals.xlsx` includes monthly return signals, and all the variables are already lagged.
* `ret.xlsx` includes monthly stock returns.

#### Compare two prediction methods: Simple regressions and Random forests
* Define "homerun" returns as an indicator variable that equals 1 if monthly returns are greater than 0.05 and 0 otherwise.
* Before estimating any model, winsorize the return signals (excluding the y-variable) at the 5% level (i.e., `limits=[0.05, 0.05]`).
* You should use all the return signals in both simple regression and random forest models.
* When estimating the random forest models, use the following parameters:  
* (n_estimators=100, bootstrap=True, max_features='sqrt', random_state=1234).fit(X_train, y_train.values.ravel())

#### Grading will be based on the following:
* (3 points) Did students correctly handle and manage the data before estimation?
* (5 points) Did students read, merge, and generate X_train, X_test, y_train, and y_test according to the goal of this question?
* (2 points) Did students correctly present the performance of the two methods (please do not include charts)?

#### You should always print out your results whenever possible so that you do not get penalized for not showing outputs.

## Logistic Regression

In [15]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

In [16]:
ret['homerun'] = np.where(ret['Returns'] > 0.05, 1, 0)
merged_data = pd.merge(ret, signals, on=['PERMNO', 'year', 'month'], how='left')
merged_data.dropna(inplace=True)
merged_data

Unnamed: 0,PERMNO,year,month,Returns,homerun,Price-to-Book,Cyclically-Adjusted Price to Sales,Cyclically-Adjusted Price-Earnings Ratio,Cyclically-Adjusted Price to Cash Flows,Cyclically-Adjusted Price to Free Cash Flows,...,Price Delay 1,Price Delay 2,Price Delay 3,Active Flows,Short Interest Ratio,Short Interest Scaled by Supply,Short Squeeze Probability,Volatility of Liquidity,Probability of dividend increase,Momentum Acceleration
215,10026,2010,1,0.046296,0,-2.001473,-0.262609,0.042719,0.081986,0.043090,...,0.109283,0.117712,0.125230,-0.472102,-19.596277,-30.924706,0.396259,0.685669,0.000513,-0.259706
216,10026,2010,2,0.021526,0,-2.094134,-0.308845,0.040788,0.078282,0.041143,...,0.307764,-0.260631,-0.325766,-0.472102,-21.234505,-33.509975,0.327583,0.542910,0.270081,-0.006817
217,10026,2010,3,0.020311,0,-2.122217,-0.341798,0.039065,0.075209,0.039325,...,0.373770,-0.444835,-0.441161,-0.182801,-26.737495,-40.171337,0.292102,0.559151,0.177226,-0.240768
218,10026,2010,4,0.071774,1,-2.159981,-0.360631,0.038336,0.073806,0.038591,...,0.430938,-1.608122,-1.622348,-0.182801,-27.394373,-41.158257,0.511576,0.889156,0.016806,-0.384905
219,10026,2010,5,-0.046362,0,-2.315011,-0.430001,0.035767,0.068860,0.036005,...,0.354480,-1.214991,-1.273745,-0.182801,-23.192181,-34.844734,0.339407,0.892484,0.118616,-0.254163
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
859322,93423,2015,12,0.058574,1,-11.978427,-1.480301,0.036168,0.068284,0.022223,...,0.231828,-1.998100,-2.072584,0.063021,-50.951458,-58.558365,0.193941,0.868222,0.029665,-0.220867
859323,93423,2016,1,-0.085002,0,-12.680053,-1.537223,0.034167,0.064506,0.020993,...,0.153004,-3.169122,-2.873457,0.063021,-54.418720,-62.543278,0.539353,0.895557,0.149446,-0.248443
859324,93423,2016,2,0.023274,0,-11.602225,-1.448390,0.037341,0.070498,0.022944,...,0.165371,-0.847746,-0.726417,0.063021,-58.107441,-66.782715,0.415136,0.774848,0.307180,-0.078321
859326,93423,2016,4,0.082177,1,-30.884527,-1.523693,0.025958,0.072775,0.024961,...,0.023173,-0.529685,-0.513124,-0.038767,-58.657127,-65.876579,0.280588,0.595971,0.041203,-0.004096


In [17]:
# Winsorize the data
X_cols = [col for col in merged_data.columns if not col in ['PERMNO', 'year', 'month', 'Returns', 'homerun']]
for col in X_cols:
    merged_data[col] = mstats.winsorize(merged_data[col], limits=[0.05, 0.05])

# Split the data into train and test
train_data = merged_data[merged_data['year'] <= 2017]
test_data = merged_data[merged_data['year'] > 2017]
X_train = train_data[X_cols].assign(_const=1)
y_train = train_data[['homerun']]
X_test = test_data[X_cols].assign(_const=1)
y_test = test_data[['homerun']].copy()

print(f'# Observations in X_train: {len(X_train)}')
print(f'# Observations in y_train: {len(y_train)}')
print(f'# Observations in X_test: {len(X_test)}')
print(f'# Observations in y_test: {len(y_test)}')

display(X_train.head())
display(y_train.head())

# Observations in X_train: 32510
# Observations in y_train: 32510
# Observations in X_test: 8300
# Observations in y_test: 8300


Unnamed: 0,Price-to-Book,Cyclically-Adjusted Price to Sales,Cyclically-Adjusted Price-Earnings Ratio,Cyclically-Adjusted Price to Cash Flows,Cyclically-Adjusted Price to Free Cash Flows,Fiscal Book-to-Market,Short Term Leverage,Long Term Leverage,Accruals,Accrual Percentage,...,Price Delay 2,Price Delay 3,Active Flows,Short Interest Ratio,Short Interest Scaled by Supply,Short Squeeze Probability,Volatility of Liquidity,Probability of dividend increase,Momentum Acceleration,_const
215,-2.001473,-0.262609,0.042719,0.081986,0.04309,0.462267,-40.93116,0.13631,0.092713,0.951806,...,0.117712,0.12523,-0.377582,-19.596277,-30.924706,0.396259,0.685669,0.000513,-0.259706,1
216,-2.094134,-0.308845,0.040788,0.078282,0.041143,0.462267,-40.93116,0.13631,0.092713,0.951806,...,-0.260631,-0.325766,-0.377582,-21.234505,-33.509975,0.327583,0.568199,0.135139,-0.006817,1
217,-2.122217,-0.341798,0.039065,0.075209,0.039325,0.470744,-40.93116,0.13631,0.093389,0.887193,...,-0.444835,-0.441161,-0.182801,-26.737495,-40.171337,0.292102,0.568199,0.135139,-0.240768,1
218,-2.159981,-0.360631,0.038336,0.073806,0.038591,0.461961,-40.93116,0.13631,0.093389,0.887193,...,-1.608122,-1.622348,-0.182801,-27.394373,-41.158257,0.511576,0.889156,0.016806,-0.384905,1
219,-2.315011,-0.430001,0.035767,0.06886,0.036005,0.431002,-40.93116,0.13631,0.093389,0.887193,...,-1.214991,-1.273745,-0.182801,-23.192181,-34.844734,0.339407,0.892484,0.118616,-0.254163,1


Unnamed: 0,homerun
215,0
216,0
217,0
218,1
219,0


In [18]:
model = LogisticRegression(C=1e9, max_iter=10000).fit(X_train, y_train.values.ravel())

y_test['homerun_p'] = model.predict(X_test)
y_test.head()

Unnamed: 0,homerun,homerun_p
311,0,0
312,0,0
313,0,0
314,0,0
315,0,0


In [19]:
conf_mat = confusion_matrix(y_test['homerun'], y_test['homerun_p']) 
print(conf_mat)

TN, FP, FN, TP = confusion_matrix(y_test['homerun'], y_test['homerun_p']).ravel() 
  
Accuracy = (TN + TP)/(TN + TP + FN + FP)
Sensitivity = TP/(TP + FN)
Specificity = TN/(TN + FP)

print(f'Accuracy    : {Accuracy:.3f}')
print(f'Sensitivity : {Sensitivity:.3f}')
print(f'Specificity : {Specificity:.3f}')

model = sm.Logit(y_test['homerun'], y_test[['homerun_p']].assign(_const=1)).fit()
print(model.summary())

[[5760   18]
 [2515    7]]
Accuracy    : 0.695
Sensitivity : 0.003
Specificity : 0.997
Optimization terminated successfully.
         Current function value: 0.614092
         Iterations 5
                           Logit Regression Results                           
Dep. Variable:                homerun   No. Observations:                 8300
Model:                          Logit   Df Residuals:                     8298
Method:                           MLE   Df Model:                            1
Date:                Fri, 06 Dec 2024   Pseudo R-squ.:               6.720e-06
Time:                        17:36:34   Log-Likelihood:                -5097.0
converged:                       True   LL-Null:                       -5097.0
Covariance Type:            nonrobust   LLR p-value:                    0.7935
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
homerun_p     -0.1158

## Random Forests

In [20]:
from sklearn.ensemble import RandomForestClassifier

In [21]:
model = RandomForestClassifier(n_estimators=100, bootstrap=True, max_features='sqrt', random_state=1234).fit(X_train,y_train.values.ravel())

y_test['homerun_p_alt'] = model.predict(X_test)
y_test.head()

Unnamed: 0,homerun,homerun_p,homerun_p_alt
311,0,0,0
312,0,0,0
313,0,0,0
314,0,0,0
315,0,0,0


In [22]:
conf_mat = confusion_matrix(y_test['homerun'], y_test['homerun_p_alt']) 
print(conf_mat)

TN, FP, FN, TP = confusion_matrix(y_test['homerun'], y_test['homerun_p_alt']).ravel() 
  
Accuracy = (TN + TP)/(TN + TP + FN + FP)
Sensitivity = TP/(TP + FN)
Specificity = TN/(TN + FP)

print(f'Accuracy    : {Accuracy:.3f}')
print(f'Sensitivity : {Sensitivity:.3f}')
print(f'Specificity : {Specificity:.3f}')

model = sm.Logit(y_test['homerun'], y_test[['homerun_p_alt']].assign(_const=1)).fit()
print(model.summary())

[[5689   89]
 [2474   48]]
Accuracy    : 0.691
Sensitivity : 0.019
Specificity : 0.985
Optimization terminated successfully.
         Current function value: 0.614012
         Iterations 5
                           Logit Regression Results                           
Dep. Variable:                homerun   No. Observations:                 8300
Model:                          Logit   Df Residuals:                     8298
Method:                           MLE   Df Model:                            1
Date:                Fri, 06 Dec 2024   Pseudo R-squ.:               0.0001362
Time:                        17:36:53   Log-Likelihood:                -5096.3
converged:                       True   LL-Null:                       -5097.0
Covariance Type:            nonrobust   LLR p-value:                    0.2387
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
homerun_p_alt  