# Modeling How to Beat the Streak

In 1941, Joe DiMaggio recorded hits in 56 consecutive games of Major League Baseball. This record has stood for 80 years and is seen as one of the most unlikely to be broken. In 2001, MLB started a fantasy game called Beat the Streak, challenging fans to simply pick, among all players, one who would get a hit on a given day. And then do that for 57 consecutive days to symbolically beat DiMaggio's streak. In 20 years nobody has won but a few have come as close as 51.

In this notebook I aim to take a statistical approach to determining these daily picks in order to give myself the best possible chance I can to win this competition and, with it, the $5.6 million prize.

My first step will be to gather the relevant data and assemble it in a useful way.

In [None]:
import pandas as pd
pd.set_option("display.max_columns", None)
import numpy as np
import pybaseball
import statsapi

from datetime import datetime, timedelta
from dateutil import tz
from geopy.geocoders import Nominatim


from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer


pybaseball.statcast allows me to gather details of every MLB game since 2017 down to the individual pitch level. As this alone is a lot of data I'll save it as I gather it so I don't have to rerun this part.

In [None]:
# I've set a limit for my data to between the beginning of the 2019 season and June 30th, 2021.. 

data = pybaseball.statcast('2019-03-20', '2019-09-29')

data.to_csv('untouched_2019_statcast_pbp.csv')

data = pybaseball.statcast('2020-07-23', '2020-09-27')

data.to_csv('untouched_2020_statcast_pbp.csv')

data = pybaseball.statcast('2021-04-01', '2021-06-30')

data.to_csv('untouched_2021_statcast_pbp.csv')

In [None]:
df = pd.concat([pd.read_csv('untouched_2019_statcast_pbp.csv', index_col=0),
                pd.read_csv('untouched_2020_statcast_pbp.csv', index_col=0),
               pd.read_csv('untouched_2021_statcast_pbp.csv', index_col=0)], ignore_index=True)

In [None]:
df.head()

In [None]:
df.info()

The various features vary wildly in count. While some of this variation would be expected for columns like "on_3b", "on_2b" and "on_1b," others demonstrate a change in how the data has been collected. For example, despite the presence of an umpire column it doesn't seem like statcast presently collects information about the people umpiring these games.

To start my data cleaning I'll drop columns with only one value.

In [None]:
for col in df.columns:
        if len(df[col].unique()) == 1 and col not in ['home_']:
            df = df.drop([col], axis=1)
print('dropping columns done')

Now to explore some individual columns to see if they need to be transformed in any way.

In [None]:
df['pitch_type'].value_counts()

In [None]:
df.columns

I'll likely need to transform this into ordinals later.

In [None]:
df['batter'].value_counts()

In [None]:
pybaseball.playerid_reverse_lookup([543760, 656803], key_type='mlbam')

In [None]:
df['pitcher'].value_counts()

In [None]:
df['game_pk'].value_counts()

In [None]:
statsapi.schedule(game_id=633588)

These three columns refer to an internal ID system which will have to be consulted.

In [None]:
df['events'].value_counts()

In [None]:
df['description'].value_counts()

One thing that's become apparent to me at this point is that many of these features are not necessarily things the user would know prior to the game, such as how hard an individual ball is hit. These features will have to be dropped.

Additionally, some information that the user would know that could be important, such as who the expected starting pitchers are, are not included. At this point we'll use game_pk to add in more information through the MLB-statsapi library.

df.head()

"game_datetime" seems to be in UTC. In order to account for possible differences between day and night games I'll convert it to local time and then to a timestamp..

Now because of the size of the data and the way I'll be manipulating it later, I found it useful to change a number of the object columns into integers and explicitly set types to the smallest possible as early as possible in the process to cut down on run time.

In [None]:
pa_ending_events = np.array(['field_out',
                                'strikeout',
                                'single',
                                'walk',
                                'double',
                                'home_run',
                                'force_out',
                                'grounded_into_double_play',
                                'hit_by_pitch',
                                'field_error',
                                'sac_fly',
                                'triple',
                                'sac_bunt',
                                'fielders_choice',
                                'double_play',
                                'fielders_choice_out',
                                'strikeout_double_play',
                                'catcher_interf',
                                'sac_fly_double_play',
                                'triple_play',
                                'sac_bunt_double_play'])

hit_events = np.array(['single',
                        'double',
                        'home_run',
                        'triple'])
   

In [None]:
def set_types(df):
    
#     First we need to fill in the  missing values in numeric columns to make the type transformation possible
    
    num_columns = ['release_speed', 'batter', 'pitcher', 'zone', 'hit_location', 'balls', 'strikes', 'game_year', 'on_3b', 'on_2b', 'on_1b', 'outs_when_up', 'inning', 'fielder_2', 'hit_distance_sc', 'release_spin_rate', 'game_pk', 'pitcher.1', 'fielder_2.1', 'fielder_3', 'fielder_4', 'fielder_5', 'fielder_6', 'fielder_7', 'fielder_8', 'fielder_9', 'woba_value', 'woba_denom', 'babip_value', 'at_bat_number', 'pitch_number', 'home_score', 'away_score', 'bat_score', 'fld_score', 'post_away_score', 'post_home_score', 'post_bat_score', 'post_fld_score']

    num_transformer = Pipeline(steps=[
    ('num_imputer', SimpleImputer(strategy='median'))
    ])
    preprocessor = ColumnTransformer(transformers=[('num', num_transformer, num_columns)])
    clf = Pipeline(steps=[('preprocessor', preprocessor)])

    t = clf.fit_transform(df)
    
    df[num_columns] = t
    
    
    df['game_date'] = pd.to_datetime(df['game_date'], format='%Y-%m-%d')
    print('datetime done')
    df['events'] = df['events'].apply(lambda x: 1 if x in hit_events else (0 if x in pa_ending_events else 0)).astype(np.int8)
    print('events done')
    df['batter_righty'] = df['stand'].apply(lambda x: 1 if x == 'R' else 0).astype(np.int8)
    print('stand done')
    df['pitcher_righty'] = df['p_throws'].apply(lambda x: 1 if x == 'R' else 0).astype(np.int8)
    print('pitcher done')
    df['bottom'] = df['inning_topbot'].apply(lambda x: 1 if x == 'Bot' else 0).astype(np.int8)
    print('inning done')
    df['release_speed'] = df['release_speed'].astype(np.int8)
    df['batter'] = df['batter'].astype(np.int32)
    df['pitcher'] = df['pitcher'].astype(np.int32)
    df['zone'] = df['zone'].astype(np.int8)
    df['hit_location'] = df['hit_location'].astype(np.int8)
    df['balls'] = df['balls'].astype(np.uint8)
    df['strikes'] = df['strikes'].astype(np.uint8)
    df['game_year'] = df['game_year'].astype(np.int16)
    df['on_3b'] = df['on_3b'].astype(np.int32)
    df['on_2b'] = df['on_2b'].astype(np.int32)
    df['on_1b'] = df['on_1b'].astype(np.int32)
    df['outs_when_up'] = df['outs_when_up'].astype(np.uint8)
    df['inning'] = df['inning'].astype(np.uint8)
    df['fielder_2'] = df['fielder_2'].astype(np.int32)
    df['hit_distance_sc'] = df['hit_distance_sc'].astype(np.int16)
    df['release_spin_rate'] = df['release_spin_rate'].astype(np.int16)
    df['game_pk'] = df['game_pk'].astype(np.int32)
    df['pitcher.1'] = df['pitcher.1'].astype(np.int32)
    df['fielder_2.1'] = df['fielder_2.1'].astype(np.int32)
    df['fielder_3'] = df['fielder_3'].astype(np.int32)
    df['fielder_4'] = df['fielder_4'].astype(np.int32)
    df['fielder_5'] = df['fielder_5'].astype(np.int32)
    df['fielder_6'] = df['fielder_6'].astype(np.int32)
    df['fielder_7'] = df['fielder_7'].astype(np.int32)
    df['fielder_8'] = df['fielder_8'].astype(np.int32)
    df['fielder_9'] = df['fielder_9'].astype(np.int32)
    df['woba_value'] = df['woba_value'].astype(np.int8)
    df['woba_denom'] = df['woba_denom'].astype(np.int8)
    df['babip_value'] = df['babip_value'].astype(np.int8)
    df['at_bat_number'] = df['at_bat_number'].astype(np.uint8)
    df['pitch_number'] = df['pitch_number'].astype(np.uint8)
    df['home_score'] = df['home_score'].astype(np.uint8)
    df['away_score'] = df['away_score'].astype(np.uint8)
    df['bat_score'] = df['bat_score'].astype(np.uint8)
    df['fld_score'] = df['fld_score'].astype(np.uint8)
    df['post_away_score'] = df['post_away_score'].astype(np.uint8)
    df['post_home_score'] = df['post_home_score'].astype(np.uint8)
    df['post_bat_score'] = df['post_bat_score'].astype(np.uint8)
    df['post_fld_score'] = df['post_fld_score'].astype(np.uint8)
    df = df.select_dtypes(exclude=['object'])
    
    df = df.drop_duplicates()
    
    return(df)

df = set_types(df)

Now to drop all rows that don't include a plate appearance ending play

In [None]:
df_filtered = df[df['events'].isin([1,0])]

In [None]:
df_filtered['game_pk'].unique()

At this point, purely for presentational purposes, I'm going to trim the dataframe down to the first 5000 rows. This will allow for the following code to run in a timely manner. The fact of the matter is with a project like this, there's a huge amount of data that could be included, making parts of what's to come run for literal days.

In [None]:
df_filtered = df_filtered.head(5000)

Another factor I'd like to account for is both physical location (whether latitude, longitude, and altitude play a role) as well as weather. To add this in I'll need to map the games to the venue in statsapi.schedule() and then to coordinate data to run through an api called Visual Crossing for weather information. I'll also need to convert the times to local for Visual Crossing.

In [None]:
def convert_UTC_to_local(row):
    venue_coords = pd.read_csv('Parks.csv')
    
    venue_name = row['venue_name']
    game_datetime = row['game_datetime']
    from_zone = tz.gettz('UTC')

    locator = Nominatim(user_agent='myGeocoder')
    try:
        city = venue_coords[venue_coords['NAME'] == row['venue_name']]['CITY'].iloc[0]
        lat = venue_coords[venue_coords['NAME'] == row['venue_name']]['Latitude'].iloc[0]
        lon = venue_coords[venue_coords['NAME'] == row['venue_name']]['Longitude'].iloc[0]
        alt = venue_coords[venue_coords['NAME'] == row['venue_name']]['Altitude'].iloc[0]
    except NameError:
        raise IndexError(venue, index)
    if city == 'Tokyo':
        to_zone = tz.gettz('Asia/Tokyo')
    elif city == 'London':
        to_zone = tz.gettz('Europe/London')
    elif city in ['San Francisco',
                        'Oakland',
                        'Phoenix',
                        'Seattle',
                        'Los Angeles',
                        'San Diego',
                        'Anaheim']:
        to_zone = tz.gettz('America/Los_Angeles')
    elif city == 'Denver':
        to_zone = tz.gettz('America/Denver')
    elif city in ['Minneapolis',
                        'Milwaukee',
                        'Chicago',
                        'St. Louis',
                        'Arlington',
                        'Kansas City',
                        'Houston',
                        'Monterrey',
                        'Omaha',
                        'Dyersville']:
        to_zone = tz.gettz('America/Chicago')
    elif city in ['Buffalo',
                        'Detroit',
                        'Cincinnati',
                        'Pittsburgh',
                        'Tampa Bay',
                        'Philadelphia',
                        'Atlanta',
                        'New York',
                        'Washington',
                        'Cleveland',
                        'Miami',
                        'Boston',
                        'Baltimore',
                        'Toronto',
                        'Williamsport',
                        'Dunedin',
                        'St. Petersburg']:
        to_zone = tz.gettz('America/New_York')
    else:
        raise NameError(venue_name, city)
    utc = datetime.strptime(game_datetime, '%Y-%m-%dT%H:%M:%SZ')
    utc = utc.replace(tzinfo=from_zone)
    local = utc.astimezone(to_zone)

    return(local)

In [None]:
venue_coords = pd.read_csv('Parks.csv')

game_pk_df = pd.DataFrame(columns = statsapi.schedule(game_id=566083)[0].keys())

venue_dict = {}

# for each game, get the available data including probable pitchers
for game in df_filtered['game_pk'].unique():
    game_data = statsapi.schedule(game_id = int(game))[-1]
    game_pk_df = pd.concat([game_pk_df, game_data], ignore_index=True)
    print(game)

try:
    for venue in game_pk_df['venue_name'].unique():
        print(venue)
        city = venue_coords[venue_coords['NAME'] == venue]['CITY'].iloc[0]
        lat = venue_coords[venue_coords['NAME'] == venue]['Latitude'].iloc[0]
        lon = venue_coords[venue_coords['NAME'] == venue]['Longitude'].iloc[0]
        alt = venue_coords[venue_coords['NAME'] == venue]['Altitude'].iloc[0]
        venue_dict[venue] = np.array((city, lat, lon, alt))
except IndexError:
#     In case there's a new venue name and Parks.csv needs to be updated
    raise NameError(venue, index)

game_pk_df = game_pk_df.sort_values(['game_datetime'])


# convert UTC time to local time
game_pk_df['local_datetime'] = game_pk_df.apply(lambda row: convert_UTC_to_local(row), axis=1)
game_pk_df['local_datetime'] = game_pk_df['local_datetime'].apply(lambda x: x.strftime("%Y-%m-%dT%H:%M:%S"))

# create a dictionary to map each start time with the associated venues to cut down on api calls
datetime_coordinate_matching = {}
for index, row in game_pk_df.iterrows():
    city, lat, lon, alt = venue_dict[row['venue_name']]
    dc_datetime = row['local_datetime']
    if dc_datetime in datetime_coordinate_matching.keys():
        datetime_coordinate_matching[dc_datetime].append(','.join([str(lat), str(lon)]))
    else:
        datetime_coordinate_matching[dc_datetime] = [','.join([str(lat), str(lon)])]

game_pk_df['coordinates'] = game_pk_df['venue_name'].apply(lambda x: ','.join(venue_dict[x][1:3]))
game_pk_df['alt'] = game_pk_df['venue_name'].apply(lambda x: venue_dict[x][3])

In [None]:
game_pk_df.head()

In [None]:
import os
visual_crossing = os.environ.get('VISUAL_CROSSING_API_KEY', '')
if not visual_crossing:
    print("Warning: VISUAL_CROSSING_API_KEY not set. Weather data fetching will fail.")
import requests

api_counter = 0

weather_df = pd.DataFrame()
for key, value in datetime_coordinate_matching.items():
    print(key)
    url_locations = '|'.join(value)
    URL = f'https://weather.visualcrossing.com/VisualCrossingWebServices/rest/services/weatherdata/history?&aggregateHours=1&startDateTime={key}&endDateTime={key}&unitGroup=us&contentType=json&location={url_locations}&key={visual_crossing}'
    global api_counter
#     api is limited to 1000 free calls per day, after which it's $0.0001 per result
    if api_counter == 1000:
        print('Prepreprocessing will continue at ' + (datetime.now()+timedelta(seconds = 86400)).strftime('%H:%M:%S'))
        sleep(86400)
        api_counter = 0
    try:
        api_counter += 1
        response = requests.get(URL)
        data = response.json()
        locations = list(data['locations'].keys())
        for each in locations:
            values = data['locations'][each]['values'][0]
            weather_df = pd.concat([weather_df, pd.DataFrame([{**{'coordinates': each}, **values}])], ignore_index=True)
    except:
        raise IndexError(key, value, response)

weather_df['datetimeStr'] = weather_df['datetimeStr'].apply(lambda x: x[:-6])
weather_df = weather_df.rename(columns={'datetimeStr': 'local_datetime'})

# merge all this new data together

games_and_weather = pd.merge(
    game_pk_df,
    weather_df,
    how="left",
    on=None,
    left_on=['local_datetime', 'coordinates'],
    right_on=['local_datetime', 'coordinates'],
    left_index=False,
    right_index=False,
    sort=False,
    suffixes=("_gpk", "_acw"),
    copy=True,
    indicator=False,
    validate=None,
)

df_detailed = pd.merge(
    games_and_weather,
    df_filtered,
    how="right",
    on=None,
    left_on='game_id',
    right_on='game_pk',
    left_index=False,
    right_index=False,
    sort=False,
    suffixes=("_gaw", "_dff"),
    copy=True,
    indicator=False,
    validate=None,
)

There's a couple more pieces of information I'd like to add in, namely whether a game was planned to be only 7 innings long, as is true in doubleheaders in 2020 and 2021 (due to COVID), and whether a designated hitter is used, which could indicate longer outings for the starting pitcher.

In [None]:
american_league_teams = np.array(['Boston Red Sox', 'Houston Astros', 'Chicago White Sox', 'Tampa Bay Rays', 'Oakland Athletics', 'Seattle Mariners', 'New York Yankees', 'Toronto Blue Jays', 'Los Angeles Angels', 'Cleveland Indians', 'Detroit Tigers', 'Kansas City Royals', 'Minnesota Twins', 'Texas Rangers', 'Baltimore Orioles'])

In [None]:
df_detailed['covid_doubleheader'] = df_detailed.apply(lambda row: 1 if row['game_year'] in [2020, 2021] and row['doubleheader'] =='Y' else 0, axis=1)



df_detailed['designated_hitter'] = df_detailed.apply(lambda row: 1 if row['home_name'] in american_league_teams or row['game_year'] == 2020 else 0, axis = 1)

And we'll save our dataframe here in case something were to go wrong.

In [None]:
df_detailed.to_csv('df_detailed')

At this point I have a good amount of data but I'm skeptical any model I use would be able to understand who is a good hitter or pitcher, two factors that would likely be extremely relevant. To make up for this I'm going to assemble a handful of statistics for each player based on their performance in recent history - the previous 2 years.

In [None]:
# data = statcast('2017-01-01', '2017-12-31')

# data.to_csv('untouched_2017_statcast_pbp.csv')



# data = statcast('2018-01-01', '2018-12-31')

# data.to_csv('untouched_2018_statcast_pbp.csv')

history = pd.concat([
                    set_types(pd.read_csv('untouched_2017_statcast_pbp.csv')),
                   set_types(pd.read_csv('untouched_2018_statcast_pbp.csv')),
                   set_types(pd.read_csv('untouched_2019_statcast_pbp.csv')),
                   set_types(pd.read_csv('untouched_2020_statcast_pbp.csv')),
                   set_types(pd.read_csv('untouched_2021_statcast_pbp.csv'))
                    ], ignore_index=True)

In [None]:
def derived_cumulative_stats(player, stats_type, history, start_date=None, end_date=datetime.strptime('2021-06-30', '%Y-%m-%d')):
    print(end_date)
    print(player)

#     in order to speed this up and cut down on api calls, I'll simply be slicing a collection of all statcast data since 2017 (2 years prior to the earliest data I'm modeling)

    if start_date == None:
        start_date = end_date-timedelta(weeks=104)
    if type(player) == str:
#         playerid_lookup has trouble accounting for 3 names
        if player[-3:] == 'Jr.':
            player = player[:-4]
        player = player.split(' ')
        player.reverse()
        try:
            player_info = pybaseball.playerid_lookup(player[0], player[-1])
            if len(player_info) >1:
                player_info = player_info.iloc[0]
                player = player_info['key_mlbam']
            elif player_info.empty:
                player_info = pybaseball.playerid_lookup(player[0], player[-1], fuzzy=True).iloc[0]
                player = player_info['key_mlbam']
            else:
                try:
                    player = player_info.iloc[0]['key_mlbam']
                except:
                    raise NameError(player_info, player)
        except:
            raise NameError(player)
            
    history = history[(history['game_date'] >= start_date) & (history['game_date'] <= end_date)]
    print('game_date filtered')
    if stats_type == 'pitcher':
        try:
            history = history[history['pitcher'] == player]
        except:
            raise NameError(player)
        print('player filtered')
    #   PITCH METRICS
        pitches = {}
        while True:
            try:
                pitches['pitch_hand'] = statsapi.player_stat_data(player)['pitch_hand']
                break
            except:
                print('pitch_hand_error')
                sleep(10)
        print('pitch_hand')
    #     filter history down to PA-enders
        history = history[history['events'].isin([1, 0])]
        print('pa ending events')
    #    PLAYER METRICS
        games = history['game_pk'].unique()
        at_bat_list = []
        num_hits_list = []
        for game in games:
            game = history[history['game_pk'] == game]
            inning = game['inning'].max()
            at_bats = len(game)
            at_bat_list.append(at_bats)
            for i in range(1, inning+1):
                num_hits = len(game[(game['inning'] == i) & (game['events'] == 1)])
                num_hits_list.append(num_hits)
        pitches['games_played_last_2_years_pitcher'] = len(games)
        pitches['avg_PAs_per_apparence_pitcher'] = np.array(at_bat_list).mean()
        pitches['avg_hits_per_inning'] = np.array(num_hits_list).mean()

#         L/R splits
        left_pitcher = history[history['batter_righty'] == 0]
        right_pitcher = history[history['batter_righty'] == 1]
        l_pas = len(left_pitcher)
        r_pas = len(right_pitcher)
        l_hits = len(left_pitcher[(history['events'] == 1)])
        r_hits = len(right_pitcher[(history['events'] == 1)])

        if l_pas > 0 and r_pas > 0:
            pitches['H/PA_pitcher'] = (l_hits + r_hits)/(l_pas + r_pas)
            pitches['against_lefties_H/PA'] = (l_hits)/(l_pas)
            pitches['against_righties_H/PA'] = (r_hits)/(r_pas)
        elif l_pas > 0:
            pitches['H/PA_pitcher'] = (l_hits + r_hits)/(l_pas + r_pas)
            pitches['against_lefties_H/PA'] = (l_hits)/(l_pas)
            pitches['against_righties_H/PA'] = 0
        elif r_pas > 0:
            pitches['H/PA_pitcher'] = (l_hits + r_hits)/(l_pas + r_pas)
            pitches['against_lefties_H/PA'] = 0
            pitches['against_righties_H/PA'] = (r_hits)/(r_pas)
        else:
            pitches['H/PA_pitcher'] = 0
            pitches['against_lefties_H/PA'] = 0
            pitches['against_righties_H/PA'] = 0
        return(pitches)

    
    elif stats_type == 'batter':
        history = history[history['batter'] == player]


#     filter history down to PA-enders
        history = history[history['events'].isin([1, 0])]

#     H/PA per pitch type
#     K/PA per pitch type
#     H/PA vs righties, lefties
#     PA/G
#     avg_launch_angle
#     avg_launch_speed
#     xBA based on estimated_ba_using_speedangle



        # pitches = history['pitch_type'].unique()
        games = history['game_pk'].unique()
        pas_list = []

        batter = {}

        for game in games:
            pas = len(history[history['game_pk'] == game])
            # print(pas)
            pas_list.append(pas)
        batter['games_played_last_2_years_batter'] = len(games)
        batter['PA/G_batter'] = np.array(pas_list).mean()

#         righty/lefty split
        pitcher_right = history[history['pitcher_righty'] == 1]
        pas_r = len(pitcher_right)
        hits_r = len(pitcher_right[pitcher_right['events'] == 1])

        if pas_r > 0:
            batter['H/PA_against_R'] = hits_r/pas_r
        else:
            batter['H/PA_against_R'] = 0

        pitcher_left = history[history['pitcher_righty'] == 0]
        pas_l = len(pitcher_left)
        hits_l = len(pitcher_left[pitcher_left['events'] == 1])

        if pas_l > 0:
            batter['H/PA_against_L'] = hits_l/pas_l
        else:
            batter['H/PA_against_L'] = 0

        if pas_r > 0 or pas_l > 0:
            batter['H/PA_batter'] = (hits_r+hits_l)/(pas_r+pas_l)
        else:
            batter['H/PA_batter'] = 0
        batter['avg_launch_angle'] = history['launch_angle'].mean()
        batter['avg_launch_speed'] = history['launch_speed'].mean()
        batter['xBA'] = history['estimated_ba_using_speedangle'].mean()

        return(batter)

We'll run each projected starting pitcher through the function, derived_cumulative_stats, to assemble their time adjusted statistics.

In [None]:
pitcher_df = pd.concat([df_detailed[['game_id', 'game_date_gaw', 'away_probable_pitcher']].rename(columns={'away_probable_pitcher': 'probable_pitcher'}), df_detailed[['game_id', 'game_date_gaw', 'home_probable_pitcher']].rename(columns={'home_probable_pitcher': 'probable_pitcher'})])
pitcher_df = pitcher_df.drop_duplicates()
pitcher_df['probable_pitcher'] = pitcher_df['probable_pitcher'].replace('', np.nan)
pitcher_df = pitcher_df.dropna(subset=['probable_pitcher'])
pitcher_df = pitcher_df.reset_index(drop=True)

pitcher_stats = pitcher_df.apply(lambda row: pd.Series(derived_cumulative_stats(row['probable_pitcher'], 'pitcher', history, end_date=datetime.strptime(row['game_date_gaw'], '%Y-%m-%d')-timedelta(days=1))), axis = 1)

pitcher_df = pd.merge(
    pitcher_stats,
    pitcher_df,
    how="left",
    on=None,
    left_on=None,
    right_on=None,
    left_index=True,
    right_index=True,
    sort=False,
    suffixes=("_s", "_df"),
    copy=True,
    indicator=False,
    validate=None,
)

In order to properly pair batters with their pitcher we'll need to split home and away batters and re-merge those dataframes by matching them with their proper pitcher.

In [None]:
home_batters = df_detailed[df_detailed['bottom'] == 1]
away_batters = df_detailed[df_detailed['bottom'] == 0]

df_wpitching_p1 = pd.merge(
    pitcher_df,
    home_batters,
    how="right",
    on=None,
    left_on=['game_id', 'probable_pitcher'],
    right_on=['game_id', 'away_probable_pitcher'],
    left_index=False,
    right_index=False,
    sort=False,
    suffixes=("_p", "_df"),
    copy=True,
    indicator=False,
    validate=None,
)

df_wpitching_p2 = pd.merge(
    pitcher_df,
    away_batters,
    how="right",
    on=None,
    left_on=['game_id', 'probable_pitcher'],
    right_on=['game_id', 'home_probable_pitcher'],
    left_index=False,
    right_index=False,
    sort=False,
    suffixes=("_p", "_df"),
    copy=True,
    indicator=False,
    validate=None,
)
df_wpitching = pd.concat([df_wpitching_p1, df_wpitching_p2], ignore_index=True)
df_wpitching = df_wpitching.sort_values(['game_date_gaw_df', 'game_pk','home_name', 'inning', 'bottom', 'outs_when_up', 'at_bat_number', 'pitch_number']).reset_index(drop=True)

#         Makes explicit the target value and drops repeat occurences of the target value in each game played

df_wpitching['got_a_hit'] = df_wpitching.apply(lambda row: 1 if row['events'] == 1 else 0, axis=1)
df_wpitching = df_wpitching.sort_values('got_a_hit').drop_duplicates(subset=['game_id', 'batter'], keep='last').sort_values(['game_date_gaw_df', 'game_pk','home_name', 'inning', 'bottom', 'outs_when_up', 'at_bat_number', 'pitch_number']).reset_index(drop=True)
        

At this point it's safe to run our batters through the batter half of the derived_cumulative_stats function.

In [None]:
df_wpitching.columns.to_list()

In [None]:
batting_stats = df_wpitching.apply(lambda row: pd.Series(derived_cumulative_stats(row['batter'], 'batter', history, end_date=datetime.strptime(row['game_date_gaw_df'], '%Y-%m-%d')-timedelta(days=1))), axis = 1)


df_final = pd.merge(
    batting_stats,
    df_wpitching,
    how="right",
    on=None,
    left_on=None,
    right_on=None,
    left_index=True,
    right_index=True,
    sort=False,
    suffixes=("_bs", "_df"),
    copy=True,
    indicator=False,
    validate=None,
)

In [None]:
df_final.head()

Looks like we have just a few things to clean up as well as a number of columns that we wouldn't know ahead of time.

In [None]:
df_final = df_final.rename(columns={'games_played_last_2_years_df': 'games_played_last_2_years_pitcher', 'games_played_last_2_years_bs': 'games_played_last_2_years_batter', 'bottom': 'home'})

df_final['local_datetime'] = pd.to_datetime(df_final['local_datetime'])
df_final['local_date'] = df_final['local_datetime'].apply(lambda x: x.toordinal())
df_final['local_datetime'] = df_final['local_datetime'].apply(lambda x: x.timestamp())

columns_to_drop = ['player_name',
 'game_type_dff', 'pitch_type',
 'release_speed',
 'release_pos_x',
 'release_pos_z',
 'pitcher',
 'events',
 'description',
 'spin_dir',
'spin_rate_deprecated',
 'break_angle_deprecated',
 'break_length_deprecated',
 'zone',
 'des',
  'p_throws',
'type',
 'hit_location',
 'bb_type',
 'balls',
 'strikes',
 'pfx_x',
 'pfx_z',
 'plate_x',
 'plate_z',
 'on_3b',
 'on_2b',
 'on_1b',
 'outs_when_up',
 'inning',
 'hc_x',
 'hc_y',
 'tfs_deprecated',
 'tfs_zulu_deprecated',
  'sv_id',
 'vx0',
 'vy0',
 'vz0',
 'ax',
 'ay',
 'az',
 'sz_top',
 'sz_bot',
 'hit_distance_sc',
 'launch_speed',
 'launch_angle',
 'effective_speed',
 'release_spin_rate',
 'release_extension',
 'pitcher.1',
 'fielder_2.1',
 'fielder_3',
 'fielder_4',
 'fielder_5',
 'fielder_6',
 'fielder_7',
 'fielder_8',
 'fielder_9',
 'release_pos_y',
 'estimated_ba_using_speedangle',
 'estimated_woba_using_speedangle',
 'woba_value',
 'woba_denom',
 'babip_value',
 'iso_value',
 'launch_speed_angle',
 'at_bat_number',
 'pitch_number',
 'pitch_name',
 'home_score_dff',
 'away_score_dff',
 'bat_score',
 'fld_score',
 'post_away_score',
 'post_home_score',
 'post_bat_score',
 'post_fld_score',
 'if_fielding_alignment',
 'of_fielding_alignment',
 'spin_axis',
 'delta_home_win_exp',
 'delta_run_exp',
 'game_pk',
 'game_id','game_date_gaw_p',
 'game_datetime',
 'game_date_gaw_df',
 'game_type_gaw',
 'datetime',
#  'datetimeStr',
 'game_date_dff','winning_team',
 'losing_team',
 'winning_pitcher',
 'losing_pitcher',
 'save_pitcher',
 'summary', 'home_probable_pitcher',
 'away_probable_pitcher',
 'home_pitcher_note',
 'away_pitcher_note',
 'away_score_gaw',
 'home_score_gaw',
 'current_inning',
 'inning_state',
 'venue_id',
 'status',
 'home_team', 'away_team', 'home_id', 'away_id']

for col in df_final.columns:
    if col in columns_to_drop:
        df_final= df_final.drop([col], axis=1)


# MODEL WORK

## Imports

In [None]:
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split, GridSearchCV, cross_validate
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, plot_confusion_matrix, precision_score, plot_roc_curve, make_scorer
from sklearn.svm import LinearSVC, SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.feature_selection import VarianceThreshold

import xgboost as xgb

import category_encoders as ce

First I wrote a function to return certain traditional statistical scores for my models.

In [None]:
def evaluate(estimator, X_train, X_val, y_train, y_val, holdout, roc_auc='proba', output = False):

    X_holdout = holdout.drop(['got_a_hit'], axis=1)
    y_holdout = holdout['got_a_hit']
    
    #     grab predictions
    train_preds = estimator.predict(X_train)
    val_preds = estimator.predict(X_val)
    holdout_preds = estimator.predict(X_holdout)
    
#     output needed for roc_auc score
    if roc_auc == 'skip':
        train_out = False
        val_out = False
        holdout_out = False
    elif roc_auc == 'dec': # not all classifiers have a decision function
        train_out = estimator.decision_function(X_train)
        val_out = estimator.decision_function(X_val)
        holdout_out = estimator.decision_function(X_holdout)
    elif roc_auc == 'proba':
        try:
            train_out = estimator.predict_proba(X_train)[:, 1]
            val_out = estimator.predict_proba(X_val)[:, 1]
            holdout_out = estimator.predict_proba(X_holdout)[:, 1]
        except AttributeError:
            train_out = estimator.predict(X_train)
            val_out = estimator.predict(X_val)
            holdout_out = estimator.predict(X_holdout)

    else:
        raise Exception("The value for roc_auc should be 'skip', 'dec', or 'proba'.")
    
    ac = accuracy_score(y_train, train_preds)
    f1 = f1_score(y_train, train_preds)
    ras = roc_auc_score(y_train, train_out)
    pr = precision_score(y_train, train_preds)
    
    if output == True:
        print('Train Scores')
        print('------------')
        print(f'Accuracy: {ac}')
        print(f'F1 Score: {f1}')
        if type(train_out) == np.ndarray:
            print(f'ROC-AUC: {ras}')
        print(f'Precision: {pr}')

    ac = accuracy_score(y_val, val_preds)
    f1 = f1_score(y_val, val_preds)
#     print(type(y_val))
    ras = roc_auc_score(y_val, val_out)
    pr = precision_score(y_val, val_preds)
    
    if output == True:
        print('-----------------------------------')
        print('Val Scores')
        print('-----------')
        print(f'Accuracy: {ac}')
        print(f'F1 Score: {f1}')
        if type(val_out) == np.ndarray:
            print(f'ROC-AUC: {ras}')
        print(f'Precision: {pr}')
    
    ac = accuracy_score(y_holdout, holdout_preds)
    f1 = f1_score(y_holdout, holdout_preds)
#     print(type(holdout['got_a_hit']))
    ras = roc_auc_score(y_holdout, holdout_out)
    pr = precision_score(y_holdout, holdout_preds)
    
    if output == True:
        print('-----------------------------------')
        print('Holdout Scores')
        print('-----------')
        print(f'Accuracy: {ac}')
        print(f'F1 Score: {f1}')
        if type(holdout_out) == np.ndarray:
            print(f'ROC-AUC: {ras}')
        print(f'Precision: {pr}')

        print('\nVal Data')
        print('-----------')

        plot_confusion_matrix(estimator, X_val, y_val, values_format=',.5g')
        plt.show()
        
        print('Holdout Data')        
        print('-----------')

        plot_confusion_matrix(estimator, X_holdout, y_holdout, values_format=',.5g')
        plt.show()

Then I wrote one to measure hitting streaks which in this particular case is the most relevant metric.

In [None]:
def streak_checker(dataframe, output = False, metric = 'odds', ascending=False, combiner=None, timed=False, names=False):
    results = []

    try:
        if 'L/R_split' in metric and combiner == 'add':
            if timed == True:
                start = datetime.now()

                print(combiner, start)
            dataframe_copy = dataframe.copy()
            dataframe_copy['temp'] = dataframe.apply(lambda row: row['H/PA_against_R'] + row['against_lefties_H/PA'] if (row['pitcher_righty'] == 1 and row['batter_righty'] == 0) else (row['H/PA_against_L'] + row['against_righties_H/PA'] if (row['pitcher_righty'] == 0 and row['batter_righty'] == 1) else 0), axis=1)
            mms = MinMaxScaler()
            dataframe_copy['temp'] = pd.DataFrame(mms.fit_transform(dataframe_copy[['temp']]))
            
            if timed == True:
                print(datetime.now() - start)
            dataframe = dataframe_copy
            metric='temp'
        elif 'L/R_split' in metric and combiner == 'multiply':
            if timed == True:
                start = datetime.now()

                print(combiner, start)
            dataframe_copy = dataframe.copy()
            dataframe_copy['temp'] = dataframe.apply(lambda row: row['H/PA_against_R'] * row['against_lefties_H/PA'] if (row['pitcher_righty'] == 1 and row['batter_righty'] == 0) else (row['H/PA_against_L'] * row['against_righties_H/PA'] if (row['pitcher_righty'] == 0 and row['batter_righty'] == 1) else 0), axis=1)
            mms = MinMaxScaler()
            dataframe_copy['temp'] = pd.DataFrame(mms.fit_transform(dataframe_copy[['temp']]))
            
            if timed == True:
                print(datetime.now() - start)
            dataframe = dataframe_copy
            metric='temp'
        if type(metric) == list:
            dataframe_copy = dataframe.copy()
            if combiner == 'add':
                if timed == True:
                    start = datetime.now()

                    print(combiner, start)
                dataframe_copy['temp'] = dataframe.apply(lambda row: row[metric[0]] + row[metric[1]], axis=1)
                mms = MinMaxScaler()
                dataframe_copy['temp'] = pd.DataFrame(mms.fit_transform(dataframe_copy[['temp']]))
                
                if timed == True:
                    print(datetime.now() - start)
            elif combiner == 'multiply':
                if timed == True:
                    start = datetime.now()

                    print(combiner, start)
                dataframe_copy['temp'] = dataframe.apply(lambda row: row[metric[0]] * row[metric[1]], axis=1)
                mms = MinMaxScaler()
                dataframe_copy['temp'] = pd.DataFrame(mms.fit_transform(dataframe_copy[['temp']]))
                
                if timed == True:
                    print(datetime.now() - start)
            
            else:
                raise KeyError(f'{combiner} combiners have not been implemented yet')
            dataframe = dataframe_copy
            metric = 'temp'
        else:
            pass
    except KeyError:
        raise KeyError("metric must be chosen from dataframe's columns")
    
    date_list = np.sort(dataframe['local_date'].unique())
    
    for day in date_list:
        best_bet = dataframe[dataframe['local_date'] == day].sort_values(metric, ascending=ascending).iloc[0]
        
        odds = best_bet[metric]
        date = datetime.fromordinal(best_bet['local_date']).strftime("%b %d %Y")
        
        
        if best_bet['home'] == 0:
            team = best_bet['away_name']
        else:
            team = best_bet['home_name']
    #     team = statsapi.lookup_team(player['currentTeam']['id'])[0]['name']
    #     player_name = player['fullName']
        actual_result = best_bet['got_a_hit']
        results.append(actual_result)
        if output == True:
            if day > date_list[-20] and names==True:
#                 pass
                player_info = pybaseball.playerid_reverse_lookup([best_bet['batter']])
                if len(player_info) >1:
                    player_info = player_info.iloc[0]
                    player = player_info['key_mlbam']
                elif player_info.empty:
                    player_info = playerid_lookup(player[0], player[1], fuzzy=True).iloc[0]
                    player = player_info['key_mlbam']
                else:
                    try:
                        player = ' '.join(player_info.iloc[0][['name_first', 'name_last']])
                    except:
                        raise NameError(player_info, player)
                print(date, player, round(odds, 2), actual_result)
    longest = 0
    current = 0
    for num in results:
        if num == 1:
            current += 1
            longest = max(longest, current)
        else:
            longest = max(longest, current)
            current = 0
    if output == True:
        print(f'Longest streak in set ({len(results)} days): ', longest)
        print(f'Total correct guesses in set ({len(results)} days): ', sum(results), sum(results)/len(results))
#     return (dataframe, best_bet)

And finally a function to streamline running the various different types of models, including a rough estimate of how long a model will take to run.

In [None]:
def run_model(clf, X_train, X_val, y_train, y_val, holdout):
#     probability = True
    
#     clf.fit(X_train.head(100), y_train.head(100))

#     evaluate(clf, X_train.head(100), X_val.head(100), y_train.head(100), y_val.head(100), holdout);

#     runtime = datetime.now() - start

#     if runtime > timedelta(seconds=5):
#         print(f'Model would take {runtime.total_seconds()*200/60} minutes to run.')
#     else:
#         print('Model should take between {:.0f} and {:.0f} minutes to run and finish by {}'.format(runtime.total_seconds()*20/60, runtime.total_seconds()*200/60, (datetime.now()+(timedelta(seconds=(runtime.total_seconds()*200)))).strftime("%H:%M")))
    clf.fit(X_train, y_train)
    evaluate(clf, X_train, X_val, y_train, y_val, holdout, output=True)
    plot_roc_curve(clf, X_val, y_val)
    plt.show()

    try:
        val_df_odds = pd.Series(clf.predict_proba(X_val)[:, 1], name='odds')
        holdout_df_odds = pd.Series(clf.predict_proba(holdout.drop(['got_a_hit'], axis=1))[:, 1], name='odds')
    except AttributeError:
        probability = False
        val_df_odds = pd.Series(clf.predict(X_val), name= 'odds')
        holdout_df_odds = pd.Series(clf.predict(holdout.drop(['got_a_hit'], axis=1)), name='odds')

    val_df = X_val.assign(got_a_hit = y_val).reset_index(drop=True)
    val_df = val_df.assign(odds=val_df_odds)
    holdout = holdout.reset_index(drop=True)
    holdout = holdout.assign(odds=holdout_df_odds)
    if set(val_df_odds.value_counts().index) == set([1, 0]):
        print(str(clf.get_params()['steps'][1][1]), "doesn't return probabilities, so no streak results will be returned based on this alone.")
        pass
    else:
        print('-----------------------------------')

        print('Val data')
        print('-----------')

        streak_checker(val_df, output=True)
        print('Holdout data')
        print('-----------')

        streak_checker(holdout, output=True)

    print('Odds multiplied by PA/G_batter')
    print('-----------')

    print('Val data')
    print('-----------')
    streak_checker(val_df, output=True, metric=['PA/G_batter', 'odds'], combiner='multiply')
    print('Holdout data')
    print('-----------')
    streak_checker(holdout, output=True, metric=['PA/G_batter', 'odds'], combiner='multiply')

    print('Odds added to PA/G_batter')
    print('-----------')

    print('Val data')
    print('-----------')
    streak_checker(val_df, output=True, metric=['L/R_split', 'odds'], combiner='add')
    print('Holdout data')
    print('-----------')
    streak_checker(holdout, output=True, metric=['L/R_split', 'odds'], combiner='add')

Also, I made a custom train_test_split function. I realized a traditional one with a random split wouldn't be as effective for my particular type of data as one that made a split based on date. This is because the best candidate for a hit on any given day could be randomly put into one grouping or another, altering my results for no clear analytical gain.

In [None]:
def train_test_split(df, test_size=.25):
    
    '''My simplified train_test_split which just cuts the dataframe by date, in order to accurately evaluate streak results, to roughly the test size requested'''
    
    sample_size = len(df)
    train_size = 1 - test_size
    train_size = round(train_size * sample_size)
    df = df.sort_values(['local_date']).reset_index(drop=True)
    
    date_cutoff = df.iloc[train_size]['local_date']
    train = df[df['local_date'] <= date_cutoff]
    val = df[df['local_date'] > date_cutoff]
    
    X_train = train.drop(['got_a_hit'], axis=1)
    y_train = train['got_a_hit']
    
    X_val = val.drop(['got_a_hit'], axis=1)
    y_val = val['got_a_hit']
    
    return(X_train, X_val, y_train, y_val)

For this section, I'll be importing my previously organized sample and holdout sets.

In [None]:
sample = pd.read_csv('sample', index_col=1)
holdout = pd.read_csv('holdout', index_col=1)

In [None]:
X_train, X_val, y_train, y_val = train_test_split(sample)

In [None]:
sample

In [None]:
X_train

I decided to fill any missing numerical values with the feature's median, as this seemed the best way to minimize undue bias. Missing values used for my one hot encoder will be replaced with "Unknown."  Finally for my categorical features, I decided on target encoding as it seemed most sensible to align the values with their relationship to the target.

In [None]:
num_cols = [] # all numeric columns
cols_to_ohe = [] # doubleheader, conditions, stand
cols_to_targ = [] # all other object columns

for c in X_train.columns:
    if X_train[c].dtype in ['float64', 'int64'] and c not in ['game_num', 'batter', 'fielder_2', 'umpire']:
        num_cols.append(c)
    elif len(X_train[c].unique()) < 10:
        cols_to_ohe.append(c)
    else:
        cols_to_targ.append(c)

In [None]:
num_transformer = Pipeline(steps=[
    ('num_imputer', SimpleImputer(strategy='median')),
    ('scaler', MinMaxScaler())
])

ohe_transformer = Pipeline(steps=[
    ('ohe_imputer', SimpleImputer(strategy='constant', fill_value='Unknown')),
    ('ohencoder', OneHotEncoder(handle_unknown='ignore'))
])

target_encoder = Pipeline(steps=[
    ('freq_enc', ce.target_encoder.TargetEncoder()),
    ('freq_imputer', SimpleImputer(strategy='constant', fill_value=0))
    
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_cols),
        ('ohe', ohe_transformer, cols_to_ohe),
        ('target', target_encoder, cols_to_targ)
    ])

# Baseline - Simple models

For this project it wouldn't have been useful to simply normalize the number of hits against non-hits. My goal isn't simply to identify if a situation will result in a hit but rather what batter is the most likely to get a hit on a given day. So i devised a number of different simple models to compare my more advanced models against.

Pick player with the highest Plate Appearances per Game over the last 2 years

In [None]:
streak_checker(sample, output=True, metric='PA/G_batter')
streak_checker(holdout, output=True, metric='PA/G_batter')

Observations - this seems to trend towards a small group of players except for when a player makes their mLB debut towards the top of the lineup (Wander Franco, Greg Deichmann). Sample size is an issue - could be addressed by either looking at games played or excluding PA/G above 4.9.



Pick player with highest launch speed

In [None]:
streak_checker(sample, output=True, metric='avg_launch_speed')
streak_checker(holdout, output=True, metric='avg_launch_speed')

The metrics point to this being mostly useless which is backed up by the fact that it seems to occasionally choose pitchers (Michael Pineda, Touki Toussaint), when it's not fawning over John Donaldson. Could be worth excluding pitchers as a class though would be hard to implement with the current dataset, not to mention the fact that 2-way players would need to be excluded from that filter.

In [None]:
streak_checker(sample, output=True, metric=['avg_launch_angle', 'avg_launch_speed'], combiner='add')
streak_checker(holdout, output=True, metric=['avg_launch_angle', 'avg_launch_speed'], combiner='add')

Exaserbates bias towards small sample sizes, leading part time players to dominate selection rather than proven hitters

Would also be worth checking lefty righty splits

In [None]:
streak_checker(sample, output=True, metric='L/R_split', combiner='add')

print("--------------------------------------------")

streak_checker(holdout, output=True, metric='L/R_split', combiner='add')

print("--------------------------------------------")


streak_checker(sample, output=True, metric='L/R_split', combiner='multiply')

print("--------------------------------------------")

streak_checker(holdout, output=True, metric='L/R_split', combiner='multiply')

Nothing stands out in these groups. Seems to be a mix of journeymen and all-stars.

# Model 1 - Decision Tree

In [None]:
start = datetime.now()
dt = DecisionTreeClassifier()

clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('tree', dt)
])

run_model(clf, X_train, X_val, y_train, y_val, holdout)

# Model 2: Logistic Regression

In [None]:
start = datetime.now()

logreg = LogisticRegression(solver='sag')

clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('logreg', logreg)
])

run_model(clf, X_train, X_val, y_train, y_val, holdout, names=names)

# Model 3: SVC too slow, linearSVC doesn't supply probability, so I need to supply a secondary way to select a single batter per day

In [None]:
start = datetime.now()

Linearsvc = LinearSVC()


clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('linearsvc', Linearsvc)
])


run_model(clf, X_train, X_val, y_train, y_val, holdout)

# Model 4: KNN - 5 neighbors

In [None]:
start = datetime.now()

knn = KNeighborsClassifier(n_jobs=-1)

clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('knn', knn)
])


run_model(clf, X_train, X_val, y_train, y_val, holdout)

# Model 4: KNN - 3 neighbors

In [None]:
start = datetime.now()

knn = KNeighborsClassifier(n_neighbors=3, n_jobs=-1)

clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('knn', knn)
])


run_model(clf, X_train, X_val, y_train, y_val, holdout)

# Model 5: Random Forest

In [None]:
start = datetime.now()

rfc = RandomForestClassifier(n_jobs=-1)

clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('rfc', rfc)
])

run_model(clf, X_train, X_val, y_train, y_val, holdout)

# Model 6: AdaBoost


In [None]:
start = datetime.now()

ada = AdaBoostClassifier(random_state=42)

clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('ada', ada)
])

run_model(clf, X_train, X_val, y_train, y_val, holdout)

# Model 7: GradientBoost


In [None]:
start = datetime.now()

gbm = GradientBoostingClassifier()
clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('gbm', gbm)
])

run_model(clf, X_train, X_val, y_train, y_val, holdout)

# Model 8: XGBoost


In [None]:
start = datetime.now()

xbg = xgb.XGBClassifier(random_state=42)

clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('xbg', xbg)
])

run_model(clf, X_train, X_val, y_train, y_val, holdout)

# Model 9: Random Forest with Grid Search


In [None]:
start = datetime.now()

parameters = {
    'min_samples_split': [3, 5, 100], 
    'n_estimators' : [100, 300],
        'max_depth': [3, 5, 15, 25],

}

estimator = RandomForestClassifier(random_state=42)

grid = GridSearchCV(estimator, parameters, n_jobs=-1, cv=5, scoring = 'precision')

clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('grid', grid)
])

clf.fit(X_train, y_train)
print(datetime.now()-start)

estimator = RandomForestClassifier(n_jobs=-1).set_params(grid.best_params_)

clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('grid', estimator)
])

# clf.fit(X_train, y_train)
run_model(clf, X_train, X_val, y_train, y_val, holdout)