# Feature Generation For Regression

In previous notebooks, we generated features using p1_winner as the label for classification (win/loss).

While this column works for classification, it does not work as a good label for regression.

In this notebook, we are going to add some columns to our existing data so that we can have labels to use for regression.

Columns that we will create in this notebook:
* minutes - how long a match lasts
* p1/p2 sets won
* p1/p2 games won
* sets_diff = player 2 sets won differential = p1 sets won - p2 sets won
* games_diff = player 2 games won differential = p1 games won - p2 games won

To save time since some of the more extensive history datasets takes almost 12 hours to create - instead of taking original data and completely re-creating the dataset. We are going to create these columns based on the pre-processed file and concatenating them to our existing data files. 

We know that in our previous notebook, per-line in pre-processed file - the first entry winner is in player 1 position, and 2nd entry winner is in player 2 position. We will use this information to create the new columns

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
import json
import random
import os
import sys
import re

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from util.model_util import RSTATE
random.seed(RSTATE)

import util.score_util as su


%matplotlib inline
sns.set()

In [2]:
# Contants
START_YEAR = 1985
END_YEAR = 2019

DATASET_DIR = '../datasets'
MODEL_DIR = '../models'

# sometimes I run these notebooks via command line. Environment variable is set so we know whether we are in DEBUG mode or not
DEBUG = bool(os.environ.get("IPYNB_DEBUG", False))

if DEBUG:
    PREPROCESSED_FILE = f'{DATASET_DIR}/test-preprocessed.csv'
    FEATURE_FILE_RAW_DIFF_OHE = f'{DATASET_DIR}/atp_matches_{START_YEAR}-{END_YEAR}_features_test-raw_diff-ohe.csv'
    FEATURE_FILE_RAW_DIFF_OHE_STATS = f'{DATASET_DIR}/atp_matches_{START_YEAR}-{END_YEAR}_features_test-raw_diff-ohe-history-matchup-stats.csv'
else:
    # this is the file we generated from our pre-processing notebook
    PREPROCESSED_FILE = f'{DATASET_DIR}/atp_matches_{START_YEAR}-{END_YEAR}_preprocessed.csv'
    FEATURE_FILE_RAW_DIFF_OHE = f'{DATASET_DIR}/atp_matches_{START_YEAR}-{END_YEAR}_features-raw_diff-ohe.csv'
    FEATURE_FILE_RAW_DIFF_OHE_STATS = f'{DATASET_DIR}/atp_matches_{START_YEAR}-{END_YEAR}_features-raw_diff-ohe-history-matchup-stats.csv'

# list of files to append this data to
APPEND_FILE_LIST = [FEATURE_FILE_RAW_DIFF_OHE_STATS]

In [3]:
# winner is p1 first then winner is p1
pre = pd.read_csv(PREPROCESSED_FILE)

In [4]:
def get_lr_columns(index, row):
    """
    Get score information from the row
    
    Information returned will be in dict format so you can append it to dataframe
    
    :param index - index of the current row (not used)
    :param row - row series object
    :return: player 1 as winner data (dict), player 1 as loser data (dict)
    """
    winner_sets_won, winner_games_won, loser_sets_won, loser_games_won, set_diff, games_diff = su.process_scores(row.score)
    d1 = {
        "minutes": row.minutes,
        "p1_sets_won": winner_sets_won,
        "p1_games_won": winner_games_won,
        "p2_set_won": loser_sets_won,
        "p2_games_won": loser_games_won,
        "set_diff": set_diff,
        "games_diff": games_diff
    }
    d2 = {
        "minutes": row.minutes,
        "p1_sets_won": loser_sets_won,
        "p1_games_won": loser_games_won,
        "p2_set_won": winner_sets_won,
        "p2_games_won": winner_games_won,
        "set_diff": -1 * set_diff,
        "games_diff": -1 * games_diff
    }
    return d1, d2



df = pd.DataFrame()

for idx, row in pre.iterrows():
    if idx % 10000 == 0:
        print(f'Processing index: {idx}')
    
    d1, d2 = get_lr_columns(idx, row)
    df = df.append(d1, ignore_index=True)
    df = df.append(d2, ignore_index=True)
    
# change all columns to int
type_dict = { col: np.int32 for col in df.columns }
df = df.astype(type_dict)

df.info()

Processing index: 0
Processing index: 10000
Processing index: 20000
Processing index: 30000
Processing index: 40000
Processing index: 50000
Processing index: 60000
Processing index: 70000
Processing index: 80000
Processing index: 90000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 199910 entries, 0 to 199909
Data columns (total 7 columns):
games_diff      199910 non-null int32
minutes         199910 non-null int32
p1_games_won    199910 non-null int32
p1_sets_won     199910 non-null int32
p2_games_won    199910 non-null int32
p2_set_won      199910 non-null int32
set_diff        199910 non-null int32
dtypes: int32(7)
memory usage: 5.3 MB


In [5]:

for file in APPEND_FILE_LIST:
    # check that dimenions are correct
    features = pd.read_csv(file)
    assert features.shape[0] == df.shape[0], "shape mismatch"

    new_file = pd.concat([features, df], axis=1)
    new_file = new_file.reindex(sorted(new_file.columns), axis=1)
    new_file_name = file.replace(".csv", "-lr.csv")
    
    print(f'Saving file {new_file_name}')
    new_file.to_csv(new_file_name, index=False)


Saving file ../datasets/atp_matches_1985-2019_features-raw_diff-ohe-history-matchup-stats-lr.csv
