### Latent Semantic Analysis

#### Natural Language Processing ~ A family of techniques used to derive meaning from data

#### Document ~ A collection of words--the or "rows" of our dataset

#### Body ~ A collection of documents--our entire dataset

#### Dictionary ~ The set of all words that appear in at least one document in our body 

#### Topic ~ A collection of words that co-occur 

#### Latent ~ Features that are "hidden" in the data

Unsupervised

aim - to create representations of the text data in terms of these topics or latent features

2 steps
1) Document Term Matrix
2) Singular Value Decomposition ~ performed on Document Term Matrix

You can represent documents as vectors (i.e. you can plot data points on a chart with text data if you turn it into words)

Output - topic encoded data

source: https://www.youtube.com/watch?v=hB51kkus-Rc


Rows = The entire sentence (Document). 
Coumns = 1 if document contains the column word (contained in the set of all words--Dictionary)

Then perform SVD (same idea as PCA)

The latent features represent topics or collections of words that co-occur

In [145]:
import pandas as pd
import numpy as np
from datetime import *
from datetime import date
from dateutil.relativedelta import *
import calendar
from datetime import datetime as dt

pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn import decomposition
from sklearn.model_selection import KFold
from sklearn.decomposition import TruncatedSVD
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.feature_selection import SelectKBest, f_regression, RFECV
from sklearn.svm import SVR

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer

import hyperopt
import colorama
import random
from sklearn.metrics import accuracy_score, log_loss
from hyperopt import hp, fmin, tpe, STATUS_OK, Trials

from betacal import BetaCalibration
from catboost import CatBoostRegressor, Pool
import catboost
from catboost.utils import eval_metric

from betacal import BetaCalibration



In [146]:
df = pd.read_csv("~/Documents/NBA/nba-text-machine-learning/ringer_data_clean.csv")
df = df.drop("Unnamed: 0",axis=1)

In [147]:
drop_list = ["standing_reach", "school"]
df = df.drop(drop_list, axis=1)
df

Unnamed: 0,age,board_rank,draft_year,height,name,negative_report,position,positive_report,school_year,weight,wingspan,rookie_BPM,rookie_VORP,rookie_minutes
0,18.7,1.0,2020,77.0,KILLIAN HAYES,Left-hand dominant: He might as well tie his r...,guard,Playmaking is his best skill. He can whip pass...,,215,80.25,,,
1,18.6,2.0,2020,79.0,LAMELO BALL,"Ball is a great passer, but he can’t be classi...",guard,Ambidextrous passer with pinpoint accuracy and...,,190,82.0,,,
2,18.7,3.0,2020,77.0,ANTHONY EDWARDS,Not a pure shooter; he settles for jumpers eve...,guard,Powerful driving to the rim; when he initiates...,freshman,225,81.0,,,
3,20.1,4.0,2020,77.0,TYRESE HALIBURTON,Lack of athleticism and burst limits his upsid...,guard,Always in control; he lacks lightning speed or...,sophomore,175,84.0,,,
4,19.6,6.0,2020,79.0,DEVIN VASSELL,Lacks burst to beat defenders off the dribble ...,wing,Elite team defender who will immediately help ...,sophomore,194,82.0,,,
5,19.0,7.0,2020,85.0,JAMES WISEMAN,Poor shot selection in high school; he played ...,big,Elite measurables with long arms and a strong ...,freshman,237,90.0,,,
6,19.5,8.0,2020,74.0,TYRELL TERRY,Developing a stepback and side-dribble 3 is th...,guard,"Elite shooter with a quick, high release. He c...",freshman,160,,,,
7,19.4,9.0,2020,75.0,TYRESE MAXEY,Lacks top-end quickness and acceleration. He’s...,guard,Clever finisher at the rim who can score from ...,freshman,198,78.0,,,
8,19.2,10.0,2020,78.0,ISAAC OKORO,Stiff shooter with clunky mechanics—defenses a...,wing,"Great finisher who delivers through contact, d...",freshman,225,81.0,,,
9,22.1,11.0,2020,81.0,OBI TOPPIN,Brutal pick-and-roll defender who displays lit...,big,Glides through the air for ferocious dunks; he...,,220,83.0,,,


In [148]:
zion_index = df[df["name"] == "ZION WILLIAMSON"].index[0]
print("Zion index", zion_index)

Zion index 29


In [4]:
vectorizer = CountVectorizer(stop_words="english")



In [5]:
"""
cv = CountVectorizer(stop_words="english")
neg_matrix = cv.fit_transform(df["negative_report"]).toarray()
neg_featurenames = cv.get_feature_names()

pos_matrix = cv.fit_transform(df["positive_report"]).toarray()
pos_featurenames = cv.get_feature_names()
"""

#Tf-idf Transformation
tfidf = TfidfVectorizer(min_df=2, stop_words="english")
tfidf_neg_matrix = tfidf.fit_transform(df["negative_report"]).toarray()
tfidf_pos_matrix = tfidf.fit_transform(df["positive_report"]).toarray()


In [6]:
len(tfidf.vocabulary_)

1064

In [7]:
svd = TruncatedSVD(n_components=10, n_iter=7, random_state=7)
svd.fit(tfidf_neg_matrix)
print(svd.explained_variance_ratio_)
print(svd.explained_variance_ratio_.sum())
print(svd.singular_values_)
latent_neg_features_all = svd.transform(tfidf_neg_matrix)

svd.fit(tfidf_pos_matrix)
print(svd.explained_variance_ratio_)
print(svd.explained_variance_ratio_.sum())
print(svd.singular_values_)
latent_pos_features_all = svd.transform(tfidf_pos_matrix)


[0.00248166 0.0155376  0.01441446 0.01280462 0.01245127 0.01194041
 0.01165857 0.01156332 0.01102901 0.01080761]
0.11468852691615256
[3.99197065 1.73436401 1.67043824 1.57430754 1.55228453 1.52015533
 1.50201364 1.49629849 1.46089512 1.44635923]
[0.00238914 0.01640285 0.01500676 0.01352422 0.01244276 0.01227209
 0.01144472 0.01099548 0.01082953 0.01060682]
0.11591435713596969
[4.65147129 1.75588571 1.67936909 1.59364783 1.52871071 1.51811092
 1.466029   1.4370124  1.42624099 1.41139772]


In [8]:
len(latent_features)

NameError: name 'latent_features' is not defined

In [None]:
df["Rookie_VORP"][zion_index:].fillna(0).head()

In [73]:
df

Unnamed: 0,age,board_rank,draft_year,height,name,negative_report,position,positive_report,school_year,weight,wingspan,Rookie_BPM,Rookie_VORP
0,18.7,1.0,2020,77.0,KILLIAN HAYES,Left-hand dominant: He might as well tie his r...,Guard,Playmaking is his best skill. He can whip pass...,,215,80.25,,
1,18.6,2.0,2020,79.0,LAMELO BALL,"Ball is a great passer, but he can’t be classi...",Guard,Ambidextrous passer with pinpoint accuracy and...,,190,82.0,,
2,18.7,3.0,2020,77.0,ANTHONY EDWARDS,Not a pure shooter; he settles for jumpers eve...,Guard,Powerful driving to the rim; when he initiates...,freshman,225,81.0,,
3,20.1,4.0,2020,77.0,TYRESE HALIBURTON,Lack of athleticism and burst limits his upsid...,Guard,Always in control; he lacks lightning speed or...,sophomore,175,84.0,,
4,19.6,6.0,2020,79.0,DEVIN VASSELL,Lacks burst to beat defenders off the dribble ...,Wing,Elite team defender who will immediately help ...,sophomore,194,82.0,,
5,19.0,7.0,2020,85.0,JAMES WISEMAN,Poor shot selection in high school; he played ...,Big,Elite measurables with long arms and a strong ...,freshman,237,90.0,,
6,19.5,8.0,2020,74.0,TYRELL TERRY,Developing a stepback and side-dribble 3 is th...,Guard,"Elite shooter with a quick, high release. He c...",freshman,160,,,
7,19.4,9.0,2020,75.0,TYRESE MAXEY,Lacks top-end quickness and acceleration. He’s...,Guard,Clever finisher at the rim who can score from ...,freshman,198,78.0,,
8,19.2,10.0,2020,78.0,ISAAC OKORO,Stiff shooter with clunky mechanics—defenses a...,Wing,"Great finisher who delivers through contact, d...",freshman,225,81.0,,
9,22.1,11.0,2020,81.0,OBI TOPPIN,Brutal pick-and-roll defender who displays lit...,Big,Glides through the air for ferocious dunks; he...,,220,83.0,,


In [149]:
def GetCategoricalIndicies(cat_list, dataframe):
    index_list = []
    for cat_col in cat_list:
        index_list.append(dataframe.columns.get_loc(cat_col))
    return index_list

In [None]:
labeled_data_features

In [84]:
future_latent_pos_features = pd.DataFrame(latent_pos_features_all)[:zion_index]
future_latent_neg_features = pd.DataFrame(latent_neg_features_all)[:zion_index]
future_labeled_data_features = df[:zion_index].drop(["Rookie_VORP", "Rookie_BPM", "negative_report", 
                             "positive_report", "draft_year"], axis=1)
future_labeled_data_features = future_labeled_data_features.join(future_latent_pos_features, rsuffix='pos')
future_labeled_data_features = future_labeled_data_features.join(future_latent_neg_features, rsuffix='neg')
future_labeled_data_features["position"] = future_labeled_data_features["position"].fillna("N/A")
future_labeled_data_features["school_year"] = future_labeled_data_features["school_year"].fillna("N/A")
future_preds = pd.DataFrame()
future_preds["name"] = future_labeled_data_features["name"]
future_labeled_data_features

Unnamed: 0,age,board_rank,height,name,position,school_year,weight,wingspan,0,1,2,3,4,5,6,7,8,9,0neg,1neg,2neg,3neg,4neg,5neg,6neg,7neg,8neg,9neg
0,18.7,1.0,77.0,KILLIAN HAYES,Guard,,215,80.25,0.367432,0.070175,0.168915,-0.058646,0.191231,0.04137,0.045903,-0.061319,-0.140383,0.074986,0.307065,0.086764,-0.197096,0.361987,-0.200711,0.04979,0.173143,0.015386,-0.078127,-0.036308
1,18.6,2.0,79.0,LAMELO BALL,Guard,,190,82.0,0.339368,0.075685,0.171293,-0.133348,0.10281,-0.107649,-0.012198,0.07952,-0.029955,-0.031259,0.355876,-0.255152,-0.046546,-0.045124,0.032291,-0.028009,0.009945,-0.16805,0.032122,-0.078634
2,18.7,3.0,77.0,ANTHONY EDWARDS,Guard,freshman,225,81.0,0.284715,-0.00223,0.151311,-0.043322,0.121756,0.058741,-0.0702,-0.061787,0.200819,-0.004933,0.272121,-0.085462,0.09327,-0.086263,-0.075293,-0.186043,0.034362,-0.031817,-0.172586,0.105087
3,20.1,4.0,77.0,TYRESE HALIBURTON,Guard,sophomore,175,84.0,0.315675,0.152914,0.121488,0.021829,0.116813,0.015854,0.079086,0.039092,-0.060482,-0.08056,0.249009,-0.018768,-0.045435,-0.039615,0.103279,-0.158333,4.6e-05,-0.083886,0.06134,0.036078
4,19.6,6.0,79.0,DEVIN VASSELL,Wing,sophomore,194,82.0,0.293762,0.059975,-0.093149,-0.127085,-0.008422,0.132569,-0.075036,0.085774,-0.080107,0.048434,0.259968,0.219082,-0.018946,0.047287,0.317629,0.268269,0.055521,-0.006739,0.076821,0.05476
5,19.0,7.0,85.0,JAMES WISEMAN,Big,freshman,237,90.0,0.304904,-0.195574,-0.048454,0.095267,0.110367,-0.073556,-0.058176,-0.046296,-0.076075,-0.071687,0.236368,0.049422,-0.003945,-0.032413,-0.068989,-0.106817,-0.086565,0.107688,-0.033304,-0.033429
6,19.5,8.0,74.0,TYRELL TERRY,Guard,freshman,160,,0.288189,0.183297,-0.002109,-0.004463,0.177838,-0.132532,0.058514,-0.139714,-0.10118,0.049291,0.291901,-0.07305,-0.189513,0.020493,0.096888,-0.069475,0.144024,0.027811,0.063653,0.022917
7,19.4,9.0,75.0,TYRESE MAXEY,Guard,freshman,198,78.0,0.359904,0.050275,-0.03593,-0.006701,0.165267,0.071864,-0.067314,-0.145305,-0.133149,0.000957,0.247361,0.018008,-0.078493,-0.164658,-0.150912,-0.02013,-0.024964,0.153358,-0.004492,-0.132496
8,19.2,10.0,78.0,ISAAC OKORO,Wing,freshman,225,81.0,0.317496,-0.059115,-0.007749,-0.170846,0.126994,0.036645,-0.086326,0.027407,-0.001574,-0.022071,0.204305,-0.159856,-0.198732,-0.084268,-0.051738,0.157078,-0.210433,-0.009219,-0.001636,0.031131
9,22.1,11.0,81.0,OBI TOPPIN,Big,,220,83.0,0.306873,-0.023585,0.059553,0.10012,0.15568,0.015369,-0.025323,0.08374,0.06004,0.131758,0.261991,-0.01776,0.161467,0.068448,0.074732,-0.103924,0.104344,0.189504,-0.051501,-0.106981


In [85]:
future_preds

Unnamed: 0,name
0,KILLIAN HAYES
1,LAMELO BALL
2,ANTHONY EDWARDS
3,TYRESE HALIBURTON
4,DEVIN VASSELL
5,JAMES WISEMAN
6,TYRELL TERRY
7,TYRESE MAXEY
8,ISAAC OKORO
9,OBI TOPPIN


In [12]:
latent_pos_features = pd.DataFrame(latent_pos_features_all)[zion_index:]
latent_neg_features = pd.DataFrame(latent_neg_features_all)[zion_index:]

labeled_data_target = df["rookie_VORP"][zion_index:].fillna(0)
labeled_data_features= df[zion_index:].drop(["rookie_VORP", "rookie_BPM", "rookie_minutes", "negative_report", 
                             "positive_report", "draft_year"], axis=1)

labeled_data_features = labeled_data_features.join(latent_pos_features, rsuffix='pos')
labeled_data_features = labeled_data_features.join(latent_neg_features, rsuffix='neg')

kf = KFold(n_splits=10)
cat_features = ["name", "position", "school_year"]
result_list = []
labeled_data_features["position"] = labeled_data_features["position"].fillna("N/A")
labeled_data_features["school_year"] = labeled_data_features["school_year"].fillna("N/A")

count=1
for train, test in kf.split(labeled_data_features):
    print("%s %s" % (train, test))
    y_train = labeled_data_target.iloc[train]
    X_train = labeled_data_features.iloc[train]
    y_test = labeled_data_target.iloc[test]
    X_test = labeled_data_features.iloc[test]
    

    """
    X_train, X_test, y_train, y_test = train_test_split(labeled_data_features, labeled_data_target,
                                                    test_size=0.2, random_state=7)
    """

    cat_indicies = GetCategoricalIndicies(cat_features, dataframe=X_train)
    #cat_indicies = []
    print(cat_indicies)
    train_pool = Pool(X_train, y_train, cat_features=cat_indicies)
    test_pool =  Pool(X_test, y_test, cat_features=cat_indicies)

    model = CatBoostRegressor(iterations=1000,
                          learning_rate=0.10,
                          depth=4,
                          l2_leaf_reg=15,
                          eval_metric="RMSE",
                          ignored_features=GetCategoricalIndicies(["name"], X_train),
                          )

    model.fit(train_pool, eval_set=test_pool)
    
    result = mean_absolute_error(model.predict(X_test), y_test)
    print(result)
    result_list.append(result)
    
    future_preds["fold_"+str(count)] = model.predict(future_labeled_data_features)
    count+=1
    
print(result_list)
result_median = np.median(np.array(result_list))
result_mean = np.mean(np.array(result_list))
print("median", result_median)
print("mean", result_mean)



[ 18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179] [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17]
[3, 4, 5]
0:	learn: 0.5846555	test: 0.8348575	best: 0.8348575 (0)	total: 48.1ms	remaining: 48s
1:	learn: 0.5813545	test: 0.8376321	best: 0.8348575 (0)	total: 51.7ms	remaining: 25.8s
2:	learn: 0.5781803	test: 0.8346178	best: 0.8346178 (2)	total: 52.9ms	remaining: 17.6s
3:	learn: 0.57489

109:	learn: 0.3943181	test: 0.8894181	best: 0.8346178 (2)	total: 341ms	remaining: 2.76s
110:	learn: 0.3928853	test: 0.8911536	best: 0.8346178 (2)	total: 343ms	remaining: 2.75s
111:	learn: 0.3920343	test: 0.8906242	best: 0.8346178 (2)	total: 345ms	remaining: 2.73s
112:	learn: 0.3895567	test: 0.8898038	best: 0.8346178 (2)	total: 347ms	remaining: 2.72s
113:	learn: 0.3884778	test: 0.8908724	best: 0.8346178 (2)	total: 349ms	remaining: 2.71s
114:	learn: 0.3870681	test: 0.8900136	best: 0.8346178 (2)	total: 351ms	remaining: 2.7s
115:	learn: 0.3848040	test: 0.8906633	best: 0.8346178 (2)	total: 354ms	remaining: 2.7s
116:	learn: 0.3805905	test: 0.8929527	best: 0.8346178 (2)	total: 356ms	remaining: 2.69s
117:	learn: 0.3795881	test: 0.8914533	best: 0.8346178 (2)	total: 359ms	remaining: 2.68s
118:	learn: 0.3791124	test: 0.8914214	best: 0.8346178 (2)	total: 363ms	remaining: 2.69s
119:	learn: 0.3784042	test: 0.8880323	best: 0.8346178 (2)	total: 365ms	remaining: 2.68s
120:	learn: 0.3761718	test: 0.8873

205:	learn: 0.2820466	test: 0.8759093	best: 0.8346178 (2)	total: 532ms	remaining: 2.05s
206:	learn: 0.2805338	test: 0.8759024	best: 0.8346178 (2)	total: 536ms	remaining: 2.05s
207:	learn: 0.2801399	test: 0.8760657	best: 0.8346178 (2)	total: 538ms	remaining: 2.05s
208:	learn: 0.2791867	test: 0.8764430	best: 0.8346178 (2)	total: 539ms	remaining: 2.04s
209:	learn: 0.2785751	test: 0.8759977	best: 0.8346178 (2)	total: 543ms	remaining: 2.04s
210:	learn: 0.2781348	test: 0.8757616	best: 0.8346178 (2)	total: 545ms	remaining: 2.04s
211:	learn: 0.2779792	test: 0.8758083	best: 0.8346178 (2)	total: 546ms	remaining: 2.03s
212:	learn: 0.2772366	test: 0.8758128	best: 0.8346178 (2)	total: 547ms	remaining: 2.02s
213:	learn: 0.2764908	test: 0.8760736	best: 0.8346178 (2)	total: 549ms	remaining: 2.02s
214:	learn: 0.2746618	test: 0.8764187	best: 0.8346178 (2)	total: 553ms	remaining: 2.02s
215:	learn: 0.2744624	test: 0.8765078	best: 0.8346178 (2)	total: 558ms	remaining: 2.03s
216:	learn: 0.2737097	test: 0.87

329:	learn: 0.1859587	test: 0.9009947	best: 0.8346178 (2)	total: 723ms	remaining: 1.47s
330:	learn: 0.1857211	test: 0.9007931	best: 0.8346178 (2)	total: 729ms	remaining: 1.47s
331:	learn: 0.1846320	test: 0.9015303	best: 0.8346178 (2)	total: 731ms	remaining: 1.47s
332:	learn: 0.1838842	test: 0.9015656	best: 0.8346178 (2)	total: 732ms	remaining: 1.47s
333:	learn: 0.1836627	test: 0.9013846	best: 0.8346178 (2)	total: 734ms	remaining: 1.46s
334:	learn: 0.1830073	test: 0.9018776	best: 0.8346178 (2)	total: 736ms	remaining: 1.46s
335:	learn: 0.1818614	test: 0.9020747	best: 0.8346178 (2)	total: 738ms	remaining: 1.46s
336:	learn: 0.1810715	test: 0.9015497	best: 0.8346178 (2)	total: 740ms	remaining: 1.46s
337:	learn: 0.1794867	test: 0.9003011	best: 0.8346178 (2)	total: 742ms	remaining: 1.45s
338:	learn: 0.1792708	test: 0.9007122	best: 0.8346178 (2)	total: 743ms	remaining: 1.45s
339:	learn: 0.1790483	test: 0.9007877	best: 0.8346178 (2)	total: 745ms	remaining: 1.45s
340:	learn: 0.1778546	test: 0.90

443:	learn: 0.1248672	test: 0.9117645	best: 0.8346178 (2)	total: 916ms	remaining: 1.15s
444:	learn: 0.1247049	test: 0.9120428	best: 0.8346178 (2)	total: 918ms	remaining: 1.14s
445:	learn: 0.1244523	test: 0.9120197	best: 0.8346178 (2)	total: 921ms	remaining: 1.14s
446:	learn: 0.1237287	test: 0.9128226	best: 0.8346178 (2)	total: 923ms	remaining: 1.14s
447:	learn: 0.1235713	test: 0.9133678	best: 0.8346178 (2)	total: 925ms	remaining: 1.14s
448:	learn: 0.1234573	test: 0.9133075	best: 0.8346178 (2)	total: 927ms	remaining: 1.14s
449:	learn: 0.1224749	test: 0.9129018	best: 0.8346178 (2)	total: 932ms	remaining: 1.14s
450:	learn: 0.1220323	test: 0.9126101	best: 0.8346178 (2)	total: 933ms	remaining: 1.14s
451:	learn: 0.1213002	test: 0.9132372	best: 0.8346178 (2)	total: 934ms	remaining: 1.13s
452:	learn: 0.1206061	test: 0.9127842	best: 0.8346178 (2)	total: 937ms	remaining: 1.13s
453:	learn: 0.1203117	test: 0.9126874	best: 0.8346178 (2)	total: 938ms	remaining: 1.13s
454:	learn: 0.1201935	test: 0.91

554:	learn: 0.0856026	test: 0.9146200	best: 0.8346178 (2)	total: 1.1s	remaining: 886ms
555:	learn: 0.0855627	test: 0.9147044	best: 0.8346178 (2)	total: 1.11s	remaining: 885ms
556:	learn: 0.0852115	test: 0.9144807	best: 0.8346178 (2)	total: 1.11s	remaining: 883ms
557:	learn: 0.0844899	test: 0.9143812	best: 0.8346178 (2)	total: 1.11s	remaining: 880ms
558:	learn: 0.0836212	test: 0.9147974	best: 0.8346178 (2)	total: 1.11s	remaining: 877ms
559:	learn: 0.0835517	test: 0.9149503	best: 0.8346178 (2)	total: 1.11s	remaining: 874ms
560:	learn: 0.0831820	test: 0.9149342	best: 0.8346178 (2)	total: 1.11s	remaining: 871ms
561:	learn: 0.0827384	test: 0.9146223	best: 0.8346178 (2)	total: 1.11s	remaining: 869ms
562:	learn: 0.0822666	test: 0.9145872	best: 0.8346178 (2)	total: 1.12s	remaining: 867ms
563:	learn: 0.0816270	test: 0.9150666	best: 0.8346178 (2)	total: 1.12s	remaining: 865ms
564:	learn: 0.0815454	test: 0.9151341	best: 0.8346178 (2)	total: 1.12s	remaining: 862ms
565:	learn: 0.0811031	test: 0.915

693:	learn: 0.0543142	test: 0.9216374	best: 0.8346178 (2)	total: 1.29s	remaining: 570ms
694:	learn: 0.0540303	test: 0.9217632	best: 0.8346178 (2)	total: 1.3s	remaining: 569ms
695:	learn: 0.0536574	test: 0.9213109	best: 0.8346178 (2)	total: 1.3s	remaining: 567ms
696:	learn: 0.0533084	test: 0.9215286	best: 0.8346178 (2)	total: 1.3s	remaining: 565ms
697:	learn: 0.0532749	test: 0.9215767	best: 0.8346178 (2)	total: 1.3s	remaining: 563ms
698:	learn: 0.0532478	test: 0.9216015	best: 0.8346178 (2)	total: 1.3s	remaining: 561ms
699:	learn: 0.0529727	test: 0.9217907	best: 0.8346178 (2)	total: 1.3s	remaining: 559ms
700:	learn: 0.0529198	test: 0.9218420	best: 0.8346178 (2)	total: 1.3s	remaining: 557ms
701:	learn: 0.0525926	test: 0.9223939	best: 0.8346178 (2)	total: 1.31s	remaining: 555ms
702:	learn: 0.0525363	test: 0.9224206	best: 0.8346178 (2)	total: 1.31s	remaining: 554ms
703:	learn: 0.0521621	test: 0.9224422	best: 0.8346178 (2)	total: 1.31s	remaining: 553ms
704:	learn: 0.0521290	test: 0.9224162	b

808:	learn: 0.0380769	test: 0.9238924	best: 0.8346178 (2)	total: 1.49s	remaining: 351ms
809:	learn: 0.0379503	test: 0.9239637	best: 0.8346178 (2)	total: 1.49s	remaining: 349ms
810:	learn: 0.0377826	test: 0.9239976	best: 0.8346178 (2)	total: 1.49s	remaining: 347ms
811:	learn: 0.0377037	test: 0.9240381	best: 0.8346178 (2)	total: 1.49s	remaining: 345ms
812:	learn: 0.0376766	test: 0.9240033	best: 0.8346178 (2)	total: 1.49s	remaining: 343ms
813:	learn: 0.0376454	test: 0.9239698	best: 0.8346178 (2)	total: 1.49s	remaining: 341ms
814:	learn: 0.0374670	test: 0.9238513	best: 0.8346178 (2)	total: 1.49s	remaining: 339ms
815:	learn: 0.0373633	test: 0.9238928	best: 0.8346178 (2)	total: 1.5s	remaining: 337ms
816:	learn: 0.0372163	test: 0.9241680	best: 0.8346178 (2)	total: 1.5s	remaining: 336ms
817:	learn: 0.0371005	test: 0.9241311	best: 0.8346178 (2)	total: 1.5s	remaining: 334ms
818:	learn: 0.0369625	test: 0.9239553	best: 0.8346178 (2)	total: 1.5s	remaining: 332ms
819:	learn: 0.0369150	test: 0.923975

927:	learn: 0.0273178	test: 0.9245579	best: 0.8346178 (2)	total: 1.67s	remaining: 130ms
928:	learn: 0.0272922	test: 0.9245587	best: 0.8346178 (2)	total: 1.68s	remaining: 128ms
929:	learn: 0.0271291	test: 0.9245291	best: 0.8346178 (2)	total: 1.68s	remaining: 126ms
930:	learn: 0.0271025	test: 0.9245912	best: 0.8346178 (2)	total: 1.68s	remaining: 124ms
931:	learn: 0.0270251	test: 0.9246344	best: 0.8346178 (2)	total: 1.68s	remaining: 123ms
932:	learn: 0.0269988	test: 0.9246427	best: 0.8346178 (2)	total: 1.68s	remaining: 121ms
933:	learn: 0.0269905	test: 0.9246504	best: 0.8346178 (2)	total: 1.68s	remaining: 119ms
934:	learn: 0.0269570	test: 0.9247316	best: 0.8346178 (2)	total: 1.69s	remaining: 117ms
935:	learn: 0.0269444	test: 0.9247521	best: 0.8346178 (2)	total: 1.69s	remaining: 116ms
936:	learn: 0.0268298	test: 0.9247248	best: 0.8346178 (2)	total: 1.69s	remaining: 114ms
937:	learn: 0.0267925	test: 0.9247799	best: 0.8346178 (2)	total: 1.69s	remaining: 112ms
938:	learn: 0.0267414	test: 0.92

NameError: name 'future_labeled_data_features' is not defined

In [None]:
future_preds["fold_average"] = future_preds.mean(axis=1)
future_preds

In [159]:
def custom_bpm(row, min_threshold=500):
    if row['rookie_minutes'] >= min_threshold:
        return row['rookie_BPM']
    else:
        return -2
def bpm_threshold(row, lowest_acceptable_threshold=0.5)

## Regression

In [195]:
N_HYPEROPT_PROBES = 5000
HYPEROPT_ALGO = tpe.suggest
#HYPEROPT_ALGO = hyperopt.rand.suggest #  tpe.suggest OR hyperopt.rand.suggest

def get_catboost_params(space):
    params = dict()
    params['learning_rate'] = space['learning_rate']
    params['depth'] = space['depth']
    params['l2_leaf_reg'] = space['l2_leaf_reg']
    params['rsm'] = space['rsm']
    params["random_strength"] = space["random_strength"]
    #params["bagging_temperature"] = space["bagging_temperature"]
    params["bootstrap_type"] = space["bootstrap_type"]
    params["sampling_unit"] = space["sampling_unit"]
    params["svd_comp"] = space["svd_comp"]
    params["svd_iter"] = space["svd_iter"]
    if space["svd_comp"] <= space["svd_cols_dropped"]: #We can't drop more columns than we have
        params["svd_cols_dropped"] = space["svd_comp"]-2
    else:
        params["svd_cols_dropped"] = space["svd_comp"] - space["svd_cols_dropped"]
    #params["model_shrink_rate"]= space["model_shrink_rate"]
    #params["grow_policy"] = space["grow_policy"]
    #params["sampling_frequency"] = space["sampling_frequency"]
    
    return params

#bagging_temperature [0,100), rsm (0;1], l2_leaf_reg Any positive values are allowed



obj_call_count = 0
cur_best_result_mean = np.inf
cur_best_preds_df = pd.DataFrame()
log_writer = open( 'catboost-hyperopt-log.txt', 'w' )

def objective(space):   
    global obj_call_count, cur_best_result_mean, cur_best_preds_df

    obj_call_count += 1

    print('\nCatBoost objective call #{} cur_best_result_mean={:7.5f}'.format(obj_call_count,cur_best_result_mean,))

    params = get_catboost_params(space)

    sorted_params = sorted(space.items(), key=lambda z: z[0])
    params_str = str.join(' ', ['{}={}'.format(k, v) for k, v in sorted_params])
    print('Params: {}'.format(params_str))
    
    
    
    tfidf = TfidfVectorizer(min_df=2)
    tfidf_neg_matrix = tfidf.fit_transform(df["negative_report"]).toarray()
    tfidf_pos_matrix = tfidf.fit_transform(df["positive_report"]).toarray()
    svd = TruncatedSVD(n_components=int(params['svd_comp']), n_iter=int(params['svd_iter']), random_state=7)
    svd.fit(tfidf_neg_matrix)
    latent_neg_features_all = svd.transform(tfidf_neg_matrix)

    svd.fit(tfidf_pos_matrix)
    latent_pos_features_all = svd.transform(tfidf_pos_matrix)
    
    
    latent_pos_features = pd.DataFrame(latent_pos_features_all)[zion_index:]
    latent_neg_features = pd.DataFrame(latent_neg_features_all)[zion_index:]
    

    labeled_data_target = df["rookie_VORP"][zion_index:].fillna(0)
    #labeled_data_target = df[zion_index:].apply(custom_bpm, axis=1)
    #print(labeled_data_target)
    labeled_data_features= df[zion_index:].drop(["rookie_VORP", "rookie_BPM", "rookie_minutes", "negative_report", 
                                 "positive_report", "draft_year"], axis=1)

    latent_all_features = latent_pos_features.join(latent_neg_features, rsuffix='neg')
    
    
    
    estimator = SVR(kernel="linear")
    num_cols_left = int(params['svd_comp']) - int(params["svd_cols_dropped"])
    selector = RFECV(estimator, step=1, min_features_to_select=num_cols_left, cv=5)
    selector = selector.fit(latent_all_features, labeled_data_target)
    labeled_data_features_trim = selector.transform(latent_all_features)
    labeled_data_features_trim = pd.DataFrame(labeled_data_features_trim)
    labeled_data_features_trim = labeled_data_features_trim.shift()[zion_index:]
    labeled_data_features_merge = labeled_data_features_trim.join(labeled_data_features)
    
    
    
    cat_features = ["name", "position", "school_year"]
    result_list = []
    labeled_data_features_merge["position"] = labeled_data_features_merge["position"].fillna("N/A")
    labeled_data_features_merge["school_year"] = labeled_data_features_merge["school_year"].fillna("N/A")

    
    future_latent_pos_features = pd.DataFrame(latent_pos_features_all)[:zion_index]
    future_latent_neg_features = pd.DataFrame(latent_neg_features_all)[:zion_index]
    future_labeled_data_features = df[:zion_index].drop(["rookie_VORP", "rookie_BPM", "rookie_minutes", "negative_report", 
                             "positive_report", "draft_year"], axis=1)
    
    future_latent_all_features = future_latent_pos_features.join(future_latent_neg_features, rsuffix='neg')
    future_latent_all_features = pd.DataFrame(selector.transform(future_latent_all_features))
    future_labeled_data_merge = future_latent_all_features.join(future_labeled_data_features)
    
    future_labeled_data_merge["position"] = future_labeled_data_merge["position"].fillna("N/A")
    future_labeled_data_merge["school_year"] = future_labeled_data_merge["school_year"].fillna("N/A")
    #(below) Trimming columns off our prediction dataframe i.e. 2020 draft class bc of feature selection
    
    future_preds = pd.DataFrame()
    future_preds["name"] = future_labeled_data_merge["name"]
    
    preds_df = pd.DataFrame(future_preds)
    
    kf = KFold(n_splits=5, shuffle=True)
    
    print(labeled_data_features_merge.columns)
    count=1
    for train, test in kf.split(labeled_data_features_merge):
        print("Fold #"+str(count))
        #print("%s %s" % (train, test))
        y_train = labeled_data_target.iloc[train]
        X_train = labeled_data_features_merge.iloc[train]
        y_test = labeled_data_target.iloc[test]
        X_test = labeled_data_features_merge.iloc[test]


        """
        X_train, X_test, y_train, y_test = train_test_split(labeled_data_features, labeled_data_target,
                                                        test_size=0.2, random_state=7)
        """

        cat_indicies = GetCategoricalIndicies(cat_features, dataframe=X_train)
        #cat_indicies = []
        train_pool = Pool(X_train, y_train, cat_features=cat_indicies)
        test_pool =  Pool(X_test, y_test, cat_features=cat_indicies)

        
        model = CatBoostRegressor(iterations=10000,
                                            learning_rate=params['learning_rate'],
                                            depth=int(params['depth']),
                                            use_best_model=True,
                                            l2_leaf_reg=params['l2_leaf_reg'],
                                            od_wait=500,
                                            random_strength=params["random_strength"],
                                            has_time=True,
                                            random_seed=7,
                                            fold_len_multiplier=2,
                                            eval_metric="RMSE",
                                            ignored_features=GetCategoricalIndicies(["name"], X_train),
                                            verbose=500,      
                                            boost_from_average=True,
                                            sampling_unit=params["sampling_unit"], 
                                            bootstrap_type = params["bootstrap_type"],
                                            #sampling_frequency = params["sampling_frequency"],
                                            task_type="GPU",
                                            #model_shrink_rate = params['model_shrink_rate'],
                                            #grow_policy = params["grow_policy"],
                                            )

        model.fit(train_pool, eval_set=test_pool)
        
        """
        importance = model.get_feature_importance(train_pool)
        for num, feature in enumerate(train_pool.get_feature_names()):
            print(num, feature, "importance: ", importance[num])
        """
        
        result = mean_squared_error(model.predict(X_test), y_test)
        #print(result)
        result_list.append(result)
        preds_df["fold_"+str(count)] = pd.Series(model.predict(future_labeled_data_merge))
        
        count+=1
    
    #print(result_list)
    result_median = np.median(np.array(result_list))
    result_mean = np.mean(np.array(result_list))
    print("median", result_median)
    print("mean", result_mean)
    
    preds_df["fold_average"] = preds_df.drop("name", axis=1).mean(axis=1)

    log_writer.flush()

    if result_mean < cur_best_result_mean:
        cur_best_result_mean = result_mean
        #model.save_model("/home/truman/Documents/UFC/CatboostModels/Sherdog/Optimized/UFC-Loss-"+str(obj_call_count))
        print("Great enough dude")
        print(colorama.Fore.RED + 'NEW BEST KFOLDS MEAN={}'.format(cur_best_result_mean) + colorama.Fore.RESET)
        cur_best_preds_df = preds_df
        
    return{'loss':result_mean, 'status': STATUS_OK }

space ={'depth': hp.quniform("depth", 2, 6, 1),
        'rsm': hp.uniform('rsm', 0, 1),
        'learning_rate': hp.uniform('learning_rate', 0.01, 0.55),
        'l2_leaf_reg': hp.uniform('l2_leaf_reg', 0, 500),
        #'bagging_temperature': hp.uniform('bagging_temperature', 0, 10),
        "sampling_unit": hp.choice("sampling_unit", ["Object"]), #"Group"]), 
        "random_strength": hp.uniform("random_strength", 0, 100),
        "bootstrap_type": hp.choice("bootsrap_type", ["Bayesian"]), #["Bayesian", "Bernoulli", "MVS", "No"])
        #"sampling_frequency": hp.choice("sampling_frequency", ["PerTree", "PerTreeLevel"]),
        #"grow_policy": hp.choice("grow_policy", ["SymmetricTree", "Depthwise", "Lossguide"]),
        #"model_shrink_rate": hp.uniform("model_shrink_rate", 0, 1),
        'svd_comp': hp.quniform("svd_comp", 2, 180, 1),
        'svd_iter': hp.quniform("svd_iter", 2, 10, 1),
        'svd_cols_dropped': hp.quniform("svd_cols_dropped", 0, 180, 1),
       }
#hp.uniform('x', -10, 10)

trials = Trials()
best = hyperopt.fmin(fn=objective,
                     space=space,
                     algo=HYPEROPT_ALGO,
                     max_evals=N_HYPEROPT_PROBES,
                     trials=trials,
                     verbose=100)

print('-'*50)
print('The best params:')
print( best )

print('\n\n')


                                                        
CatBoost objective call #1 cur_best_result_mean=    inf
Params: bootstrap_type=Bayesian depth=4.0 l2_leaf_reg=283.58009630924397 learning_rate=0.43070713476815103 random_strength=28.20196224370639 rsm=0.23698472914400026 sampling_unit=Object svd_cols_dropped=33.0 svd_comp=109.0 svd_iter=9.0
Index([            0,             1,             2,             3,
                   4,             5,             6,             7,
                   8,             9,            10,            11,
                  12,            13,            14,            15,
                  16,            17,            18,            19,
                  20,            21,            22,            23,
                  24,            25,            26,            27,
                  28,            29,            30,            31,
                  32,            33,            34,         'age',
        'board_rank',      'height',        'nam

500:	learn: 0.2939846	test: 0.4627193	best: 0.4179930 (64)	total: 2.33s	remaining: 44.2s

bestTest = 0.4179929646                                                               

bestIteration = 64                                                                    

Shrink model to first 65 iterations.                                                  
Fold #2                                                                               
0:	learn: 0.6471286	test: 0.6988440	best: 0.6988440 (0)	total: 6.01ms	remaining: 1m   

500:	learn: 0.2675431	test: 0.7325192	best: 0.6958682 (27)	total: 2.18s	remaining: 41.3s

bestTest = 0.6958681551                                                               

bestIteration = 27                                                                    

Shrink model to first 28 iterations.                                                  
Fold #3                                                                               
0:	learn: 0.5765823	test: 0.9149

500:	learn: 0.2199324	test: 0.8988423	best: 0.8682665 (7)	total: 4.15s	remaining: 1m 18s

bestTest = 0.8682665461                                                              

bestIteration = 7                                                                    

Shrink model to first 8 iterations.                                                  
Fold #3                                                                              
0:	learn: 0.6323684	test: 0.7568574	best: 0.7568574 (0)	total: 8.08ms	remaining: 1m 20s

500:	learn: 0.2436392	test: 0.7754565	best: 0.7530775 (43)	total: 4.04s	remaining: 1m 16s

bestTest = 0.7530774873                                                              

bestIteration = 43                                                                   

Shrink model to first 44 iterations.                                                 
Fold #4                                                                              
0:	learn: 0.7022095	test: 0.4405383	be

KeyboardInterrupt: 

In [196]:
cur_best_preds_df.sort_values("fold_average", ascending=False)

Unnamed: 0,name,fold_1,fold_2,fold_3,fold_4,fold_5,fold_average
0,KILLIAN HAYES,-0.001388,0.336022,-0.051625,0.01903,0.364589,0.133326
25,JALEN SMITH,-0.013312,0.34101,0.029014,0.01903,0.206913,0.116531
17,SADDIQ BEY,-0.038708,0.077577,0.084763,0.01903,0.361365,0.100805
19,PRECIOUS ACHIUWA,-0.015861,0.149437,0.116395,0.01903,0.212873,0.096375
3,TYRESE HALIBURTON,-0.015146,0.066452,0.034766,0.01903,0.227551,0.06653
2,ANTHONY EDWARDS,-0.04397,0.139483,0.002569,0.018978,0.194023,0.062217
16,JOSH GREEN,-0.069914,0.004634,0.049163,0.01903,0.290598,0.058702
11,RJ HAMPTON,-0.001388,0.047306,-0.033255,0.01903,0.260551,0.058449
7,TYRESE MAXEY,-0.017278,0.018913,-0.023721,0.01903,0.288585,0.057106
5,JAMES WISEMAN,-0.038805,0.169404,0.008579,0.01903,0.112863,0.054214


## Classifier

In [204]:
test_df = pd.DataFrame(df[zion_index:].apply(custom_bpm, axis=1))
test_df["name"]=df["name"]
test_df
#df[["name", "rookie_BPM"]].sort_values("rookie_BPM", ascending=False)

Unnamed: 0,0,name
29,2.4,ZION WILLIAMSON
30,0.4,JA MORANT
31,-4.3,R.J. BARRETT
32,-4.7,DE'ANDRE HUNTER
33,-5.6,DARIUS GARLAND
34,-2.8,COBY WHITE
35,-4.0,JARRETT CULVER
36,-4.2,CAM REDDISH
37,-3.5,NASSIR LITTLE
38,0.1,JAXSON HAYES


## Creating A Machine Learning Model with No Scouting Report (no text data)

In [None]:
train_pool = Pool(X_train, y_train, cat_features=categorical_features_indices, feature_names=list(X_train.columns))
test_pool = Pool(X_test, y_test, cat_features=categorical_features_indices, feature_names=list(X_train.columns))

Documents ~ Players
 ~ scouting report
 
 
use truncated SVD in sci-kit learn