### Latent Semantic Analysis

#### Natural Language Processing ~ A family of techniques used to derive meaning from data

#### Document ~ A collection of words--the or "rows" of our dataset

#### Body ~ A collection of documents--our entire dataset

#### Dictionary ~ The set of all words that appear in at least one document in our body 

#### Topic ~ A collection of words that co-occur 

#### Latent ~ Features that are "hidden" in the data

Unsupervised

aim - to create representations of the text data in terms of these topics or latent features

2 steps
1) Document Term Matrix
2) Singular Value Decomposition ~ performed on Document Term Matrix

You can represent documents as vectors (i.e. you can plot data points on a chart with text data if you turn it into words)

Output - topic encoded data

source: https://www.youtube.com/watch?v=hB51kkus-Rc


Rows = The entire sentence (Document). 
Coumns = 1 if document contains the column word (contained in the set of all words--Dictionary)

Then perform SVD (same idea as PCA)

The latent features represent topics or collections of words that co-occur

In [20]:
import pandas as pd
import numpy as np
from datetime import *
from datetime import date
from dateutil.relativedelta import *
import calendar
from datetime import datetime as dt

pd.set_option('display.max_columns', 500)

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.metrics import accuracy_score, brier_score_loss
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn import decomposition
from sklearn.decomposition import TruncatedSVD
from sklearn.calibration import calibration_curve, CalibratedClassifierCV

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

from betacal import BetaCalibration
from catboost import CatBoostClassifier, CatBoostRegressor, Pool
import catboost
from catboost.utils import eval_metric

from betacal import BetaCalibration



In [79]:
df = pd.read_csv("/home/truman/Documents/NBA/ScoutingAI/ringer_data_clean.csv")
df = df.drop("Unnamed: 0",axis=1)

In [80]:
drop_list = ["standing_reach", "school"]
df = df.drop(drop_list, axis=1)
df

Unnamed: 0,age,board_rank,draft_year,height,name,negative_report,position,positive_report,school_year,weight,wingspan,Rookie_BPM,Rookie_VORP
0,18.7,1.0,2020,77.0,KILLIAN HAYES,Left-hand dominant: He might as well tie his r...,Guard,Playmaking is his best skill. He can whip pass...,,215,80.25,,
1,18.6,2.0,2020,79.0,LAMELO BALL,"Ball is a great passer, but he can’t be classi...",Guard,Ambidextrous passer with pinpoint accuracy and...,,190,82.00,,
2,18.7,3.0,2020,77.0,ANTHONY EDWARDS,Not a pure shooter; he settles for jumpers eve...,Guard,Powerful driving to the rim; when he initiates...,freshman,225,81.00,,
3,20.1,4.0,2020,77.0,TYRESE HALIBURTON,Lack of athleticism and burst limits his upsid...,Guard,Always in control; he lacks lightning speed or...,sophomore,175,84.00,,
4,19.6,6.0,2020,79.0,DEVIN VASSELL,Lacks burst to beat defenders off the dribble ...,Wing,Elite team defender who will immediately help ...,sophomore,194,82.00,,
5,19.0,7.0,2020,85.0,JAMES WISEMAN,Poor shot selection in high school; he played ...,Big,Elite measurables with long arms and a strong ...,freshman,237,90.00,,
6,19.5,8.0,2020,74.0,TYRELL TERRY,Developing a stepback and side-dribble 3 is th...,Guard,"Elite shooter with a quick, high release. He c...",freshman,160,,,
7,19.4,9.0,2020,75.0,TYRESE MAXEY,Lacks top-end quickness and acceleration. He’s...,Guard,Clever finisher at the rim who can score from ...,freshman,198,78.00,,
8,19.2,10.0,2020,78.0,ISAAC OKORO,Stiff shooter with clunky mechanics—defenses a...,Wing,"Great finisher who delivers through contact, d...",freshman,225,81.00,,
9,22.1,11.0,2020,81.0,OBI TOPPIN,Brutal pick-and-roll defender who displays lit...,Big,Glides through the air for ferocious dunks; he...,,220,83.00,,


In [None]:
vectorizer = CountVectorizer(stop_words="english")



In [86]:
cv = CountVectorizer(stop_words="english")
neg_matrix = cv.fit_transform(df["negative_report"]).toarray()
featurenames = cv.get_feature_names()

#Tf-idf Transformation
tfidf = TfidfTransformer()
tfidf_neg_matrix = tfidf.fit_transform(neg_matrix).toarray()


In [87]:
cv.vocabulary_

{'left': 941,
 'hand': 729,
 'dominant': 463,
 'tie': 1724,
 'right': 1415,
 'considering': 327,
 'little': 970,
 'uses': 1843,
 'passes': 1188,
 'make': 1008,
 'relies': 1392,
 'limited': 961,
 'athlete': 116,
 'lacks': 901,
 'burst': 227,
 'bounce': 203,
 'hinders': 763,
 'finishing': 609,
 'ability': 32,
 'especially': 539,
 'rarely': 1342,
 'advanced': 63,
 'handle': 730,
 'picks': 1217,
 'dribble': 478,
 'gets': 692,
 'trouble': 1777,
 'shifty': 1507,
 'doesn': 459,
 'create': 359,
 'ton': 1740,
 'separation': 1490,
 'string': 1637,
 'moves': 1087,
 'break': 210,
 'defenses': 401,
 'experiences': 558,
 'lapses': 909,
 'defense': 400,
 'missing': 1066,
 'rotations': 1439,
 'falling': 579,
 'ball': 149,
 'stance': 1595,
 'needs': 1110,
 'vocal': 1866,
 'lead': 923,
 'guard': 717,
 'better': 172,
 'command': 308,
 'team': 1688,
 'great': 708,
 'passer': 1187,
 'classified': 285,
 'playmaker': 1226,
 'decision': 389,
 'making': 1011,
 'jacks': 862,
 'poor': 1245,
 'shots': 1519,
 'ear

In [88]:
svd = TruncatedSVD(n_components=150, n_iter=7, random_state=42)
svd.fit(tfidf_neg_matrix)
print(svd.explained_variance_ratio_)
print(svd.explained_variance_ratio_.sum())
print(svd.singular_values_)
latent_features = svd.transform(bag_of_tfidf)

[0.00235639 0.01357617 0.01269528 0.01139216 0.01111263 0.01070669
 0.01052317 0.01037864 0.0099457  0.00977572 0.00968771 0.00938841
 0.00921409 0.00909197 0.00903725 0.00885841 0.00871302 0.00861943
 0.00852164 0.00835677 0.00829415 0.00807584 0.00803794 0.00793377
 0.00781438 0.00776504 0.00772312 0.00761451 0.00754153 0.00739108
 0.00733153 0.00729629 0.00719242 0.00717112 0.00710541 0.00704557
 0.00696789 0.00694989 0.00684017 0.00675792 0.00671    0.00665092
 0.00664392 0.00661299 0.00649515 0.00645019 0.00639651 0.00636258
 0.00625928 0.00621892 0.00618515 0.00614725 0.00605753 0.00602199
 0.00601655 0.00594297 0.00592771 0.00586479 0.00582227 0.0057299
 0.0057186  0.00568921 0.00563393 0.00558996 0.00554788 0.00548574
 0.00545654 0.00543449 0.00539439 0.00537674 0.00532834 0.00531551
 0.00528143 0.00519771 0.00518787 0.00515365 0.00514385 0.00506612
 0.00505563 0.00500817 0.00497168 0.00495517 0.00492613 0.00488589
 0.00487786 0.00484921 0.00480648 0.00476868 0.0047489  0.00469

In [92]:
len(latent_features)

209

In [104]:
df["Rookie_VORP"][29:].fillna(0).head()

29    0.6
30    1.1
31   -1.0
32   -1.4
33   -1.7
Name: Rookie_VORP, dtype: float64

In [None]:
df

In [103]:
model = CatBoostRegressor(iterations=1000,
                          learning_rate=1,
                          depth=6)

labeled_data_features = latent_features[29:]
labeled_data_target = df["Rookie_VORP"][29:].fillna(0)


labeled_data_target = df["Rookie_VORP"][29:].fillna(0)
labeled_data_features= df[29:].drop(["Rookie_VORP", "Rookie_BPM", "negative_report", 
                                "positive_report", "draft_year"], axis=1)

X_train, X_test, y_train, y_test = train_test_split(labeled_data_features, labeled_data_target, test_size=0.20, random_state=7)


train_pool = Pool(X_train, y_train)
test_pool =  Pool(X_test, y_test)
model.fit(train_pool, eval_set=test_pool)


#preds = model.predict(eval_data)

CatBoostError: Bad value for num_feature[non_default_doc_idx=0,feature_idx=3]="DAMYEAN DOTSON": Cannot convert 'b'DAMYEAN DOTSON'' to float

In [102]:
len(X_test)

36

## Creating A Machine Learning Model with No Scouting Report (no text data)

In [None]:
train_pool = Pool(X_train, y_train, cat_features=categorical_features_indices, feature_names=list(X_train.columns))
test_pool = Pool(X_test, y_test, cat_features=categorical_features_indices, feature_names=list(X_train.columns))

Documents ~ Players
 ~ scouting report
 
 
use truncated SVD in sci-kit learn