<a href="https://colab.research.google.com/github/rritam94/mycode/blob/main/nba.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Here, I import important libraries such as pandas to read the data and Sci-Kit learn to train and test
# my model.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Here, I read the csv input file that I created with the use of the Pandas library. I then use the
# nba_data.head() command to read the first 5 lines of the spreadsheet to ensure everything went smoothly.

nba_data = pd.read_csv('/content/nba.csv')
nba_data.head()

FileNotFoundError: ignored

In [None]:
# Here, I set good to 1 and bad to 0. Then I split the data into 2 groups: X and Y, where X is equal to the
# points per game, and the Y is equal to the category (whether or not the player is good or bad).

nba_data.loc[nba_data['Category'] == 'bad', 'Category',] = 0
nba_data.loc[nba_data['Category'] == 'good', 'Category',] = 1

X = nba_data['Points']
Y = nba_data['Category']

In [None]:
# Here, I split the data into testing and training data. In this case, we have 20% of the data be alloted
# for testing, whereas 80% of the data will be used towards training.

# Training means the data that is used to train the machine learning model
# Testing means the data that are known values used to test the model to increase efficiency.

# train_test_split() is a function in the Sci-Kit Learn library.

# I'm not entirely sure what the random_state is for.

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 3)

In [None]:
# The line below is a work of magic. TfidfVectorizer is a class in the Sci-Kit Learn library which takes in
# certain parameters and creates an instance of the class. You can then use this object to transform your data!
# I decided not to change everything to lowercase as all of my data is lowercase already.
feature_extraction = TfidfVectorizer(min_df = 1, stop_words = 'english', lowercase = False)

# Here, I convert X_train and X_test after realizing the feature extraction cannot accept float64 because
# there is a method that is called that does .lower(), which is not applicable to float64 format.
X_train = X_train.astype(str)
X_test = X_test.astype(str)

# Here, I convert Y_train and Y_test into integer format, since they are just 0 or 1 based on if the player
# is good or bad.
Y_train = Y_train.astype('int')
Y_test = Y_test.astype('int')

# Again, the lines below are a work of magic. This is where the actual transformation from YOUR data happens.
X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)

In [None]:
# Here, I am going to create a LogisticRegression model found in the Sci-Kit Learn library.
model = LogisticRegression()

# The line below is where the training of our very own machine learning model occurs.
model.fit(X_train_features, Y_train) # for a reminder, x is input (ppg), y is output (0 - bad or 1 - good)

In [None]:
# Now, we will test our model. I am creating an input which will just take a String of the points/game.
input_ppg = ["32.56"]

# I have to again convert my input_ppg text to a feature for the model to understand.
input_feature = feature_extraction.transform(input_ppg)

# Now, I use the in-built function in Sci-Kit learn to predict an outcome based on the input_feature
prediction = model.predict(input_feature)

# Now, we check whether or not we received 0 or 1. Based on that we print good or bad!!!
if prediction[0] == 1:
  print("Good")

else:
  print ("Bad")

Good
