# Introduction

Please move on to the "Experiment objectives.txt" then to "Generate data" notebook to understand the aim, objectives. In short, we have at this point generated a dataset for our experimentation and we would like to develop multiple AI agents with the same exact dataset. However, throughout the experimentation the only variable that changes is the vectorization; comparing different performances based on those can reveal which techniques aligns well with this kind of experiment.

## Experiment structure

1. Import libraries and load dataset.

2. Basic data visualization

3. Data preparation in form of vectorized dataset

4. Train multiple AI agents

5. Performance evaluation

In [3]:
# Essential libraries
import pandas as pd
import numpy as np

In [5]:
# Load dataset
df = pd.read_csv('generated_sentences.csv')

In [7]:
# Shows the first 5 rows
df.head()  

Unnamed: 0,Generated Sentence,About
0,"king, endgame rook rook bishop FIDE pawn castling",CHESS
1,"dancing phone running jumping, mountain",OTHER
2,"bishop, Elo rating pawn pawn, knight, rook kni...",CHESS
3,"river apple, phone, mountain, eating apple, ri...",OTHER
4,"FIDE, knight pawn, pawn stalemate rook",CHESS


In [8]:
# (Number of rows, number of columns)
df.shape  

(10000, 2)

In [6]:
# A concise summary including index dtype and column dtypes, non-null values, and memory usage.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Generated Sentence  10000 non-null  object
 1   About               10000 non-null  object
dtypes: object(2)
memory usage: 156.4+ KB


In [9]:
# A summary statistic of numerical columns, incase of no numerical variables the default behavior changes to categorical columns.
df.describe()

Unnamed: 0,Generated Sentence,About
count,10000,10000
unique,10000,2
top,"king, endgame rook rook bishop FIDE pawn castling",CHESS
freq,1,5500


In [10]:
# Checking for missing value 

df.isnull().sum()

Generated Sentence    0
About                 0
dtype: int64

In [15]:
# To check how many unique values for any given column, in our case its Generated Sentence
df['Generated Sentence'].value_counts()

Generated Sentence
king, endgame rook rook bishop FIDE pawn castling                              1
sky, computer mountain, dancing apple, river dancing sleeping, apple, music    1
bishop, king checkmate, opening, king castling endgame FIDE                    1
sky, dancing, computer sky river, phone sleeping sky                           1
swimming, music running, book mountain phone, apple, sky apple phone           1
                                                                              ..
bishop king endgame endgame, Elo rating                                        1
castling, FIDE bishop, checkmate, queen, stalemate                             1
computer jumping dancing computer, mountain, book mountain river sleeping      1
computer, jumping apple sky, computer                                          1
mountain computer, sky jumping sleeping apple                                  1
Name: count, Length: 10000, dtype: int64

# Data preparation

In [48]:
# Apply OHE for the About feature/column
About_chess = pd.get_dummies(df['About'],drop_first=False)['CHESS'].replace({True:1, False:0})

  About_chess = pd.get_dummies(df['About'],drop_first=False)['CHESS'].replace({True:1, False:0})


In [49]:
About_chess

0       1
1       0
2       1
3       0
4       1
       ..
9995    1
9996    0
9997    0
9998    1
9999    0
Name: CHESS, Length: 10000, dtype: int64

In [50]:
prepared_dataset = pd.concat([df['Generated Sentence'],About_chess],axis=1)

In [51]:
prepared_dataset

Unnamed: 0,Generated Sentence,CHESS
0,"king, endgame rook rook bishop FIDE pawn castling",1
1,"dancing phone running jumping, mountain",0
2,"bishop, Elo rating pawn pawn, knight, rook kni...",1
3,"river apple, phone, mountain, eating apple, ri...",0
4,"FIDE, knight pawn, pawn stalemate rook",1
...,...,...
9995,"king, rook opening castling knight",1
9996,"apple, music, music, music, running sky phone,...",0
9997,"sky, computer dancing sleeping apple, running ...",0
9998,"middlegame opening knight middlegame, queen, q...",1


In [52]:
X = prepared_dataset['Generated Sentence']
y = prepared_dataset['CHESS']

In [53]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(X)

In [54]:
from sklearn.model_selection import train_test_split

In [55]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# Modelling 

The primary known algorithm for most ML cases is RandomForests,
therefore it is the classifier algorithm choice for this experiment.

In [59]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

In [62]:
parameters = {
    "criterion":["entropy","log_loss"],
    "max_features":["sqrt","log2"],
}

In [63]:
RF_classifier = RandomForestClassifier(max_depth=200)

In [67]:
RF_classifier_hypertuned = GridSearchCV(RF_classifier, parameters,verbose=2)

In [68]:
RF_classifier_hypertuned.fit(X_train, y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] END ...............criterion=entropy, max_features=sqrt; total time=   0.2s
[CV] END ...............criterion=entropy, max_features=sqrt; total time=   0.2s
[CV] END ...............criterion=entropy, max_features=sqrt; total time=   0.2s
[CV] END ...............criterion=entropy, max_features=sqrt; total time=   0.2s
[CV] END ...............criterion=entropy, max_features=sqrt; total time=   0.2s
[CV] END ...............criterion=entropy, max_features=log2; total time=   0.2s
[CV] END ...............criterion=entropy, max_features=log2; total time=   0.2s
[CV] END ...............criterion=entropy, max_features=log2; total time=   0.1s
[CV] END ...............criterion=entropy, max_features=log2; total time=   0.1s
[CV] END ...............criterion=entropy, max_features=log2; total time=   0.2s
[CV] END ..............criterion=log_loss, max_features=sqrt; total time=   0.2s
[CV] END ..............criterion=log_loss, max_fe

In [70]:
RF_classifier_hypertuned.best_score_
RF_classifier_hypertuned.best_params_

{'criterion': 'entropy', 'max_features': 'sqrt'}

In [71]:
from sklearn.metrics import classification_report

In [78]:
y_pred = RF_classifier_hypertuned.predict(X_test)

In [82]:
print(classification_report(y_test, y_pred, target_names=["sentence about chess","sentence unrelated to chess"]))

                             precision    recall  f1-score   support

       sentence about chess       1.00      1.00      1.00       884
sentence unrelated to chess       1.00      1.00      1.00      1116

                   accuracy                           1.00      2000
                  macro avg       1.00      1.00      1.00      2000
               weighted avg       1.00      1.00      1.00      2000



In [83]:
# Let's test the program

In [121]:
target = {0:"unrelated to chess", 1:"About chess"}
def classify_input(input):
    transformed_input = vectorizer.transform([input])
    classified = RF_classifier_hypertuned.predict(transformed_input)
    print(classified)
    return target[classified[0]]

In [122]:
user_input = "Magnus carlsen plays e4 90% of his games and usually delays castling because this move can be tricky to deal with in the high elo levels."
classify_input(user_input)

[1]


'About chess'

In [123]:
second_user_input = "I would love to run a horse riding school busines, simply I love horses especially arabic horses. The true vibe comes when this is sheered with arabian poem, coffee and desert camping!" 
classify_input(second_user_input)

[1]


'About chess'

In [128]:
third_input = "Apple, Music" 
classify_input(third_input)

[0]


'unrelated to chess'

# Results

Reaching 100% Accuracy and on all other metrics while there is NO unique values; meaning that the model trained on data and tested on unseen data indicates model has overfitted, looking back into ratio of classification 55% Chess related, 50% unrelated the model does not look at specific frequency of a word and determine class based on it but rather is overfitted to the exact structure because it is likely that the model has trainedon something like "apple, music" and then tested on "apple, music, sky". Simply the model looks for the "apple, music" and if it is not present it classifies it as "chess related". 

In other words, despite reaching great unbelievable accuracy, the model is unsable due to no text processing techniques used. However, the aim of the notebook was to make sure to refresh memory on "Bag Of Words" technique which we successfuly completed.