# Assigment 01:Data Science programing : Data Preperation

## Source of Data: 
1. For this assignment, I have chosen the chess ( king, rook Vs king ) dataset [Source Link](https://archive.ics.uci.edu/ml/datasets/Chess+%28King-Rook+vs.+King%29) from the UCI ML repository. 

## Business Scenario 

I think that this data presents an interesting case for analysis, as we are trying to build a model that will predict if you win a game based on the position of your pieces(chess pieces) and the move you are making. 
As humans, we may have not thought of all possible moves that can be made based on the pieces you have on the board. This model computes all possible moves and strategies needed to win steers in the best possible checkmate strategy in fewer moves. Additionally, it gives us an understanding of the unique movement patterns that we may never think of. Or Assist any beginner player in making the right moves toward a win.


## Data Profile: 
Below is the attribute information: 
Predictors: 
   1. White King file (column)
   2. White King rank(row)
   3. White Rook file
   4. White Rook rank
   5. Black King file
   6. Black King rank

Target: 
   7. optimal depth-of-win for White in 0 to 16 moves, otherwise drawn
               {draw, zero, one, two, ..., sixteen}.


# Data Preperation:

In [5]:
# lib imports 

import pandas as pd 
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split


## Load source data

In [17]:
df = pd.read_csv('krkopt.csv')
df.head()

Unnamed: 0,white_king_file,white_king_rank,white_rook_file,white_rook_rank,black_king_file,black_king_rank,moves
0,a,1,b,3,c,2,draw
1,a,1,c,1,c,2,draw
2,a,1,c,1,d,1,draw
3,a,1,c,1,d,2,draw
4,a,1,c,2,c,1,draw


In [12]:
# dimensions [rows * columns]
df.shape

(28056, 7)

In [13]:
# info on columns and data points

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28056 entries, 0 to 28055
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   white_king_file  28056 non-null  object
 1   white_king_rank  28056 non-null  int64 
 2   white_rook_file  28056 non-null  object
 3   white_rook_rank  28056 non-null  int64 
 4   black_king_file  28056 non-null  object
 5   black_king_rank  28056 non-null  int64 
 6   moves            28056 non-null  object
dtypes: int64(3), object(4)
memory usage: 1.5+ MB


### Observations on source data:  
1. file: the file in chess is the column on the board, from a to h
2.  rank: or row on the chess board, range from 1 to 8.discrete and finite numbers so categorical in nature. 
3 . moves: finally our target column moves is number of moves to checkmate the black king. 
 3. a) this one's tricky , cause we have a zero move, and a draw move.
 3. b) zero move is when the default position on the board allows the king  to checkmate the king, ideally this is a less likely scenario.
 3. c) we have draw, which occurs on many conditions but for our data, the condition that is feasible is a stalemate, after sisxteen consecutive moves skill there is no checkmate, then its a draw.



### Assumptions: 
1. based on the above scenario, all of our vairables are categorical.
2. Based on the above observations on the target column , we have to assume that a 'zero' move is an ideal outcome and 'draw' is a the worst outcome. 

We will conduct our analysis based on this assumption

##  Data Imbalance

as per source of Dataset, there is clearly an imbalance classes for our target column 
Class Distribution:

    draw       2796
    zero         27
    one          78
    two         246
    three        81
    four        198
    five        471
    six         592
    seven       683
    eight      1433
    nine       1712
    ten        1985
    eleven     2854
    twelve     3597
    thirteen   4194
    fourteen   4553
    fifteen    2166
    sixteen     390

    Total     28056

Our dataset is quite a large dataset with around 28k records, so we can use undersampling to normalize the class difference 

In [18]:
target = 'moves'
predictors = list(df.columns)
predictors.remove(target)

X = df[predictors]
y = df[[target]]
print(f"X dimensions : {X.shape}, y dimensions: {y.shape}")

X dimensions : (28056, 6), y dimensions: (28056, 1)


In [19]:
y.value_counts()

moves   
fourteen    4553
thirteen    4194
twelve      3597
eleven      2854
draw        2796
fifteen     2166
ten         1985
nine        1712
eight       1433
seven        683
six          592
five         471
sixteen      390
two          246
four         198
three         81
one           78
zero          27
dtype: int64

In [21]:


under_sampler = RandomUnderSampler(sampling_strategy='majority')
X_res, y_res = under_sampler.fit_resample(X, y)


In [22]:
y_res.value_counts()

moves   
thirteen    4194
twelve      3597
eleven      2854
draw        2796
fifteen     2166
ten         1985
nine        1712
eight       1433
seven        683
six          592
five         471
sixteen      390
two          246
four         198
three         81
one           78
fourteen      27
zero          27
dtype: int64

## Train, test split source data before pre-processing

In [23]:
# train-test split 


X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, 
                                                    test_size=0.3, random_state=42)
y_test.head()


Unnamed: 0,moves
2734,draw
8393,fifteen
8618,fifteen
2111,draw
6538,eleven


## Type conversion to catgorical

In [24]:
# convert categorical columns into pandas categories 
for column in  list(X_train.columns):
  X_train[column] =X_train[column].astype('category')
  print(f"Column: {column}({X_train[column].dtype}) -->  category ") 
print()

for column in  list(X_test.columns):
  X_test[column] =X_test[column].astype('category')
  print(f"Column: {column}({X_test[column].dtype}) -->  category ") 
print()
  
y_train[target] =y_train[target].astype('category')
print(f"Column: {target}({y_train[target].dtype}) -->  category ") 
print()

y_test[target] =y_test[target].astype('category')
print(f"Column: {target}({y_test[target].dtype}) -->  category ") 
print()

Column: white_king_file(category) -->  category 
Column: white_king_rank(category) -->  category 
Column: white_rook_file(category) -->  category 
Column: white_rook_rank(category) -->  category 
Column: black_king_file(category) -->  category 
Column: black_king_rank(category) -->  category 

Column: white_king_file(category) -->  category 
Column: white_king_rank(category) -->  category 
Column: white_rook_file(category) -->  category 
Column: white_rook_rank(category) -->  category 
Column: black_king_file(category) -->  category 
Column: black_king_rank(category) -->  category 

Column: moves(category) -->  category 

Column: moves(category) -->  category 



## Encoding categorical variables


In [25]:
# the categorical columns have a non linear relationship among different classes in them. 
# for example the difference betwewn the king file position a to b is not the same as b to c , because one of the difference is of greter importance because it might get you closer to checkmate. 

# Hence one hot encoding seems to be suitable for our categorical columns 

#label encoding for the target ,to convert them into numerical columns, since they have natural order

# get dummies to convert 

target = 'moves'
predictors = list(df.columns)[:-1]

df_encoded_prXedictors  = pd.get_dummies(df[predictors], prefix_sep='_ohe_', drop_first=True ) 

X_train_encoded  = pd.get_dummies(X_train, prefix_sep='_ohe_', drop_first=True ) 
X_test_encoded  = pd.get_dummies(X_test, prefix_sep='_ohe_', drop_first=True ) 

label_map = {'draw': 18, 'zero': 0, 'one': 1, 'two': 2, 'three': 3, 'four': 4,
             'five': 5, 'six': 6, 'seven': 7, 'eight': 8, 'nine': 9, 'ten': 10,
             'eleven': 11, 'twelve': 12, 'thirteen': 13, 'fourteen': 14, 
             'fifteen': 15, 'sixteen': 16}

y_train['moves'] = y_train['moves'].map(label_map)
y_test['moves'] = y_test['moves'].map(label_map)

# y_train = y_train.drop(columns=['moves'])
# y_test = y_test.drop(columns=['moves'])


## Save Pre-processed data

In [26]:
# save to parquet format to preserve dtypes 

X_train_encoded.to_parquet('X_train_krk.parquet' , engine='fastparquet', index=False)
X_test_encoded.to_parquet('X_test_krk.parquet' , engine='fastparquet', index=False)
y_train.to_parquet('y_train_krk.parquet' , engine='fastparquet', index=False)
y_test.to_parquet('y_test_krk.parquet' , engine='fastparquet', index=False)