# Cuneiform Language Detection

## Model used
- Ensemble of `LogisticRegression` and `RandomForest` using `VotingClassifier`.

## Data Preprocessing
- Target labels were encoded using `LabelEncoder`.
- The cuneiform text was vectorized using `CountVectorizer`.
- The vectorized output was standardized using `StandardScaler` with `with_mean = False` since that doesn't have support for sparse matrices as the one returned by `CountVectorizer`.

## HyperParameter Tuning and Cross Validation
- Hyperparameters were tuned using `RandomizedSearchCV` with `RepeatedStratifiedKFold` with 5 folds and 2 repeats for cross validation.

## Metrics and Evaluation
- The best model was selected based on the `balanced_accuracy` metric.
- Other metrics such as `accuracy` and Weighted ROC-AUC for OVR (`roc_auc_ovr_weighted`) were also considered.
- Confusion matrix was plotted for predictions over the entire dataset.

## Outputs
- **predictions.csv**: CSV of the original data with predictions.
- **grid_cv_results.csv**: Results of `RandomizedSearchCV` with cross validation.
- **confusion_matrix.jpg**: Plot of the confusion matrix over the entire dataset

## Install the latest version of scikit-learn

In [None]:
!pip install scikit-learn==0.24.2

## Imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedStratifiedKFold, RandomizedSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.utils.fixes import loguniform
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

## Constants

In [None]:
RANDOM_STATE = 7
GRID_SEARCH_ITER = 30

## Data Loading

In [None]:
df = pd.read_csv(r'../input/cuneiform-language-identification/train.csv')
df.head()

## Data Preprocessing and Modelling

### `LabelEncoder` for encoding the target classes as integers

In [None]:
le = LabelEncoder()
df['enc_lang'] = le.fit_transform(df['lang'])
df[['lang', 'enc_lang']].sample(10)

### `RepeatedStratifiedKFold` for cross validation

In [None]:
rskf = RepeatedStratifiedKFold(n_splits = 5, n_repeats = 2, random_state = RANDOM_STATE)
rskf

### Count vectorization for the text data at the character level
There are about 550 unique characters across the entire dataset

In [None]:
vectorizer = CountVectorizer(lowercase = False, analyzer = 'char')
vectorizer

### `StandardScaler` for scaling the data as `saga` solver used in `LogisticRegression` works better with inputs of the same scale

In [None]:
scaler = StandardScaler(with_mean = False)
scaler

### Pipeline with the `CountVectorizer` and `VotingClassifier` Model

In [None]:
logreg = LogisticRegression(solver = 'saga', random_state = RANDOM_STATE)
forest = RandomForestClassifier(random_state = RANDOM_STATE)
vote_clf = VotingClassifier(estimators = [('logreg', logreg), ('forest', forest)], voting = "soft")
vote_clf

In [None]:
pipeline = Pipeline([
    ('vectorizer', vectorizer),
    ('scaler', scaler),
    ('vote_clf', vote_clf)
])

pipeline

### Get list of all configurable parameters for hyperparameter tuning for the pipeline

In [None]:
pipeline.get_params().keys()

### Hyperparameter tuning

In [None]:
params = {
    'vectorizer__binary': [True, False],
    'vote_clf__weights': [[1, 1], [2, 1], [1, 2]],
    'vote_clf__logreg__C': loguniform(1e0, 1e3),
    'vote_clf__logreg__class_weight': ['balanced', None],
    'vote_clf__logreg__max_iter': [200, 300],
    'vote_clf__forest__n_estimators': [50, 100, 150, 200],
    'vote_clf__forest__criterion': ['gini', 'entropy'],
    'vote_clf__forest__max_depth': [20, 40, 60],
    'vote_clf__forest__max_features': ['sqrt', 'log2'],
    'vote_clf__forest__class_weight': ['balanced', 'balanced_subsample', None],
    'vote_clf__forest__max_samples': [0.4, 0.6, 0.8]
}

scorers = {'Weighted ROC-AUC': 'roc_auc_ovr_weighted', 'Accuracy': 'accuracy', 'Balanced Accuracy': 'balanced_accuracy'}

model = RandomizedSearchCV(
    pipeline, 
    params, 
    n_iter = GRID_SEARCH_ITER, 
    cv = rskf, 
    n_jobs = 4, 
    scoring = scorers, 
    refit = 'Balanced Accuracy',
    random_state = RANDOM_STATE, 
    verbose = 1
)

model

### Model training

In [None]:
%%time

model.fit(df['cuneiform'], df['enc_lang'])

results = pd.DataFrame(model.cv_results_).sort_values('mean_test_Balanced Accuracy', ascending = False)
results.head()

## Model Predictions and evaluation

### Predictions

In [None]:
df['pred'] = model.predict(df['cuneiform'])
df.sample(5)

### Best parameters and score

In [None]:
model.best_params_

In [None]:
model.best_score_

## Outputs

### Confusion Matrix

In [None]:
_, ax = plt.subplots(figsize = (16, 12))

ConfusionMatrixDisplay(
    confusion_matrix(df['enc_lang'], df['pred'], labels = range(len(le.classes_))), display_labels = le.classes_
).plot(ax = ax, xticks_rotation = 'vertical')
                        
plt.savefig('confusion_matrix.jpg', dpi = 200, bbox_inches = 'tight')
plt.show()

### Predictions and results of hyperparameter tuning

In [None]:
results.to_csv("grid_cv_results.csv", index = False)
df.to_csv("predictions.csv", index = False)