## Leetcode Analysis

Recently I am applying for job in some large company. Algorithm Questions is an essential part of the interview. Leetcode is the most famous platform for progrmamers practicing Algorithms, some how less information is found in Kaggle. So I would like to have an analysis LeetCode quetions so it may help me better prepare for the interview.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
leetcode = pd.read_csv("/kaggle/input/leetcode/leetcode.csv")
leetcode.head()

### Total Number of  Questions

In [1]:
len(leetcode)

## How many questions is paid?

In [1]:
sns.countplot(leetcode["isPaid"]);


## Difficulty of Leetcode

In [1]:
sns.countplot(leetcode["difficulty"]);

In [1]:
sns.countplot("difficulty", hue='isPaid', data=leetcode);

## Calcuate Accept rate

In [1]:
leetcode["accept_rate"] = leetcode['total Accepted'] / leetcode['total Submitted']
leetcode.head()

### Accept rate for different difficulties

In [1]:
leetcode.groupby("difficulty")\
.accept_rate.mean().plot(kind="bar")

### Average Submission for different difficulties

In [1]:
leetcode.groupby("difficulty")['total Submitted']\
.mean().plot(kind="bar")

### Average Accepted Submission for different difficulities

In [1]:
leetcode.groupby("difficulty")['total Accepted']\
.mean().plot(kind="bar")

### Calculate Title Length

In [1]:
leetcode["title_len"] = leetcode["title"].apply(len)
leetcode.head()

### Title Length for different difficulities

In [1]:
leetcode.groupby("difficulty")['title_len']\
.mean().plot(kind="bar")

In [1]:
leetcode.describe()

## Modeling
Now I want to build a Model to predict difficulty based on other information such as total Accepted submissions and total Submitted submissions and accept rate and so on.

### Using Catboot

In [1]:
from catboost import CatBoostClassifier

In [1]:
from sklearn.model_selection import train_test_split
train, val = train_test_split(leetcode, test_size=0.1, random_state=42)

In [1]:
train.head()

In [1]:
featrue_columns = ["total Accepted", 'total Submitted', "accept_rate", "title_len", "isPaid"]
label_column = 'difficulty'
cat_params = {
    "verbose": 1000,
    "learning_rate": 0.01, 
    "od_wait": 1000,
    "l2_leaf_reg": 10,
    "iterations": 10000,
    "eval_metric": "Accuracy"
}
model = CatBoostClassifier(**cat_params)
model.fit(train[featrue_columns], train[label_column], eval_set=(val[featrue_columns], val[label_column]))

### Using Tensorflow

In [1]:
import tensorflow as tf
from tensorflow import keras

### Text Vectorization

In [1]:
vocab_size = 1500
sequence_length = 32
vectorizer = keras.layers.TextVectorization(
    max_tokens=vocab_size,
    output_sequence_length=sequence_length,
    standardize="lower_and_strip_punctuation",
)
vectorizer.adapt(leetcode["title"]);

In [1]:
vectorizer(["hello world"])

In [1]:
len(vectorizer.get_vocabulary())

### Tabular data preprocessing

In [1]:
train_features = train[featrue_columns]
train_features.head()

In [1]:
val_features = val[featrue_columns]
val_features.head()

In [1]:
for column in featrue_columns:
    if column == "isPaid":
        train_features.loc[:, column] =  train_features.loc[:, column].apply(lambda item: 1.0 if item == True else 0.0)
        val_features.loc[:, column] =  val_features.loc[:, column].apply(lambda item: 1.0 if item == True else 0.0)
    else:
        mean_value = train_features[column].mean()
        std_value = train_features[column].std()
        train_features.loc[:, column] = (train_features[column] - mean_value) / std_value
        val_features.loc[:, column] = (val_features[column] - mean_value) / std_value

In [1]:
train_features.head()

In [1]:
val_features.head()

### Build the Model

In [1]:

def get_model():
    tabular_inputs = keras.Input((val_features.shape[1], ))
    text_inputs = keras.Input((1, ), dtype="string")
    tabular_model = keras.Sequential([
        tabular_inputs,
        keras.layers.Dense(32, activation="relu"),
        keras.layers.Dense(32, activation="relu")
    ])
    text_model = keras.Sequential([
        text_inputs,
        vectorizer,
        keras.layers.Embedding(vocab_size, 64),
        keras.layers.LSTM(16, recurrent_dropout=0.2, return_sequences=True),
        keras.layers.LSTM(16, recurrent_dropout=0.2),
        keras.layers.Dense(16, activation="relu"),
    ])
    x = keras.layers.Concatenate()([tabular_model.output, text_model.output])
    x = keras.layers.Dense(64, activation="relu")(x)
    x = keras.layers.Dropout(0.3)(x)
    output =  keras.layers.Dense(3, activation="softmax")(x)
    model = keras.Model(inputs=[tabular_inputs, text_inputs], outputs=[output])
    model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
    return model, tabular_model, text_model

In [1]:
model, _, text_model  = get_model()
model.summary()

In [1]:
train_target = pd.get_dummies(train["difficulty"])
train_target.head()

In [1]:
val_target = pd.get_dummies(val["difficulty"])
val_target.head()

## Training

In [1]:
model_path = "model.tf"
checkpoint = keras.callbacks.ModelCheckpoint(model_path, monitor="val_accuracy", save_best_only=True, save_weights_only=True) 
model.fit(
    x=[train_features[featrue_columns], train["title"]], y=train_target, epochs=10, 
    validation_data=([val_features[featrue_columns], val["title"]], val_target),
    callbacks=[checkpoint]
)

In [1]:
model.load_weights(model_path)

## Training CatBoost with Word Vector Information

In [1]:
train_text_vector = text_model(train["title"])
val_text_vector = text_model(val["title"])
train_text_vector.shape, val_text_vector.shape

In [1]:
new_train_features = np.concatenate([train_text_vector.numpy(), train_features], axis=-1)
new_val_features = np.concatenate([val_text_vector.numpy(), val_features], axis=-1)
new_val_features.shape, new_train_features.shape

In [1]:
model = CatBoostClassifier(**cat_params)
model.fit(new_train_features, train[label_column], eval_set=(new_val_features, val[label_column]))