# Introduction

This is my solution for the [StumbleUpon Evergreen Classification Challenge](https://www.kaggle.com/c/stumbleupon) challenge on Kaggle. In this challange, the task is to predict whether or not a given site is going to be relevant in future. In other words, we are required to predict whether or not a site will be "evergreen". To do this, we are provided with the text in the given url and various other meta-data features. Thus, It is a text classification problem and in this notebook, I have approached it with Feedforward Neural Networks having multiple inputs to handle meta-data and text features seperately.

# Part 1: Exploratory data analysis and data cleaning

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline


pd.set_option("display.max_columns", None)

In [None]:
train_df = pd.read_csv("../input/stumbleupon/train.tsv", delimiter="\t")
test_df = pd.read_csv("../input/stumbleupon/test.tsv", delimiter="\t")
sub_df = pd.read_csv("../input/stumbleupon/sampleSubmission.csv")

train_df.head()

Seperating **label** from train set.

In [None]:
target = train_df["label"]
train_df.drop(["label"], axis=1, inplace=True)

**urlid** is for giving ID to various columns so dropping from both train and test sets.

In [None]:
train_df.drop(["urlid"], axis=1, inplace=True)
test_df.drop(["urlid"], axis=1, inplace=True)

Extracting the site name from the site URL and changing the column name from **url** to **site_name**.

In [None]:
def get_domain(url):
    temp = url.split("/")
    if temp[0][:4] == "http":
        temp = temp[2]
    else:
        temp = temp[0]

    temp = temp.split(".")
    if temp[0][:3] == "www":
        temp = temp[1]
    else:
        temp = temp[0]

    return temp


train_df["url"] = train_df["url"].apply(get_domain)
test_df["url"] = test_df["url"].apply(get_domain)

train_df.rename(columns={"url": "site_name"}, inplace=True)
test_df.rename(columns={"url": "site_name"}, inplace=True)

There are many sites which occur less than 5 time in the whole dataset. Grouping these site into the "other" category.

In [None]:
from sklearn.impute import SimpleImputer


site = train_df["site_name"].value_counts()
site_map = pd.Series(site.index, index=site.index)

site_map[site < 5] = "other"
del site

train_df["site_name"] = train_df["site_name"].map(site_map)
test_df["site_name"] = test_df["site_name"].map(site_map)
del site_map

test_df["site_name"] = SimpleImputer(strategy="constant", fill_value="other").fit_transform(test_df[["site_name"]])

Imputing missing values in **alchemy_category_score** using SimpleImputer with median strategy.

In [None]:
imputer = SimpleImputer(strategy="median")

train_df["alchemy_category_score"].replace({"?": np.nan}, inplace=True)
test_df["alchemy_category_score"].replace({"?": np.nan}, inplace=True)

train_df["alchemy_category_score"] = train_df["alchemy_category_score"].astype(float)
test_df["alchemy_category_score"] = test_df["alchemy_category_score"].astype(float)

train_df["alchemy_category_score"] = imputer.fit_transform(train_df[["alchemy_category_score"]])
test_df["alchemy_category_score"] = imputer.fit_transform(test_df[["alchemy_category_score"]])

Analyzing the missing value count and dtype of various columns in train and test sets.

In [None]:
train_df.info()

In [None]:
test_df.info()

Calculating the number of unique values of various columns in train and test sets.

In [None]:
pd.DataFrame({
    "Train": train_df[test_df.columns].nunique(),
    "Test": test_df.nunique()
})

**framebased** has 0 variance so dropping it from both train and test sets and also typecasting the dtype of various categorical columns to "object".

In [None]:
train_df.drop(["framebased"], axis=1, inplace=True)
test_df.drop(["framebased"], axis=1, inplace=True)

cat_cols = [
    "site_name",
    "alchemy_category",
    "hasDomainLink",
    "is_news",
    "lengthyLinkDomain",
    "news_front_page",
    "numwords_in_url"
]
cont_cols = list(set(test_df.columns) - set(cat_cols + ["boilerplate"]))

cat_cols = np.array(cat_cols)
cont_cols = np.array(cont_cols)

train_df[cat_cols] = train_df[cat_cols].astype(str)
test_df[cat_cols] = test_df[cat_cols].astype(str)

train_df[cont_cols] = train_df[cont_cols].astype(float)
test_df[cont_cols] = test_df[cont_cols].astype(float)

Some basic statistical parameters for numerical columns in train and test sets.

In [None]:
train_df[cont_cols].describe()

In [None]:
test_df[cont_cols].describe()

Plotting histogram for all the numerical columns in the train set.

In [None]:
fig, ax = plt.subplots(nrows=6, ncols=3, figsize=(20, 20))
ax = np.array(ax).ravel()

for i, col in enumerate(cont_cols):
    ax[i].hist(train_df[col], bins=50)
    ax[i].set_title(col)
    ax[i].grid(True)

fig.show()
fig.savefig("feature_histogram.jpeg")

# Part 2: Feature engineering categorical and text features

**site_name** has very high cardinality so it using the encoding using binary encoding scheme. For all other categorical columns, performing one-hot encoding.

In [None]:
from category_encoders.one_hot import OneHotEncoder
from category_encoders.binary import BinaryEncoder


train_df[["hasDomainLink", "lengthyLinkDomain"]] = train_df[["hasDomainLink", "lengthyLinkDomain"]].astype(int)
test_df[["hasDomainLink", "lengthyLinkDomain"]] = test_df[["hasDomainLink", "lengthyLinkDomain"]].astype(int)

train_df["is_news"].replace({"1": 1, "?": 0}, inplace=True)
test_df["is_news"].replace({"1": 1, "?": 0}, inplace=True)

onehot_enc = OneHotEncoder(cols=["alchemy_category", "news_front_page", "numwords_in_url"])
train_df = onehot_enc.fit_transform(train_df)
test_df = onehot_enc.transform(test_df)

binary_enc = BinaryEncoder(cols=["site_name"])
train_df = binary_enc.fit_transform(train_df)
test_df = binary_enc.fit_transform(test_df)

test_df.insert(0, "site_name_-1", 0)

new_name = {"site_name_"+str(i):"site_name_"+str(i+1) for i in range(-1, 8)}
test_df.rename(columns=new_name, inplace=True)

Cleaning the boilerplate data. Every row is in JSON/Python-dictionary format with keys - *title*, *body* and *url*. Extracting the text from all the keys.

In [None]:
from json import loads


def get_text(bp):
    text_dict = loads(bp)

    text = ""
    for value in text_dict.values():
        if value != None:
            text = text + " " + value

    return text


train_df["boilerplate"] = train_df["boilerplate"].apply(get_text)
test_df["boilerplate"] = test_df["boilerplate"].apply(get_text)

train_df.rename(columns={"boilerplate": "text"}, inplace=True)
test_df.rename(columns={"boilerplate": "text"}, inplace=True)

Removing stopwords, number, punctuations and performing lemmatization using the SpaCy library.

In [None]:
import spacy


def preprocess(df):
    nlp = spacy.load("en_core_web_sm")
    count = 0

    for text in nlp.pipe(df["text"], n_process=4, batch_size=250, disable=["ner", "parser"]):
        df.loc[count, "text"] = " ".join([token.lemma_ for token in text if token.is_alpha and not token.is_stop])
        count += 1

    return df


train_df = preprocess(train_df)
test_df = preprocess(test_df)

In [None]:
train_df.head()

Splitting the train dataframe into training and validation sets and then scaling the meta-data features with sklearn RobustScaler.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler


train_df, val_df, train_target, val_target = train_test_split(
    train_df,
    target,
    test_size=0.2,
    stratify=target,
    shuffle=True,
    random_state=42
)

train_meta, train_text = train_df.drop(["text"], axis=1), train_df["text"]
del train_df

val_meta, val_text = val_df.drop(["text"], axis=1), val_df["text"]
del val_df

test_meta, test_text = test_df.drop(["text"], axis=1), test_df["text"]
del test_df

scaler = RobustScaler()
train_meta = scaler.fit_transform(train_meta)
val_meta = scaler.transform(val_meta)
test_meta = scaler.transform(test_meta)

# Part 3: Building LSTM Model

Analyzing the corpus to find the total number of unique bi-gram tokens maximum sentence length.

In [None]:
from tensorflow.keras.layers import TextVectorization


layer = TextVectorization(ngrams=2)
layer.adapt(train_text)

print("Total number of tokens:", layer.vocabulary_size())
print("Maximum sequence length:", layer(train_text).shape[1])

Performing TF-IDF vectorization of the corpus and defining a multi-input neural network model to handle text and meta-data features seperately.

In [None]:
from tensorflow.keras.layers import Input, Dense, Concatenate
from tensorflow.keras.metrics import AUC
from tensorflow.keras import Model
from tensorflow import string


MAX_TOKENS = 900000

text2vec = TextVectorization(
    max_tokens=MAX_TOKENS,
    ngrams=2,
    pad_to_max_tokens=True,
    output_mode="tf_idf"
)
text2vec.adapt(train_text)

text_input = Input(shape=(1,), dtype=string, name="text_input")
x = text2vec(text_input)

meta_input = Input(shape=train_meta.shape[1:], name="meta_input")
y = Concatenate()([x, meta_input])
y = Dense(units=1, activation="sigmoid", name="sigmoid_output")(y)

model = Model(inputs=[text_input, meta_input], outputs=y, name="NLP_Model")

model.compile(
    optimizer="adam",
    loss="binary_crossentropy",
    metrics=["accuracy", AUC(name="auc")]
)

model.summary()

In [None]:
from tensorflow.keras.utils import plot_model


plot_model(
    model=model,
    to_file="fnn_model.jpeg",
    show_shapes=True,
    dpi=75
)

Defining some callbacks for learning rate optimization and early stopping.

In [None]:
from tensorflow.keras.callbacks import ReduceLROnPlateau, EarlyStopping


reduce_lr = ReduceLROnPlateau(
    monitor="val_auc",
    factor=0.2,
    patience=5,
    verbose=True,
    mode="max"
)

early_stop = EarlyStopping(
    monitor="val_auc",
    patience=30,
    verbose=True,
    mode="max",
    restore_best_weights=True
)

callbacks = [reduce_lr, early_stop]

In [None]:
history = model.fit(
    x=[train_text, train_meta],
    y=train_target,
    batch_size=512,
    epochs=300,
    verbose=1,
    callbacks=callbacks,
    validation_data=([val_text, val_meta], val_target),
    shuffle=True
)

In [None]:
model.evaluate([val_text, val_meta], val_target)

In [None]:
sub_df["label"] = model.predict([test_text, test_meta])
sub_df.to_csv("submission.csv", index=False)

In [None]:
model.save("fnn_model", save_format="tf")

Plotting training history of various metrics.

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(20, 5))
ax = np.array(ax).ravel()

for i, metric in enumerate(["loss", "accuracy", "auc"]):
    ax[i].plot(history.history[metric])
    ax[i].plot(history.history["val_"+metric])
    ax[i].legend(["train", "val"])
    ax[i].set_xlabel("epochs")
    ax[i].set_ylabel(metric)
    ax[i].set_title(metric + " vs epochs")

fig.show()

In [None]:
fig.savefig("fnn_training_history.jpeg")