**This is the starter code of the Compitition [jigsaw-toxic-severity-rating](https://www.kaggle.com/c/jigsaw-toxic-severity-rating)**

In [None]:
import re
from bs4 import BeautifulSoup
import os
import random
import joblib
import numpy as np
import pandas as pd

from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import KFold, cross_val_score

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, VotingRegressor
from sklearn.kernel_ridge import KernelRidge

import matplotlib.pyplot as plt

In [None]:
TEST_DATA_PATH = '../input/jigsaw-toxic-severity-rating/comments_to_score.csv'
VALID_DATA_PATH = '../input/jigsaw-toxic-severity-rating/validation_data.csv'
TRAIN_DATA_PATH = '../input/d/julian3833/jigsaw-toxic-comment-classification-challenge/train.csv'

In [None]:
VAL_SIZE = 0.15
SEED = 10
N_FOLDS = 5

## Functions

In [None]:
def set_seed(seed=42):
    """Utility function to use for reproducibility.
    :param seed: Random seed
    :return: None
    """
    np.random.seed(seed)
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)


def set_display():
    """Function sets display options for charts and pd.DataFrames.
    """
    # Plots display settings
    plt.style.use('fivethirtyeight')
    plt.rcParams['figure.figsize'] = 12, 8
    plt.rcParams.update({'font.size': 14})
    # DataFrame display settings
    pd.set_option('display.max_columns', None)
    pd.set_option('display.max_rows', None)
    pd.options.display.float_format = '{:.4f}'.format
    
def text_cleaning(text: str) -> str:
    """Function cleans text removing special characters,
    extra spaces, embedded URL links, HTML tags and emojis.
    Code source: https://www.kaggle.com/manabendrarout/pytorch-roberta-ranking-baseline-jrstc-infer
    :param text: Original text
    :return: Preprocessed text
    """
    template = re.compile(r'https?://\S+|www\.\S+')  # website links
    text = template.sub(r'', text)

    soup = BeautifulSoup(text, 'lxml')  # HTML tags
    only_text = soup.get_text()
    text = only_text

    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)

    text = re.sub(r"[^a-zA-Z\d]", " ", text)  # special characters
    text = re.sub(' +', ' ', text)  # extra spaces
    text = text.strip()  # spaces at the beginning and at the end of string

    return text

## Data processing

In [None]:
set_seed(SEED)
set_display()

By the rules of this competition we can use public data sets on Kaggle. However for some reason we cannot submit notebooks that directly use the data from the previous competition of 4 years ago where text samples were labeled with 6 classes. In this version we use an exact copy of this competition data set created and published by [Julian Peller](https://www.kaggle.com/julian3833) [here](https://www.kaggle.com/julian3833/jigsaw-toxic-comment-classification-challenge).

In [None]:
# Extract classified text samples.
data_train = pd.read_csv(TRAIN_DATA_PATH)
data_train.head()

In [None]:
data_train.iloc[6]

In [None]:
data_train.describe()

In [None]:
categories = data_train.loc[:, 'toxic':'identity_hate'].sum()
plt.title('Category Frequency')
plt.bar(categories.index, categories.values)
plt.show()

In the previous competition the task was to perform multi-class classification. Text sample could be labeled with one or several categories or not labeled with any. Non-toxic comments represent the majority of text samples, while toxic comments are a minority class and extremely toxic comments are more rare than plain toxic.

In this competition we have to score texts based on the level of toxicity. To get a toxicity score from the previous data we can use two approaches:
- Simply sum up all values in each row of the DataFrame. The toxicity score will vary between 0 and 6. However some unequally toxic samples could have the same score.
- Adjust the values in the DataFrame according to extremety of the category (for example, "toxic" and "severe toxic" should have different score) and then sum up per row values.

In [None]:
scores = data_train.loc[:, 'toxic':'identity_hate'].sum(axis=1).value_counts()
plt.bar(scores.index, scores.values)
plt.title('Scores Distribution: Simple Sum')
plt.show()

In the original DataFrame each category contains binary values (0 or 1). We will change the original values using multiplicative factors observing the following (disputable) common sense rules:
- Normal non-offensive text samples would have a score of 0.
- "toxic" category with a score of 1 would be used as a benchmark to score other offensive categories.
- Samples marked as "severe_toxic" would have higher toxicity score than those marked as "toxic".
- Obscene language along would have slightly lower score than "toxic" samples.
- Insults would be scored in between "toxic" and "severe_toxic" closer to the upper bound.
- Samples containing threats would have the highest toxicity score.
- Identity hate would be scored marginally lower than threats.
- If a sample is marked as belonging to several offensive classes total score would be calculated as a sum of values in all individual categories.

In [None]:
# Multiplication factors for categories.
cat_mtpl = {'toxic': 1, 'severe_toxic': 1.75, 'obscene': 0.95,
            'threat': 2, 'insult': 1.6, 'identity_hate': 1.95}

for category in cat_mtpl:
    data_train[category] = data_train[category] * cat_mtpl[category]

data_train['score'] = data_train.loc[:, 'toxic':'identity_hate'].sum(axis=1)

In [None]:
plt.hist(data_train['score'])
plt.title('Scores Distribution: Adjusted Sum')
plt.show()

Data set is imbalanced: number of neutral text samples is much larger than the number of toxic samples. We limit number of samples with 0 toxicity score to 1/5 of the total number of texts labeled as toxic.

In [None]:
n_samples_toxic = len(data_train[data_train['score'] != 0])
n_samples_normal = len(data_train) - n_samples_toxic

idx_to_drop = data_train[data_train['score'] == 0].index[n_samples_toxic//5:]
data_train = data_train.drop(idx_to_drop)

print(f'Reduced number of neutral text samples from {n_samples_normal} to {n_samples_toxic//5}.')
print(f'Total number of training samples: {len(data_train)}')

In [None]:
print(f'Mean toxicity score: {data_train["score"].mean()}\n'
      f'Standard deviation: {data_train["score"].std()}')

In [None]:
data_train.iloc[6]

In [None]:
X = data_train["comment_text"]
Y = data_train['score']

**Now You have value of X and Y. Now you can train your model whatever type you want. If It helps you so please give upvaote.**

In [None]:
X = X.apply(text_cleaning)

In [None]:
X.head()

In [None]:
# New data for validation: text pairs.
data_valid = pd.read_csv(VALID_DATA_PATH)

# Clean the texts
data_valid['less_toxic'] = data_valid['less_toxic'].apply(text_cleaning)
data_valid['more_toxic'] = data_valid['more_toxic'].apply(text_cleaning)

data_valid.head()