# Toxic Comment Classification

In [9]:
import pandas as pd

## 1. Data Preparation

In [10]:
master_dataset = pd.read_csv("Datasets/train.csv")
master_dataset.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [11]:
master_dataset.shape

(159571, 8)

The given dataset has 159571 rows and 8 variables, which has a comment_text variable having textual data followed by 5 target variables. This is a multi label classification problem. 

### 1.1 Drop Duplicates
The function fn_del_dup_rows takes a dataframe as in input and checks for any duplicate records in the dataframe. If found, the duplicate rceords are dropped by preserving only one of the records. 
Next, the if any empty rows are found, the entire rows are dropped from the dataframe. The function returns two dataframes - the original dataframe and cleaned dataframe after removing duplicates and empty records.

In [12]:
def fn_del_dup_rows(df):
    duplicated_df = df.copy()
    tot_rows = duplicated_df.shape[0]
    
    # Dropping duplicate records
    df.drop_duplicates(inplace=True)
    distinct_rows = df.shape[0]
    if(distinct_rows<tot_rows):
        print("Duplicates found. Total duplicates",tot_rows-distinct_rows)
    else:
        print("No duplicates found")
    
    #Dropping empty records
    tot_rows = df.shape[0]
    df.dropna(axis=0, how='all',inplace=True)
    distinct_rows = df.shape[0]
    if(distinct_rows<tot_rows):
        print("Empty records found. Total empty records",tot_rows-distinct_rows)
    else:
        print("No empty records found")
    return duplicated_df,df

duplicated_df,master_dataset = fn_del_dup_rows(master_dataset)

No duplicates found
No empty records found


In [13]:
master_dataset.dtypes

id               object
comment_text     object
toxic             int64
severe_toxic      int64
obscene           int64
threat            int64
insult            int64
identity_hate     int64
dtype: object

### 1.2 Datatype Conversion
Looks like pandas has considered almost all variables as float and 2 variables as objects. Let's assign the actual datatypes to each of the variables. 

The function get_uniq_vals takes a dataframe as an input and returns a Series of unique values in each column. This helps in identifying the categorical variables in the dataset.

The function fn_set_dtypes takes 5 parameters as inputs - the dataframe, a list of all categorical variable names, a list of integers and a list of floats and a list of objects. 

The function returns a dataframe by assigning the appropriate datatype to each variable.

In [15]:
def fn_set_dtypes(df,categories_,ints_,floats_,objects_):
    for category_ in categories_:
        df[category_] = df[category_].astype("category")
        
    for int_ in ints_:
        df[int_] = df[int_].astype("int64")
        
    for float_ in floats_:
        df[float_] = df[float_].astype("float64")
        
    for object_ in objects_:
        df[object_] = df[object_].astype("object")
    return df

categories_=['toxic','severe_toxic','obscene','threat','insult','identity_hate']
ints_=[]
floats_=[]
objects_=['comment_text','id']
master_dataset = fn_set_dtypes(master_dataset,categories_,ints_,floats_,objects_)
master_dataset.dtypes

id                 object
comment_text       object
toxic            category
severe_toxic     category
obscene          category
threat           category
insult           category
identity_hate    category
dtype: object

## 2. Missing Value Treatment

Let's check if any variable has missing values in the dataset and use a correct method to impute.

The function fn_get_missing_vals takes a dataframe and returns a dataframe which contains the list of all variables and number of missing values in each variable along with percentages. 

In [16]:
def fn_get_missing_vals(df,cols_):
    df=df[cols_]
    n_rows = df.shape[0]
    miss_val_cnts = df.isna().sum()
    miss_vals = pd.DataFrame(miss_val_cnts[miss_val_cnts>0],columns=['Missing Val Count'])
    miss_vals['Percentage'] = miss_vals['Missing Val Count']*100/n_rows
    return miss_vals

fn_get_missing_vals(master_dataset,categories_+ints_+floats_+objects_)

Unnamed: 0,Missing Val Count,Percentage


Cool! There are no missing values.

## 3. Text Preprocessing

In [23]:
master_dataset['comment_text']

0         Explanation\nWhy the edits made under my usern...
1         D'aww! He matches this background colour I'm s...
2         Hey man, I'm really not trying to edit war. It...
3         "\nMore\nI can't make any real suggestions on ...
4         You, sir, are my hero. Any chance you remember...
5         "\n\nCongratulations from me as well, use the ...
6              COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK
7         Your vandalism to the Matt Shirvington article...
8         Sorry if the word 'nonsense' was offensive to ...
9         alignment on this subject and which are contra...
10        "\nFair use rationale for Image:Wonju.jpg\n\nT...
11        bbq \n\nbe a man and lets discuss it-maybe ove...
12        Hey... what is it..\n@ | talk .\nWhat is it......
13        Before you start throwing accusations and warn...
14        Oh, and the girl above started her arguments w...
15        "\n\nJuelz Santanas Age\n\nIn 2002, Juelz Sant...
16        Bye! \n\nDon't look, come or t

In [24]:
master_dataset.loc[29,'comment_text']

'"== A barnstar for you! ==\n\n  The Real Life Barnstar lets us be the stars\n   "'