# Toxic comment classification
## Exploration and preprocessing



## Instructions


## General Outline

Recall the general outline for SageMaker projects using a notebook instance.

1. Unzip
2. Process / Prepare the data.
3. Upload the processed data to S3.
4. Train a chosen model.
5. Test the trained model (typically using a batch transform job).
6. Deploy the trained model.
7. Use the deployed model.


In [2]:
#data 
import pandas as pd 
import numpy as np

#paths
import os

#save
import pickle

#plot
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# text processing 
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *
import re
from bs4 import BeautifulSoup

In [3]:
cols = [ 'ID', 'label', 'statement', 'subject', 'speaker', 'speaker_job', 'state_info',
 'afiliation', 'barely_true_counts', 'false_counts', 'half_true_counts', 'mostly_true_counts', 
 'pants_on_fire_counts', 'context']

len(cols)

14

## Step 1: Inspecting for missing values

In [4]:

train = pd.read_csv('./data/liar_dataset/train.csv', sep='\t',header=None) 

train.columns = cols

ValueError: Length mismatch: Expected axis has 1 elements, new values have 14 elements

In [None]:
train.head()

In [None]:
# There are not null entries
print(train.dropna(subset = ['afiliation']).isnull().sum())


In [7]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10241 entries, 0 to 10240
Data columns (total 1 columns):
0    10241 non-null object
dtypes: object(1)
memory usage: 80.1+ KB


In [6]:
train.to_csv('./data/liar_dataset/train.csv', index=False)

# 2.- Preprocessing 

### Spliting comments from labels

In [51]:
train.columns

Index(['ID', 'label', 'statement', 'subject', 'speaker', 'speaker_job',
       'state_info', 'afiliation', 'barely_true_counts', 'false_counts',
       'half_true_counts', 'mostly_true_counts', 'pants_on_fire_counts',
       'context'],
      dtype='object')

**Looking at the reviews, one finds comments which need a lot of preprocessing:**

In [60]:
rev=660
print('TRAIN\n')
print(train['statement'][rev][:120],'\n')


TRAIN

It used to be the policy of the Republican Party to get rid of the Department of Education. We finally get in charge and 



The first step in processing the reviews is to make sure that any html tags that appear should be removed. In addition we wish to tokenize our input, that way words such as *entertained* and *entertaining* are considered the same with regard to sentiment analysis.

In [61]:
rep_numbers=re.compile(r'\d+',re.IGNORECASE)

#rep_jumps=re.compile(r'\n+',re.IGNORECASE)

rep_special_chars=re.compile(r'[^a-z\d ]',re.IGNORECASE)

rep_special_chars= re.compile("[^\w']|_")

stemmer = PorterStemmer()
nltk.download("stopwords", quiet=True)

def review_to_words(review):
    
    
    
    #text = BeautifulSoup(review, "html.parser").get_text() # Remove HTML tags
    
    #text=rep_jumps.sub(' ', review)
    
    text=rep_special_chars.sub(' ', review)
    
    text = rep_numbers.sub('n', text) # substitute all numbers
    
    
    words = text.split()[:100] # Split string into words
    #words = [w for w in words if w not in stopwords.words("english")] # Remove stopwords
    #words = [stemmer.stem(w) for w in words] # shorter words to stems 
    
    return ' '.join(words)

The `review_to_words` method defined above uses `BeautifulSoup` to remove any html tags that appear and uses the `nltk` package to tokenize the reviews. As a check to ensure we know how everything is working, try applying `review_to_words` to one of the reviews in the training set.

In [65]:
rev=1
print(train['statement'][rev],'\n AFTER: \n')
print(review_to_words(train['statement'][rev]))

When did the decline of coal start? It started when natural gas took off that started to begin in (President George W.) Bushs administration. 
 AFTER: 

When did the decline of coal start It started when natural gas took off that started to begin in President George W Bushs administration


The function below applies the `review_to_words`  to review in training and testing datasets, and caches results. This way start charge libraries and start notebook here.

In [66]:
train.statement = train.statement.apply(review_to_words)
train.to_json('./data/liar_dataset/liar.json', orient='records', lines=True)