# Toxic comment classification
## Exploration and preprocessing



## Instructions


## General Outline

Recall the general outline for SageMaker projects using a notebook instance.

1. Unzip
2. Process / Prepare the data.
3. Upload the processed data to S3.
4. Train a chosen model.
5. Test the trained model (typically using a batch transform job).
6. Deploy the trained model.
7. Use the deployed model.


In [1]:
#data 
import pandas as pd 
import numpy as np

#paths
import os

#save
import pickle

#plot
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# text processing 
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *
import re
from bs4 import BeautifulSoup

In [2]:
cols = [ 'ID', 'label', 'statement', 'subject', 'speaker', 'speaker_job', 'state_info',
 'afiliation', 'barely_true_counts', 'false_counts', 'half_true_counts', 'mostly_true_counts', 
 'pants_on_fire_counts', 'context']

len(cols)

14

## Step 1: Inspecting for missing values

In [3]:

train = pd.read_csv('./data/liar_dataset/train.tsv', sep='\t',header=None) 

train.columns = cols

In [4]:
train.head()

Unnamed: 0,ID,label,statement,subject,speaker,speaker_job,state_info,afiliation,barely_true_counts,false_counts,half_true_counts,mostly_true_counts,pants_on_fire_counts,context
0,2635.json,false,Says the Annies List political group supports ...,abortion,dwayne-bohac,State representative,Texas,republican,0.0,1.0,0.0,0.0,0.0,a mailer
1,10540.json,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,State delegate,Virginia,democrat,0.0,0.0,1.0,1.0,0.0,a floor speech.
2,324.json,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy,barack-obama,President,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,Denver
3,1123.json,false,Health care reform legislation is likely to ma...,health-care,blog-posting,,,none,7.0,19.0,3.0,5.0,44.0,a news release
4,9028.json,half-true,The economic turnaround started at the end of ...,"economy,jobs",charlie-crist,,Florida,democrat,15.0,9.0,20.0,19.0,2.0,an interview on CNN


In [5]:
# There are not null entries
print(train.dropna(subset = ['afiliation']).isnull().sum())


ID                         0
label                      0
statement                  0
subject                    0
speaker                    0
speaker_job             2895
state_info              2206
afiliation                 0
barely_true_counts         0
false_counts               0
half_true_counts           0
mostly_true_counts         0
pants_on_fire_counts       0
context                  100
dtype: int64


In [6]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10240 entries, 0 to 10239
Data columns (total 14 columns):
ID                      10240 non-null object
label                   10240 non-null object
statement               10240 non-null object
subject                 10238 non-null object
speaker                 10238 non-null object
speaker_job             7343 non-null object
state_info              8032 non-null object
afiliation              10238 non-null object
barely_true_counts      10238 non-null float64
false_counts            10238 non-null float64
half_true_counts        10238 non-null float64
mostly_true_counts      10238 non-null float64
pants_on_fire_counts    10238 non-null float64
context                 10138 non-null object
dtypes: float64(5), object(9)
memory usage: 1.1+ MB


In [8]:
train.to_csv('./data/liar_dataset/train.csv', index=False)

# 2.- Preprocessing 

### Spliting comments from labels

In [51]:
train.columns

Index(['ID', 'label', 'statement', 'subject', 'speaker', 'speaker_job',
       'state_info', 'afiliation', 'barely_true_counts', 'false_counts',
       'half_true_counts', 'mostly_true_counts', 'pants_on_fire_counts',
       'context'],
      dtype='object')

**Looking at the reviews, one finds comments which need a lot of preprocessing:**

In [60]:
rev=660
print('TRAIN\n')
print(train['statement'][rev][:120],'\n')


TRAIN

It used to be the policy of the Republican Party to get rid of the Department of Education. We finally get in charge and 



The first step in processing the reviews is to make sure that any html tags that appear should be removed. In addition we wish to tokenize our input, that way words such as *entertained* and *entertaining* are considered the same with regard to sentiment analysis.

In [61]:
rep_numbers=re.compile(r'\d+',re.IGNORECASE)

#rep_jumps=re.compile(r'\n+',re.IGNORECASE)

rep_special_chars=re.compile(r'[^a-z\d ]',re.IGNORECASE)

rep_special_chars= re.compile("[^\w']|_")

stemmer = PorterStemmer()
nltk.download("stopwords", quiet=True)

def review_to_words(review):
    
    
    
    #text = BeautifulSoup(review, "html.parser").get_text() # Remove HTML tags
    
    #text=rep_jumps.sub(' ', review)
    
    text=rep_special_chars.sub(' ', review)
    
    text = rep_numbers.sub('n', text) # substitute all numbers
    
    
    words = text.split()[:100] # Split string into words
    #words = [w for w in words if w not in stopwords.words("english")] # Remove stopwords
    #words = [stemmer.stem(w) for w in words] # shorter words to stems 
    
    return ' '.join(words)

The `review_to_words` method defined above uses `BeautifulSoup` to remove any html tags that appear and uses the `nltk` package to tokenize the reviews. As a check to ensure we know how everything is working, try applying `review_to_words` to one of the reviews in the training set.

In [65]:
rev=1
print(train['statement'][rev],'\n AFTER: \n')
print(review_to_words(train['statement'][rev]))

When did the decline of coal start? It started when natural gas took off that started to begin in (President George W.) Bushs administration. 
 AFTER: 

When did the decline of coal start It started when natural gas took off that started to begin in President George W Bushs administration


The function below applies the `review_to_words`  to review in training and testing datasets, and caches results. This way start charge libraries and start notebook here.

In [66]:
train.statement = train.statement.apply(review_to_words)
train.to_json('./data/liar_dataset/liar.json', orient='records', lines=True)