# Toxic comment classification
## Exploration and preprocessing



## Instructions


## General Outline

Recall the general outline for SageMaker projects using a notebook instance.

1. Unzip
2. Process / Prepare the data.
3. Upload the processed data to S3.
4. Train a chosen model.
5. Test the trained model (typically using a batch transform job).
6. Deploy the trained model.
7. Use the deployed model.


In [44]:
#data 
import pandas as pd 
import numpy as np

#paths
import os

#save
import pickle

#plot
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# text processing 
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *
import re
from bs4 import BeautifulSoup

## Step 1: Inspecting for missing values

In [45]:

train = pd.read_csv('./data/train_stances.csv') 

train.head(1)

Unnamed: 0,Headline,Body ID,Stance
0,Police find mass graves with at least '15 bodi...,712,unrelated


In [46]:
train = train.dropna()

In [48]:
# There are not null entries
print(train.dropna().isnull().sum())

Headline    0
Body ID     0
Stance      0
dtype: int64


In [49]:
print(train.dropna().notnull().sum())

Headline    49972
Body ID     49972
Stance      49972
dtype: int64


In [64]:
train.to_csv('./data/stance_preprocessed.csv', index=False)

# 2.- Preprocessing 

### Spliting comments from labels

In [51]:
train.columns

Index(['Headline', 'Body ID', 'Stance'], dtype='object')

**Looking at the reviews, one finds comments which need a lot of preprocessing:**

In [53]:
rev=115
print('TRAIN\n')
print(train['Headline'][rev][:120],'\n')


TRAIN

Christian Bale In Talks To Play Steve Jobs In Sony's Next Biopic 



The first step in processing the reviews is to make sure that any html tags that appear should be removed. In addition we wish to tokenize our input, that way words such as *entertained* and *entertaining* are considered the same with regard to sentiment analysis.

In [54]:
rep_numbers=re.compile(r'\d+',re.IGNORECASE)

#rep_jumps=re.compile(r'\n+',re.IGNORECASE)

rep_special_chars=re.compile(r'[^a-z\d ]',re.IGNORECASE)

rep_special_chars= re.compile("[^\w']|_")

stemmer = PorterStemmer()
nltk.download("stopwords", quiet=True)

def review_to_words(review):
    
    
    
    #text = BeautifulSoup(review, "html.parser").get_text() # Remove HTML tags
    
    #text=rep_jumps.sub(' ', review)
    
    text=rep_special_chars.sub(' ', review)
    
    text = rep_numbers.sub('n', text) # substitute all numbers
    
    
    words = text.split()[:100] # Split string into words
    #words = [w for w in words if w not in stopwords.words("english")] # Remove stopwords
    #words = [stemmer.stem(w) for w in words] # shorter words to stems 
    
    return ' '.join(words)

The `review_to_words` method defined above uses `BeautifulSoup` to remove any html tags that appear and uses the `nltk` package to tokenize the reviews. As a check to ensure we know how everything is working, try applying `review_to_words` to one of the reviews in the training set.

In [56]:
rev=3
print(train['Headline'][rev],'\n AFTER: \n')
print(review_to_words(train['Headline'][rev]))

HBO and Apple in Talks for $15/Month Apple TV Streaming Service Launching in April 
 AFTER: 

HBO and Apple in Talks for n Month Apple TV Streaming Service Launching in April


The function below applies the `review_to_words`  to review in training and testing datasets, and caches results. This way start charge libraries and start notebook here.

In [57]:
train.Headline = train.Headline.apply(review_to_words)
train.to_json('./data/train_stance.json', orient='records', lines=True)

### Yet another dataset

In [58]:
train = pd.read_csv('./data/fake.csv') 

train.head(1)

Unnamed: 0,uuid,ord_in_thread,author,published,title,text,language,crawled,site_url,country,domain_rank,thread_title,spam_score,main_img_url,replies_count,participants_count,likes,comments,shares,type
0,6a175f46bcd24d39b3e962ad0f29936721db70db,0,Barracuda Brigade,2016-10-26T21:41:00.000+03:00,Muslims BUSTED: They Stole Millions In Gov’t B...,Print They should pay all the back all the mon...,english,2016-10-27T01:49:27.168+03:00,100percentfedup.com,US,25689.0,Muslims BUSTED: They Stole Millions In Gov’t B...,0.0,http://bb4sp.com/wp-content/uploads/2016/10/Fu...,0,1,0,0,0,bias


In [60]:
train['title_text'] = train['title'] +' '+ train['text']

In [62]:
train['title_text'][0]

'Muslims BUSTED: They Stole Millions In Gov’t Benefits Print They should pay all the back all the money plus interest. The entire family and everyone who came in with them need to be deported asap. Why did it take two years to bust them? \nHere we go again …another group stealing from the government and taxpayers! A group of Somalis stole over four million in government benefits over just 10 months! \nWe’ve reported on numerous cases like this one where the Muslim refugees/immigrants commit fraud by scamming our system…It’s way out of control! More Related'