## Fake news detection

This notebook uses the fake news datasets from Kaggle (https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset) to create a fake news predictor

Begin my importing the standard Sagemaker stuff

In [None]:
import sagemaker
from sagemaker import get_execution_role

role = get_execution_role()
print(role)

sess = sagemaker.Session()
bucket = sess.default_bucket()
prefix = 'fake-news'

print('using bucket %s'%bucket)

### Data preparation

Begin by preparing data for use with our algorithm

In [None]:
import pandas as pd
import sagemaker

Import the real news data

In [None]:
true_data = pd.read_csv("True.csv")
true_data['type'] = '__label__true'
true_data.drop(columns=['text','subject','date'], axis=1, inplace=True)

true_data.head()

Import the fake news data

In [None]:
false_data = pd.read_csv("Fake.csv")
false_data['type'] = '__label__false'
false_data.drop(columns=['text','subject','date'], axis=1, inplace=True)

false_data.head()

Combine the data sets together

In [None]:
from sklearn.utils import shuffle

full_data = pd.concat([true_data, false_data])
full_data['tokenized_sentence'] = ''

full_data.head()

Now tokenize the dataset and make it ready to use with BlazingText

In [None]:
import multiprocessing
import csv
import nltk
nltk.download('punkt')

In [None]:
for index, row in full_data.iterrows():
    tokenized_sentence = ' '.join(nltk.word_tokenize(row[0].lower()))
    row['tokenized_sentence'] = tokenized_sentence
    #full_data.loc[index, 'tokenized_sentence'] = tokenized_sentence

Data looks good, so drop the "title" column and split into test and training data sets

In [None]:
full_data.drop(columns='title', axis=1, inplace=True)
full_data.head()

In [None]:
from sklearn.model_selection import train_test_split

full_data = shuffle(full_data)
train, test = train_test_split(full_data, test_size=0.2)

In [None]:
train.to_csv ('train.data', index = False, header=False, sep=' ', quoting=csv.QUOTE_NONE, escapechar=" ")
test.to_csv('test.data', index = False, header=False, sep=' ', quoting=csv.QUOTE_NONE, escapechar=" ")

In [None]:
train_channel = prefix + '/train'
validation_channel = prefix + '/validation'

sess.upload_data(path='train.data', bucket=bucket, key_prefix=train_channel)
sess.upload_data(path='test.data', bucket=bucket, key_prefix=validation_channel)

s3_train_data = 's3://{}/{}'.format(bucket, train_channel)
s3_validation_data = 's3://{}/{}'.format(bucket, validation_channel)


s3_output_location = 's3://{}/{}/output'.format(bucket, prefix)

### Training

Now that data has been pre-processed, we can begin the process of training our model

In [None]:
import boto3
region_name = boto3.Session().region_name

In [None]:
container = sagemaker.amazon.amazon_estimator.get_image_uri(region_name, "blazingtext", "latest")
print('Using SageMaker BlazingText container: {} ({})'.format(container, region_name))

In [None]:
bt_model = sagemaker.estimator.Estimator(container,
                                         role, 
                                         train_instance_count=1, 
                                         train_instance_type='ml.c4.4xlarge',
                                         train_volume_size = 30,
                                         train_max_run = 360000,
                                         input_mode= 'File',
                                         output_path=s3_output_location,
                                         sagemaker_session=sess)

In [None]:
bt_model.set_hyperparameters(mode="supervised",
                            epochs=10,
                            min_count=2,
                            learning_rate=0.05,
                            vector_dim=40,
                            early_stopping=True,
                            patience=4,
                            min_epochs=5,
                            word_ngrams=3)

In [None]:
train_data = sagemaker.session.s3_input(s3_train_data, distribution='FullyReplicated', content_type='text/plain', s3_data_type='S3Prefix')
validation_data = sagemaker.session.s3_input(s3_validation_data, distribution='FullyReplicated', content_type='text/plain', s3_data_type='S3Prefix')
data_channels = {'train': train_data, 'validation': validation_data}

In [None]:
bt_model.fit(inputs=data_channels, logs=True)

### Hosting/Inference

Deploy the model so we can begin inferencing against it

In [None]:
text_classifier = bt_model.deploy(initial_instance_count = 1,instance_type = 'ml.m4.xlarge')

In [None]:
runtime_client = boto3.client('runtime.sagemaker')

In [None]:
import json

sentences = ["Fukushima Nuclear Flowers: Not much more to say, this is what happens when flowers get nuclear birth defects.",
            "Did AOC Commit a Felony by Making a False Accusation Against Trumpf?",
            "where’s the oversight? obama funneled billions to liberal groups through doj ‘slush fund’",
            "french socialists sell historic headquarters for $53 million",
            "trump calls for special prosecutor to investigate clinton foundation",
            "obama names u.s. ambassador to cuba, setting up confirmation fight"]

# using the same nltk tokenizer that we used during data preparation for training
tokenized_sentences = [' '.join(nltk.word_tokenize(sent)) for sent in sentences]

payload = {"instances" : tokenized_sentences}
#print(payload)
response = text_classifier.predict(json.dumps(payload))

#print(response)
predictions = json.loads(response)
print(json.dumps(predictions, indent=2))

#*******************************************************************************************************************
#response = runtime_client.invoke_endpoint(EndpointName = "<ENDPOINT NAME>", ContentType = "application/json", Body = json.dumps(payload))
#body = response['Body']
#result = body.read().decode("UTF-8")

#for item in result:
#    print(item)

#print(result)
#print(json.dumps(result, indent=2))

In [None]:
sess.delete_endpoint(text_classifier.endpoint)