<a href="https://colab.research.google.com/github/rahiakela/automl-experiments/blob/main/automated-machine-learning-with-autokeras/05-text-classification-and-regression/02_predicting_news_popularity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Predicting news popularity in social media

In this notebook, we will create a model that will find out the popularity
score for an article on social media platforms, based on its text. For this,
we will train the model with a [News Popularity dataset collected between
2015 and 2016](https://archive.ics.uci.edu/ml/datasets/News+Popularity+in+Multiple+Social+Media+Platforms).

As we want to approximate a score (number of likes), we will use a text regressor for
this task.

##Setup

In [None]:
!pip3 -q install autokeras

In [1]:
import tensorflow as tf
from tensorflow.keras.utils import plot_model

import pandas as pd 
import numpy as np
import autokeras as ak
from sklearn import model_selection
from sklearn.model_selection import train_test_split

##Preparing  dataset

First, we load and preprocess the emails spam dataset from
our GitHub repository.

In [2]:
news_df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00432/Data/News_Final.csv")

In [3]:
news_df.head()

Unnamed: 0,IDLink,Title,Headline,Source,Topic,PublishDate,SentimentTitle,SentimentHeadline,Facebook,GooglePlus,LinkedIn
0,99248.0,Obama Lays Wreath at Arlington National Cemetery,Obama Lays Wreath at Arlington National Cemete...,USA TODAY,obama,2002-04-02 00:00:00,0.0,-0.0533,-1,-1,-1
1,10423.0,A Look at the Health of the Chinese Economy,"Tim Haywood, investment director business-unit...",Bloomberg,economy,2008-09-20 00:00:00,0.208333,-0.156386,-1,-1,-1
2,18828.0,Nouriel Roubini: Global Economy Not Back to 2008,"Nouriel Roubini, NYU professor and chairman at...",Bloomberg,economy,2012-01-28 00:00:00,-0.42521,0.139754,-1,-1,-1
3,27788.0,Finland GDP Expands In Q4,Finland's economy expanded marginally in the t...,RTT News,economy,2015-03-01 00:06:00,0.0,0.026064,-1,-1,-1
4,27789.0,"Tourism, govt spending buoys Thai economy in J...",Tourism and public spending continued to boost...,The Nation - Thailand&#39;s English news,economy,2015-03-01 00:11:00,0.0,0.141084,-1,-1,-1


As we want to estimate the popularity score (number) based
on its title and headline, we will use a regression model.

In [4]:
text_inputs = np.array(news_df.Title + ". " + news_df.Headline).astype("str")

Now, we extract the popularity score of each article on LinkedIn, to be used as
labels. We have decided to use only the LinkedIn scores to simplify the example.

In [5]:
media_success_outputs = news_df.LinkedIn.to_numpy(dtype="int")

In [6]:
# split the dataset in a train and test set
x_train, x_test, y_train, y_test = train_test_split(text_inputs, media_success_outputs, test_size=.2, random_state=2021)

##Creating a text regressor

Because we want to predict a popularity score from a set of text sentences, and this score
is a scalar value, we are going to use AutoKeras TextRegressor.

In [9]:
# Initialize the TextRegressor
clf = ak.TextRegressor(max_trials=2, overwrite=True)

# Callback to avoid overfitting with the EarlyStopping.
cbs = [tf.keras.callbacks.EarlyStopping(patience=2)]

#  Search for the best model
clf.fit(x_train, y_train, callbacks=cbs)

Trial 2 Complete [00h 08m 04s]
val_loss: 29302.716796875

Best val_loss So Far: 28876.587890625
Total elapsed time: 00h 14m 36s
INFO:tensorflow:Oracle triggered exit
Epoch 1/3
Epoch 2/3
Epoch 3/3
INFO:tensorflow:Assets written to: ./text_regressor/best_model/assets


<tensorflow.python.keras.callbacks.History at 0x7f9b0b6d6b90>

we have a model with
14726 as the best validation loss (mean squared error, or MSE). This means that every
prediction is failing at an average of 121 (square root of 14726) in the final score, which
is not a bad result for the time invested.

##Evaluating the model

It's time to evaluate the best model with the testing dataset.

In [11]:
clf.evaluate(x_test, y_test)



[56350.63671875, 56350.63671875]

As we can see, `0.9849` as prediction accuracy in the test set is a really good final
prediction score for the time invested.

##Visualizing the model

Now, we can see a little summary of the architecture of the best generated model.

In [12]:
model = clf.export_model()
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None,)]                 0         
_________________________________________________________________
expand_last_dim (ExpandLastD (None, 1)                 0         
_________________________________________________________________
text_vectorization (TextVect (None, 64)                0         
_________________________________________________________________
embedding (Embedding)        (None, 64, 128)           640128    
_________________________________________________________________
dropout (Dropout)            (None, 64, 128)           0         
_________________________________________________________________
separable_conv1d (SeparableC (None, 62, 32)            4512      
_________________________________________________________________
separable_conv1d_1 (Separabl (None, 60, 32)            1152  

##Improving the model performance

if we need more precision in less time, we can fine-tune
our model using an advanced AutoKeras feature that allows you to customize your search
space.

In [7]:
# Callback to avoid overfitting with the EarlyStopping.
cbs = [tf.keras.callbacks.EarlyStopping(patience=2)]

In [None]:
input_node = ak.TextInput()
output_node = ak.TextToIntSequence(max_tokens=20000)(input_node)
output_node = ak.TextBlock(block_type="ngram")(output_node)
output_node = ak.RegressionHead()(output_node)

auto_model = ak.AutoModel(inputs=input_node, outputs=output_node, objective="val_mean_squared_error", max_trials=2)
auto_model.fit(x_train, y_train, callbacks=cbs)

##Evaluating the model with the test set

After training, it is time to measure the actual prediction of our model using the reserved
test dataset.

In [None]:
automodel.evaluate(x_test, y_test)

The performance is slightly better than in the model without fine-tuning, but training it
for a longer time surely improves it.