<a href="https://colab.research.google.com/github/viktoruebelhart/Keras_NPL_News/blob/main/Keras_NPL_news.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##News Portal.

This portal wants to build a solution that automates the classification of news into different categories.

The different categories we want to evaluate this data are: world, sports, business and science and technology.

In [1]:
url ='https://github.com/allanspadini/curso-tensorflow-proxima-palavra/raw/main/dados/train.zip'

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv(url, header=None,names=['ClassIndex', 'Title', 'Description'])

In [4]:
df.head()

Unnamed: 0,ClassIndex,Title,Description
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120000 entries, 0 to 119999
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   ClassIndex   120000 non-null  int64 
 1   Title        120000 non-null  object
 2   Description  120000 non-null  object
dtypes: int64(1), object(2)
memory usage: 2.7+ MB


In [6]:
df['Text'] = df['Title'] + ' ' + ['Description']

In [7]:
df['Text']

Unnamed: 0,Text
0,Wall St. Bears Claw Back Into the Black (Reute...
1,Carlyle Looks Toward Commercial Aerospace (Reu...
2,Oil and Economy Cloud Stocks' Outlook (Reuters...
3,Iraq Halts Oil Exports from Main Southern Pipe...
4,"Oil prices soar to all-time record, posing new..."
...,...
119995,Pakistan's Musharraf Says Won't Quit as Army C...
119996,Renteria signing a top-shelf deal Description
119997,Saban not going to Dolphins yet Description
119998,Today's NFL games Description


In [8]:
df.head()

Unnamed: 0,ClassIndex,Title,Description,Text
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli...",Wall St. Bears Claw Back Into the Black (Reute...
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...,Carlyle Looks Toward Commercial Aerospace (Reu...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...,Oil and Economy Cloud Stocks' Outlook (Reuters...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...,Iraq Halts Oil Exports from Main Southern Pipe...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco...","Oil prices soar to all-time record, posing new..."


In [9]:
df['ClassIndex'].unique()

array([3, 4, 2, 1])

In [10]:
df['ClassIndex'].value_counts()

Unnamed: 0_level_0,count
ClassIndex,Unnamed: 1_level_1
3,30000
4,30000
2,30000
1,30000


We will pass the data to TensorFlow in numerical form, but we need to ensure that this data starts from the value 0

In [11]:
df['ClassIndex'] = df['ClassIndex'] - 1

In [12]:
df['ClassIndex'].unique()

array([2, 3, 1, 0])

We will need to split the data into one part to train the Deep Learning model and another to validate it.

To separate them, we will use the SK Learn library

In [13]:
from sklearn.model_selection import train_test_split

In [14]:
  x_train, x_test, y_train, y_test = train_test_split(df['Text'].values, df['ClassIndex'].values, test_size=0.20, random_state=4256)

We need to transform our texts into numbers. This is because, up until now, we have performed the general overview process and divided our data into training and testing. In addition, we have stored them in the variables X and Y, our input and the expected output of the neural network.

However, in the variable X, we have the data in text. Therefore, we need to convert this data to the numerical format, which is the format that the neural network understands to process.

In [15]:
import tensorflow as tf

In [16]:
vocab_size = 1000

In [17]:
encoder = tf.keras.layers.TextVectorization(max_tokens=vocab_size)

In [18]:
encoder.adapt(x_train)

In [19]:
encoder_vocab = encoder.get_vocabulary()
encoder_vocab[:20]

['',
 '[UNK]',
 'description',
 'to',
 'in',
 'for',
 'on',
 'of',
 'ap',
 'the',
 '39s',
 'us',
 'a',
 'at',
 'reuters',
 'with',
 'new',
 'as',
 '39',
 'up']

In [20]:
example = "Today's NFL games PITTSBURG at NY GIANTS"

In [21]:
encoder(example)

<tf.Tensor: shape=(7,), dtype=int64, numpy=array([  1, 402, 251,   1,  13,   1, 303])>

Now, we will build the neural network that will perform this classification process.

In [26]:
modelo = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(
        input_dim=len(encoder.get_vocabulary()),
        output_dim=16,
        mask_zero=True
    ),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dense(4, activation='softmax')
])

In [29]:
modelo.compile(optimizer=tf.keras.optimizers.Adam(1e-4),
               loss='sparse_categorical_crossentropy',
               metrics=['accuracy'])

In [31]:
x_test[1]

"Palestinians Pour Out Grief Over Arafat's Death Description"

In [35]:
modelo.predict(x_test[:1])

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 557ms/step


array([[0.24963516, 0.24835794, 0.2510588 , 0.25094813]], dtype=float32)

As a prediction, we obtain a vector with 4 values. These values ​​represent the probability of each class being the same. To get a specific value, in addition to predict, we have to get the highest value.

In [39]:
modelo.predict(x_test[:1]).argmax(axis=1)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step


array([2])

In [37]:
y_test[1]

0

The y_test is 0. This happens because we have a very close probability for each of the classes, since our neural network was started with random weights.

So, we need to teach the neural network to deal with our problem.