# Labor V. Neural Network

<img src="https://media.geeksforgeeks.org/wp-content/cdn-uploads/20230602113310/Neural-Networks-Architecture.png">

## I have an idea. I do text classification!

<img src="https://gmu.ac.ae/wp-content/uploads/2017/03/idea.jpg">

## I need some data!

<img src="https://staging.herovired.com/wp-content/uploads/2023/04/What-Is-Data-Definition-01.webp">

## [Huggingface](https://huggingface.co/docs/datasets/index)

- IMDB dataset: hf://datasets/scikit-learn/imdb/IMDB Dataset.csv

In [2]:
import pandas as pd

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
df_ag_new = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
df_ag_new.head()

Unnamed: 0,text,label
0,Wall St. Bears Claw Back Into the Black (Reute...,2
1,Carlyle Looks Toward Commercial Aerospace (Reu...,2
2,Oil and Economy Cloud Stocks' Outlook (Reuters...,2
3,Iraq Halts Oil Exports from Main Southern Pipe...,2
4,"Oil prices soar to all-time record, posing new...",2


In [3]:
df_ag_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120000 entries, 0 to 119999
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    120000 non-null  object
 1   label   120000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 1.8+ MB


## Text cleaning

<img src="https://www.henryford.com/-/media/project/hfhs/henryford/henry-ford-blog/images/mobile-interior-banner-images/2019/02/bucket-of-cleaning-products.jpg">

In [None]:
# Lower case
# Remove white spaces
# Remove special characters
# Expand contraction: !pip install contractions
# Removing HTML or XML tags
# Removing Punctuation
# Remove Numbers
# Remove stop words
# Lemmatization
# Steeming

In [5]:
# prompt: Please write me a python which clean the
# df_ag_new dataset text column step  by step based on
# the following instruction:
# - Lower case
# - Remove white spaces
# - Remove special characters
# - Remove stop words
# - Lemmatization
# Use progress bar to every task (instruction) and use the tqdm
# libary for this.

import pandas as pd
from tqdm import tqdm
import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Assuming df_ag_new is already defined and contains a 'text' column

# Lower case
tqdm.pandas(desc="Lowercasing")
df_ag_new['text'] = df_ag_new['text'].progress_apply(lambda x: x.lower())

# Remove white spaces
tqdm.pandas(desc="Removing whitespace")
df_ag_new['text'] = df_ag_new['text'].progress_apply(lambda x: x.strip())

# Remove special characters
tqdm.pandas(desc="Removing special characters")
df_ag_new['text'] = df_ag_new['text'].progress_apply(lambda x: re.sub(r'[^a-zA-Z0-9\s]', '', x))

# Remove stop words
stop_words = set(stopwords.words('english'))
tqdm.pandas(desc="Removing stop words")
df_ag_new['text'] = df_ag_new['text'].progress_apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))

# Lemmatization
lemmatizer = WordNetLemmatizer()
tqdm.pandas(desc="Lemmatization")
df_ag_new['text'] = df_ag_new['text'].progress_apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()]))

print(df_ag_new.head())


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Lowercasing: 100%|██████████| 120000/120000 [00:00<00:00, 532558.18it/s]
Removing whitespace: 100%|██████████| 120000/120000 [00:00<00:00, 920735.69it/s]
Removing special characters: 100%|██████████| 120000/120000 [00:00<00:00, 288537.35it/s]
Removing stop words: 100%|██████████| 120000/120000 [00:00<00:00, 155116.71it/s]
Lemmatization: 100%|██████████| 120000/120000 [00:14<00:00, 8259.92it/s]

                                                text  label
0  wall st bear claw back black reuters reuters s...      2
1  carlyle look toward commercial aerospace reute...      2
2  oil economy cloud stock outlook reuters reuter...      2
3  iraq halt oil export main southern pipeline re...      2
4  oil price soar alltime record posing new menac...      2





In [6]:
# prompt: Write a unit test to the lower casing validation.

import pandas as pd
import unittest

class TestLowerCasing(unittest.TestCase):

    def test_lower_casing(self):
        # Create a sample DataFrame with a 'text' column
        df = pd.DataFrame({'text': ['This Is A Test', 'ANOTHER TEST']})

        # Apply the lower casing function to the DataFrame
        tqdm.pandas(desc="Lowercasing")
        df['text'] = df['text'].progress_apply(lambda x: x.lower())

        # Assert that all text values are lower case
        self.assertTrue(all([isinstance(text, str) and text.islower() for text in df['text']]))


# Run the tests
unittest.main(argv=['first-arg-is-ignored'], exit=False)


Lowercasing: 100%|██████████| 2/2 [00:00<00:00, 1158.49it/s]
.
----------------------------------------------------------------------
Ran 1 test in 0.010s

OK


<unittest.main.TestProgram at 0x7a99053ecfd0>

## Training, validation and test set

<img src="https://www.brainstobytes.com/content/images/2020/01/Sets.png">



In [7]:
X = df_ag_new['text'].values
y = df_ag_new['label'].values
X.shape, y.shape

((120000,), (120000,))

In [9]:
# prompt: Split the X and y varrible train test and validation data.
# Train should be 80% and test and validation should be 10-10%

from sklearn.model_selection import train_test_split

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_val shape:", X_val.shape)
print("y_val shape:", y_val.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)


X_train shape: (96000,)
y_train shape: (96000,)
X_val shape: (12000,)
y_val shape: (12000,)
X_test shape: (12000,)
y_test shape: (12000,)


In [17]:
# prompt: Apply the counter_vectorization on my X train,
# test and val varrible. The new varrible names should be X_train_cv,
# text_train_cv and X_val_cv

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=1000)

X_train_cv = vectorizer.fit_transform(X_train)
X_test_cv = vectorizer.transform(X_test)
X_val_cv = vectorizer.transform(X_val)
X_train_cv.shape, X_test_cv.shape, X_val_cv.shape

((96000, 1000), (12000, 1000), (12000, 1000))

In [16]:
# prompt: Apply the TFIDF on my X train, test and val varrible.
# The new varrible names should be X_train_tfidf, text_train_tfidf and
# X_val_tfidf

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_features=1000)

X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)
X_val_tfidf = tfidf_vectorizer.transform(X_val)

print("X_train_tfidf shape:", X_train_tfidf.shape)
print("X_test_tfidf shape:", X_test_tfidf.shape)
print("X_val_tfidf shape:", X_val_tfidf.shape)


X_train_tfidf shape: (96000, 1000)
X_test_tfidf shape: (12000, 1000)
X_val_tfidf shape: (12000, 1000)


## Modelling

In [15]:
set(y_train[:10])

{0, 1, 2, 3}

In [18]:
# prompt: Generate a classifcator neural network with two layer and
# apply this the counter vectorized data and use the tensorflow machine
# learning library.

import tensorflow as tf

# Define the model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(X_train_cv.shape[1],)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(4, activation='softmax')  # Assuming 4 classes in your dataset
])

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Train the model
model.fit(X_train_cv.toarray(), y_train, epochs=10, batch_size=32, validation_data=(X_val_cv.toarray(), y_val))

# Evaluate the model
loss, accuracy = model.evaluate(X_test_cv.toarray(), y_test)
print('Test accuracy:', accuracy)


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/10
[1m3000/3000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 4ms/step - accuracy: 0.8262 - loss: 0.4938 - val_accuracy: 0.8665 - val_loss: 0.3650
Epoch 2/10
[1m3000/3000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 4ms/step - accuracy: 0.8910 - loss: 0.3074 - val_accuracy: 0.8746 - val_loss: 0.3495
Epoch 3/10
[1m3000/3000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 5ms/step - accuracy: 0.9150 - loss: 0.2383 - val_accuracy: 0.8798 - val_loss: 0.3504
Epoch 4/10
[1m3000/3000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 3ms/step - accuracy: 0.9382 - loss: 0.1753 - val_accuracy: 0.8744 - val_loss: 0.3820
Epoch 5/10
[1m3000/3000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 4ms/step - accuracy: 0.9578 - loss: 0.1213 - val_accuracy: 0.8752 - val_loss: 0.4401
Epoch 6/10
[1m3000/3000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 3ms/step - accuracy: 0.9705 - loss: 0.0842 - val_accuracy: 0.8705 - val_loss: 0.5225
Epoch 7/10

In [19]:
# prompt: Generate a classifcator neural network with two layer and apply this the TFIDF data and use the tensorflow machine learning library.

# Define the model
model_tfidf = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(X_train_tfidf.shape[1],)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(4, activation='softmax')  # Assuming 4 classes in your dataset
])

# Compile the model
model_tfidf.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Train the model
model_tfidf.fit(X_train_tfidf.toarray(), y_train, epochs=10, batch_size=32, validation_data=(X_val_tfidf.toarray(), y_val))

# Evaluate the model
loss_tfidf, accuracy_tfidf = model_tfidf.evaluate(X_test_tfidf.toarray(), y_test)
print('Test accuracy (TF-IDF):', accuracy_tfidf)


Epoch 1/10
[1m3000/3000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 4ms/step - accuracy: 0.8241 - loss: 0.5063 - val_accuracy: 0.8691 - val_loss: 0.3611
Epoch 2/10
[1m3000/3000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 3ms/step - accuracy: 0.8776 - loss: 0.3365 - val_accuracy: 0.8733 - val_loss: 0.3498
Epoch 3/10
[1m3000/3000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 4ms/step - accuracy: 0.8982 - loss: 0.2850 - val_accuracy: 0.8760 - val_loss: 0.3383
Epoch 4/10
[1m3000/3000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 4ms/step - accuracy: 0.9167 - loss: 0.2372 - val_accuracy: 0.8803 - val_loss: 0.3473
Epoch 5/10
[1m3000/3000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 3ms/step - accuracy: 0.9370 - loss: 0.1845 - val_accuracy: 0.8780 - val_loss: 0.3765
Epoch 6/10
[1m3000/3000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 4ms/step - accuracy: 0.9553 - loss: 0.1322 - val_accuracy: 0.8767 - val_loss: 0.4153
Epoch 7/10


## Evaluating

In [20]:
# prompt: Evaulate the counter vectorize and TFIDF models on the test datasets

loss_cv, accuracy_cv = model.evaluate(X_test_cv.toarray(), y_test)
print('Test accuracy (CountVectorizer):', accuracy_cv)

loss_tfidf, accuracy_tfidf = model_tfidf.evaluate(X_test_tfidf.toarray(), y_test)
print('Test accuracy (TF-IDF):', accuracy_tfidf)


[1m375/375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8644 - loss: 0.9043
Test accuracy (CountVectorizer): 0.8652499914169312
[1m375/375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.8692 - loss: 0.7544
Test accuracy (TF-IDF): 0.8709999918937683


## Predicition / Inferation

In [21]:
# prompt: Please create a python code which demonstrat the prediction on some elements of the test daset.

# Assuming you have already trained your model (e.g., model_tfidf) and have X_test_tfidf

# Choose some elements from the test dataset (e.g., the first 5)
num_samples_to_predict = 5
X_test_samples = X_test_tfidf[:num_samples_to_predict].toarray()

# Make predictions using your model
predictions = model_tfidf.predict(X_test_samples)

# Get the predicted class labels (the class with the highest probability)
predicted_labels = [tf.argmax(prediction).numpy() for prediction in predictions]

# Print the predicted labels and corresponding actual labels
print("Predictions:", predicted_labels)
print("Actual labels:", y_test[:num_samples_to_predict].tolist())

# You can also print the probabilities for each class if needed:
# for i, prediction in enumerate(predictions):
#   print(f"Sample {i+1}: Probabilities: {prediction}")



[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 83ms/step
Predictions: [3, 0, 2, 3, 3]
Actual labels: [0, 0, 2, 3, 0]
