# Text Classification with ANN using TensorFlow and Keras

In this notebook, we will create a simple Artificial Neural Network (ANN) for text classification using a small dataset. We will use TensorFlow, Keras, Pandas, and Scikit-learn to build and evaluate our model.

## Install Required Libraries

First, ensure you have the necessary libraries installed. You can run the following command in a Jupyter Notebook cell:

```python
!pip install tensorflow pandas numpy scikit-learn

In [16]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

## Creating a DataFrame for Text Classification

In this section, we will create a DataFrame that contains sample text data and their corresponding labels. This DataFrame will serve as our dataset for training a simple neural network model for text classification.

### Define the Data

We start by defining a dictionary named `data`. This dictionary consists of two keys:

1. **text**: A list of sentences that represent different opinions and statements.
2. **label**: A list of corresponding labels that categorize the sentiment of each sentence. The possible labels include:
   - **positive**: Indicates a positive sentiment towards programming or problem-solving.
   - **negative**: Indicates a negative sentiment related to debugging.
   - **neutral**: Indicates a neutral sentiment towards programming languages.

In [17]:
data = {
    'text': [
        'I love programming in Python',
        'Python is an amazing language',
        'I enjoy machine learning',
        'Deep learning is a subset of machine learning',
        'I dislike bugs in code',
        'Debugging can be frustrating',
        'I love solving problems',
        'I prefer Java over C++',
        'C++ is powerful but complex',
        'I enjoy reading about algorithms'
    ],
    'label': [
        'positive', 'positive', 'positive', 'positive', 'negative',
        'negative', 'positive', 'neutral', 'neutral', 'positive'
    ]
}

# Convert to DataFrame
df = pd.DataFrame(data)
print(df)

                                            text     label
0                   I love programming in Python  positive
1                  Python is an amazing language  positive
2                       I enjoy machine learning  positive
3  Deep learning is a subset of machine learning  positive
4                         I dislike bugs in code  negative
5                   Debugging can be frustrating  negative
6                        I love solving problems  positive
7                         I prefer Java over C++   neutral
8                    C++ is powerful but complex   neutral
9               I enjoy reading about algorithms  positive


## Data Preprocessing: Cleaning Text and Converting Labels

In this section, we will preprocess the text data and convert the sentiment labels into a numerical format suitable for model training. This step is crucial in preparing our data for a machine learning model.

### Clean the Text Data

The first step in preprocessing is to clean the text data. We define a function called `clean_text` that converts the text to lowercase, which helps standardize the data and reduces the number of unique tokens.

```python
# Clean the text data
def clean_text(text):
    return text.lower()  # Convert to lowercase
```
Next, we apply this function to the 'text' column of our DataFrame df using the apply method. This will transform all text entries to lowercase:

### Convert Labels to Numerical Format

Machine learning algorithms often work better with numerical data. To facilitate this, we will convert the sentiment labels into numerical codes. We can achieve this by using pandas' categorical data type, which assigns an integer code to each category.

```python
df['label'] = df['label'].astype('category').cat.codes
```

In [18]:
# Clean the text data
def clean_text(text):
    return text.lower()  # Convert to lowercase

df['text'] = df['text'].apply(clean_text)

# Convert labels to numerical format
df['label'] = df['label'].astype('category').cat.codes
print(df)

                                            text  label
0                   i love programming in python      2
1                  python is an amazing language      2
2                       i enjoy machine learning      2
3  deep learning is a subset of machine learning      2
4                         i dislike bugs in code      0
5                   debugging can be frustrating      0
6                        i love solving problems      2
7                         i prefer java over c++      1
8                    c++ is powerful but complex      1
9               i enjoy reading about algorithms      2


## Splitting the Dataset into Training and Validation Sets

In this section, we will split our dataset into training and validation sets. This step is essential for evaluating the performance of our model and ensuring it generalizes well to unseen data.

### Train-Test Split

We use the `train_test_split` function from the `sklearn.model_selection` module to divide our data. This function randomly splits the dataset into two parts: 
a training set and a validation set (or test set).

In [19]:
X_train, X_val, y_train, y_val = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)

print(f'Training samples: {len(X_train)}, Validation samples: {len(X_val)}')

Training samples: 8, Validation samples: 2


### Parameters Explained

- **`df['text']`**: The input features (text data) that we want to use for training and validation.
- **`df['label']`**: The target variable (labels) associated with the text data.
- **`test_size=0.2`**: This parameter specifies the proportion of the dataset to include in the validation set. In this case, 20% of the data will be used for validation, and the remaining 80% will be used for training.
- **`random_state=42`**: This parameter ensures that the split is reproducible. Using a fixed seed (like 42) means that you will get the same split every time you run the code, which is useful for debugging and consistency.


## Text Vectorization

Text vectorization is a crucial step in natural language processing (NLP) that transforms text data into a numerical format that can be used by machine learning models. This process allows the model to understand and process the text data.

### Setting Up Text Vectorization

In this section, we will use the `TextVectorization` layer from TensorFlow's Keras library to convert our text data into a format suitable for model training.

```python
max_features = 1000  # Number of unique words
sequence_length = 10  # Length of each input sequence

vectorize_layer = layers.TextVectorization(max_tokens=max_features, output_sequence_length=sequence_length)


In [20]:
max_features = 1000  # Number of unique words
sequence_length = 10  # Length of each input sequence

vectorize_layer = layers.TextVectorization(max_tokens=max_features, output_sequence_length=sequence_length)

# Fit the vectorization layer on the training data
vectorize_layer.adapt(X_train)

# Vectorize the text data
X_train_vectorized = vectorize_layer(X_train)
X_val_vectorized = vectorize_layer(X_val)


# Importance of Loss Function and Optimizer in Neural Networks

## Loss Function

- **Performance Measurement**: Quantifies how well the model's predictions match the actual target values, providing a numerical value that reflects model performance during training.
  
- **Guidance for Learning**: Calculates the difference between predicted and actual values, guiding the model to adjust its weights for improvement. A high loss indicates poor performance, while a low loss signifies better performance.

- **Objective of Training**: The goal is to minimize the loss function by adjusting the model's parameters (weights and biases) based on the specific task (e.g., regression, binary classification, multi-class classification).

## Optimizer

- **Parameter Update Mechanism**: Adjusts the model's parameters based on gradients computed from the loss function, dictating how weights are updated during training.

- **Learning Rate Control**: Determines the size of the steps taken during weight updates. A well-chosen learning rate is critical; too large can cause instability, while too small can slow down training.

- **Adaptability**: Different optimizers use varied strategies for updating parameters. For instance, optimizers like Adam dynamically adjust the learning rate based on past gradients, enhancing convergence on complex loss surfaces.

## Building and Compiling the Neural Network Model

In this section, we will create a neural network model for text classification using Keras. The model will be built using the Sequential API, which allows us to stack layers in a linear fashion.

### Model Architecture

We define the model as follows:


In [26]:
model = keras.Sequential([
    layers.Embedding(max_features, 8, input_length=sequence_length),
    layers.GlobalAveragePooling1D(),
    layers.Dense(8, activation='relu'),
    layers.Dense(len(df['label'].unique()), activation='softmax')  # Output layer for multi-class classification
])


### Model Components Explanation

- **`keras.Sequential`**: 
  - This class allows us to create a linear stack of layers for our model. 
  - Each layer will receive the output of the previous layer as input.

- **`layers.Embedding(max_features, 8, input_length=sequence_length)`**:
  - This layer converts integer-encoded words into dense vectors of fixed size.
  - **`max_features`**: The maximum number of unique words to be considered (set earlier to 1000).
  - **`8`**: The size of the output vectors for each word.
  - **`input_length=sequence_length`**: Specifies the length of input sequences, which ensures that all input sequences are of the same length.

- **`layers.GlobalAveragePooling1D()`**:
  - This layer reduces the dimensionality of the output from the Embedding layer by taking the average of all time steps (words) in the input sequences.
  - It produces a single vector for each input sequence, which is then passed to the next layer.

- **`layers.Dense(8, activation='relu')`**:
  - A fully connected layer with 8 units and a ReLU (Rectified Linear Unit) activation function.
  - This layer introduces non-linearity to the model, allowing it to learn more complex patterns.

- **`layers.Dense(len(df['label'].unique()), activation='softmax')`**:
  - The output layer for the model, which corresponds to the number of unique classes (labels) in our dataset.
  - **`activation='softmax'`**: This activation function converts the output into probabilities for each class, making it suitable for multi-class classification tasks.


# Training the Model

The model is trained using the `fit` method, which adjusts the model's weights based on the training data.

In [25]:
history = model.fit(X_train_vectorized, y_train, validation_data=(X_val_vectorized, y_val), epochs=10, batch_size=2)
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Epoch 1/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step - accuracy: 0.7333 - loss: 1.0136 - val_accuracy: 0.5000 - val_loss: 1.0673
Epoch 2/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - accuracy: 0.7833 - loss: 0.9938 - val_accuracy: 0.5000 - val_loss: 1.0647
Epoch 3/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.7833 - loss: 0.9859 - val_accuracy: 0.5000 - val_loss: 1.0627
Epoch 4/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.4500 - loss: 1.0544 - val_accuracy: 0.5000 - val_loss: 1.0610
Epoch 5/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.7833 - loss: 0.9635 - val_accuracy: 0.5000 - val_loss: 1.0584
Epoch 6/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.4000 - loss: 1.0632 - val_accuracy: 0.5000 - val_loss: 1.0576
Epoch 7/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[

### Parameters Explained

- **X_train_vectorized**: 
  - The input features (vectorized text data) used for training the model.

- **y_train**: 
  - The target labels corresponding to the training data.

- **validation_data**: 
  - A tuple containing the validation features and labels, used to evaluate the model's performance on unseen data during training. This helps in monitoring overfitting.

- **epochs**: 
  - The number of times the model will go through the entire training dataset. In this case, the model will be trained for 10 epochs.

- **batch_size**: 
  - The number of samples processed before the model's internal parameters are updated. A batch size of 2 means that the model will update its weights after processing every 2 samples.

# Evaluate the Model

Finally, we will evaluate the model's performance.

In [23]:
loss, accuracy = model.evaluate(X_val_vectorized, y_val)
print(f'Validation Loss: {loss}, Validation Accuracy: {accuracy}')

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 104ms/step - accuracy: 0.5000 - loss: 1.0700
Validation Loss: 1.069966197013855, Validation Accuracy: 0.5


# Preparing New Texts for Prediction

To make predictions using the trained model, we need to prepare new texts similarly to how the training data was prepared.

In [33]:
# Sample new texts for prediction
new_texts = [
    'I love coding in Python',
    'Debugging is hard',
    'Machine learning is fascinating'
]

label_mapping = {
    0: 'positive',
    1: 'negative',
    2: 'neutral'
}

# Clean and vectorize the new texts
new_texts_cleaned = [clean_text(text) for text in new_texts]
new_texts_vectorized = vectorize_layer(new_texts_cleaned)

predictions = model.predict(new_texts_vectorized)
predicted_classes = predictions.argmax(axis=1)

predicted_strings = [label_mapping[label] for label in predicted_classes]

for text, pred in zip(new_texts, predicted_strings):
    print(f'Text: "{text}" - Predicted label: {pred}')

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
Text: "I love coding in Python" - Predicted label: neutral
Text: "Debugging is hard" - Predicted label: neutral
Text: "Machine learning is fascinating" - Predicted label: neutral


The `argmax` function is a method provided by NumPy, which is a fundamental package for numerical computing in Python. It is used to return the indices of the maximum values along an axis of an array.

In the context of the following code snippet:

```python
predicted_classes = predictions.argmax(axis=1)
```
* predictions is typically a NumPy array that contains the output probabilities for each class (as produced by the softmax activation function in the final layer of your model).
* axis=1 specifies that you want to find the index of the maximum value across the columns of the array (i.e., for each row, which class has the highest predicted probability).

Thus, predictions.argmax(axis=1) will return an array of indices representing the predicted class for each input sample, based on the highest probability.