<a href="https://colab.research.google.com/github/saiteja7467/community-starter-kit/blob/master/toxic_comment_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf

In [3]:
data=pd.read_csv(os.path.join("drive","MyDrive","jigsaw-toxic-comment-classification-challenge","train.csv"))
data.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


#Text Vectorization
eg:I like ice cream

what text vectorization does is assigns a unique integer to each word present in the sentence

I=23

like=9

ice=87

cream=12

then these integers our further processed into something called
word embeddings which are converted into 0s and 1s

### 🔤 `TextVectorization` Layer:

This is a preprocessing layer in TensorFlow/Keras that **converts text into numbers**, so it can be fed into a neural network.

---

### ✅ Code:

```python
MAX_WORDS = 200000  # number of words in the vocabulary

vectorize = TextVectorization(
    max_tokens=MAX_WORDS,
    output_sequence_length=1500,
    output_mode="int"
)
```

---

### 🔍 What each part means (in plain terms):

| Part                          | Meaning                                                                    |
| ----------------------------- | -------------------------------------------------------------------------- |
| `MAX_WORDS = 200000`          | Limit the vocabulary to the **top 200,000 most frequent words**            |
| `max_tokens=MAX_WORDS`        | Only keep these top words; others will be treated as unknown (`[UNK]`)     |
| `output_sequence_length=1500` | Every text input will be **padded or truncated** to 1500 words             |
| `output_mode="int"`           | The output will be a **sequence of integers**, where each number = word ID |

---

### 🎯 What it does:

* Takes raw text like: `"This is a sample sentence."`
* Turns it into a list of word IDs like: `[10, 4, 23, 88, 9, ...]` (up to 1500 tokens)
* Each unique word gets a unique integer from the vocabulary

---

### 💡 Why it’s important:

Neural networks can’t work with text directly — they need **numbers**. This layer helps convert variable-length raw text into **fixed-length integer sequences**.

Great question!

### 🔝 What does "top words" mean in `max_tokens=MAX_WORDS`?

When we say **"top words"**, we mean:

> **The most frequently occurring words** in your training dataset.

---

### 🧠 How it works behind the scenes:

1. The `TextVectorization` layer **scans through your text data**.
2. It **counts how many times each word appears**.
3. Then, it keeps only the **most common (frequent) `MAX_WORDS` words**.
4. Any word **not in that top list** is replaced with a special token: `[UNK]` (unknown).

---

### 🔍 Example:

Let's say your text dataset contains these sentences:

```python
["I love pizza", "You love burgers", "They love pasta"]
```

Here's how it might rank word frequency:

| Word      | Frequency |
| --------- | --------- |
| "love"    | 3         |
| "pizza"   | 1         |
| "You"     | 1         |
| "burgers" | 1         |
| "They"    | 1         |
| "pasta"   | 1         |
| "I"       | 1         |

Now if `max_tokens=3`, it will **keep only the top 3 most frequent words**:

🟢 `"love"`, `"pizza"`, `"You"`
🔴 The rest (`"burgers"`, `"They"`, etc.) will be replaced with `[UNK]`.

---

### ✅ Why it's useful:

* Helps reduce memory and computation by **limiting the vocabulary size**.
* Filters out rare or noisy words that don’t contribute much to the model.
* Prevents overfitting on rare tokens.



In [4]:
from tensorflow.keras.layers import TextVectorization
#Splitting the data into x and y

x=data["comment_text"]
y=data.drop(columns=["id","comment_text"],axis=1).values#Converts into numpy array(labels)
MAX_WORDS=200000 #Number of words for vocab
vectorize=TextVectorization(max_tokens=MAX_WORDS,
                            output_sequence_length=1500,
                            output_mode="int")


In [5]:
vectorize.adapt(x.values)#Learns all the words in the x dataset (.values just converts into numpy array)

ok so as we intialized vectorized variable, what this does is take a sentence and assigns a unique integer(removes any punctuation and converts them into lowercase which is default).Even though "this is a sample" is just 4 tokens, it pads the rest with zeros up to 1500.

In [5]:
vectorize("this is a sample")

<tf.Tensor: shape=(1500,), dtype=int64, numpy=array([14,  9,  6, ...,  0,  0,  0])>

In [6]:
# Apply vectorization to all data
x_vectorized=vectorize(x)  # This gives a tensor of shape (num_samples, 1500)

In [8]:
type(x_vectorized)

tensorflow.python.framework.ops.EagerTensor

#Creating a data pipeline

rember like this for data pipelining

MCSBAP map,chache,shuffle,batch,prefetch from tensor_slices, list_file

Absolutely — let’s make this **as simple as possible** so you can quickly understand what each line does when you look back:

---

```python
dataset = tf.data.Dataset.from_tensor_slices((x_vectorized, y))
```

📦 **Makes a dataset** from your inputs (`x_vectorized`) and labels (`y`).
➡️ It's like turning your data into a format TensorFlow can use for training.

---

```python
dataset = dataset.cache()
```

🧠 **Remembers the data** after the first time it's used.
➡️ Makes training faster (only if your data fits in memory).

---

```python
dataset = dataset.shuffle(160000)
```

🔀 **Mixes up the data randomly**.
➡️ Helps the model learn better by not always seeing the same order.

---

```python
dataset = dataset.batch(16)
```

📚 **Groups the data** into chunks of 16.
➡️ Instead of learning from one example at a time, it learns from 16 at once (faster + better).

---

```python
dataset = dataset.prefetch(8)

🚀 **Prepares the next 8 batches while training** on the current one.
➡️ Keeps training smooth and fast — no waiting around.


> “I’m turning my inputs and labels into a special dataset TensorFlow understands.
> Then I speed it up by remembering it, mixing it up, splitting it into small groups, and preparing the next group early.”

!


In [7]:
dataset=tf.data.Dataset.from_tensor_slices((x_vectorized,y))#only expects one arguement so pass a tuple
dataset = dataset.cache()
dataset = dataset.shuffle(160000)
dataset = dataset.batch(16)
dataset = dataset.prefetch(8)

In [8]:
batch_x,batch_y=dataset.as_numpy_iterator().next()

In [9]:
batch_x #a batch comments which are tokenized

array([[    21,    301,     13, ...,      0,      0,      0],
       [  1721, 196529,  27378, ...,      0,      0,      0],
       [     8,     74,     15, ...,      0,      0,      0],
       ...,
       [    64,      9,     14, ...,      0,      0,      0],
       [    70,    265,    215, ...,      0,      0,      0],
       [ 10286,  17820,     21, ...,      0,      0,      0]])

In [10]:
batch_y #a labels of labels

array([[0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [1, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 1, 0]])

#Splitting the dataset into training,validation and testing

In [11]:
train=dataset.take(int(len(dataset)*0.8))
valid=dataset.skip((int(len(dataset)*0.8))).take(int(len(dataset)*0.2))
test=dataset.skip(int(len(dataset)*0.9)).take(int(len(dataset)*0.1))


from tensorflow.keras.models import Sequential
```

✅ This loads the `Sequential` model — a **simple stack of layers** added one after another.

from tensorflow.keras.layers import LSTM, Dropout, Bidirectional, Dense, Embedding
```

✅ These are the **different building blocks** (layers) you'll use:

* `Embedding`: turns words into numbers (word vectors)
* `LSTM`: helps the model understand the **order** of words
* `Bidirectional`: lets the LSTM look at the sentence **forward and backward**
* `Dense`: regular neural network layers
* `Dropout`: randomly turns off some neurons (helps avoid overfitting)

---


model = Sequential()
```

✅ You're creating an **empty model**, ready to add layers to it.

---

## 🔤 Step 2: Word Embedding Layer


model.add(Embedding(MAX_WORDS + 1, 32))
```

🧠 Think of this as the **“dictionary layer”**.

* It takes each word in a sentence (already converted to numbers by `TextVectorization`) and maps it to a **vector of 32 numbers**.
* `MAX_WORDS + 1`: size of your vocabulary (number of different words your model knows)
* `32`: size of each word's vector (you choose this — bigger size = more information)

📌 Result: Your sentence becomes a list of word vectors (like turning "I love cats" into 3 points in 32-D space).

---

## 🔁 Step 3: Bidirectional LSTM Layer

```python
model.add(Bidirectional(LSTM(32, activation="tanh")))
```

🧠 This layer tries to **understand the meaning of the whole sentence**, one word at a time — but in **both directions** (start → end **and** end → start).

* `LSTM`: a type of RNN (Recurrent Neural Network) that can **remember important info** over long sequences.
* `32`: number of memory units (the brain size of the LSTM).
* `Bidirectional`: reads the sentence forward *and* backward — so it gets better context.
* `activation="tanh"`: helps decide how much memory to keep or forget.

📌 This is the **core** of your model — it learns the sentence’s meaning.

---

## 🧱 Step 4: Dense Layers (Neural Network Brains)


model.add(Dense(128, activation="relu"))
```

🧠 This is a **regular neural network layer** with 128 “neurons”.

* `Dense`: fully connected — each neuron sees everything from the previous layer.
* `activation="relu"`: only keeps positive values, which helps the network learn faster and better.

---


model.add(Dense(256, activation="relu"))
```

🧠 A **bigger brain** with 256 neurons. Helps the model learn more complex features.

---


model.add(Dense(128, activation="relu"))
```

🧠 One more layer with 128 neurons. This can help the model **refine** its understanding before making the final prediction.

---

## 🎯 Step 5: Output Layer


model.add(Dense(6, activation="sigmoid"))
```

🧠 This layer gives you the **final prediction** — 6 outputs.

* `6`: because you’re predicting **6 different labels** (e.g., toxic, threat, insult, etc.)
* `activation="sigmoid"`: squashes each output to a value between 0 and 1 → useful for **multi-label classification** (each label is independent, and can be 0 or 1).

📌 It tells you the **probability** that a comment belongs to each of the 6 categories.

---

## 🧠 Final Model Summary:

Your model does the following:

1. **Turns words into vectors** (Embedding)
2. **Understands the sentence** in both directions (Bidirectional LSTM)
3. **Processes that info** through deep neural layers (Dense layers)
4. **Outputs 6 predictions** (e.g., how likely it is that the comment is "toxic")


In [12]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM,Dropout,Bidirectional,Dense,Embedding

MAX_WORDS = 200000
model=Sequential()
#Creating the embedding layers
model.add(Embedding(MAX_WORDS+1,32))

#Out LSTM input layer
model.add(Bidirectional(LSTM(32,activation="tanh")))

#Feature extractor layer all are connected
model.add(Dense(128,activation="relu"))
model.add(Dense(256,activation="relu"))
model.add(Dense(128,activation="relu"))

#output layer
model.add(Dense(6,activation="sigmoid"))

In [13]:
#Compiling our model
model.compile(loss="BinaryCrossentropy",optimizer="adam")

In [14]:
model.build(input_shape=(None, 1500))  # (batch_size, sequence_length)
model.summary()

#Fitting the model



In [15]:
history=model.fit(train,epochs=1,validation_data=valid)

[1m7979/7979[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m659s[0m 82ms/step - loss: 0.0820 - val_loss: 0.0512


#Making Predictions

input_text = vectorize("You suck in content creation")
➡️ This converts the raw text into a sequence of integers.

Each word is replaced by its corresponding number from the vocabulary the model learned.

Result: input_text has shape (1500,) — a flat vector of word indices, padded or truncated to 1500 words.

python
Copy
Edit
input_text_batch = tf.expand_dims(input_text, axis=0)
➡️ This adds a batch dimension to your input.

Machine learning models expect a batch of examples, not just one.

This changes the shape from (1500,) to (1, 1500) — meaning "1 sentence of 1500 tokens".

python
Copy
Edit
res = model.predict(input_text_batch)
➡️ This runs the input through the model.

Your model has 6 output units (because of the 6 labels).

So res will be a NumPy array like:
[[0.01, 0.99, 0.02, 0.05, 0.88, 0.12]]

Each number is a probability (between 0 and 1) — showing how likely the sentence belongs to each of the 6 classes.

In [16]:
input_text=vectorize("You suck in content creation")
input_text_batch = tf.expand_dims(input_text, axis=0)
res = model.predict(input_text_batch)
print(res)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 618ms/step
[[0.96772754 0.03278205 0.758113   0.01630224 0.62516266 0.06579437]]


In [22]:
batches_x,batches_y=test.as_numpy_iterator().next()

In [27]:
(model.predict(batches_x)>0.5).astype(int)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 73ms/step


array([[0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0]])

#Evaluation of metrics

In [17]:
from tensorflow.keras.metrics import BinaryAccuracy,Precision,Recall

ba=BinaryAccuracy()
p=Precision()
r=Recall()

for batch in test.as_numpy_iterator():
  #Unpacks the batch x-> as comments and y->as labels['toxic', 'severe_toxic', 'obscene', 'threat', 'insult','identity_hate']
  x_true,y_true=batch
  #Performs prediction on x_true
  y_pred=model.predict(x_true)

  #Flatten into one huge vector
  y_true=y_true.flatten()
  y_pred=y_pred.flatten()

  #Evaluation metrics
  p.update_state(y_true,y_pred)
  ba.update_state(y_true,y_pred)
  r.update_state(y_true,y_pred)



[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 268ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 74ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 79ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 73ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 72ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 75ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 73ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 73ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 81ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 73ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 72ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 71ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 70ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7

In [18]:
print(f"Precision {p.result().numpy()} , Recall {r.result().numpy()} , BinaryAccuracy {ba.result().numpy()}")

Precision 0.8765432238578796 , Recall 0.5744977593421936 , BinaryAccuracy 0.9810368418693542


In [19]:
import gradio as gr
import tensorflow as tf

In [20]:
model.save("toxicity.h5")
model_saved=tf.keras.models.load_model("toxicity.h5")



In [22]:
data.columns

Index(['id', 'comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat',
       'insult', 'identity_hate'],
      dtype='object')

In [23]:
#enumerate
y="name"
for idx,letter in enumerate(y):
  print(idx,letter)

0 n
1 a
2 m
3 e


In [32]:
def score_comment(comment):
  vectorize_comment = tf.expand_dims(vectorize(comment), axis=0)
  results=model.predict(vectorize_comment)

  text=" "
  for idx,col in enumerate(data.columns[2:]):
    text+="{}: {}\n".format(col,results[0][idx]>0.5)

  return text

In [33]:
interface=gr.Interface(fn=score_comment,
                       inputs=gr.Textbox(lines=2,placeholder="Comment to score"),
                       outputs=gr.Text())

In [34]:
interface.launch(share=True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://5f5d871c98a965c052.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


