# Auto Tagging for Social Media Systems using NER


**1.Data Collection:**

The project uses a hardcoded dataset of sentences along with their corresponding entity tags. Each word in a sentence is tagged with an entity label (e.g., "O" for non-entities, "B-LOC" for the beginning of a location entity, and "I-LOC" for continuation).

2**.Data Preparation:**

Vocabulary and Tag Mapping:
Create mappings for words to indices and tags to indices. This is essential for feeding the data into a machine learning model.
Padding:
Since sentences may vary in length, they are padded to a uniform length. This allows the model to process batches of data efficiently.

**3.Model Building:**

A Bidirectional Long Short-Term Memory (BiLSTM) model is constructed:
Embedding Layer: Converts word indices into dense vectors of fixed size.
BiLSTM Layer: Captures contextual information from both past and future words in the sentence, allowing for a better understanding of the entities.
TimeDistributed Dense Layer: Outputs a probability distribution for each token in the sequence across all possible tags.

**4.Training:**

The model is trained on the prepared data using categorical cross-entropy as the loss function and Adam as the optimizer. The training process adjusts the model's parameters to minimize the loss and improve accuracy.

**5.Evaluation:**

After training, the model is evaluated on a separate test dataset to measure its performance in identifying entities.
Prediction:

The model can be used to predict entities in new sentences. Given a sentence, it converts words to their corresponding indices, pads the sequence, and generates predictions.





In [3]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split

In [1]:
# Sample sentences and corresponding tags
train_sentences = [
    ['I', 'love', 'New', 'York', 'City'],
    ['San', 'Francisco', 'is', 'beautiful'],
    ['Barack', 'Obama', 'was', 'the', 'president'],
    ['Facebook', 'is', 'a', 'large', 'company'],
    ['Apple', 'releases', 'new', 'iPhone', 'every', 'year'],
    ['The', 'new', 'Google', 'Pixel', 'is', 'amazing'],
    ['Tesla', 'is', 'innovating', 'in', 'electric', 'vehicles'],
    ['I', 'met', 'Elon', 'Musk', 'yesterday'],
    ['The', 'Golden', 'Gate', 'Bridge', 'is', 'in', 'San', 'Francisco'],
    ['Microsoft', 'announced', 'Windows', '11'],
    ['Amazon', 'is', 'headquartered', 'in', 'Seattle'],
    ['Paris', 'is', 'known', 'for', 'the', 'Eiffel', 'Tower'],
    ['The', 'CEO', 'of', 'Twitter', 'is', 'Elon', 'Musk'],
    ['Bill', 'Gates', 'is', 'the', 'founder', 'of', 'Microsoft'],
    ['New', 'York', 'is', 'often', 'called', 'the', 'Big', 'Apple'],
    ['The', 'Great', 'Wall', 'of', 'China', 'is', 'ancient'],
    ['I', 'enjoyed', 'the', 'movie', 'Inception'],
    ['Leonardo', 'DiCaprio', 'starred', 'in', 'Inception'],
    ['Amazon', 'Prime', 'has', 'great', 'shows'],
    ['Google', 'announced', 'a', 'new', 'AI', 'feature']
]

train_tags = [
    ['O', 'O', 'B-LOC', 'I-LOC', 'I-LOC'],
    ['B-LOC', 'I-LOC', 'O', 'O'],
    ['B-PER', 'I-PER', 'O', 'O', 'O'],
    ['B-ORG', 'O', 'O', 'O', 'O'],
    ['B-ORG', 'O', 'O', 'B-PROD', 'O', 'O'],
    ['O', 'O', 'B-ORG', 'B-PROD', 'O', 'O'],
    ['B-ORG', 'O', 'O', 'O', 'O', 'O'],
    ['O', 'O', 'B-PER', 'I-PER', 'O'],
    ['O', 'B-LOC', 'I-LOC', 'I-LOC', 'O', 'O', 'B-LOC', 'I-LOC'],
    ['B-ORG', 'O', 'B-PROD', 'I-PROD'],
    ['B-ORG', 'O', 'O', 'O', 'B-LOC'],
    ['B-LOC', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC'],
    ['O', 'O', 'O', 'B-ORG', 'O', 'B-PER', 'I-PER'],
    ['B-PER', 'I-PER', 'O', 'O', 'O', 'O', 'B-ORG'],
    ['B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O'],
    ['O', 'B-LOC', 'I-LOC', 'O', 'B-LOC', 'O'],
    ['O', 'O', 'O', 'O', 'B-PROD'],
    ['B-PER', 'I-PER', 'O', 'O', 'B-PROD'],
    ['B-ORG', 'B-PROD', 'O', 'O', 'O'],
    ['B-ORG', 'O', 'O', 'O', 'B-TECH', 'O']
]

test_sentences = [
    ['Google', 'is', 'based', 'in', 'Mountain', 'View'],
    ['I', 'visited', 'the', 'Eiffel', 'Tower'],
    ['Facebook', 'is', 'planning', 'a', 'new', 'feature'],
    ['Tesla', 'unveils', 'new', 'Model', 'S'],
    ['Apple', 'launched', 'the', 'new', 'iPad'],
    ['Twitter', 'CEO', 'Elon', 'Musk', 'speaks', 'at', 'conference'],
    ['Amazon', 'expands', 'to', 'new', 'markets'],
    ['Microsoft', 'Teams', 'is', 'popular', 'in', 'offices'],
    ['The', 'Great', 'Wall', 'of', 'China', 'is', 'a', 'tourist', 'attraction']
]

test_tags = [
    ['B-ORG', 'O', 'O', 'O', 'B-LOC', 'I-LOC'],
    ['O', 'O', 'O', 'B-LOC', 'I-LOC'],
    ['B-ORG', 'O', 'O', 'O', 'O', 'B-TECH'],
    ['B-ORG', 'O', 'O', 'B-PROD', 'I-PROD'],
    ['B-ORG', 'O', 'O', 'O', 'B-PROD'],
    ['B-ORG', 'O', 'B-PER', 'I-PER', 'O', 'O', 'O'],
    ['B-ORG', 'O', 'O', 'O', 'B-LOC'],
    ['B-ORG', 'B-PROD', 'O', 'O', 'O', 'O'],
    ['O', 'B-LOC', 'I-LOC', 'O', 'B-LOC', 'O', 'O', 'O', 'O']
]

In [2]:
# Build vocabulary and tag sets
vocab = list(set(word for sentence in train_sentences for word in sentence))
tags = list(set(tag for tag_seq in train_tags for tag in tag_seq))

In [4]:
word2idx = {w: i + 2 for i, w in enumerate(vocab)}
word2idx['UNK'] = 1  # Unknown words
word2idx['PAD'] = 0  # Padding
tag2idx = {t: i + 1 for i, t in enumerate(tags)}
tag2idx['PAD'] = 0  # Padding

In [5]:
idx2tag = {i: w for w, i in tag2idx.items()}

In [6]:
X_train = [[word2idx.get(w, 1) for w in s] for s in train_sentences]
y_train = [[tag2idx[t] for t in tag_seq] for tag_seq in train_tags]

X_test = [[word2idx.get(w, 1) for w in s] for s in test_sentences]
y_test = [[tag2idx[t] for t in tag_seq] for tag_seq in test_tags]

In [7]:
# Pad sequences
max_len = 10
X_train = pad_sequences(X_train, maxlen=max_len, padding='post')
y_train = pad_sequences(y_train, maxlen=max_len, padding='post')
X_test = pad_sequences(X_test, maxlen=max_len, padding='post')
y_test = pad_sequences(y_test, maxlen=max_len, padding='post')
# Pad sequences
max_len = 10
X_train = pad_sequences(X_train, maxlen=max_len, padding='post')
y_train = pad_sequences(y_train, maxlen=max_len, padding='post')
X_test = pad_sequences(X_test, maxlen=max_len, padding='post')
y_test = pad_sequences(y_test, maxlen=max_len, padding='post')


In [8]:
# Convert y to categorical
y_train = [to_categorical(i, num_classes=len(tag2idx)) for i in y_train]
y_test = [to_categorical(i, num_classes=len(tag2idx)) for i in y_test]

In [9]:
# Define the model
input = tf.keras.layers.Input(shape=(max_len,))
model = tf.keras.layers.Embedding(input_dim=len(word2idx), output_dim=50, input_length=max_len)(input)
model = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(model)
out = tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(len(tag2idx), activation='softmax'))(model)



In [10]:
model = tf.keras.models.Model(input, out)

In [11]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()

In [12]:
# Train the model
history = model.fit(X_train, np.array(y_train), batch_size=32, epochs=5, validation_split=0.1, verbose=1)

Epoch 1/5
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 6s/step - accuracy: 0.0389 - loss: 2.3029 - val_accuracy: 0.5000 - val_loss: 2.2812
Epoch 2/5
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 68ms/step - accuracy: 0.4889 - loss: 2.2826 - val_accuracy: 0.4500 - val_loss: 2.2599
Epoch 3/5
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 74ms/step - accuracy: 0.4889 - loss: 2.2616 - val_accuracy: 0.4500 - val_loss: 2.2365
Epoch 4/5
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 139ms/step - accuracy: 0.5056 - loss: 2.2382 - val_accuracy: 0.4500 - val_loss: 2.2099
Epoch 5/5
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 138ms/step - accuracy: 0.5333 - loss: 2.2117 - val_accuracy: 0.4500 - val_loss: 2.1789


In [13]:
#Model Evaluation
test_loss, test_acc = model.evaluate(X_test, np.array(y_test))
print(f"Test Loss: {test_loss}")
print(f"Test Accuracy: {test_acc}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step - accuracy: 0.4667 - loss: 2.1866
Test Loss: 2.1866328716278076
Test Accuracy: 0.46666669845581055


In [14]:
# Example prediction
def predict(sentence):
    sentence_idx = [word2idx.get(w, 1) for w in sentence]
    sentence_padded = pad_sequences([sentence_idx], maxlen=max_len, padding='post')
    pred = model.predict(sentence_padded)
    pred = np.argmax(pred, axis=-1)
    return [idx2tag[idx] for idx in pred[0] if idx != 0]

example_sentence = ['I', 'love', 'San', 'Francisco']
print(predict(example_sentence))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 553ms/step
['O']
