# 💡 Project Showcase: Named Entity Recognition

Named Entity means anything that is a real-world object such as a person, a place, any organization, any product which has a name.

For example, like how Grammarly identifies all the incorrect spellings and punctuations in the text and corrects them. But it does not do anything with the named entities since it is using the same technique. 

We will be using the [ner_dataset.csv](https://raw.githubusercontent.com/amankharwal/Website-data/master/ner_dataset.csv) for this project.

## Step 1: Importing the libraries and Dataset

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [9]:
df = pd.read_csv('ner_dataset.csv', encoding = "unicode_escape")
df.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O


## Step 2: EDA and Data Preprocessing

In [10]:
df.columns

Index(['Sentence #', 'Word', 'POS', 'Tag'], dtype='object')

Let us extract the mappings needed to train the neural network

In [12]:
from itertools import chain
def get_dict_map(data, token_or_tag):
    tok2idx = {}
    idx2tok = {}
    
    if token_or_tag == 'token':
        vocab = list(set(data['Word'].to_list()))
    else:
        vocab = list(set(data['Tag'].to_list()))
    
    idx2tok = {idx:tok for  idx, tok in enumerate(vocab)}
    tok2idx = {tok:idx for  idx, tok in enumerate(vocab)}
    return tok2idx, idx2tok
token2idx, idx2token = get_dict_map(df, 'token')
tag2idx, idx2tag = get_dict_map(df, 'tag')

Now we transform the columns in our data to extract the sequential data for our neural network

In [14]:
df['Word_idx'] = df['Word'].map(token2idx)
df['Tag_idx'] = df['Tag'].map(tag2idx)
data_fillna = df.fillna(method='ffill', axis=0)
# Groupby and collect columns
data_group = data_fillna.groupby(
['Sentence #'],as_index=False
)['Word', 'POS', 'Tag', 'Word_idx', 'Tag_idx'].agg(lambda x: list(x))

  data_group = data_fillna.groupby(
