<a href="https://colab.research.google.com/github/meghanabhange/denomme/blob/multi-lingual-name/nbs/train-denomme.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# I. Installation and Unzip

In [None]:
!pip install spacy --upgrade
!pip install fire
!pip install spacy_transformers
!pip install transformers[sentencepiece]

In [None]:
!git clone https://github.com/meghanabhange/denomme.git
%cd denomme
# !git checkout multi-lingual-name

## Copy your assets folder here. 
---
### Expected dir structure for assets: 
```
denomme
└───assets
│   │   train.json
│   │   dev.json
│   │
│   └───raw
│   │   │   data.txt
│   │   │   more_data.txt
│   │   │   ...
│   └───processed
│       │   data.json
│       │   more_data.json
│       │   ...
│   
└─── ...
```
---
Raw files to processed can be converted into expected format by using `scripts/convert_to_spacy` then create a suitable split into `train.json` and `dev.json` to use 

More details in section `II. Data conversion` 

In [None]:
!cp -r ../drive/MyDrive/denomme/assets . 

# II. Data conversion

Expected data format : `.txt` files in dir 

- `S-PER` - Single person name without last name or middle name
- `B-PER` - Beginning of person name in multi-word name
- `I-PER` - Middle name(s) can be multiple including initials
- `E-PER` - End of person name
- `O` - Everything else 

---

### Example : 

```
two O
characters O
are O
named O
Dixon B-PER
R. I-PER
L. I-PER
Stien E-PER
and O
Winterburn S-PER
-- O
```

In [None]:
!python -m scripts.convert_to_spacy --modified_ner_dir raw --out_dir processed

# III. Train with transformers

In [None]:
!python -m spacy project run train-denomme

# IV. Upload to AWS [optional]

In [None]:
# %cd /content/denomme/packages/xx_denomme-0.3.1
# !python setup.py sdist
# %cd /content/denomme

In [None]:
# !curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
# !unzip awscliv2.zip
# !sudo ./aws/install
# !aws configure

In [None]:
# !aws s3 cp /content/denomme/packages/xx_denomme-0.3.1/dist/xx_denomme-0.3.1.tar.gz s3://denomme/xx_denomme-0.3.1/dist/

# V. Testing out the model

In [None]:
!pip install https://denomme.s3.us-east-2.amazonaws.com/xx_denomme-0.3.1/dist/xx_denomme-0.3.1.tar.gz

In [None]:
import spacy
from tqdm import tqdm
from spacy.lang.xx import MultiLanguage
from denomme import denomme_component
from spacy import displacy
tqdm.pandas()

In [None]:
nlp = MultiLanguage()
nlp.add_pipe("denomme")
doc = nlp("Hi my name is Meghana S.R Bhange and I want to talk Asha")
print(doc._.person_name)

In [None]:
%timeit doc = nlp("My name is Ketaki S.R Ambadkar"); doc.ents

In [None]:
%timeit doc = nlp("في 9 مايو 2006 , تم افتتاح ملعب جديد في مدينة ريال مدريد الرياضية وتم تسميته بأحد أهم اللاعبين الذي سطروا أسمائهم بأحرف من ذهب في تاريخ ريال مدريد وهو ألفريدو دي ستيفانو ."); doc.ents

In [None]:
def get_ents(doc):
  ents = [
        {
            "start":doc.text.find(name.text),
            "end" : doc.text.find(name.text)+len(name.text),
            "label": name.label_
        }
        for name in doc._.person_name
        ]
  ex = [{"text": doc.text,
        "ents": ents,
        "title": "Denomme : Name Detection"}]
  return ex

In [None]:
doc = nlp("Hi , I my name is Meghana S.R Bhange, I want to book an appointment for Ketaki and also Dr. S Kumar")
ex = get_ents(doc)
displacy.render(ex, style="ent", manual=True, jupyter=True)

In [None]:
doc = nlp("في 9 مايو 2006 , تم افتتاح ملعب جديد في مدينة ريال مدريد الرياضية وتم تسميته بأحد أهم اللاعبين الذي سطروا أسمائهم بأحرف من ذهب في تاريخ ريال مدريد وهو ألفريدو دي ستيفانو .")
ex = get_ents(doc)
displacy.render(ex, style="ent", manual=True, jupyter=True)

# Checking Metrics

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score, f1_score, recall_score
from sklearn.metrics import roc_auc_score
import pandas as pd

In [None]:
test_set_1 = pd.read_csv("assets/01_test_set.csv")
test_set_2 = pd.read_csv("assets/02_test_set.csv")

In [None]:
def predict_names(sent):
  doc = nlp(sent.lower())
  names = [name.text for name in doc._.person_name]
  if names:
    return names[0].lower()
  return ""
  
def metrics(df):
  metrics = {
      "Namignizer" : {},
      "Denomme" : {}
  }
  df["Denomme"] = df["Input"].progress_apply(predict_names)
  df.fillna("", inplace=True)
  metrics["Namignizer"]["Accuracy Score"] = accuracy_score(df["Namignizer"], df["Name"])
  metrics["Denomme"]["Accuracy Score"] = accuracy_score(df["Denomme"], df["Name"])

  metrics["Namignizer"]["f1_score Macro"] = f1_score(df["Namignizer"], df["Name"],  average='macro')
  metrics["Denomme"]["f1_score Macro"] = f1_score(df["Denomme"], df["Name"],  average='macro')

  metrics["Namignizer"]["precision_score Macro"] = precision_score(df["Namignizer"], df["Name"],  average='macro')
  metrics["Denomme"]["precision_score Macro"] = precision_score(df["Denomme"], df["Name"],  average='macro')

  metrics["Namignizer"]["recall_score Macro"] = recall_score(df["Namignizer"], df["Name"],  average='macro')
  metrics["Denomme"]["recall_score Macro"] = recall_score(df["Denomme"], df["Name"],  average='macro')

  # metrics["Namignizer"]["roc_auc_score Macro"] = roc_auc_score(df["Namignizer"], df["Name"],  average='macro')
  # metrics["Denomme"]["roc_auc_score Macro"] = roc_auc_score(df["Denomme"], df["Name"],  multi_class='ovr')
  return metrics
  

In [None]:
metrics(test_set_1)

In [None]:
metrics(test_set_2)