# Named entity recognition in CSVs

This notebook uses [Spacy](https://spacy.io/) to perform named-entity recognition on text in specified columns of a CSV file. The notebook adds new columns to the CSV with the identified entities.
It is 90% based on this wonderful notebook: https://github.com/quinnanya/csv-ner/blob/master/named-entity-recognition-in-csvs.ipynb. Only step 5 is new :)


## Step 1. Load core modules

In [1]:
#os is used to change the directory
import os
#spacy is used for the NER
import spacy
#pandas is used to read, edit, and write tabular data
import pandas as pd

## Step 2. Download and load language-specific data
This notebook was originally created for German text, but you can substitute values in the following two cells with the corresponding ones for [another language that Spacy supports](https://spacy.io/models). Choose the language you'd like, and check the box for "import as module" on the Spacy site to see the values for the language you'd like to use.

For instance, to use Lithuanian, you'd change the first code cell below to: `!python -m spacy download lt_core_news_sm` 

and the second one to: 

`import lt_core_news_sm
snlp = spacy.load("import lt_core_news_sm")`

After you've run the first code cell once ever on the computer where you're running this notebook, you can skip it and just run the cell that imports the module. There's no harm in running the first cell again, but it won't do anything.

In [2]:
#downloads the model for the specified language (German)
!python -m spacy download de_core_news_md

Collecting de-core-news-md==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_md-3.1.0/de_core_news_md-3.1.0-py3-none-any.whl (47.8 MB)
     |████████████████████████████████| 47.8 MB 1.1 MB/s            
You should consider upgrading via the '/home/silviaegt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_md')


In [4]:
#imports the model as a module
import de_core_news_md
#loads the model as snlp
snlp = spacy.load("de_core_news_md")

## Step 3. Specify file directory and file
Replace `/Users/qad/Documents/netzdg` with the full path to the directory that has your input CSV file. This is also the directory where your output CSV file will be saved.

The syntax for the path is different on Mac and Windows. For instance, the default path to the Documents directory is (substituting your user name on the computer for YOUR-USER-NAME):

- On Mac: '/Users/YOUR-USER-NAME/Documents'
- On Windows: 'C:\\\Users\\\YOUR-USER-NAME\\\Documents'

Then, replace `netzdg_blog.csv` with the name of a CSV file in the directory you've specified. The first row of the CSV file should be a header (i.e. with the name of each column).

In [7]:
sourcefiledirectory = '/home/silviaegt/Documents/corpus_de'
#changes the working directory to the directory specified above
os.chdir(sourcefiledirectory)
infilename = 'metadata_de_clean.csv'

## Step 4. Reading CSV, performing NER
This step actually processes your data. 

In this example, there are two columns in the source file that contain text where we want to find entities: *text* and *comments*. 

The cell below reads in the data from the CSV file you've specified above. It creates two new columns: *ner_text* which includes the entities extracted from the *title_normalized* column (you can change the name).

The `print(df['ner_text'])` command is optional, but are a convenient way for you to get a sense of what the output will be. When the values are printed, each term is surrounded by parentheses (), and all the entities for a given row of the CSV are surrounded by square brackets \[\]. When you *print* the output, multi-word entities are separated by commas within the parentheses (such as: (Europäische, Union)), but when you *write* it to a new CSV file, the parentheses and commas between individual words disappear, and you'll just get a single comma-separated list inside of square brackets, with commas representing individual entities (e.g. \[Uploadfilter, Europäische Union\]).

In [77]:
#creates pandas dataframe with your specified input file, using the first row as a header
df = pd.read_csv(infilename, header=0)
#creates a new column, ner_text, with entities extracted from a column titled 'text'
df['ner_text'] = df['title_normalized'].astype(str).apply(lambda x: list(snlp(x).ents))
#prints the values from ner_text
print(df['ner_text'])

0                [(jüdischen), (nationalsozialistischen)]
1                                           [(Englische)]
2       [(Australian, Reflections, in, a, Mirror, Clou...
3                                           [(Deutschen)]
4       [(gerbet, han), (Wolframs, von, Eschenbach, Pa...
                              ...                        
1065                                                   []
1066                                                   []
1067                  [(Deutschlands), (Thüringer, Wald)]
1068                                     [(Neil, Gaiman)]
1069    [(A, Close, and, Distant, Reading, of, Shakesp...
Name: ner_text, Length: 1070, dtype: object


In [79]:
for t in df['title_normalized']:
    for w in snlp(t).ents:
        print(w.label_)

MISC
MISC
MISC
MISC
MISC
MISC
PER
MISC
PER
PER
MISC
MISC
MISC
LOC
LOC
PER
PER
MISC
ORG
ORG
MISC
MISC
MISC
MISC
LOC
MISC
MISC
PER
MISC
PER
MISC
LOC
LOC
LOC
PER
PER
PER
LOC
LOC
PER
LOC
MISC
PER
PER
PER
LOC
MISC
MISC
MISC
MISC
PER
MISC
MISC
MISC
PER
PER
PER
PER
MISC
PER
PER
LOC
PER
MISC
PER
LOC
MISC
MISC
ORG
MISC
MISC
LOC
MISC
MISC
MISC
PER
PER
MISC
MISC
MISC
MISC
LOC
MISC
MISC
LOC
MISC
MISC
LOC
LOC
ORG
MISC
ORG
MISC
MISC
MISC
MISC
MISC
MISC
MISC
MISC
PER
PER
MISC
LOC
MISC
MISC
PER
PER
MISC
MISC
PER
PER
PER
PER
MISC
PER
PER
MISC
LOC
PER
LOC
PER
MISC
LOC
MISC
PER
PER
PER
PER
PER
ORG
MISC
MISC
MISC
LOC
MISC
PER
PER
MISC
PER
PER
PER
PER
MISC
LOC
MISC
LOC
MISC
PER
MISC
LOC
MISC
PER
PER
MISC
LOC
LOC
MISC
MISC
MISC
MISC
LOC
PER
MISC
MISC
MISC
MISC
MISC
PER
PER
MISC
MISC
PER
PER
PER
MISC
LOC
LOC
PER
PER
PER
LOC
MISC
PER
LOC
MISC
PER
LOC
LOC
LOC
PER
MISC
MISC
MISC
ORG
ORG
ORG
LOC
MISC
MISC
PER
MISC
PER
ORG
LOC
PER
PER
MISC
MISC
LOC
MISC
PER
MISC
PER
MISC
MISC
PER
MISC
PER
MISC
PER
PER
PER
PER
MIS

PER
MISC
MISC
PER
PER
MISC
LOC
PER
PER
PER
MISC
LOC
ORG
MISC
ORG
ORG
ORG
LOC
LOC
PER
PER
MISC
LOC
MISC
MISC
ORG
LOC
LOC
LOC
LOC
MISC
LOC
MISC
PER
MISC
ORG
PER
MISC
ORG
ORG
PER
MISC
ORG
ORG
MISC
PER
PER
PER
MISC
PER
MISC
PER
PER
LOC
LOC
PER
PER
PER
PER
PER
PER
PER
MISC
ORG
MISC
MISC
MISC
ORG
PER
LOC
MISC
ORG
MISC
ORG
MISC
MISC
PER
ORG
PER
PER
PER
MISC
MISC
MISC
PER
PER
MISC
LOC
PER
PER
MISC
MISC
PER
MISC
PER
LOC
LOC
LOC
PER
MISC


## Step 5. Filtering NER Labels
So say you're only interested in getting entities that are persons, you could run the following line:

In [75]:
df['persons']= df['title_normalized'].map(lambda x: [x for x in snlp(x).ents if x.label_ == "PER"])

0                                                      []
1                                                      []
2                               [(Christopher, J., Koch)]
3                                                      []
4       [(gerbet, han), (Wolframs, von, Eschenbach, Pa...
                              ...                        
1065                                                   []
1066                                                   []
1067                                                   []
1068                                     [(Neil, Gaiman)]
1069                                                   []
Name: persons, Length: 1070, dtype: object

In [76]:
outfilename = 'ner_'+infilename
df.to_csv(outfilename)

In [3]:
pip install spacy-transformers

Collecting spacy-transformers
  Downloading spacy_transformers-1.1.4-py2.py3-none-any.whl (51 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.4/51.4 KB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
Collecting spacy-alignments<1.0.0,>=0.7.2
  Downloading spacy_alignments-0.8.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m36m0:00:01[0m
Installing collected packages: spacy-alignments, spacy-transformers
Successfully installed spacy-alignments-0.8.4 spacy-transformers-1.1.4
Note: you may need to restart the kernel to use updated packages.


In [6]:
!python -m spacy download en_trf_bertbaseuncased_lg


[38;5;1m✘ No compatible package found for 'en_trf_bertbaseuncased_lg' (spaCy
v3.2.2)[0m



## Suggested citation
If you use this notebook as part of your project workflow, you can cite it with something to the effect of:

Dombrowski, Quinn. *Named entity recognition in CSVs* Jupyter notebook. https://github.com/quinnanya/csv-ner. 2019.