# Named entity recognition in CSVs

This notebook uses [Spacy](https://spacy.io/) to perform named-entity recognition on text in specified columns of a CSV file. The notebook adds new columns to the CSV with the identified entities.
It is 90% based on this wonderful notebook: https://github.com/quinnanya/csv-ner/blob/master/named-entity-recognition-in-csvs.ipynb. Only step 5 is new :)


## Step 1. Load core modules

In [1]:
#os is used to change the directory
import os
#spacy is used for the NER
import spacy
#pandas is used to read, edit, and write tabular data
import pandas as pd

## Step 2. Download and load language-specific data
This notebook was originally created for German text, but you can substitute values in the following two cells with the corresponding ones for [another language that Spacy supports](https://spacy.io/models). Choose the language you'd like, and check the box for "import as module" on the Spacy site to see the values for the language you'd like to use.

For instance, to use Lithuanian, you'd change the first code cell below to: `!python -m spacy download lt_core_news_sm` 

and the second one to: 

`import lt_core_news_sm
snlp = spacy.load("import lt_core_news_sm")`

After you've run the first code cell once ever on the computer where you're running this notebook, you can skip it and just run the cell that imports the module. There's no harm in running the first cell again, but it won't do anything.

In [3]:
#downloads the model for the specified language (English)
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
     |████████████████████████████████| 13.6 MB 3.7 MB/s            
You should consider upgrading via the '/home/silviaegt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [4]:
#imports the model as a module
import en_core_web_sm
#loads the model as snlp
snlp = spacy.load("en_core_web_sm")

## Step 3. Specify file directory and file
Replace `/Users/qad/Documents/netzdg` with the full path to the directory that has your input CSV file. This is also the directory where your output CSV file will be saved.

The syntax for the path is different on Mac and Windows. For instance, the default path to the Documents directory is (substituting your user name on the computer for YOUR-USER-NAME):

- On Mac: '/Users/YOUR-USER-NAME/Documents'
- On Windows: 'C:\\\Users\\\YOUR-USER-NAME\\\Documents'

Then, replace `netzdg_blog.csv` with the name of a CSV file in the directory you've specified. The first row of the CSV file should be a header (i.e. with the name of each column).

In [5]:
sourcefiledirectory = '/home/silviaegt/Documents/alemania_test/'
#changes the working directory to the directory specified above
os.chdir(sourcefiledirectory)
infilename = 'EThOS_CSV_202109.csv'

## Step 4. Reading CSV, performing NER
This step actually processes your data. 

In this example, there are two columns in the source file that contain text where we want to find entities: *text* and *comments*. 

The cell below reads in the data from the CSV file you've specified above. It creates two new columns: *ner_text* which includes the entities extracted from the *title_normalized* column (you can change the name).

The `print(df['ner_text'])` command is optional, but are a convenient way for you to get a sense of what the output will be. When the values are printed, each term is surrounded by parentheses (), and all the entities for a given row of the CSV are surrounded by square brackets \[\]. When you *print* the output, multi-word entities are separated by commas within the parentheses (such as: (Europäische, Union)), but when you *write* it to a new CSV file, the parentheses and commas between individual words disappear, and you'll just get a single comma-separated list inside of square brackets, with commas representing individual entities (e.g. \[Uploadfilter, Europäische Union\]).

In [8]:
#creates pandas dataframe with your specified input file, using the first row as a header
df = pd.read_csv(infilename, header=0, low_memory=False)

NameError: name 'head' is not defined

In [12]:
df = df[df['Title'].str.contains("Language")]

In [22]:
#creates a new column, ner_text, with entities extracted from a column titled 'text'
df['ner_text'] = df['Abstract'].astype(str).apply(lambda x: list(snlp(x).ents))
#prints the values from ner_text
print(df['ner_text'])

3485                                                     []
4550                                                     []
5781      [(Webster), (1986), (Campbell, ,, Burden, &, W...
5980                                                     []
6492                                                     []
                                ...                        
582400    [(Online, Community, Projects), (English), (a,...
582694    [(Saudi, Arabia), (G20), (2016), (2030), (Engl...
582972              [(first), (two), (four), (1), (2), (3)]
583480    [(Ghanaian, Sign, Language), (three), (GSL), (...
584221          [(Critical, Language, Awareness), (Kuwait)]
Name: ner_text, Length: 654, dtype: object


In [23]:
for t in df['Title']:
    for w in snlp(t).ents:
        print(w.label_)

PERSON
LANGUAGE
LANGUAGE
PERSON
ORDINAL
LANGUAGE
NORP
LANGUAGE
DATE
NORP
LANGUAGE
PERSON
ORG
GPE
PERSON
GPE
GPE
ORDINAL
GPE
PERSON
PERSON
NORP
GPE
ORG
CARDINAL
PRODUCT
DATE
EVENT
ORG
DATE
GPE
GPE
ORG
PERSON
LAW
NORP
GPE
DATE
GPE
PERSON
GPE
NORP
GPE
ORG
ORG
PERSON
NORP
ORG
NORP
NORP
FAC
LANGUAGE
LOC
LOC
PERSON
ORG
ORG
GPE
LOC
PERSON
PERSON
GPE
DATE
PERSON
PERSON
PERSON
WORK_OF_ART
DATE
PERSON
PERSON
LANGUAGE
LANGUAGE
NORP
LOC
GPE
DATE
ORG
PERSON
PERSON
GPE
ORG
GPE
WORK_OF_ART
NORP
PERSON
NORP
GPE
LANGUAGE
ORDINAL
PERSON
GPE
EVENT
PERSON
ORG
NORP
NORP
LANGUAGE
LANGUAGE
LOC
GPE
CARDINAL
CARDINAL
NORP
NORP
ORG
ORG
GPE
DATE
NORP
GPE
NORP
GPE
LANGUAGE
LANGUAGE
DATE
ORG
ORDINAL
NORP
GPE
GPE
GPE
NORP
NORP
NORP
LANGUAGE
LOC
GPE
PERSON
GPE
EVENT
LANGUAGE
GPE
ORG
CARDINAL
CARDINAL
DATE
LANGUAGE
ORDINAL
GPE
ORG
NORP
ORG
GPE
ORG
PERSON
PERSON
PERSON
GPE
PERSON
PERSON
LOC
GPE
ORDINAL
GPE
NORP
LANGUAGE
ORG
NORP
LANGUAGE
LOC
ORG
ORG
GPE
CARDINAL
LANGUAGE
GPE
PERSON
LANGUAGE
CARDINAL
GPE
ORG
GPE
ORG
LO

## Step 5. Filtering NER Labels
So say you're only interested in getting entities that are persons, you could run the following line:

In [24]:
df['persons']= df['Abstract'].map(lambda x: [x for x in snlp(x).ents if x.label_ == "PERSON"])

In [25]:
df['persons']

3485                                                  []
4550                                                  []
5781      [(Webster), (Pattison), (Jensema), (Baddeley)]
5980                                                  []
6492                                                  []
                               ...                      
582400                                                []
582694                                                []
582972                                                []
583480                                                []
584221                                                []
Name: persons, Length: 654, dtype: object

In [26]:
outfilename = 'nerabstract_'+infilename
df.to_csv(outfilename)

## Suggested citation
If you use this notebook as part of your project workflow, you can cite it with something to the effect of:

Dombrowski, Quinn. *Named entity recognition in CSVs* Jupyter notebook. https://github.com/quinnanya/csv-ner. 2019.