# Named entity recognition in CSVs

This notebook uses [Spacy](https://spacy.io/) to perform named-entity recognition on text in specified columns of a CSV file. The notebook adds new columns to the CSV with the identified entities.

## Step 1. Load core modules

In [None]:
#os is used to change the directory
import os
#spacy is used for the NER
import spacy
#pandas is used to read, edit, and write tabular data
import pandas as pd

## Step 2. Download and load language-specific data
This notebook was originally created for German text, but you can substitute values in the following two cells with the corresponding ones for [another language that Spacy supports](https://spacy.io/models). Choose the language you'd like, and check the box for "import as module" on the Spacy site to see the values for the language you'd like to use.

For instance, to use Lithuanian, you'd change the first code cell below to: `!python -m spacy download lt_core_news_sm` 

and the second one to: 

`import lt_core_news_sm
snlp = spacy.load("import lt_core_news_sm")`

After you've run the first code cell once ever on the computer where you're running this notebook, you can skip it and just run the cell that imports the module. There's no harm in running the first cell again, but it won't do anything.

In [None]:
#downloads the model for the specified language (German)
!python -m spacy download de_core_news_md

In [None]:
#imports the model as a module
import de_core_news_md
#loads the model as snlp
snlp = spacy.load("de_core_news_md")

## Step 3. Specify file directory and file
Replace `/Users/qad/Documents/netzdg` with the full path to the directory that has your input CSV file. This is also the directory where your output CSV file will be saved.

The syntax for the path is different on Mac and Windows. For instance, the default path to the Documents directory is (substituting your user name on the computer for YOUR-USER-NAME):

- On Mac: '/Users/YOUR-USER-NAME/Documents'
- On Windows: 'C:\\\Users\\\YOUR-USER-NAME\\\Documents'

Then, replace `netzdg_blog.csv` with the name of a CSV file in the directory you've specified. The first row of the CSV file should be a header (i.e. with the name of each column).

In [None]:
sourcefiledirectory = '/Users/qad/Documents/netzdg'
#changes the working directory to the directory specified above
os.chdir(sourcefiledirectory)
infilename = 'netzdg_blog.csv'

## Step 4. Reading CSV, NER, writing CSV
This step actually processes your data. 

In this example, there are two columns in the source file that contain text where we want to find entities: *text* and *comments*. 

The cell below reads in the data from the CSV file you've specified above. It creates two new columns: *ner_text* which includes the entities extracted from the *text* column, and *ner_comments* with the entities from the *comments* column.

If you want to do NER on columns with different names in your source CSV, change *text* and *comments* to match the appropriate header in your source CSV. You may also want to change *ner_text* and *ner_comments* to something more informative for your data set as well. If you want to do this on more than two columns in your source CSV, you can add additional lines following the same model. If you want to do it on only one column, delete one of the sets of lines.

The `print(df['ner_text'])` and `print(df['ner_comments'])` commands are optional, but are a convenient way for you to get a sense of what the output will be. When the values are printed, each term is surrounded by parentheses (), and all the entities for a given row of the CSV are surrounded by square brackets \[\]. When you *print* the output, multi-word entities are separated by commas within the parentheses (such as: (Europäische, Union)), but when you *write* it to a new CSV file, the parentheses and commas between individual words disappear, and you'll just get a single comma-separated list inside of square brackets, with commas representing individual entities (e.g. \[Uploadfilter, Europäische Union\]).

The final cell specifies an output name for the CSV file the notebook will generate, including all the original cells of the original CSV, plus the new ones you've created with the extracted entities. It's set up to prefix the name of the original CSV with `ner_`, but you can change it to something else if you prefer.

In [None]:
#creates pandas dataframe with your specified input file, using the first row as a header
df = pd.read_csv(infilename, header=0)
#creates a new column, ner_text, with entities extracted from a column titled 'text'
df['ner_text'] = df['text'].astype(str).apply(lambda x: list(snlp(x).ents))
#prints the values from ner_text
print(df['ner_text'])
#creates a new column, ner_comments, with entities extracted from a column titled 'comments'
df['ner_comments'] = df['comments'].astype(str).apply(lambda x: list(snlp(x).ents))
#prints the values from ner_comments
print(df['ner_comments'])

In [None]:
outfilename = 'ner_'+infilename
df.to_csv(outfilename)

## Suggested citation
If you use this notebook as part of your project workflow, you can cite it with something to the effect of:

Dombrowski, Quinn. *Named entity recognition in CSVs* Jupyter notebook. https://github.com/quinnanya/csv-ner. 2019.