# Session Outline

- Go through homework 5
- Answers about Indexing
- What is the Lambda operator?
- A Note on **Visualization**
- Cleaning Pipeline
- Adding to the NLP Pipeline: **Tokenization**



![coding](https://media.giphy.com/media/FPbnShq1h1IS5FQyPD/giphy.gif)

# Needed Libraries

In [14]:
import numpy as np
from numpy import *
from numpy import random # random data
import csv
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
from itertools import chain
import nltk
nltk.download('punkt')
import codecs
import string
from nltk.corpus import stopwords
import re


[nltk_data] Downloading package punkt to /Users/Ashrakat/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Answers about Indexing

## Indexing in a list

In [None]:
with open('/Users/Ashrakat/Dropbox/University/Oxford/Jobs/Teaching/Text Analysis/code/Basics-of-Text-Analysis-for-Political-Science/Data/dataset_solution4_ex2.csv') as f:
    lines = f.read().splitlines()

In [None]:
lines

In [None]:
#if I want to keep year

lines_year = lines.copy()
lines_year[2]=lines_year[2][:-4]
lines_year

In [None]:
#keep only months
lines_month = lines.copy()
lines_month[2]=lines_month[2][4:6]
lines_month

In [None]:
#keep only days
lines_day = lines.copy()
lines_day[2]=lines_day[2][6:8]
lines_day

In [None]:
#or

lines_day2 = lines.copy()
lines_day2[2]=lines_day2[2][6:]
lines_day2

# What is the Lamda Operator?



A lambda function is a small anonymous function.

`lambda arguments : expression`

**Lets take a look at some examples?**

In [None]:
#Example 1 Lambda

x = lambda a : a + 10
print(x(5))

In [None]:
#Example 2 Lambda

x = lambda a, b : a * b
print(x(5, 6))

## Example: Indexing in Pandas
#### Now I want a function to iterate through each row, and remove the last two charcters in the id-snippet column. Lambda can help do that


In [None]:
sascat_ccounts = pd.read_csv('/Users/Ashrakat/Dropbox/University/Oxford/Jobs/Teaching/Text Analysis/code/Basics-of-Text-Analysis-for-Political-Science/Data/datafromsession5_exc3.tsv',delimiter="\t")
del sascat_ccounts['Unnamed: 0']
sascat_ccounts.head()

In [None]:
sascat_ccounts['id-snippet'] = sascat_ccounts['id-snippet'].map(lambda x: str(x)[:-2])


In [None]:
sascat_ccounts

# A Note on Visualization

In [None]:
 
country_notflat = [a.split() for a in sascat_ccounts['mentions_countries']] #this puts them in a list of list
country_notflat


In [None]:
sascat_ccounts = pd.read_csv('/Users/Ashrakat/Dropbox/University/Oxford/Jobs/Teaching/Text Analysis/code/Basics-of-Text-Analysis-for-Political-Science/Data/datafromsession5_exc3.tsv',delimiter="\t")
del sascat_ccounts['Unnamed: 0']


- itertools is a function creating efficient loops, read more here https://docs.python.org/2/library/itertools.html and here https://realpython.com/python-itertools/

- chains from itertools: chain('ABC', 'DEF') --> A B C D E F - This function takes any number of iterables as arguments and “chains” them together.


In [None]:
country_notflat = [a.split() for a in sascat_ccounts['mentions_countries']] 
#this command puts countries in a list of list according to thei structure in pandas
country_notflat

In [None]:
counter = Counter(chain.from_iterable(country_notflat))
counter

In [None]:
#sort by values
from collections import OrderedDict
d_sorted_by_value = OrderedDict(sorted(counter.items(), key=lambda x: x[1]))
d_sorted_by_value

In [None]:
plt.bar(d_sorted_by_value.keys(), d_sorted_by_value.values(), color='r')
plt.gcf().set_size_inches(20, 10)
plt.xticks(rotation=90)
plt.show()

# Cleaning Pipeline


In [None]:
def clean(x): #function to deal with a string
    x=x.lower() # lower text
    x=x.replace('[^\w\s]','') # remove punctuation
    x=x.replace('\d+', '') #replace digits with nothing
    x=x.split() # split in words
    x=[word for word in x if word not in stopwords.words('english')] #remove stopwords
    #change that if youre using another languages
    x=" ".join(str(x) for x in x) # join in sentence
    return x

# Tokenization Session - using NLTK package

Tokenization is splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.

- Tokanization has advantages over Python .split() function especially when it comes to sentence tokenization


In [None]:
#sentence tokenizer
#inspired by Edward MA blogging on Medium

In [None]:
article = 'In computer science, lexical analysis, lexing or tokenization is the process of \
converting a sequence of characters (such as in a computer program or web page) into a \
sequence of tokens (strings with an assigned and thus identified meaning). A program that \
performs lexical analysis may be termed a lexer, tokenizer,[1] or scanner, though scanner \
is also a term for the first stage of a lexer. A lexer is generally combined with a parser, \
which together analyze the syntax of programming languages, web pages, and so forth.'

article2 = 'ConcateStringAnd123 ConcateSepcialCharacter_!@# !@#$%^&*()_+ 0123456'

article3 = 'It is a great moment from 10 a.m. to 1 p.m. every weekend.'

In [None]:
import re

for doc in [article, article2, article3]: # for every document 
    print('Original Article:', (doc)) #print the original article
    print()#print empty space

    sentences = re.split('(\.|!|\?)', doc) #split by these items using regex
    
    for i, s in enumerate(sentences): #count number of sentences
        print('-->Sentence %d: %s' % (i, s)) #place holders %d

        #different ways 

In [None]:
import nltk
from nltk.tokenize import sent_tokenize
print('NTLK Version: %s' % nltk.__version__)

for article in [article, article2, article3]:
    print('Original Article: %s' % (article))
    print()

    doc = sent_tokenize(article)
    for i, token in enumerate(doc):
        print('-->Sentence %d: %s' % (i, token))

In [None]:
text = """Amazing day... however, I still need a good night sleep. I will see you tomorrow for sure. Bye.Bye"""
# Splits at '.' 
text.split('. ') 

In [None]:
tokenized_text = nltk.sent_tokenize(text)
tokenized_text

### Lets work with a new dataset and discover it

In [None]:
import codecs
dataset = codecs.open("/Users/Ashrakat/Desktop/rt_dataset.tsv", "r", "utf-8").read().strip().split("\n") 
# we split by line breaks

In [None]:
print(dataset[0])
print(dataset[1])


In [None]:
print (len(dataset))

In [None]:
# how to count the number of topics?

topics = []
for line in dataset:
    topic = line.split("\t")[2]
    topics.append(topic)
    

In [None]:
from collections import Counter

print (Counter(topics).most_common(30))

In [None]:
#using list comprehension
topics = [line.split("\t")[2] for line in dataset]

from collections import Counter

print (Counter(topics).most_common(30))

In [None]:
# let's start with some real NLP

# let's focus on a specific article, for example

article = dataset[50].split("\t")[3]
print (article)

In [None]:
print (type(article))


In [None]:
sentences = nltk.sent_tokenize(article) # <-- documentation for this command: http://www.nltk.org/_modules/nltk/tokenize.html

# for checking what you're getting back from a library, run these commands
print (type(sentences))
print (len(sentences))
print (sentences)

In [None]:
# let us consider a single sentence - how do we do that? ## use the 5th sentence

sentence = sentences[4]
print (sentence)

In [None]:
# let's divide the sentence in tokens (aka single words)
tokenized_sentence = nltk.word_tokenize(sentence)

print (tokenized_sentence)

In [None]:
# lower-casing the sentence
without_capital_letters = [word.lower() for word in tokenized_sentence]

print (without_capital_letters)

# homework: write a for-loop for doing the same thing

In [None]:
#remove stopwords
stop = stopwords.words('english')

without_stop_words = [word for word in without_capital_letters if word not in stop]

print (without_stop_words)

In [None]:
import string
exclude = set(string.punctuation)

# homework: how do we exclude punctuation?, hint: use exclude, from the previous line

without_punct = [word for word in without_stop_words if word not in exclude]

print (without_punct)

## Let's take a look at our cleaning pipeline

missing remove stopwords

In [None]:
dataset = codecs.open("/Users/Ashrakat/Desktop/rt_dataset.tsv", "r", "utf-8").read().strip().split("\n") 
article=dataset[1]
article

In [None]:
with open('/Users/Ashrakat/Desktop/rt_dataset.tsv') as f:
    counter = Counter(f.read().strip().split())

print(counter.most_common(20))

In [None]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
exclude = set(string.punctuation)
exclude.add("‘")
exclude.add("“")

import re

def clean1(x): 
    x=x.replace('\n\n','') # remove the line breaks
    x=x.lower()# lower text
    x = ''.join(ch for ch in x if ch not in exclude) #remove punctuation
    x=re.sub('[0-9]+', '', x) # remove numbers
    x=x.split() #split words 
    x=[word for word in x if word not in stopwords.words('english')]#remove stopwords
   #x=" ".join(str(x) for x in x) # you can do this if you want to remove list structure
    return x

In [None]:
cleaned1=clean1(article)
print(cleaned1)

### Can you spot:
- `httpstcohwuvvftbelarus`


In [None]:
def nlp_pipeline1(text):
    
    # if you want you can split in sentences - i'm usually skipping this step
    text=text.lower()
    
    #tokenize words for each sentence
    text = nltk.word_tokenize(text)
    
    # remove punctuation and numbers
    #text = [token for token in text if token.isalpha()]
    
    # remove stopwords - be careful with this step    
    #text = [token for token in text if token not in stop_words]

    return text

In [None]:
cleaned2=nlp_pipeline1(article)
print(cleaned2)

In [10]:
def nlp_pipeline2(text):
    
    # if you want you can split in sentences - i'm usually skipping this step
    text=text.lower()
    
    #tokenize words for each sentence
    text = nltk.word_tokenize(text)
    
    # remove punctuation and numbers
    text = [token for token in text if token.isalpha()]#The isalpha() keeps here caracters in the string are alphabets
    
    # remove stopwords - be careful with this step    
    text = [token for token in text if token not in stop_words]

    return text

In [11]:
cleaned2=nlp_pipeline2(article)
print(cleaned2)

NameError: name 'article' is not defined

In [None]:
#discrepencies between both lists
def diff(first, second):
        second = set(second)
        return [item for item in first if item not in second]

diff(cleaned1,cleaned2)

- take care of weird punctuation in the beginning
- we are going to see in sentiment analysis why dont etc. can be of important

# Excercises:

**Excercise 1**

- Discover what the filter function of lambdas is
- use the list [listofnumbers] below
- write a lamda expression that will only print in the list any number that is larger than or equal 5

In [None]:
listofnumbers = [10,2,8,7,5,4,3,11,0, 1]


In [None]:
filter00 = filter (lambda x: x >= 5, listofnumbers) 
print(list(filter00))

**Excercise 2**

- use data - datafromsession5_exc3 - you can find it on GitHub in the data folder
- seperate in 3 new columns the following: a) the first six charachters for example hrxxxx  b) the number between the dash and dot for example 101 (in the first row) 3) the number after the dot (for example "1" in row 1)
- call the new columns: id1 id2 id3

In [None]:
sascat_ccounts = pd.read_csv('/Users/Ashrakat/Dropbox/University/Oxford/Jobs/Teaching/Text Analysis/code/Basics-of-Text-Analysis-for-Political-Science/Data/datafromsession5_exc3.tsv',delimiter="\t")
sascat_ccounts

In [None]:
sascat_ccounts.rename(columns={'id-snippet':'id'}, 
                 inplace=True)
sascat_ccounts['id3']= (sascat_ccounts["id"].str[11:])
sascat_ccounts

In [None]:
sascat_ccounts['id1'], sascat_ccounts['id2_needsedit'] = sascat_ccounts['id'].str.split('-', 1).str
sascat_ccounts

In [None]:
sascat_ccounts['id2_wrong']=(sascat_ccounts["id2_needsedit"].str[:-2]) #whats the mistake here?

In [None]:
sascat_ccounts

In [None]:
#remove everything after . is a better iption for creating id2
sascat_ccounts['id2_correct'] = sascat_ccounts['id2_needsedit'].str.split('.').str[0]

In [None]:
sascat_ccounts

In [None]:
sascat_ccounts['id2_correct'] = sascat_ccounts['id2_correct'].str.replace('.', '')
sascat_ccounts


**Excercise 3**

- Load convertfromwidetolong using pandas
- Reshape dataset to long format

In [None]:

import pandas as pd
reshapedata = pd.read_csv('/Users/Ashrakat/Desktop/convertfromwidetolong.tsv',delimiter="\t")
reshapedata.head()

In [None]:
reshapedata1=reshapedata.melt(id_vars='country', var_name='year', value_name='some_value')
reshapedata1

**Excercise 4**

a)
- Your first task is to improve the country dictionary from last class
- Countries like the UK, can appear in the text as UK, United Kingdom, Great Britian etc. we dont account for all the ways the countries are mentioned
- This is a huge problem in text analysis if you can not recognize different ways a country can be mentioned
- create a new dictionary for Russia, that includes all the forms Russia can be mentioned
- create the same type of dictionary for Central Republic of Congo
- create the same type of dictionary for Yugoslavia

then:

- create three new columns, one for each Russia, one for CR of Congo, one for Yugoslavia.
- in each of these columns add the name of the country that was mentioned.

b)
- Please use the methods for text processing that we have used until now to clean the articles, you can use pandas or any other method (think when does this step has to be done)

c) 
- Count most frequent words overall
- Plot most frequent words in a histogram

In [71]:
countries = ['United States of America', 'Canada', 'Bahamas', 'Cuba', 'Haiti', 'Dominican Republic', 'Jamaica', 'Trinidad and Tobago', 'Barbados', 'Mexico', 'Belize', 'Guatemala', 'Honduras', 'El Salvador', 'Nicaragua', 'Costa Rica', 'Panama', 'Colombia', 'Venezuela', 'Guyana', 'Surinam', 'Ecuador', 'Peru', 'Brazil', 'Bolivia', 'Paraguay', 'Chile', 'Argentina', 'Uruguay', 'United Kingdom', 'Ireland', 'Netherlands', 'Belgium', 'Luxembourg', 'France', 'Switzerland', 'Spain', 'Portugal', 'German Federal Republic', 'German Democratic Republic', 'Poland', 'Austria', 'Hungary', 'Czechoslovakia', 'Czech Republic', 'Slovakia', 'Italy/Sardinia', 'Malta', 'Albania', 'Montenegro', 'Macedonia (Former Yugoslav Republic of)', 'Croatia', 'Serbia', 'Yugoslavia', 'Bosnia-Herzegovina', 'Kosovo', 'Slovenia', 'Greece', 'Cyprus', 'Bulgaria', 'Moldova', 'Rumania', 'Russia (Soviet Union)', 'Estonia', 'Latvia', 'Lithuania', 'Ukraine', 'Belarus (Byelorussia)', 'Armenia', 'Georgia', 'Azerbaijan', 'Finland', 'Sweden', 'Norway', 'Denmark', 'Iceland', 'Cape Verde', 'Guinea-Bissau', 'Equatorial Guinea', 'Gambia', 'Mali', 'Senegal', 'Benin', 'Mauritania', 'Niger', 'Cote D\x92Ivoire', 'Guinea', 'Burkina Faso (Upper Volta)', 'Liberia', 'Sierra Leone', 'Ghana', 'Togo', 'Cameroon', 'Nigeria', 'Gabon', 'Central African Republic', 'Chad', 'Congo', 'Congo, Democratic Republic of (Zaire)', 'Uganda', 'Kenya', 'Tanzania/Tanganyika', 'Zanzibar', 'Burundi', 'Rwanda', 'Somalia', 'Djibouti', 'Ethiopia', 'Eritrea', 'Angola', 'Mozambique', 'Zambia', 'Zimbabwe (Rhodesia)', 'Malawi', 'South Africa', 'Namibia', 'Lesotho', 'Botswana', 'Swaziland', 'Madagascar', 'Comoros', 'Mauritius', 'Morocco', 'Algeria', 'Tunisia', 'Libya', 'Sudan', 'South Sudan', 'Iran (Persia)', 'Turkey (Ottoman Empire)', 'Iraq', 'Egypt', 'Syria', 'Lebanon', 'Jordan', 'Israel', 'Saudi Arabia', 'Yemen (Arab Republic of Yemen)', "Yemen, People's Republic of", 'Kuwait', 'Bahrain', 'Qatar', 'United Arab Emirates', 'Oman', 'Afghanistan', 'Turkmenistan', 'Tajikistan', 'Kyrgyz Republic', 'Uzbekistan', 'Kazakhstan', 'China', 'Tibet', 'Mongolia', 'Taiwan', "Korea, People's Republic of", 'Korea, Republic of', 'Japan', 'India', 'Bhutan', 'Pakistan', 'Bangladesh', 'Myanmar (Burma)', 'Sri Lanka (Ceylon)', 'Maldives', 'Nepal', 'Thailand', 'Cambodia (Kampuchea)', 'Laos', 'Vietnam, Democratic Republic of', 'Vietnam, Republic of', 'Malaysia', 'Singapore', 'Brunei', 'Philippines', 'Indonesia', 'East Timor', 'Australia', 'Papua New Guinea', 'New Zealand', 'Solomon Islands', 'Fiji']


In [82]:
yugoslavia = ["Serbia", "Yugoslavia", "Serbia and Montenegro"]
CRC=["Congo", "Democratic" "Republic of Congo", "Zaire"]
russia=["Russia", "Russian Federation", "Soviets", "Soviet Union"]

In [83]:
yugoslavia=[x.lower() for x in yugoslavia]
CRC=[x.lower() for x in CRC]
russia=[x.lower() for x in russia]

In [84]:
sascat = pd.read_csv('/Users/Ashrakat/Desktop/sascat_excerpt.tsv',delimiter="\t")
sascat = sascat.rename({"Unnamed: 10": 'content'}, axis=1)
sascat = sascat[['content','id-snippet']]


In [85]:
def nlp_pipeline(text):
    
    text=text.lower()
    
    #tokenize words for each sentence
    text = nltk.word_tokenize(text)
    
    # remove punctuation and numbers
    text = [token for token in text if token.isalpha()]#The isalpha() keeps here caracters in the string are alphabets
    
    # remove stopwords - be careful with this step    
    text = [token for token in text if token not in stop_words]

    return text

In [86]:
sascat['content'] = sascat['content'].apply(nlp_pipeline)
sascat['content'] = [' '.join(map(str, l)) for l in sascat['content']] # remove list structure

sascat.head()

Unnamed: 0,content,id-snippet
0,provides enforcement provisions law impose san...,hr5114-101.1
1,requires delivery excess defense articles nato...,hr5114-101.2
2,prohibits making available esf foreign militar...,hr5114-101.3
3,prohibits obligation funds european bank recon...,hr5114-101.4
4,prohibits assistance countries fail take steps...,hr5114-101.5


In [87]:
for index, row in sascat.iterrows(): # itereate on rows
    words_to_keep = [] # create a list of words you want to keep
    for word in row[0].split(' '): # for words in each row in colum zero, split by space
        if word in russia : #for each word in the dictionary
            words_to_keep.append(word + ' ') # append to the list of words to keep the matching words
    sascat.loc[index, 'mentions_russia']= ''.join(words_to_keep) #create a new column and paste the words you want to keep

In [88]:
for index, row in sascat.iterrows(): # itereate on rows
    words_to_keep = [] # create a list of words you want to keep
    for word in row[0].split(' '): # for words in each row in colum zero, split by space
        if word in CRC : #for each word in the dictionary
            words_to_keep.append(word + ' ') # append to the list of words to keep the matching words
    sascat.loc[index, 'mentions_crc']= ''.join(words_to_keep) #create a new column and paste the words you want to keep

In [89]:
for index, row in sascat.iterrows(): # itereate on rows
    words_to_keep = [] # create a list of words you want to keep
    for word in row[0].split(' '): # for words in each row in colum zero, split by space
        if word in yugoslavia : #for each word in the dictionary
            words_to_keep.append(word + ' ') # append to the list of words to keep the matching words
    sascat.loc[index, 'mentions_yugoslavia']= ''.join(words_to_keep) #create a new column and paste the words you want to keep

In [91]:
sascat['mentions_russia']=sascat['mentions_russia'].str.split(' ').apply(set).str.join(', ')
sascat['mentions_crc']=sascat['mentions_crc'].str.split(' ').apply(set).str.join(', ')
sascat['mentions_yugoslavia']=sascat['mentions_yugoslavia'].str.split(' ').apply(set).str.join(', ')


In [93]:
sascat['mentions_russia'] = sascat['mentions_russia'].apply(lambda x: x.replace(',', ''))
sascat['mentions_crc'] = sascat['mentions_crc'].apply(lambda x: x.replace(',', ''))
sascat['mentions_yugoslavia'] = sascat['mentions_yugoslavia'].apply(lambda x: x.replace(',', ''))


In [94]:
sascat

Unnamed: 0,content,id-snippet,mentions_russia,mentions_crc,mentions_yugoslavia
0,provides enforcement provisions law impose san...,hr5114-101.1,,,
1,requires delivery excess defense articles nato...,hr5114-101.2,,,yugoslavia
2,prohibits making available esf foreign militar...,hr5114-101.3,,,
3,prohibits obligation funds european bank recon...,hr5114-101.4,,,
4,prohibits assistance countries fail take steps...,hr5114-101.5,,,
...,...,...,...,...,...
95,earmarks esf development assistance available ...,hr5368-102.16,,,
96,expresses sense congress recommended levels es...,hr5368-102.17,,,
97,prohibits esf assistance zaire,hr5368-102.18,,zaire,
98,limits amount esf assistance tied aid credits ...,hr5368-102.19,,,


**Excercise 4**

Use the cleaning methods we have learned until now to clean the text of rt_dataset.tsv

You can use pandas or not, you're choice

In [33]:
import pandas as pd
rt_news = pd.read_csv('/Users/Ashrakat/Desktop/rt_dataset.tsv',delimiter="\t")
rt_news.columns = ['date', 'title',"topic","content"]
rt_news.head()


Unnamed: 0,date,title,topic,content
0,"16 Sep, 2016 14:08","Putin: We don’t approve of WADA hackers, but i...",news,"We don’t approve of what hackers do, but what ..."
1,"11 Sep, 2016 22:33","Hillary Clinton diagnosed with pneumonia, canc...",usa,"Dr. Lisa Bardack, Clinton’s personal doctor s..."
2,"2 Dec, 2016 20:15",Ronaldinho and Riquelme offer to come out of r...,sport,READ MORE: 71 dead after plane carrying Brazil...
3,"9 Feb, 2016 21:13",NATO & European leaders whip up hysteria over ...,news,“The leaders of NATO member states and a numbe...
4,"5 Apr, 2016 18:01",US ‘Gremlin’ drones designed to cause missile ...,usa,"Four firms, including fighter jet manufacturer..."


In [34]:
rt_news['title_content']= rt_news['title']+ " " + rt_news['content']

In [35]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
def nlp_pipeline2(text):
    
    text=text.lower()

    #tokenize words for each sentence
    text = nltk.word_tokenize(text)
    
    # remove punctuation and numbers
    text = [token for token in text if token.isalpha()]#The isalpha() keeps here caracters in the string are alphabets
    
    # remove stopwords - be careful with this step    
    text = [token for token in text if token not in stop_words]
    

    return text

In [36]:
rt_news["title_content"]=rt_news['title_content'][0:5000].apply(nlp_pipeline2)

In [37]:
rt_news.head()

Unnamed: 0,date,title,topic,content,title_content
0,"16 Sep, 2016 14:08","Putin: We don’t approve of WADA hackers, but i...",news,"We don’t approve of what hackers do, but what ...","[putin, approve, wada, hackers, information, l..."
1,"11 Sep, 2016 22:33","Hillary Clinton diagnosed with pneumonia, canc...",usa,"Dr. Lisa Bardack, Clinton’s personal doctor s...","[hillary, clinton, diagnosed, pneumonia, cance..."
2,"2 Dec, 2016 20:15",Ronaldinho and Riquelme offer to come out of r...,sport,READ MORE: 71 dead after plane carrying Brazil...,"[ronaldinho, riquelme, offer, come, retirement..."
3,"9 Feb, 2016 21:13",NATO & European leaders whip up hysteria over ...,news,“The leaders of NATO member states and a numbe...,"[nato, european, leaders, whip, hysteria, myth..."
4,"5 Apr, 2016 18:01",US ‘Gremlin’ drones designed to cause missile ...,usa,"Four firms, including fighter jet manufacturer...","[us, gremlin, drones, designed, cause, missile..."


In [None]:
#Sources: https://medium.com/@makcedward/nlp-pipeline-sentence-tokenization-part-6-86ed55b185e6