# Students Do: Crude Stopwords
For this activity, create a function that takes in an article and outputs a list of words that is free of stopwords and any non-letter characters. After looking at the results, define your own list of stopwords to add to the NLTK default set. 

In [1]:
from nltk.corpus import reuters, stopwords
from nltk.tokenize import word_tokenize
import re

# Code to download corpora
import nltk
nltk.download('reuters')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\lorie\AppData\Roaming\nltk_data...
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lorie\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lorie\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [2]:
crude_article = reuters.raw(fileids=reuters.fileids(categories='crude')[2])

In [3]:
print(crude_article)

TURKEY CALLS FOR DIALOGUE TO SOLVE DISPUTE
  Turkey said today its disputes with
  Greece, including rights on the continental shelf in the Aegean
  Sea, should be solved through negotiations.
      A Foreign Ministry statement said the latest crisis between
  the two NATO members stemmed from the continental shelf dispute
  and an agreement on this issue would effect the security,
  economy and other rights of both countries.
      "As the issue is basicly political, a solution can only be
  found by bilateral negotiations," the statement said. Greece has
  repeatedly said the issue was legal and could be solved at the
  International Court of Justice.
      The two countries approached armed confrontation last month
  after Greece announced it planned oil exploration work in the
  Aegean and Turkey said it would also search for oil.
      A face-off was averted when Turkey confined its research to
  territorrial waters. "The latest crises created an historic
  opportunity to solve th

In [4]:
# define stopwording and regex 
def clean_text(article):
    sw = set(stopwords.words('english'))
    regex = re.compile("[^a-zA-Z ]")
    
    re_clean = regex.sub('', article)
    words = word_tokenize(re_clean)
    output = [word.lower() for word in words if word.lower() not in sw]
    return output

In [5]:
result = clean_text(crude_article)

In [6]:
# print out unique words
print(set(result))

{'disputes', 'stemmed', 'prime', 'minister', 'bilateral', 'latest', 'members', 'turkey', 'countries', 'disclosed', 'armed', 'basicly', 'solved', 'could', 'negotiations', 'foreign', 'created', 'greece', 'also', 'aegean', 'due', 'akiman', 'month', 'found', 'turkish', 'planned', 'dispute', 'reply', 'nazmi', 'continental', 'international', 'announced', 'faceoff', 'andreas', 'agreement', 'dialogue', 'justice', 'ozal', 'crises', 'averted', 'turkeys', 'greek', 'work', 'calls', 'effect', 'opportunity', 'week', 'ministry', 'statement', 'crisis', 'exploration', 'would', 'confined', 'message', 'sea', 'repeatedly', 'legal', 'including', 'search', 'research', 'confrontation', 'sent', 'security', 'nato', 'waters', 'oil', 'issue', 'political', 'solution', 'court', 'today', 'ambassador', 'territorrial', 'economy', 'rights', 'approached', 'contents', 'athens', 'shelf', 'said', 'meet', 'turgut', 'papandreou', 'last', 'two', 'solve', 'historic'}


In [7]:
# second iteration, with custom stopwords
def clean_text(article):
    sw = set(stopwords.words('english'))
    regex = re.compile("[^a-zA-Z ]")
    sw_addons = {'said', 'sent', 'found', 'including', 'today', 'announced', 'week', 'basically', 'also'}
    
    re_clean = regex.sub('', article)
    words = word_tokenize(re_clean)
    output = [word.lower() for word in words if word.lower() not in sw.union(sw_addons)]
    return output

In [8]:
result2 = clean_text(crude_article)
print(set(result2))

{'disputes', 'stemmed', 'prime', 'minister', 'bilateral', 'latest', 'members', 'turkey', 'countries', 'disclosed', 'armed', 'basicly', 'solved', 'could', 'negotiations', 'foreign', 'created', 'greece', 'aegean', 'due', 'akiman', 'month', 'turkish', 'planned', 'dispute', 'reply', 'nazmi', 'continental', 'international', 'faceoff', 'andreas', 'agreement', 'dialogue', 'justice', 'ozal', 'crises', 'averted', 'turkeys', 'greek', 'work', 'calls', 'effect', 'opportunity', 'ministry', 'statement', 'crisis', 'exploration', 'would', 'confined', 'message', 'sea', 'repeatedly', 'legal', 'search', 'research', 'confrontation', 'security', 'nato', 'waters', 'oil', 'issue', 'political', 'solution', 'court', 'ambassador', 'territorrial', 'economy', 'rights', 'approached', 'contents', 'athens', 'shelf', 'meet', 'turgut', 'papandreou', 'last', 'two', 'solve', 'historic'}
