# Project 01
## Simple word-counting 
## TF-IDF semantics

In [3]:
# Load sys to read files
import sys

In [2]:
import pandas as pd
import numpy as np
import matplotlib as plt
import string
import re

import os
import requests

import nltk
from nltk.corpus import reuters
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
# nltk.download('reuters')

In [5]:
# List downloaded files

# os.listdir('Data/reuters21578/')

After reading the README text, I know that the collection I am interested in is in the 22 sgm files. 
To access these files, I will use `open as infile`. 

There are 6 files describing the categories used to index the data as well.

Opening just one file as an example:

In [10]:
# Open a first file as an example. Used 'r' because it is text mode parsing (read only).
with open('Data/reuters21578/reut2-008.sgm', 'r', encoding = 'utf-8', errors = 'ignore') as infile:
        data = infile.read()

# To print as a sanity check, uncomment.
# print(data)

Despite the difficult format presentation, the corpus is there. 

I will open all the files in the directory and assemble them in an array called data.

In [11]:
data = []

for i in range(22):
    # Open all filenames. Pad {0} to 3 digits with str methods. 
    # Use range from 0 to 22.
    filename = 'Data/reuters21578/reut2-{0}.sgm'.format(str(i).zfill(3))
    
    # Encoding with most common scheme.
    with open(filename, 'r', encoding = 'utf-8', errors = 'ignore') as infile:
        data.append(infile.read())
        
# Print first 100 characters of the first article
data[0][:600]

'<!DOCTYPE lewis SYSTEM "lewis.dtd">\n<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5544" NEWID="1">\n<DATE>26-FEB-1987 15:01:01.79</DATE>\n<TOPICS><D>cocoa</D></TOPICS>\n<PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>\n<PEOPLE></PEOPLE>\n<ORGS></ORGS>\n<EXCHANGES></EXCHANGES>\n<COMPANIES></COMPANIES>\n<UNKNOWN> \n&#5;&#5;&#5;C T\n&#22;&#22;&#1;f0704&#31;reute\nu f BC-BAHIA-COCOA-REVIEW   02-26 0105</UNKNOWN>\n<TEXT>&#2;\n<TITLE>BAHIA COCOA REVIEW</TITLE>\n<DATELINE>    SALVADOR, Feb 26 - </DATELINE><BODY>Showers continued throughout the week in\nthe Bahia cocoa zone, alle'

All corpora have been set up in an array of arrays. We can see that the text format seems to have a HTML presentation. In order to give some format, I will use BeautifulSoup package.

As a sanity check, I will see what the NLTK package has for 'Reuters' database. 

In [12]:
# nltk.corpus.reuters.raw()

Looks like it is the same database, however the articles are in different order.

Now, I will use BeautifulSoup to parse the text. BeautifulSoup will allow to remove all the HTML tags. There are several ways to do this. I could run a loop and attach an article between the tags of <Body> to a dataframe. (Might do that later - for simplicity now, I will just put it in an array.

In [13]:
from bs4 import BeautifulSoup
example_soup = BeautifulSoup(data[0], 'html.parser')

# print(example_soup.prettify()) # Makes the above easier to read. More as we would see on a describe HTML panel (when scrapping).
# print(example_soup.get_text())  # Removes HTML flags.
# print(example_soup.find_all('body')) # Extracts the text from the HTML tag we choose. This case, the 'Body' tag.

Let's now put all the corpus in the soup. The soup object can just take one index at a time. We need to iterate over our data list.

In [14]:
corpora = []
for text in data:
    # Parse text as html using beautiful soup
    parsed_text = BeautifulSoup(text, 'html.parser')
    table = parsed_text.find_all('body')[0] 
print(table)
    #df = pd.read_html(str(table))
    #print( tabulate(df[0], headers='keys', tablefmt='psql') )

<body>Huge oil platforms dot the Gulf like
beacons -- usually lit up like Christmas trees at night.
    One of them, sitting astride the Rostam offshore oilfield,
was all but blown out of the water by U.S. Warships on Monday.
    The Iranian platform, an unsightly mass of steel and
concrete, was a three-tier structure rising 200 feet (60
metres) above the warm waters of the Gulf until four U.S.
Destroyers pumped some 1,000 shells into it.
    The U.S. Defense Department said just 10 pct of one section
of the structure remained.
    U.S. helicopters destroyed three Iranian gunboats after an
American helicopter came under fire earlier this month and U.S.
forces attacked, seized, and sank an Iranian ship they said had
been caught laying mines.
    But Iran was not deterred, according to U.S. defense
officials, who said Iranian forces used Chinese-made Silkworm
missiles to hit a U.S.-owned Liberian-flagged ship on Thursday
and the Sea Isle City on Friday.
    Both ships were hit in the ter

In [15]:
corpora = []
for text in data:
    # Parse text as html using beautiful soup
    parsed_text = BeautifulSoup(text, 'html.parser')
    
    # Extract article between <BODY> and </BODY> and convert to standard text. Add to list of articles
    corpora += [corpora.get_text() for corpora in parsed_text.find_all('body')]
# print the first article as an example
print(corpora[0][:300])

Showers continued throughout the week in
the Bahia cocoa zone, alleviating the drought since early
January and improving prospects for the coming temporao,
although normal humidity levels have not been restored,
Comissaria Smith said in its weekly review.
    The dry period means the temporao will b


## Normalization

After removing the HTML tags, there are still some items that might need further normalization. 
These tasks I can think of as now are: lowering capital letters, removing punctuation signs and digits and changing vocabulary to their lemmas or stems.

In [16]:
# Dictionary where punctuation is mapped to none.
no_punc = str.maketrans('', '', string.punctuation) 

# Remove punctuation from corpora.
corpora = [corpus.translate(no_punc) for corpus in corpora]

In [17]:
# Lowercase all capital letters.
corpora = [corpus.lower() for corpus in corpora]

In [18]:
# Remove digits from corpora.
corpora = [re.sub(r'\d+', '', corpus) for corpus in corpora]
 
# Set English and identified/additional stopwords in order to remove them.
stopwords = set(nltk.corpus.stopwords.words('english') + ['reuter', '\x03', '``','’', '`','br','"',"”", "''", "'s", "\\n"])
corpora = [[word for word in corpus.split() if word not in stopwords] for corpus in corpora]

In [19]:
corpora[0][:10]

['showers',
 'continued',
 'throughout',
 'week',
 'bahia',
 'cocoa',
 'zone',
 'alleviating',
 'drought',
 'since']

In [15]:
example = wnl.lemmatize('continued')
example

'continued'

In [14]:
# Change the corpora's full words for just the lemmas
wnl = WordNetLemmatizer()
corpora_lem = [" ".join([wnl.lemmatize(word) for word in corpus]) for corpus in corpora]
# print the first article as a running example
print(corpora_lem[0])

shower continued throughout week bahia cocoa zone alleviating drought since early january improving prospect coming temporao although normal humidity level restored comissaria smith said weekly review dry period mean temporao late year arrival week ended february bag kilo making cumulative total season mln stage last year seems cocoa delivered earlier consignment included arrival figure comissaria smith said still doubt much old crop cocoa still available harvesting practically come end total bahia crop estimate around mln bag sale standing almost mln hundred thousand bag still hand farmer middleman exporter processor doubt much cocoa would fit export shipper experiencing dificulties obtaining bahia superior certificate view lower quality recent week farmer sold good part cocoa held consignment comissaria smith said spot bean price rose cruzados per arroba kilo bean shipper reluctant offer nearby shipment limited sale booked march shipment dlrs per tonne port named new crop sale also l

In [58]:
# Change the corpora's full words for just the stems
ps = nltk.PorterStemmer()
corpora_stem = [" ".join([ps.stem(word) for word in corpus]) for corpus in corpora]
# print the first article as a running example
print(corpora_stem[0])

shower continu throughout week bahia cocoa zone allevi drought sinc earli januari improv prospect come temporao although normal humid level restor comissaria smith said weekli review dri period mean temporao late year arriv week end februari bag kilo make cumul total season mln stage last year seem cocoa deliv earlier consign includ arriv figur comissaria smith said still doubt much old crop cocoa still avail harvest practic come end total bahia crop estim around mln bag sale stand almost mln hundr thousand bag still hand farmer middlemen export processor doubt much cocoa would fit export shipper experienc dificulti obtain bahia superior certif view lower qualiti recent week farmer sold good part cocoa held consign comissaria smith said spot bean price rose cruzado per arroba kilo bean shipper reluct offer nearbi shipment limit sale book march shipment dlr per tonn port name new crop sale also light open port junejuli go dlr dlr new york juli augsept dlr per tonn fob routin sale butter