<a href="https://colab.research.google.com/github/rex17/Machine-Learning-Practice/blob/master/Text%20Analytics/Text%20Preprocessing%20%26%20Web%20Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Preprocessing & Web Scraping

## Sample Data Import

In [0]:
import nltk
file_handle = open('test_text.txt') # text file containing more than 200 words
text_data = file_handle.read()
text_data

'Hackers are illegally generating Monero, Bitcoin and other cryptocurrencies by exploiting a software flaw that was leaked from the U.S. government, according to new research, raising questions about the security of one of the fastest-growing corners of financial markets.\nDetected cases of illicit cryptocurrency mining -- the digital equivalent of minting money -- have surged 459 percent in 2018 compared to last year, Cyber Threat Alliance said in a report released Wednesday.\nThe spike is tied to the 2017 leak of Eternal Blue, a tool to exploit vulnerabilities in outdated Microsoft Systems software. When the tool became known, it tipped hackers to a previously unknown flaw in the software, now the basis of some hackersâ€™ efforts to commandeer computing power of others to generate digital currency.\nAs of July this year, 85 percent of all illicit cryptocurrency mining has targeted Monero, according to the report. Bitcoin made up about 8 percent, while other cryptocurrencies accounted

## Tokenization

In [0]:
tokens = nltk.word_tokenize(text_data)
print(tokens)

['Hackers', 'are', 'illegally', 'generating', 'Monero', ',', 'Bitcoin', 'and', 'other', 'cryptocurrencies', 'by', 'exploiting', 'a', 'software', 'flaw', 'that', 'was', 'leaked', 'from', 'the', 'U.S.', 'government', ',', 'according', 'to', 'new', 'research', ',', 'raising', 'questions', 'about', 'the', 'security', 'of', 'one', 'of', 'the', 'fastest-growing', 'corners', 'of', 'financial', 'markets', '.', 'Detected', 'cases', 'of', 'illicit', 'cryptocurrency', 'mining', '--', 'the', 'digital', 'equivalent', 'of', 'minting', 'money', '--', 'have', 'surged', '459', 'percent', 'in', '2018', 'compared', 'to', 'last', 'year', ',', 'Cyber', 'Threat', 'Alliance', 'said', 'in', 'a', 'report', 'released', 'Wednesday', '.', 'The', 'spike', 'is', 'tied', 'to', 'the', '2017', 'leak', 'of', 'Eternal', 'Blue', ',', 'a', 'tool', 'to', 'exploit', 'vulnerabilities', 'in', 'outdated', 'Microsoft', 'Systems', 'software', '.', 'When', 'the', 'tool', 'became', 'known', ',', 'it', 'tipped', 'hackers', 'to', 'a

 ### Oddities and how they might be handled
* Chars ',' , '.' , '--',`` are not required punctuation - Removed using filter-lambda function
* Words like a, an, the, are like insignificant words can be removed using NLTK stopwords corpus


## Punctuation and Stop Words Removal

In [0]:
token_punc_removed = list(filter(lambda word: word not in "',--.``''", tokens))
from nltk.corpus import stopwords
token_stopwords_removed = list(filter(lambda word: word not in set(stopwords.words('english')), token_punc_removed))
norm_tokens = [word.lower() for word in token_stopwords_removed ]
print(norm_tokens)

['hackers', 'illegally', 'generating', 'monero', 'bitcoin', 'cryptocurrencies', 'exploiting', 'software', 'flaw', 'leaked', 'u.s.', 'government', 'according', 'new', 'research', 'raising', 'questions', 'security', 'one', 'fastest-growing', 'corners', 'financial', 'markets', 'detected', 'cases', 'illicit', 'cryptocurrency', 'mining', 'digital', 'equivalent', 'minting', 'money', 'surged', '459', 'percent', '2018', 'compared', 'last', 'year', 'cyber', 'threat', 'alliance', 'said', 'report', 'released', 'wednesday', 'the', 'spike', 'tied', '2017', 'leak', 'eternal', 'blue', 'tool', 'exploit', 'vulnerabilities', 'outdated', 'microsoft', 'systems', 'software', 'when', 'tool', 'became', 'known', 'tipped', 'hackers', 'previously', 'unknown', 'flaw', 'software', 'basis', 'hackersâ€™', 'efforts', 'commandeer', 'computing', 'power', 'others', 'generate', 'digital', 'currency', 'as', 'july', 'year', '85', 'percent', 'illicit', 'cryptocurrency', 'mining', 'targeted', 'monero', 'according', 'report'

## Part of Speech Tagging 

In [0]:
text_with_pos = nltk.pos_tag(norm_tokens)
print(text_with_pos)

[('hackers', 'NNS'), ('illegally', 'RB'), ('generating', 'VBG'), ('monero', 'NN'), ('bitcoin', 'NN'), ('cryptocurrencies', 'NNS'), ('exploiting', 'VBG'), ('software', 'NN'), ('flaw', 'NN'), ('leaked', 'VBD'), ('u.s.', 'JJ'), ('government', 'NN'), ('according', 'VBG'), ('new', 'JJ'), ('research', 'NN'), ('raising', 'VBG'), ('questions', 'NNS'), ('security', 'NN'), ('one', 'CD'), ('fastest-growing', 'NN'), ('corners', 'NNS'), ('financial', 'JJ'), ('markets', 'NNS'), ('detected', 'VBD'), ('cases', 'NNS'), ('illicit', 'JJ'), ('cryptocurrency', 'NN'), ('mining', 'NN'), ('digital', 'JJ'), ('equivalent', 'JJ'), ('minting', 'VBG'), ('money', 'NN'), ('surged', 'VBD'), ('459', 'CD'), ('percent', 'NN'), ('2018', 'CD'), ('compared', 'VBN'), ('last', 'JJ'), ('year', 'NN'), ('cyber', 'VBD'), ('threat', 'NN'), ('alliance', 'NN'), ('said', 'VBD'), ('report', 'NN'), ('released', 'VBN'), ('wednesday', 'IN'), ('the', 'DT'), ('spike', 'NN'), ('tied', 'VBD'), ('2017', 'CD'), ('leak', 'JJ'), ('eternal', 'JJ

# Stemming with Porter Stemmer 

In [0]:
from nltk.stem import PorterStemmer
file_handle = open('test_text_2.txt') # text file containing more than 200 words
text_data = file_handle.read()
text_data

'A Met Eireann forecaster said: "Cloudy this morning with outbreaks of rain. Brighter this afternoon and evening as rain clears. Cool. Highest temperatures 12 to 14 degrees.\n\n"Dry tonight with good clear spells. Cool with light winds. Minimum temperatures 4 to 6 degrees.\n\n"Fine and dry tomorrow. Good sunny spells. Light northwest breezes. High of 12 to 14 degrees."\n\nThese low temperatures are expected to last into next week although it will be very dry.\n\nThe forecaster added: "Other than isolated showers near north coasts, Sunday night will be dry but noticeably cool.\n\n"Both Monday and Tuesday will be fine and dry days with varying cloud and sunny spells. Daytime temperatures however will be on the cool side- typically in the low to mid - teens.\n\n"Even though winds will be light and variable, night will be distinctly chilly with grass frosts in places.\n\n"It will get somewhat milder for Wednesday and Thursday as southwest breezes pick up. Most of the south and east of the 

In [0]:
tokens = nltk.word_tokenize(text_data)
norm_tokens = [word.lower() for word in tokens ]
token_punc_removed = list(filter(lambda word: word not in "',--:.``''", norm_tokens))
token_stopwords_removed = list(filter(lambda word: word not in set(stopwords.words('english')), token_punc_removed))
stemmer = PorterStemmer()
after_stemming = [stemmer.stem(token) for token in token_stopwords_removed] 
print(after_stemming)

['met', 'eireann', 'forecast', 'said', 'cloudi', 'morn', 'outbreak', 'rain', 'brighter', 'afternoon', 'even', 'rain', 'clear', 'cool', 'highest', 'temperatur', '12', '14', 'degre', 'dri', 'tonight', 'good', 'clear', 'spell', 'cool', 'light', 'wind', 'minimum', 'temperatur', '4', '6', 'degre', 'fine', 'dri', 'tomorrow', 'good', 'sunni', 'spell', 'light', 'northwest', 'breez', 'high', '12', '14', 'degre', 'low', 'temperatur', 'expect', 'last', 'next', 'week', 'although', 'dri', 'forecast', 'ad', 'isol', 'shower', 'near', 'north', 'coast', 'sunday', 'night', 'dri', 'notic', 'cool', 'monday', 'tuesday', 'fine', 'dri', 'day', 'vari', 'cloud', 'sunni', 'spell', 'daytim', 'temperatur', 'howev', 'cool', 'side-', 'typic', 'low', 'mid', 'teen', 'even', 'though', 'wind', 'light', 'variabl', 'night', 'distinctli', 'chilli', 'grass', 'frost', 'place', 'get', 'somewhat', 'milder', 'wednesday', 'thursday', 'southwest', 'breez', 'pick', 'south', 'east', 'countri', 'stay', 'dri', 'light', 'wind', 'outb

#### Porter Stemming have removed the suffixes from words such as 

* "er" from "forecaster" to forecast
* "ing" fron "morning" to morn -> this have changed the semantics
* "ing" from evening that became even which is also affected the semantics
* it dosent consider the context of the words when stemming it.

# Lemmatization with WordNetLemmatizer

In [0]:
print("\n================== After Porter Stemming ===================\n")
print(" ".join(after_stemming))
pos_tagged = nltk.pos_tag(token_stopwords_removed)

def convert_tags(tag):
    if tag == 'vbd' or tag == 'vbg' or tag == 'vbz':
        return 'v'
    else:
        return 'n'

wnl = nltk.WordNetLemmatizer()
print("\n================== After Lemmatization ===================\n")
after_lemmatization = []
for item in pos_tagged:
    new_tag = convert_tags(item[1].lower())
    out = wnl.lemmatize(item[0], new_tag)
    after_lemmatization.append(out)

print(" ".join(after_lemmatization))



met eireann forecast said cloudi morn outbreak rain brighter afternoon even rain clear cool highest temperatur 12 14 degre dri tonight good clear spell cool light wind minimum temperatur 4 6 degre fine dri tomorrow good sunni spell light northwest breez high 12 14 degre low temperatur expect last next week although dri forecast ad isol shower near north coast sunday night dri notic cool monday tuesday fine dri day vari cloud sunni spell daytim temperatur howev cool side- typic low mid teen even though wind light variabl night distinctli chilli grass frost place get somewhat milder wednesday thursday southwest breez pick south east countri stay dri light wind outbreak rain affect northwestern counti time latter part week continu essenti dri 'll milder wind stay light littl rain may recur near north coast


met eireann forecaster say cloudy morning outbreak rain brighter afternoon even rain clear cool highest temperature 12 14 degree dry tonight good clear spell cool light wind minimum

## Observation : 
After comparison of Porter Stemming and Lemmatization results I think Lemmatization technique is better because it relates words as per context and keep it whole which keep semantics intact. On the other hand stemming distort the meaning of words more frequently.

# Web Scraping with BeautifulSoup - IMDB top 50 list 2018

*Accessed IMDB Site to get to the list of top 50 movies of 2018 by Number of votes*'

Used beautifulSoup with inspect element to get the Name of the class of the div which had the Title of the movie from each movie description snippet.

In [0]:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup as bsup

url = "https://www.imdb.com/search/title?release_date=2018&sort=num_votes,desc&page=1"
html = urllib.request.urlopen(url).read()

soup = bsup(html,'html.parser')
movies_divs = soup.find_all('div', class_ = 'lister-item mode-advanced')
print(soup.title)
top_50_movie_list = []
for item in movies_divs:
    top_50_movie_list.append(item.h3.a.text)
print(top_50_movie_list)

<title>IMDb: Most Voted Titles Released 2018-01-01 to 2018-12-31 - IMDb</title>
['Avengers: Infinity War', 'Black Panther', 'Deadpool 2', 'Ready Player One', 'A Quiet Place', 'Annihilation', 'Jurassic World: Fallen Kingdom', 'Solo: A Star Wars Story', 'Mission: Impossible - Fallout', 'Tomb Raider', 'Game Night', 'Red Sparrow', 'Incredibles 2', 'Ant-Man and the Wasp', 'Altered Carbon', 'Hereditary', 'Rampage', "Ocean's Eight", 'Maze Runner: The Death Cure', 'Isle of Dogs', 'The Cloverfield Paradox', 'Pacific Rim: Uprising', 'The Commuter', 'Love, Simon', 'Upgrade', 'Den of Thieves', 'Enes Batur Hayal mi Gerçek mi?', 'Tag', 'The Meg', 'Lost in Space', 'Sicario: Day of the Soldado', 'Sacred Games', 'Blockers', 'Skyscraper', '12 Strong', 'Death Wish', 'The Nun', "To All the Boys I've Loved Before", 'Fifty Shades Freed', 'Insidious: The Last Key', 'O Mecanismo', 'Mamma Mia! Here We Go Again', 'BlacKkKlansman', 'The Kissing Booth', 'Sanju', 'When We First Met', 'Extinction', 'I Feel Pretty',

$ python bs_impl.py<br>
IMDb: Most Voted Titles Released 2018-01-01 to 2018-12-31 - IMDb<br>
['Avengers: Infinity War', 'Black Panther', 'Deadpool 2', 'Ready Player One', 'A Quiet Place', 'Annihilation', 'Jurassic World: Fallen Kingdom', 'Solo: A Star Wars Story', 'Mission: Impossible - Fallout', 'Tomb Raider', 'Game Night', 'Red Sparrow', 'Incredibles 2', 'Ant-Man and the Wasp', 'Altered Carbon', 'Rampage', 'Hereditary', "Ocean's Eight", 'Maze Runner: The Death Cure', 'The Cloverfield Paradox', 'Isle of Dogs', 'Pacific Rim: Uprising', 'The Commuter', 'Love, Simon', 'Upgrade', 'Den of Thieves', 'Enes Batur Hayal mi Gerçek mi?', 'Tag', 'Lost in Space', 'The Meg', 'Sacred Games', 'Blockers', 'Sicario: Day of the Soldado', '12 Strong', 'Skyscraper', 'Death Wish', "To All the Boys I've Loved Before", 'Fifty Shades Freed', 'The Nun', 'Insidious: The Last Key', 'O Mecanismo', 'Mamma Mia! Here We Go Again', 'BlacKkKlansman', 'The Kissing Booth', 'Sanju', 'When We First Met', 'Extinction', 'I Feel Pretty', 'How It Ends', 'The Equalizer 2']