# MidTerm Summary 1-5
## Scrape data from website
scrape authors and quotes from this website: http://quotes.toscrape.com/ (all 10 pages).
combine them in a Dataframe and export that Dataframe to Excel
- (Then try the same thing with dictionaries)

In [1]:
import nltk
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs

response = requests.get("http://quotes.toscrape.com")

quotes = []
authors = []
for pageNr in range(1, 11):
    response = requests.get(f"http://quotes.toscrape.com/page/{pageNr}/")
    responseText = bs(response.text, "html.parser")
    quotes_tags = responseText.find_all('span', class_='text')
    authors_tags = responseText.find_all('small', class_='author')
    for quote_tag in quotes_tags:
        quotes.append(quote_tag.get_text())
    for authors_tag in authors_tags:
        authors.append(authors_tag.get_text())

quotesDF = pd.DataFrame(data=quotes, columns=["Quotes"])
authorsDF = pd.DataFrame(data=authors, columns=["Authors"])
combinedDF = pd.concat([quotesDF, authorsDF], axis=1)
# combinedDF.to_excel("combined_authors_quotes.xlsx")

In [2]:
# Try the same thing with dictionaries

## Tokenization
 get the data from "https://www.gutenberg.org/cache/epub/8001/pg8001.html",
 - remove newline and \r, and everything that is not a word
 - put the words into lowercase
 - tokenize the text into words
 - find the 10 words that occur the most

In [3]:
response = requests.get("https://www.gutenberg.org/cache/epub/8001/pg8001.html")
responseText = bs(response.text, "html.parser")
paragraphs = responseText.find_all("p")
paragraphs_not_all = []
for paragraph in paragraphs[0:50]:
    paragraphs_not_all.append(" ".join(w.lower() for w in paragraph.get_text().split()))

import re

words = []
for p in paragraphs_not_all:
    for w in re.sub(r'[^A-Za-z]+', ' ', p).split():
        words.append(w)

from collections import Counter

counter = Counter(words)
# counter.most_common(10)

 get the Alice in Wonderland text from the Gutenberg Corpus
 - remove newlines
 - remove everything that is not a word
 - lowercase
 - find the 10 words that occur the most
 - remove the stopwords and find the 10 words that occur the most again

In [4]:
from nltk.corpus import gutenberg

nltk.download('gutenberg')
alice_raw = gutenberg.raw(fileids='carroll-alice.txt')
alice_raw_words = re.sub(r'[^A-Za-z]+', ' ', alice_raw)
stop_words = nltk.corpus.stopwords.words('english')
alice_filtered = [word.lower() for word in alice_raw_words.split() if word not in stop_words]
counter = Counter(alice_filtered)
# counter.most_common(10)

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\sonja\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


## POS-tagging
 download nltk tagsets and POS-tag sentence1 and then pos-tag sentence2
 and chunk to these patterns
 - NP: {<DT>?<JJ>*<NN>}
 - VBD: {<VBD>}
 - IN: {<IN>}
 - NP: {<NNP>+}
 and then show the results in a tree

In [5]:
nltk.download("tagset")
sentence1 = "They refuse to permit us to obtain the refuse permit"
sentence2 = "The little yellow dog barked at the angry cat that belongs to Heidi Choi."

sentence1_pos_DF = pd.DataFrame(nltk.pos_tag(sentence1.split()))
sentence2_pos = nltk.pos_tag(sentence2.split())
# sentence1_pos.T

pattern = r"""
NP: {<DT>?<JJ>*<NN>}
VBD: {<VBD>}
IN: {<IN>}
NP: {<NNP>+}
"""

NPchunker = nltk.RegexpParser(pattern)  # -----------------!-----------------
result = NPchunker.parse(sentence2_pos)  # ----------------!-----------------
# result.draw()  # ----------------------------------------!-----------------

[nltk_data] Error loading tagset: Package 'tagset' not found in index


 download the hamlet text and pos-tag it and then through that get all the nouns out

In [6]:
from nltk.corpus import gutenberg

gutenberg.fileids()
hamlet = gutenberg.raw(fileids='shakespeare-hamlet.txt')
hamlet_pos = nltk.pos_tag(hamlet.split())
nouns = []
for word, pos in hamlet_pos:
    if pos in ['NN']:
        nouns.append(word)

#print(nouns)

## Stemming

In [7]:
words = ['plays', 'playing', 'played', 'player', 'pharmacies', 'badly']

 These are the given words. Use the 3 stemmers to get the base of the words
 Then lemmatize these words

In [8]:
from nltk.stem import PorterStemmer

porter_stemmer = PorterStemmer()
print([porter_stemmer.stem(word) for word in words])

['play', 'play', 'play', 'player', 'pharmaci', 'badli']


In [9]:
from nltk.stem import SnowballStemmer

snowball_stemmer = SnowballStemmer('english')
print([snowball_stemmer.stem(word) for word in words])

['play', 'play', 'play', 'player', 'pharmaci', 'bad']


In [10]:
from nltk.stem import LancasterStemmer

lancaster_stemmer = LancasterStemmer()
print([lancaster_stemmer.stem(word) for word in words])

['play', 'play', 'play', 'play', 'pharm', 'bad']


In [11]:
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
# print([lemmatizer.lemmatize(w) for w in words])

words_tagged = nltk.pos_tag(words)
# for word, pos in words_tagged:
#     print(lemmatizer.lemmatize(word=word, pos=pos))
# -------------!----------------!--------------!----------------
# I think the tags might be different and that is the reason why this is not working
# ?

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sonja\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\sonja\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


## Googlenews
- get the top news
- get the top business articles
- get everything in NewYork in 2022 (%22New+York%22) and print the values for each key and the title of every entry
- put every title and link from every story from search entry weather in a Dataframe
- Remove a column in the Dataframe
- Rename a column in the Dataframe
- Reorder the Dataframe

In [12]:
from pygooglenews import GoogleNews

gn = GoogleNews()
top_news = gn.top_news()
business = gn.topic_headlines('business')
new_york = gn.search('%New+York%22')
# new_york.keys() # -> dict_keys(['feed', 'entries'])

# for entry in new_york['entries']:
# print(entry.title) # or print(entry['title'])

In [13]:
weather = gn.search('weather')
weather_stories = []
for story in weather['entries']:
    weather_stories.append({'title': story['title'], 'link': story['link']})

dataframe = pd.DataFrame(weather_stories)
dataframe['Title'] = dataframe['title'].str.split('-', expand=True)[0]
dataframe['Source'] = dataframe['title'].str.split('-', expand=True)[1]
dataframe.drop(['title'], axis=1, inplace=True)
dataframe.rename(columns={'link': "Link"}, inplace=True)
dataframe = dataframe[["Title", "Source", "Link"]]
# dataframe