# Week 2. Web Crawling 

Hello everyone, we will be using Jupyter Notebook every class, so let's make sure we know how to use Jupyter Notebook. <br>
If you face any problem with Jupyter Notebook, ask me, or you can google it :) <br>

Let's import our requests library and get our HTML data. <br>
We will save information in a variable named <code>page</code>

## HTTP request

In [733]:
import requests

request = requests.get("http://quotes.toscrape.com")

Did it go well? <br>
Let's check it out!

In [734]:
request

<Response [200]>

If it shows <Response [200]> then you are on the right track. <br>
Let's check what we downloaded.

In [735]:
#request.text

## HTML Parsing

We will be using BeautifulSoup. <br>
Let's import BeautifulSoup library. <br>
Then, we will get parsed version of our data using HTML parser.

In [736]:
from bs4 import BeautifulSoup as bs

Let's print out the parsed version and compare with the version above.

In [737]:
text = bs(request.text, 'html.parser')

Let's navigate through the tree.

In [738]:
#text.head

In [739]:
#text.body

In [740]:
#text.footer

Let's make our soup to look better using prettify.

In [741]:
#print(text.prettify())

## storing target information

We will try to find quotes this time. <br>
Let's try finding the first quote that is available.

In [742]:
text.find_all(itemprop="text")

[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
 <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
 <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>,
 <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>,
 <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>,
 <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.

In [743]:
text.find_all('span', itemprop="text")

[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
 <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
 <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>,
 <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>,
 <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>,
 <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.

uh oh! It seems like there was an error! <br>
What does the error say? <br>
we can solve this problem using underscore after our code of concern (class) <br>
Single post underscore is used for naming your variables as Python Keywords and to avoid the clashes by adding an underscore at last of your variable name.

I want to get all quotes from this page. <br>
Let's try out find_all function

This below also works ! <br>
<code> soup.find_all('span', itemprop='text')</code>

Let's save our data in a variable named <code>txt</code><br>
and print it out

In [744]:
txt_with_tags = text.find_all("span", class_="text")
txt_with_tags

[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
 <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
 <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>,
 <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>,
 <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>,
 <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.

This still looks messy to me. <br>
I want to get the text out from txt variable. <br>
To do that, we need to loop through the txt variable and tell Python to: <br>
1. go through txt and for each element in a list
2. we will get_text() out from it.


In [745]:
text_without_tags = (txt.get_text() for txt in txt_with_tags)
quotes = []
for text in text_without_tags:
    quotes.append(text)

Now that we *printed* the quotes, <br>
we need to extract them and save them in a variable called <code>quotes</code>.<br>
To do so, we will first need to make an empty variable called <code>quotes</code>.<br>
Then, using loops again, we will <code>append</code> them in the variable.

## Quotes into dataframe!

We will now import pandas and use pandas library to make our variable <code>quotes</code> into a dataframe.

In [746]:
import pandas as pd

In [747]:
quotesDF = pd.DataFrame(data=quotes, columns=["quotes"])
quotesDF

Unnamed: 0,quotes
0,“The world as we have created it is a process ...
1,"“It is our choices, Harry, that show what we t..."
2,“There are only two ways to live your life. On...
3,"“The person, be it gentleman or lady, who has ..."
4,"“Imperfection is beauty, madness is genius and..."
5,“Try not to become a man of success. Rather be...
6,“It is better to be hated for what you are tha...
7,"“I have not failed. I've just found 10,000 way..."
8,“A woman is like a tea bag; you never know how...
9,"“A day without sunshine is like, you know, nig..."


Let's export our df to excel!

In [748]:
# quotesDF.to_excel("quotes_DF_Ex.xlsx")

## Author names 
Let's quickly do the same with author names!

In [749]:
text = bs(request.text, 'html.parser')
authors_with_tags = text.find_all("small", class_="author")
authors_pre = (author.get_text() for author in authors_with_tags)
authors = []
for author in authors_pre:
    authors.append(author)

Let's combine our two dataframes together. <br>
I will use concatenation of pandas, which helps us append columns horizontally.

In [750]:
qoutesDF = pd.DataFrame(data=quotes, columns=["Quotes"])
authorsDF = pd.DataFrame(data=authors, columns=["Authors"])
both = pd.concat([quotesDF, authorsDF], axis=1)
both

Unnamed: 0,quotes,Authors
0,“The world as we have created it is a process ...,Albert Einstein
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling
2,“There are only two ways to live your life. On...,Albert Einstein
3,"“The person, be it gentleman or lady, who has ...",Jane Austen
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe
5,“Try not to become a man of success. Rather be...,Albert Einstein
6,“It is better to be hated for what you are tha...,André Gide
7,"“I have not failed. I've just found 10,000 way...",Thomas A. Edison
8,“A woman is like a tea bag; you never know how...,Eleanor Roosevelt
9,"“A day without sunshine is like, you know, nig...",Steve Martin


Let's export the table to excel!

Another way using dictionaries..

Please save your work and make sure to follow this format. <br>
"LASTNAME_firstname.ipynb" <br>
Then, upload it where it says "hand in your jupyter notebook file here!" <br>