<div align='center'><font size="5" color='#353B47'>Scraping famous phrases and sayings</font></div>
<div align='center'><font size="4" color="#353B47">For an ironclad repartee</font></div>
<br>
<hr>

<div align="center"><img src=https://www.apc.edu.ph/wp-content/uploads/2019/05/Its-Raining-Cats-and-Dogs_Beatrice-Baylosis.png width="60%"></div>

<div align='justify'><font size=3 color ='#D291BC'><i>It is raining ropes, take some seed, I am spliting my pear...</i> All of these are wrongly translated. I wanted to create this playful notebook to get started with scraping and get as many sayings as possible. First I focuse my research on England country, then I will add more and more countries</font></div>

---

## <div id="summary">Table of contents</div>

**<font size="2"><a href="#chap1">1. English phrases and sayings</a></font>**
**<br><font size="2"><a href="#chap2">2. Chinese proverbs</a></font>**
**<br><font size="2"><a href="#chap3">3. French expressions</a></font>**

In [None]:
# Libraries
import pandas as pd
import numpy as np
from requests import get
from bs4 import BeautifulSoup
from tqdm import tqdm
import os
import re

# <font color = '#957DAD'>What is web scraping ?</font>

<div align='justify'><font size=3>Web scraping, also known as web data extraction, is the process of retrieving or “scraping” data from a website. Unlike the mundane, mind-numbing process of manually extracting data, web scraping uses intelligent automation to retrieve hundreds, millions, or even billions of data points from the internet’s seemingly endless frontier.</font></div>

# <font color = '#957DAD'>What is HTTP?</font>

<div align='justify'><font size=3>HTTP is a set of protocols designed to enable communication between clients and servers. It works as a request-response protocol between a client and server. A web browser may be the client, and an application on a computer that hosts a web site may be the server. To request a response from the server, there are mainly two methods:</font></div>
<br>

- <font color ='#D291BC'>GET</font> : to request data from the server.
- <font color ='#D291BC'>POST</font> : to submit data to be processed to the server.

# <font color = '#957DAD'>Time to scrap !</font>

---

# <div id="chap1"><font color = '#957DAD'>PART I: Scraping English sayings</font></div>

In [None]:
# The URL I want to scrap data on
url = 'https://www.phrases.org.uk/meanings/phrases-and-sayings-list.html'

# Prepare GET request
response = get(url)

# Retrieve the webpage and store it as an bs4.BeautifulSoup object
html_soup = BeautifulSoup(response.text, 'html.parser')

<div align='justify'><font size=3>Now all the source code is stored into html_soup and we can work with it and select what we need. You will see that bs4 is a powerful library that can save you a lot of time.</font></div>

In [None]:
quotes = html_soup.find_all('p', class_ = 'phrase-list')
size = len(quotes)
quotes[:5]

<div align='justify'><font size=3>We obtain a list containing all the p tags of the targeted class. However, we can see that this class covers an a tag in which a link is specified by the href property. This link will be useful to retrieve the explanation of the saying. Indeed, each saying is a link to a page explaining it and its origin.</font></div>

In [None]:
cleaned_quotes = [quotes[i].text for i in range(size)]
cleaned_quotes[:5]

<div align='justify'><font size=3>Now we have all the quotes cleaned. The next step consists in retrieving explanations of each quote of that list. To do so, I need to create a list of links so I can scrap the explanation associated with the quote. Let's retrieve our first list of quotes. I can access a tag and href property directly with the following line of code:</font></div>
<br>

> <font size = 3>text.a['href']</font>

In [None]:
href_quotes = [quotes[i].a['href'] for i in range(size)]

<div align='justify'><font size=3>I retrieved the list of all href links associated with each quote. Now for each of these link, a generic scraping will be performed to retrieve the explanation. It is not always that easy because sometimes, the link can be a totally different website with a different organisation of tags and not the same class names. So I will try my best to get as many explanations as I can.</font></div>

In [None]:
# The base link
BASE_LINK = 'https://www.phrases.org.uk/meanings/'

def get_explanations(url):
    
    # This chunk of code is the same we used in the begining of this notebook
    url = url
    response = get(BASE_LINK + url)
    html_soup = BeautifulSoup(response.text, 'html.parser')
    
    quote_explanation = html_soup.find_all('p', class_ = 'meanings-body')
    if len(quote_explanation) >= 1:
        quote_explanation = str(quote_explanation[0].text)
    else:
        quote_explanation = "NO INFORMATION"
        
    return quote_explanation

> <div align='justify'><font size=3>I noticed that most of the time, the pages containing explanations have the same structure. I need to find p tag whose class is meanings-body and then if the structure is regular, the explanation can be scraped directly with the .text attribute. When the page is different, I will have a length of quote_explanation equal to zero. So I don't get any error message, I simply put 'NO INFORMATION' string instead of the true explanation.</font></div>

In [None]:
%%time
# This might take a while, you can grab a coffee or just reduce dimensionality.
# Here I chose only the five first quotes so it runs faster
number_of_quotes = 5
assert number_of_quotes < len(quotes)

explanations = [get_explanations(i) for i in tqdm(href_quotes[:number_of_quotes])]

In [None]:
# Constructing the final dataframe
quotes_dataframe = pd.DataFrame()
quotes_dataframe['text'] = quotes[:number_of_quotes]
quotes_dataframe['text'] = quotes_dataframe['text'].apply(lambda x:x.text)
quotes_dataframe['explanation'] = explanations
quotes_dataframe['origin'] = 'English'

quotes_dataframe.head()

In [None]:
# Save all the data in .csv file
quotes_dataframe.to_csv('English_phrases_and_sayings.csv')

**<font size="2"><a href="#summary">Back to summary</a></font>**

---

# <div id="chap2"><font color = '#957DAD'>PART II: Chinese Proverbs</font></div>

<div align='justify'><font size=3>Now let's scrap a different website to get Chinese proverbs. They have the peculiarity of having an element of morality, not without a sense of humour as well.</font></div>

In [None]:
# The URL I want to scrap data on
url = 'https://www.chinahighlights.com/travelguide/learning-chinese/chinese-sayings.htm'

# Prepare GET request
response = get(url)

# Retrieve the webpage and store it as an bs4.BeautifulSoup object
html_soup = BeautifulSoup(response.text, 'html.parser')

<div align='justify'><font size=3>This website is not built in the same way, so you will have to adapt the way you retrieve the data. Unfortunately, there are no miracle methods to automatically adapt to any website, you have to do it on a case by case basis. Here, all the quotes are contained in a div tag.</font></div>

In [None]:
proverbs_container = html_soup.find_all('div', class_ = 'col-md-19 col-sm-19 col-xs-24 pull-right')

<div align='justify'><font size=3>Now we are sure to have all the proverbs contained in this variable. To retrieve these proverbs, we need to retrieve only the lines starting with a number. We'll need to do a pattern search in a string of characters, this is called regular expression.</font></div>

> # <div align="center"><font color = '#957DAD'>What are Regular Expressions (RE) ?</font></div>
> <img src="https://www.oreilly.com/content/wp-content/uploads/sites/2/2019/06/email-regex_crop-ae942dc427c8cebd3a83c52d17389123.jpg" width=400>
> <br>
> <div align='justify'><font size=3>A regular expression is a sequence of characters that define a search pattern. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. So basically you will have to learn the Regex language to match whatever you whant on a string. I strongly recommend to use <a href="https://regex101.com/">this website</a>, which is a live testing regex website.</font></div>

In [None]:
def starts_with_digit(text):
    
    """
    Return boolean, does 'text' starts with a number
    text: string
    """
    
    output = False
    # This pattern finds if a string is starting is a number
    pattern = re.compile(r'\d*')
    # If this pattern match with something, return True
    if pattern.search(text).group() != '':
        output = True
        
    return output

<div align='justify'><font size=3>Here we get all the tags in our proverbs_container that are of type p. After retrieving all of it, I define a mask which will return for each p tag if the sentence contained in this tags starts with a number. I will use the previous function for that task. Once I get this mask, I will just have to filter all my p tags defined in 'to_browse' variable.</font></div>

In [None]:
# All the p tags in proverbs_container
to_browse = proverbs_container[0].find_all('p')

In [None]:
# List of sentence starting with a number
mask = [starts_with_digit(quote.text) for quote in to_browse]

# Filter to get all the proverbs
list_of_proverbs =[to_browse[i].text for i in range(len(mask)) if mask[i] == True]

<div align='justify'><font size=3>Now I would like to split each row into several part so I can create a dataframe with columns with the following information:</font>
<br>
<br>- <font size=3 color ='#D291BC'>The Chinese version of the proverb</font> ('<b>in_chinese</b>')
<br>- <font size=3 color ='#D291BC'>The PinYin version with its wordwise translation in English</font> ('<b>pin_yin</b>')
<br>- <font size=3 color ='#D291BC'>The English version of the proverb</font> ('<b>text</b>')</div>

In [None]:
pattern_chinese = re.compile(r'((?<=\d\.)(.*?)(?=\())')
pattern_chinese.search("6. 三个和尚没水喝。 (Sān gè héshàng méi shuǐ hē. 'three monks have no water to drink') — Too many cooks spoil the broth.").group()

(?<=\d\.) indicates that my pattern start after the number at the beginning of the sentence.

(?=\() indicates that my pattern will end before '('.

(.*?) indicates that my pattern catches everything.

On a whole, I want to get all characters between the number and '(' excluded.

In [None]:
pattern_pin_yin = re.compile(r'((?<=\()(.*?)(?=\)))')
pattern_pin_yin.search("6. 三个和尚没水喝。 (Sān gè héshàng méi shuǐ hē. 'three monks have no water to drink') — Too many cooks spoil the broth.").group()

(?<=\() indicates that my pattern start after '('.

(?=\)) indicates that my pattern will end before ')'.

(.*?) indicates that my pattern catches everything.

On a whole, I want to get all characters between '(' and ')' excluded.

In [None]:
pattern_translation = re.compile(r'(?<=\—)(.*?)$')
pattern_translation.search("6. 三个和尚没水喝。 (Sān gè héshàng méi shuǐ hē. 'three monks have no water to drink') — Too many cooks spoil the broth.").group()

(?<=\—) indicates that my pattern start after '—'.

$ indicates that my pattern will catch all the remaining characters.

(.*?) indicates that my pattern catches everything.

On a whole, I want to get all characters after '—'.

<div align='justify'><font size=3>Once each of the pattern works on an example, I can try to retrieve all of the matching strings for each observation and drop the column 'all_text'. Now there is still one missing information...</font></div>

In [None]:
chinese_proverbs = pd.DataFrame()
chinese_proverbs['all_text'] = list_of_proverbs
chinese_proverbs['in_chinese'] = chinese_proverbs['all_text'].apply(lambda x:pattern_chinese.search(x).group())
chinese_proverbs['pin_yin'] = chinese_proverbs['all_text'].apply(lambda x:pattern_pin_yin.search(x).group())
chinese_proverbs['text'] = chinese_proverbs['all_text'].apply(lambda x:pattern_translation.search(x).group())
chinese_proverbs['category'] = "-1"
chinese_proverbs['origin'] = "Chinese"

chinese_proverbs = chinese_proverbs.drop(['all_text'], axis=1)

<div align='justify'><font size=3>Now that we have retrieved all the proverbs, we can also associate them with a category. The h2 tags contain the categories into which the different proverbs are divided. We even have the number of proverbs per category, which will allow us to have the exact number of proverbs.</font></div>

In [None]:
proverbs_container[0].find_all('h2')

In [None]:
# I define a list with the name of categories
categories = ['Wisdom', 'Friendship', 'Love', 'Family', 'Encouragement', 'Education', 'Literature', 'Dragons']

# I define a list of proverbs per category (same order)
number_of_quotes_per_category = [26, 10, 10, 10, 21, 10 ,30 ,10]

# Put both list in a dict
dict_categories = dict(zip(categories, number_of_quotes_per_category))

In [None]:
# I ensure that we have same number of proverbs
len(list_of_proverbs) == sum(number_of_quotes_per_category)

In [None]:
# I define a cumsum list to get range index of proverbs that are contained in a category
cumsum = np.cumsum(number_of_quotes_per_category)

In [None]:
# I complete the 'category' column
chinese_proverbs.loc[:cumsum[0], 'category'] = categories[0]
for index in range(6):
    chinese_proverbs.loc[cumsum[index]:cumsum[index+1], 'category'] = categories[index+1]
chinese_proverbs.loc[cumsum[6]:, 'category'] = categories[7]

In [None]:
chinese_proverbs.head()

<div align='justify'><font size=3>There is not much things in common with the English phrases and sayings dataset except the 'text' and 'origin' column. It does not really matter, but if you want to join these two datasets, You can do it easily if you keep text and origin columns. My goal here was more focused on scraping different websites and to show that it can be totally different perspectives to scrap data on static web pages.</font></div>

In [None]:
# Save all the data in .csv file
chinese_proverbs.to_csv('Chinese_proverbs.csv')

**<font size="2"><a href="#summary">Back to summary</a></font>**

---

# <div id="chap3"><font color = '#957DAD'>PART III: FRENCH EXPRESSIONS</font></div>

<div align='justify'><font size="3">Now let's retrieve some French expresions on another website.</font></div>

In [None]:
# The URL I want to scrap data on
url = "https://frenchtogether.com/french-idioms/"

# Prepare GET request
response = get(url)

# Retrieve the webpage and store it as an bs4.BeautifulSoup object
html_soup = BeautifulSoup(response.text, 'html.parser')

<div align='justify'><font size="3">Once I managed to retrieve the html page, I retrieve the tag that allows me to have the list of expressions. The list is easy to retrieve because it is in a specific h3 tag. Only the last two items in the list of retrieved expressions are irrelevant.</font></div>

In [None]:
quotes = html_soup.find_all('h3')

# Get the list of all french quotes
french_quotes = [quote.text for quote in quotes]

# I got these elements that I need to clean
print(french_quotes[-2:])

# Excluding the two last elements which are not quotes
french_quotes = french_quotes[:-2]

<div align='justify'><font size="3">After each h3 tag, there is a succession of p tags. These tags contain the information I want to retrieve: the texts coming along with the strong tag Literally, Meaning and English counterpart.</font></div>

In [None]:
# For each quote, get all p tags that have all the necessary information
all_texts = html_soup.find_all('p')

<div align='justify'><font size="3">Now I have a list of all p tags, I want to retrieve these strong tags with the keywords I mentionned. I define a function which returns if a text as a strong tag in it and if it contains the specified keyword.</font></div>

In [None]:
def has_strong_tag(quote, chunk):
    assert chunk in ['Literally', 'Meaning', 'English counterpart']
    if chunk in quote.contents[0] :
        return True
    else:
        return False

<div align='justify'><font size="3">Then, in a first step I will apply this function to all elements of all_texts list three times, one time per keyword. I will get a list of booleans. With np.where, I can get the indexes of all True values.</font></div>

In [None]:
# Get indexes of elements that contain specified keywords 
literally_text_indexes = np.where([has_strong_tag(all_texts[i], 'Literally') for i in range(len(all_texts))])[0]
meaning_text_indexes = np.where([has_strong_tag(all_texts[i], 'Meaning') for i in range(len(all_texts))])[0]
eng_cnt_text_indexes = np.where([has_strong_tag(all_texts[i], 'English counterpart') for i in range(len(all_texts))])[0]

<div align='justify'><font size="3">Once I retrieve my indexes for each keyword, I can get the text associated with the specified keyword. Unfortunately, I get 91 elements in the literally_text_indexes and meaning_text_indexes lists instead of 90 and 87 for the eng_cnt_text_indexes. I need to filter the first two ones to be confident on their size.</font></div>

In [None]:
all_texts_arr = np.array(all_texts)

# Filter to keep all the Literally texts
literally_filtered = list(all_texts_arr[literally_text_indexes])
literally_real = [i.contents[1] for i in literally_filtered if 'strong' in str(i.contents[0])]

# Filter to keep all the Meaning texts
meanings_filtered = list(all_texts_arr[meaning_text_indexes])
meanings_real = [i.contents[1] for i in meanings_filtered if 'strong' in str(i.contents[0])]

# Filter to keep all the English Counterpart texts
eng_cnt_filtered = list(all_texts_arr[eng_cnt_text_indexes])
eng_cnt_real = [i.contents[1] for i in eng_cnt_filtered if 'strong' in str(i.contents[0])]

<div align='justify'><font size="3">Here, I retrieve the index of the element that should be removed and I delete it. The two lists will be same length than the number of french quotes I retrieved earlier.</font></div>

In [None]:
# Cleaning
to_find = [not 'strong' in str(i.contents[0]) for i in literally_filtered]
print(np.where(to_find)[0][0])
print(literally_filtered[67])
to_find = [not ':' in str(i.contents[1]) for i in meanings_filtered if 'strong' in str(i.contents[0])]
print(np.where(to_find)[0][0])
print(meanings_filtered[1])

In [None]:
all_texts_arr[literally_text_indexes[67]]
all_texts_arr[meaning_text_indexes[1]]

In [None]:
literally_text_indexes = [i for i in literally_text_indexes if i != literally_text_indexes[67]]
meaning_text_indexes = [i for i in meaning_text_indexes if i != meaning_text_indexes[1]]

<div align='justify'><font size="3">So now, you can wonder why I insist to get the indexes of the Literally and Meaning texts. For a simple reason: in the english counter part texts, I got only 87 elements, so I am not able to associate the french expression with the right english counterpart as there can be an offset introduced at any time. However, I know that a french quote with h3 tag is followed by a Literally text in a p tag which is followed by a Meaning text with a p tag and a english counterpart with also a p tag. So according to the index it has on all p tags, I can set conditions based on english counterpart indexes to find the right position across all french quotes.</font></div>

In [None]:
# Construct DataFrame
column_data = ['in_french', 'literally', 'meaning', 'text']
french_expressions = pd.DataFrame(index = range(0, 90) ,columns = column_data)

french_expressions['in_french'] = french_quotes
french_expressions['literally'] = literally_real    
french_expressions['meaning'] = meanings_real
french_expressions['lit_index'] = literally_text_indexes
french_expressions['mea_index'] = meaning_text_indexes
french_expressions['origin'] = "French"

for index, pivot in enumerate(eng_cnt_text_indexes):
    for i in range(len(french_expressions)-1):
        if ((french_expressions.loc[i,'mea_index']<pivot) and (french_expressions.loc[i+1,'mea_index']>pivot)):
            french_expressions.loc[i, 'text'] = eng_cnt_real[index]
french_expressions.loc[89, 'text'] = eng_cnt_real[-1]

<div align='justify'><font size="3">The dataframe is almost ready to be used. I perform some cleaning on it, save it and add it to the database.</font></div>

In [None]:
# Cleaning dataframe
french_expressions.literally = french_expressions.literally.apply(lambda x:str(x).replace(':','').replace(';', ''))
french_expressions.meaning = french_expressions.meaning.apply(lambda x:str(x).replace(':','').replace(';', ''))
french_expressions.text = french_expressions.text.apply(lambda x:str(x).replace(':','').replace(';', ''))
french_expressions = french_expressions.drop(['lit_index', 'mea_index'], axis = 1)

In [None]:
french_expressions.head()

In [None]:
# Save all the data in .csv file
french_expressions.to_csv('French_expressions.csv', index=False)

<div align='center'><font size="5" color='#353B47'>Never give up</font></div>
<br>

<div align="center"><img src="https://i.dawn.com/primary/2020/06/5ef3b46a5fa89.jpg"></div>

<div align='justify'><font size="3">Note that at any time, each of the website I was working on to scrap quotes might be updated and my code is likely not to work anymore. I will save an image of this notebook so it displays output. The most important in this notebook is to show the process of analyzing a webpage and scraping its data. A webpage is never 100% changing. In most cases, only few adjustements are necessary.</font></div>

**<font size="2"><a href="#summary">Back to summary</a></font>**

---

# <font color = '#957DAD'>BONUS</font>

<div align='justify'><font size="3">If by chance you would like to get both English and Chinese quotes in one dataframe, I do it for you.</font></div>

In [None]:
Eng_quotes = pd.read_csv("../input/phrases-and-sayings/English_phrases_and_sayings.csv")
Chi_quotes = pd.read_csv("../input/phrases-and-sayings/Chinese_proverbs.csv")
Fre_quotes = pd.read_csv("../input/phrases-and-sayings/French_expressions.csv")

new_df = pd.concat([Eng_quotes, Chi_quotes, Fre_quotes], join="inner").reset_index(drop=True)
new_df.to_csv('Concatenated_quotes.csv')

# <font color = '#957DAD'>References</font>

* https://www.scrapinghub.com/what-is-web-scraping/ 
* https://www.phrases.org.uk/meanings/phrases-and-sayings-list.html
* https://regex101.com/
* https://www.chinahighlights.com/travelguide/learning-chinese/chinese-sayings.htm

<hr>
<br>
<div align='justify'><font color="#353B47" size="4">Thank you for taking the time to read this notebook. I hope that I was able to answer your questions or your curiosity and that it was quite understandable. <u>any constructive comments are welcome</u>. They help me progress and motivate me to share better quality content. I am above all a passionate person who tries to advance my knowledge but also that of others. If you liked it, feel free to <u>upvote and share my work.</u> </font></div>
<br>
<div align='center'><font color="#353B47" size="3">Thank you and may passion guide you.</font></div>