## _30 Days Python Bootcamp @ BEST-ENLIST_

### _Author: SANDHYA S_

### _Date: 12 July '21_

## _Task: BeautifulSoup_
---

- Scraping is simply a process of extracting (from various means), copying and screening of data.
- When we do scraping or extracting data or feeds from the web (like from web-pages or websites), it is termed as web-scraping.
- Beautiful Soup is a python package and as the name suggests, parses the unwanted data and helps to organize and format the messy web data by fixing bad HTML and present to us in an easily-traversible XML structures.
```
pip install beautifulsoup4
```

```
from bs4 import BeautifulSoup
import requests
url = "https://www.xlwings.com/index.htm"
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")
print(soup.title)
```

- When we passed a html document or string to a beautifulsoup constructor, beautifulsoup basically converts a complex html page into different python objects. Below we are going to discuss four major kinds of objects:
> - Tag
> - NavigableString
> - BeautifulSoup
> - Comments

- A tag object can have any number of attributes. 
- The tag < b class = "boldest" > has an attribute "class" whose value is "boldest". 
- Anything that is NOT tag, is basically an attribute and must contain a value. 
- You can access the attributes either through accessing the keys (like accessing "class" in above example) or directly accessing through ".attrs"

```
tutorialsP = BeautifulSoup("<div class='xlwings'></div>",'lxml')
tag2 =xlwings.div
tag2['class']
```

- Some of the HTML5 attributes can have multiple values. Most commonly used is the class-attribute which can have multiple CSS-values. 
- Others include ‘rel’, ‘rev’, ‘headers’, ‘accesskey’ and ‘accept-charset’.

```
css_soup = BeautifulSoup('<p class="body"></p>')
css_soup.p['class']
```

- The navigablestring object is used to represent the contents of a tag. To access the contents, use “.string” with tag.

```
soup = BeautifulSoup("<h2 id='message'>Hello, xlwings!</h2>")
soup.string
```

```
type(soup.string)
```

- BeautifulSoup is the object created when we try to scrape a web resource. So, it is the complete document which we are trying to scrape. Most of the time, it is treated tag object.
- Adding new data/contents to an existing tag is by using tag.append() method. It is very much similar to append() method in Python list.

---
# _Exercise_

### _1. Take any webpage, Scrape the data and try to insert only the class data and save it in excel_

In [1]:
from bs4 import BeautifulSoup
import requests

class_list = set()
url = 'https://en.wikipedia.org/wiki/Beautiful_Soup'
page = requests.get(url)
soup = BeautifulSoup(page.content , 'html.parser')
tags = {tag.name for tag in soup.find_all()}

for tag in tags:
    for i in soup.find_all(tag):
        if i.has_attr("class"):
            if len(i['class']) != 0:
                class_list.add(" ".join(i['class']))
                
print(f'CLASS:\n{class_list}')

CLASS:
{'mw-hidden-catlinks mw-hidden-cats-hidden', 'mw-portlet mw-portlet-personal vector-user-menu-legacy vector-menu', 'mw-portlet mw-portlet-coll-print_export vector-menu vector-menu-portal portal', 'searchButton mw-fallbackSearchButton', 'mw-parser-output', 'read-more-container', 'shortdescription nomobile noexcerpt noprint searchaux', 'mw-editsection', 'vector-menu-heading', 'wbc-editpage', 'mw-portlet mw-portlet-interaction vector-menu vector-menu-portal portal', 'vector-menu-content-list', 'searchButton', 'dmbox-body', 'after-portlet after-portlet-lang', 'mw-headline', 'catlinks', 'vector-menu-content', 'mw-jump-link', 'mw-editsection-bracket', 'selected', 'mw-normal-catlinks', 'uls-after-portlet-link', 'mw-body-content mw-content-ltr', 'image', 'vector-menu-checkbox', 'metadata plainlinks dmbox dmbox-disambig', 'mw-portlet mw-portlet-tb vector-menu vector-menu-portal portal', 'mw-body', 'noprint', 'mw-footer', 'noprint stopMobileRedirectToggle', 'printfooter', 'mw-indicators',

In [2]:
# Creating dictionary of list of class and its content 
class_dict = {'Class': [],
              'Text' : []}
for clas in class_list:
    content = soup.find('div', {"class": clas})
    text = ''
    try:
        for i in content.findAll('p'):
            text = text + ' ' +  i.text
        class_dict['Class'].append(clas)
        class_dict['Text'].append(text)
    except:
        class_dict['Class'].append(clas)
        class_dict['Text'].append('None')

In [3]:
# Creating DF of the dictionary
import pandas as pd
df = pd.DataFrame(class_dict)

In [4]:
# Displaying first 10 entries
df.head(10)

Unnamed: 0,Class,Text
0,mw-hidden-catlinks mw-hidden-cats-hidden,
1,mw-portlet mw-portlet-personal vector-user-men...,
2,mw-portlet mw-portlet-coll-print_export vector...,
3,searchButton mw-fallbackSearchButton,
4,mw-parser-output,Beautiful Soup may refer to:\n
5,read-more-container,
6,shortdescription nomobile noexcerpt noprint se...,
7,mw-editsection,
8,vector-menu-heading,
9,wbc-editpage,


In [5]:
# Displaying last 10 entries
df.tail(10)

Unnamed: 0,Class,Text
39,client-nojs,
40,mediawiki ltr sitedir-ltr mw-hide-empty-elt ns...,
41,firstHeading,
42,wb-langlinks-add wb-langlinks-link,
43,anonymous-show,
44,mw-wiki-logo,
45,mw-portlet mw-portlet-views vector-menu vector...,
46,vector-body,Beautiful Soup may refer to:\n
47,mw-portlet mw-portlet-variants emptyPortlet ve...,
48,extiw,


In [6]:
# Writing the DF to an excel
df.to_excel('Scraped_class.xlsx')

---
## _Thank You!_