# Web Scraping with BeautifulSoup Python Library

Waste amount of data are stored on the Internet. Most accessible and ready to use for Data scientists  are those stored in CSV, JSON or Excel files. Data also can be accessed via API-Application programming interface. 
But what if we need data that are not available in those formats? We need to scrap them from website.
So, what is it 'Web Scraping'?
- Web Scraping is a process of collecting data from a different websites. Basically we write code that requests specified file from the server, that hosting website, from which we want to scrape data. Then it goes through it in order to collect data we instructed code to collect.
- There are different web scraping tools that might been used for collecting data, here we going to use Python library **Beautiful Soap**.

For this project I am going to collect data about teas from this beautiful site [Tea Forte]('https://www.teaforte.com/').
Frst I am going to designte data that I want to collect from the web site. In this case it is site about tea and here are data that I need:
- pages,
- Name,
- Category,
- Description,
- Review,
- Review_score

In [43]:
# creating empty lists
pages = []
names = []
categories =[]
descriptiones = []
reviewes = []
review_scores = []

In this lists I will store data from which, at the end, I am going to create dictionary and Data frame.
Let's first import requests and BeautifulSoap library.
After that I am going to store web page in a soup object and display it.

In [44]:
import requests
from bs4 import BeautifulSoup

In [45]:
page = requests.get('https://www.teaforte.com/teas/')
page

<Response [200]>

In [78]:
soup = BeautifulSoup(page.content, 'html.parser')
#print(soup.prettify())

So, as it can be seen this is page that contains all teas, that can be ordered from the web site. There is 82 different teas displaied on this page. In order to extract necessary data, I need to click on every tea name under the tea picture or the picture itself. From this page utilising Developers Tools I can see that links who lead to specific tea are stored under the:

>Blockquote
    ```<h2 class = 'product-card--name'>
            <a href="/store/gourmet-tea/african-solstice/">African Solstice</a>
        </h2>```
    

I am going to select it.    

In [48]:
soup.select('h2 > a')

[<a href="/store/gourmet-tea/african-solstice/">African Solstice</a>,
 <a href="/store/gourmet-tea/apricot-amaretto/">Apricot Amaretto</a>,
 <a href="/store/gourmet-tea/belgian-mint/">Belgian Mint Tea</a>,
 <a href="/store/gourmet-tea/berry-basket/">Berry Basket</a>,
 <a href="/store/gourmet-tea/black-cherry/">Black Cherry</a>,
 <a href="/store/gourmet-tea/black-currant/">Black Currant Tea</a>,
 <a href="/store/gourmet-tea/blood-orange/">Blood Orange Tea</a>,
 <a href="/store/gourmet-tea/blueberry-merlot/">Blueberry Merlot Tea</a>,
 <a href="/store/gourmet-tea/bombay-chai/">Bombay Chai Tea</a>,
 <a href="/store/gourmet-tea/caramel-nougat/">Caramel Nougat</a>,
 <a href="/store/gourmet-tea/chai-matcha/">Chai Matcha</a>,
 <a href="/store/gourmet-tea/chamomile-citron/">Chamomile Citron Tea</a>,
 <a href="/store/gourmet-tea/cherry-amour/">Cherry Amour</a>,
 <a href="/store/gourmet-tea/hanami/">Cherry Blossom Hanami</a>,
 <a href="/store/gourmet-tea/cherry-cosmo/">Cherry Cosmo Tea</a>,
 <a h

Since I don't need complete 'a' tag just href part, that contain part of the address, lets extract it and store all the address parts from the page that lead to specific tea in the page list:

In [49]:
page = [a['href'] for a in soup.select('h2 > a')]
page

['/store/gourmet-tea/african-solstice/',
 '/store/gourmet-tea/apricot-amaretto/',
 '/store/gourmet-tea/belgian-mint/',
 '/store/gourmet-tea/berry-basket/',
 '/store/gourmet-tea/black-cherry/',
 '/store/gourmet-tea/black-currant/',
 '/store/gourmet-tea/blood-orange/',
 '/store/gourmet-tea/blueberry-merlot/',
 '/store/gourmet-tea/bombay-chai/',
 '/store/gourmet-tea/caramel-nougat/',
 '/store/gourmet-tea/chai-matcha/',
 '/store/gourmet-tea/chamomile-citron/',
 '/store/gourmet-tea/cherry-amour/',
 '/store/gourmet-tea/hanami/',
 '/store/gourmet-tea/cherry-cosmo/',
 '/store/gourmet-tea/cherry-marzipan/',
 '/store/gourmet-tea/chocolate-matcha/',
 '/store/gourmet-tea/chocolate-rose/',
 '/store/gourmet-tea/citrus-mint/',
 '/store/gourmet-tea/coconut-matcha/',
 '/store/gourmet-tea/cucumber-mint/',
 '/store/gourmet-tea/darjeeling-quince-tea/',
 '/store/gourmet-tea/decaf-breakfast/',
 '/store/gourmet-tea/defense/',
 '/store/gourmet-tea/earl-grey/',
 '/store/gourmet-tea/english-breakfast/',
 '/stor

Join the missing part of the address that is unique for all the extracted parts and split them.

In [50]:
start = 'https://www.teaforte.com'
start = start.join(page)
start

'/store/gourmet-tea/african-solstice/https://www.teaforte.com/store/gourmet-tea/apricot-amaretto/https://www.teaforte.com/store/gourmet-tea/belgian-mint/https://www.teaforte.com/store/gourmet-tea/berry-basket/https://www.teaforte.com/store/gourmet-tea/black-cherry/https://www.teaforte.com/store/gourmet-tea/black-currant/https://www.teaforte.com/store/gourmet-tea/blood-orange/https://www.teaforte.com/store/gourmet-tea/blueberry-merlot/https://www.teaforte.com/store/gourmet-tea/bombay-chai/https://www.teaforte.com/store/gourmet-tea/caramel-nougat/https://www.teaforte.com/store/gourmet-tea/chai-matcha/https://www.teaforte.com/store/gourmet-tea/chamomile-citron/https://www.teaforte.com/store/gourmet-tea/cherry-amour/https://www.teaforte.com/store/gourmet-tea/hanami/https://www.teaforte.com/store/gourmet-tea/cherry-cosmo/https://www.teaforte.com/store/gourmet-tea/cherry-marzipan/https://www.teaforte.com/store/gourmet-tea/chocolate-matcha/https://www.teaforte.com/store/gourmet-tea/chocolate-

In [51]:
links = start.split('https://')
links

['/store/gourmet-tea/african-solstice/',
 'www.teaforte.com/store/gourmet-tea/apricot-amaretto/',
 'www.teaforte.com/store/gourmet-tea/belgian-mint/',
 'www.teaforte.com/store/gourmet-tea/berry-basket/',
 'www.teaforte.com/store/gourmet-tea/black-cherry/',
 'www.teaforte.com/store/gourmet-tea/black-currant/',
 'www.teaforte.com/store/gourmet-tea/blood-orange/',
 'www.teaforte.com/store/gourmet-tea/blueberry-merlot/',
 'www.teaforte.com/store/gourmet-tea/bombay-chai/',
 'www.teaforte.com/store/gourmet-tea/caramel-nougat/',
 'www.teaforte.com/store/gourmet-tea/chai-matcha/',
 'www.teaforte.com/store/gourmet-tea/chamomile-citron/',
 'www.teaforte.com/store/gourmet-tea/cherry-amour/',
 'www.teaforte.com/store/gourmet-tea/hanami/',
 'www.teaforte.com/store/gourmet-tea/cherry-cosmo/',
 'www.teaforte.com/store/gourmet-tea/cherry-marzipan/',
 'www.teaforte.com/store/gourmet-tea/chocolate-matcha/',
 'www.teaforte.com/store/gourmet-tea/chocolate-rose/',
 'www.teaforte.com/store/gourmet-tea/citru

To make it complete lets do this:

In [52]:
begin ='https://'

for link in links:
    if link == links[0]:
        link = 'https://www.teaforte.com' + links[0]
        pages.append(link)
    else:
        link = begin + link
        pages.append(link)                        
pages

['https://www.teaforte.com/store/gourmet-tea/african-solstice/',
 'https://www.teaforte.com/store/gourmet-tea/apricot-amaretto/',
 'https://www.teaforte.com/store/gourmet-tea/belgian-mint/',
 'https://www.teaforte.com/store/gourmet-tea/berry-basket/',
 'https://www.teaforte.com/store/gourmet-tea/black-cherry/',
 'https://www.teaforte.com/store/gourmet-tea/black-currant/',
 'https://www.teaforte.com/store/gourmet-tea/blood-orange/',
 'https://www.teaforte.com/store/gourmet-tea/blueberry-merlot/',
 'https://www.teaforte.com/store/gourmet-tea/bombay-chai/',
 'https://www.teaforte.com/store/gourmet-tea/caramel-nougat/',
 'https://www.teaforte.com/store/gourmet-tea/chai-matcha/',
 'https://www.teaforte.com/store/gourmet-tea/chamomile-citron/',
 'https://www.teaforte.com/store/gourmet-tea/cherry-amour/',
 'https://www.teaforte.com/store/gourmet-tea/hanami/',
 'https://www.teaforte.com/store/gourmet-tea/cherry-cosmo/',
 'https://www.teaforte.com/store/gourmet-tea/cherry-marzipan/',
 'https://

Now we have all necessary addresses that should be clicked on in order to collect the rest of the data. If I make request to the first element(address) from the list 'pages' and then store it in the soup object p0, I can easily extract the name and the rest of data I want to collect. Also I am going to break soup object to smaller containing parts, since the data I need are contained in 'body' tag. Here is how it look's like:

In [53]:
p0 = requests.get(pages[0])
soup0 = BeautifulSoup(p0.text, 'html.parser')
list(soup0.children)
#[type(ithem) for ithem in list(soup.children)]
html0 = list(soup0.children)[2]
list(html0.children)
body0 = list(html0.children)[3]
#body0

In [54]:
name = body0.select("div .product-card--name > h1")
name

[<h1 class="title data-layer-flavor" data-flavor-name="African Solstice" itemprop="name">African Solstice</h1>]

Now I will make requests for every address in pages list and store body object in a list as well.

In [55]:
bodies = []
for item in pages:
    page = requests.get(item)
    soup = BeautifulSoup(page.text, 'html.parser')
    list(soup.children)
    html = list(soup.children)[2]
    list(html.children)
    body = list(html.children)[3]
    bodies.append(body)
#body

In [56]:
bodies[0]
name0 = bodies[0].find('h1', class_ = 'title data-layer-flavor').get_text()
name0

'African Solstice'

In [57]:
print(len(bodies))

82


From this point on it's easily to extract all I need, I am just going to change the tags that contain the necessary data.

In [58]:
for body in bodies:
    name = body.find('h1', class_ = 'title data-layer-flavor').get_text()
    names.append(name)
names

['African Solstice',
 'Apricot Amaretto',
 'Belgian Mint Tea',
 'Berry Basket',
 'Black Cherry',
 'Black Currant Tea',
 'Blood Orange Tea',
 'Blueberry Merlot Tea',
 'Bombay Chai Tea',
 'Caramel Nougat',
 'Chai Matcha',
 'Chamomile Citron Tea',
 'Cherry Amour',
 'Cherry Blossom Hanami',
 'Cherry Cosmo Tea',
 'Cherry Marzipan',
 'Chocolate Matcha',
 'Chocolate Rose Tea',
 'Citrus Mint',
 'Coconut Matcha',
 'Cucumber Mint',
 'Darjeeling Quince Tea',
 'Decaf Breakfast Tea',
 'Defense',
 'Earl Grey Tea',
 'English Breakfast Tea',
 'Estate Darjeeling',
 'Formosa Oolong',
 'Ginger Guru Chai',
 'Ginger Lemongrass',
 'Ginger Matcha',
 'Ginger Snap',
 'Green Mango Peach',
 'Harvest Apple Spice',
 'Hibiscus Blossom',
 'Iced Blood Orange',
 'Iced Blueberry Merlot',
 'Iced Ceylon Gold',
 'Iced Ginger Pear',
 'Iced Mango Peach',
 'Iced Raspberry Nectar',
 'Invigorate',
 'Jasmine Green Tea',
 'Kiwi Lime Ginger Tea',
 'Lemon Lavender Tea',
 'Lemon Sorbetti Tea',
 'Lemon Vervain',
 'Mango Mélange',
 '

In [59]:
for body in bodies:
    score = body.find('span', class_ = 'review-score').get_text()
    review_scores.append(score)
review_scores

['4.9',
 '4.6',
 '4.8',
 '4.7',
 '4.3',
 '4.9',
 '4.5',
 '4.6',
 '4.8',
 '4.6',
 '4.7',
 '4.8',
 '4.3',
 '4.8',
 '4.5',
 '4.7',
 '4.7',
 '4.4',
 '4.7',
 '4.6',
 '4.7',
 '4.8',
 '4.8',
 '4.7',
 '4.9',
 '4.9',
 '4.8',
 '4.7',
 '4.5',
 '4.7',
 '4.8',
 '4.3',
 '4.7',
 '4.7',
 '4.8',
 '4.7',
 '4.8',
 '5.0',
 '4.9',
 '4.9',
 '4.9',
 '4.6',
 '4.8',
 '3.8',
 '4.5',
 '4.7',
 '4.5',
 '4.5',
 '4.9',
 '4.6',
 '4.8',
 '4.9',
 '4.5',
 '4.8',
 '3.5',
 '4.6',
 '4.6',
 '4.4',
 '4.6',
 '4.9',
 '4.9',
 '4.3',
 '4.6',
 '4.8',
 '4.8',
 '4.7',
 '4.6',
 '4.2',
 '4.7',
 '4.5',
 '4.3',
 '4.3',
 '4.8',
 '4.5',
 '4.5',
 '4.9',
 '4.6',
 '4.6',
 '4.8',
 '4.8',
 '4.6',
 '4.7']

In [69]:
for body in bodies:
    review = body.find('count', itemprop = 'reviewCount').get_text()
    reviewes.append(review)
reviewes

['651',
 '170',
 '115',
 '49',
 '92',
 '259',
 '104',
 '295',
 '314',
 '201',
 '41',
 '479',
 '3',
 '315',
 '106',
 '191',
 '66',
 '131',
 '242',
 '51',
 '139',
 '64',
 '147',
 '9',
 '886',
 '1134',
 '173',
 '28',
 '49',
 '237',
 '29',
 '29',
 '328',
 '162',
 '49',
 '68',
 '33',
 '126',
 '85',
 '71',
 '84',
 '9',
 '649',
 '38',
 '84',
 '150',
 '2',
 '65',
 '13',
 '83',
 '162',
 '133',
 '98',
 '334',
 '2',
 '51',
 '134',
 '42',
 '150',
 '7',
 '12',
 '26',
 '122',
 '249',
 '35',
 '50',
 '132',
 '6',
 '64',
 '8',
 '62',
 '235',
 '70',
 '6',
 '105',
 '9',
 '47',
 '163',
 '714',
 '104',
 '11',
 '243']

In [70]:
for body in bodies:
    category = body.find('span', class_ = 'c-herbal').get_text()
    categories.append(category)
categories

['Herbal Tea',
 'Herbal Tea',
 'Herbal Tea',
 'Black Tea',
 'Black Tea',
 'Black Tea',
 'Black Tea',
 'Herbal Tea',
 'Black Tea',
 'Black Tea',
 'Green Tea',
 'Herbal Tea',
 'Herbal Tea',
 'Green Tea',
 'Herbal Tea',
 'Green Tea',
 'Green Tea',
 'Black Tea',
 'Herbal Tea',
 'Green Tea',
 'Green Tea',
 'Black Tea',
 'Black Tea',
 'Green Tea',
 'Black Tea',
 'Black Tea',
 'Black Tea',
 'Oolong Tea',
 'Herbal Tea',
 'Herbal Tea',
 'Green Tea',
 'Black Tea',
 'Green Tea',
 'Herbal Tea',
 'Herbal Tea',
 'Black Tea',
 'Iced Tea',
 'Black Tea',
 'White Tea',
 'Green Tea',
 'Herbal Tea',
 'Green Tea',
 'Green Tea',
 'Herbal Tea',
 'Herbal Tea',
 'Green Tea',
 'Herbal Tea',
 'Herbal Tea',
 'Black Tea',
 'Herbal Tea',
 'Green Tea',
 'Oolong Tea',
 'Green Tea',
 'Black Tea',
 'Green Tea',
 'White Tea',
 'Black Tea',
 'White Tea',
 'Green Tea',
 'Herbal Tea',
 'Herbal Tea',
 'Black Tea',
 'Black Tea',
 'Herbal Tea',
 'Herbal Tea',
 'Black Tea',
 'Green Tea',
 'Herbal Tea',
 'Herbal Tea',
 'Herbal 

In [71]:
for body in bodies:
    description = body.find('p', itemprop = 'description').get_text()
    descriptiones.append(description)
descriptiones

['From the soul of South Africa, this naturally caffeine-free, antioxidant-rich rooibos herb is layered with sweet berries and rose. A smooth and delicious cup.',
 'The taste of juicy apricot and almond pair for a refreshing infusion.',
 'Rich dark chocolate and cool peppermint: a classic dessert pairing.\r\n\r\nNOTE: While this tea is herbal, it does contain a very small amount of caffeine that naturally occurs in the cocoa husk (approximately 0.5 - 3mg caffeine per serving).',
 'Sun-ripened berries and cornflower petals brighten up classic black tea.',
 'Vanilla, licorice, and strawberry sweeten a classic flavor pairing.',
 'A lush, fruity, sweet steep. Blackberry leaves add a floral note.',
 'A bold ruby blend with a sweet zing. Delicious hot or iced.',
 'AWARD WINNER: Best Herbal Tea at the North American Tea Championships.\r\nA fruity, herbal blend with sweet berries and savory sage.',
 'A traditional blend of warming spices for a timeless ritual.',
 'AWARD WINNER: at the North Am

Now let's do the rest of the job, make dictionary and ultimately data frame that can be used for further examination. I will also store it as csv file.

In [73]:
tea_forte = {'Name' : names, 'Category' : categories, 'Description': descriptiones, 'Review': reviewes, 'Review Score' : review_scores}
tea_forte

{'Name': ['African Solstice',
  'Apricot Amaretto',
  'Belgian Mint Tea',
  'Berry Basket',
  'Black Cherry',
  'Black Currant Tea',
  'Blood Orange Tea',
  'Blueberry Merlot Tea',
  'Bombay Chai Tea',
  'Caramel Nougat',
  'Chai Matcha',
  'Chamomile Citron Tea',
  'Cherry Amour',
  'Cherry Blossom Hanami',
  'Cherry Cosmo Tea',
  'Cherry Marzipan',
  'Chocolate Matcha',
  'Chocolate Rose Tea',
  'Citrus Mint',
  'Coconut Matcha',
  'Cucumber Mint',
  'Darjeeling Quince Tea',
  'Decaf Breakfast Tea',
  'Defense',
  'Earl Grey Tea',
  'English Breakfast Tea',
  'Estate Darjeeling',
  'Formosa Oolong',
  'Ginger Guru Chai',
  'Ginger Lemongrass',
  'Ginger Matcha',
  'Ginger Snap',
  'Green Mango Peach',
  'Harvest Apple Spice',
  'Hibiscus Blossom',
  'Iced Blood Orange',
  'Iced Blueberry Merlot',
  'Iced Ceylon Gold',
  'Iced Ginger Pear',
  'Iced Mango Peach',
  'Iced Raspberry Nectar',
  'Invigorate',
  'Jasmine Green Tea',
  'Kiwi Lime Ginger Tea',
  'Lemon Lavender Tea',
  'Lemon

In [74]:
import pandas as pd

In [75]:
df = pd.DataFrame(tea_forte)
df.index += 1

In [76]:
df.head()

Unnamed: 0,Name,Category,Description,Review,Review Score
1,African Solstice,Herbal Tea,"From the soul of South Africa, this naturally ...",651,4.9
2,Apricot Amaretto,Herbal Tea,The taste of juicy apricot and almond pair for...,170,4.6
3,Belgian Mint Tea,Herbal Tea,Rich dark chocolate and cool peppermint: a cla...,115,4.8
4,Berry Basket,Black Tea,Sun-ripened berries and cornflower petals brig...,49,4.7
5,Black Cherry,Black Tea,"Vanilla, licorice, and strawberry sweeten a cl...",92,4.3


In [77]:
df.to_csv(r'D:\Data\tea_forte.csv', index = False, header = True)