# **Data Mining**

Tutorial based on https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/


# **1. The Request Library:**

Let’s try downloading a simple sample website, https://dataquestio.github.io/web-scraping-pages/simple.html.

We’ll need to first import the requests library, and then download the page using the requests.get method:




In [None]:
import requests
page = requests.get("https://dataquestio.github.io/web-scraping-pages/simple.html")
page

<Response [200]>

After running our request, we get a Response object. This object has a status_code property, which indicates if the page was downloaded successfully. A status_code of 200 means that the page downloaded successfully. We won’t fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error.

We can print out the HTML content of the page using the content property:

In [None]:
page.content


b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

# **2. Parsing a page with BeautifulSoup:**
As you can see above, we now have downloaded an HTML document.

We can use the BeautifulSoup library to parse this document, and extract the text from the p tag.

We first have to import the library, and create an instance of the BeautifulSoup class to parse our document:

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

We can now print out the HTML content of the page, formatted nicely, using the prettify method on the BeautifulSoup object.

In [None]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


This step isn’t strictly necessary, and we won’t always bother with it, but it can be helpful to look at prettified HTML to make the structure of the and where tags are nested easier to see.
As all the tags are nested, we can move through the structure one level at a time. We can first select all the elements at the top level of the page using the children property of soup.

Note that children returns a list generator, so we need to call the list function on it

In [None]:
list(soup.children)

['html', '\n', <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

The above tells us that there are two tags at the top level of the page — the initial <!DOCTYPE html> tag, and the <html> tag. There is a newline character (n) in the list as well. Let’s see what the type of each element in the list is:

In [None]:
[type(item) for item in list(soup.children)]
#for example, the code indicates that the initial 'html' tag is the document type, 
#the new lone character is a navigable string and the final item is a tag object

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
As we can see, all of the items are BeautifulSoup objects:

The first is a Doctype object, which contains information about the type of the document.
The second is a NavigableString, which represents text found in the HTML document.
The final item is a Tag object, which contains other nested tags.
The most important object type, and the one we’ll deal with most often, is the Tag object.

The Tag object allows us to navigate through an HTML document, and extract other tags and text. You can learn more about the various BeautifulSoup objects here.

We can now select the html tag and its children by taking the third item in the list:

In [None]:
html = list(soup.children)[2]

Each item in the list returned by the children property is also a BeautifulSoup object, so we can also call the children method on html.

Now, we can find the children inside the html tag:

In [None]:
list(html.children)

['\n', <head>
 <title>A simple example page</title>
 </head>, '\n', <body>
 <p>Here is some simple content for this page.</p>
 </body>, '\n']

As we can see above, there are two tags here, head, and body. We want to extract the text inside the p tag, so we’ll dive into the body:

In [None]:
body = list(html.children)[3]

Now, we can get the p tag by finding the children of the body tag:

In [None]:
list(body.children)

['\n', <p>Here is some simple content for this page.</p>, '\n']

We can now isolate the p tag:

In [None]:
p = list(body.children)[1]

Once we’ve isolated the tag, we can use the get_text method to extract all of the text inside the tag:

In [None]:
p.get_text()

'Here is some simple content for this page.'

# **3. Finding all instances of a tag at once**

What we did above was useful for figuring out how to navigate a page, but it took a lot of commands to do something fairly simple.

If we want to extract a single tag, we can instead use the find_all method, which will find all the instances of a tag on a page.

In [None]:
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

Note that find_all returns a list, so we’ll have to loop through, or use list indexing, it to extract text:

In [None]:
soup.find_all('p')[0].get_text()

'Here is some simple content for this page.'

If you instead only want to find the first instance of a tag, you can use the find method, which will return a single BeautifulSoup object:

In [None]:
soup.find('p')

<p>Here is some simple content for this page.</p>

# **4.Searching for tags by class and id**

We introduced classes and ids earlier, but it probably wasn’t clear why they were useful.

Classes and ids are used by CSS to determine which HTML elements to apply certain styles to. But when we’re scraping, we can also use them to specify the elements we want to scrape.



We can access the above document at the URL https://dataquestio.github.io/web-scraping-pages/ids_and_classes.html.

Let’s first download the page and create a BeautifulSoup objec

In [None]:
page = requests.get("https://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
soup

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

Now, we can use the find_all method to search for items by class or by id. In the below example, we’ll search for any p tag that has the class outer-text:

In [None]:
soup.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In the below example, we’ll look for any tag that has the class outer-text:

In [None]:
soup.find_all(class_="outer-text")

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

We can also search for elements by id:



In [None]:
soup.find_all(id="first")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

# **5. Practice with weather data**

We now know enough to download the page and start parsing it. In the below code, we will:

Download the web page containing the forecast.
Create a BeautifulSoup class to parse the page.
Find the div with id seven-day-forecast, and assign to seven_day
Inside seven_day, find each individual forecast item.
Extract and print the first forecast item

In [None]:
page = requests.get("https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
#Using the request library to download the website
soup = BeautifulSoup(page.content, 'html.parser')
#Using the beautify feature to clean up the website
seven_day = soup.find(id="seven-day-forecast")
#Using find to find the first instance of the id of the seven-day-forecast
forecast_items = seven_day.find_all(class_="tombstone-container")
#Find all the weather information as they are all stored in a tomestone container
tonight = forecast_items[0]
#Get the first entry of the weather forecast
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Tonight
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Tonight: Increasing clouds, with a low around 55. Breezy, with a west wind 13 to 23 mph, with gusts as high as 30 mph. " class="forecast-icon" src="DualImage.php?i=nwind_sct&amp;j=nbkn" title="Tonight: Increasing clouds, with a low around 55. Breezy, with a west wind 13 to 23 mph, with gusts as high as 30 mph. "/>
 </p>
 <p class="short-desc">
  Partly Cloudy
  <br/>
  and Breezy
  <br/>
  then Mostly
  <br/>
  Cloudy
 </p>
 <p class="temp temp-low">
  Low: 55 °F
 </p>
</div>


As we can see, inside the forecast item tonight is all the information we want. There are four pieces of information we can extract:

The name of the forecast item — in this case, Tonight.
The description of the conditions — this is stored in the title property of img.
A short description of the conditions — in this case, Mostly Clear.
The temperature low — in this case, 49 degrees.
We’ll extract the name of the forecast item, the short description, and the temperature first, since they’re all similar:

In [None]:
period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()
print(period)
print(short_desc)
print(temp)

Tonight
Partly Cloudyand Breezythen MostlyCloudy
Low: 55 °F


Now, we can extract the title attribute from the img tag. To do this, we just treat the BeautifulSoup object like a dictionary, and pass in the attribute we want as a key:

In [None]:
img = tonight.find("img")
desc = img['title']
print(desc)

Tonight: Increasing clouds, with a low around 55. Breezy, with a west wind 13 to 23 mph, with gusts as high as 30 mph. 


Now that we know how to extract each individual piece of information, we can combine our knowledge with CSS selectors and list comprehensions to extract everything at once.

In the below code, we will:

Select all items with the class period-name inside an item with the class tombstone-container in seven_day.
Use a list comprehension to call the get_text method on each BeautifulSoup object.

In [None]:
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

['Tonight',
 'Monday',
 'MondayNight',
 'Tuesday',
 'TuesdayNight',
 'Wednesday',
 'WednesdayNight',
 'Thursday',
 'ThursdayNight']

As we can see above, our technique gets us each of the period names, in order.

We can apply the same technique to get the other three fields:

In [None]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]
print(short_descs)
print(temps)
print(descs)

['Partly Cloudyand Breezythen MostlyCloudy', 'Mostly Sunnythen Sunnyand Breezy', 'Mostly Clearand Breezythen MostlyClear', 'Sunny thenSunny andBreezy', 'Clear andBreezy thenMostly Clear', 'Sunny', 'Mostly Clearand Breezythen MostlyClear', 'Sunny', 'Partly Cloudy']
['Low: 55 °F', 'High: 70 °F', 'Low: 57 °F', 'High: 75 °F', 'Low: 55 °F', 'High: 72 °F', 'Low: 55 °F', 'High: 69 °F', 'Low: 55 °F']
['Tonight: Increasing clouds, with a low around 55. Breezy, with a west wind 13 to 23 mph, with gusts as high as 30 mph. ', 'Monday: Partly sunny, then gradually becoming sunny, with a high near 70. Breezy, with a west wind 13 to 18 mph increasing to 20 to 25 mph in the afternoon. Winds could gust as high as 32 mph. ', 'Monday Night: Mostly clear, with a low around 57. Breezy, with a west wind 13 to 23 mph, with gusts as high as 30 mph. ', 'Tuesday: Sunny, with a high near 75. Breezy, with a west wind 15 to 23 mph, with gusts as high as 30 mph. ', 'Tuesday Night: Mostly clear, with a low around 55

# **6. Combining our data into a Pandas Dataframe**

We can now combine the data into a Pandas DataFrame and analyze it. A DataFrame is an object that can store tabular data, making data analysis easy. In order to do this, we’ll call the DataFrame class, and pass in each list of items that we have. We pass them in as part of a dictionary.

Each dictionary key will become a column in the DataFrame, and each list will become the values in the column:

In [None]:
import pandas as pd
weather = pd.DataFrame({
    "period": periods,
    "short_desc": short_descs,
    "temp": temps,
    "desc":descs
})
weather

Unnamed: 0,period,short_desc,temp,desc
0,Tonight,Partly Cloudyand Breezythen MostlyCloudy,Low: 55 °F,"Tonight: Increasing clouds, with a low around ..."
1,Monday,Mostly Sunnythen Sunnyand Breezy,High: 70 °F,"Monday: Partly sunny, then gradually becoming ..."
2,MondayNight,Mostly Clearand Breezythen MostlyClear,Low: 57 °F,"Monday Night: Mostly clear, with a low around ..."
3,Tuesday,Sunny thenSunny andBreezy,High: 75 °F,"Tuesday: Sunny, with a high near 75. Breezy, w..."
4,TuesdayNight,Clear andBreezy thenMostly Clear,Low: 55 °F,"Tuesday Night: Mostly clear, with a low around..."
5,Wednesday,Sunny,High: 72 °F,"Wednesday: Sunny, with a high near 72."
6,WednesdayNight,Mostly Clearand Breezythen MostlyClear,Low: 55 °F,"Wednesday Night: Mostly clear, with a low arou..."
7,Thursday,Sunny,High: 69 °F,"Thursday: Sunny, with a high near 69."
8,ThursdayNight,Partly Cloudy,Low: 55 °F,"Thursday Night: Partly cloudy, with a low arou..."


We can now do some analysis on the data. For example, we can use a regular expression and the Series.str.extract method to pull out the numeric temperature values:

In [None]:
temp_nums = weather["temp"].str.extract("(?P<temp_num>\d+)", expand=False)
weather["temp_num"] = temp_nums.astype('int')
temp_nums

0    55
1    70
2    57
3    75
4    55
5    72
6    55
7    69
8    55
Name: temp_num, dtype: object

We could then find the mean of all the high and low temperatures

In [None]:
weather["temp_num"].mean()

62.55555555555556

We could also only select the rows that happen at night:

In [None]:
is_night = weather["temp"].str.contains("Low")
weather["is_night"] = is_night
is_night

0     True
1    False
2     True
3    False
4     True
5    False
6     True
7    False
8     True
Name: temp, dtype: bool

In [None]:
weather[is_night]

Unnamed: 0,period,short_desc,temp,desc,temp_num,is_night
0,Tonight,Partly Cloudyand Breezythen MostlyCloudy,Low: 55 °F,"Tonight: Increasing clouds, with a low around ...",55,True
2,MondayNight,Mostly Clearand Breezythen MostlyClear,Low: 57 °F,"Monday Night: Mostly clear, with a low around ...",57,True
4,TuesdayNight,Clear andBreezy thenMostly Clear,Low: 55 °F,"Tuesday Night: Mostly clear, with a low around...",55,True
6,WednesdayNight,Mostly Clearand Breezythen MostlyClear,Low: 55 °F,"Wednesday Night: Mostly clear, with a low arou...",55,True
8,ThursdayNight,Partly Cloudy,Low: 55 °F,"Thursday Night: Partly cloudy, with a low arou...",55,True


# **7. Individual Practice of Weather Scores**

---



Using the acquired knowledge, scrape the best weather conditions of each province in Vietnam

In [None]:
import requests
import re
page = requests.get("https://thoitietvietnam.locvy.com")
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
weather_table = soup.find(class_="col-md-3 wea-right")
province_tag = weather_table.find("h5")
weather = weather_table.find('div', attrs={'class': None}).get_text()
province_names = [x.get_text() for x in weather_table.select("h5")]
trueweather = [y.get_text() for y in weather_table.find_all('div', class_ = None)]


import pandas as pd
weather = pd.DataFrame({"Province Name": province_names,
                        "Temperature": trueweather})
weather

Unnamed: 0,Province Name,Temperature
0,Hà Nội,36°C
1,Hải Phòng,34°C
2,Bắc Giang,36°C
3,Bắc Kạn,34°C
4,Bắc Ninh,36°C
...,...,...
58,Long An,33°C
59,Tây Ninh,33°C
60,Tiền Giang,33°C
61,Trà Vinh,32°C


# **8. Individual Pracice of Mining Canonical Smiles according to CID and appending to pre-existing CSV file**

Using PubCHEM and the downloaded CSV file

In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
page = requests.get("https://pubchem.ncbi.nlm.nih.gov/compound/3034011")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup)
can_smiles = soup.find_all(id ="Canonical-SMILES")
print(can_smiles)

#df = pd.read_csv("/content/DILI_dataset_AID_588211.csv")
#df["Canonical-SMILES"] = ""
#because the page requested or downloaded is dynamic, a findall could not be employed


<!DOCTYPE html>

<html lang="en">
<head>
<meta content="index,follow,noarchive" name="robots"/>
<meta charset="utf-8"/>
<title>Idoxifene | C28H30INO - PubChem</title>
<script type="application/ld+json">
      {
          "@context": "https://schema.org",
          "@type": "Organization",
          "name": "PubChem",
          "url": "https://pubchem.ncbi.nlm.nih.gov",
          "logo": "https://pubchem.ncbi.nlm.nih.gov/pcfe/logo/PubChem_logo.png",
          "foundingDate": "2004"
      }
      
    </script>
<link href="/pcfe/favicon/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
<link href="/pcfe/favicon/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
<link href="/pcfe/favicon/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
<link href="/pcfe/favicon/manifest.json" rel="manifest"/>
<link color="#0071bc" href="/pcfe/favicon/safari-pinned-tab.svg" rel="mask-icon"/>
<link href="/pcfe/favicon/favicon.ico" rel="shortcut-icon"/>
<link href="http