This notebook follows the tutorials __"Python Tutorial: Web Scraping with BeautifulSoup and Requests"__ by __Corey Schafer__</br>
The tutorial video link is "https://www.youtube.com/watch?v=ng2o98k983k"
 

In [102]:
from bs4 import BeautifulSoup
import requests
import csv

with open('simple.html') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')

In [103]:
soup

<!DOCTYPE html>
<html class="no-js" lang="">
<head>
<title>Test - A Sample Website</title>
<meta charset="utf-8"/>
<link href="css/normalize.css" rel="stylesheet"/>
<link href="css/main.css" rel="stylesheet"/>
</head>
<body>
<h1 id="site_title">Test Website</h1>
<hr/>
<div class="article">
<h2><a href="article_1.html">Article 1 Headline</a></h2>
<p>This is a summary of article 1</p>
</div>
<hr/>
<div class="article">
<h2><a href="article_2.html">Article 2 Headline</a></h2>
<p>This is a summary of article 2</p>
</div>
<hr/>
<div class="footer">
<p>Footer Information</p>
</div>
<script src="js/vendor/modernizr-3.5.0.min.js"></script>
<script src="js/plugins.js"></script>
<script src="js/main.js"></script>
</body>
</html>

In [104]:
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" lang="">
 <head>
  <title>
   Test - A Sample Website
  </title>
  <meta charset="utf-8"/>
  <link href="css/normalize.css" rel="stylesheet"/>
  <link href="css/main.css" rel="stylesheet"/>
 </head>
 <body>
  <h1 id="site_title">
   Test Website
  </h1>
  <hr/>
  <div class="article">
   <h2>
    <a href="article_1.html">
     Article 1 Headline
    </a>
   </h2>
   <p>
    This is a summary of article 1
   </p>
  </div>
  <hr/>
  <div class="article">
   <h2>
    <a href="article_2.html">
     Article 2 Headline
    </a>
   </h2>
   <p>
    This is a summary of article 2
   </p>
  </div>
  <hr/>
  <div class="footer">
   <p>
    Footer Information
   </p>
  </div>
  <script src="js/vendor/modernizr-3.5.0.min.js">
  </script>
  <script src="js/plugins.js">
  </script>
  <script src="js/main.js">
  </script>
 </body>
</html>



In [105]:
soup.head.title

<title>Test - A Sample Website</title>

In [106]:
soup.head.title.text

'Test - A Sample Website'

In [107]:
soup.head.link

<link href="css/normalize.css" rel="stylesheet"/>

In [108]:
match = soup.find('div') # will give the first div in the html file/page
print(match)

<div class="article">
<h2><a href="article_1.html">Article 1 Headline</a></h2>
<p>This is a summary of article 1</p>
</div>


In [109]:
match = soup.find('div', class_='footer') # will give the div having class as footer in the html file/page
print(match)

<div class="footer">
<p>Footer Information</p>
</div>


In [110]:
article = soup.find('div', class_='article') 
print(article)

<div class="article">
<h2><a href="article_1.html">Article 1 Headline</a></h2>
<p>This is a summary of article 1</p>
</div>


In [111]:
article.h2.a.text

'Article 1 Headline'

In [112]:
headline = article.h2.a.text
print(headline)

Article 1 Headline


In [113]:
article.p.text

'This is a summary of article 1'

In [114]:
summary = article.p.text
print(summary)

This is a summary of article 1


In [115]:
for article in soup.find_all('div', class_='article'):
    headline = article.h2.a.text
    print(headline)
    
    summary = article.p.text
    print(summary)
    
    print()

Article 1 Headline
This is a summary of article 1

Article 2 Headline
This is a summary of article 2



In [116]:
from bs4 import BeautifulSoup
import requests
import csv

source = requests.get('http://coreyms.com').text

soup = BeautifulSoup(source, 'lxml')

csv_file = open('cms_scrape.csv', 'w')

csv_writer = csv.writer(csv_file)
csv_writer.writerow(['headline', 'summary', 'video_link'])

for article in soup.find_all('article'):
    headline = article.h2.a.text
    print(headline)

    summary = article.find('div', class_='entry-content').p.text
    print(summary)

    try:
        vid_src = article.find('iframe', class_='youtube-player')['src']

        vid_id = vid_src.split('/')[4]
        vid_id = vid_id.split('?')[0]

        yt_link = f'https://youtube.com/watch?v={vid_id}'
    except Exception as e:
        yt_link = None

    print(yt_link)

    print()

    csv_writer.writerow([headline, summary, yt_link])

csv_file.close()

Python Tutorial: Zip Files – Creating and Extracting Zip Archives
In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…
None

Python Data Science Tutorial: Analyzing the 2019 Stack Overflow Developer Survey
In this Python Programming video, we will be learning how to download and analyze real-world data from the 2019 Stack Overflow Developer Survey. This is terrific practice for anyone getting into the data science field. We will learn different ways to analyze this data and also some best practices. Let’s get started…
None

Python Multiprocessing Tutorial: Run Code in Parallel Using the Multiprocessing Module
In this Python Programming video, we will be learning how to run code in parallel using the multiprocessing module. We will also 

In [117]:
#https://nptel.ac.in/course.html
from bs4 import BeautifulSoup
import requests
import csv

source = requests.get('https://nptel.ac.in/course.html').text

soup = BeautifulSoup(source, 'lxml')

In [118]:
content = soup.tbody.tr
print(content.prettify())

<tr>
 <td>
  <a href="courses/105/107/105107200/" target="_blank">
   NOC:Geomorphology
  </a>
 </td>
 <td>
  Civil Engineering
 </td>
 <td>
  Prof. Pitambar Pati
 </td>
 <td>
  IIT Roorkee
 </td>
 <td>
  Video
 </td>
</tr>



In [119]:
content.td.a

<a href="courses/105/107/105107200/" target="_blank">NOC:Geomorphology</a>

In [120]:
subject_name = content.td.a.text
subject_name

'NOC:Geomorphology'

In [121]:
content.find('a')['href']

'courses/105/107/105107200/'

In [122]:
course_link = 'https://nptel.ac.in/'
course_link +=content.find('a')['href']
course_link

'https://nptel.ac.in/courses/105/107/105107200/'

In [123]:
content = soup.tbody.tr
print(content.prettify())

<tr>
 <td>
  <a href="courses/105/107/105107200/" target="_blank">
   NOC:Geomorphology
  </a>
 </td>
 <td>
  Civil Engineering
 </td>
 <td>
  Prof. Pitambar Pati
 </td>
 <td>
  IIT Roorkee
 </td>
 <td>
  Video
 </td>
</tr>



In [124]:
for td in content.find_all('td'):
    print(td)

<td><a href="courses/105/107/105107200/" target="_blank">NOC:Geomorphology</a></td>
<td>Civil Engineering</td>
<td>Prof. Pitambar Pati</td>
<td>IIT Roorkee</td>
<td>Video</td>


In [125]:
content.find_all('td')

[<td><a href="courses/105/107/105107200/" target="_blank">NOC:Geomorphology</a></td>,
 <td>Civil Engineering</td>,
 <td>Prof. Pitambar Pati</td>,
 <td>IIT Roorkee</td>,
 <td>Video</td>]

In [126]:
subject_name = content.find_all('td')[0].a.text
subject_name

'NOC:Geomorphology'

In [127]:
course_link = 'https://nptel.ac.in/'
course_link +=content.find('a')['href']
course_link

'https://nptel.ac.in/courses/105/107/105107200/'

In [128]:
discipline = content.find_all('td')[1].text
discipline

'Civil Engineering'

In [129]:
prof_name = content.find_all('td')[2].text
prof_name

'Prof. Pitambar Pati'

In [130]:
from bs4 import BeautifulSoup
import requests
import csv

source = requests.get('https://nptel.ac.in/course.html').text

soup = BeautifulSoup(source, 'lxml')

csv_file = open('course_list.csv', 'w')

csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Subject Name', 'Course Link', 'Discipline', 'SME Name', 'Institute'])

for content in soup.tbody.find_all('tr'):
    
    try:
        subject_name = content.find_all('td')[0].a.text
        
    except Exception as e:
        subject_name = None
        
    try:
        course_link = 'https://nptel.ac.in/'
        course_link +=content.find('a')['href']
        
    except Exception as e:
        course_link = None    
    
    try:
        discipline = content.find_all('td')[1].text
        
    except Exception as e:
        discipline  = None 
        
    try:
        prof_name = content.find_all('td')[2].text
        
    except Exception as e:
        prof_name  = None 
    
    try:
        institute = content.find_all('td')[3].text
        
    except Exception as e:
        institute = None 
    
    csv_writer.writerow([subject_name, course_link, discipline, prof_name, institute])

csv_file.close()
    