# Using BeautifulSoup and Requests packages for Web Scraping

In [3]:
! pip install bs4



In [4]:
import bs4
import requests

In [5]:
import urllib.request 

In [6]:
from bs4 import BeautifulSoup
import csv
import urllib

# Using a writer function to convert the contents from websites to a csv file

In [66]:
x = open ('msdegrees12.csv', 'w', newline = '')
writer = csv.writer(x)

# Web scraping Santa Clara University url and accessing the HTML content of MS Degrees Comparison table

# Source 1 : https://www.scu.edu/business/ms-degrees/ms-comparison/

In [57]:
soup = BeautifulSoup (urllib.request.urlopen("https://www.scu.edu/business/ms-degrees/ms-comparison//").read(), 'lxml')

# Reading the rows and columns of the HTML table and removing tags and spaces 

In [67]:
tbody = soup('table',{"class":"table"})[0].find_all('tr')
for row in tbody:
    cols = row.findChildren(recursive=False)
    cols = [ele.text.strip() for ele in cols]
    writer.writerow(cols)
    print(cols)

['', 'MS Business Analytics', 'MS Finance', 'MS Information Systems', 'MS Supply Chain Management']
['Application Deadlines', 'May 1 International applicant final deadline (Fall)\nJune 1\xa0Final Deadline (Fall)\nApplications received after final deadline may be considered on a case-by-case basis', 'May 1 International applicant final deadline (Fall)\nJune 1\xa0Final Deadline (Fall)\nApplications received after final deadline may be considered on a case-by-case basis', 'May 1 International applicant final deadline (Fall)\nJune 1 Final Deadline (Fall)\nOctober 15th Round 1 Deadline & International applicant final deadline (Winter)\nNovember 1 Final Deadline (Winter)\nJanuary 15th Round 1 Deadline & International applicant final deadline (Spring)\nFebruary 1st Final Deadline (Spring)', 'May 1 International applicant final deadline (Fall)\nJune\xa01 Final Deadline (Fall)\nApplications received after final deadline may be considered on a case-by-case basis']
['Academic Year Begins', 'Fall 

# Reading the Heading 1 and paragraph of the webpage

In [59]:
soup.h1.text

'MS Degrees'

In [60]:
soup.p

<p>Compare our demanding, full-time MS programs in Business Analytics, Finance, Information Systems, Supply Chain Management. In just a year, your career could be heading in a new direction.</p>

# Using prettify function to print the entire HTML code

In [61]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="Leavey School of Business MS Degree Comparison" name="description"/>
  <meta content="SCU, LSB, Leavey School of Business, MS Degree Comparison" name="keywords"/>
  <meta content="Santa Clara University" name="author"/>
  <meta content="T4 Site Specific Full Width" name="generator"/>
  <link href="/assets/images/favicons/apple-touch-icon-57x57.png" rel="apple-touch-icon" sizes="57x57"/>
  <link href="/assets/images/favicons/apple-touch-icon-60x60.png" rel="apple-touch-icon" sizes="60x60"/>
  <link href="/assets/images/favicons/apple-touch-icon-72x72.png" rel="apple-touch-icon" sizes="72x72"/>
  <link href="/assets/images/favicons/apple-touch-icon-76x76.png" rel="apple-touch-icon" sizes="76x76"/>
  <link href="/assets/images/favicons/apple-touch-icon-114x114.png" rel="apple-tou

In [62]:
soup.title

<title>MS Comparison - Leavey School of Business - Santa Clara University</title>

In [63]:
soup.a

<a class="sr-only sr-only-focusable" href="#content">Skip to main content</a>

# Find all HTML Tags using find_all() function

In [64]:
soup.find_all()

[<html lang="en">
 <head>
 <meta charset="utf-8"/>
 <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
 <meta content="width=device-width, initial-scale=1" name="viewport"/>
 <meta content="Leavey School of Business MS Degree Comparison" name="description"/>
 <meta content="SCU, LSB, Leavey School of Business, MS Degree Comparison" name="keywords"/>
 <meta content="Santa Clara University" name="author"/>
 <meta content="T4 Site Specific Full Width" name="generator"/>
 <link href="/assets/images/favicons/apple-touch-icon-57x57.png" rel="apple-touch-icon" sizes="57x57"/>
 <link href="/assets/images/favicons/apple-touch-icon-60x60.png" rel="apple-touch-icon" sizes="60x60"/>
 <link href="/assets/images/favicons/apple-touch-icon-72x72.png" rel="apple-touch-icon" sizes="72x72"/>
 <link href="/assets/images/favicons/apple-touch-icon-76x76.png" rel="apple-touch-icon" sizes="76x76"/>
 <link href="/assets/images/favicons/apple-touch-icon-114x114.png" rel="apple-touch-icon" sizes="114x114"/>


In [65]:
print(soup.title.text)
for link in soup.find_all('a'):
   print(link.get('href'))
   print(link.text) 

MS Comparison - Leavey School of Business - Santa Clara University
#content
Skip to main content
/
Santa Clara University Homepage
/
Santa Clara University
/business/
Leavey School of Business
/business/about/
About LSB
#
View subcategory links
/business/about/
About LSB
/business/about/leadership/
Leadership
/business/about/lucashall/
Lucas Hall
/business/alumni/
Alumni
/business/about/staff/
Staff Directory
/business/about/accreditation/
Accreditation
/business/about/faculty-governance/
Faculty Governance
/business/about/contact/
Contact Us
/business/undergraduates/
Undergraduates
#
View subcategory links
/business/undergraduates/
Undergraduates
/business/undergraduates/academics/
Academics
/business/undergraduates/community/
Community & Campus Involvement
/business/undergraduates/advising/
Advising
/business/undergraduates/calendar/
Calendar
/business/undergraduates/contact-undergraduate-business/
Contact Undergraduate Business
/business/graduate-degrees/
Graduate
#
View subcategory

In [18]:
import pandas as pd

# Web scraping Santa Clara University url and accessing the HTML content of SCU MS Program Demographics

# Source 2: https://www.scu.edu/business/ms-information-systems/prospective-students/class-profile

In [19]:
y = open ('misprofile', 'w', newline = '')
writer = csv.writer(y)

In [20]:
page = requests.get("https://www.scu.edu/business/ms-information-systems/prospective-students/class-profile")
soup1 = BeautifulSoup(page.text, 'html.parser')


In [21]:
soup1.h1

<h1><span></span>MS in Information Systems</h1>

# Accessing the div class to retrieve the desired contents

In [22]:
msis_profile_list = soup1.find(class_='col-md-6')    
print(msis_profile_list.prettify())

<div class="col-md-6">
 <div class="one-column news module">
  <ul class="media-list">
   <li class="media">
    <div class="media-left">
     <span class="thumbnail">
      <img alt="Graphics crowd concept" class="media-object" src="/media/leavey-school-of-business/academics-/graphics-gifs-etc/crowd-adapt-red-sq-crop-238x238.gif"/>
     </span>
    </div>
    <div class="media-body">
     <h4 class="media-heading">
      Class Demographics
     </h4>
     <p>
      <strong>
       Women
      </strong>
      : 51%
     </p>
     <p>
      <strong>
       Average age
      </strong>
      : 26
     </p>
     <p>
      <strong>
       Multilingual
      </strong>
      :  89%
     </p>
    </div>
   </li>
  </ul>
 </div>
 <div class="one-column news module">
  <ul class="media-list">
   <li class="media">
    <div class="media-left">
     <span class="thumbnail">
      <img alt="Graduation" class="media-object" src="/media/leavey-school-of-business/academics-/graphics-gifs-etc/grads-200

# Removing the HTML tags and retrieving only p tags

In [23]:
msis_profile_list_items = msis_profile_list.find_all('p')
for msis_profile in msis_profile_list_items:
    print(msis_profile.prettify())

<p>
 <strong>
  Women
 </strong>
 : 51%
</p>

<p>
 <strong>
  Average age
 </strong>
 : 26
</p>

<p>
 <strong>
  Multilingual
 </strong>
 :  89%
</p>
<p>
 <strong>
  Average undergraduate GPA
 </strong>
 3.2
</p>

<p>
 <strong>
  Average GMAT
 </strong>
 650
</p>

<p>
 <strong>
  Average GRE
 </strong>
 308
</p>

<p>
 <strong>
  % holding graduate degrees
 </strong>
 20%
</p>
<p>
 <strong>
  Average work experience
 </strong>
 :  2.3 years
</p>

<p>
 <strong>
  Employed at time of admission
 </strong>
 : 66%
</p>

<p>
 <strong>
  Selected hiring companies
 </strong>
 :  Apple, Cisco, Ernst &amp; Young, Facebook, NetApp, NVIDIA, Symantec, Twitter
</p>


In [24]:
for row in msis_profile_list_items:
    cols = row.findChildren()
    cols = [ele.text.strip() for ele in cols]
    writer.writerow(cols)
    print(cols)

['Women']
['Average age']
['Multilingual']
['Average undergraduate GPA']
['Average GMAT']
['Average GRE']
['% holding graduate degrees']
['Average work experience']
['Employed at time of admission']
['Selected hiring companies']


# Web scraping Santa Clara University Linkedin page and accessing the HTML content of SCU Alumni Demographics

In [25]:
link = requests.get("https://www.linkedin.com/school/10458916/alumni/")
print(link.status_code)
soup2 = BeautifulSoup(link.text, 'html.parser')
print(soup2)

200


In [51]:
soup2 = BeautifulSoup(link.text, 'html.parser')
soup2.find_all('div')

[<div id="artdeco-modal-outlet"></div>,
 <div id="a11y-menu"></div>,
 <div class="nav-main__content display-flex"><div class="nav-main__inbug-container fl mr3"><div class="nav-item--inbug" id="inbug-nav-item" lang="en"><a class="nav-item__link js-nav-item-link" data-alias="" data-control-name="" data-link-to="feed" data-resource="feed/badge" href="/feed/"><span aria-role="presentation" class="nav-item__icon nav-item__icon--inbug" lang="en"><li-icon aria-hidden="true" color="brand" size="34dp" type="linkedin-bug"><svg preserveaspectratio="xMinYMin meet" xmlns="http://www.w3.org/2000/svg"><g class="scaling-icon" style="fill-opacity: 1"><defs><lineargradient id="premium-linkedin-bug-color-gradient" x1="100%" x2="0%" y1="0%" y2="100%"><stop class="stop1" offset="0%" stop-color="#C5B583"></stop><stop class="stop2" offset="50%" stop-color="#AF9B62"></stop></lineargradient></defs><g class="bug-14dp" fill="none" fill-rule="evenodd" stroke="none" stroke-width="1"><g class="dp-1"><path class="bu

In [52]:
print(soup2.get_text())

 


!function(i,n){void 0!==i.addEventListener&&void 0!==i.hidden&&(n.liVisibilityChangeListener=function(){i.hidden&&(n.liHasWindowHidden=!0)},i.addEventListener("visibilitychange",n.liVisibilityChangeListener))}(document,window);


LinkedIn










































































 LinkedIné¢è± Home My Network Jobs Messaging NotificationsTry Premium for free More

















  {"data":{"$deletedFields":["launchAlert"],"mediaConfig":"crc4KtMwSB9CyP/2sFIqSw==,root,mediaConfig","$type":"com.linkedin.voyager.common.Configuration","$id":"crc4KtMwSB9CyP/2sFIqSw==,root"},"included":[{"mprConfig":"crc4KtMwSB9CyP/2sFIqSw==,root,mediaConfig,mprConfig","$deletedFields":[],"$type":"com.linkedin.voyager.common.MediaConfig","$id":"crc4KtMwSB9CyP/2sFIqSw==,root,mediaConfig"},{"$deletedFields":[],"sizes":["crc4KtMwSB9CyP/2sFIqSw==,root,mediaConfig,mprConfig,sizes,81902298-6e6f-4235-9552-639e8eea12a6-0","crc4KtMwSB9CyP/2sFIqSw==,root,mediaConfig,mprConfig,sizes

In [41]:
#soup2.p
sculinkedin = soup2.find('container',class_='artdeco-carousel-container ember-view')


In [45]:
sculinkedin = soup2.find('container',class_='artdeco-carousel-container ember-view')
print(sculinkedin)
sculinkedin_list = sculinkedin.find_all('strong','span')


None


AttributeError: 'NoneType' object has no attribute 'find_all'

In [32]:
page1 = requests.get("https://www.linkedin.com/school/10458916/alumni/", auth=('[sultanadeel]','[Batman1980]'), verify=False)



# Accessing the LinkedIn url using my username & password

Source 2: https://www.linkedin.com/school/10458916/alumni/

In [46]:
client = requests.Session()

HOMEPAGE_URL = 'https://www.linkedin.com'
LOGIN_URL = 'https://www.linkedin.com/uas/login-submit'

html = client.get(HOMEPAGE_URL).content
soup = BeautifulSoup(html)
csrf = soup.find(id="loginCsrfParam-login")['value']

login_information = {
    'session_key':'sultanadeel',
    'session_password':'Batman1980',
    'loginCsrfParam': csrf,
}

client.post(LOGIN_URL, data=login_information)

page = client.get('https://www.linkedin.com/school/10458916/alumni/')




 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [35]:
print(page.content)

b'<html><head>\n<script type="text/javascript">\nwindow.onload = function() {\n  // Parse the tracking code from cookies.\n  var trk = "bf";\n  var trkInfo = "bf";\n  var cookies = document.cookie.split("; ");\n  for (var i = 0; i < cookies.length; ++i) {\n    if ((cookies[i].indexOf("trkCode=") == 0) && (cookies[i].length > 8)) {\n      trk = cookies[i].substring(8);\n    }\n    else if ((cookies[i].indexOf("trkInfo=") == 0) && (cookies[i].length > 8)) {\n      trkInfo = cookies[i].substring(8);\n    }\n  }\n\n  if (window.location.protocol == "http:") {\n    // If "sl" cookie is set, redirect to https.\n    for (var i = 0; i < cookies.length; ++i) {\n      if ((cookies[i].indexOf("sl=") == 0) && (cookies[i].length > 3)) {\n        window.location.href = "https:" + window.location.href.substring(window.location.protocol.length);\n        return;\n      }\n    }\n  }\n\n  // Get the new domain. For international domains such as\n  // fr.linkedin.com, we convert it to www.linkedin.com\n

In [47]:
soup2 = BeautifulSoup(page.text, 'html.parser')
soup2.h1

In [48]:
#soup2 = BeautifulSoup(page.text, 'html.parser')

#soup2 = BeautifulSoup(page.content)
sculinkedin = soup2.find(class_='artdeco-carousel-container ember-view')


In [29]:
soup2.find_all()

[<html lang="en">
 <head>
 <script type="application/javascript">!function(i,n){void 0!==i.addEventListener&&void 0!==i.hidden&&(n.liVisibilityChangeListener=function(){i.hidden&&(n.liHasWindowHidden=!0)},i.addEventListener("visibilitychange",n.liVisibilityChangeListener))}(document,window);</script>
 <meta charset="utf-8"/>
 <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
 <title>LinkedIn</title>
 <meta content="" name="description"/>
 <meta content="width=device-width, initial-scale=1.0, minimum-scale=1.0" name="viewport"/>
 <meta content="#0077B5" name="theme-color"/>
 <meta content="false" name="is-http2"/>
 <meta content="%7B%22modulePrefix%22%3A%22extended%22%2C%22environment%22%3A%22production%22%2C%22lix%22%3A%7B%22tests%22%3A%5B%22lego_free_profile_review_promo_banner_widget%22%2C%22lego_jobs%3Ajobs-home_onboarding_flow_oc_promo_in_header_widget%22%2C%22lego_jobs_open_candidates_banner_widget%22%2C%22neptune.jobs.enableInitiateReferrals%22%2C%22voyager.jobs.web.apply-bu

In [69]:
z = open ('scualumni.csv', 'w', newline = '')
writer3 = csv.writer(z)

# Tableau Visual 1

https://public.tableau.com/views/Lab4_50/MSComparison?:embed=y&:display_count=yes&publish=yes

In above tableau visual 1, i used webscraping to convert HTML table using python code and then saving it to a csv file, later
using Tableau to generate visual . The visual 1 is a packed bubbles visual which shows various criteria for SCU MS program
admissions and the requirements for prospective students. The advantages of the visual 1 is that:
    - It is easy to read and interpret the program requirements
    - It is easy for the audience to compare the different MS programs offered by SCU
    - Also, the viewers can identify the major prerequisites for securing admission 
The disadvantage of visual 1 is that it requires effort to identify the MSIS program requirements. If we have more historical
data of students, we can identify patterns and relationships in choosing MS programs
    

# Tableau Visual 2

https://public.tableau.com/profile/muhammad.adeel3420#!/vizhome/Lab4_50/MSISStudentProfile

The tableau visual 2 is prepared using HTML code and then performing web scraping using beautifulsoup python library. 
This visual shows the student profiles of SCU MSIS candidates. This visual is very helpful in analyzing the student demographics 
and identifying important measures such as Average age, Average GMAT, work experience and gender of the students enrolling
in the SCU MSIS program. The advantages are readibility, clarity and ease of use for the audience

# Tableau Visual 3

https://public.tableau.com/views/Lab4_50/SCUAlumniSkillset?:embed=y&:display_count=yes

The visual 3 shows the profile of SCU alumni currently working in the industry. This visual shows the top skills of the alumni 
in various business areas. The visual is clear, easy to read and identifies the measure in terms of no.of SCU alumni possessing
the skill identified in the bubble represented by the Tableau visual. This visual is very helpful for the audience as it shows that
which skills are currently in demand and therefore practised in their professions by the alumni

# Tableau Visual 4

https://public.tableau.com/views/Lab4_50/SCUAlumniProfession?:embed=y&:display_count=yes

The visual 4 shows the professions of SCU alumni in terms of which industry they are currently working in. This visual is very 
readable and easy to understand in terms of identifying the industries where alumni are currently working. The visual is a 
horizontal bars graph and the colours used identify the industries of the alumni. However, if we manage to get more data on 
the SCU alumni such as year graduated, then we can have a deeper insight about their work experiences and related industries
of profession. We can also identfiy thorugh this if any of them switched to a different career over their work history since
graduating from Santa Clara University

# Tableau Visual 5

https://public.tableau.com/views/Lab4_50/SCUAlumniResidence?:embed=y&:display_count=yes

The tableau visual 5 is a pie chart to identify in whcih locations SCU alumni are currently living. The pie chart in this
scenario is very useful and easily readable in pointing out the location and the way alumni are scattered. Using colours, we can
better reflect which location is the majority and which only reflect a small portion of the pie chart

https://public.tableau.com/views/Lab4_50/Dashboard1?:embed=y&:display_count=yes