# Please don't edit directly in this document. Create your own copy first.

# This is a demo for [Canadian Archive of Women in STEM](https://biblio.uottawa.ca/en/women-in-stem) with [requests](https://github.com/psf/requests) and [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) Python libraries.

---



# Extract

First, let's run the cell below to import neccesary libraries. Although most of the commonly used Python libraries are pre-installed, new libraries can be installed as *!pip install [package name]* or *!apt-get install [package name]*.

##1. Libraries
*   [requests](https://github.com/psf/requests): an elegant and simple HTTP library for Python.
*   [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/): Python library for pulling data out of HTML and XML files






In [0]:
# install
#!pip install requests
#!pip install beautifulsoup4

# import
import requests
from bs4 import BeautifulSoup

Second, set the url of the website from which we'd like to extract data using the requests library that we imported in the first step. If the access was successful, you should see the output as <Response [200]>.

## 2. Set the URL

In [0]:
# Set the URL you want to scrape from
url='https://biblio.uottawa.ca/en/women-in-stem'

# Connect to the URL and download document
response = requests.get(url)
response

<Response [200]>

Third, parse the html with BeautifulSoup.

## 3. Parse HTML file

In [0]:
# Parse HTML and save to BeautifulSoup object
soup = BeautifulSoup(response.text, "html.parser")
soup

<!DOCTYPE html>

<!--[if IEMobile 7]><html class="iem7"  lang="en" dir="ltr"><![endif]-->
<!--[if lt IE 9]><html class="lt-ie9"  lang="en" dir="ltr"><![endif]-->
<!--[if IE 9]><html class="ie9"  lang="en" dir="ltr"><![endif]-->
<!--[if (gt IE 9)|(gt IEMobile 7)]><!--><html dir="ltr" lang="en" prefix="fb: http://ogp.me/ns/fb# og: http://ogp.me/ns# article: http://ogp.me/ns/article# book: http://ogp.me/ns/book# profile: http://ogp.me/ns/profile# video: http://ogp.me/ns/video# product: http://ogp.me/ns/product#"><!--<![endif]-->
<head>
<title>Canadian Archive of Women in STEM | Library | University of Ottawa</title>
<meta content="width" name="MobileOptimized"/>
<meta content="true" name="HandheldFriendly"/>
<meta content="width=device-width" name="viewport"/>
<meta charset="utf-8">
<link href="https://biblio.uottawa.ca/sites/all/themes/custom/uottawa_zen_assets/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>
<meta content="Drupal 7 (https://www.drupal.org)" name="gener

Fourth, find an element with its attribute name. 

###Syntax: find_all(name, attrs)
Find all elements following the same syntax rules.
###Syntax: find(name, attrs)
Find a specific element only.

## 4. Extract fonds title, hosting institution, description and STEM fields.

In [0]:
#First element only
soup.find('div', attrs={'class': 'field field-name-title-field field-type-text field-label-hidden'}).text


'Academic Faculties and Schools: School of Nursing'

In [0]:
# Find fonds_title: save as list

title = [fonds_title.text for fonds_title in soup.find_all('div', attrs={'class': 'field field-name-title-field field-type-text field-label-hidden'})]
title

['Academic Faculties and Schools: School of Nursing',
 'Ada Funnell fonds',
 'Ada Janet Ross collection',
 'Adele A. Crowder fonds',
 "Admission of Women to Queen's Medical School (20th Century)",
 'Alfreda Jean Attrill collection',
 'Alice Girard fonds ',
 'Allie Vibert Douglas fonds',
 'Alumnae Association, School of Nursing, Toronto General Hospital fonds',
 'Amelia Taylor Anderson fonds',
 'Amelia Yeomans file',
 'Anila Maskeri fonds',
 'Ann E. McJanet fonds',
 'Anna Stahmer fonds',
 'Annie Elaine Bryenton fonds',
 'Association of Consulting Engineers of Canada fonds',
 'Audrey Maureen Cowling fonds',
 'Augusta Stowe-Gullen fonds',
 'Barbara Gill fonds ',
 'Barbara Kelsey fonds',
 'Barbara Triggs-Raine file',
 'Beatrice Kidd collection ',
 'Bernice Goldsmith fonds ',
 'Bertha L. Gregory fonds',
 'Betty Havens fonds ',
 'Betty J Havens file',
 'Betty Nordrum fonds',
 'Blackberg Family fonds; Anna Regina Blackberg ',
 'Brenda Loveridge file',
 'Calgary General Hospital School of Nurs

In [0]:
# Hosting institutions: save as list

hosting = [hosting_institutions.text for hosting_institutions in soup.find_all('div', attrs={'class': 'field field-name-uottawa-women-organization field-type-text field-label-hidden'})]
hosting

['Trinity Western University',
 "Queen's University",
 'Health Sciences Centre Winnipeg',
 "Queen's University",
 "Queen's University",
 'Health Sciences Centre Winnipeg',
 'University of Montreal',
 "Queen's University",
 'City of Toronto',
 'Glenbow Museum',
 'University of Manitoba',
 'Ryerson University',
 'University of Toronto',
 'Ryerson University',
 'Provincial Archives of New Brunswick ',
 'Library and Archives Canada',
 'University of Toronto',
 'Victoria University',
 'Provincial Archives of New Brunswick ',
 'Ryerson University',
 'University of Manitoba',
 'Health Sciences Centre Winnipeg',
 'Concordia University   ',
 'Provincial Archives of New Brunswick ',
 'University of Manitoba',
 'University of Manitoba',
 'University of Manitoba',
 'University of Alberta',
 'University of Manitoba',
 'Glenbow Museum',
 'Library and Archives Canada',
 'Library and Archives Canada',
 'Library and Archives Canada',
 'Library and Archives Canada',
 'Library and Archives Canada',
 'Rye

In [0]:
# Description: save as list

description = [description.text for description in soup.find_all('div', attrs={'class': 'field field-name-uottawa-women-scope field-type-text-long field-label-hidden'})]
description

['The fonds consists of records that document the activities and functions of the School of Nursing, including correspondence, minutes, and reports.',
 "The fonds consists of a few pieces of correspondence pertaining to Dr. Funnell's death as well as her certification by the College of Physicians and Surgeons of Ontario; plus a crayon portrait of her as a young woman. There are also certificates and\xa0diplomas from her time at Queen's Univeristy, notebooks containing medicinal recipes, autographs and bible studies, and photographs of Ada and her sister Rose. Also included are two syringe cases, with syringes, a silver prescription scale by Dr. C.H. Fitch and a portable electro-medical apparatus, patented by A. Gaiffe, philosophical instrument maker.",
 'The collection reflects Ada Janet Ross’ nursing career and wartime experience from her nursing training, through to her death and posthumous honours. The records document her time at the Winnipeg General Hospital School of Nursing, the

In [0]:
# STEM Fields: save as list

stem = [stem_fields.text for stem_fields in soup.find_all('div', attrs={'class': 'field field-name-uottawa-women-category field-type-entityreference field-label-hidden'})]
stem

['Nursing',
 'Medicine',
 'Nursing',
 'Biology',
 'Medicine',
 'Nursing',
 'Nursing',
 'Astrophysics',
 'Nursing',
 'Nursing',
 'Medicine',
 'Nursing',
 'Architecture',
 'Trades and Technology',
 'Nursing',
 'Engineering',
 'DentistryNursing',
 'Medicine',
 'Nursing',
 'Information technology',
 'Biochemistry',
 'Nursing',
 'Engineering',
 'Public HealthNursing',
 'Gerontology',
 'Medicine',
 'Nursing',
 'Pharmacy',
 'Physiotherapy',
 'Nursing',
 'Engineering',
 'Engineering',
 'Nutrition',
 'Botany',
 'Nutrition',
 'Information technology',
 'Microbiology',
 'Nursing',
 'Information technology',
 'Nursing',
 'Ecology',
 'Science',
 'MathematicsPhysics ',
 'Medicine',
 'Botany',
 'MedicineObstetrics and Gynaecology',
 'ChemistryPhysics ',
 'Botany',
 'ChemistryPhysics ',
 'Medicine',
 'Nursing',
 'Biology',
 'Information technology',
 'Nursing',
 'MedicineSurgery',
 'Science',
 'Medicine',
 'Nursing',
 'Nursing',
 'MedicinePhysiology',
 'ZoologyAnatomy',
 'MedicinePsychiatry',
 'Medici

# Export to CSV
Import neccesary libraries. The file will be saved in the virtual machine, so in order to download a csv file to your local computer, you need to import *files* from google.colab. 

## 1. Libraries

*   [pandas](https://pandas.pydata.org/): open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools.




In [0]:
# import
import pandas as pd
from google.colab import files

You need to combine multiple lists (title, hosting institution, description, and STEM fields) per row to the one data frame. The easiest approach is to create an empty data frame and then add each list to the data frame.

## 2. DataFrame

In [0]:
# Create an empty data frame
df = pd.DataFrame()

# Add a title to df
df['Title'] = title

# Add a hosting institution to df
df['Hosting'] = hosting

# Add description to df
df['Description'] = description

# Add STEM fields to df
df['STEM'] = stem

df

Unnamed: 0,Title,Hosting,Description,STEM
0,Academic Faculties and Schools: School of Nursing,Trinity Western University,The fonds consists of records that document th...,Nursing
1,Ada Funnell fonds,Queen's University,The fonds consists of a few pieces of correspo...,Medicine
2,Ada Janet Ross collection,Health Sciences Centre Winnipeg,The collection reflects Ada Janet Ross’ nursin...,Nursing
3,Adele A. Crowder fonds,Queen's University,The fonds consists of three articles (one typs...,Biology
4,Admission of Women to Queen's Medical School (...,Queen's University,The article details the story of the readmissi...,Medicine
5,Alfreda Jean Attrill collection,Health Sciences Centre Winnipeg,The collection reflects Alfreda Jean Attrill’s...,Nursing
6,Alice Girard fonds,University of Montreal,The fonds documents Alice Girard's professiona...,Nursing
7,Allie Vibert Douglas fonds,Queen's University,"The fonds consists of correspondence, lecture ...",Astrophysics
8,"Alumnae Association, School of Nursing, Toront...",City of Toronto,Fonds consists of the textual and photographic...,Nursing
9,Amelia Taylor Anderson fonds,Glenbow Museum,The fonds consists of photographs of Amelia an...,Nursing


You can save df (Data frame) to csv format using df.to_csv().

## 3. Save as CSV

In [0]:
# Save df as csv in the virtual machine provided by Google
df.to_csv('women_stem.csv', sep='\t', encoding='utf-8')

# Download the file to your computer
files.download("women_stem.csv")
