# Web Scraper
Scrapes data from a website and parses it into a csv file. This script is based on the PH tutorial *Intro to Beautiful Soup* by Jeri Wieringa (for getting data from one page) and the BetterProgramming Tutorial *How to Scrape Multiple Pages of a Website Using a Python Web Scraper* by Angelica Dietzel (for getting data from multiple pages). Documentations for *Beautiful Soup* can be found here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all
The website that I'm scraping is Lexikon Japans Studierende provided by the Staatsbibliothe Berlin: https://themen.crossasia.org/japans-studierende/index/show

In [1]:
import requests
from requests import get
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import re
import csv

**re** is not required if you don't use regex to extract text. Python3 works better for this script; Python2 doesn't handle utf-8 too well.

In [2]:
from time import sleep
from random import randint

This is added to slow down requests rate from the website.

In [3]:
names = []
places = []
years = []
texts = []

Initiates empty containers for stored data.

In [4]:
pages = np.arange(0, 52)

All the entries of "Lexikon Japans Studierende" can be found on 52 consecutive webpages. I'm only using **start** and **stop**. The **step** argument does not need to be defined; default = 1 is what I need in this case.

In [5]:
for page in pages:
    page = requests.get("https://themen.crossasia.org/japans-studierende/index/show/page/" + str(page))
    soup = BeautifulSoup(page.text, features="lxml")
    jstudies = soup.find_all('table', width="100%")
    
    sleep(randint(2,10))
    
    for container in jstudies:
        name= container.find("b", string=re.compile("^(?!.*(Name|\d))")).get_text()
        names.append(name)
    
        tds = container.find_all("td")
    
        place=str(tds[3].get_text(strip=True)) if str(tds[3].get_text(strip=True)) else '-'
        places.append(place)
    
        year=str(tds[5].get_text(strip=True)) if str(tds[5].get_text(strip=True)) else '-'
        years.append(year)
    
        text=str(tds[7].get_text(strip=True)) if str(tds[7].get_text(strip=True)) else '-'
        texts.append(text)

Defines what **Beautiful Soup** will get as raw data: results from **requests** with the numbers 1-52 appended to the base-url for every subsequent webpage. Parser is lxml.

**jstudies** pre-defines the frame within which I will look for the data. All the data that I want to extract (names, dates, places..) can be found within the individual 'table' tags (specifically those that have width="100%"). All subsequent searches will refer to this **jstudies** container.

**sleep** is pausing requests from website. Random pause between 2-10 seconds.

After "for container in jstudies:" comes the actual scraping code. The first expression grabs the names from within the 'b' tag with a regex (excluding "Name" and any digitals; which are also in bold font). Places, years, and texts can be found within table data ('td') tags within each 'table'. With "str" the data is extracted as string. The \[\] identifies the position of the individual 'td'; strip=True is required to get rid of any newlines and tabs (\t and \n). The "if" clause makes sure the code works with missing data.

In [6]:
aus_studies = pd.DataFrame({
'person': names,
'herkunft': places,
'lebensdaten': years,
'karriere': texts,
})

Fills pandas DataFrame with grabbed content from craping code.

In [7]:
print(aus_studies)

                     person                    herkunft           lebensdaten  \
0         Abe Bunjirô 阿部文次郎                     Niigata            21.3.1880–   
1             Abe Isoo 阿部磯雄                     Fukuoka    4.2.1865–10.2.1949   
2        Abe Masayoshi 阿部正義                   Fukui-shi  2.7.1860–19.11. 1909   
3              Abe Masayuki                       Ehime            11.5.1853–   
4         Abe Mikishi 阿部美樹志       Ichinoseki, Iwate-ken    4.5.1883–20.2.1965   
...                     ...                         ...                   ...   
2546      Yukawa Genyô 湯川玄洋                Wakayama-shi   15.8.1867–10.8.1935   
2547  Yumoto Takehiko 湯本武比古  Shimotakai-gun, Nagano-ken   1.12.1857–27.9.1925   
2548   Yunome Suketaka 湯目補隆                      Sendai                     -   
2549       Yusa Yoshio 遊佐慶夫               Fukushima-ken            Jan. 1889–   
2550     Zushisaki Sukamoto                           -                     -   

                           

In [8]:
aus_studies.to_csv('aus_studies_compl.csv')

Exports table as csv file.