## Python Code for scraping Syuuin speech
#### Data Management (Spring/Summer 2018) at OSIPP, Osaka U

#### Notes: Make sure to use API if you download many many texts. [Link](http://kokkai.ndl.go.jp/api.html)

### Preamble

In [1]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import numpy as np
import pandas as pd
import time as t

ModuleNotFoundError: No module named 'selenium'

In [5]:
path_to_chromedriver = "C:\\Users\\shu\\Desktop\\chromedriver" # set path to your chrome driver
browser = webdriver.Chrome(executable_path = path_to_chromedriver) # launch Google Chrome

browser.implicitly_wait(10) # wait 10 sec

### Find relevant data

In [6]:
url = 'http://kokkai.ndl.go.jp/' 
browser.get(url) # open the webpage

In [7]:
browser.find_element_by_id('b_easy-search').click() # click 'easy search'

In [8]:
browser.switch_to_frame("frame1") # select the first frame (there are two frames)

#### - Start and end date
- I choose the period between 2017/4/1 and 2018/3/31.

In [9]:
fromY = '/html/body/table[4]/tbody/tr[2]/td[2]/table/tbody/tr/td/table/tbody/tr[1]/td[3]/input[2]'
fromM = '/html/body/table[4]/tbody/tr[2]/td[2]/table/tbody/tr/td/table/tbody/tr[1]/td[3]/input[3]'
fromD = '/html/body/table[4]/tbody/tr[2]/td[2]/table/tbody/tr/td/table/tbody/tr[1]/td[3]/input[4]'

browser.find_element_by_xpath(fromY).clear() # use xpath to select "fromY" as defined above
browser.find_element_by_xpath(fromY).send_keys('29')  # start year
browser.find_element_by_xpath(fromM).clear()
browser.find_element_by_xpath(fromM).send_keys('4')  # start month
browser.find_element_by_xpath(fromD).clear()
browser.find_element_by_xpath(fromD).send_keys('1')  # start date

In [10]:
toY = '/html/body/table[4]/tbody/tr[2]/td[2]/table/tbody/tr/td/table/tbody/tr[2]/td/input[1]'
toM = '/html/body/table[4]/tbody/tr[2]/td[2]/table/tbody/tr/td/table/tbody/tr[2]/td/input[2]'
toD = '/html/body/table[4]/tbody/tr[2]/td[2]/table/tbody/tr/td/table/tbody/tr[2]/td/input[3]'

browser.find_element_by_xpath(toY).clear()
browser.find_element_by_xpath(toY).send_keys('30')  # end year
browser.find_element_by_xpath(toM).clear()
browser.find_element_by_xpath(toM).send_keys('3')  # end month
browser.find_element_by_xpath(toD).clear()
browser.find_element_by_xpath(toD).send_keys('31')  # end date

#### - Meetings
- Select the House of Representatives (Syuuin).

In [11]:
all_meetings = '/html/body/p[2]/table/tbody/tr[2]/td[2]/table/tbody/tr/td/table/tbody/tr[1]/td[2]/table/tbody/tr/td[1]/input'
syuuin       = '/html/body/p[2]/table/tbody/tr[2]/td[2]/table/tbody/tr/td/table/tbody/tr[1]/td[2]/table/tbody/tr/td[3]/input'
sannin       = '/html/body/p[2]/table/tbody/tr[2]/td[2]/table/tbody/tr/td/table/tbody/tr[1]/td[2]/table/tbody/tr/td[5]/input'
ryouin       = '/html/body/p[2]/table/tbody/tr[2]/td[2]/table/tbody/tr/td/table/tbody/tr[1]/td[2]/table/tbody/tr/td[7]/input'

browser.find_element_by_xpath(syuuin).click() # you can change syuuin to another option

#### - Keywords
- Use "TPP" as a keyword.

In [12]:
clue = '/html/body/p[3]/table[1]/tbody/tr[2]/td[2]/table/tbody/tr/td/table/tbody/tr[2]/td/input'

browser.find_element_by_xpath(clue).send_keys('TPP')

#### - Click 'Search'

In [13]:
t.sleep(3) 

browser.find_element_by_xpath('/html/body/p[3]/table[2]/tbody/tr/td/table/tbody/tr/td[1]/a/img').click()

#### - Show results in the browser

In [14]:
t.sleep(3)

browser.find_element_by_xpath('/html/body/table[4]/tbody/tr/td[6]/a/img').click()

### Get attributes
- We will scrape attributes shown only on the first page (20 rows). If you want to scrape all results, you need to make a loop to go through the rest of the pages.
- Attributes include: (a) the session number, (b) the name of a house, (c) the name of a meeting, (d) the number of the meeting, and (e) the date of the meeting

In [None]:
attr = browser.find_elements_by_xpath('/html/body/table[7]/tbody/tr/td[2]/table/tbody/tr/td')

## create a list and put texts. results include 180 rows in this case.
attr_list = [] 
for i in range(0,180):
    attr_list.append(attr[i].text)

## convert the list to a 20 * 9 matrix (Numpy array), then to a data frame
df_attr = pd.DataFrame(np.reshape(attr_list, (20, 9))) 
df_attr = df_attr.iloc[:,1:6] # drop empty columns
df_attr

### Open each meeting page and get texts

In [None]:
text_list = []  # make an empty list
j = 1
for i in range(2,22):
    browser.find_element_by_xpath('/html/body/table[7]/tbody/tr/td[2]/table/tbody/tr['+str(i)+']/td[9]/a').click() # click a meeting
    browser.switch_to.window(browser.window_handles[j]) # switch to a new tab
    browser.switch_to_frame("MAIN1") # select the main frame
    text_list.append(browser.find_element_by_xpath('/html/body').text) # get the element
    browser.switch_to.window(browser.window_handles[0]) # switch back to the original tab
    j += 1
    t.sleep(2) # wait for 2 sec
len(text_list)
text_list

In [None]:
df_text = pd.DataFrame(text_list) # convert the list to a data frame
df_text.shape # check the shape of df_text
df_text.iloc[0,0] # check the first row

### Merge attributes with texts
- We have scraped attributes and texts. Let's merge them!

In [None]:
data = df_attr.join(df_text) # join texts to attributes
data.columns = ['kai','house','m_name','gou','date','content'] # add column names
data.shape # check the shape

dta = [elem.strip().replace('\n','') for elem in data.content] # remove \n
lst = [elem.strip().split('○') for elem in dta] # split per speaker
lst

## select speaches including 安倍
abe_list = []
for i in range(0,len(lst)):
    abe_list.append([x for x in lst[i] if "安倍" in x]) 

abe_col = ['s' + str(x) for x in range(0,20)] # make column names
abe_list = pd.DataFrame(abe_list, columns=abe_col) # convert the list to a data frame
abe_list

del(data['content']) # drop content variable

data = pd.merge(data, abe_list,right_index=True,left_index=True) # merge abe_list
data

### Export data as a csv file
- Note: The file size is about 430KB. 

In [358]:
data.to_csv("syuuin_speech_tpp2017.csv", encoding='cp932')