# Scraping India's Parliament Websites To Extract Questions and Answers

During every Parliament session (Lok Sabha or Rajya Sabha), the first hour is called the Question Hour and is dedicated for members of parliament to ask questions to the ministers and to hold the government accountable. The answers that the government provides through these questions become a crucial source of data and information for journalists.

In India, where official reports may be delayed or be unavailable online, Parliament Questions help journalists access latest data on a wide range of issues.

This scraper scans <b>both the Lok Sabha and Rajya Sabha</b> pages to extract all the available questions and answers based on the ministry <b>with links to the answers</b>. The details are then exported to a csv file.

<i>I have also mentioned all the ways the code can be customised and how to make changes if something goes wrong.</i>

In [1]:
import pandas as pd
from bs4 import BeautifulSoup

import time

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select

from webdriver_manager.firefox import GeckoDriverManager



I have imported the Firefox driver for selenium. <b>If you are using Chrome</b>, change the last line in the above chunk of code to 
```python
from webdriver_manager.chrome import ChromeDriverManager
```

Also change the first line in the next chunk to 
```python
driver = webdriver.Chrome(ChromeDriverManager().install())
```

In [2]:
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())




  driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())


### Scraping Lok Sabha website

In [3]:
driver.get("http://loksabhaph.nic.in/Questions/Qministrysearch.aspx")

Right now, the scraper only extracts questions from <b>WOMEN AND CHILD DEVELOPMENT</b> ministry. Change the name of the ministry in the line of code below to get questions from other ministries. Typing the name in capital case should work. But I would suggest copy pasting the exact name from the [Lok Sabha website](http://loksabhaph.nic.in/Questions/Qministrysearch.aspx)

In [4]:
Select(driver.find_element(By.ID, "ContentPlaceHolder1_ddlministry")).select_by_visible_text("WOMEN AND CHILD DEVELOPMENT")

In [5]:
driver.find_element(By.XPATH, '//*[@id="ContentPlaceHolder1_search1"]').click()

The Lok Sabha website can be slow. <b>In case, only the first 10 or so questions are extracted, change the number in</b> ```time.sleep``` code. The numbers are in seconds. The default (1 minute) should be enough in most cases

In [6]:
driver.find_element(By.XPATH, '//*[@id="ContentPlaceHolder1_Button1"]').click()
time.sleep(60)

In [7]:
df=pd.read_html(driver.page_source)[2]

In [8]:
df=df.drop(0)
df=df.drop(1)

In [9]:
df['Q.Type']=df['Q.Type'].str.replace(' PDF/WORD', '')
df['Q.Type']=df['Q.Type'].str.replace("\(Hindi\)", '')

  df['Q.Type']=df['Q.Type'].str.replace("\(Hindi\)", '')


In [10]:
data=BeautifulSoup(driver.page_source)

In [11]:
links=[]

answers=data.find_all("a", style="color:green;")
for answer in answers:
    link=answer['href']
    
    links.append(link)
    

In [12]:
df['links']=links

In [13]:
df.to_csv("LS_WCD.csv", index=False)

### Scraping Rajya Sabha website

If you are using Chrome, change the first line of code again in the chunk below. It should be

```python
driver = webdriver.Chrome(ChromeDriverManager().install())
```

In [14]:
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
driver.get("https://rajyasabha.nic.in/Questions/IntegratedSearchForm")




  driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())


As for Lok Sabha questions, the scraper only extracts questions from <b>WOMEN AND CHILD DEVELOPMENT</b> ministry. Change the name of the ministry in the line of code below to get questions from other ministries. Typing the name in capital case should work. But I would suggest copy pasting the exact name from the [Rajya Sabha website](https://rajyasabha.nic.in/Questions/IntegratedSearchForm)

In [15]:
Select(driver.find_element(By.ID, "ministrycode")).select_by_visible_text("WOMEN AND CHILD DEVELOPMENT")

Here too, I have kept the <b>default wait time as 60 seconds</b>. It can be changed by modifying the ``` time.sleep``` code below

In [16]:
driver.find_element(By.XPATH, '//*[@id="show"]').click()
time.sleep(60)

In [17]:
df=pd.read_html(driver.page_source)[0]

In [18]:
data=BeautifulSoup(driver.page_source)

In [19]:
links=[]
answers=data.select("tr a")

for answer in answers:
    if answer.text=="English":
        link=answer['href']
        
        links.append(link)

In [20]:
df['links']=links

In [21]:
df = df.drop('Answer', axis=1)

In [22]:
df.to_csv("RS_WCD.csv", index=False)