<a href="https://colab.research.google.com/github/so-dipe/Web-Scraping-Datasets/blob/main/Web_Scraping_Nigerian_2015_Election_Data_from_PDFs_on_INEC_website.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
pip install tabula-py

Installing collected packages: distro, tabula-py
Successfully installed distro-1.7.0 tabula-py-2.4.0


In [None]:
import tabula as t
import pandas as pd
from bs4 import BeautifulSoup
import urllib.request
import ssl
from tempfile import TemporaryFile
ssl._create_default_https_context = ssl._create_unverified_context


The function below uses the `BeautifulSoup` (`bs4`) library to scrape the table from INEC's website. The tables are in pdf format but with the help of `tabula` the tables are converted to a `pandas` DataFrame which is easier to work with.

In [None]:
def scrape_table(table_link):
  temp = TemporaryFile()
  temp, headers = urllib.request.urlretrieve(table_link)
  df_list = t.read_pdf(temp, pages='all', pandas_options={'header':None})
  return df_list

The function `scrape_links` takes the webpage as input and returns all the links in that particular webpage that might contain a link to a pdf document.

In [None]:
def scrape_links(webpage):
  page = urllib.request.urlopen(webpage)
  soup = BeautifulSoup(page)
  links = []
  for link in soup.findAll('a'):
    if (link.get('href') is not None) and ('pdf' in link.get('href')):
      links.append(link.get('href'))
  return links

In [None]:
links = scrape_links('https://www.inecnigeria.org/2019-senatorial-district-elections-result/')

In [None]:
df_senatorial = []
inec_ini = 'https://www.inecnigeria.org'
for link in links:
  df = scrape_table(inec_ini + link)
  col_dict = {0:'S/N', 1:'NAME', 2:'GENDER', 3:'PARTY', 4:'VOTES', 5:'ELECTED'}
  df[0].rename(columns=col_dict, inplace=True)
  df[0]['SENATORIAL DISTRICT'] = link[28:-4]
  df[0]['STATE'] = link[28:-4].split('-')[0]
  df_senatorial.append(df[0])

Now that we've gotten the Data we need and we've done some cleaning, there's still a little problem. 

All the tables are in different DataFrames, but it's easier to work with them when they're together. And as we can see below they all look alike so we'd use the `pd.concat` function to combine them into one big table.

In [None]:
df_senatorial[0].head()

Unnamed: 0,S/N,NAME,GENDER,PARTY,VOTES,ELECTED,SENATORIAL DISTRICT,STATE
0,1,EHICHANYA OBINNAYA,M,A,40,,ABIA-CENTRAL,ABIA
1,2,NWOKO QUINET IFUNANYACHUKWU,F,ACD,76,,ABIA-CENTRAL,ABIA
2,3,IKECHUKWU EVEREST EGEONU,M,ADC,297,,ABIA-CENTRAL,ABIA
3,4,NWAOGU NKECHI JUSTINA WOKOCHA,F,APC,29860,,ABIA-CENTRAL,ABIA
4,5,AJAEGBU CHIDI ONYEUKWU,M,APGA,19534,,ABIA-CENTRAL,ABIA


In [None]:
senate_df = pd.DataFrame()
for df in df_senatorial:
  senate_df = pd.concat([senate_df, df])

In [None]:
senate_df.tail(10)

Unnamed: 0,S/N,NAME,GENDER,PARTY,VOTES,ELECTED,SENATORIAL DISTRICT,STATE
3,3,THADDEUS LIKITA ASHEI,M,ANN,239,,KADUNA-SOUTH-1,KADUNA
4,4,BALA BARNABAS YUSUF,M,APC,133287,,KADUNA-SOUTH-1,KADUNA
5,5,MUSA LAZARUS,M,CAP,277,,KADUNA-SOUTH-1,KADUNA
6,6,MARUF ABDULAHI,M,DA,64,,KADUNA-SOUTH-1,KADUNA
7,7,DAUDA PHILIBUS,M,GPN,176,,KADUNA-SOUTH-1,KADUNA
8,8,HASSAN MICHAEL PETER,M,LP,566,,KADUNA-SOUTH-1,KADUNA
9,9,LAAH DANJUMA TELLA,M,PDP,268923,ELECTED,KADUNA-SOUTH-1,KADUNA
10,10,PATRICK,M,PPN,252,,KADUNA-SOUTH-1,KADUNA
11,11,BULUS JAMES,M,PRP,1546,,KADUNA-SOUTH-1,KADUNA
12,12,SHEKARI RIJO SHEKARI,M,SDP,9609,,KADUNA-SOUTH-1,KADUNA


In [None]:
senate_df['STATE'].unique()

array(['ABIA', 'ADAMAWA', 'AKWA', 'ANAMBRA', 'BAUCHI', 'BAYELSA', 'BENUE',
       'CROSS', 'DELTA', 'EBONYI', 'EDO', 'EKITI', 'ENUGU',
       'ABAJI_GWAGWALADA', 'BWARI', 'FCT', 'GOMBE', 'IMO', 'JIGAWA',
       'kano', 'KANO', 'KATSINA', 'KEBBI', 'KOGI', 'KWARA', 'LAGOS',
       'NASSARAWA', 'NIGER', 'OGUN', 'ONDO', 'OSUN', 'OYO', 'PLATEAU',
       'RIVERS', 'SOKOTO', 'TARABA', 'YOBE', 'ZAMFARA', 'BORNO', 'KADUNA'],
      dtype=object)

The Data looks good, except that there are 40 states instead of 37. The main issue here is that 'ABAJI_GWAGWALADA' and 'BWARI' are been considered as states instead of districts under FCT. Also 'kano' and 'KANO' are actually the same.

The code below isn't very important, It just returns the csv format of all the DataFrames that have been scraped.

In [None]:
# df = pd.DataFrame()
# for i in range(len(dfs_senatorial)):
#   filepath = '/content/drive/MyDrive/Election Data/Raw Data/raw_senatorial/' + 'senatorial district'  + str(i) + '.csv'
#   dfs_senatorial[i][0].to_csv(filepath, index=False)

In [None]:
webpage = 'https://www.inecnigeria.org/2019-house-of-representative-elections-result/'

In [None]:
def scrape_table_from_divs(webpage):
  inec_link = "https://www.inecnigeria.org/"
  df_list = []
  page = urllib.request.urlopen(webpage)
  soup = BeautifulSoup(page)
  for state in soup.findAll(class_='card'):
    get_state = None
    if state.findAll(class_='card-header') is not None:
      get_state = state.find('a').text.lstrip().split()[-2]
    for zone in state.findAll(class_='col-lg-12'):
      get_zone = zone.find('a').text
      get_link = zone.find('a').get('href')
      get_table = scrape_table(inec_link + get_link)[0]
      col_names = {0:'S/N', 1:'NAME', 2:'GENDER', 3:'PARTY', 4:'VOTES', 5:'ELECTED'}
      get_table.rename(columns=col_names, inplace=True)
      get_table['STATE'] = get_state
      get_table['ZONE'] = get_zone
      df_list.append(get_table)
  return df_list

    

In [None]:
reps_df_ls = scrape_table_from_divs(webpage)

In [None]:
reps_df = pd.DataFrame()
for rep in reps_df_ls:
  reps_df = pd.concat([reps_df, rep])
reps_df.head()

Unnamed: 0,S/N,NAME,GENDER,PARTY,VOTES,ELECTED,STATE,ZONE,6
0,S/N,NAME OF CANDIDATE,GENDER,PARTY,VOTES RECEIVED,REMARKS,ABA,ABA NORTH-ABA SOUTH,
1,,IFEANYI NWOSU CHIOMA,,,,,ABA,ABA NORTH-ABA SOUTH,
2,1,,F,A,1,,ABA,ABA NORTH-ABA SOUTH,
3,,ADAORA,,,,,ABA,ABA NORTH-ABA SOUTH,
4,2,OKPECHI ADANNE H.,F,ADC,20,,ABA,ABA NORTH-ABA SOUTH,


array(['REMARKS', nan, 'DECLARED', 'ELECTED', 'REMARK', 'VOTES RECEIVED',
       'DECLARED ELECTED', 'DECLARED\rELECTED', '22', '25', '2,713',
       '8,690', '7', '107', '17', '10', '12', '37', '52,934', '77',
       'EECTED', 'ELECETD', 'CODE: FC/343/FC', 'Elected (As',
       'per S/Court', 'Decision)'], dtype=object)

In [None]:
gov_df_ls = scrape_table_from_divs('https://www.inecnigeria.org/2019-governorship-election-results/')

In [None]:
gov_df_ls[27].head()

Unnamed: 0,S/N,NAME,GENDER,PARTY,VOTES,ELECTED,STATE,ZONE
0,S/N,NAME OF CANDIDATE,GENDER,PARTY,VOTES RECEIVED,REMARK,YOBE,YOBE
1,1,ISAH MOHAMMED,MALE,ADC,1350,,YOBE,YOBE
2,,,,,,Declared,YOBE,YOBE
3,2,MAI MALA,MALE,APC,444013,,YOBE,YOBE
4,,,,,,Elected,YOBE,YOBE


In [None]:
gov_df = pd.DataFrame()
for gov in gov_df_ls:
  gov_df = pd.concat([gov_df, gov])
gov_df.head()

Unnamed: 0,S/N,NAME,GENDER,PARTY,VOTES,ELECTED,STATE,ZONE
0,S/N,NAME OF CANDIDATE,GENDER,PARTY,VOTES RECEIVED,REMARK,ABIA,ABIA
1,1,EMEKA UWAKOLAM,MALE,A,43,,ABIA,ABIA
2,2,UBANI VINCENT ANTHONY,MALE,AAC,254,,ABIA,ABIA
3,3,OPARA ALPHONSIUS OBINNA,MALE,ACD,166,,ABIA,ABIA
4,4,OBINNA KELENNA,MALE,ADC,333,,ABIA,ABIA


In [None]:
reps_df[reps_df[6] == 'DECLARED ELECTED']

Unnamed: 0,S/N,NAME,GENDER,PARTY,VOTES,ELECTED,STATE,ZONE,6
16,12,OSSAI NICHOLAS OSSAI,MALE,PDP,,52934.0,DELTA,NDOKWA EAST-NDOKWA WEST-UKWUANI,DECLARED ELECTED
5,5,ABUBAKAR ABUBAKAR KABIR,MALE,APC,37573.0,,KANO,BCH-169,DECLARED ELECTED
9,9,YAKASAI MUKHTAR ISHAQ,MALE,APC,43049.0,,KANO,KMC-179,DECLARED ELECTED


In [None]:
reps_df[6].unique()

array([nan, 'REMARK', 'DECLARED', 'ELECTED', 'REMARKS',
       'DECLARED ELECTED'], dtype=object)

In [None]:
reps_df = reps_df[reps_df['NAME'].notna()]

In [None]:
reps_df = reps_df[reps_df['GENDER'] != "GENDER"]

In [None]:
reps_df[reps_df['NAME'] == "OSSAI NICHOLAS OSSAI"]['VOTES'] = "52,934"	
reps_df[reps_df['NAME'] == "ABUBAKAR ABUBAKAR KABIR"]['ELECTED'] = "ELECTED"
reps_df[reps_df['NAME'] == "YAKASAI MUKHTAR ISHAQ"]['ELECTED'] = "ELECTED"

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [None]:
senate_df.to_csv('nigerian-senate-election-data-2019.csv', index=False)

In [None]:
gov_df.head()

Unnamed: 0,S/N,NAME,GENDER,PARTY,VOTES,ELECTED,STATE,ZONE
0,S/N,NAME OF CANDIDATE,GENDER,PARTY,VOTES RECEIVED,REMARK,ABIA,ABIA
1,1,EMEKA UWAKOLAM,MALE,A,43,,ABIA,ABIA
2,2,UBANI VINCENT ANTHONY,MALE,AAC,254,,ABIA,ABIA
3,3,OPARA ALPHONSIUS OBINNA,MALE,ACD,166,,ABIA,ABIA
4,4,OBINNA KELENNA,MALE,ADC,333,,ABIA,ABIA


In [None]:
gov_df = gov_df[gov_df['GENDER'] != "GENDER"]

In [None]:
gov_df.head()

Unnamed: 0,S/N,NAME,GENDER,PARTY,VOTES,ELECTED,STATE,ZONE
1,1,EMEKA UWAKOLAM,MALE,A,43,,ABIA,ABIA
2,2,UBANI VINCENT ANTHONY,MALE,AAC,254,,ABIA,ABIA
3,3,OPARA ALPHONSIUS OBINNA,MALE,ACD,166,,ABIA,ABIA
4,4,OBINNA KELENNA,MALE,ADC,333,,ABIA,ABIA
5,5,OKEY OKORO UDO,MALE,ADP,522,,ABIA,ABIA


In [None]:
gov_df.to_csv('nigerian-governorship-election-data-2019.csv', index=False)

In [None]:
pred_df_ls = scrape_table('https://www.inecnigeria.org/wp-content/uploads/2019/10/2019-GE-PRESIDENTIAL-ELECTION-RESULTS.pdf')

In [None]:
pred_df_ls[1]

Unnamed: 0,0,1,2,3,4,5
0,39,KRIZ DAVID,M,LM,1438,
1,40,MUHAMMED USMAN ZAKI,M,LP,5074,
2,41,ADESANYA-DAVIES MERCY OLUFUNMILAYO,F,MAJA,2651,
3,42,BASHAYI ISA DANSARKI,M,MMN,14540,
4,43,SANTURAKI HAMISU,M,MPN,2752,
5,44,RABIA YASAI HASSAN CENGIZ,F,NAC,2279,
6,45,ADEMOLA BABATUNDE ABIDEMI,M,NCMP,1378,
7,46,SALISU YUNUSA TANKO,M,NCP,3799,
8,47,A. EDOSOMWAN JOHNSON,M,NDCP,1192,
9,48,AKPUA ROBINSON,M,NDLP,1588,


In [None]:
pred_df = pd.concat([pred_df_ls[0], pred_df_ls[1]])
pred_df.head()

Unnamed: 0,0,1,2,3,4,5
0,SN,NAME OF CANDIDATE,GENDE,PARTY,VOTES RECEIVED,REMARKS
1,1,OSITELU ISAAC BABATUNDE,M,A,19219,
2,2,ABDULRASHID HASSAN BABA,M,AA,14380,
3,3,OMOYELE SOWORE,M,AAC,33953,
4,4,CHIKE UKAEGBU,M,AAP,8902,


In [None]:
col_name = {0:'S/N', 1:'NAME', 2:'GENDER', 3:'PARTY', 4:'VOTES', 5:'ELECTED'}
pred_df.rename(columns=col_name, inplace=True)
pred_df.head()

Unnamed: 0,S/N,NAME,GENDER,PARTY,VOTES,ELECTED
0,SN,NAME OF CANDIDATE,GENDE,PARTY,VOTES RECEIVED,REMARKS
1,1,OSITELU ISAAC BABATUNDE,M,A,19219,
2,2,ABDULRASHID HASSAN BABA,M,AA,14380,
3,3,OMOYELE SOWORE,M,AAC,33953,
4,4,CHIKE UKAEGBU,M,AAP,8902,


In [None]:
pred_df = pred_df[pred_df['GENDER'] != "GENDE"]

In [None]:
pred_df.to_csv('nigerian-presidential-election-2019.csv', index=False)

In [None]:
senate_df['ELECTION-TYPE'] = 'SENATE'
gov_df['ELECTION-TYPE'] = 'GOVERNORSHIP'
pred_df['ELECTION-TYPE'] = 'PRESIDENTIAL'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [None]:
senate_df.head()

Unnamed: 0,S/N,NAME,GENDER,PARTY,VOTES,ELECTED,SENATORIAL DISTRICT,STATE,ELECTION-TYPE
0,1,EHICHANYA OBINNAYA,M,A,40,,ABIA-CENTRAL,ABIA,SENATE
1,2,NWOKO QUINET IFUNANYACHUKWU,F,ACD,76,,ABIA-CENTRAL,ABIA,SENATE
2,3,IKECHUKWU EVEREST EGEONU,M,ADC,297,,ABIA-CENTRAL,ABIA,SENATE
3,4,NWAOGU NKECHI JUSTINA WOKOCHA,F,APC,29860,,ABIA-CENTRAL,ABIA,SENATE
4,5,AJAEGBU CHIDI ONYEUKWU,M,APGA,19534,,ABIA-CENTRAL,ABIA,SENATE


In [None]:
gov_df.head()

Unnamed: 0,S/N,NAME,GENDER,PARTY,VOTES,ELECTED,STATE,ZONE,ELECTION-TYPE
1,1,EMEKA UWAKOLAM,MALE,A,43,,ABIA,ABIA,GOVERNORSHIP
2,2,UBANI VINCENT ANTHONY,MALE,AAC,254,,ABIA,ABIA,GOVERNORSHIP
3,3,OPARA ALPHONSIUS OBINNA,MALE,ACD,166,,ABIA,ABIA,GOVERNORSHIP
4,4,OBINNA KELENNA,MALE,ADC,333,,ABIA,ABIA,GOVERNORSHIP
5,5,OKEY OKORO UDO,MALE,ADP,522,,ABIA,ABIA,GOVERNORSHIP


In [None]:
pred_df.head()

Unnamed: 0,S/N,NAME,GENDER,PARTY,VOTES,ELECTED,ELECTION-TYPE
1,1,OSITELU ISAAC BABATUNDE,M,A,19219,,PRESIDENTIAL
2,2,ABDULRASHID HASSAN BABA,M,AA,14380,,PRESIDENTIAL
3,3,OMOYELE SOWORE,M,AAC,33953,,PRESIDENTIAL
4,4,CHIKE UKAEGBU,M,AAP,8902,,PRESIDENTIAL
5,5,SHIPI MOSES GODIA,M,ABP,4523,,PRESIDENTIAL


In [None]:
election_df = pd.concat([senate_df, gov_df, pred_df])

In [None]:
election_df.head()

Unnamed: 0,S/N,NAME,GENDER,PARTY,VOTES,ELECTED,SENATORIAL DISTRICT,STATE,ELECTION-TYPE,ZONE
0,1,EHICHANYA OBINNAYA,M,A,40,,ABIA-CENTRAL,ABIA,SENATE,
1,2,NWOKO QUINET IFUNANYACHUKWU,F,ACD,76,,ABIA-CENTRAL,ABIA,SENATE,
2,3,IKECHUKWU EVEREST EGEONU,M,ADC,297,,ABIA-CENTRAL,ABIA,SENATE,
3,4,NWAOGU NKECHI JUSTINA WOKOCHA,F,APC,29860,,ABIA-CENTRAL,ABIA,SENATE,
4,5,AJAEGBU CHIDI ONYEUKWU,M,APGA,19534,,ABIA-CENTRAL,ABIA,SENATE,


In [None]:
election_df.to_csv('nigerian-election-2019-not-reps-data.csv', index=False)