# <img style="float: left; padding-right: 100px; width: 300px" src="images/logo.png">AI4SG Bootcamp:



## Module 2D: Data Scraping


**Authors:** Martha Shaka

---

### What is Data scraping
It is commonly defined as a system where a technology extracts data from a particular codebase or program. Data scraping provides results for a variety of uses and automates aspects of data aggregation.
Most common data scraping is *web scraping*, is the process of importing information from a website into a spreadsheet or 
local file saved on your computer. It’s one of the most efficient ways to get data from the web, and in some cases to channel
that data to another website. 
Data scraping has a vast number of applications – it’s useful in just about any case where data needs to be moved from one place to another

#### Uses of data scraping include:
<ol start="1">
<li>Research for web content/business intelligence </li>
<li>Pricing for travel booker sites/price comparison sites </li>
<li>Finding sales leads/conducting market research by crawling public data sources (e.g. Yell and Twitter) </li>
<li>Sending product data from an e-commerce site to another online vendor (e.g. Google Shopping)</li>
</ol>

In this lab, we'll scrape 2019 A level Results:

https://onlinesys.necta.go.tz/results/2019/acsee/acsee.htm .

We'll walk through scraping the list pages for the school names/urls

Although many programming languages offer libraries for web information retrieval and analysis, 
we will be focusing on the Python data analysis ecosystem given its popularity and capabilities.

# Table of Contents 
<ol start="0">
<li> Learning Goals </li>
<li> Exploring the Web pages and downloading them</li>
<li> Parse the page, extract school urls </li>
<li> Parse a school page, extract student list with results </li>
<li> Set up a pipeline for fetching and parsing</li>
</ol>

## Learning Goals

Understand the structure of a web page. Use Beautiful soup to scrape content from these web pages.

*This lab corresponds to lectures 2, 3 and 4 and maps on to homework 1 and further.*

In [2]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)

In [3]:
import time, requests

## 1. Exploring the web pages and downloading them

We're going to capture infromation on the from six NECTA results list for school. 



To getch this page we use the `requests` module. But are we allowed to do this? Lets check:

https://onlinesys.necta.go.tz/results/2019/acsee/acsee.htm

Yes we are.

In [4]:
URLSTART="https://onlinesys.necta.go.tz/results/2019/acsee/acsee.htm"
url = URLSTART
print(url)
page = requests.get(url)

https://onlinesys.necta.go.tz/results/2019/acsee/acsee.htm


We can see properties of the page. Most relevant are `status_code` and `text`. The former tells us  if the web-page was found, and if found , ok. (See lecture notes.)

In [5]:
page.status_code # 200 is good

200

In [164]:
page.text[:1000]

'\ufeff<HTML>\r\n<HEAD>\r\n<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1252">\r\n<HEAD>\r\n<BODY TEXT="#000080" LINK="#0000ff" VLINK="#800080" BGCOLOR="LIGHTYELLOW">\r\n<FONT COLOR="#800080"><H2>NATIONAL EXAMINATIONS COUNCIL OF TANZANIA</H2>\r\n<H3><U> ACSEE 2019 EXAMINATION RESULTS ENQUIRIES </U></H3></FONT>\r\n<BR>\r\n<TABLE BORDER=0 CELLSPACING=0 WIDTH="75%" ALIGN="CENTER" >\r\n<TR ><TD WIDTH="100%" VALIGN="MIDDLE"><B><FONT FACE="Arial" SIZE=2><P ALIGN="CENTER">CLICK ANY LETTER BELOW TO FILTER CENTRES BY ALPHABET</B></FONT></TD></TR>\r\n</TABLE>\r\n<TABLE BORDER CELLSPACING=1 WIDTH="75%" ALIGN="CENTER" >\r\n<TR> <TD WIDTH="10%" VALIGN="MIDDLE" BGCOLOR="LIGHTGREEN">\r\n<B><FONT FACE="Arial" SIZE=3><P ALIGN="CENTER" ><A HREF="index.htm">ALL CENTRES\r\n <TD WIDTH="2.5%" VALIGN="MIDDLE" BGCOLOR="LIGHTBLUE">\r\n<B><FONT FACE="Arial" SIZE=3><P ALIGN="CENTER"><A HREF="indexfiles/index_a.htm">A\r\n</A></B></FONT></TD>\r\n <TD WIDTH="2.5%" VALIGN="MIDDLE" BGCOLOR="LIG

Let us write a loop to fetch students results for 2 schools with code p0104 and p0110 from goodreads. Notice the use of a format string. This is an example of old-style python format strings

In [9]:
URLSTART="https://onlinesys.necta.go.tz/results/2019/acsee/results/"
SCHOOL=["p0104","p0110"]
for i in range(0,2):
    bookpage=str(i)
    stuff=requests.get(URLSTART+SCHOOL[i]+".htm")
    filetowrite="files/school"+ '%02d' % i + ".html"
    print("FTW", filetowrite)
    fd=open(filetowrite,"w",encoding="utf-8")
    fd.write(stuff.text)
    fd.close()
    time.sleep(2)

FTW files/school00.html
FTW files/school01.html


## 2. Parse the page, extract school urls

Notice how we do file input-output, and use beautiful soup in the code below. The `with` construct ensures that the file being read is closed, something we do explicitly for the file being written. 

In [11]:
from bs4 import BeautifulSoup

Getting the html of the page is just the first step. Next step is to create a Beautiful Soup object from the html. 
This is done by passing the html to the BeautifulSoup() function. 
The Beautiful Soup package is used to parse the html, that is, take the raw html text and break it into Python objects. The second argument 'lxml' is the html parser whose details you do not need to worry about at this point.

In [158]:
filetoread="files/school00.html"
print("FTW", filetoread)
with open(filetoread) as fdr:
    data = fdr.read()
    soup = BeautifulSoup(data, 'html.parser')
print(soup)

FTW files/school00.html
ï»¿<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<head>
<body bgcolor="LIGHTBLUE" link="#0000ff" text="#000080" vlink="#800080">
<font color="#800080"><h2>NATIONAL EXAMINATIONS COUNCIL OF TANZANIA</h2>
<h1><p align="LEFT"> ACSEE 2019 EXAMINATION RESULTS</p></h1>
<h3><p align="LEFT">P0104 BWIRU BOYS' SECONDARY SCHOOL CENTRE

<h3>DIV-I = 1;  DIV-II = 13;  DIV-III = 41;  DIV-IV = 15;  DIV-0 = 14  </h3></p></h3></font>
</body></head></head></html>
<table bgcolor="LIGHTYELLOW" border="" cellspacing="2" width="70%">
<tr><td valign="MIDDLE" width="6%">
<p align="CENTER"><b><font face="Courier" size="2">CNO</font></b></p></td></tr></table>
<td valign="MIDDLE" width="4%">
<b><font face="Courier" size="2"><p align="CENTER">SEX</p></font></b></td>
<td valign="MIDDLE" width="6%">
<b><font face="Courier" size="2"><p align="CENTER">AGGT</p></font></b></td>
<td valign="MIDDLE" width="4%">
<b><font face="Courier" size="2"><p align="CENTER">D

The soup object allows you to extract interesting information about the website you're scraping such as getting the title of the page as shown below. 
You can view the html of the webpage by right-clicking anywhere on the webpage and selecting "Inspect." This above see shows what the result looks like.

In [160]:
# Get the title
title = soup.title
print(title)               #for this page it has no title as you can see on the above html codes

None


You can use the find_all() method of soup to extract useful html tags within a webpage. 
Examples of useful tags include < a > for hyperlinks, < table > for tables, < tr > for table rows, < th > for table headers, and < td > for table cells.
The code below shows how to extract all the rows within the table in a webpage. for our case we have only one table.

In [167]:
rows = soup.find_all('tr')
print(rows[:5])

[<tr><td valign="MIDDLE" width="6%">
<p align="CENTER"><b><font face="Courier" size="2">CNO</font></b></p></td></tr>, <tr><td valign="MIDDLE" width="6%">
<font face="Arial" size="1"><p align="CENTER">P0104/0501</p></font></td>
<td valign="MIDDLE" width="4%">
<font face="Arial" size="1"><p align="CENTER">F</p></font></td>
<td valign="MIDDLE" width="6%">
<font face="Arial" size="1"><p align="CENTER"> 18</p></font></td>
<td valign="MIDDLE" width="4%">
<font face="Arial" size="1"><p align="CENTER">IV</p></font></td>
<td valign="MIDDLE" width="58%">
<font face="Arial" size="1"><p align="LEFT">G/STUDIES - 'F'   GEOGR - 'E'   CHEMISTRY - 'F'   BIOLOGY - 'S'   BAM - 'F'   </p></font></td></tr>, <tr><td valign="MIDDLE" width="6%">
<font face="Arial" size="1"><p align="CENTER">P0104/0502</p></font></td>
<td valign="MIDDLE" width="4%">
<font face="Arial" size="1"><p align="CENTER">F</p></font></td>
<td valign="MIDDLE" width="6%">
<font face="Arial" size="1"><p align="CENTER"> 16</p></font></td>
<

The output above shows that each row is printed with html tags embedded in each row. This is not what you want. You can use remove the html tags using Beautiful Soup or regular expressions.

The easiest way to remove html tags is to use Beautiful Soup, and it takes just one line of code to do this. Pass the string of interest into BeautifulSoup() and use the get_text() method to extract the text without html tags.

In [170]:
#finding columns in each row
cols= []
for row in rows:
    cells = row.find_all('td')
    cleantext = BeautifulSoup(str_cells, "lxml").get_text()
    cols.append(cleantext)
print(cols[:5])

["[\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES - 'E'   HISTORY - 'S'   KISWAHILI - 'C'   ENGLISH - 'D'   ]", "[\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES - 'E'   HISTORY - 'S'   KISWAHILI - 'C'   ENGLISH - 'D'   ]", "[\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES - 'E'   HISTORY - 'S'   KISWAHILI - 'C'   ENGLISH - 'D'   ]", "[\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES - 'E'   HISTORY - 'S'   KISWAHILI - 'C'   ENGLISH - 'D'   ]", "[\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES - 'E'   HISTORY - 'S'   KISWAHILI - 'C'   ENGLISH - 'D'   ]"]


The next step is to convert the list into a dataframe and get a quick view of the first 10 rows using Pandas.

In [172]:
dataframe = pd.DataFrame(cols)
dataframe.head(10)

Unnamed: 0,0
0,"[\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES ..."
1,"[\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES ..."
2,"[\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES ..."
3,"[\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES ..."
4,"[\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES ..."
5,"[\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES ..."
6,"[\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES ..."
7,"[\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES ..."
8,"[\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES ..."
9,"[\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES ..."


okey lets do the same for all the schools we have saved their html files

In [175]:
list_rows = []
for i in range(0,2):    
    stri = '%02d' % i
    filetoread="files/school"+ stri + '.html'
    print("FTW", filetoread)
    with open(filetoread) as fdr:
        data = fdr.read()
    soup = BeautifulSoup(data, 'html.parser')
    rows = soup.find_all('tr')
    schoolNames=soup.select("h3")[0].text

    
    for row in rows:
        cells = row.find_all('td')
        cleantext = BeautifulSoup(str_cells, "lxml").get_text()
        list_rows.append(cleantext)
     
    df = pd.DataFrame(list_rows)
    print(df)

FTW files/school00.html
                                                    0
0   [\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES ...
1   [\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES ...
2   [\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES ...
3   [\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES ...
4   [\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES ...
5   [\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES ...
6   [\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES ...
7   [\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES ...
8   [\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES ...
9   [\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES ...
10  [\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES ...
11  [\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES ...
12  [\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES ...
13  [\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES ...
14  [\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES ...
15  [\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES ...
16  [\nP0110/0741, \nF, \n 13, \nIII, \nG/STUDIES ...
17  

## 3. Data Manipulation and Cleaning

Ok so now lets dive in and get the data in the format required.
The dataframe is not in the format we want. To clean it up, you should split the "0" column into multiple columns at the comma position. This is accomplished by using the str.split() method.

In [184]:
df1 = df[0].str.split(',', expand=True)
df1.head(10)

Unnamed: 0,0,1,2,3,4
0,[\nP0110/0741,\nF,\n 13,\nIII,\nG/STUDIES - 'E' HISTORY - 'S' KISWAHILI...
1,[\nP0110/0741,\nF,\n 13,\nIII,\nG/STUDIES - 'E' HISTORY - 'S' KISWAHILI...
2,[\nP0110/0741,\nF,\n 13,\nIII,\nG/STUDIES - 'E' HISTORY - 'S' KISWAHILI...
3,[\nP0110/0741,\nF,\n 13,\nIII,\nG/STUDIES - 'E' HISTORY - 'S' KISWAHILI...
4,[\nP0110/0741,\nF,\n 13,\nIII,\nG/STUDIES - 'E' HISTORY - 'S' KISWAHILI...
5,[\nP0110/0741,\nF,\n 13,\nIII,\nG/STUDIES - 'E' HISTORY - 'S' KISWAHILI...
6,[\nP0110/0741,\nF,\n 13,\nIII,\nG/STUDIES - 'E' HISTORY - 'S' KISWAHILI...
7,[\nP0110/0741,\nF,\n 13,\nIII,\nG/STUDIES - 'E' HISTORY - 'S' KISWAHILI...
8,[\nP0110/0741,\nF,\n 13,\nIII,\nG/STUDIES - 'E' HISTORY - 'S' KISWAHILI...
9,[\nP0110/0741,\nF,\n 13,\nIII,\nG/STUDIES - 'E' HISTORY - 'S' KISWAHILI...


This looks much better, but there is still work to do. The dataframe has unwanted square brackets surrounding each row. You can use the strip() method to remove the opening square bracket on column "0."

In [188]:
df1[0] = df1[0].str.strip('[')    #removing the [ in column 1
df1[4] = df1[4].str.strip('[')    #removing the [ in the last column 
df1[0] = df1[0].str.strip('\n') 
df1[1] = df1[1].str.strip(' \n') 
df1[2] = df1[2].str.strip('\n ') 
df1[3] = df1[3].str.strip(' \n')
df1[4] = df1[4].str.strip(' \n') #removing the \n in all the columns
df1.head(10)


Unnamed: 0,0,1,2,3,4
0,P0110/0741,F,13,III,G/STUDIES - 'E' HISTORY - 'S' KISWAHILI - ...
1,P0110/0741,F,13,III,G/STUDIES - 'E' HISTORY - 'S' KISWAHILI - ...
2,P0110/0741,F,13,III,G/STUDIES - 'E' HISTORY - 'S' KISWAHILI - ...
3,P0110/0741,F,13,III,G/STUDIES - 'E' HISTORY - 'S' KISWAHILI - ...
4,P0110/0741,F,13,III,G/STUDIES - 'E' HISTORY - 'S' KISWAHILI - ...
5,P0110/0741,F,13,III,G/STUDIES - 'E' HISTORY - 'S' KISWAHILI - ...
6,P0110/0741,F,13,III,G/STUDIES - 'E' HISTORY - 'S' KISWAHILI - ...
7,P0110/0741,F,13,III,G/STUDIES - 'E' HISTORY - 'S' KISWAHILI - ...
8,P0110/0741,F,13,III,G/STUDIES - 'E' HISTORY - 'S' KISWAHILI - ...
9,P0110/0741,F,13,III,G/STUDIES - 'E' HISTORY - 'S' KISWAHILI - ...


![](images/goodreads3.png)

The table is missing table headers. You can use the find_all() method to get the table headers. The html tag fpr table header is usually <th> but for our case there is no defined table header we will just right the list of the columns name

In [190]:
df1.columns = ['index' , 'sex', 'aggr','division','subjects']
df1.head()

Unnamed: 0,index,sex,aggr,division,subjects
0,P0110/0741,F,13,III,G/STUDIES - 'E' HISTORY - 'S' KISWAHILI - ...
1,P0110/0741,F,13,III,G/STUDIES - 'E' HISTORY - 'S' KISWAHILI - ...
2,P0110/0741,F,13,III,G/STUDIES - 'E' HISTORY - 'S' KISWAHILI - ...
3,P0110/0741,F,13,III,G/STUDIES - 'E' HISTORY - 'S' KISWAHILI - ...
4,P0110/0741,F,13,III,G/STUDIES - 'E' HISTORY - 'S' KISWAHILI - ...


In [191]:
df1.info()
df1.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 337 entries, 0 to 336
Data columns (total 5 columns):
index       337 non-null object
sex         337 non-null object
aggr        337 non-null object
division    337 non-null object
subjects    337 non-null object
dtypes: object(5)
memory usage: 13.2+ KB


(337, 5)

Now we can save the data for future uses.


In [192]:
df.to_csv("files/student_results.csv", index=False, header=True)

### Exercise 
Your job is to write the code to get the list of ten schools code for example Bwiru Boys' Center is P0104 and save in a list called school_code.

Use this link https://onlinesys.necta.go.tz/results/2019/acsee/acsee.htm to get list of the school codes

In [202]:
## write you code here

## Get the html file first and save it as homepage.html at the folder called results

####################### write you code here


## Create Beautiful soup object
school_url= []
filetoread="files/homepage.html"
####################### write you code here


##Create a list of url for the first 100 schools

####################### write you code here

FTW files/homepage.html
FTW files/homepage.html
['results/p0101.htm', 'results/p0104.htm', 'results/p0110.htm', 'results/p0112.htm', 'results/p0116.htm', 'results/p0119.htm', 'results/p0123.htm', 'results/p0129.htm', 'results/p0132.htm', 'results/p0133.htm']


## 4. Set up a pipeline for fetching and parsing

Ok lets get back to the fetching...

In [226]:

urlStart = "https://onlinesys.necta.go.tz/results/2019/acsee/"
for i in range(0,20):    
    stuff=requests.get(urlStart+school_url[i])
    filetowrite = school_url[i]
    print("FTW", filetowrite)
    f = open(filetowrite,'w',encoding="utf-8")
    f.write(stuff.text)
    f.close()


FTW results/p0101.htm
FTW results/p0104.htm


We have now saved html files for each school according to there school codes


### Note

Now we have used Beautiful soup to scrape content from NECTA website. 

Whether or not you intend to use data scraping in your work, it’s advisable to still educate yourself on the subject, as it is most likely to become even more important in the next few years.

There are now data scraping AI on the market that can use machine learning to keep on getting better at recognising inputs which only humans have traditionally been able to interpret – like images