# **Fundamentals of Data Analysis 2021 - Assessment**

## Instructions
A Jupyter notebook called cao.ipynb that contains the following: 
- A clear and concise overview of how to load CAO points information from the CAO website into a pandas data frame, pitched as your classmates.
- A detailed comparison of CAO points in 2019, 2020, and 2021 using the functionality in pandas.
- Appropriate plots and other visualisations to enhance your notebook for viewers.

<br>

***

## **Part 2 - Analysing CAO Points Information from the CAO Website**

<br>

***

## **Comment on Assignment Completion**

Despite numerous attempts I have not been able to extract the numerical points data for the year 2019 and 2020 in a way that they could be analysed. In the data sets they are often combined with special characters (* and #) and I have been unable to isolate the numbers and convert them to integer data types. 
The most recent failed attempt is documented in section 2.3.8. 

As a result, part 4 containing the data analysis is very limited due to only being able to use the 2021 data. My plan was to include comparisons between the three years to investigate similarities and differences, and whether there any trends of increasing or decreasing points requirements could be observed. 

<br>

***

## **Table of Contents**
<br>

**[1.0 CAO Points Overview](#part1)**<br>
**[1.1 Calculation of CAO Points](#part1.1)**<br>
**[1.2 Application and Offers Process](#part1.2)**<br>
**[2.0 Retrieving CAO Points Information](#part2)**<br>
&emsp; **[2.1 Preparation Steps](#part2.1)**<br>
&emsp;&emsp; **[2.1.1 Importing Required Modules](#part2.1.1)**<br>
&emsp;&emsp; **[2.1.2 Time Stamp Creation](#part2.1.2)**<br>
&emsp; **[2.2 Retrieving CAO Points from PDF Format (Year 2019](#part2.2)**<br>
&emsp;&emsp; **[2.2.1 Defining File Paths](#part2.2.1)**<br>
&emsp;&emsp; **[2.2.2 Extracting Data](#part2.2.2)**<br>
&emsp;&emsp; **[2.2.3 Exporting Data to csv](#part2.2.3)**<br>
&emsp;&emsp; **[2.2.4 Create Data Frame](#part2.2.4)**<br>
&emsp;&emsp; **[2.2.5 Update Column Names](#part2.2.5)**<br>
&emsp;&emsp; **[2.2.5.6 Remove Irrelevant Rows](#part2.2.6)**<br>
&emsp;&emsp;&emsp; **[2.2.6.1 Level 6/7 Remove first 8 Rows](#part2.2.6.1)**<br>
&emsp;&emsp;&emsp; **[2.2.6.2 Remove Rows with College Names](#part2.2.6.2)**<br>
&emsp;&emsp; **[2.2.7 Set Course Code as Index](#part2.2.7)**<br>
&emsp;&emsp; **[2.2.8 Add Column with Level](#part2.2.8)**<br>
&emsp; **[2.3 Retrieving CAO Points from Excel Format (Year 2020](#part2.3)**<br>
&emsp;&emsp; **[2.3.1 Defining File Path](#part2.3.1)**<br>
&emsp;&emsp; **[2.3.2 Extracting Data](#part2.3.2)**<br>
&emsp;&emsp; **[2.3.3 Creating the pandas Data Frame](#part2.3.3)**<br>
&emsp;&emsp; **[2.3.4 Update Column Names](#part2.3.4)**<br>
&emsp;&emsp; **[2.3.5 Update Level Column](#part2.3.5)**<br>
&emsp;&emsp; **[2.3.6 Set Course Code as Index](#part2.3.6)**<br>
&emsp;&emsp; **[2.3.7 Update Random to boolean](#part2.3.7)**<br>
&emsp;&emsp; **[2.3.8 Split Points and Portfolio Indicators](#part2.3.8)**<br>
&emsp;&emsp; **[2.3.9 Exporting to csv File](#part2.3.9)**<br>
&emsp; **[2.4 Retrieving CAO Points from HTTP Format (Year 2021)](#part2.4)**<br>
&emsp;&emsp; **[2.4.1 Defining File Paths](#part2.4.1)**<br>
&emsp;&emsp; **[2.4.2 Extracting Data](#part2.4.2)**<br>
&emsp;&emsp; **[2.4.3 Saving the Original Data Set](#part2.4.3)**<br>
&emsp;&emsp; **[2.4.4 Identifying Relevant Lines](#part2.4.4)**<br>
&emsp;&emsp; **[2.4.5 Exporting Data to csv](#part2.4.5)**<br>
&emsp;&emsp; **[2.4.6 Creating the pandas Data Frame](#part2.4.6)**<br>
&emsp;&emsp; **[2.4.7 Add Column with Level](#part2.4.7)**<br>
&emsp;&emsp; **[2.4.8 Set Course Code as Index](#part2.4.8)**<br>
&emsp;&emsp; **[2.4.9 Update Random and Portfolio Indicators to boolean](#part2.4.9)**<br>
**[3.0 Merging Data Frames](#part3)**<br>
&emsp; **[3.1 Identifying List of Courses](#part3.1)**<br>
&emsp; **[3.2 Merging Data Frames into One](#part3.2)**<br>
**[4.0 Data Set Analysis - 2021 Data](#part4)**<br>
&emsp; **[4.1 General Overview](#part4.1)**<br>
&emsp; **[4.2 Round 1 Specifics](#part4.2)**<br>
&emsp; **[4.3 Round 2 Specifics](#part4.3)**<br>
&emsp; **[4.4 Round 1 vs Round 2 Comparison](#part4.4)**<br>

**[References Used](#references)**

<br>

***

<a id= 'part1'></a>
## **1.0 CAO Points Overview**
This notebook analyses the minimum CAO points required for taking up undergraduate studies in Ireland for the years 2019 to 2021. <br><br>
The Central Applications Office (CAO) is an organisation that manages applications for higher education courses at colleges and universities in the Republic of Ireland.
The third level education application process is centrally managed by the CAO rather than by the individual higher education institutions (HEI). 


There are three types of third level qualifications within the Irish National Framework of Qualifications (NFQ).

NFQ Levels
- Level 6 - Higher Certificates (awarded by institutes of technology)
- Level 7 - Ordinary Bachelor Degree (awarded by institutes of technology and universities)
- Level 8 - Honours Bachelor Degree (awarded by institutes of technology and universities)


The applications are based on the points awarded to students for their exam results during the leaving certificate (the highest second level education degree). [[1]](#reference1), [[2]](#reference2), [[3]](#reference3), [[4]](#reference4)

<br>

<a id= 'part1.1'></a>
### **1.1 Calculation of CAO Points**

A student's six best subjects are counted towards the final score of maximum 625 points. For their leaving cert, students can opt to take courses at ordinary or higher level, the latter resulting in higher points assigned. 

**Points awarded for higher and ordinary level subjects:** [[5]](#reference5)

| Percentage Result | Grade <br>(Ordinary level) | Points <br>(Ordinary level) | Grade <br>(Higher level) | Points <br>( Higher Level) |
|---------|:---:|---:|:---:|----:|
| 90 +    | O1  | 56 | H1  | 100 |
| 80 - 89 | O2  | 46 | H2  | 88  |
| 70 - 79 | O3  | 37 | H3  | 77  |
| 60 - 69 | O4  | 28 | H4  | 66  |
| 50 - 59 | O5  | 20 | H5  | 56  |
| 40 - 49 | O6  | 12 | H6  | 46  |
| 30 - 39 | O7  | 0  | H7  | 37  |
| < 29    | O8  | 0  | H8  | 0   |


An extra 25 points can be awarded for a successful pass of the honours mathematical exams. [[6]](#reference6)

<br>

<a id= 'part1.2'></a>
### **1.2 Application and Offers Process**

The sum of a student's points for their six best subjects make up the CAO points so the highest possible points to be achieved can be 625. 

Students then apply to the CAO for their preferred courses submitting their CAO Points. For courses where the number of applicants exceeds the number of available places, the places are offered to students with the highest points. 

Offers are sent to applicants in 3 main rounds: 
- Round A: Deferred applicants, mature applicants, etc.
- Round Zero: Medicine applicants, additional mature, deferred and access applicants
- Round One: Applicants applying on the basis of school leaving cert results
- Round Two (and subsequent): Offers are issued until the end of the offer season or until all places have been filled. 

[[7]](#reference7)

<br>

***

<a id= 'part2'></a>
## **2.0 Retrieving CAO Points Information**

On their website the CAO provides an overview of the minimum points required to access each of the study courses. 
The CAO points information for the years 2019, 2020 and 2021 is maintained on the CAO website in different formats: 
- 2019: Two lists in pdf format, one for level 8 and one for level 6 and 7 courses
- 2020: A combined list for level 6, 7 and 8 in Excel (.xslx) format
- 2021: Two lists in http format, one for level 8 and one for level 6 and 7 courses

Due to the different file formats, different methods of retrieving the data need to be used.

<br>

***

<a id= 'part2.1'></a>
### **2.1 Preparation Steps**

<br>

<a id= 'part2.1.1'></a>
#### **2.1.1 Importing Required Modules**

As a first step, several Python packages need to be imported which will be used throughout the Notebook.  

In [1]:
# Import the urllib.request module for downloading 
import urllib.request as urlrq

# Import the regular expressions package for matching strings
import re

# Import the requests module to access and retrieve data from HTTP websites
import requests as rq

# Import the datetime module for setting time stamps when naming locally saved copies of data files
import datetime as dt

# Import the pandas library for working with data frames
import pandas as pd

# Import the tabula module for accessing data in PDF files
import tabula

<br>

<a id= 'part2.1.2'></a>
#### **2.1.2 Time Stamp Creation**

Creating a time stamp of current date and time for saving local copies of files extracted from the internet or created as part of the analysis. 

In [2]:
# Returns current date and time (as per computer time) using the datetime.now() function
now = dt.datetime.now()

In [3]:
# Convert the time stamp to string
nowstr = now.strftime('%Y%m%d_%H%M%S')

<br>

***

<a id= 'part2.2'></a>
### **2.2 Retrieving CAO Points from PDF Format (Year 2019)**
The 2019 data is provided on the CAO website as a pdf file. The tabula module can be used to access data in pdf documents. To be able to use it, it first needs to be installed. Using tabula also requires java to be installed on the machine. [[8]](#reference8)

Link to 2019 data on the CAO website: [CAO Points Required for Entry to 2019 Courses](http://www.cao.ie/index.php?page=points&p=2019)

<br>

<a id= 'part2.2.1'></a>
#### **2.2.1 Defining File Paths**

Defining the URL where the pdf files can be accessed and creating a file path for saving a local copy of the original pdf files. 

In [4]:
# Define the URL
# Level 8 Courses
url2019_8 ='http://www2.cao.ie/points/lvl8_19.pdf'

# Level 6 and 7 Courses
url2019_67 = 'http://www2.cao.ie/points/lvl76_19.pdf'

In [5]:
# Create a file path for saving the original data
# Level 8 Courses
path2019_8 = 'data/cao2019_lv8' + nowstr + '.pdf'

# Level 6 and 7 Courses
path2019_67 = 'data/cao2019_lv67' + nowstr + '.pdf'

In [6]:
# Create a file path for saving the extracted data as csv
# Level 8 Courses
path2019_8csv = 'data/cao2019_8_' + nowstr + '.csv'

# Level 6 and 7 Courses
path2019_67csv = 'data/cao2019_67_' + nowstr + '.csv'

<br>

<a id= 'part2.2.2'></a>
#### **2.2.2 Extracting Data**

After saving the original pdf files, the data is extracted from the pdf files using the `read_pdf()` function of the tabula module. [[9]](#reference9)

In [7]:
# Save original pdf files to disk
# Level 8 Courses
urlrq.urlretrieve(url2019_8, path2019_8)

# Level 6 and 7 Courses
urlrq.urlretrieve(url2019_67, path2019_67)

('data/cao2019_lv6720220102_232108.pdf',
 <http.client.HTTPMessage at 0x1d690fb8e80>)

<br>

<a id= 'part2.2.3'></a>
#### **2.2.3 Exporting Data to csv**

Save the extracted data as a csv file in the data folder. The extracted data can then be transformed into a pandas data frame by reading in the created csv file. [[10]](#reference10), [[11]](#reference11)

In [8]:
# Exporting the data to csv
# Level 8 Courses
tabula.convert_into(url2019_8, path2019_8csv, output_format='csv', pages='all')

# Level 6 and 7 courses
tabula.convert_into(url2019_67, path2019_67csv, output_format='csv', pages='all')

<br>

<a id= 'part2.2.4'></a>
#### **2.2.4 Create Data Frame**

In [9]:
# Level 8 Courses

# Read in the created csv file 
df2019_8 = pd.read_csv(path2019_8csv)

# Display the first 10 rows 
df2019_8.head(10)

Unnamed: 0,Course Code,INSTITUTION and COURSE,EOS,Mid
0,,Athlone Institute of Technology,,
1,AL801,Software Design with Virtual Reality and Gaming,304.0,328.0
2,AL802,Software Design with Cloud Computing,301.0,306.0
3,AL803,Software Design with Mobile Apps and Connected...,309.0,337.0
4,AL805,Network Management and Cloud Infrastructure,329.0,442.0
5,AL810,Quantity Surveying,307.0,349.0
6,AL820,Mechanical and Polymer Engineering,300.0,358.0
7,AL830,General Nursing,410.0,429.0
8,AL832,Psychiatric Nursing,387.0,403.0
9,AL836,Nutrition and Health Science,352.0,383.0


In [10]:
# Level 6 and 7 Courses

# Read in the created csv file and display the first 10 rows 
df2019_67 = pd.read_csv(path2019_67csv)

# Display the first 10 rows 
df2019_67.head(10)

Unnamed: 0.1,Unnamed: 0,ADMISSION DATA 2019,Unnamed: 2,Unnamed: 3
0,,End of Season,,
1,,"Level 6, 7",,
2,,The details given are for general information...,,
3,*,Not all on this points score were offered places,,
4,#,Test / Interview / Portfolio / Audition,,
5,AQA,All qualified applicants,,
6,,,,
7,Course Code,INSTITUTION and COURSE,EOS,Mid
8,,Athlone Institute of Technology,,
9,AL600,Software Design,205,306


<br>

<a id= 'part2.2.5'></a>
#### **2.2.5 Update Column Names**

Update the column names of both data frames to standardise them. 

In [11]:
# Level 8 Courses
df2019_8.columns = ['COURSE CODE', 'COURSE TITLE', '2019 EOS', '2019 MID']

# View to check
df2019_8.head()

Unnamed: 0,COURSE CODE,COURSE TITLE,2019 EOS,2019 MID
0,,Athlone Institute of Technology,,
1,AL801,Software Design with Virtual Reality and Gaming,304.0,328.0
2,AL802,Software Design with Cloud Computing,301.0,306.0
3,AL803,Software Design with Mobile Apps and Connected...,309.0,337.0
4,AL805,Network Management and Cloud Infrastructure,329.0,442.0


In [12]:
# Level 6/7 Courses
df2019_67.columns = ['COURSE CODE', 'COURSE TITLE', '2019 EOS', '2019 MID']

# View to check
df2019_67.head()

Unnamed: 0,COURSE CODE,COURSE TITLE,2019 EOS,2019 MID
0,,End of Season,,
1,,"Level 6, 7",,
2,,The details given are for general information...,,
3,*,Not all on this points score were offered places,,
4,#,Test / Interview / Portfolio / Audition,,


<br>

<a id= 'part2.2.6'></a>
#### **2.2.6 Remove Irrelevant Rows**

<br>

<a id= 'part2.2.6.1'></a>
##### **2.2.6.1 Level 6/7 - Remove First 8 Rows**

The first 8 rows of the level 6/7 data frame are not part of the actual data so can be removed. [[12]](#reference12)

In [13]:
# Removing first 8 rows of Level 6 and 7 courses
df2019_67.drop(index=[0, 1, 2, 3, 4, 5, 6, 7], inplace=True)

# Reset the index
df2019_67.reset_index(drop=True, inplace=True)
df2019_67.head(10)

Unnamed: 0,COURSE CODE,COURSE TITLE,2019 EOS,2019 MID
0,,Athlone Institute of Technology,,
1,AL600,Software Design,205,306.0
2,AL601,Computer Engineering,196,272.0
3,AL602,Mechanical Engineering,258,424.0
4,AL604,Civil Engineering,252,360.0
5,AL630,Pharmacy Technician,306,366.0
6,AL631,Dental Nursing,326,379.0
7,AL632,Applied Science,243,372.0
8,AL650,Business,210,317.0
9,AL651,Music and Instrument Technology,AQA,296.0


<br>

<a id= 'part2.2.6.2'></a>
##### **2.2.6.2 Remove Rows with College Names**

In addition to the rows with actual course data, the first column contains the course code consisting of two letters and three numbers, also the rows with the name of the college or university have been extracted and added to the data frame (for example the first row). 
These can  be determined by the value "NaN" in the first (also third and forth column). 
To remove these, the `dropna()` function can be used with the subset parameter which indicates the row. [[13]](#reference13), [[14]](#reference14)

In [14]:
# Level 8 Courses

# Removing rows where Course Code is NaN
df2019_8.dropna(subset=['COURSE CODE'], inplace=True)

# Reset the index
df2019_8.reset_index(drop=True, inplace=True)
df2019_8.head(10)

Unnamed: 0,COURSE CODE,COURSE TITLE,2019 EOS,2019 MID
0,AL801,Software Design with Virtual Reality and Gaming,304,328
1,AL802,Software Design with Cloud Computing,301,306
2,AL803,Software Design with Mobile Apps and Connected...,309,337
3,AL805,Network Management and Cloud Infrastructure,329,442
4,AL810,Quantity Surveying,307,349
5,AL820,Mechanical and Polymer Engineering,300,358
6,AL830,General Nursing,410,429
7,AL832,Psychiatric Nursing,387,403
8,AL836,Nutrition and Health Science,352,383
9,AL837,Sports Science with Exercise Physiology,351,392


In [15]:
# Level 6/7 Courses

# Removing rows where Course Code is NaN
df2019_67.dropna(subset=['COURSE CODE'], inplace=True)

# Reset the index
df2019_67.reset_index(drop=True, inplace=True)
df2019_67.head(10)

Unnamed: 0,COURSE CODE,COURSE TITLE,2019 EOS,2019 MID
0,AL600,Software Design,205,306
1,AL601,Computer Engineering,196,272
2,AL602,Mechanical Engineering,258,424
3,AL604,Civil Engineering,252,360
4,AL630,Pharmacy Technician,306,366
5,AL631,Dental Nursing,326,379
6,AL632,Applied Science,243,372
7,AL650,Business,210,317
8,AL651,Music and Instrument Technology,AQA,296
9,AL660,Culinary Arts,AQA,216


<br>

<a id= 'part2.2.7'></a>
#### **2.2.7 Set Course Code as Index**

As a next step, set the first column with the Course Code as the index, removing the existing index using the `set_index()` function. This will enable combining the datasets for the different years by using the Course Code as  as the key. [[15]](#reference15)

In [16]:
# Level 8 Courses

# Set the index to the Course Code column 
df2019_8.set_index('COURSE CODE', inplace=True, verify_integrity=True)

# Display first 10 rows of dataset to verify
df2019_8.head(10)

Unnamed: 0_level_0,COURSE TITLE,2019 EOS,2019 MID
COURSE CODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AL801,Software Design with Virtual Reality and Gaming,304,328
AL802,Software Design with Cloud Computing,301,306
AL803,Software Design with Mobile Apps and Connected...,309,337
AL805,Network Management and Cloud Infrastructure,329,442
AL810,Quantity Surveying,307,349
AL820,Mechanical and Polymer Engineering,300,358
AL830,General Nursing,410,429
AL832,Psychiatric Nursing,387,403
AL836,Nutrition and Health Science,352,383
AL837,Sports Science with Exercise Physiology,351,392


In [17]:
# Level 6/7 Courses

# Set the index to the Course Code column 
df2019_67.set_index('COURSE CODE', inplace=True, verify_integrity=True)

# Display first 10 rows of data set to verify
df2019_67.head(10)

Unnamed: 0_level_0,COURSE TITLE,2019 EOS,2019 MID
COURSE CODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AL600,Software Design,205,306
AL601,Computer Engineering,196,272
AL602,Mechanical Engineering,258,424
AL604,Civil Engineering,252,360
AL630,Pharmacy Technician,306,366
AL631,Dental Nursing,326,379
AL632,Applied Science,243,372
AL650,Business,210,317
AL651,Music and Instrument Technology,AQA,296
AL660,Culinary Arts,AQA,216


<br>

<a id= 'part2.2.8'></a>
#### **2.2.8 Add Column with Level**

Adding a column that indicates the level of the course. 

In [18]:
# Level 8 Courses

# Add column for level
df2019_8['LEVEL'] = '8'

# Display first 5 rows to verify
df2019_8.head()

Unnamed: 0_level_0,COURSE TITLE,2019 EOS,2019 MID,LEVEL
COURSE CODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AL801,Software Design with Virtual Reality and Gaming,304,328,8
AL802,Software Design with Cloud Computing,301,306,8
AL803,Software Design with Mobile Apps and Connected...,309,337,8
AL805,Network Management and Cloud Infrastructure,329,442,8
AL810,Quantity Surveying,307,349,8


In [19]:
# Level 6/7 Courses

# Add column for level
df2019_67['LEVEL'] = '6/7'

# Display first 5 rows to verify
df2019_67.head()

Unnamed: 0_level_0,COURSE TITLE,2019 EOS,2019 MID,LEVEL
COURSE CODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AL600,Software Design,205,306,6/7
AL601,Computer Engineering,196,272,6/7
AL602,Mechanical Engineering,258,424,6/7
AL604,Civil Engineering,252,360,6/7
AL630,Pharmacy Technician,306,366,6/7


In [20]:
df2019_67.info()

<class 'pandas.core.frame.DataFrame'>
Index: 461 entries, AL600 to WD208
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   COURSE TITLE  461 non-null    object
 1   2019 EOS      454 non-null    object
 2   2019 MID      453 non-null    object
 3   LEVEL         461 non-null    object
dtypes: object(4)
memory usage: 18.0+ KB


In [21]:
df2019_8.info()

<class 'pandas.core.frame.DataFrame'>
Index: 930 entries, AL801 to WD230
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   COURSE TITLE  930 non-null    object
 1   2019 EOS      926 non-null    object
 2   2019 MID      915 non-null    object
 3   LEVEL         930 non-null    object
dtypes: object(4)
memory usage: 36.3+ KB


<br>

***

<a id= 'part2.3'></a>
### **2.3 Retrieving CAO Points from Excel Format (Year 2020)**

The 2020 data is provided on the CAO website as an Excel file. The `urlretrieve()` function can be used to save a local copy of the file, then the pandas `read_excel()` function is used to create a data frame from the excel data. 

Link to 2020 data on the CAO Website: [Points Required for Entry to 2020 Courses](http://www.cao.ie/index.php?page=points&p=2020)

<br>

<a id= 'part2.3.1'></a>
#### **2.3.1 Defining File Path**

In [22]:
# Define the URL
url2020 = 'http://www2.cao.ie/points/CAOPointsCharts2020.xlsx'

In [23]:
# Create a file path for saving the original data
path = 'data/cao2020_' + nowstr + '.xlsx'

<br>

<a id= 'part2.3.2'></a>
#### **2.3.2 Extracting Data**

In [24]:
# Save original file to disk
urlrq.urlretrieve(url2020, path)

('data/cao2020_20220102_232108.xlsx',
 <http.client.HTTPMessage at 0x1d6911114c0>)

<br>

<a id= 'part2.3.3'></a>
#### **2.3.3 Creating the pandas Data Frame**

Using the `pd.read_excel` function, data is read from the excel file and parsed to a pandas data frame. 
The first 10 rows of the excel file contain irrelevant data, these will be skipped by setting the skiprows parameter. [[16]](#reference16)

Also only a number of columns will be added to the data frame, these are selected by the usecol parameter. 

In [25]:
# Download and parse the excel spreadsheet to a data frame
df2020 = pd.read_excel(
    path, skiprows=10, usecols=['COURSE TITLE', 'COURSE CODE2','R1 POINTS',
                                   'R1 Random *', 'R2 POINTS', 'R2 Random*',
                                   'EOS', 'EOS Random *', 'EOS Mid-point',
                                   'LEVEL'])

In [26]:
# Display first 10 rows of data frame
df2020.head(10)

Unnamed: 0,COURSE TITLE,COURSE CODE2,R1 POINTS,R1 Random *,R2 POINTS,R2 Random*,EOS,EOS Random *,EOS Mid-point,LEVEL
0,International Business,AC120,209,,,,209,,280,8
1,Liberal Arts,AC137,252,,,,252,,270,8
2,"First Year Art & Design (Common Entry,portfolio)",AD101,#+matric,,,,#+matric,,#+matric,8
3,Graphic Design and Moving Image Design (portfo...,AD102,#+matric,,,,#+matric,,#+matric,8
4,Textile & Surface Design and Jewellery & Objec...,AD103,#+matric,,,,#+matric,,#+matric,8
5,Education & Design or Fine Art (Second Level T...,AD202,#+matric,,,,#+matric,,#+matric,8
6,Fine Art (portfolio),AD204,#+matric,,,,#+matric,,#+matric,8
7,Fashion Design (portfolio),AD211,#+matric,,,,#+matric,,#+matric,8
8,Product Design (portfolio),AD212,#+matric,,,,#+matric,,#+matric,8
9,Visual Culture,AD215,377,,320.0,,320,,389,8


In [27]:
# Reordering the columns
df2020 = df2020[['COURSE CODE2', 'COURSE TITLE', 'EOS', 'EOS Random *',
                 'EOS Mid-point', 'R1 POINTS', 'R1 Random *','R2 POINTS',
                 'R2 Random*','LEVEL']]

# Display first 5 rows
df2020.head()

Unnamed: 0,COURSE CODE2,COURSE TITLE,EOS,EOS Random *,EOS Mid-point,R1 POINTS,R1 Random *,R2 POINTS,R2 Random*,LEVEL
0,AC120,International Business,209,,280,209,,,,8
1,AC137,Liberal Arts,252,,270,252,,,,8
2,AD101,"First Year Art & Design (Common Entry,portfolio)",#+matric,,#+matric,#+matric,,,,8
3,AD102,Graphic Design and Moving Image Design (portfo...,#+matric,,#+matric,#+matric,,,,8
4,AD103,Textile & Surface Design and Jewellery & Objec...,#+matric,,#+matric,#+matric,,,,8


<br>

<a id= 'part2.3.4'></a>
#### **2.3.4 Update Column Names**

In [28]:
# Update column names
df2020.columns = ['COURSE CODE', 'COURSE TITLE', '2020 EOS', 
                  '2020 EOS RANDOM', '2020 MID', '2020 R1 POINTS', 
                  '2020 R1 RANDOM', '2020 R2 POINTS', '2020 R2 RANDOM',
                  'LEVEL']

# View to check
df2020.head()

Unnamed: 0,COURSE CODE,COURSE TITLE,2020 EOS,2020 EOS RANDOM,2020 MID,2020 R1 POINTS,2020 R1 RANDOM,2020 R2 POINTS,2020 R2 RANDOM,LEVEL
0,AC120,International Business,209,,280,209,,,,8
1,AC137,Liberal Arts,252,,270,252,,,,8
2,AD101,"First Year Art & Design (Common Entry,portfolio)",#+matric,,#+matric,#+matric,,,,8
3,AD102,Graphic Design and Moving Image Design (portfo...,#+matric,,#+matric,#+matric,,,,8
4,AD103,Textile & Surface Design and Jewellery & Objec...,#+matric,,#+matric,#+matric,,,,8


<br>

<a id= 'part2.3.5'></a>
#### **2.3.5 Update Level Column**
Updating the level column to align with the other two data sets that do not distinguish between levels 6 and 7. Wherevever LEVEL is 6 or 7, this will be replaced by 6/7.

In [29]:
# Viewing the different levels
df2020['LEVEL'].value_counts()

8    1027
7     346
6      91
Name: LEVEL, dtype: int64

In [30]:
# Update level column
df2020.loc[df2020['LEVEL'] == 6, 'LEVEL'] = '6/7'
df2020.loc[df2020['LEVEL'] == 7, 'LEVEL'] = '6/7'

In [31]:
# Viewing the levels after update to verify
df2020['LEVEL'].value_counts()

8      1027
6/7     437
Name: LEVEL, dtype: int64

<br>

<a id= 'part2.3.6'></a>
#### **2.3.6 Set Course Code as Index**

In [32]:
# Set the index to the Course Code column 
df2020.set_index('COURSE CODE', inplace=True, verify_integrity=True)

# Display first 10 rows of dataset to verify
df2020.head(10)

Unnamed: 0_level_0,COURSE TITLE,2020 EOS,2020 EOS RANDOM,2020 MID,2020 R1 POINTS,2020 R1 RANDOM,2020 R2 POINTS,2020 R2 RANDOM,LEVEL
COURSE CODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
AC120,International Business,209,,280,209,,,,8
AC137,Liberal Arts,252,,270,252,,,,8
AD101,"First Year Art & Design (Common Entry,portfolio)",#+matric,,#+matric,#+matric,,,,8
AD102,Graphic Design and Moving Image Design (portfo...,#+matric,,#+matric,#+matric,,,,8
AD103,Textile & Surface Design and Jewellery & Objec...,#+matric,,#+matric,#+matric,,,,8
AD202,Education & Design or Fine Art (Second Level T...,#+matric,,#+matric,#+matric,,,,8
AD204,Fine Art (portfolio),#+matric,,#+matric,#+matric,,,,8
AD211,Fashion Design (portfolio),#+matric,,#+matric,#+matric,,,,8
AD212,Product Design (portfolio),#+matric,,#+matric,#+matric,,,,8
AD215,Visual Culture,320,,389,377,,320.0,,8


<br>

<a id= 'part2.3.7'></a>
#### **2.3.7 Update Random to boolean**

There are three columns which contain an indicator (*) whether random selection was applied to select applicants. To assist with analysis, this column will be updated to boolean values.

In [33]:
# Update the 'RANDOM' columns to replace the * indicator with True
df2020.loc[df2020['2020 EOS RANDOM'] == '*', '2020 EOS RANDOM'] = True
df2020.loc[df2020['2020 R1 RANDOM'] == '*', '2020 R1 RANDOM'] = True
df2020.loc[df2020['2020 R2 RANDOM'] == '*', '2020 R2 RANDOM'] = True

# Update the 'RANDOM' columns to replace the NaN values with False
df2020['2020 EOS RANDOM'].fillna(False, inplace = True)
df2020['2020 R1 RANDOM'].fillna(False, inplace = True)
df2020['2020 R2 RANDOM'].fillna(False, inplace = True)

In [34]:
# View results
df2020.head(10)

Unnamed: 0_level_0,COURSE TITLE,2020 EOS,2020 EOS RANDOM,2020 MID,2020 R1 POINTS,2020 R1 RANDOM,2020 R2 POINTS,2020 R2 RANDOM,LEVEL
COURSE CODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
AC120,International Business,209,False,280,209,False,,False,8
AC137,Liberal Arts,252,False,270,252,False,,False,8
AD101,"First Year Art & Design (Common Entry,portfolio)",#+matric,False,#+matric,#+matric,False,,False,8
AD102,Graphic Design and Moving Image Design (portfo...,#+matric,False,#+matric,#+matric,False,,False,8
AD103,Textile & Surface Design and Jewellery & Objec...,#+matric,False,#+matric,#+matric,False,,False,8
AD202,Education & Design or Fine Art (Second Level T...,#+matric,False,#+matric,#+matric,False,,False,8
AD204,Fine Art (portfolio),#+matric,False,#+matric,#+matric,False,,False,8
AD211,Fashion Design (portfolio),#+matric,False,#+matric,#+matric,False,,False,8
AD212,Product Design (portfolio),#+matric,False,#+matric,#+matric,False,,False,8
AD215,Visual Culture,320,False,389,377,False,320.0,False,8


<br>

<a id= 'part2.3.8'></a>
#### **2.3.8 Split Points and Portfolio Indicators**
Similar to the random indicator, when students needed to provide a portfolio or sit through a test or interview during the application process, this is indicated by adding a # indicator to the points value in several columns: '2020 EOS', '2020 MID', '2020 R1 POINTS', '2020 R2 POINTS'.

Attempting to split the '2020 EOS' column into two columns, one containing the points and the other containing the indicator for portfolio (#). 

In [35]:
# Defining a function to split the data in column 2020 EOS

def eos_to_array_new(s):
    # If value is a number, return number and blank space
    if s.isdigit():
        return [s,'']
    else: 
        # If first character is a '#', set portfolio value as '#'
        portfolio = ''
        if s[0] == '#':
            portfolio = '#'
        # For any numerical values, set it as the points value
        points = ''
        for i in s:
            if i.isdigit():
                points = points + i
        return [points, portfolio]

In [36]:
# Converting column data type to string
df2020['2020 EOS'] = df2020['2020 EOS'].astype(str)

In [37]:
# Creating a new column "New" applying the function defined above
df2020['New'] = df2020['2020 EOS'].apply(eos_to_array_new)

In [38]:
# View the result
df2020.head(10)

Unnamed: 0_level_0,COURSE TITLE,2020 EOS,2020 EOS RANDOM,2020 MID,2020 R1 POINTS,2020 R1 RANDOM,2020 R2 POINTS,2020 R2 RANDOM,LEVEL,New
COURSE CODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
AC120,International Business,209,False,280,209,False,,False,8,"[209, ]"
AC137,Liberal Arts,252,False,270,252,False,,False,8,"[252, ]"
AD101,"First Year Art & Design (Common Entry,portfolio)",#+matric,False,#+matric,#+matric,False,,False,8,"[, #]"
AD102,Graphic Design and Moving Image Design (portfo...,#+matric,False,#+matric,#+matric,False,,False,8,"[, #]"
AD103,Textile & Surface Design and Jewellery & Objec...,#+matric,False,#+matric,#+matric,False,,False,8,"[, #]"
AD202,Education & Design or Fine Art (Second Level T...,#+matric,False,#+matric,#+matric,False,,False,8,"[, #]"
AD204,Fine Art (portfolio),#+matric,False,#+matric,#+matric,False,,False,8,"[, #]"
AD211,Fashion Design (portfolio),#+matric,False,#+matric,#+matric,False,,False,8,"[, #]"
AD212,Product Design (portfolio),#+matric,False,#+matric,#+matric,False,,False,8,"[, #]"
AD215,Visual Culture,320,False,389,377,False,320.0,False,8,"[320, ]"


The new column is created with data type list. Trying to split the two values into the points and portfolio indicator.

In [39]:
print(type(df2020.loc[df2020.index[0], 'New']))

<class 'list'>


In [40]:
# Create a new column '2020 EOS Portfolio' containing the second part of 'New'
df2020['2020 EOS Portfolio'] = df2020['New'].str.get(1)

In [41]:
# View the result
df2020.head(10)

Unnamed: 0_level_0,COURSE TITLE,2020 EOS,2020 EOS RANDOM,2020 MID,2020 R1 POINTS,2020 R1 RANDOM,2020 R2 POINTS,2020 R2 RANDOM,LEVEL,New,2020 EOS Portfolio
COURSE CODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
AC120,International Business,209,False,280,209,False,,False,8,"[209, ]",
AC137,Liberal Arts,252,False,270,252,False,,False,8,"[252, ]",
AD101,"First Year Art & Design (Common Entry,portfolio)",#+matric,False,#+matric,#+matric,False,,False,8,"[, #]",#
AD102,Graphic Design and Moving Image Design (portfo...,#+matric,False,#+matric,#+matric,False,,False,8,"[, #]",#
AD103,Textile & Surface Design and Jewellery & Objec...,#+matric,False,#+matric,#+matric,False,,False,8,"[, #]",#
AD202,Education & Design or Fine Art (Second Level T...,#+matric,False,#+matric,#+matric,False,,False,8,"[, #]",#
AD204,Fine Art (portfolio),#+matric,False,#+matric,#+matric,False,,False,8,"[, #]",#
AD211,Fashion Design (portfolio),#+matric,False,#+matric,#+matric,False,,False,8,"[, #]",#
AD212,Product Design (portfolio),#+matric,False,#+matric,#+matric,False,,False,8,"[, #]",#
AD215,Visual Culture,320,False,389,377,False,320.0,False,8,"[320, ]",


In [42]:
# Update column 'New' to only display the first part of the list
df2020['New'] = df2020['New'].str.get(0)

In [43]:
# View the result
df2020.head(10)

Unnamed: 0_level_0,COURSE TITLE,2020 EOS,2020 EOS RANDOM,2020 MID,2020 R1 POINTS,2020 R1 RANDOM,2020 R2 POINTS,2020 R2 RANDOM,LEVEL,New,2020 EOS Portfolio
COURSE CODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
AC120,International Business,209,False,280,209,False,,False,8,209.0,
AC137,Liberal Arts,252,False,270,252,False,,False,8,252.0,
AD101,"First Year Art & Design (Common Entry,portfolio)",#+matric,False,#+matric,#+matric,False,,False,8,,#
AD102,Graphic Design and Moving Image Design (portfo...,#+matric,False,#+matric,#+matric,False,,False,8,,#
AD103,Textile & Surface Design and Jewellery & Objec...,#+matric,False,#+matric,#+matric,False,,False,8,,#
AD202,Education & Design or Fine Art (Second Level T...,#+matric,False,#+matric,#+matric,False,,False,8,,#
AD204,Fine Art (portfolio),#+matric,False,#+matric,#+matric,False,,False,8,,#
AD211,Fashion Design (portfolio),#+matric,False,#+matric,#+matric,False,,False,8,,#
AD212,Product Design (portfolio),#+matric,False,#+matric,#+matric,False,,False,8,,#
AD215,Visual Culture,320,False,389,377,False,320.0,False,8,320.0,


The data type of column 'New' is now string, empty values are displayed as blank. 

In [44]:
print(type(df2020.loc[df2020.index[0], 'New']))

<class 'str'>


In [45]:
print(type(df2020.loc[df2020.index[2], 'New']))

<class 'str'>


Updating the blank values to None (also tried np.nan). 

In [46]:
# Updating the blank values to None
df2020.loc[df2020['New'] == '', 'New'] = None

In [47]:
# View the result
df2020.head(10)

Unnamed: 0_level_0,COURSE TITLE,2020 EOS,2020 EOS RANDOM,2020 MID,2020 R1 POINTS,2020 R1 RANDOM,2020 R2 POINTS,2020 R2 RANDOM,LEVEL,New,2020 EOS Portfolio
COURSE CODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
AC120,International Business,209,False,280,209,False,,False,8,209.0,
AC137,Liberal Arts,252,False,270,252,False,,False,8,252.0,
AD101,"First Year Art & Design (Common Entry,portfolio)",#+matric,False,#+matric,#+matric,False,,False,8,,#
AD102,Graphic Design and Moving Image Design (portfo...,#+matric,False,#+matric,#+matric,False,,False,8,,#
AD103,Textile & Surface Design and Jewellery & Objec...,#+matric,False,#+matric,#+matric,False,,False,8,,#
AD202,Education & Design or Fine Art (Second Level T...,#+matric,False,#+matric,#+matric,False,,False,8,,#
AD204,Fine Art (portfolio),#+matric,False,#+matric,#+matric,False,,False,8,,#
AD211,Fashion Design (portfolio),#+matric,False,#+matric,#+matric,False,,False,8,,#
AD212,Product Design (portfolio),#+matric,False,#+matric,#+matric,False,,False,8,,#
AD215,Visual Culture,320,False,389,377,False,320.0,False,8,320.0,


Trying to change the data type of column 'New' to integer is unsuccessful, the below returns an error: "TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'". 

In [48]:
# Change data type of column 'New' to integer
# Commented out because it returns an error

# df2020['New'] = df2020['New'].astype(int)

Other attempts to resolve this issue included 
- using `str.split()` to split the column
- using `str.strip()` to remove the brackets from the column
- using regular expressions to match patterns

[[17]](#reference17), [[18]](#reference18), [[19]](#reference19), [[20]](#reference20), [[21]](#reference21), [[22]](#reference22), [[23]](#reference23), [[24]](#reference24), [[25]](#reference25), [[26]](#reference26), [[27]](#reference27), [[28]](#reference28)

<br>

<a id= 'part2.3.9'></a>
#### **2.3.9 Exporting to csv File**

Saving the extracted course lines as a csv file

In [49]:
# Create a file path for the pandas data
pathpd = 'data/cao2020_new_csv' + nowstr + '.csv'

In [50]:
#Save pandas data frame as a csv file
df2020.to_csv(pathpd)

<br>

<a id= 'part2.4'></a>
### **2.4 Retrieving CAO Points from HTTP Format (Year 2021)**

The 2021 data is provided on the CAO website as a list on an HTTP site. 

Links to 2021 data on the CAO Website: 
- [Points Required for Entry to 2021 Level 8 Courses](http://www2.cao.ie/points/l8.php)
- [Points Required for Entry to 2021 Level 7/6 Courses](http://www2.cao.ie/points/l76.php)


<br>

<a id= 'part2.4.1'></a>
#### **2.4.1 Defining File Paths**

In [51]:
# Create a file path for saving the original data

# Level 8 Courses
path2021_8 = 'data/cao2021_8_' + nowstr + '.html'

# Level 6/7 Courses
path2021_67 = 'data/cao2021_67_' + nowstr + '.html'

In [52]:
# Create a file path for saving extracted data as csv

# Level 8 Courses
path2021_8csv = 'data/cao2021_8_new_csv_' + nowstr + '.csv'

# Level 6/7 Courses
path2021_67csv = 'data/cao2021_67_new_csv_' + nowstr + '.csv'

<br>

<a id= 'part2.4.2'></a>
#### **2.4.2 Extracting Data**

Using the get() function from the requests module to retrieve data from the CAO website containing the points for 2021

In [53]:
# Using the get() function for fetching the CAO URL

# Level 8 Courses
resp_8 = rq.get('http://www2.cao.ie/points/l8.php')

# Level 6/7 Courses
resp_67 = rq.get('http://www2.cao.ie/points/l76.php')

In [54]:
# To check that response is successful (code 200) - Level 8
resp_8

<Response [200]>

In [55]:
# To check that response is successful (code 200) - Level 6/7
resp_67

<Response [200]>

<br>

<a id= 'part2.4.3'></a>
#### **2.4.3 Saving the Orginal Data Set**

Creating a local copy of the original data as a text file. 

In [56]:
# The server uses the Windows-1252 encoding (cp1252) rather than iso-8859-1 as indicated on the website UTF-8 which results in some characters not being displayed correctly. 
# Changing the decode to cp1252 enables to correctly decode 
#To fix the wrong encoding: 
original_encoding = resp_8.encoding

# Change to cp1252
resp_8.encoding = 'cp1252'

In [57]:
# Save a local copy of the original html file
with open(path2021_8, 'w') as f:
    f.write(resp_8.text)

In [58]:
# The server uses the Windows-1252 encoding (cp1252) rather than iso-8859-1 as indicated on the website UTF-8 which results in some characters not being displayed correctly. 
# Changing the decode to cp1252 enables to correctly decode 
#To fix the wrong encoding: 
#original_encoding = resp.encoding

# Change to cp1252
#resp.encoding = 'cp1252'

In [59]:
# Save a local copy of the original html file
with open(path2021_67, 'w') as f:
    f.write(resp_67.text)

<br>

<a id= 'part2.4.4'></a>
#### **2.4.4 Identifying Relevant Lines**

Aside from the course data, the CAO website contains a number of additional text lines, headings and hyperlinks which are not required for the dataframe. Regular expressions can be used to identify and extract only the relevant lines with course data.

All of the course lines begin with the course code (2 letters, 3 numbers) followed by the course name and further information. 

In [60]:
# Compiling the regular expression for matching lines using the compile() function. 
# The regular expression matches all lines beginning with 2 uppercase letters (indicated by [A-Z]{2}), then 3 digits (indicated by [0-9]{3}.
# followed by any further characters (indicated by the . (dot) character as a wildcard for any characters. * is used as a quantifier, indicating 0 or more characters).
re_course = re.compile(r'([A-Z]{2}[0-9]{3})(.*)')

In [61]:
# Defining a function to split the course points of the two rounds and the indicators for random  selection (*) and additional selecction requirements like tests or portfolios (#)

def points_to_array(s):
    # https://www.pythonpool.com/empty-string-python/ using len() to check for empty values
    if len(s) == 0:
        return ['','','']
    else:
        portfolio = ''
        if s[0] == '#':
            portfolio = '#'
        random = ''
        if s[-1] == '*':
            random = '*'
        points = ''
        for i in s:
            if i.isdigit():
                points = points + i

        return [points, portfolio, random]

<br>

<a id= 'part2.4.5'></a>
#### **2.4.5 Exporting Data to csv**

Extract the course lines and save them as a csv file. 

https://www.w3schools.com/python/ref_string_join.asp 

**Level 8 Courses**

In [62]:
# Loop through the lines containing courses using the iter_lines() function

# Adding a line count to keep track of the number of courses found by the regular expression:
no_lines = 0

# Open the csv file for writing
with open(path2021_8csv,'w') as f:
    # Create a header row
    f.write(','.join(['COURSE CODE','COURSE TITLE','2021 R1 POINTS', '2021 R1 PORTFOLIO', '2021 R1 RANDOM *','2021 R2 POINTS','2021 R2 PORTFOLIO', '2021 R2 RANDOM*']) + '\n')
    # Loop through the lines of the response
    for line in resp_8.iter_lines():
        #Decode the line using the Windows-1252 encoding
        dline = line.decode('cp1252')
        # If the regular expression defined above matches the line
        if re_course.fullmatch(dline):
            # Add one to the lines count
            no_lines = no_lines + 1
            # Extract the course code (first five characters of the line)
            course_code = dline[:5]
            # Extract the course title (characters 6 to 57 of the line) using strip() to remove white spaces
            course_title = dline[7:57].strip()
            # Extract round one and two points (starting from character 59 of the line, adding a split between round 1 and 2 which is indicated by one or more blank space)
            course_points = re.split(' +', dline[60:])
            # Using join() to change array created in points_to_array to string separated by a comma
            course_points_1 = ",".join(points_to_array(course_points[0]))
            course_points_2 = ",".join(points_to_array(course_points[1]))
            
            # Join the fields using a comma
            linesplit = [course_code, course_title, course_points_1, course_points_2]
            # Rejoin the substrings with commas in between
            f.write(','.join(linesplit) + '\n')
            
# Print the total number of processed lines
print(f"\nTotal number of lines is {no_lines}.")



Total number of lines is 949.


**Level 6/7**

In [63]:
# Loop through the lines containing courses using the iter_lines() function

# Adding a line count to keep track of the number of courses found by the regular expression:
no_lines = 0

# Open the csv file for writing
with open(path2021_67csv,'w') as f:
    # Create a header row
    f.write(','.join(['COURSE CODE','COURSE TITLE','2021 R1 POINTS', '2021 R1 PORTFOLIO', '2021 R1 RANDOM *','2021 R2 POINTS','2021 R2 PORTFOLIO', '2021 R2 RANDOM*']) + '\n')
    # Loop through the lines of the response
    for line in resp_67.iter_lines():
        #Decode the line using the Windows-1252 encoding
        dline = line.decode('cp1252')
        # If the regular expression defined above matches the line
        if re_course.fullmatch(dline):
            # Add one to the lines count
            no_lines = no_lines + 1
            # Extract the course code (first five characters of the line)
            course_code = dline[:5]
            # Extract the course title (characters 6 to 57 of the line) using strip() to remove white spaces
            course_title = dline[7:57].strip()
            # Extract round one and two points (starting from character 59 of the line, adding a split between round 1 and 2 which is indicated by one or more blank space)
            course_points = re.split(' +', dline[60:])
            # Using join() to change array created in points_to_array to string separated by a comma
            course_points_1 = ",".join(points_to_array(course_points[0]))
            course_points_2 = ",".join(points_to_array(course_points[1]))
            
            # Join the fields using a comma
            linesplit = [course_code, course_title, course_points_1, course_points_2]
            # Rejoin the substrings with commas in between
            f.write(','.join(linesplit) + '\n')
            
# Print the total number of processed lines
print(f"\nTotal number of lines is {no_lines}.")


Total number of lines is 416.


The total number of courses has been verified against the CAO website.

<br>

<a id= 'part2.4.6'></a>
#### **2.4.6 Creating the pandas Data Frame**

Reading in data from the csv files created in the previous step and parsing it to a pandas data frame. 

In [64]:
# Converting 2021 level 8 data into a pandas data frame
df2021_8 = pd.read_csv(path2021_8csv)

# Display first 10 rows of data frame
df2021_8.head()

Unnamed: 0,COURSE CODE,COURSE TITLE,2021 R1 POINTS,2021 R1 PORTFOLIO,2021 R1 RANDOM *,2021 R2 POINTS,2021 R2 PORTFOLIO,2021 R2 RANDOM*
0,AL801,Software Design for Virtual Reality and Gaming,300.0,,,,,
1,AL802,Software Design in Artificial Intelligence for...,313.0,,,,,
2,AL803,Software Design for Mobile Apps and Connected ...,350.0,,,,,
3,AL805,Computer Engineering for Network Infrastructure,321.0,,,,,
4,AL810,Quantity Surveying,328.0,,,,,


In [65]:
# Converting 2021 Level 6/7 data into a pandas data frame
df2021_67 = pd.read_csv(path2021_67csv)

# Display first 10 rows of data frame
df2021_67.head()

Unnamed: 0,COURSE CODE,COURSE TITLE,2021 R1 POINTS,2021 R1 PORTFOLIO,2021 R1 RANDOM *,2021 R2 POINTS,2021 R2 PORTFOLIO,2021 R2 RANDOM*
0,AL605,Music and Instrument Technology,211.0,,,,,
1,AL630,Pharmacy Technician,308.0,,,,,
2,AL631,Dental Nursing,311.0,,,,,
3,AL632,Applied Science,297.0,,,,,
4,AL650,Business,,,,,,


<br>

<a id= 'part2.4.7'></a>
#### **2.4.7 Add Column with Level**

In [66]:
# Level 8 Courses

# Add column for level
df2021_8['LEVEL'] = '8'

# Display first 5 rows to verify
df2021_8.head()

Unnamed: 0,COURSE CODE,COURSE TITLE,2021 R1 POINTS,2021 R1 PORTFOLIO,2021 R1 RANDOM *,2021 R2 POINTS,2021 R2 PORTFOLIO,2021 R2 RANDOM*,LEVEL
0,AL801,Software Design for Virtual Reality and Gaming,300.0,,,,,,8
1,AL802,Software Design in Artificial Intelligence for...,313.0,,,,,,8
2,AL803,Software Design for Mobile Apps and Connected ...,350.0,,,,,,8
3,AL805,Computer Engineering for Network Infrastructure,321.0,,,,,,8
4,AL810,Quantity Surveying,328.0,,,,,,8


In [67]:
# Level 6/7 Courses

# Add column for level
df2021_67['LEVEL'] = '6/7'

# Display first 5 rows to verify
df2021_67.head()

Unnamed: 0,COURSE CODE,COURSE TITLE,2021 R1 POINTS,2021 R1 PORTFOLIO,2021 R1 RANDOM *,2021 R2 POINTS,2021 R2 PORTFOLIO,2021 R2 RANDOM*,LEVEL
0,AL605,Music and Instrument Technology,211.0,,,,,,6/7
1,AL630,Pharmacy Technician,308.0,,,,,,6/7
2,AL631,Dental Nursing,311.0,,,,,,6/7
3,AL632,Applied Science,297.0,,,,,,6/7
4,AL650,Business,,,,,,,6/7


<br>

<a id= 'part2.4.8'></a>
#### **2.4.8 Set Course Code as Index**

In [68]:
# Level 8 Courses

# Set the index to the Course Code column 
df2021_8.set_index('COURSE CODE', inplace=True, verify_integrity=True)

# Display first 10 rows of dataset to verify
df2021_8.head(10)

Unnamed: 0_level_0,COURSE TITLE,2021 R1 POINTS,2021 R1 PORTFOLIO,2021 R1 RANDOM *,2021 R2 POINTS,2021 R2 PORTFOLIO,2021 R2 RANDOM*,LEVEL
COURSE CODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
AL801,Software Design for Virtual Reality and Gaming,300.0,,,,,,8
AL802,Software Design in Artificial Intelligence for...,313.0,,,,,,8
AL803,Software Design for Mobile Apps and Connected ...,350.0,,,,,,8
AL805,Computer Engineering for Network Infrastructure,321.0,,,,,,8
AL810,Quantity Surveying,328.0,,,,,,8
AL811,Civil Engineering,,,,,,,8
AL820,Mechanical and Polymer Engineering,327.0,,,,,,8
AL830,General Nursing,451.0,,*,444.0,,,8
AL832,Mental Health Nursing,440.0,,*,431.0,,,8
AL835,Pharmacology,356.0,,,,,,8


In [69]:
# Level 6/7 Courses

# Set the index to the Course Code column 
df2021_67.set_index('COURSE CODE', inplace=True, verify_integrity=True)

# Display first 10 rows of data set to verify
df2021_67.head(10)

Unnamed: 0_level_0,COURSE TITLE,2021 R1 POINTS,2021 R1 PORTFOLIO,2021 R1 RANDOM *,2021 R2 POINTS,2021 R2 PORTFOLIO,2021 R2 RANDOM*,LEVEL
COURSE CODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
AL605,Music and Instrument Technology,211.0,,,,,,6/7
AL630,Pharmacy Technician,308.0,,,,,,6/7
AL631,Dental Nursing,311.0,,,,,,6/7
AL632,Applied Science,297.0,,,,,,6/7
AL650,Business,,,,,,,6/7
AL660,Culinary Arts,,,,,,,6/7
AL661,Bar Supervision,,,,,,,6/7
AL663,Business (Sport and Recreation),,,,,,,6/7
AL701,Computer Engineering for Network Infrastructure,207.0,,,,,,6/7
AL702,Software Design in Artificial Intelligence for...,220.0,,,,,,6/7


In [70]:
df2021_8['2021 R1 RANDOM *'].unique()

array([nan, '*'], dtype=object)

In [71]:
df2021_8['2021 R1 PORTFOLIO'].unique()

array([nan, '#'], dtype=object)

<br>

<a id= 'part2.4.9'></a>
#### **2.4.9 Update Random and Portfolio Indicators to boolean**

In [72]:
# Level 8 courses

# Update the 'RANDOM' columns to replace the * indicator with True
df2021_8.loc[df2021_8['2021 R1 RANDOM *'] == '*', '2021 R1 RANDOM *'] = True
df2021_8.loc[df2021_8['2021 R2 RANDOM*'] == '*', '2021 R2 RANDOM*'] = True

# Update the 'PORTFOLIO' columns to replace the # indicator with True
df2021_8.loc[df2021_8['2021 R1 PORTFOLIO'] == '#', '2021 R1 PORTFOLIO'] = True
df2021_8.loc[df2021_8['2021 R2 PORTFOLIO'] == '#', '2021 R2 PORTFOLIO'] = True

# Update the 'RANDOM' columns to replace the NaN values with False
df2021_8['2021 R1 RANDOM *'].fillna(False, inplace = True)
df2021_8['2021 R2 RANDOM*'].fillna(False, inplace = True)

# Update the 'PORTFOLIO' columns to replace the NaN values with False
df2021_8['2021 R1 PORTFOLIO'].fillna(False, inplace = True)
df2021_8['2021 R2 PORTFOLIO'].fillna(False, inplace = True)

In [73]:
# Level 6/7 courses

# Update the 'RANDOM' columns to replace the * indicator with True
df2021_67.loc[df2021_67['2021 R1 RANDOM *'] == '*', '2021 R1 RANDOM *'] = True
df2021_67.loc[df2021_67['2021 R2 RANDOM*'] == '*', '2021 R2 RANDOM*'] = True

# Update the 'PORTFOLIO' columns to replace the # indicator with True
df2021_67.loc[df2021_67['2021 R1 PORTFOLIO'] == '#', '2021 R1 PORTFOLIO'] = True
df2021_67.loc[df2021_67['2021 R2 PORTFOLIO'] == '#', '2021 R2 PORTFOLIO'] = True

# Update the 'RANDOM' columns to replace the NaN values with False
df2021_67['2021 R1 RANDOM *'].fillna(False, inplace = True)
df2021_67['2021 R2 RANDOM*'].fillna(False, inplace = True)

# Update the 'PORTFOLIO' columns to replace the NaN values with False
df2021_67['2021 R1 PORTFOLIO'].fillna(False, inplace = True)
df2021_67['2021 R2 PORTFOLIO'].fillna(False, inplace = True)

In [74]:
df2021_8.info()

<class 'pandas.core.frame.DataFrame'>
Index: 949 entries, AL801 to WD232
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   COURSE TITLE       949 non-null    object 
 1   2021 R1 POINTS     923 non-null    float64
 2   2021 R1 PORTFOLIO  949 non-null    bool   
 3   2021 R1 RANDOM *   949 non-null    bool   
 4   2021 R2 POINTS     255 non-null    float64
 5   2021 R2 PORTFOLIO  949 non-null    bool   
 6   2021 R2 RANDOM*    949 non-null    bool   
 7   LEVEL              949 non-null    object 
dtypes: bool(4), float64(2), object(2)
memory usage: 40.8+ KB


In [75]:
df2021_67.info()

<class 'pandas.core.frame.DataFrame'>
Index: 416 entries, AL605 to WD208
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   COURSE TITLE       416 non-null    object 
 1   2021 R1 POINTS     394 non-null    float64
 2   2021 R1 PORTFOLIO  416 non-null    bool   
 3   2021 R1 RANDOM *   416 non-null    bool   
 4   2021 R2 POINTS     113 non-null    float64
 5   2021 R2 PORTFOLIO  416 non-null    bool   
 6   2021 R2 RANDOM*    416 non-null    bool   
 7   LEVEL              416 non-null    object 
dtypes: bool(4), float64(2), object(2)
memory usage: 17.9+ KB


<br>

***

<a id= 'part3'></a>
### **3.0 Merging the Data Frames**

<br>

<a id= 'part3.1'></a>
#### **3.1 Identifying List of Courses**

In [76]:
# Extract Course Codes, Course Title and Level only
courses2021_8 = df2021_8[['COURSE TITLE', 'LEVEL']]
courses2021_67 = df2021_67[['COURSE TITLE', 'LEVEL']]
courses2020 = df2020[['COURSE TITLE', 'LEVEL']]
courses2019_8 = df2019_8[['COURSE TITLE', 'LEVEL']]
courses2019_67 = df2019_67[['COURSE TITLE', 'LEVEL']]

In [77]:
# View example to check
courses2021_67.head()

Unnamed: 0_level_0,COURSE TITLE,LEVEL
COURSE CODE,Unnamed: 1_level_1,Unnamed: 2_level_1
AL605,Music and Instrument Technology,6/7
AL630,Pharmacy Technician,6/7
AL631,Dental Nursing,6/7
AL632,Applied Science,6/7
AL650,Business,6/7


In [78]:
# Combining all 5 course code data sets
allcourses = pd.concat([courses2021_8, courses2021_67, courses2020, 
                        courses2019_8, courses2019_67])

Sorting the data frame by the course code shows that there are now a number of duplicates which will need to be removed. [[29]](#reference29)


In [79]:
# Sort by course code 
allcourses.sort_index(inplace=True)

# Display first 10 rows
allcourses.head(10)

Unnamed: 0_level_0,COURSE TITLE,LEVEL
COURSE CODE,Unnamed: 1_level_1,Unnamed: 2_level_1
AC120,International Business,8
AC120,International Business,8
AC120,International Business,8
AC137,Liberal Arts,8
AC137,Liberal Arts,8
AC137,Liberal Arts,8
AD101,First Year Art and Design (Common Entry portfo...,8
AD101,"First Year Art & Design (Common Entry,portfolio)",8
AD101,First Year Art & Design (Common Entry),8
AD102,Graphic Design and Moving Image Design (portfo...,8


In [80]:
allcourses = allcourses.reset_index().drop_duplicates(subset='COURSE CODE').set_index('COURSE CODE')
allcourses

Unnamed: 0_level_0,COURSE TITLE,LEVEL
COURSE CODE,Unnamed: 1_level_1,Unnamed: 2_level_1
AC120,International Business,8
AC137,Liberal Arts,8
AD101,First Year Art and Design (Common Entry portfo...,8
AD102,Graphic Design and Moving Image Design (portfo...,8
AD103,Textile and Surface Design and Jewellery and O...,8
...,...,...
WD211,Creative Computing,8
WD212,Recreation and Sport Management,8
WD230,Mechanical and Manufacturing Engineering,8
WD231,Early Childhood Care and Education,8


In [81]:
allcourses.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1793 entries, AC120 to WD232
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   COURSE TITLE  1793 non-null   object
 1   LEVEL         1793 non-null   object
dtypes: object(2)
memory usage: 42.0+ KB


<br>

<a id= 'part3.2'></a>
#### **3.2 Merging Data Frames into One**

In [82]:
# Merging the two 2021 data frames
df2021 = pd.concat([df2021_67, df2021_8])

In [83]:
# Merging the two 2019 data frames
df2019 = pd.concat([df2019_67, df2019_8])

In [84]:
# Defining new data frame for all years, using allcourses as basis
dfall = allcourses 

In [85]:
# Defining the columns for the combined data frame
columns2021 = ['2021 R1 POINTS', '2021 R1 PORTFOLIO', '2021 R1 RANDOM *', 
               '2021 R2 POINTS', '2021 R2 PORTFOLIO', '2021 R2 RANDOM*']
columns2020 = ['2020 EOS', '2020 EOS RANDOM', '2020 MID', '2020 R1 POINTS', 
               '2020 R1 RANDOM', '2020 R2 POINTS', '2020 R2 RANDOM']
columns2019 = ['2019 EOS', '2019 MID']

In [86]:
# Create combined data frame
dfall = dfall.join((df2021[columns2021], df2020[columns2020], df2019[columns2019]), how = 'left')

In [87]:
# Display first 10 rows
dfall.head(10)

Unnamed: 0_level_0,COURSE TITLE,LEVEL,2021 R1 POINTS,2021 R1 PORTFOLIO,2021 R1 RANDOM *,2021 R2 POINTS,2021 R2 PORTFOLIO,2021 R2 RANDOM*,2020 EOS,2020 EOS RANDOM,2020 MID,2020 R1 POINTS,2020 R1 RANDOM,2020 R2 POINTS,2020 R2 RANDOM,2019 EOS,2019 MID
COURSE CODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
AC120,International Business,8,294.0,False,False,294.0,False,False,209,False,280,209,False,,False,234,269
AC137,Liberal Arts,8,271.0,False,False,270.0,False,False,252,False,270,252,False,,False,252,275
AD101,First Year Art and Design (Common Entry portfo...,8,554.0,True,False,,False,False,#+matric,False,#+matric,#+matric,False,,False,# +mat,ic 550
AD102,Graphic Design and Moving Image Design (portfo...,8,538.0,True,False,,False,False,#+matric,False,#+matric,#+matric,False,,False,# +mat,ic 635
AD103,Textile and Surface Design and Jewellery and O...,8,505.0,True,False,,False,False,#+matric,False,#+matric,#+matric,False,,False,# +mat,ic 545
AD202,Education & Design or Fine Art (Second Level T...,8,591.0,True,False,,False,False,#+matric,False,#+matric,#+matric,False,,False,# +mat,ic 580
AD204,Fine Art,8,514.0,True,False,,False,False,#+matric,False,#+matric,#+matric,False,,False,# +mat,ic 600
AD211,Fashion Design,8,760.0,True,False,679.0,True,False,#+matric,False,#+matric,#+matric,False,,False,# +mat,ic 600
AD212,Product Design (portfolio),8,413.0,True,False,,False,False,#+matric,False,#+matric,#+matric,False,,False,# +mat,ic 600
AD215,Visual Culture,8,337.0,False,False,300.0,False,False,320,False,389,377,False,320.0,False,300,338


<br>

***

<a id= 'part4'></a>
### **4.0 Data Analysis - 2021 Data**

As a result of not being able to isolate numerical values from the 2019 and 2020 data sets, the below analysis will only contain 2021 data. 

<br>

<a id= 'part4.1'></a>
#### **4.1 General Overview**

In [88]:
# Overview of data set
df2021.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1365 entries, AL605 to WD232
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   COURSE TITLE       1365 non-null   object 
 1   2021 R1 POINTS     1317 non-null   float64
 2   2021 R1 PORTFOLIO  1365 non-null   bool   
 3   2021 R1 RANDOM *   1365 non-null   bool   
 4   2021 R2 POINTS     368 non-null    float64
 5   2021 R2 PORTFOLIO  1365 non-null   bool   
 6   2021 R2 RANDOM*    1365 non-null   bool   
 7   LEVEL              1365 non-null   object 
dtypes: bool(4), float64(2), object(2)
memory usage: 90.9+ KB


Using the `info()` function, it can be established that for 2021 there were 1365 courses listed by the CAO. 

<br>

<a id= 'part4.2'></a>
#### **4.2 Round 1 Specifics**

In [89]:
# Overview of descriptive statistics for Round 1 points
df2021['2021 R1 POINTS'].describe()

count    1317.000000
mean      361.665148
std       139.025919
min        57.000000
25%       260.000000
50%       325.000000
75%       462.000000
max      1028.000000
Name: 2021 R1 POINTS, dtype: float64

During the first round 1,317 of the 1365 courses were assigned based on minimum CAO points. The average of points required was 362. 

<br>

**Highest Points Requirements**

In [90]:
# Identifying the five courses with highest points required
df2021.nlargest(5, ['2021 R1 POINTS'])[['COURSE TITLE', '2021 R1 POINTS',
                                        '2021 R1 PORTFOLIO', 'LEVEL']]

Unnamed: 0_level_0,COURSE TITLE,2021 R1 POINTS,2021 R1 PORTFOLIO,LEVEL
COURSE CODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CR125,Popular Music at CIT Cork School of Music,1028.0,True,8
LC115,Art and Design Teacher Education (LIT and UL ...,993.0,True,8
DL832,Animation,989.0,True,8
DL843,Film,987.0,True,8
LC114,Fashion and Textiles for Product and Costume (...,914.0,True,8


There a multiple courses with CAO points higher than the maximum achievable leaving cert result of 625 points (see section 1.10). My best guess here would be that applicants are awareded additional points based on the required portfolio/interviews/assessments which get added to the leaving cert result. 

When filtering out courses with portfolio requirements, the below courses with maximum points are returned, showing that 4 courses required the maximum leaving cert points and applicants were randomly selected. 

In [91]:
# Filter data frame for courses without portfolio requirements
df2021_noport1 = df2021[df2021['2021 R1 PORTFOLIO'] == False]

# Identifying the five courses with highest points required
df2021_noport1.nlargest(
    5, ['2021 R1 POINTS'])[
    ['COURSE TITLE', '2021 R1 POINTS', '2021 R1 PORTFOLIO', 
     '2021 R1 RANDOM *', 'LEVEL']]

Unnamed: 0_level_0,COURSE TITLE,2021 R1 POINTS,2021 R1 PORTFOLIO,2021 R1 RANDOM *,LEVEL
COURSE CODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
CK702,Dentistry,625.0,False,True,8
TR034,Management Science and Information Systems Stu...,625.0,False,True,8
TR052,Dental Science,625.0,False,True,8
DN670,Economics and Finance,625.0,False,True,8
CK703,Pharmacy,613.0,False,True,8


<br>

**Portfolio Requirements**

In [92]:
# Identify number of courses with portfolio requirement
df2021['2021 R1 PORTFOLIO'].value_counts()

False    1268
True       97
Name: 2021 R1 PORTFOLIO, dtype: int64

In total 97 courses required a portfolio or additional assessment/interview as part of the application process. 

<br>

**Random Selection**

In [93]:
# Identify number of courses with random selection
df2021['2021 R1 RANDOM *'].value_counts()

False    1289
True       76
Name: 2021 R1 RANDOM *, dtype: int64

For 76 courses random selection was applied.

<br>

<a id= 'part4.3'></a>
#### **4.3 Round 2 Specifics**

In [94]:
# Overview of descriptive statistics for Round 1 points
df2021['2021 R2 POINTS'].describe()

count    368.000000
mean     350.146739
std      161.468061
min       60.000000
25%      217.000000
50%      316.000000
75%      485.250000
max      904.000000
Name: 2021 R2 POINTS, dtype: float64

During Round 2, still 368 of the courses were assigned based on minimum CAO points. The average of points required in the second round was 350.  

<br>

**Highest Points Requirements**

In [95]:
# Identifying the five courses with highest points required
df2021.nlargest(5, ['2021 R2 POINTS'])[['COURSE TITLE', '2021 R2 POINTS',
                                        '2021 R2 PORTFOLIO', 'LEVEL']]

Unnamed: 0_level_0,COURSE TITLE,2021 R2 POINTS,2021 R2 PORTFOLIO,LEVEL
COURSE CODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CR121,Music at CIT Cork School of Music,904.0,True,8
DL843,Film,874.0,True,8
TR051,Medicine (HPAT required),743.0,True,8
RC001,Medicine - Undergraduate Entry (HPAT required),740.0,True,8
CK701,Medicine (Undergraduate Entry - HPAT required),737.0,True,8


Similar to the round 1 results, also for round 2 the highest point requirements exceed the 625 maximum points. 

When filtering out courses with portfolio requirements, the same 5 courses as with highest points requirements as in round 1 are returned. 

In [96]:
# Filter data frame for courses without portfolio requirements
df2021_noport2 = df2021[df2021['2021 R2 PORTFOLIO'] == False]

# Identifying the five courses with highest points required
df2021_noport2.nlargest(
    5, ['2021 R2 POINTS'])[
    ['COURSE TITLE', '2021 R2 POINTS', '2021 R2 PORTFOLIO', 
     '2021 R2 RANDOM*', 'LEVEL']]

Unnamed: 0_level_0,COURSE TITLE,2021 R2 POINTS,2021 R2 PORTFOLIO,2021 R2 RANDOM*,LEVEL
COURSE CODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
CK702,Dentistry,625.0,False,True,8
TR034,Management Science and Information Systems Stu...,625.0,False,True,8
TR052,Dental Science,625.0,False,True,8
DN670,Economics and Finance,625.0,False,False,8
CK703,Pharmacy,613.0,False,True,8


<br>

**Portfolio Requirements**

In [97]:
# Identify number of courses with portfolio requirement
df2021['2021 R2 PORTFOLIO'].value_counts()

False    1341
True       24
Name: 2021 R2 PORTFOLIO, dtype: int64

For round 2, portfolios were required for 24 courses. 

<br>

**Random Selection**

In [98]:
# Identify number of courses with random selection
df2021['2021 R2 RANDOM*'].value_counts()

False    1313
True       52
Name: 2021 R2 RANDOM*, dtype: int64

For 52 courses random selection was applied.

<br>

<a id= 'part4.4'></a>
#### **4.4 Round 1 vs Round 2 Comparison**

In [99]:
# Add a new column calculating the difference between round 1 and 2 points
df2021_pointsdiff = df2021.assign(DIFF_R1_R2= df2021['2021 R1 POINTS'] 
                                  - df2021['2021 R2 POINTS'])

In [100]:
# Drop NaN values in the new column
df2021_pointsdiff.dropna(subset=['DIFF_R1_R2'])

# Display the 5 courses with largest difference between rounds 1 and 2
df2021_pointsdiff.nlargest(5, ['DIFF_R1_R2'])

Unnamed: 0_level_0,COURSE TITLE,2021 R1 POINTS,2021 R1 PORTFOLIO,2021 R1 RANDOM *,2021 R2 POINTS,2021 R2 PORTFOLIO,2021 R2 RANDOM*,LEVEL,DIFF_R1_R2
COURSE CODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
BY002,Business,347.0,False,False,171.0,False,False,6/7,176.0
SG401,Science,282.0,False,False,128.0,False,False,6/7,154.0
DL843,Film,987.0,True,False,874.0,True,False,8,113.0
DB531,Marketing,288.0,False,False,176.0,False,False,8,112.0
PC404,Applied Social Studies - Professional Social Care,200.0,False,False,100.0,False,False,6/7,100.0


In [101]:
# Counting number of courses with same point requirements in both rounds
df2021_pointsdiff[df2021_pointsdiff['DIFF_R1_R2'] == 0.0][
    'COURSE TITLE'].count()

119

For 119 courses there was no difference between the points required for rounds 1 and 2. For two courses the requirements were higher in the second round than in the first.

In [102]:
# Identify courses with higher round 2 point requirements than round 1
df2021_pointsdiff[df2021_pointsdiff['DIFF_R1_R2'] < 0.0]

Unnamed: 0_level_0,COURSE TITLE,2021 R1 POINTS,2021 R1 PORTFOLIO,2021 R1 RANDOM *,2021 R2 POINTS,2021 R2 PORTFOLIO,2021 R2 RANDOM*,LEVEL,DIFF_R1_R2
COURSE CODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
TU708,Engineering (Common Entry with Award options),117.0,False,False,263.0,False,False,6/7,-146.0
WD177,Science (Mol. Biology with Biopharm. Food Science,205.0,False,False,455.0,False,False,6/7,-250.0


<br>

***

<a id= 'references'></a>
## **References Used**

<a id='reference1'></a> [[1] Wikipedia Contributors, 2021: *Central Applications Office*](https://en.wikipedia.org/wiki/Central_Applications_Office) (Accessed 2 November 2021)

<a id='reference2'></a> [[2] : Citizens Information Board, 2021: *College application and entrance requirements*](https://www.citizensinformation.ie/en/education/third_level_education/applying_to_college/application_procedures_and_entry_requirements.html) (Accessed 2 November 2021)

<a id='reference3'></a> [[3] Citizens Information Board, 2021: *Third-level education in Ireland*](https://www.citizensinformation.ie/en/education/third_level_education/colleges_and_qualifications/third_level_education_in_ireland.html) (Accessed 2 November 2021)

<a id='reference4'></a> [[4] Quality and Qualifications Ireland, 2021: *IRISH NATIONAL FRAMEWORK OF QUALIFICATIONS (NFQ)*](https://nfq.qqi.ie/) (Accessed 10 December 2021)

<a id='reference5'></a> [[5] Wikipedia Contributors, 2021: *The points system*](https://en.wikipedia.org/wiki/Central_Applications_Office#The%20Points%20System) (Accessed 2 November 2021)

<a id='reference6'></a> [[6] Central Applications Office Ltd., 2021: *Irish Leaving Certificate Examination Points Calculation Grid*](http://www.cao.ie/index.php?page=scoring&s=lcepointsgrid) (Accessed 2 November 2021)

<a id='reference7'></a> [[7] Central Applications Office Ltd., 2021: *Offer Round Dates and Reply Dates*](https://www.cao.ie/help_files/round_dates.php) (Accessed 2 November 2021)

<a id='reference8'></a> [[8] ARIGA, A., 2019: *Getting Started*](https://tabula-py.readthedocs.io/en/latest/getting_started.html#) (Accessed 30 October 2021)

<a id='reference9'></a> [[9] soumilshah1995, 2019: *How to extract tables from online PDF as Pandas DF in Python*](https://www.youtube.com/watch?v=6QSe_hlsUPc) (Accessed 30 October 2021)

<a id='reference10'></a> [[10] Python Software Foundation, 2021: *tabula-py 2.3.0*](https://pypi.org/project/tabula-py/) (Accessed 30 October 2021)

<a id='reference11'></a> [[11] ARIGA, A., 2019: *tabula.io.convert_into*](https://tabula-py.readthedocs.io/en/latest/tabula.html#tabula.io.convert_into) (Accessed 30 October 2021)

<a id='reference12'></a> [[12] Zach, 2021: *How to Drop Rows by Index in Pandas (With Examples)*](https://www.statology.org/pandas-drop-row-by-index/) (Accessed 1 January 2022)

<a id='reference13'></a> [[13] The pandas development team, 2021: *pandas.DataFrame.dropna*](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) (Accessed 16 December 2021)

<a id='reference14'></a> [[14] Data to Fish, 2021: *How to Drop Rows with NaN Values in Pandas DataFrame*](https://datatofish.com/dropna/) (Accessed 18 December 2021)

<a id='reference15'></a> [[15] The pandas development team, 2021: *pandas.DataFrame.set_index*](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html) (Accessed 16 December 2021)

<a id='reference16'></a> [[16] The pandas development team, 2021: *pandas.read_excel*](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) (Accessed 2 January 2022)

<a id='reference17'></a> [[17] The pandas development team, 2021: *pandas.Series.str.split*](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html) (Accessed 2 January 2022)

<a id='reference18'></a> [[18] Stackoverflow Discussion, 2019: *Replacing blank values (white space) with NaN in pandas*](https://stackoverflow.com/questions/13445241/replacing-blank-values-white-space-with-nan-in-pandas) (Accessed 2 January 2022)

<a id='reference19'></a> [[19] Stackoverflow Discussion, 2016: *How to remove square bracket from pandas dataframe*](https://stackoverflow.com/questions/38147447/how-to-remove-square-bracket-from-pandas-dataframe) (Accessed 2 January 2022)

<a id='reference20'></a> [[20] Stackoverflow Discussion, 2017: *Pandas error in Python: columns must be same length as key*](https://stackoverflow.com/questions/46585193/pandas-error-in-python-columns-must-be-same-length-as-key) (Accessed 2 January 2022)

<a id='reference21'></a> [[21] Statology, 2021: *How to Split String Column in Pandas into Multiple Columns*](https://www.statology.org/pandas-split-column/) (Accessed 2 January 2022)

<a id='reference22'></a> [[22] GeeksforGeeks, 2019: *Python | Creating a Pandas dataframe column based on a given condition*](https://www.geeksforgeeks.org/python-creating-a-pandas-dataframe-column-based-on-a-given-condition/?ref=gcse) (Accessed 2 January 2022)

<a id='reference23'></a> [[23] GeeksforGeeks, 2021: *Ways to apply an if condition in Pandas DataFrame*](https://www.geeksforgeeks.org/ways-to-apply-an-if-condition-in-pandas-dataframe/?ref=gcse) (Accessed 2 January 2022)

<a id='reference24'></a> [[24] Stacckoverflow Discussion, 2020: *Pandas Dataframe: split column into multiple columns*](https://stackoverflow.com/questions/61705123/pandas-dataframe-split-column-into-multiple-columns) (Accessed 2 January 2022)

<a id='reference24'></a> [[25] W3Schools, 2021: *Pandas - Cleaning Empty Cells*](https://www.w3schools.com/python/pandas/pandas_cleaning_empty_cells.asp) (Accessed 2 January 2022)

<a id='reference26'></a> [[26] GeeksforGeeks, 2021: *How to Replace Values in Column Based on Condition in Pandas?*](https://www.geeksforgeeks.org/how-to-replace-values-in-column-based-on-condition-in-pandas/) (Accessed 2 January 2022)

<a id='reference27'></a> [[27] GeeksforGeeks, 2019: *Python | Splitting Text and Number in string*](https://www.geeksforgeeks.org/python-splitting-text-and-number-in-string/) (Accessed 2 January 2022)

<a id='reference28'></a> [[28] GeeksforGeeks, 2020: *Change the data type of a column or a Pandas Series*](https://www.geeksforgeeks.org/change-the-data-type-of-a-column-or-a-pandas-series/) (Accessed 2 January 2022)

<a id='reference29'></a> [[29] Data to Fish, 2021: *How to Sort an Index in Pandas DataFrame*](https://datatofish.com/sort-index-pandas-dataframe/) (Accessed 2 January 2022)