# Exploring Population Trends in Ireland ☘️
***

### Name: Stephen Hasson
### Student No: sba23014
### Student Email: sba23014@student.cct.ie
### Course: CCT MSC in Data Analytics
### Assignment: MSC_DA_CA1
### Year: Sept-23 Intake
### Data Source: https://data.cso.ie/product/pme
***

## Table of Contents

### 1. [Data Cleaning](#Data-Cleaning)
### 2. [Data Cleaning](#Data-Cleaning)
### 3. [Data Cleaning](#Data-Cleaning)
### 4. [Data Cleaning](#Data-Cleaning)
### 5. [Data Cleaning](#Data-Cleaning)
### 6. [Data Cleaning](#Data-Cleaning)
### 7. [Data Cleaning](#Data-Cleaning)
### 8. [Data Cleaning](#Data-Cleaning)
### 9. [Data Cleaning](#Data-Cleaning)
### 10. [Data Cleaning](#Data-Cleaning)
***

## 1. Import packages & load data

In [1]:
# Import eda & visualisation packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import machine learning packages from sklearn
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import r2_score

# Configure default colour scheme for seaborn
sns.set(color_codes=True)

# Display all columns of the pandas df
pd.set_option('display.max_columns', None)

In [2]:
# Specify the url of the 'xxx' dataset, saved in github account
pop_change_url = 'https://raw.githubusercontent.com/sba23014/cct_msc_data_analytics/main/s1_ca1/datasets/20230925_annual_population_change_1951_to_2023.csv'
pop_est_url = 'https://raw.githubusercontent.com/sba23014/cct_msc_data_analytics/main/s1_ca1/datasets/20230925_population_estimates_1950_to_2023.csv'
gdp_gnp_url = 'https://raw.githubusercontent.com/sba23014/cct_msc_data_analytics/main/s1_ca1/datasets/20230901_gdp_%26_gnp_1995_to_2023.csv'

# Read files from the github url's into the pandas DataFrame (df)
pop_change_df = pd.read_csv(pop_change_url)
pop_est_df = pd.read_csv(pop_est_url)
gdp_gnp_df = pd.read_csv(gdp_gnp_url)

In [3]:
# Return the first 5 rows to validate df creation
pop_change_df.head()

Unnamed: 0,STATISTIC,STATISTIC Label,TLIST(A1),Year,C02541V03076,Component,UNIT,VALUE
0,PEA15,Annual Population Change,1951,1951,1,Annual births,Thousand,
1,PEA15,Annual Population Change,1951,1951,2,Annual deaths,Thousand,
2,PEA15,Annual Population Change,1951,1951,3,Natural increase,Thousand,26.6
3,PEA15,Annual Population Change,1951,1951,4,Immigrants,Thousand,
4,PEA15,Annual Population Change,1951,1951,5,Emigrants,Thousand,


In [4]:
# Return the first 5 rows to validate df creation
pop_est_df.head()

Unnamed: 0,STATISTIC,STATISTIC Label,TLIST(A1),Year,C02076V02508,Age Group,C02199V02655,Sex,UNIT,VALUE
0,PEA01,Population Estimates (Persons in April),1950,1950,200,Under 1 year,-,Both sexes,Thousand,61.1
1,PEA01,Population Estimates (Persons in April),1950,1950,200,Under 1 year,1,Male,Thousand,31.4
2,PEA01,Population Estimates (Persons in April),1950,1950,200,Under 1 year,2,Female,Thousand,29.7
3,PEA01,Population Estimates (Persons in April),1950,1950,205,0 - 4 years,-,Both sexes,Thousand,
4,PEA01,Population Estimates (Persons in April),1950,1950,205,0 - 4 years,1,Male,Thousand,


In [5]:
# Return the first 5 rows to validate df creation
gdp_gnp_df.head()

Unnamed: 0,STATISTIC,Statistic Label,TLIST(Q1),Quarter,C02196V02652,State,UNIT,VALUE
0,NAQ03C01,GVA at Constant Basic Prices,19951,1995Q1,-,State,Euro Million,21283
1,NAQ03C01,GVA at Constant Basic Prices,19952,1995Q2,-,State,Euro Million,22083
2,NAQ03C01,GVA at Constant Basic Prices,19953,1995Q3,-,State,Euro Million,22529
3,NAQ03C01,GVA at Constant Basic Prices,19954,1995Q4,-,State,Euro Million,22342
4,NAQ03C01,GVA at Constant Basic Prices,19961,1996Q1,-,State,Euro Million,22989


## 2. Initial data exploration & cleaning

Exploring population '20230925_annual_population_change_1951_to_2023' dataset first

In [6]:
# Print the dimensionality of the df
pop_change_df.shape

(584, 8)

584 rows & 8 columns

In [7]:
# Return the first 20 rows to validate df creation
pop_change_df.head(20)

Unnamed: 0,STATISTIC,STATISTIC Label,TLIST(A1),Year,C02541V03076,Component,UNIT,VALUE
0,PEA15,Annual Population Change,1951,1951,1,Annual births,Thousand,
1,PEA15,Annual Population Change,1951,1951,2,Annual deaths,Thousand,
2,PEA15,Annual Population Change,1951,1951,3,Natural increase,Thousand,26.6
3,PEA15,Annual Population Change,1951,1951,4,Immigrants,Thousand,
4,PEA15,Annual Population Change,1951,1951,5,Emigrants,Thousand,
5,PEA15,Annual Population Change,1951,1951,6,Net migration,Thousand,-35.0
6,PEA15,Annual Population Change,1951,1951,7,Population change,Thousand,-8.4
7,PEA15,Annual Population Change,1951,1951,8,Population,Thousand,2960.6
8,PEA15,Annual Population Change,1952,1952,1,Annual births,Thousand,
9,PEA15,Annual Population Change,1952,1952,2,Annual deaths,Thousand,


Notes: 
* Value column contains nulls
* There are several different unique 'Components', for meaningful results splitting this data will be essential
* Numerous 'system code' type columns: 'STATISTIC', 'TLIST(A1)', 'C02541V03076'
* Validate mapping of 'system code' type columns: 'STATISTIC', 'TLIST(A1)', 'C02541V03076' against labels, if 1:1 then drop as required
* 'C02541V03076' already in numeric datatype format, could be kept for ML purposes
* 'UNIT' column specifies the denomination of 'Value' column, this could be transformed to the correct value
* Based on date range in question several observations can be removed based on 'Year'
* Identifying what to do with 'Value' NaNs important

In [8]:
# Print a concise summary of the df
pop_change_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 584 entries, 0 to 583
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   STATISTIC        584 non-null    object 
 1   STATISTIC Label  584 non-null    object 
 2   TLIST(A1)        584 non-null    int64  
 3   Year             584 non-null    int64  
 4   C02541V03076     584 non-null    int64  
 5   Component        584 non-null    object 
 6   UNIT             584 non-null    object 
 7   VALUE            440 non-null    float64
dtypes: float64(1), int64(3), object(4)
memory usage: 36.6+ KB


In [9]:
# Generate descriptive statistics for all attributes
pop_change_df.describe(include = 'all')

Unnamed: 0,STATISTIC,STATISTIC Label,TLIST(A1),Year,C02541V03076,Component,UNIT,VALUE
count,584,584,584.0,584.0,584.0,584,584,440.0
unique,1,1,,,,8,1,
top,PEA15,Annual Population Change,,,,Annual births,Thousand,
freq,584,584,,,,73,584,
mean,,,1987.0,1987.0,4.5,,,633.262273
std,,,21.089371,21.089371,2.293252,,,1378.671718
min,,,1951.0,1951.0,1.0,,,-58.0
25%,,,1969.0,1969.0,2.75,,,24.775
50%,,,1987.0,1987.0,4.5,,,35.55
75%,,,2005.0,2005.0,6.25,,,70.775


* Value column contains nulls
* Several rows are objects and don't have descriptive statistics
* Remaining columns all appear to have no missing values
* For this analysis only care about last 20 years, so will drop oberservations before that and review the stats again

In [10]:
# Drop rows where 'Year' is less than 2013
pop_change_df = pop_change_df[pop_change_df['Year'] >= 2004]

# Reset the index and drop the old index
pop_change_df.reset_index(drop = True, inplace = True)

# Validate results of new df & index reset
pop_change_df

Unnamed: 0,STATISTIC,STATISTIC Label,TLIST(A1),Year,C02541V03076,Component,UNIT,VALUE
0,PEA15,Annual Population Change,2004,2004,1,Annual births,Thousand,62.0
1,PEA15,Annual Population Change,2004,2004,2,Annual deaths,Thousand,28.6
2,PEA15,Annual Population Change,2004,2004,3,Natural increase,Thousand,33.3
3,PEA15,Annual Population Change,2004,2004,4,Immigrants,Thousand,58.5
4,PEA15,Annual Population Change,2004,2004,5,Emigrants,Thousand,26.5
...,...,...,...,...,...,...,...,...
155,PEA15,Annual Population Change,2023,2023,4,Immigrants,Thousand,141.6
156,PEA15,Annual Population Change,2023,2023,5,Emigrants,Thousand,64.0
157,PEA15,Annual Population Change,2023,2023,6,Net migration,Thousand,77.6
158,PEA15,Annual Population Change,2023,2023,7,Population change,Thousand,97.6


In [11]:
# Count of unique values in 'Year' column
unique_year_count = len(pop_change_df['Year'].unique())

print(f"The number of unique values in the 'Year' column is {unique_year_count}.")

The number of unique values in the 'Year' column is 20.


In [12]:
# Validate only time period remains
pop_change_df['Year'].unique()

array([2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014,
       2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023], dtype=int64)

In [13]:
# Validate only one denomination value in the 'UNIT' column
pop_change_df['UNIT'].unique()

array(['Thousand'], dtype=object)

* Numerous 'system code' type columns: 'STATISTIC', 'TLIST(A1)', 'C02541V03076'
* Validate that they are indeed 1:1 mapping, if so can drop additional columns

In [14]:
# Group by 'STATISTIC' and 'STATISTIC Label'
pop_change_df_group_statistic = pop_change_df.groupby(['STATISTIC', 'STATISTIC Label']).size().drop(columns=[0])

# Display the simplified DataFrame
print(pop_change_df_group_statistic)

STATISTIC  STATISTIC Label         
PEA15      Annual Population Change    160
dtype: int64


In [15]:
# Group by 'TLIST(A1)' and 'Year'
pop_change_df_group_year = pop_change_df.groupby(['TLIST(A1)', 'Year']).size().drop(columns=[0])

# Display the simplified DataFrame
print(pop_change_df_group_year)

TLIST(A1)  Year
2004       2004    8
2005       2005    8
2006       2006    8
2007       2007    8
2008       2008    8
2009       2009    8
2010       2010    8
2011       2011    8
2012       2012    8
2013       2013    8
2014       2014    8
2015       2015    8
2016       2016    8
2017       2017    8
2018       2018    8
2019       2019    8
2020       2020    8
2021       2021    8
2022       2022    8
2023       2023    8
dtype: int64


In [16]:
# Group by 'C02541V03076' and 'Component'
pop_change_df_group_component = pop_change_df.groupby(['C02541V03076', 'Component']).size().drop(columns=[0])

# Display the simplified DataFrame
print(pop_change_df_group_component)

C02541V03076  Component        
1             Annual births        20
2             Annual deaths        20
3             Natural increase     20
4             Immigrants           20
5             Emigrants            20
6             Net migration        20
7             Population change    20
8             Population           20
dtype: int64


* It's fine to drop those additional columns as they are not serving the dataset

In [17]:
# we don't these columns
pop_change_df.drop(['STATISTIC', 'STATISTIC Label', 'TLIST(A1)', 'C02541V03076', 'UNIT'], axis = 1, inplace = True)

pop_change_df

Unnamed: 0,Year,Component,VALUE
0,2004,Annual births,62.0
1,2004,Annual deaths,28.6
2,2004,Natural increase,33.3
3,2004,Immigrants,58.5
4,2004,Emigrants,26.5
...,...,...,...
155,2023,Immigrants,141.6
156,2023,Emigrants,64.0
157,2023,Net migration,77.6
158,2023,Population change,97.6


* Renaming columns

In [18]:
# Rename columns
pop_change_df.rename(columns = {'VALUE': 'Value (K)'}, inplace = True)

# Validate changes
pop_change_df.head()

Unnamed: 0,Year,Component,Value (K)
0,2004,Annual births,62.0
1,2004,Annual deaths,28.6
2,2004,Natural increase,33.3
3,2004,Immigrants,58.5
4,2004,Emigrants,26.5


* Create an additional column for the 'Value (K)' in the raw format

In [19]:
pop_change_df['Value'] = pop_change_df['Value (K)'] * 1000

pop_change_df

Unnamed: 0,Year,Component,Value (K),Value
0,2004,Annual births,62.0,62000.0
1,2004,Annual deaths,28.6,28600.0
2,2004,Natural increase,33.3,33300.0
3,2004,Immigrants,58.5,58500.0
4,2004,Emigrants,26.5,26500.0
...,...,...,...,...
155,2023,Immigrants,141.6,141600.0
156,2023,Emigrants,64.0,64000.0
157,2023,Net migration,77.6,77600.0
158,2023,Population change,97.6,97600.0


In [21]:
# Print a concise summary of the df
pop_change_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 160 entries, 0 to 159
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Year       160 non-null    int64  
 1   Component  160 non-null    object 
 2   Value (K)  160 non-null    float64
 3   Value      160 non-null    float64
dtypes: float64(2), int64(1), object(1)
memory usage: 5.1+ KB


* Change data types

In [24]:
# Convert selected columns to 'int'
pop_change_df['Value'] = pop_change_df['Value'].astype('int64')

# Validate the results
pop_change_df.dtypes

Year           int64
Component     object
Value (K)    float64
Value          int64
dtype: object

* Check and remove duplicates

In [None]:
# Total number of rows and columns
print('df shape:')
print(df.shape)
print()

# Rows containing duplicate data
duplicate_rows_df = df[df.duplicated()]
print('Number of duplicate rows:')
print(duplicate_rows_df.shape)
print()

# Used to count the number of rows before removing the data
print('Row count before removing duplicates:')
print()
print(df.count())
print()

# Dropping the duplicates
df = df.drop_duplicates()

# Counting the number of rows after removing duplicates.
print('Row count after removing duplicates:')
print()
print(df.count())
print()

In [20]:
# Generate descriptive statistics for numeric attributes
# pop_change_df.describe(include = 'all')

* Reducing the time period for analysis resolved the issue with the missing values in the 'VALUE' column, assumption is that in previous years data collection may not have been as advanced as it has been since 2013
* The values in 'TLIST(A1)' and 'Year' are the same, which would indicate at 1:1 mapping, will be verified in subsequent steps
* 'Year' 
    * Min & Max year as expected
* 'VALUE':
    * min & max, quite a large distribution
    * mean very different from the 50% (median)
    * likely explained by the different