This notebook analyses the differences between the sexes by age in Ireland, from the next dataset: cso-populationbyage.csv.

- Weighted mean age (by sex).
- The difference between the sexes by age.

In [1]:
import pandas as pd

In [2]:
FILENAME = 'cso-populationbyage.csv'
DATADIR = './'
FULLPATH = DATADIR + FILENAME

# read the CSV file into a DataFrame
df = pd.read_csv(FULLPATH)
print(df.head(3))  # print the first 3 rows to verify


  Statistic Label  CensusYear         Sex Single Year of Age  \
0      Population        2022  Both sexes           All ages   
1      Population        2022  Both sexes           All ages   
2      Population        2022  Both sexes           All ages   

  Administrative Counties    UNIT    VALUE  
0                 Ireland  Number  5149139  
1   Carlow County Council  Number    61968  
2     Dublin City Council  Number   592713  


In [3]:
#Create a new CSV file with the cleaned data
#df.to_csv("population_foranalysis.csv")

Drop columns that we don't need.

In [4]:
drop_col_list = ['Statistic Label', 'CensusYear', 'Sex', 'UNIT']
df.drop(columns=drop_col_list, inplace=True) # Inplace=True to modify the DataFrame directly
print(df.head(3))  # print the first 3 rows to verify

  Single Year of Age Administrative Counties    VALUE
0           All ages                 Ireland  5149139
1           All ages   Carlow County Council    61968
2           All ages     Dublin City Council   592713


Having words in the age field, such as "All ages," "under 1 year," "years and over," and "years," can be problematic when analyzing the data. Therefore, we will remove these words and keep only the age as a number.

First step is remove the "All ages" text.

In [5]:
df = df[df["Single Year of Age"] != "All ages"]
print(df.head(3))  # print the first 3 rows to verify

   Single Year of Age Administrative Counties  VALUE
32       Under 1 year                 Ireland  57796
33       Under 1 year   Carlow County Council    699
34       Under 1 year     Dublin City Council   6213


Second step is change the column using find and replace. We will replace everything that says "Under 1 year" for "0" consider the minors have less than 1 year.

In [6]:
df["Single Year of Age"] = df["Single Year of Age"].replace("Under 1 year", "0")
print(df.head(5))  # print the first 5 rows to verify

   Single Year of Age                Administrative Counties  VALUE
32                  0                                Ireland  57796
33                  0                  Carlow County Council    699
34                  0                    Dublin City Council   6213
35                  0  DÃºn Laoghaire Rathdown County Council   2457
36                  0                  Fingal County Council   4009


Third step is change the text that is repeated regulary in the data set "years". To avoid do this one by one: '2 years' '3 years', we can use regular expressions, that they help to change/delete large amounts of data that are repeated in the data set.

Source: https://www.w3schools.com/python/python_regex.asp

In [7]:
# Regular expression to remove 'years' and 'year' from the 'Single Year of Age' column
df["Single Year of Age"] = df["Single Year of Age"].str.replace(r'\D', '', regex=True) # add r to indicate raw string
print(df.tail(5))  # print the last 5 rows to verify

     Single Year of Age   Administrative Counties  VALUE
3259                100  Roscommon County Council     12
3260                100      Sligo County Council     12
3261                100      Cavan County Council     18
3262                100    Donegal County Council     33
3263                100   Monaghan County Council      8


In [8]:
#Create a new CSV file with the cleaned data
df.to_csv("population_for_analysis.csv")