# Assignment 5: Population Analysis

## Imports

In [1]:
import pandas as pd
import numpy as np
import regex as re

## Part 1 
*Write a Jupyter notebook that analyses the differences between the sexes by age in Ireland.*
- *Weighted mean age (by sex)*
- *The difference between the sexes by age*

First, I read in the dataset and chose which columns I would be using

In [2]:
FILENAME="cso-populationbysex.csv"
DATADIR=r"C:\Users\ZMH\OneDrive\Desktop\PFDA\data\\" # I got a FileNotFoundError when I used "../../data/"
FULLPATH =  DATADIR + FILENAME

# Read in the CSV file
df = pd.read_csv(FULLPATH)

# Filter to only include rows where "Administrative Counties" is "Ireland"
df = df[df["Administrative Counties"] == "Ireland"]

# Create a list of columns to drop
drop_col_list = ["Statistic Label","CensusYear","Administrative Counties","UNIT"]

# Drop columns
df.drop(columns=drop_col_list, inplace=True)

# Have a look
df.head()

Unnamed: 0,Sex,Single Year of Age,VALUE
0,Male,All ages,2544549
32,Male,Under 1 year,29610
64,Male,1 year,28875
96,Male,2 years,30236
128,Male,3 years,31001


Then I removed the "All ages" from the dataset as I want to deal with each age individually.

I tidied up the ages, replacing "Under 1 year" to just "0" and also removing the "years old" from the dataset.

I also changed the ages to integers as they were strings

In [3]:
# Remove "All ages" from the dataset
df = df[df["Single Year of Age"] != "All ages"]

# I was getting errors when trying to tidy up the ages, so I changed the datatype to string first
df['Single Year of Age']=df['Single Year of Age'].astype('str')
df["VALUE"] = df["VALUE"].astype('str')

# Tidy up ages
df["Single Year of Age"] = df["Single Year of Age"].str.replace(r'Under 1 year', '0')
df["Single Year of Age"] = df["Single Year of Age"].str.replace(r'\D', '', regex=True)

# Change ages to int
df['Single Year of Age']=df['Single Year of Age'].astype('int64')
df["VALUE"] = df["VALUE"].astype('int64')

# Show
df.head()


Unnamed: 0,Sex,Single Year of Age,VALUE
32,Male,0,29610
64,Male,1,28875
96,Male,2,30236
128,Male,3,31001
160,Male,4,31686


I then created a pivot table with the average age by sex.

Then I wrote out the cleaned up pivot tables to my computer

In [4]:
# Create pivot table with different columns for male and female
df_pivot = pd.pivot_table(df, values="VALUE", index="Single Year of Age", columns="Sex", aggfunc="sum")
print (df_pivot.head(3))

# write out the entire file to local machine
df_pivot.to_csv("population_for_analysis.csv")

Sex                 Female   Male
Single Year of Age               
0                    28186  29610
1                    27545  28875
2                    28974  30236


Now that I have cleaned up my dataframe, I can now do some analysis.

First I want to get the weighted mean age by sex

Weighted mean is sum(age*population at age) / sum (populations at age)

In [5]:
df_pivot

Sex,Female,Male
Single Year of Age,Unnamed: 1_level_1,Unnamed: 2_level_1
0,28186,29610
1,27545,28875
2,28974,30236
3,29483,31001
4,29819,31686
...,...,...
96,956,327
97,732,217
98,492,130
99,336,105


First I got the number of females and number of males

In [6]:
# Number of females
number_females = df_pivot["Female"].sum()
print(f"No. of females: {number_females}")

# Number of males
number_males = df_pivot["Male"].sum()
print(f"No. of males: {number_males}")

No. of females: 2604590
No. of males: 2544549


Then I got the cumulative ages for females and males

In [7]:
# Cumulative ages for females
cumages_female = df_pivot["Female"].mul(df_pivot.index, axis=0).sum()
print(cumages_female)

# Cumulative ages for males
cumages_male = df_pivot["Male"].mul(df_pivot.index, axis=0).sum()
print(cumages_male)

101422203
96029874


Now I can find the weighted  for both sexes

In [8]:
# Female weighted mean age
w_mean_f = cumages_female / number_females
print(f"Weighted mean age for females in Ireland: {w_mean_f}")

# Male weighted mean age
w_mean_m = cumages_male / number_males
print(f"Weighted mean age for males in Ireland: {w_mean_m}")

Weighted mean age for females in Ireland: 38.9397958987787
Weighted mean age for males in Ireland: 37.7394477371039


## Part 2
*In the same notebook, make a variable that stores an age (say 35).*

*Write that code that would group the people within 5 years of that age together, into one age group*

*Calculate the population difference between the sexes in that age group.*

I created the variable `age` that stored the age 35.

Then I used the existing pivot table `df_pivot` and used indexing to include those within 5 years of age 35. (i.e. 30-40)

In [20]:
# Create variable that stores an age
age = 35

# Create age group within 5 years of that age
age_group = df_pivot[age-5:age+6] # inclusive of both 30 and 40

# Show
age_group


Sex,Female,Male
Single Year of Age,Unnamed: 1_level_1,Unnamed: 2_level_1
30,32841,30858
31,33710,32237
32,34382,32413
33,34489,31888
34,36284,33121
35,37940,34695
36,39030,35828
37,39193,36427
38,40902,37513
39,42592,38749


Then I used the `.iloc` function to find the first and last population for both sexes

I minused one from the other to get the population difference.

In [None]:
# Calculate population difference between the sexes in that age group
pop_diff = age_group.iloc[-1] - age_group.iloc[0]

print(pop_diff)

Sex
Female    10302
Male       9443
dtype: int64


Part 3 10%
In the same notebook.

Write the code that would work out which region in Ireland has the biggest population difference between the sexes in that age group

End