# Assignment for Topic 5

## Part 1

Analyse the differences between the sexes by age in Ireland (not regions)

Using [CSO data](https://data.cso.ie/), load data from the FY006A - Population database. 

### Import the csv data from the url containing the data

In [1]:
# import pandas

import pandas as pd


In [2]:
# Define url
url = "https://ws.cso.ie/public/api.restful/PxStat.Data.Cube_API.ReadDataset/FY006A/CSV/1.0/en"

# Define dataframe and import data from url 
df = pd.read_csv(url)

# Sanity check, show first 5 rows (shows "all sexes"), then the last 5 rows ("shows ")
df.head(-5)

# check this against csv I downloaded from the CSO (to be sure)

# Also, send to csv to check all of the data
df.to_csv("population_import.csv")

# After checking the population_import.csv file against the downloaded CSV file (which is in Excel), 
# I am happy that the import has worked correctly.


Observation: I only want male and female for sex comparison so I want to get rid of "Both sexes"

### Remove "Both sexes"

In [3]:
# Remove "Both sexes" from the "Sex" column"
df = df[df["Sex"] != "Both sexes"]

# Sanity check here (commented out when I am happy with it, also sending to csv below will void this output)
# df.head(5)

# As it's first attempt at this task, I will do a second sanity check by again exporting to csv to check 
# the data
df.to_csv("population_male_female.csv")

# I am happy that the data now only contains male and female only in the "Sex" column

### Remove all ages as I don't need this

In [4]:
# Remove "All ages" from the "Single Year of Age" column"
df = df[df["Single Year of Age"] != "All ages"]

# Sanity check here (commented out when I am happy with it)
# df.head(5)

# Again, I will do a second sanity check by again exporting to csv to check the data
df.to_csv("population_ages.csv")

# I am happy that the data now only contains no reference to "all ages", and male and female 
# only in the "Sex" column

### Remove columns I don't need

I can see that I don't want the following information: STATISTIC, statistic Label, TLIST(A1), CensusYear, C02199V02655, C02076V03371, C03789V04537, and UNIT. So I will remove them, leaving me with Index, Sex, Single Year of Age, Administrative Counties, VALUE.

In [5]:
headers = df.columns.tolist()
headers

['STATISTIC',
 'Statistic Label',
 'TLIST(A1)',
 'CensusYear',
 'C02199V02655',
 'Sex',
 'C02076V03371',
 'Single Year of Age',
 'C03789V04537',
 'Administrative Counties',
 'UNIT',
 'VALUE']

In [6]:
# Use the headers list to create a list to remove

drop_col_list = ['STATISTIC', 'Statistic Label', 'TLIST(A1)', 'CensusYear', 'C02199V02655', 'C02076V03371', 'C03789V04537', 'UNIT']
df.drop(columns=drop_col_list, inplace=True)

# Sanity check to csv as warnings appearing so trying to see where the issue is. Works to the write
# to csv stage (commented out)
df.to_csv("population_columns_dropped.csv")

In [8]:
# Remove text from ages
df['Single Year of Age'] = df['Single Year of Age'].str.replace('Under 1 year', '0')
df['Single Year of Age'] = df['Single Year of Age'].str.replace('\D', '', regex=True)

# Save to csv for sanity check
df.to_csv("population_clean_ages.csv")

  df['Single Year of Age'] = df['Single Year of Age'].str.replace('\D', '', regex=True)


### Note on above syntax warnings


In [None]:
# Following on from my note below
# Remove "Ireland" from the "Administrative Counties" column"
df = df[df["Administrative Counties"] != "Ireland"]

# Sanity check here (commented out when I am happy with it)
#df.head(5)

# Again, I will do a second sanity check by again exporting to csv to check the data
df.to_csv("population_no_Ireland.csv")

In [36]:
# Define values

df['Single Year of Age']=df['Single Year of Age'].astype('int64')
df['VALUE']=df['VALUE'].astype('int64')

# Look at the dataframe
print (df.head(3))
df.info()

       Sex  Single Year of Age                Administrative Counties  VALUE
3297  Male                   0                  Carlow County Council    346
3298  Male                   0                    Dublin City Council   3188
3299  Male                   0  Dún Laoghaire Rathdown County Council   1269
<class 'pandas.core.frame.DataFrame'>
Index: 6262 entries, 3297 to 9791
Data columns (total 4 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Sex                      6262 non-null   object
 1   Single Year of Age       6262 non-null   int64 
 2   Administrative Counties  6262 non-null   object
 3   VALUE                    6262 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 244.6+ KB


In [37]:
# Use a pivot table to reframe the dataframe for analysis
# Investigating why my pivot table is giving warnings:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html
# Conversation with chatgpt: 

df_analysis = pd.pivot_table(df, values = 'VALUE', index = ['Administrative Counties', 'Single Year of Age'], columns= 'Sex', aggfunc="sum")
print (df_analysis)

# Save the csv ready for analysis
df.to_csv("population_for_anal.csv")

Sex                                         Female  Male
Administrative Counties Single Year of Age              
Carlow County Council   0                      353   346
                        1                      302   347
                        2                      334   355
                        3                      378   376
                        4                      369   376
...                                            ...   ...
Wicklow County Council  96                      26    12
                        97                      21     4
                        98                      12     4
                        99                       7     1
                        100                     15     3

[3131 rows x 2 columns]


## Analysis of the population_for_anal.csv file

### First, determine the weighted mean age (by sex)

In [38]:
# Define where each of the sexes live

headers = list(df_analysis.columns)
female = headers[0]
male = headers[1]
female, male

('Female', 'Male')

In [39]:
# Weighted mean age calculation
number_female = df_analysis[female].sum()
number_female

np.int64(2604590)

In [40]:
# Weighted mean age calculation
number_male = df_analysis[male].sum()
number_male

np.int64(2544549)

Note: at this stage both male and females are approx 5million, which i know is approx. the population of the country. So I think i forgot to remove "Ireland" counts for each sex, and age. Will go back and remove that and hopefully, this comment will reflect my thinking but above will actually be correct!
I think this had worked as the amount of each is about 2.6million now. 

In [41]:
df_analysis 

Unnamed: 0_level_0,Sex,Female,Male
Administrative Counties,Single Year of Age,Unnamed: 2_level_1,Unnamed: 3_level_1
Carlow County Council,0,353,346
Carlow County Council,1,302,347
Carlow County Council,2,334,355
Carlow County Council,3,378,376
Carlow County Council,4,369,376
...,...,...,...
Wicklow County Council,96,26,12
Wicklow County Council,97,21,4
Wicklow County Council,98,12,4
Wicklow County Council,99,7,1


In [None]:
# Following your notebook, I will use the numpy method to calculate the weighted mean

import numpy as np

# Weighted mean age female
# AI suggested this code, makes sense to me
w_mean_female = np.average(df_analysis.index.get_level_values('Single Year of Age'), weights=df_analysis[female])
w_mean_female

np.float64(38.9397958987787)

In [None]:
# Weighted mean age male
w_mean_male = np.average(df_analysis.index.get_level_values('Single Year of Age'), weights=df_analysis[male])
w_mean_male

np.float64(37.7394477371039)

In [53]:
# The differences between the sexes
# minus the smaller from the larger and round to 2 decimal places

difference = w_mean_female - w_mean_male
round_difference = round(difference, 2)
round_female_w = round(w_mean_female, 4)
round_male_w = round(w_mean_male, 4)

print(f"The difference between the weighted mean age of the sexes in Ireland (to two decimal places) is {round_difference} years.\nTo four decimal places, the female weighted mean age is {round_female_w} years and the male weighted mean age is {round_male_w} years.")

The difference between the weighted mean age of the sexes in Ireland (to two decimal places) is 1.2 years.
To four decimal places, the female weighted mean age is 38.9398 years and the male weighted mean age is 37.7394 years.


# Part 2

## Create a variable that stores an age (40). Group the people within 5 years of that age together,
## into one age group. Calculate the population differences between the sexes in that age group.

### Come up with a plan of what I would like to try:

1. Store an age (40 years old)

2. Take only the Ireland data for each age, no administrative county. I will have to go back and see which csv i want to work on. Use this to remove all admin counties that aren't ireland and export a csv with only ireland data for each single age year

3. define the age 40 by where it is in the index/column in this df

4. define a function to group ages 35-45

5. calculate the difference between male and female for this age group
        

# Part 3


## Which region in Ireland has the biggest population difference between the sexes in the 35-45 year old age group


Plan for part 3:

1. Define the amount of people in that group for each administrative county

2. Work out the sum of males and females for this age group for each sex in each of the counties

3. For each county, minus one value (population males, population females) from the other, and find the absolute value of this 

4. Rank all of these value for each county in order, or perhaps compile them all as a list and find the maximum value (would be quicker if it worked)

5. Find the highest/maximum (as applicable, depending on my method) value and go back to see which county this came from (i.e. look at the index for it)