# Assignment 5 - Populations
Author: Vanessa Lyra

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## Part 1
Write a jupyter notebook that analyses the differences between the sexes by age in Ireland.
- Weighted mean age (by sex)
- The difference between the sexes by age

In [2]:
# Fetching data
url = "https://ws.cso.ie/public/api.restful/PxStat.Data.Cube_API.ReadDataset/FY006A/CSV/1.0/en"
df = pd.read_csv(url)

# Checking and printing data columns
headers = df.columns.tolist()
headers

['STATISTIC',
 'Statistic Label',
 'TLIST(A1)',
 'CensusYear',
 'C02199V02655',
 'Sex',
 'C02076V03371',
 'Single Year of Age',
 'C03789V04537',
 'Administrative Counties',
 'UNIT',
 'VALUE']

#### Cleaning data

In [3]:
# Drop all unwanted columns
drop_columns = ["STATISTIC", "Statistic Label","TLIST(A1)", "CensusYear", "C02199V02655", "C02076V03371", "C03789V04537", "UNIT"]
df.drop(columns=drop_columns, inplace=True, errors='ignore')

# Analysing only Ireland from Administrative counties data
df = df[df["Administrative Counties"] == "Ireland"]

# Replacing unwanted characters
df = df[df["Single Year of Age"] != "All ages"]
df["Single Year of Age"] = df["Single Year of Age"].replace("Under 1 year", "0")
df["Single Year of Age"] = df["Single Year of Age"].replace(r"\D", "", regex=True)
df["Single Year of Age"] = df["Single Year of Age"].astype("int64")

# Ensuring only numeric values
df["VALUE"]=df["VALUE"].astype("int64")

# Pivotting data
df_anal = pd.pivot_table(df,"VALUE","Single Year of Age","Sex")

# Python couldn't find Single Year of Age, line added to transform it in a regular column again, by resetting its index to zero
df_anal = df_anal.reset_index()

# Removing the “name” metadata that pandas stored "Sex" in my dataframe, giving a confusing printing output, name index removed with this line of code
df_anal.columns = df_anal.columns.rename(None)

# Female and male weighted means calculations
mean_female = np.average(df_anal["Single Year of Age"], weights=df_anal["Female"])
mean_male = np.average(df_anal["Single Year of Age"], weights=df_anal["Male"])

####  Weighted mean age (by sex)

In [4]:
# Printing statements
print("Weighted mean age (by sex)")
print(f"Weighted mean females: {mean_female:.2f}")
print(f"Weighted mean males: {mean_male:.2f}")
print(f"Difference (Females & Males): {mean_female - mean_male:.2f}\n")

Weighted mean age (by sex)
Weighted mean females: 38.94
Weighted mean males: 37.74
Difference (Females & Males): 1.20



#### Difference between the sexes by age

In [5]:
# Calculating total of females minus total of males from Dataframe
df_anal["Difference (Female - Male)"] = df_anal["Female"] - df_anal["Male"]

# Printing the difference per year
print(df_anal[["Single Year of Age", "Difference (Female - Male)"]])
# print(df_anal.columns) # Printing statement for testings, checking column names in DF

     Single Year of Age  Difference (Female - Male)
0                     0                     -1424.0
1                     1                     -1330.0
2                     2                     -1262.0
3                     3                     -1518.0
4                     4                     -1867.0
..                  ...                         ...
96                   96                       629.0
97                   97                       515.0
98                   98                       362.0
99                   99                       231.0
100                 100                       430.0

[101 rows x 2 columns]


#### References

Reset index: https://www.geeksforgeeks.org/pandas/python-pandas-dataframe-reset_index/ 


## Part 2
Make a variable that stores an age (say 35).
- Write that code that would group the people within 5 years of that age together, into one age group.
- Calculate the population difference between the sexes in that age group.

In [6]:
# Defining age for analysis
base_age = 35

# Defining variables fro ages group
younger_group = base_age - 5
older_group = base_age + 5

# Pivotting data from df for new analysis
df_anal1 = pd.pivot_table(df,"VALUE","Single Year of Age","Sex")

# Transforming single Year of Age in a regular column again, by resetting its index to zero
df_anal1 = df_anal1.reset_index()

# Removing the “name” metadata that pandas stored "Sex" in my dataframe, giving a confusing printing output, name index removed with this line of code
df_anal1.columns = df_anal1.columns.rename(None)

# Finding people at defined age group in dataframe
age_group = df_anal1[
    (df_anal1["Single Year of Age"] >= younger_group) &
    (df_anal1["Single Year of Age"] <= older_group)]

# Sum total of males and females from age group
female_group = age_group["Female"].sum()
male_group = age_group["Male"].sum()

# Calculation age difference between sexes in age group
sexes_diff = female_group - male_group

# Printing results to user 
print(f"Age group of study: {younger_group} - {older_group}")
print(f"Female group: {female_group:.0f}") #0f, rounding values
print(f"Male group: {male_group:.0f}")
print(f"Population difference between sexes: {sexes_diff:.0f}\n")

Age group of study: 30 - 40
Female group: 414506
Male group: 384030
Population difference between sexes: 30476



#### References

Retrieving range of data from Dataframe: https://stackoverflow.com/questions/38884466/how-to-select-a-range-of-values-in-a-pandas-dataframe-column  
Sum in Dataframe: https://www.statology.org/pandas-groupby-range/  
Decimal formatting: https://www.askpython.com/python/string/decimal-formatting-0f-vs-1f  

### Part 3
Write the code that would work out which region in Ireland has the biggest population difference between the sexes in that age group

In [None]:
# Fetching data
url = "https://ws.cso.ie/public/api.restful/PxStat.Data.Cube_API.ReadDataset/FY006A/CSV/1.0/en"
df = pd.read_csv(url)

# Cleansing data
# Drop all unwanted colunms
drop_columns = ["STATISTIC", "Statistic Label","TLIST(A1)", "CensusYear", "C02199V02655", "C02076V03371", "C03789V04537", "UNIT"]
df.drop(columns=drop_columns, inplace=True, errors='ignore')

# Removing Ireland from Administrative counties data
df = df[df["Administrative Counties"] != "Ireland"]

# Replacing unwanted characters
df = df[df["Single Year of Age"] != "All ages"]
df["Single Year of Age"] = df["Single Year of Age"].replace("Under 1 year", "0")
df["Single Year of Age"] = df["Single Year of Age"].replace(r"\D", "", regex=True)
df["Single Year of Age"] = df["Single Year of Age"].astype('int64')

# Ensuring only numeric values
df['VALUE']=df['VALUE'].astype('int64')

# Pivoting data
# Reseting index to transform "Single Year of Age", "Administrative Counties" from index into a regular column for analysis
df_anal2 = pd.pivot_table(df, values="VALUE", index=["Single Year of Age","Administrative Counties"], columns="Sex", aggfunc="sum").reset_index()

# Defining age for analysis
base_age = 35

# Defining variables fro ages group
younger_group = base_age - 5
older_group = base_age + 5

# Finding people at defined age group in datafrme
age_group = df_anal2[
    (df_anal2["Single Year of Age"] >= younger_group) &
    (df_anal2["Single Year of Age"] <= older_group)]


#Grouping age group by regions and sex
irl_region = age_group.groupby("Administrative Counties").agg(
    female_group=("Female", "sum"),
    male_group=("Male", "sum"))

#Calculate difference by sex
irl_region["sex_difference"] = irl_region["female_group"] - irl_region["male_group"]

#Finding region with highest age difference
region_diff = irl_region["sex_difference"].idxmax()
diff_value = irl_region.loc[region_diff, "sex_difference"]

#Printing statements to user
print(f"Population difference between sexes for ages {younger_group} - {older_group}")
print("Region with the largest sex difference:")
print(f"{region_diff} and difference is: {diff_value}")

Population difference between sexes for ages 30 - 40
Region with the largest sex difference:
Fingal County Council and difference is: 2942


#### References

Ignore Ireland from dataframe: https://www.statology.org/pandas-filter-by-column-value-not-equal/  
Groupby and aggregate: https://medium.com/@heyamit10/understanding-groupby-and-aggregate-in-pandas-f45e524538b9  
Idxmax inspiration: https://community.dataquest.io/t/pandas-return-row-with-the-maximum-value-of-a-column/258474   
Loc inspiration: https://saturncloud.io/blog/how-to-search-pandas-data-frame-by-index-value-and-value-in-any-column/  
Dataframe index name removal: https://stackoverflow.com/questions/29765548/remove-index-name-in-pandas  

** End **