<a href="https://www.kaggle.com/code/rodolphojustino/wrangling-data-for-gds?scriptVersionId=122165128" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## Author: Rodolpho Justino

This dataset is part of a study of the UN between 1950 to 2021 about world population and is available [here](https://www.kaggle.com/datasets/ahmethoso/wpp-population-by-age-and-sex?select=PopulationByAgeSex.csv)

Here we do some wrangling of the data in order to use the ds on Google Data Studio

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime

In [2]:
df = pd.read_csv('/kaggle/input/world-population-ds/world-population.csv', sep = ',', parse_dates = [2], infer_datetime_format = True)

In [3]:
df

Unnamed: 0,id,location,time,child,adolescent,adult,old,male,female,total
0,1,Afghanistan,1950-01-01,3179.867,783.732,3398.243,390.275,4099.243,3652.874,7752.117
1,2,Afghanistan,1951-01-01,3235.425,790.341,3418.766,395.619,4134.756,3705.395,7840.151
2,3,Afghanistan,1952-01-01,3281.624,801.420,3452.410,400.542,4174.450,3761.546,7935.996
3,4,Afghanistan,1953-01-01,3327.335,814.914,3493.155,404.280,4218.336,3821.348,8039.684
4,5,Afghanistan,1954-01-01,3376.382,829.684,3538.851,406.399,4266.484,3884.832,8151.316
...,...,...,...,...,...,...,...,...,...,...
14035,66810,Zimbabwe,2017-01-01,6064.344,1533.435,6002.048,636.772,6777.054,7459.545,14236.599
14036,66811,Zimbabwe,2018-01-01,6122.127,1558.954,6104.387,653.344,6879.119,7559.693,14438.812
14037,66812,Zimbabwe,2019-01-01,6174.235,1594.436,6206.268,670.534,6983.353,7662.120,14645.473
14038,66813,Zimbabwe,2020-01-01,6229.220,1638.895,6306.378,688.434,7092.010,7770.917,14862.927


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14040 entries, 0 to 14039
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   id          14040 non-null  int64         
 1   location    14040 non-null  object        
 2   time        14040 non-null  datetime64[ns]
 3   child       14040 non-null  float64       
 4   adolescent  14040 non-null  float64       
 5   adult       14040 non-null  float64       
 6   old         14040 non-null  float64       
 7   male        14040 non-null  float64       
 8   female      14040 non-null  float64       
 9   total       14040 non-null  float64       
dtypes: datetime64[ns](1), float64(7), int64(1), object(1)
memory usage: 1.1+ MB


There is no missing data in the df, we can proceed with the analysis

* We can drop the id column and rename the other ones (Location and time);

* Adjust the format of the date column;

* Calculating the percentages of males, females, and age groups.

In [5]:
df = df.drop(['id'], axis = 1)

In [6]:
df = df.rename(columns = {'location': "country", "time": "date"})

In [7]:
df["date"] = df["date"].apply(lambda num: datetime.strftime(num, '%b %d, %Y'))

In [8]:
df["child_perc"] = 100 * df["child"] / df["total"]
df["adolescent_perc"] = 100 * df["adolescent"] / df["total"]
df["adult_perc"] = 100 * df["adult"] / df["total"]
df["old_perc"] = 100 * df["old"] / df["total"]

In [9]:
df["male_perc"] = 100 * df["male"] / df["total"]
df["female_perc"] = 100 * df["female"] / df["total"]

Rounding the number to 2 decimal places

In [10]:
for col in df.columns:
    if col not in ["date", "country"]:
        df[col] = df[col].apply(lambda num: round(num, 2))

Reorganizing the dataframe

In [11]:
df = df[["date", "country", "total", "male", "male_perc",
        "female", "female_perc", "child", "child_perc",
        "adolescent", "adolescent_perc", "adult", "adult_perc",
        "old", "old_perc"]]

df.head()

Unnamed: 0,date,country,total,male,male_perc,female,female_perc,child,child_perc,adolescent,adolescent_perc,adult,adult_perc,old,old_perc
0,"Jan 01, 1950",Afghanistan,7752.12,4099.24,52.88,3652.87,47.12,3179.87,41.02,783.73,10.11,3398.24,43.84,390.27,5.03
1,"Jan 01, 1951",Afghanistan,7840.15,4134.76,52.74,3705.39,47.26,3235.43,41.27,790.34,10.08,3418.77,43.61,395.62,5.05
2,"Jan 01, 1952",Afghanistan,7936.0,4174.45,52.6,3761.55,47.4,3281.62,41.35,801.42,10.1,3452.41,43.5,400.54,5.05
3,"Jan 01, 1953",Afghanistan,8039.68,4218.34,52.47,3821.35,47.53,3327.34,41.39,814.91,10.14,3493.16,43.45,404.28,5.03
4,"Jan 01, 1954",Afghanistan,8151.32,4266.48,52.34,3884.83,47.66,3376.38,41.42,829.68,10.18,3538.85,43.41,406.4,4.99


Saving the Dataset to load in Google Data Studio

In [12]:
df.to_csv('./world-pop-wrangled.csv', sep = ",", index = False)