#**Week-4 Assignment**
##**World Bank Data**

---


# Python Implementation of Hans Rosling-style Visualizations

This project is dedicated to recreating dynamic visualizations akin to those famously showcased by Hans Rosling using Python. It leverages popular libraries such as Matplotlib, Seaborn, Plotly, and Choropleth for data analysis and visualization.

## Project Overview
Contained within this repository are Python scripts and notebooks illustrating the process of replicating Hans Rosling's data visualizations. The project is structured into the following sections:

### Data Preprocessing
Details on how the datasets were prepared for analysis.

### Analysis and Visualization
A breakdown of the analysis conducted using Matplotlib, Seaborn, Plotly, and Choropleth, accompanied by the visualizations created.

## Datasets Used
Four datasets were employed for analysis:

1. Life_Expectancy.csv
2. fertility_rate.csv
3. Country_population.csv
4. metadata_Country.csv

## Tools and Libraries Utilized
- Matplotlib
- Seaborn
- Plotly
- Choropleth

## Usage Guidelines
### Prerequisites
- Python
- Installation of the aforementioned required libraries

## Visualization Examples
- Animated scatter plots using Matplotlib
- Interactive plots using Plotly
- Choropleth maps displaying global data

## Inspiration from TED Talk
This project draws inspiration from [Hans Rosling's TED Talk](https://www.ted.com/talks/hans_rosling_the_best_stats_you_ve_ever_seen), aiming to replicate his exceptional data storytelling and visualization techniques.


---

##**Importing Libraries**

First we will be importing multiple libraries,
*  `NumPy` for working with arrays.
*  `Pandas` for working with data sets.
*  `Plotly` for interactive graphical visualization.

`pd.set_option` : Describes the behaviour of how floating-point numbers are shown within DataFrames in Pandas.

In [33]:
import numpy as np                                  # import numpy for working with arrays
import pandas as pd                                 # import for working with data sets
import plotly.express as px                         # import for interactive visualization of graphs
pd.set_option('display.float_format', lambda x: '%.2f' % x)       # determine the behaviour of mantissa denotion

We will import multiple raw .csv files as follows:
1.  `population`: Depicts the population, country-wise ranging from year 1960 to 2016.
2.  `fertility`: Depicts the fertility rate, country-wise ranging from year 1960 to 2016.
3.  `expectancy`: Depicts the expectancy rate, country-wise ranging from year 1960 to 2016.
4.  `metadata`: Depicts the information about the countries along with region they belong from. Helpful for grouping up the countires for future reference.

In [34]:
population=pd.read_csv('https://raw.githubusercontent.com/vignay21/Prepinsta/main/PrepInsta-Week4/Datasets/country_population.csv')        # reads population dataset

fertility=pd.read_csv('https://raw.githubusercontent.com/vignay21/Prepinsta/main/PrepInsta-Week4/Datasets/fertility_rate.csv')          # reads fertility rate dataset

metadata=pd.read_csv('https://raw.githubusercontent.com/vignay21/Prepinsta/main/PrepInsta-Week4/Datasets/Metadata_Country.csv')            # reads life expectancy dataset

expectancy=pd.read_csv('https://raw.githubusercontent.com/vignay21/Prepinsta/main/PrepInsta-Week4/Datasets/life_expectancy.csv')        # reads metadata dataset

##**Cleaning & Processing Data Frame**

###Population Data Frame

In order to clean and process `population` data frame
*  First we will remove the columns which are not necessary using a list. Here: `['Indicator Name', 'Indicator Code']`
*  Rename the column name whereever necessary. Here: `'ï»¿"Country Name"'` to `'Country Name'`
* In case there is any non-disclosed country present withing the data fram, filter it out. Here: `'Not Classified'`
* Standardize the column header, such as lowercasing them and replacing special characters with `'_'`

In [35]:
columns_to_remove = ['Indicator Name', 'Indicator Code']                  # variable to store list of unnecessary columns
population = population.drop(columns=columns_to_remove)                   # dropping the unnecessary columns
population.rename(columns={'ï»¿"Country Name"': 'Country Name'}, inplace=True)    # renaming the column header
population = population[population['Country Name'] != 'Not classified']           # filtering out the specific row
population.columns=population.columns.str.lower().str.replace(' ','_')            # standardizing the column header
population

Unnamed: 0,country_name,country_code,1960,1961,1962,1963,1964,1965,1966,1967,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
0,Aruba,ABW,54211.00,55438.00,56225.00,56695.00,57032.00,57360.00,57715.00,58055.00,...,101220.00,101353.00,101453.00,101669.00,102053.00,102577.00,103187.00,103795.00,104341.00,104822.00
1,Afghanistan,AFG,8996351.00,9166764.00,9345868.00,9533954.00,9731361.00,9938414.00,10152331.00,10372630.00,...,26616792.00,27294031.00,28004331.00,28803167.00,29708599.00,30696958.00,31731688.00,32758020.00,33736494.00,34656032.00
2,Angola,AGO,5643182.00,5753024.00,5866061.00,5980417.00,6093321.00,6203299.00,6309770.00,6414995.00,...,20997687.00,21759420.00,22549547.00,23369131.00,24218565.00,25096150.00,25998340.00,26920466.00,27859305.00,28813463.00
3,Albania,ALB,1608800.00,1659800.00,1711319.00,1762621.00,1814135.00,1864791.00,1914573.00,1965598.00,...,2970017.00,2947314.00,2927519.00,2913021.00,2905195.00,2900401.00,2895092.00,2889104.00,2880703.00,2876101.00
4,Andorra,AND,13411.00,14375.00,15370.00,16412.00,17469.00,18549.00,19647.00,20758.00,...,82683.00,83861.00,84462.00,84449.00,83751.00,82431.00,80788.00,79223.00,78014.00,77281.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
259,Kosovo,XKX,947000.00,966000.00,994000.00,1022000.00,1050000.00,1078000.00,1106000.00,1135000.00,...,1733404.00,1747383.00,1761474.00,1775680.00,1791000.00,1805200.00,1824100.00,1821800.00,1801800.00,1816200.00
260,"Yemen, Rep.",YEM,5172135.00,5260501.00,5351799.00,5446063.00,5543339.00,5643643.00,5748588.00,5858638.00,...,21751605.00,22356391.00,22974929.00,23606779.00,24252206.00,24909969.00,25576322.00,26246327.00,26916207.00,27584213.00
261,South Africa,ZAF,17456855.00,17920673.00,18401608.00,18899275.00,19412975.00,19942303.00,20486439.00,21045785.00,...,49887181.00,50412129.00,50970818.00,51584663.00,52263516.00,52998213.00,53767396.00,54539571.00,55291225.00,56015473.00
262,Zambia,ZMB,3044846.00,3140264.00,3240587.00,3345145.00,3452942.00,3563407.00,3676189.00,3791887.00,...,12725974.00,13082517.00,13456417.00,13850033.00,14264756.00,14699937.00,15153210.00,15620974.00,16100587.00,16591390.00


###Fertility Data Frame

In order to clean and process `fertility` data frame
*  First we will remove the columns which are not necessary using a list. Here: `['Indicator Name', 'Indicator Code']`
*  Rename the column name whereever necessary. Here: `'ï»¿"Country Name"'` to `'Country Name'`
* In case there is any non-disclosed country present withing the data fram, filter it out. Here: `'Not Classified'`
* Standardize the column header, such as lowercasing them and replacing special characters with `'_'`

In [36]:
columns_to_remove = ['Indicator Name', 'Indicator Code']        # variable to store list of unnecessary columns
fertility = fertility.drop(columns=columns_to_remove)           # dropping the unnecessary columns
fertility.rename(columns={'ï»¿"Country Name"': 'Country Name'}, inplace=True)   # renaming the column header
fertility = fertility[fertility['Country Name'] != 'Not classified']            # filtering out the specific row
fertility.columns=fertility.columns.str.lower().str.replace(' ','_')            # standardizing the column header
fertility

Unnamed: 0,country_name,country_code,1960,1961,1962,1963,1964,1965,1966,1967,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
0,Aruba,ABW,4.82,4.66,4.47,4.27,4.06,3.84,3.62,3.42,...,1.76,1.76,1.77,1.78,1.78,1.79,1.80,1.80,1.80,1.80
1,Afghanistan,AFG,7.45,7.45,7.45,7.45,7.45,7.45,7.45,7.45,...,6.46,6.25,6.04,5.82,5.59,5.38,5.17,4.98,4.80,4.63
2,Angola,AGO,7.48,7.52,7.56,7.59,7.61,7.62,7.62,7.61,...,6.37,6.31,6.24,6.16,6.08,6.00,5.92,5.84,5.77,5.69
3,Albania,ALB,6.49,6.40,6.28,6.13,5.96,5.77,5.58,5.39,...,1.67,1.65,1.65,1.65,1.67,1.69,1.70,1.71,1.71,1.71
4,Andorra,AND,,,,,,,,,...,1.18,1.25,1.19,1.27,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
259,Kosovo,XKX,,,,,,,,,...,2.43,2.38,2.34,2.29,2.24,2.19,2.16,2.13,2.09,2.06
260,"Yemen, Rep.",YEM,7.49,7.53,7.58,7.62,7.67,7.71,7.74,7.76,...,5.09,4.94,4.80,4.67,4.55,4.44,4.33,4.21,4.10,4.00
261,South Africa,ZAF,6.04,6.03,6.01,5.99,5.96,5.92,5.88,5.83,...,2.64,2.62,2.60,2.59,2.57,2.55,2.53,2.51,2.48,2.46
262,Zambia,ZMB,7.12,7.17,7.21,7.25,7.27,7.29,7.30,7.32,...,5.64,5.56,5.48,5.40,5.32,5.24,5.17,5.10,5.04,4.98


###Expectancy Data Frame

In order to clean and process `expectancy` data frame
*  First we will remove the columns which are not necessary using a list. Here: `['Indicator Name', 'Indicator Code']`
*  Rename the column name whereever necessary. Here: `'ï»¿"Country Name"'` to `'Country Name'`
* In case there is any non-disclosed country present withing the data fram, filter it out. Here: `'Not Classified'`
* Standardize the column header, such as lowercasing them and replacing special characters with `'_'`

In [37]:
columns_to_remove = ['Indicator Name', 'Indicator Code']           # variable to store list of unnecessary columns
expectancy = expectancy.drop(columns=columns_to_remove)            # dropping the unnecessary columns
expectancy.rename(columns={'ï»¿"Country Name"': 'Country Name'}, inplace=True)  # renaming the column header
expectancy = expectancy[expectancy['Country Name'] != 'Not classified']         # filtering out the specific row
expectancy.columns=expectancy.columns.str.lower().str.replace(' ','_')          # standardizing the column header
expectancy

Unnamed: 0,country_name,country_code,1960,1961,1962,1963,1964,1965,1966,1967,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
0,Aruba,ABW,65.66,66.07,66.44,66.79,67.11,67.44,67.76,68.09,...,74.58,74.72,74.87,75.02,75.16,75.30,75.44,75.58,75.72,75.87
1,Afghanistan,AFG,32.29,32.74,33.19,33.62,34.06,34.49,34.93,35.36,...,59.69,60.24,60.75,61.23,61.67,62.09,62.49,62.90,63.29,63.67
2,Angola,AGO,33.25,33.57,33.91,34.27,34.65,35.03,35.43,35.83,...,55.10,56.19,57.23,58.19,59.04,59.77,60.37,60.86,61.24,61.55
3,Albania,ALB,62.28,63.30,64.19,64.91,65.46,65.85,66.11,66.30,...,75.66,75.94,76.28,76.65,77.03,77.39,77.70,77.96,78.17,78.34
4,Andorra,AND,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
259,Kosovo,XKX,,,,,,,,,...,69.20,69.40,69.65,69.90,70.15,70.50,70.80,71.10,71.35,71.65
260,"Yemen, Rep.",YEM,34.36,34.47,34.74,35.19,35.81,36.60,37.49,38.43,...,62.55,62.89,63.21,63.51,63.79,64.05,64.29,64.52,64.74,64.95
261,South Africa,ZAF,52.22,52.56,52.89,53.23,53.57,53.93,54.30,54.69,...,53.01,53.72,54.70,55.89,57.20,58.55,59.83,60.99,61.98,62.77
262,Zambia,ZMB,45.12,45.50,45.87,46.23,46.57,46.93,47.30,47.70,...,52.31,53.75,55.19,56.59,57.87,59.01,59.98,60.77,61.40,61.87


###Metadata Data Frame

In order to clean and process `metadata` data frame
*  First we will remove the columns which are not necessary using a list. Here: `['SpecialNotes', 'Unnamed: 5']`
*  Rename the column name whereever necessary. Here: `'ï»¿"Country Code"'` to `'Country Code'`
* Column shifting for valid visual representation.
* Standardize the column header, such as lowercasing them and replacing special characters with `'_'`

In [38]:
columns_to_remove = ['SpecialNotes', 'Unnamed: 5']          # variable to store list of unnecessary columns
metadata = metadata.drop(columns=columns_to_remove)         # dropping the unnecessary columns
metadata.rename(columns={'ï»¿"Country Code"': 'Country Code'}, inplace=True)    # renaming the column header
metadata.rename(columns={'IncomeGroup': 'Income Group'}, inplace=True)          # renaming the column header
metadata.rename(columns={'TableName': 'Country Name'}, inplace=True)            # renaming the column header
column_to_shift = metadata.pop(metadata.columns[3])           # shifting column 3 to 0
metadata.insert(0, column_to_shift.name, column_to_shift)
metadata.columns=metadata.columns.str.lower().str.replace(' ','_')              # standardizing the column header
metadata

Unnamed: 0,country_name,country_code,region,income_group
0,Aruba,ABW,Latin America & Caribbean,High income
1,Afghanistan,AFG,South Asia,Low income
2,Angola,AGO,Sub-Saharan Africa,Lower middle income
3,Albania,ALB,Europe & Central Asia,Upper middle income
4,Andorra,AND,Europe & Central Asia,High income
...,...,...,...,...
258,Kosovo,XKX,Europe & Central Asia,Lower middle income
259,"Yemen, Rep.",YEM,Middle East & North Africa,Lower middle income
260,South Africa,ZAF,Sub-Saharan Africa,Upper middle income
261,Zambia,ZMB,Sub-Saharan Africa,Lower middle income


##**Melting & Merging Data Frame**
Reference:
1. [Hans Rosling - Recreation](https://www.youtube.com/watch?v=cmBRctdMykc)
2. [Melting Data](https://www.youtube.com/watch?v=49PKysycCGc)

<br> I'd like to express my gratitude for the valuable references that have been instrumental in enhancing my understanding of specific data handling techniques, in addition to the resources provided by `Prepinsta`.

###Population Dataframe

**Merging:**<br>
To merge the `metadata` dataframe to `population` dataframe:
1. We will first mention the columns that we want to join using a list `columns_to_merge`.
2. Now full join two dataframes at `country_name` which is unique
3. Now we see that the join is done but we need to move the `"region"` column to its desirable position in the dataframe.


In [39]:
columns_to_merge = ['country_name','region']    # listing the column headers required after merging

population_metadata = population.merge(metadata[columns_to_merge], on='country_name')    # merge at country_name
column_to_shift = population_metadata.pop(population_metadata.columns[-1])               # shifting columns
population_metadata.insert(2, column_to_shift.name, column_to_shift)

population_metadata.head()      # print the first few rows of the merged dataset

Unnamed: 0,country_name,country_code,region,1960,1961,1962,1963,1964,1965,1966,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
0,Aruba,ABW,Latin America & Caribbean,54211.0,55438.0,56225.0,56695.0,57032.0,57360.0,57715.0,...,101220.0,101353.0,101453.0,101669.0,102053.0,102577.0,103187.0,103795.0,104341.0,104822.0
1,Afghanistan,AFG,South Asia,8996351.0,9166764.0,9345868.0,9533954.0,9731361.0,9938414.0,10152331.0,...,26616792.0,27294031.0,28004331.0,28803167.0,29708599.0,30696958.0,31731688.0,32758020.0,33736494.0,34656032.0
2,Angola,AGO,Sub-Saharan Africa,5643182.0,5753024.0,5866061.0,5980417.0,6093321.0,6203299.0,6309770.0,...,20997687.0,21759420.0,22549547.0,23369131.0,24218565.0,25096150.0,25998340.0,26920466.0,27859305.0,28813463.0
3,Albania,ALB,Europe & Central Asia,1608800.0,1659800.0,1711319.0,1762621.0,1814135.0,1864791.0,1914573.0,...,2970017.0,2947314.0,2927519.0,2913021.0,2905195.0,2900401.0,2895092.0,2889104.0,2880703.0,2876101.0
4,Andorra,AND,Europe & Central Asia,13411.0,14375.0,15370.0,16412.0,17469.0,18549.0,19647.0,...,82683.0,83861.0,84462.0,84449.0,83751.0,82431.0,80788.0,79223.0,78014.0,77281.0


**Melting:**
<br>The following code is used for restructuring `population` dataframe. <br>
*  `Melting` is a way where the original dataset's columns are transformed into rows using .melt().
*  Then we sort the melted dataset based on `country_name` and `year`.
<br>

The goal is to organize the population data in a specific order for better analysis or presentation.

In [40]:
melted_data_population = population_metadata.melt(id_vars=['country_name', 'country_code','region'], var_name='year', value_name='population')
sorted_data_population = melted_data_population.sort_values(by=['country_name','year']) # melting, restructuring and sorting data of population
sorted_data_population

Unnamed: 0,country_name,country_code,region,year,population
1,Afghanistan,AFG,South Asia,1960,8996351.00
255,Afghanistan,AFG,South Asia,1961,9166764.00
509,Afghanistan,AFG,South Asia,1962,9345868.00
763,Afghanistan,AFG,South Asia,1963,9533954.00
1017,Afghanistan,AFG,South Asia,1964,9731361.00
...,...,...,...,...,...
13461,Zimbabwe,ZWE,Sub-Saharan Africa,2012,14710826.00
13715,Zimbabwe,ZWE,Sub-Saharan Africa,2013,15054506.00
13969,Zimbabwe,ZWE,Sub-Saharan Africa,2014,15411675.00
14223,Zimbabwe,ZWE,Sub-Saharan Africa,2015,15777451.00


###Fertility Dataframe

**Melting:**
<br>The following code is used for restructuring `fertility` dataframe. <br>
*  `Melting` is a way where the original dataset's columns are transformed into rows using .melt().
*  Then we sort the melted dataset based on `country_name` and `year`.
<br>

The goal is to organize the population data in a specific order for better analysis or presentation.

In [41]:
melted_data_fertility = fertility.melt(id_vars=['country_name','country_code'], var_name='year', value_name='fertility')
sorted_data_fertility = melted_data_fertility.sort_values(by=['country_name','year'])       # melting, restructuring and sorting data of population
sorted_data_fertility

Unnamed: 0,country_name,country_code,year,fertility
1,Afghanistan,AFG,1960,7.45
264,Afghanistan,AFG,1961,7.45
527,Afghanistan,AFG,1962,7.45
790,Afghanistan,AFG,1963,7.45
1053,Afghanistan,AFG,1964,7.45
...,...,...,...,...
13938,Zimbabwe,ZWE,2012,4.00
14201,Zimbabwe,ZWE,2013,3.96
14464,Zimbabwe,ZWE,2014,3.90
14727,Zimbabwe,ZWE,2015,3.84


**Merging:**<br>
To merge the `sorted_data_fertility` dataframe to `sorted_data_population` dataframe:
1. We will first mention the columns that we want to join using a list `columns_to_merge`.
2. Now full join two dataframes at `country_name`and `year`
3. After joining we can now view the merged `population_fertility` dataset


In [42]:
columns_to_merge = ['fertility','country_name','year']  # listing the column headers required after merging
# merge at country_name and year
population_fertility = sorted_data_population.merge(sorted_data_fertility[columns_to_merge], on=['country_name','year'])
population_fertility

Unnamed: 0,country_name,country_code,region,year,population,fertility
0,Afghanistan,AFG,South Asia,1960,8996351.00,7.45
1,Afghanistan,AFG,South Asia,1961,9166764.00,7.45
2,Afghanistan,AFG,South Asia,1962,9345868.00,7.45
3,Afghanistan,AFG,South Asia,1963,9533954.00,7.45
4,Afghanistan,AFG,South Asia,1964,9731361.00,7.45
...,...,...,...,...,...,...
14473,Zimbabwe,ZWE,Sub-Saharan Africa,2012,14710826.00,4.00
14474,Zimbabwe,ZWE,Sub-Saharan Africa,2013,15054506.00,3.96
14475,Zimbabwe,ZWE,Sub-Saharan Africa,2014,15411675.00,3.90
14476,Zimbabwe,ZWE,Sub-Saharan Africa,2015,15777451.00,3.84


###Expectancy Dataframe

**Melting:**
<br>The following code is used for restructuring `expectancy` dataframe. <br>
*  `Melting` is a way where the original dataset's columns are transformed into rows using .melt().
*  Then we sort the melted dataset based on `country_name` and `year`.
<br>

The goal is to organize the population data in a specific order for better analysis or presentation.

In [43]:
melted_data_expectancy = expectancy.melt(id_vars=['country_name','country_code'], var_name='year', value_name='expectancy')
sorted_data_expectancy = melted_data_expectancy.sort_values(by=['country_name','year'])       # melting, restructuring and sorting data of population
sorted_data_expectancy

Unnamed: 0,country_name,country_code,year,expectancy
1,Afghanistan,AFG,1960,32.29
264,Afghanistan,AFG,1961,32.74
527,Afghanistan,AFG,1962,33.19
790,Afghanistan,AFG,1963,33.62
1053,Afghanistan,AFG,1964,34.06
...,...,...,...,...
13938,Zimbabwe,ZWE,2012,56.52
14201,Zimbabwe,ZWE,2013,58.05
14464,Zimbabwe,ZWE,2014,59.36
14727,Zimbabwe,ZWE,2015,60.40


**Merging:**<br>
To merge the `sorted_data_expectancy` dataframe to `sorted_data_fertility` dataframe:
1. We will first mention the columns that we want to join using a list `columns_to_merge`.
2. Now full join two dataframes at `country_name`and `year`
3. After joining we can now view the merged `final` dataset


In [44]:
columns_to_merge = ['expectancy','country_name','year']         # listing the column headers required after merging
# merge at country_name and year
final = population_fertility.merge(sorted_data_expectancy[columns_to_merge], on=['country_name','year'])
final

Unnamed: 0,country_name,country_code,region,year,population,fertility,expectancy
0,Afghanistan,AFG,South Asia,1960,8996351.00,7.45,32.29
1,Afghanistan,AFG,South Asia,1961,9166764.00,7.45,32.74
2,Afghanistan,AFG,South Asia,1962,9345868.00,7.45,33.19
3,Afghanistan,AFG,South Asia,1963,9533954.00,7.45,33.62
4,Afghanistan,AFG,South Asia,1964,9731361.00,7.45,34.06
...,...,...,...,...,...,...,...
14473,Zimbabwe,ZWE,Sub-Saharan Africa,2012,14710826.00,4.00,56.52
14474,Zimbabwe,ZWE,Sub-Saharan Africa,2013,15054506.00,3.96,58.05
14475,Zimbabwe,ZWE,Sub-Saharan Africa,2014,15411675.00,3.90,59.36
14476,Zimbabwe,ZWE,Sub-Saharan Africa,2015,15777451.00,3.84,60.40


##**Final DataFrame Process**

After getting the `final` dataframe, we need to still ensure that:
*  Ensure the missing values in `'region'` which is to be dropped
*  Round the floating numbers upto 2 decimals in certain columns for visual representation.
* Ensuring the datatype of `"Year"` to be integer
* Missing value of `'population'` values shoudl be dropped.

In [45]:
final.dropna(subset=['region'], inplace=True)                   # replace NaN values with 'World'
final['fertility'] = final['fertility'].round(decimals=2)       # round the floating values to 2 decimals
final['expectancy'] = final['expectancy'].round(decimals=2)
final["year"] = final["year"].astype(int)                       # convert the year to integer data type
final.dropna(subset=['population'], inplace=True)               # drop missing 'population' values
final

Unnamed: 0,country_name,country_code,region,year,population,fertility,expectancy
0,Afghanistan,AFG,South Asia,1960,8996351.00,7.45,32.29
1,Afghanistan,AFG,South Asia,1961,9166764.00,7.45,32.74
2,Afghanistan,AFG,South Asia,1962,9345868.00,7.45,33.18
3,Afghanistan,AFG,South Asia,1963,9533954.00,7.45,33.62
4,Afghanistan,AFG,South Asia,1964,9731361.00,7.45,34.06
...,...,...,...,...,...,...,...
14473,Zimbabwe,ZWE,Sub-Saharan Africa,2012,14710826.00,4.00,56.52
14474,Zimbabwe,ZWE,Sub-Saharan Africa,2013,15054506.00,3.96,58.05
14475,Zimbabwe,ZWE,Sub-Saharan Africa,2014,15411675.00,3.90,59.36
14476,Zimbabwe,ZWE,Sub-Saharan Africa,2015,15777451.00,3.84,60.40


##**Visual Representation**

###1.1. Scatter Plot comparing Fertility Rate and Life Expectancy [Countrywise]:
Lets visualize the famous graph represented by [Sir Hans Rosling](https://en.wikipedia.org/wiki/Hans_Rosling)<br>
The famous Fertility Rate vs Life Expectancy graph tht represents the diversity of world into two parts and how with time lifestyle changed.

In [46]:
# create the scatter chart using Plotly Express
fig_fertility_expectation_region = px.scatter(data_frame=final,
                 x='fertility',                    # x-axis consists of fertility rate
                 y='expectancy',                   # y-axis consists of life expectancy
                 size='population',                # bubble size based on population
                 size_max=50,                      # set the maximum size for bubbles
                 hover_name='country_name',        # will view country names on hovering
                 color='region',                   # will color accordingly to region
                 animation_frame='year',           # will animate in time frame
                 animation_group='country_name',   # animation format
                 template='plotly_dark',           # black as template
                 labels={'region': 'Region'},      # standardize the label
                 range_x=[0, 10],                  # range of x axis
                 range_y=[10, 90])                 # range of y axis

fig_fertility_expectation_region.update_layout(title='Fertility Rate vs. Life Expectancy',        # title of graph
                  xaxis_title='Fertility Rate - Total [Births per Woman]',  # x label
                  yaxis_title='Life Expectancy at Birth - Total')           # y label

fig_fertility_expectation_region.show()     # show the plot

###1.2. Scatter Plot comparing Fertility Rate and Life Expectancy [Regionwise]:


In [47]:
# aggregating data at the region level
region_data = final.groupby(['region','year']).agg({
    'fertility': 'mean',      # calculating the mean fertility rate for each region
    'expectancy': 'mean',     # calculating the mean life expectancy for each region
    'population': 'sum'       # summing up the population for each region
}).reset_index()

# create the bubble chart for region-wise fertility, life expectancy, and population
fig_region_bubble = px.scatter(data_frame=region_data,
    x='fertility',                  # x-axis consists of fertility rate
    y='expectancy',                 # y-axis consists of life expectancy
    size='population',              # bubble size based on population
    size_max=50,                    # set the maximum size for bubbles
    hover_name='region',            # will view region on hovering
    color='region',                 # will color accordingly to region
    template='plotly_dark',         # black as template
    animation_frame='year',         # will animate in time frame
    animation_group='region',       # animation format
    labels={'region': 'Region'},    # standardize the label
    range_x=[0, 10],                # range of x axis
    range_y=[10, 90]                # range of y axis
)

# Update layout and titles
fig_region_bubble.update_layout(title='Region-wise Fertility Rate vs. Life Expectancy',   # title of graph
    xaxis_title='Fertility Rate - Average [Births per Woman]',        # x label
    yaxis_title='Life Expectancy at Birth - Average')                 # y label

fig_region_bubble.show()      # show the plot

###2. Line Chart showing Population Trends over Years:

In [48]:
# create the scatter chart using Plotly Express
fig_population = px.line(final,
              x='year',                                 # x-axis consists of year
              y='population',                           # y-axis consists of population
              labels={'country_name': 'Country Name'},  # standardize the label
              color='country_name',                     # will color accordingly to countries
              template='plotly_dark')                   # black as template

fig_population.update_layout(title='Population Trends over Years by Country',               # title of graph
                  xaxis_title='Year',                                       # x label
                  xaxis=dict(type='category'),                              # Set x-axis type to category for years
                  yaxis_title='Population - Total [in Billions]')           # y label

fig_population.show()     # show the plot

###3. Population Comparison Among Regions

In [49]:
# grouping the data by region to get the sum of population
region_population = final.groupby('region')['population'].sum().reset_index()

# create the bar chart using Plotly Express
fig_region_population = px.bar(region_population,
                               x='region',                    # x-axis consists of region
                               y='population',                # y-axis consists of population
                               template='plotly_dark',        # black as template
                               labels={'region': 'Region'},   # standardize the label
                               color='region')                # will color accordingly to region

fig_region_population.update_layout(title='Population Comparison Among Regions',      # title of graph
                  xaxis_title='Region',                                               # x label
                  yaxis_title='Population - Total [in Billions]')                     # y label

fig_region_population.show()    # show the plot

###4.1. Life Expectancy Distribution

In [50]:
# create the histogram using Plotly Express
fig_life_expectancy = px.histogram(final,
                                   x='expectancy',                          # x-axis consists of life expectancy
                                   template='plotly_dark')                  # black as template

fig_life_expectancy.update_layout(bargap=0.2,                               # maintaining gap between two bar
                                  title='Life Expectancy Distribution',     # title of graph
                                  xaxis_title='Life Expectancy [in Years]', # x label
                                  yaxis_title='Counts [in Billions]')       # y label

fig_life_expectancy.show()    # show the plot

###4.2. Fertility Rate Distribution

In [51]:
# create the histogram using Plotly Express
fig_fertility_rate = px.histogram(final,
                                   x='fertility',                          # x-axis consists of life expectancy
                                   color_discrete_sequence=['cyan'],       # define color to histogram discretely
                                   template='plotly_dark')                 # black as template

fig_fertility_rate.update_layout(bargap=0.2,                               # maintaining gap between two bar
                                  title='Fertility Rate Distribution',     # title of graph
                                  xaxis_title='Fertility Rate [Births per Woman]', # x label
                                  yaxis_title='Counts [in Billions]')      # y label

fig_fertility_rate.show()    # show the plot

###5. Population of Each Region Over Years

In [52]:
# grouping the data by region and year to get the sum of population for each region per year
region_population = final.groupby(['region', 'year'])['population'].sum().reset_index()

# creating the area chart using Plotly Express
fig_region_population_line = px.area(region_population,
              x='year',                                       # x-axis consists of year
              y='population',                                 # y-axis consists of population
              color='region',                                 # will color accordingly to region
              line_group='region',
              labels={'region': 'Region'},                    # standardize the label
              template='plotly_dark')                         # black as template

fig_region_population_line.update_layout(title='Population of Each Region Over Years',      # title of graph
                  xaxis=dict(type='category'),                                              # Set x-axis type to category for years
                  xaxis_title='Year',                                                       # x label
                  yaxis_title='Population - Total [in Billions]')                           # y label

fig_region_population_line.show()   # show the plot

###6. Life Expectancy of Each Region Over Years

In [53]:
# grouping the data by region and year to get the average life expectancy for each region per year
region_life_expectancy = final.groupby(['region', 'year'])['expectancy'].mean().reset_index()

# creating the area chart using Plotly Express
fig_region_life_expectancy = px.area(region_life_expectancy,
                                     x='year',                          # x-axis consists of year
                                     y='expectancy',                    # y-axis consists of life expectancy
                                     color='region',                    # will color accordingly to region
                                     labels={'region': 'Region'},       # standardize the label
                                     template='plotly_dark')            # black as template

fig_region_life_expectancy.update_layout(title='Life Expectancy of Each Region Over Years',      # title of graph
                  xaxis=dict(type='category'),                                                   # Set x-axis type to category for years
                  xaxis_title='Year',                                                            # x label
                  yaxis_title='Life Expectancy [in Years]')                                      # y label

fig_region_life_expectancy.show()   # show the plot

###7. Fertility Rate of Each Region Over Years

In [54]:
# grouping the data by region and year to get the average fertility rate for each region per year
region_fertility = final.groupby(['region', 'year'])['fertility'].mean().reset_index()

# creating the area chart using Plotly Express
fig_region_fertility = px.area(region_fertility,
                               x='year',                            # x-axis consists of year
                               y='fertility',                       # y-axis consists of fertility rate
                               color='region',                      # will color accordingly to region
                               labels={'region': 'Region'},         # standardize the label
                               template='plotly_dark')              # black as template

fig_region_fertility.update_layout(title='Fertility Rate of Each Region Over Years',      # title of graph
                  xaxis=dict(type='category'),                                      # Set x-axis type to category for years
                  xaxis_title='Year',                                               # x label
                  yaxis_title='Fertility Rate [Births per Woman]')                  # y label

fig_region_fertility.show()     # show the plot

###8. Geographical Population Density by Country

In [55]:
# creating the geographical chart using Plotly Express
fig_population_country = px.scatter_geo(final,
    locations="country_code",       # using country codes for locations
    color="country_name",           # color points by country name
    size="population",              # bubble size based on population
    hover_name="country_name",      # information displayed on hover
    projection="natural earth",     # choose a map projection
    labels={'country_name': 'Country Name'},  # standardize the label
    title="Population Density by Country")    # title of the plot


fig_population_country.show()     # show the plot

###9. Correlation Analysis
URL to this heatmap matrix from where I refered [7 Ways To Make a Correlation Matrix In Python](https://pub.towardsai.net/7-ways-to-make-a-correlation-matrix-in-python-d45392aaa83d)

In [56]:
import plotly.figure_factory as ff          # import special library for heatmap i.e figure_factory
# Select only numeric columns from the DataFrame
numeric_columns = final.select_dtypes(include=['float64', 'int64'])

# Compute correlation matrix for numeric columns
correlation_matrix = numeric_columns.corr()

# Proceed with creating the heatmap using the correlation matrix
# (Your remaining code for creating the heatmap remains the same)
fig_heatmap = ff.create_annotated_heatmap(
    z=correlation_matrix.values,
    x=list(correlation_matrix.columns),
    y=list(correlation_matrix.index),
    colorscale='Blues',
    annotation_text=correlation_matrix.round(2).values
)
fig_heatmap.update_layout(title='Correlation Analysis Heatmap')
fig_heatmap.show()


#**Coding Playground**


Code encapsulated into one form.

In [57]:
import numpy as np                                  # import numpy for working with arrays
import pandas as pd                                 # import for working with data sets
import plotly.express as px                         # import for interactive visualization of graphs
pd.set_option('display.float_format', lambda x: '%.2f' % x)       # determine the behaviour of mantissa denotion


population=pd.read_csv('https://raw.githubusercontent.com/vignay21/Prepinsta/main/PrepInsta-Week4/Datasets/country_population.csv')        # reads population dataset

fertility=pd.read_csv('https://raw.githubusercontent.com/vignay21/Prepinsta/main/PrepInsta-Week4/Datasets/fertility_rate.csv')          # reads fertility rate dataset

metadata=pd.read_csv('https://raw.githubusercontent.com/vignay21/Prepinsta/main/PrepInsta-Week4/Datasets/Metadata_Country.csv')            # reads life expectancy dataset

expectancy=pd.read_csv('https://raw.githubusercontent.com/vignay21/Prepinsta/main/PrepInsta-Week4/Datasets/life_expectancy.csv')        # reads metadata dataset

columns_to_remove = ['Indicator Name', 'Indicator Code']                  # variable to store list of unnecessary columns
population = population.drop(columns=columns_to_remove)                   # dropping the unnecessary columns
population.rename(columns={'ï»¿"Country Name"': 'Country Name'}, inplace=True)    # renaming the column header
population = population[population['Country Name'] != 'Not classified']           # filtering out the specific row
population.columns=population.columns.str.lower().str.replace(' ','_')            # standardizing the column header
population.iloc[:, 2:] = population.iloc[:, 2:].apply(lambda row: row.fillna(row.median()), axis=1)   # replacing the NaN values with median of row

columns_to_remove = ['Indicator Name', 'Indicator Code']        # variable to store list of unnecessary columns
fertility = fertility.drop(columns=columns_to_remove)           # dropping the unnecessary columns
fertility.rename(columns={'ï»¿"Country Name"': 'Country Name'}, inplace=True)   # renaming the column header
fertility = fertility[fertility['Country Name'] != 'Not classified']            # filtering out the specific row
fertility.columns=fertility.columns.str.lower().str.replace(' ','_')            # standardizing the column header

columns_to_remove = ['Indicator Name', 'Indicator Code']           # variable to store list of unnecessary columns
expectancy = expectancy.drop(columns=columns_to_remove)            # dropping the unnecessary columns
expectancy.rename(columns={'ï»¿"Country Name"': 'Country Name'}, inplace=True)  # renaming the column header
expectancy = expectancy[expectancy['Country Name'] != 'Not classified']         # filtering out the specific row
expectancy.columns=expectancy.columns.str.lower().str.replace(' ','_')          # standardizing the column header

columns_to_remove = ['SpecialNotes', 'Unnamed: 5']          # variable to store list of unnecessary columns
metadata = metadata.drop(columns=columns_to_remove)         # dropping the unnecessary columns
metadata.rename(columns={'ï»¿"Country Code"': 'Country Code'}, inplace=True)    # renaming the column header
metadata.rename(columns={'IncomeGroup': 'Income Group'}, inplace=True)          # renaming the column header
metadata.rename(columns={'TableName': 'Country Name'}, inplace=True)            # renaming the column header
column_to_shift = metadata.pop(metadata.columns[3])           # shifting column 3 to 0
metadata.insert(0, column_to_shift.name, column_to_shift)
metadata.columns=metadata.columns.str.lower().str.replace(' ','_')              # standardizing the column header

columns_to_merge = ['country_name','region']    # listing the column headers required after merging

population_metadata = population.merge(metadata[columns_to_merge], on='country_name')    # merge at country_name
column_to_shift = population_metadata.pop(population_metadata.columns[-1])               # shifting columns
population_metadata.insert(2, column_to_shift.name, column_to_shift)
melted_data_population = population_metadata.melt(id_vars=['country_name', 'country_code','region'], var_name='year', value_name='population')
sorted_data_population = melted_data_population.sort_values(by=['country_name','year']) # melting, restructuring and sorting data of population

melted_data_fertility = fertility.melt(id_vars=['country_name','country_code'], var_name='year', value_name='fertility')
sorted_data_fertility = melted_data_fertility.sort_values(by=['country_name','year'])       # melting, restructuring and sorting data of population
columns_to_merge = ['fertility','country_name','year']  # listing the column headers required after merging
# merge at country_name and year
population_fertility = sorted_data_population.merge(sorted_data_fertility[columns_to_merge], on=['country_name','year'])

melted_data_expectancy = expectancy.melt(id_vars=['country_name','country_code'], var_name='year', value_name='expectancy')
sorted_data_expectancy = melted_data_expectancy.sort_values(by=['country_name','year'])       # melting, restructuring and sorting data of population
columns_to_merge = ['expectancy','country_name','year']         # listing the column headers required after merging
# merge at country_name and year
final = population_fertility.merge(sorted_data_expectancy[columns_to_merge], on=['country_name','year'])

final.dropna(subset=['region'], inplace=True)               # replace NaN values with 'World'
final['fertility'] = final['fertility'].round(decimals=2)       # round the floating values to 2 decimals
final['expectancy'] = final['expectancy'].round(decimals=2)
final["year"] = final["year"].astype(int)                       # convert the year to integer data type
final.dropna(subset=['population'], inplace=True) # fill missing 'population' values with a default value like 0
final['population'].fillna(0, inplace=True)


fig_fertility_expectation_region = px.scatter(data_frame=final,
                 x='fertility',                    # x-axis consists of fertility
                 y='expectancy',                   # x-axis
                 size='population',                # bubble size based on population
                 size_max=50,                      # set the maximum size for bubbles
                 hover_name='country_name',        # will view country names on hovering
                 color='region',             # will color accordingly to countries
                 animation_frame='year',           # will animate in time frame
                 animation_group='country_name',   # animation format
                 template='plotly_dark',           # black as template
                 range_x=[0, 10],                   # range of x axis
                 range_y=[10, 90])                 # range of y axis

fig_fertility_expectation_region.update_layout(title='Fertility Rate vs. Life Expectancy',                    # title of graph
                  xaxis_title='Fertility Rate - Total [Births per Woman]',  # x label
                  yaxis_title='Life Expectancy at Birth - Total')           # y label

fig_fertility_expectation_region.show()