# Python for Data Science Project Session 2: Social Sciences

The Sustainable Development Goals (SDGs) are a UN initiative consisting of 17 goals.  Their aim is to continue striving towards development in a more environmentally conscious manner.  

In this notebook, we will investigate data regarding the aim of increasing access to electricity.  We will create three main graphs – first looking at data on a global scale, then focusing on Africa, and finally looking at if progress has been made in the 10 countries with the lowest access.  

We will be using World Bank data spanning from 2000-2019 relating to Goal 7 (Affordable and Clean Energy).

First, import the necessary packages:

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import pycountry 
import seaborn as sns

Then, run the code below to import the dataset `GoalData7.csv` (find the file on the session 2 page) to access the data and extract the variables that are of relevance.  This will create a data frame called `sdg` that contains our 6 variables of interest:
* `GeoAreaCode`
* `GeoAreaName`
* `TimePeriod`
* `Value`
* `Location`
* `Nature`



In [None]:
raw_sdg = pd.read_csv("Goal7data.csv")
sdg = raw_sdg[["GeoAreaCode", "GeoAreaName", "TimePeriod", "Value", "Location", "Nature"]].copy()

Have a look at the data frame `sdg`, calling up random samples. Notice how the data is split up into `ALLAREA`, `RURAL` and `URBAN` data points, along with values for countries and wider regions.  

## A Global Look

In this first section, we will look at the data on a global scale.  Run the code below to create the data frame `df_world` that consists of the `ALLAREA` data from the geographical area called `World`.

In [None]:
df_world = sdg[(sdg["GeoAreaName"] == "World") & (sdg["Location"] == "ALLAREA")]

Then, create a line plot from this data frame, with the x-axis being the `TimePeriod`, and the y-axis being the access to electricity  `Value`.  Specify a colour, and add the parameter `label = "All Country Data"` to the plot specifications in order to help clarify the legend.

Format the figure, reffering to the documentation (https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.html) for help.  Ensure that the:
* x axis ticks are correctly spaced [hint: use splicing], and rotated so they can be read 
* both axes are labelled
* there is a grid (and it is behind the line plot)
* the plot has a title
* the plot has a legend
* colour documentation = https://matplotlib.org/stable/gallery/color/named_colors.html

Now, we will add a scatter of all countries data.  First, run the code below to manipulate `sdg` such that a data frame `df_c` is created.  It consists of the `ALLAREA` data for just countries - thereby removing values for regions and urban/rural specifics.

In [None]:
#Create list of countries
countries_list = []
for country in pycountry.countries:
    countries_list.append(country.name)
    
#Create data frame without regions
df_c = sdg[(sdg["Location"] == "ALLAREA")]
df_c = df_c[df_c.GeoAreaName.isin(countries_list)]
df_c

Now, add a scatter of all countries data using the data frame `df_c` to the line plot created above. Change the transparency for clarity using `alpha`, and the make the colour something neutral so that the world average stands out ontop.  Also include a `label` parameter for clarity in the legend.

The scatter plot has a lot of data, and so is not very clear to read.  However, creating a colour distinction between rural and urban data points would illustrate if there are any significant divisions between access to electricity.

Run and observe the code below.  It is an example which illustrates how to plot seperate series (in the example, the seperate series being different flower species) using a `for` loop.

In [None]:
iris = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv")
iris.head()

In [None]:
fig, ax = plt.subplots()

for species in iris.species.unique():
    iris_subset = iris.loc[iris.species == species]
    ax.scatter(iris_subset.sepal_length, iris_subset.sepal_width,
               label=species)
ax.legend()
ax.set_xlabel("Sepal Length")
ax.set_ylabel("Sepal Width")
plt.show()

Now, we will use this `for` loop method to plot the rural and urban aspects of our data frame as separate series.  Run the cell below to create the data frame `df_c2` which will have all the countries rural and urban data.

In [None]:
df_c2 = sdg[(~(sdg["Location"] == "ALLAREA")) & (sdg.GeoAreaName.isin(countries_list))]
df_c2

Then, replace the previous scatter plot by using the `for` loop to create a two tone scatter chart distinguishing between rural and urban areas.  Still keep the world average in the figure.

What do you notice about the distrubution of rural vs urban data points ?

## Deep Dive: A look into Africa

Research generally points to Africa being the continent that is the most energy deprived.  We will therefore focus the next step of our analysis on this continent.

The data includes a breakdown of 6 african regions:
* North Africa
* South Africa
* Middle Africa
* East Africa
* West Africa
* Sub-Saharan Africa

The aim is to creat two line plots side by side, one with the urban data of said regions, and the other with rural data.  Line graphs have less flexibility compared to scatter plots in terms of determining colours and series.  It is therefore easiest to split the rural and urban data.

Observe and run the code for the creation of the urban african regions data frame.  Note that it is first filtered to include any `GeoAreaName` that contains the word `Africa`, along with selecting the `Locatio` for `URBAN`.  A look at the `value_counts()` shows that there are some undersireable regions that have been included, such as "South Africa".  These are then removed to create the final `df_africa_u` data frame.

In [1]:
#africa regions urban
df_africa_rawu = sdg[sdg.GeoAreaName.str.contains("Africa") & (sdg["Location"] == 'URBAN')] 
df_africa_rawu.GeoAreaName.value_counts() 
df_africa_u = df_africa_rawu[(df_africa_rawu.GeoAreaName != "Northern Africa and Western Asia") & (df_africa_rawu.GeoAreaName != "South Africa") & (df_africa_rawu.GeoAreaName != "Central African Republic")]

NameError: name 'sdg' is not defined

Follow the above methodology to create a data frame for the rural regions of Africa:

Look at either the `df_africa_u` or `df_africa_r` data frame.  All the regions are listed together under the `GeoAreaName` column.  This makes it tricky to create a line plot.

To help plot each region as a seperate line, we will create something known as a 'pivot table'.  This will reorder the data so that each region will be its own column. 

Referring to the documentation here (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html), create two new dataframes: `pivot_u` and `pivot_r` with the urban and rural data for the regions.



Having reorganised the data, now create two subplots next to each other, one with the African regions urban data, and the other with the rural.  The pivot table will automatically plot each series in a different colour, so that is the only input needed in `.plot()`.

Each subplot has to be formatted seperately, so a trick to apply the same conditions to both is using `plt.setp((ax1,ax2), function = [])`.  It doesn't work in all cases (in this case doesn't seem to work for adding grid lines), but helps speed things up often. 

Look at the documentation (https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.htmlto), and find the necessary functions to:
* set x axis from 2000 to 2019
* set y axis from 0 to 100
* correct x axis ticks (splicing won't work here unfortunately - I suggest jumps of 2)
* add grid lines
* add a legend to the figure
* set titles for individual plots
* set an overal title for the figure

## Progress or not?

With the aim of the SDGs being to help countries develop, seeing if there is progress in acheiving said goals is an important element of analysis.

We will therefore find the 10 countries that had the lowest access to electricity in 2000, and compare it to the data for them in 2019. This will be done using a more advanced plotting tool called "Seaborn" to create a grouped bar graph.

Run the cells below to create a dataframe of all countries in 2000 (`df_2000`) and 2019 (`df_2019`).

In [None]:
df_2000 = df_c[(df_c["Location"] == "ALLAREA") & (df_c["TimePeriod"] == 2000)]
df_2019 = df_c[(df_c["Location"] == "ALLAREA") & (df_c["TimePeriod"] == 2019)]

Sort the values colomn to find the 10 countries with the lowest access to electricity in 2000.  Call this new data frame `df_2000_10`. 

Hint = use the `.nsmallest(#, colomn)` function to find the bottom 10 countries

We will now create a data frame of the same 10 countries, but with their 2019 data.

First, create a list titled `c_10` of the name in the GeoAreaName colomn so we can use it to filter the 2019 data.

Hint = use `.tolist()` 

Now, using this list as a filter, create a dataframe `df_2019_10` from the dataframe `df_2019` which has all countries data in 2019.

Hint = use the `.isin()` function

Seaborn requires that the data is in one single data frame, so run the code below to combine the two data frames (learn more about such dataframe manipulations in session 4).  This will create a final dataframe called `df_10`.

In [None]:
frames = [df_2000_10, df_2019_10]

df_10 = pd.concat(frames)
df_10

Now, we will use Seaborn to create a grouped bar graph.  Look at the example at this following link = https://seaborn.pydata.org/examples/grouped_barplot.html,  and use it as a guide to plot the `df_10` dataframe below. 

Not all the elements from the example are necessary, and in addition, rotate the x-axis ticks 90 degrees.

You can find alot more information on Seaborn plot customisation here = https://s3.amazonaws.com/assets.datacamp.com/production/course_15192/slides/chapter4.pdf

Python has loads of powerful methods to visualise data, and hopefully this notebook has given you a taste for what is possible.  Hope you enjoyed!