# ECON GA- 4003 Project Data Report

# Country Level Analysis of Happiness and Peace 

### Authors: Syeda Naveera Fatima Rizvi and Suniya Raza

This data report is intended for readers to be able to successfully download and view the data required for the final analyis. We download 3 main datasets:

1. Data for Peace Index. This is a Global Peace Index measuring the peacefulness of a country. This index ranges from 1 to 5 and a lower index represent a higher peace level
<br>
<br>
2. Data for Happiness Index. This is a Cantril ladder Score which is meant to represent happiness or subjective well-being. Respondents are asked to imagine a ladder with steps numbered from 0 to 10. They are then asked to rate their current lives based on where they consider themselves to stand on the ladder, with step 0 is the worst possible life and step 10 is the best one. This dataset also includes data on the factors which are considered to constitute to the general level of happiness in a country
<br>
<br>
3. Data for Female Head of States. This data lists female head of states who were appointed by a governing committee or parliament. It includes Presidents, Prime Ministers and Chancellors. The dataset includes the name of the countries these women mandated on as well as the start and end dates of the mandate

In [246]:
import pandas as pd 
import requests
import numpy as np
 

# Dataset for Peace Index

We download this dataset using an API request from the [vision of humanity](https://www.visionofhumanity.org/) website. 


This data has been produced by the Institute for Economics and Peace (IEP)
under the guidance of an international panel of independent experts.  


The following code creates an API request and outputs the data if the request is successfull (status_code>199).
Sometimes, the request fails, however we found that running the cell again would solve the problem

We also some basic cleaning and reshaping to present the data in a consistent (with other sources) and presentable manner

***Key Variables: Year, Country, Peace_index***

In [247]:
url_PI = "https://www.visionofhumanity.org/wp-content/uploads/2021/06/GPI-2021-overall-scores-and-domains-2008-2021.xlsx"
resp = requests.get(url_PI)

if resp.status_code>199:
    peace_index= pd.read_excel(resp.content, sheet_name="Overall Scores").iloc[2:,:16]
    peace_index.columns= peace_index.iloc[0]
    peace_index=peace_index.drop([2], axis=0)
    peace_index=peace_index.drop(['iso3c'], axis=1).set_index("Country").unstack().reset_index().rename(columns={2:'Year', 0:'Peace_index'})
    peace_index["Peace_index"]= peace_index["Peace_index"].astype(float)
    print(peace_index)
else:
    raise ValueError(f"Response error with code {resp.status_code}. Try running this code again")


        Year      Country  Peace_index
0     2008.0  Afghanistan        3.129
1     2008.0      Albania        1.860
2     2008.0      Algeria        2.322
3     2008.0       Angola        2.047
4     2008.0    Argentina        1.883
...      ...          ...          ...
2277  2021.0    Venezuela        2.934
2278  2021.0      Vietnam        1.835
2279  2021.0        Yemen        3.407
2280  2021.0       Zambia        1.964
2281  2021.0     Zimbabwe        2.490

[2282 rows x 3 columns]


We describe the data below. We have a total of 2282 observations for 163 countries. Our data is from the years 2008-2014 (a total of 14 years)

The minimum of the Peace index is 1.093 (most peaceful) and the maximum is 3.648 (least peaceful). The mean is 2.067

In [248]:
peace_index.describe(include='all').iloc[:,[1,0,2]]

Unnamed: 0,Country,Year,Peace_index
count,2282,2282.0,2271.0
unique,163,,
top,Croatia,,
freq,14,,
mean,,2014.5,2.067386
std,,4.032012,0.4653
min,,2008.0,1.093
25%,,2011.0,1.756
50%,,2014.5,2.033
75%,,2018.0,2.29


# Dataset for Happiness Index

Next we download the data for the happiness index using an API request from the [World Happiness Report](https://worldhappiness.report/ed/2021/) published in the year 2021. 

The World Happiness Report is a publication of the Sustainable Development Solutions Network, powered by data from the Gallup World poll and
Lloyd’s Register Foundation, who provided access to the World Risk Poll.
The 2021 Report includes data from the ICL-YouGov Behaviour Tracker as part of the COVID Data Hub from the Institute of Global Health Innovation

Other than the happinesss index, we also have data for the  six factors which are supposed to contribute to the level of happiness. 

These are:

    -Log GDP per Capita
    -Healthy life expectancy
    -Social Support. The national average of the binary question of wether people believe they have someone to count on in times of trouble.
    -Freedom to make Life Choices. The national average of the response to wether the respondents are satistfied or dissatisfied with what they want to do in life.)
    -Generosity 
    -Perception of Corruption (in both government and businesses)

Two additional variables in the dataset are:
    -Positive affect.  Average of three positive affect measures, happiness, laugh and enjoyment
    -Negative affect.  Average of three negative affect measures, worry, sadness and anger

Full details on the variables can be found in this [appendix](https://happiness-report.s3.amazonaws.com/2021/Appendix1WHR2021C2.pdf)

***Key Variables: Year, Country, Life Ladder (Happiness index)***

In [249]:
url_happiness = "https://happiness-report.s3.amazonaws.com/2021/DataPanelWHR2021C2.xls"
resp2 = requests.get(url_happiness)

if resp2.status_code>199:
    df_happiness= pd.read_excel(resp2.content)
    df_happiness=df_happiness.rename(columns={'Country name':'Country', 'year':'Year'})
    print(df_happiness)
else:
    raise ValueError(f"Response error with code {resp2.status_code}")
    

          Country  Year  Life Ladder  Log GDP per capita  Social support  \
0     Afghanistan  2008     3.723590            7.370100        0.450662   
1     Afghanistan  2009     4.401778            7.539972        0.552308   
2     Afghanistan  2010     4.758381            7.646709        0.539075   
3     Afghanistan  2011     3.831719            7.619532        0.521104   
4     Afghanistan  2012     3.782938            7.705479        0.520637   
...           ...   ...          ...                 ...             ...   
1944     Zimbabwe  2016     3.735400            7.984372        0.768425   
1945     Zimbabwe  2017     3.638300            8.015738        0.754147   
1946     Zimbabwe  2018     3.616480            8.048798        0.775388   
1947     Zimbabwe  2019     2.693523            7.950132        0.759162   
1948     Zimbabwe  2020     3.159802            7.828757        0.717243   

      Healthy life expectancy at birth  Freedom to make life choices  \
0              

We describe the data below. We have a total of 1949 observations for 166 countries. Our data is from the years 2005-2020 (a total of 15 years)

The minimum of the Life Ladder is 2.37 (worst possible life) and the maximum is 8.02 (best possible life) with a mean of 5.47. 

In [250]:
df_happiness.describe(include="all")

Unnamed: 0,Country,Year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect
count,1949,1949.0,1949.0,1913.0,1936.0,1894.0,1917.0,1860.0,1839.0,1927.0,1933.0
unique,166,,,,,,,,,,
top,Philippines,,,,,,,,,,
freq,15,,,,,,,,,,
mean,,2013.216008,5.466707,9.368459,0.812553,63.359375,0.742567,0.000108,0.747111,0.709998,0.268552
std,,4.166828,1.115717,1.154091,0.11848,7.510244,0.142104,0.162221,0.186793,0.107106,0.085176
min,,2005.0,2.375092,6.635322,0.290184,32.299999,0.257534,-0.33504,0.035198,0.32169,0.082737
25%,,2010.0,4.640079,8.463744,0.74939,58.685,0.647048,-0.112973,0.690305,0.625373,0.206403
50%,,2013.0,5.386025,9.460323,0.835167,65.199997,0.763476,-0.025393,0.802428,0.722391,0.258117
75%,,2017.0,6.283498,10.352778,0.905291,68.589998,0.85603,0.090967,0.871942,0.799276,0.319716


# Merged Dataset

We merge the two datasets above into a final consolidated dataset named master_file

In [251]:
#Merge on Country and Year and use values from both datasets
master_file=df_happiness.merge(peace_index, how='outer', on=['Country', 'Year'])
master_file

Unnamed: 0,Country,Year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect,Peace_index
0,Afghanistan,2008,3.723590,7.370100,0.450662,50.799999,0.718114,0.167640,0.881686,0.517637,0.258195,3.129
1,Afghanistan,2009,4.401778,7.539972,0.552308,51.200001,0.678896,0.190099,0.850035,0.583926,0.237092,3.270
2,Afghanistan,2010,4.758381,7.646709,0.539075,51.599998,0.600127,0.120590,0.706766,0.618265,0.275324,3.121
3,Afghanistan,2011,3.831719,7.619532,0.521104,51.919998,0.495901,0.162427,0.731109,0.611387,0.267175,3.122
4,Afghanistan,2012,3.782938,7.705479,0.520637,52.240002,0.530935,0.236032,0.775620,0.710385,0.267919,3.241
...,...,...,...,...,...,...,...,...,...,...,...,...
2631,Venezuela,2021,,,,,,,,,,2.934
2632,Vietnam,2021,,,,,,,,,,1.835
2633,Yemen,2021,,,,,,,,,,3.407
2634,Zambia,2021,,,,,,,,,,1.964


# Dataset for Female Head of State



We now get the data for our final dataset; female heads of state.df_happiness

We scrape this data from the Wikipedia page for [List of elected and appointed female heads of state and government](https://en.wikipedia.org/wiki/List_of_elected_and_appointed_female_heads_of_state_and_government)

We use scrapy to scrape our data

## Note

We have already scraped the data and we provide it as csv in the zip folder containing this data report. We also provide the spider python file used to crawl the data. 

Inisde this zip folder there is a folder named female_heads_data. Inside it there is csv named fem_head. This csv contains our data.
The folder additionally has a subfolder also named as female_head_data. This subfolder contains a folder named spiders which contains a python source file named fem_head. This is the spider file we used to crawl the data.

The steps required to create this spider file, crawl the webpage and store the data in the csv are now discussed below.

We first need to run the following lines of code in the terminal

    1. Set the current directory to where you want your scrapy project

    You should change the path in front of cd with the path for your own directory

    2. Run the command scrapy startproject <insert name>

    For our code we name the project female_heads_data. This creates a folder named as female_heads_data in our current directory

    3. Set the project folder as your current directory

    4. Run the command scrapy genspider <insert spider file name> <insert the webpage link>

    For our code we name the spider file as fem_head and insert the link to the Wikipedia webpage

***To Run in the Terminal***

cd "..........\Data for Social Skills\Project" 

scrapy startproject female_heads_data

cd female_heads_data

scrapy genspider fem_head en.wikipedia.org/wiki/List_of_elected_and_appointed_female_heads_of_state_and_government

Scrapy has now created a spider file inside our project. It first created the folder for the project named as female_heads_data in our orginal current directory. This project folder will later also contain our csv data after we have crawled the webpage. The female_heads_data contains a subfolder with the same name i.e female_heads_data. This subfolder contains a folder named spiders which contains the fem_head spider file scrapy just created.

You can run the tree command in the terminal to locate this file

We will next edit this fem_head spider file in order to be able to crawl the data. As discussed above the final version of the file has already been created and can be viewed seperately. In order to explain the code inside it, we have copied its contents below for convenience. 

Inside the spider file we first need to remove the slash "/" at the end of the path in the start_urls. This slash creates an error when scrapy tried to 
crawl the webapge and so must be removed.

Next we add our codes inside the parse function. Full details and description of the code are described inside the
docstring of this function.

***SPIDER FILE***


import scrapy


class FemHeadSpider(scrapy.Spider):

    name = 'fem_head' 
    allowed_domains = ['en.wikipedia.org/wiki/List_of_elected_and_appointed_female_heads_of_state_and_government']
    start_urls = ['http://en.wikipedia.org/wiki/List_of_elected_and_appointed_female_heads_of_state_and_government']
    
    # Remove the slash at the end of the path in start_urls if it is still present. 
    # The "/" creates an error when scrapy tries to crawl the data

    def parse(self, response):
        """
        This function aims to extract the name, country, mandate start and mandate end date of each female head of state in the table on the 
        wikipedia page.

        We first identify the tbody of the table and store it in a variable tab. We then exytract the rows of this table in the tr.
        We exclude the first row as it doesnt contain any actual data.
        Then we index over all the rows and for each row we extract the child td.
    
        The wikipedia data of our interest needs to be retrieved from 2 different children ("a" and "span") of this td child.

        We store the text contents of "a" in a variable nam_count as these are the names of the head of the state and the country
        We store the text content of "span" in a variable start_end as these are the mandate start and end dates

        The wikipedia data is not consistent across all rows. The length of the a and span items and the position of their elements 
        varies with each row. 
        Eg. in some rows the start date is the first element of the span item and in some rows it is the second.
        We noticed that the position of the relevant text varies with the length of the item and so we use this with if conditions
        to fix the errors

        Similarly for leaders with mutliple terms, the name and country is provided for the first term but it is missing in the successive 
        terms. 
        If the name and country are not missing then these variables are populated accordingly. If they are missing then the value from
        the previous row is used. 
        They are only missing if they are a successive term, of a head with multiple terms, so using the previous value is valid

        For head of states which are currently in power, we fix the end data as 31 Jan 2022. This is a date larger than all start dates
        in our sample. This is just done to make analysis easier by giving date values to all entries. It doesn't affect our analysis
        otherwise as we will be focusing on 2020 only

        The if conditions take care of all these cleaning errors.

        In the end we store our data in a dictionary named data and we then yield it
        """
        tab= response.xpath('//*[@id="mw-content-text"]/div[1]/table[3]/tbody')
        rows= tab.xpath('.//tr')[1:]
        for row in rows:
            td= row.css("td")
            nam_count= td.css("a::text").getall() #This contains the name of the state and its female head 
            start_end= td.css("span::text").getall() #Start and End Dates of the head's position in power

            #Cleaning
            
            #Extracting Name and Country
            if len(nam_count)>=2:
                name= nam_count[0]    
                country= nam_count[1] 
            
            #Extracting date entries based on each case
            if len(start_end)==5:
                start= start_end[1]
                end= start_end[2]
            if len(start_end)==4:
                start= start_end[2]
                end= start_end[3]
            if len(start_end)==3:
                start= start_end[1]
                end= start_end[2]
            if len(start_end)==2:
                if len(start_end[0])>=4:
                   start= start_end[0]
                   end= start_end[1]
                else:
                   start= start_end[1]
                   end= "31-Jan-22"       #31 Jan 22 entry as end date for heads currently in power
            if len(start_end)==1:
                start= start_end[0]
                end= "31-Jan-22"
            
            
            data={
                "Name": name, 
                "Country":country, 
                "Start_Date":start,
                "End_Date":end
                }
            yield data


Once our spider file is ready, we can scrawl it by running the following command in the terminal

***To Run in the Terminal***

scrapy crawl fem_head -O fem_head.csv

Scrapy has now created a csv named fem_head.csv inside the main scrapy project folder (name female_heads_data).
We will now use this csv for our analysis

In [252]:
#Reading the CSV
F_Head_of_State= pd.read_csv("female_heads_data//fem_head.csv")
#Converting Dates to datetime format
F_Head_of_State['Start_Date']= pd.to_datetime(F_Head_of_State['Start_Date'])
F_Head_of_State['End_Date']= pd.to_datetime(F_Head_of_State['End_Date'])
F_Head_of_State

Unnamed: 0,Name,Country,Start_Date,End_Date
0,Khertek Anchimaa-Toka,Tannu Tuva,1940-04-06,1944-10-11
1,Sükhbaataryn Yanjmaa,Mongolia,1953-09-07,1954-07-07
2,Sirimavo Bandaranaike,Ceylon,1960-07-21,1965-03-27
3,Sirimavo Bandaranaike,Ceylon,1970-05-29,1977-07-23
4,Indira Gandhi,India,1966-01-24,1977-04-24
...,...,...,...,...
157,Natalia Gavrilița,Moldova,2021-08-06,2022-01-31
158,Najla Bouden,Tunisia,2021-10-11,2022-01-31
159,Sandra Mason,Barbados,2021-11-30,2022-01-31
160,Magdalena Andersson,Sweden,2021-11-30,2022-01-31


Our data list 142 unique female head of states from 91 unique countries.

Lithuania has had 5 terms with female head of states making it the state with the highest number of terms with a  
female head in our Data.

Gro Harlem Brundtland from Norway has had the most terms in office i.e 3



In [253]:
F_Head_of_State.describe(include="all")

  F_Head_of_State.describe(include="all")
  F_Head_of_State.describe(include="all")


Unnamed: 0,Name,Country,Start_Date,End_Date
count,162,162,162,162
unique,142,91,161,133
top,Gro Harlem Brundtland,Lithuania,2021-11-30 00:00:00,2022-01-31 00:00:00
freq,3,5,2,30
first,,,1940-04-06 00:00:00,1944-10-11 00:00:00
last,,,2022-01-27 00:00:00,2022-01-31 00:00:00


# Dataset for 2020

Next for our 2020 (impact of female head of state on peace and happiness during COVID) analysis we merge two subsets of data

We create a subset of data of female heads of state who were in power during 2020 i.e heads of state whose term
started before 1st Jan 2021 and whose term ended after 31 Dec 2019.

In [254]:

Female_HOS_2020=F_Head_of_State.loc[(Head_of_State["Start_Date"]<'2021-01-01') & (Head_of_State["End_Date"]>'2019-12-31'),:]
Female_HOS_2020

Unnamed: 0,Name,Country,Start_Date,End_Date
44,Sheikh Hasina,Bangladesh,2009-01-06,2022-01-31
70,Angela Merkel,Germany,2005-11-22,2022-01-31
81,Ivy Matsepe-Casaburri,South Africa,2008-09-25,2022-01-31
109,Erna Solberg,Norway,2013-10-16,2021-10-14
115,Kolinda Grabar-Kitarović,Croatia,2015-02-19,2020-02-18
121,Bidya Devi Bhandari,Nepal,2015-10-29,2022-01-31
123,Hilda Heine,Marshall Islands,2016-01-28,2020-01-14
124,Aung San Suu Kyi,Myanmar,2016-04-06,2021-02-01
125,Tsai Ing-wen,Taiwan,2016-05-20,2022-01-31
128,Kersti Kaljulaid,Estonia,2016-10-10,2021-10-11


We have 30 countries with female heads of state in our data

In [255]:
Female_HOS_2020.describe(include="all")


  Female_HOS_2020.describe(include="all")
  Female_HOS_2020.describe(include="all")


Unnamed: 0,Name,Country,Start_Date,End_Date
count,30,30,30,30
unique,30,30,30,10
top,Sahle-Work Zewde,Croatia,2016-01-28 00:00:00,2022-01-31 00:00:00
freq,1,1,1,21
first,,,2005-11-22 00:00:00,2020-01-07 00:00:00
last,,,2020-12-24 00:00:00,2022-01-31 00:00:00


Next we create a subset of data for the year 2020 from our master file (merged data of peace and happiness)

We then merge the two subsets on Country and only keep values from the master file.
We make a binary variable named female_head which takes the value of 1 for states with female heads and is 0 otherwise. This is our final merged dataset for the 2020 analysis

In [256]:
df_2020=master_file.loc[master_file['Year']==2020]
merged_2020=df_2020.merge(Female_HOS_2020, how='left', on='Country').drop(['Start_Date', 'End_Date'], axis=1)

merged_2020['female_head'] = merged_2020['Name'].apply(lambda x: 1 if pd.notnull(x) else 0)  #Female head =1, 0 otherwise
merged_2020.sort_values("female_head")

Unnamed: 0,Country,Year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect,Peace_index,Name,female_head
0,Albania,2020,5.364910,9.497252,0.710115,69.300003,0.753671,0.006968,0.891359,0.678661,0.265066,1.821,,0
104,Burundi,2020,,,,,,,,,,2.455,,0
105,Central African Republic,2020,,,,,,,,,,3.174,,0
106,Chad,2020,,,,,,,,,,2.492,,0
107,Costa Rica,2020,,,,,,,,,,1.719,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
136,Nepal,2020,,,,,,,,,,2.015,Bidya Devi Bhandari,1
27,Ethiopia,2020,4.549220,7.710983,0.823138,59.500000,0.768694,0.188497,0.783822,0.669389,0.251514,2.492,Sahle-Work Zewde,1
28,Finland,2020,7.889350,10.750446,0.961621,72.099998,0.962424,-0.115532,0.163636,0.744292,0.192898,1.391,Sanna Marin,1
72,Serbia,2020,6.041546,9.788260,0.852102,69.000000,0.843480,0.149401,0.824472,0.602846,0.357580,1.767,Ana Brnabić,1
