# Project 3

- **Dataset(s) to be used:** 
    - [**World happiniess data**](https://ourworldindata.org/grapher/share-of-people-who-say-they-are-happy?tab=chart): This dataset documents the **share of people who say they are happy**, collected from the [Integrated Values Surveys (2022)](https://www.worldvaluessurvey.org/WVSEVStrend.jsp) and processed by [Our World in Data](https://ourworldindata.org/)
    - [**World povrety ratio data**](https://data.un.org/Data.aspx?q=poverty&d=WDI&f=Indicator_Code%3aSI.POV.GAPS): This dataset documents the **poverty headcount ratio** at $2.15 a day (2017 PPP) (% of population)
    - [**World inqeuality data**](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/LM4OWF): This dataset conatains information on income inequality data for greater coverage across countries over time
- **Analysis question:** 
  1. Is there a positive correlation beween inequality and happiness, as well as poverty and happiness
- **Columns that will (likely) be used:**
  - **inequality dataset**: `country`, `gini_disp`, `year`
  - **happiness dataset**: `Entity`, `Year`, `Happiness: Happy (aggregate)`
  - **poverty dataset**: `Country or Area`, `Year`, `Value`
- **Columns to be used to merge/join them:**
  - **inequality dataset**: `country`, `year`
  - **happiness dataset**: `Entity`, `Year`
  - **poverty dataset**: `Country or Area`, `Year`
- **Hypothesis**: 
    1. Inequality is pnegatively correlated with happiness (as measured in share of poulation who says ehy are happy)
    2. Poverty ratio is negatively correlated to happiness (as measured in share of poulation who says ehy are happy)
- **Site URL:** [https://cic-yuchen.readthedocs.io/en/latest/index.html]

In [281]:
import plotly.io as pio

pio.renderers.default = "vscode+jupyterlab+notebook_connected"

## Part 0: Read the datasets

In [282]:
import pandas as pd
import plotly.express as px
import numpy as np

In [283]:
happiness = pd.read_csv("happiness.csv")
inequality = pd.read_csv("inequality.csv")
poverty = pd.read_csv("poverty.csv")

### Snaphsot of the datasets

In [284]:
happiness.head()

Unnamed: 0,Entity,Code,Year,Happiness: Happy (aggregate)
0,Albania,ALB,1998,33.43343
1,Albania,ALB,2004,58.8
2,Albania,ALB,2010,66.85212
3,Albania,ALB,2022,73.9271
4,Algeria,DZA,2004,80.73323


In [285]:
inequality.head()

Unnamed: 0,country,year,gini_disp,gini_disp_se,gini_mkt,gini_mkt_se,abs_red,abs_red_se,rel_red,rel_red_se
0,Afghanistan,2007,31.4,2.6,33.0,2.97,,,,
1,Afghanistan,2008,31.4,2.52,33.0,2.88,,,,
2,Afghanistan,2009,31.4,2.56,33.1,2.91,,,,
3,Afghanistan,2010,31.5,2.58,33.2,2.94,,,,
4,Afghanistan,2011,31.6,2.59,33.2,2.99,,,,


In [286]:
poverty.head()

Unnamed: 0,Country or Area,Year,Value,Value Footnotes
0,Albania,2020,0.0,1.0
1,Albania,2019,0.0,1.0
2,Albania,2018,0.0,1.0
3,Albania,2017,0.0,1.0
4,Albania,2016,0.0,1.0


### For convenience of data visualization and anlaytics
**clean and preprocess the datasets**

1. convert year to integer
2. country name are all lower-cased
3. Rename country column to `country`
4. Reanme year column to `year`

In [287]:
#inequality
inequality = inequality[['country', 'year', 'gini_disp']]
inequality['country'] = inequality['country'].str.lower()

#poverty
poverty = poverty[:2498]  #delete footnotes
poverty = poverty[['Country or Area', 'Year', 'Value']]
poverty = poverty.rename(columns={"Value": "poverty_ratio", 'Year':'year', 'Country or Area':'country'})
poverty['year'] = poverty['year'].astype(int)
poverty['country'] = poverty['country'].str.lower()

#happiness
happiness = happiness.rename(columns={'Year': 'year', 'Happiness: Happy (aggregate)':'happiness_score', 'Entity':'country'})
happiness['country'] = happiness['country'].str.lower()

In [288]:
#replace 0 as NA
poverty.replace(0, np.nan, inplace=True)
len(poverty[poverty['poverty_ratio'].isna()])

379

**There are 379 rows has missing values for our poverty ratio dataset, for convenience of this analysis, we'll drop them from the poverty dataset**

In [289]:
poverty = poverty.dropna(subset=["poverty_ratio"])

**Now merge three datasets into one single dataset**

In [290]:
poverty.head()

Unnamed: 0,country,year,poverty_ratio
6,albania,2014,0.1
7,albania,2012,0.1
9,albania,2005,0.1
10,albania,2002,0.2
11,albania,1996,0.1


In [291]:
inequality.head()

Unnamed: 0,country,year,gini_disp
0,afghanistan,2007,31.4
1,afghanistan,2008,31.4
2,afghanistan,2009,31.4
3,afghanistan,2010,31.5
4,afghanistan,2011,31.6


In [292]:
merged_df = pd.merge(poverty, inequality, how='outer', on=['country','year'])
merged_df =  pd.merge(merged_df, happiness, how='outer', on=['country','year'])

In [293]:
merged_df.head()

Unnamed: 0,country,year,poverty_ratio,gini_disp,Code,happiness_score
0,albania,2014,0.1,38.2,,
1,albania,2012,0.1,37.9,,
2,albania,2005,0.1,37.5,,
3,albania,2002,0.2,37.3,,
4,albania,1996,0.1,36.5,,


**There are NA values for each variable of interest, and it is within expectation (by using outer merger method). We'll keep the NA valuea for now and handle them case by case in further analysis**

In [294]:
# Number of NA calues for each of indicator of interests
len(merged_df[merged_df['happiness_score'].isna()])

6590

In [295]:
len(merged_df[merged_df['poverty_ratio'].isna()])

4894

In [296]:
len(merged_df[merged_df['gini_disp'].isna()])

656

## Inequality vs.happiness

In this part of analysis, we wiil explore the relationship between inequality and happiness. 

- Hypothesis: more inequality --> lower level of happiness

In [297]:
merged_df.head()

Unnamed: 0,country,year,poverty_ratio,gini_disp,Code,happiness_score
0,albania,2014,0.1,38.2,,
1,albania,2012,0.1,37.9,,
2,albania,2005,0.1,37.5,,
3,albania,2002,0.2,37.3,,
4,albania,1996,0.1,36.5,,


In [298]:
inequality_happiness = merged_df.dropna(subset=['happiness_score', 'gini_disp'])

In [299]:
inequality_happiness.head()

Unnamed: 0,country,year,poverty_ratio,gini_disp,Code,happiness_score
17,argentina,2014,0.3,38.2,ARG,86.44928
21,argentina,2010,0.3,40.2,ARG,87.6046
27,argentina,2004,1.3,45.2,ARG,81.37083
33,argentina,1998,2.1,45.8,ARG,81.5944
38,argentina,1993,1.2,43.1,ARG,76.34731


In [300]:
# select a year with most values
inequality_happiness['year'].mode()

0    2010
Name: year, dtype: int64

In [301]:
# look at 2010 only
temp = inequality_happiness[inequality_happiness['year']==2010]

# Plot
fig = px.scatter(temp, 
              x = 'gini_disp', 
              y = 'happiness_score', 
              title = 'Happiness Score vs. Inequality in 2010', 
              trendline='ols', 
              hover_data=['country'],
              labels={ 'gini_disp': 'Gini Coefficient (Inequality)', 
                      'happiness_score': 'Happiness Score'})


fig.show()

**Analysis**

The above graph shows a very loose correlation that higher gini score (more inequality) is associated with lower happiness score. This is relationship between ineuqality and happiness score is not as obvious as expected.

## Poverty vs. Happiness

Inthis part of analysis, we'll look at the relationship bwteern poverty ratio and happniess score. 
- Hypothesis: More poverty --> less happy

In [302]:
# create dataset with only poverty and happiness score data
poverty_happiness = merged_df.dropna(subset=['happiness_score', 'poverty_ratio'])

In [303]:
poverty_happiness.head()

Unnamed: 0,country,year,poverty_ratio,gini_disp,Code,happiness_score
17,argentina,2014,0.3,38.2,ARG,86.44928
21,argentina,2010,0.3,40.2,ARG,87.6046
27,argentina,2004,1.3,45.2,ARG,81.37083
33,argentina,1998,2.1,45.8,ARG,81.5944
38,argentina,1993,1.2,43.1,ARG,76.34731


In [304]:
poverty_happiness['year'].mode()

0    2010
Name: year, dtype: int64

In [305]:
#normalize poverty ratio

poverty_happiness['poverty_ratio_normalized'] = (poverty_happiness['poverty_ratio'] - poverty_happiness['poverty_ratio'].mean()) / poverty_happiness['poverty_ratio'].std()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [306]:
# look at 2010 only
temp = poverty_happiness[poverty_happiness['year']==2010]

# Plot
fig = px.scatter(temp, 
              x = 'poverty_ratio', 
              y = 'happiness_score', 
              title = 'Happiness Score vs. Poverty in 2010', 
              trendline='ols', 
              labels={ 'poverty_ratio': 'Poverty Ratio (% of population)', 
                      'happiness_score': 'Happiness Score'})


fig.show()

**The plot above fails to capture obvious trend due to some outliers. Now let's handle these outliers propoertly.**

In [307]:
# identify the outlier
temp.sort_values('poverty_ratio', ascending = False).head()

Unnamed: 0,country,year,poverty_ratio,gini_disp,Code,happiness_score,poverty_ratio_normalized
2109,zambia,2010,31.4,54.3,ZMB,51.98119,6.026203
1625,rwanda,2010,22.8,50.2,RWA,85.33511,4.225399
594,ethiopia,2010,8.1,32.7,ETH,63.46667,1.147278
1442,north macedonia,2010,6.0,34.4,MKD,80.4359,0.707547
1673,south africa,2010,5.5,63.4,ZAF,77.83786,0.602849


The two outliers are **Zambia** and **Rwanda**. These two outliers present some interesting findings:
- Rwanda has a really high poverty ratio (22.8%), but the reported happiness score is unexpectedly high(85.33511)
- Zambia, on the other hand, tells a story more or less expected. A high poverty ratio and a low happiness score. 

Now, remove the two outliers and visualize the trend.

In [308]:

# look at 2010 only
temp = poverty_happiness[poverty_happiness['year']==2010]

# remove rwanda and zambia
temp = temp[(temp['country']!='rwanda') & (temp['country']!='zambia')]

# Plot
fig = px.scatter(temp, 
              x = 'poverty_ratio', 
              y = 'happiness_score', 
              title = 'Happiness Score vs. Poverty in 2010', 
              trendline='ols', 
              hover_data=['country'], 
              labels={ 'poverty_ratio': 'Poverty Ratio (% of population)', 
                      'happiness_score': 'Happiness Score'})


fig.show()

The above plot show a clearer, yet still loose, relationship between happiness score and poverty ratio. In general, higher poverty ratio is associated with a lower happiness score. However, there are lots of countries having low poverty ratio and a low happiness score. 
- For example, [**Moldova**](https://en.wikipedia.org/wiki/Moldova) and [**Bulgaria**](https://en.wikipedia.org/wiki/Bulgaria) have low poverty ratio but low happiness score. 
- **Indonesia** has a relatively high poverty ratio but still a high happiness score.