# World Happyness Report - The Course No1 Project


In [1]:
# Import usefull libraries for the project
import matplotlib.pyplot as plt
%matplotlib inline

## A. Importing, cleaning and numerical summaries

### Load and analyze the data

First, we build a pandas dataframe object by loading the data.csv file.

In [2]:
import os
import pandas as pd

datafile_path = os.path.join('data','data.csv')

df = pd.read_csv(datafile_path)


Now it's time to grab some informations from this freshly loaded dataframe.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 12 columns):
Country             153 non-null object
Happiness Rank      153 non-null int64
Happiness Score     153 non-null float64
Economy             153 non-null float64
Family              153 non-null float64
Health              153 non-null float64
Freedom             153 non-null float64
Generosity          153 non-null float64
Corruption          153 non-null float64
Dystopia            153 non-null float64
Job Satisfaction    151 non-null float64
Region              153 non-null object
dtypes: float64(9), int64(1), object(2)
memory usage: 14.4+ KB


This dataframe is made of a total of 12 columns and 153 lines.

Columns types are floats (values of the observations), integer (for the country rank) and 2 generic types (Country names and Regions)

The df.info() indicates that _Job Satisfaction_ is missing two values (151 entries over 153 lines in the CSV file).

According to what it is expected in this project, we will remove the lines that are missing data. We will use the `dropna()`function which basically remove lines that are missing at least one value in any of its cells.

_Do not forget to use the `inplace=True` parameter to update the current dataframe_



In [4]:
df.dropna(inplace=True)

Here is now the count of lines in the dataframe for each column. We can confirme that we do not have any empty cells, each column has a count value set to 151 (we've dropped the two lines where _Job Satisfaction_ value was not set).

In [5]:
df.describe().loc[['count']]

Unnamed: 0,Happiness Rank,Happiness Score,Economy,Family,Health,Freedom,Generosity,Corruption,Dystopia,Job Satisfaction
count,151.0,151.0,151.0,151.0,151.0,151.0,151.0,151.0,151.0,151.0


---
### Reindex the dataset

Displaying a few lines of the dataframe using a `df.head(5)` function call, we can see that index could be rearranged using the Country Name.
This can be easily done using the following code.

_Again, do not forget the `inplace=True` to update the currentdataframe_



In [6]:
df.set_index('Country',inplace=True)
df.head(3)

Unnamed: 0_level_0,Happiness Rank,Happiness Score,Economy,Family,Health,Freedom,Generosity,Corruption,Dystopia,Job Satisfaction,Region
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Malta,27,6.527,1.34328,1.488412,0.821944,0.588767,0.574731,0.153066,1.556863,85.2,Western Europe
Zimbabwe,138,3.875,0.375847,1.083096,0.196764,0.336384,0.189143,0.095375,1.59797,56.3,Africa
Cyprus,65,5.621,1.355938,1.131363,0.844715,0.355112,0.271254,0.041238,1.621249,88.7,Eastern Europe


At that point, our dataframe contains 11 column as the `'Country'` one became its index.

In [7]:
df.columns

Index(['Happiness Rank', 'Happiness Score', 'Economy', 'Family', 'Health',
       'Freedom', 'Generosity', 'Corruption', 'Dystopia', 'Job Satisfaction',
       'Region'],
      dtype='object')

---
### Grab some stats

Basic statistics values of the data can be obtain using the `describe()` function.

We can quickly restrict the statistcs displayed to the mean, the min and the max values of each numerical columns using the `.loc()`function.

In [8]:
df.describe().loc[['mean','min','max']]

Unnamed: 0,Happiness Rank,Happiness Score,Economy,Family,Health,Freedom,Generosity,Corruption,Dystopia,Job Satisfaction
mean,77.827815,5.357874,0.983895,1.190509,0.550794,0.409805,0.244914,0.123008,1.85491,75.209934
min,1.0,2.693,0.0,0.0,0.0,0.0,0.0,0.0,0.377914,44.4
max,155.0,7.537,1.870766,1.610574,0.949492,0.658249,0.838075,0.464308,3.117485,95.1


Using the `sort_values` methohd of the `Dataframe` object, we can easily get the top 10 happiest conuntries, as well as the 10 least happy countries.

#### Top 10 happiest countries

In [9]:
df.sort_values('Happiness Rank').head(10)['Happiness Rank']

Country
Norway          1
Denmark         2
Iceland         3
Switzerland     4
Finland         5
Netherlands     6
Canada          7
New Zealand     8
Sweden          9
Australia      10
Name: Happiness Rank, dtype: int64

#### 10 least happiest countries

In [10]:
df.sort_values('Happiness Rank',ascending=False).head(10)['Happiness Rank']

Country
Central African Republic    155
Burundi                     154
Tanzania                    153
Syria                       152
Rwanda                      151
Togo                        150
Guinea                      149
Liberia                     148
Yemen                       146
Haiti                       145
Name: Happiness Rank, dtype: int64

---
##  B. Indexing and grouping

In this part of the project, we have to handle data grouped by _Region_.

I've decided to build an hash table of _Dataframe_ objects, one for each _Region_. The hash keys are made of the name of the _Region_.

Note: I use a Python `set()` object to build the list of _Region_. Python `set()` object can be used to build unique index values from arrays.

In [18]:
region_dict = set(df['Region'].to_list())
df_by_region = dict()

for i in regions:
    print("Processing region",i,"=>",os.path.join('data',i+'.csv'))
    df_by_region[i] = df[df['Region'] == i]
    df_by_region[i].to_csv(os.path.join('data',i+'.csv'))
    


Processing region Asia-Pacific => data/Asia-Pacific.csv
Processing region Latin America => data/Latin America.csv
Processing region Europe => data/Europe.csv
Processing region North America => data/North America.csv
Processing region Africa => data/Africa.csv
Processing region Western Europe => data/Western Europe.csv
Processing region Eastern Europe => data/Eastern Europe.csv


The `df_by_region`variable is now a dict of Dataframe, indexed by the region name it concerns


### Mean happiness score for each region, rank the regions from most happy to least happy.

Even if we could use our `df_by_region` dict object to get this ranking, we will use `pandas`library functions.

