# Introduction
Happiness is an emotional state characterized by feelings of joy, satisfaction, contentment, and fulfillment. While happiness has many different definitions, it is often described as involving positive emotions and life satisfaction. 

In this work we will research these:
1. [Load and check data](#1)
2. [Variable description](#2)
3. [Univariate Variable Analysis](#3)
    * [Categorical Univariate Variable Analysis](#4)
    * [Numerical Univariate Variable Analysis](#5)
4. [Basic Data Analysis](#6)
5. [Missing Values](#7)   
    * [Finding Missing Value](#8)
    * [Filling Missing Value](#9)


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

#useful libraries
import matplotlib.pyplot as plt
import seaborn as sns 
from collections import Counter #to count somethings

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Thanks to default codelines we can see there are five cvs files. So, we can use any of these files.

<a id=1></a>
# Load and Check Data
In this part we will load the data and investigate what is inside the data

In [None]:
#loading all of these files
data_2015 = pd.read_csv('../input/world-happiness/2015.csv')
data_2016 = pd.read_csv('../input/world-happiness/2016.csv')
data_2017 = pd.read_csv('../input/world-happiness/2017.csv')
data_2018 = pd.read_csv('../input/world-happiness/2018.csv')
data_2019 = pd.read_csv('../input/world-happiness/2019.csv')

In [None]:
#to see all of the features:
data_2015.columns

In [None]:
data_2016.columns

In [None]:
data_2017.columns

In [None]:
data_2018.columns

In [None]:
data_2019.columns

As we can see, there are some common column names in each csv file and also different ones.
From now on we will continue "data_2018.csv" file.

First, let's look at which type of datas are there in file. We will look at some of the values in this file later.

In [None]:
#All different data types
data_2018.info()

In [None]:
#Let's see what is inside this file - only first 10 rows
data_2018.head(10)

<a id=2></a>
# Variable Desciption
* Overall rank : unique ID for each row
* Country or region : Which country or region is mentioned
* Score : Given score value each country/region
* GDP per capita : Measure of a country's economic output that accounts for its number of people
* Social support : The physical and emotional comfort given to us by our family, friends, co-workers and others
* Healthy life expectancy : Average number of years that a person can expect to live in "full health" by taking into account years lived in less than full health due to disease and/or injury
* Freedom to make life choices : An individual's opportunity and autonomy to perform an action selected from at least two available options, unconstrained by external parties
* Generosity : The quality of being kind and generous.
* Perceptions of corruption : An index that scores countries on how corrupt their governments are believed to be

* int64(1) : Overall rank       
* object(1) : Country or region 
* float64(7) : Score, GDP per capita, Social support, Healthy life expectancy, Freedom to make life choices, Generosity, Perceptions of corruption     

In [None]:
#statistical values
data_2018.describe()

<a id=3></a>
# Univariate Variable Analysis
* Categorical Variables: Country or region
* Numerical Variables: Score, GDP per capita, Social support, Healthy life expectancy, Freedom to make life choices, Generosity, Perceptions of corruption, Overall Rank (we will not examine this feature because it is just ID)

<a id = 4></a>
## Categorical Variable 

In [None]:
def bar_plot(variable):
    #get feature
    var = data_2018[variable]
    #count number of categorical variable (value/sample)
    varValue = var.value_counts()
    
    #visualize
    plt.figure(figsize = (20,6))
    plt.bar(varValue.index, varValue)
    plt.xticks(varValue.index, varValue.index.values)
    plt.ylabel("Frequency")
    plt.title(variable)
    plt.show()
    print("{}: \n {}". format(variable,varValue)) 

In [None]:
category1 = ['Country or region']
for c in category1:
    bar_plot(c)

<a id = "5"></a>
## Numerical Variable

In [None]:
def plot_hist(variable):
    plt.figure(figsize = (9,3))
    
    plt.hist(data_2018[variable], bins = 50)
    
    plt.xlabel(variable)
    plt.ylabel("Frequency")
    plt.title("{} distribution with hist".format(variable))
    plt.show()

In [None]:
#numeric variables
numericVar = ['Score','GDP per capita','Social support','Healthy life expectancy','Freedom to make life choices','Generosity','Perceptions of corruption']
for n in numericVar:
    plot_hist(n)

Possible comments:

* Given scores mostly around 6
* Healthy life expectancy value is around 70%
* Freedom to make life choices max value 45% 
* People think that generosity is between 10% and 20 %

Let's look at correlation between all of features:

In [None]:
f,ax = plt.subplots(figsize=(13,13))
sns.heatmap(data_2018.corr(),annot=True, linewidth=.5,fmt='.1f',ax=ax)
plt.show()

According to heatmap we can say:
* GDP per capita is strongly related with Score, Social Support and Healty life expectancy
* Social Support is strongly related with Score, Healty life expectancy and GDP per capita
* Healty life expectancy is strongly related with Score, Social Support and GDP per capita
* Generosity not much related with Score, Social Support, Healty life expectancy and GDP per capita

<a id  =6 > </a><br>
# Basic Data Analysis
In this chapter we will examine relations between some features
* Country or region and Score
* Country or region and Healthy life expectancy 
* Country or region and Social support
* Country or region and Generosity
* Healthy life expectancy and GDP per capita
* Healthy life expectancy and Freedom to make life choices
* GDP per capita and Perceptions of corruption

In [None]:
#Country or region and Score
data_2018[["Country or region","Score"]].groupby(["Country or region"], as_index = False).mean().sort_values(by = "Score", ascending = True)

While Brundi give 2.905 score which is lowest value, Finland give the highest value which is 7.632. 

In [None]:
#Country or region and Healthy life expectancy
data_2018[["Country or region","Healthy life expectancy"]].groupby(["Country or region"], as_index = False).mean().sort_values(by = "Healthy life expectancy", ascending = False)

Hong Kong has the highest Healthy life expectancy value

In [None]:
#Country or region and Social support
data_2018[["Country or region","Social support"]].groupby(["Country or region"], as_index = False).mean().sort_values(by = "Social support", ascending = False)

Top three countries or regions with high Social Support values are Iceland, New Zealand and Finland.

In [None]:
#Country or region and Generosity
data_2018[["Country or region","Generosity"]].groupby(["Country or region"], as_index = False).mean().sort_values(by = "Generosity", ascending = True)

Top three countries or regions with low Generosity values are Greece, Morocco and Lithuania.

In [None]:
#Healthy life expectancy and GDP per capita
plt.scatter(data_2018["Healthy life expectancy"],data_2018["GDP per capita"],color="blue")
plt.xlabel("Healthy life expectancy")
plt.ylabel("GDP per capita")
plt.show()

We can say that the higher GDP per capita the higher Healthy life expectancy.

In [None]:
#Healthy life expectancy and Freedom to make life choices
plt.scatter(data_2018["Healthy life expectancy"],data_2018["Freedom to make life choices"],color="blue")
plt.xlabel("Healthy life expectancy")
plt.ylabel("Freedom to make life choices")
plt.show()

Here seems to no much relation between Healthy life expectancy and Freedom to make life choices.

In [None]:
#GDP per capita and Perceptions of corruption
plt.scatter(data_2018["GDP per capita"],data_2018["Perceptions of corruption"],color="blue")
plt.xlabel("GDP per capita")
plt.ylabel("Perceptions of corruption")
plt.show()

The perception of corruption value rises at 1.25 and beyond.


<a id=7></a>
# Missing Values   
* Finding Missing Value
* Filling Missing Value

In [None]:
#concatenate two datasets: data_2018 and data_2019
data_2018 = pd.concat([data_2018,data_2019],axis=0).reset_index(drop = True)

In [None]:
data_2018

<a id=8></a>
## Finding Missing Values

In [None]:
#Are there any missing values?
data_2018.columns[data_2018.isnull().any()]

In [None]:
#How many missing values are there?
data_2018.isnull().sum()

<a id=9></a>
## Filling Missing Values 
There is only one missing value in Perceptions of corruption. Instead of deleting this row, we are going to fill this empty cell.

In [None]:
#Where is this empty cell
data_2018[data_2018["Perceptions of corruption"].isnull()]

In [None]:
#mean value of Perceptions of corruption for United Arab Emirates
np.mean(data_2018[data_2018["Country or region"] == "United Arab Emirates"]["Perceptions of corruption"])

In [None]:
#Filling empty value with mean value
data_2018["Perceptions of corruption"] = data_2018["Perceptions of corruption"].fillna(np.mean(data_2018[data_2018["Country or region"] == "United Arab Emirates"]["Perceptions of corruption"]))

In [None]:
#to check if it is now fill or not
data_2018[data_2018["Perceptions of corruption"].isnull()]