# Happy Life

#### Project Mission
* This project is based on the World Happiness Report.
* The World Happiness Report is a survey that asks people to rate their happiness on a scale based on questions using a scale from 1-10.
* My goal is to use the features from the data gathered to create a model that can effectly predict Happiness.  

### Imports

In [1]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import os
import wrangle_happy as wh
import explore_happy as eh
import model_happy as mh
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score

### Acquire Data

* Data acquired from Kaggle database
* Data set contained 782 rows and 9 columns after cleaning
* Each row represents a Country
* Each column represents a feature of the Survey

In [2]:
# acquiring Happy data
df_2015, df_2016, df_2017, df_2018, df_2019 = wh.wrangle_happi()

### Prepare Data

#### Actions:

* Removed columns that did not contain useful information
* Renamed columns to promote readability and allow for concatenation
* Removed nulls in the data 
* Concatenated 5 dataframes to make 1 master dataframe 
* Split data into train, validate and test (approx. 56/24/20)

In [3]:
# preparing Happy data
happy_df = wh.join_happy(df_2015, df_2016, df_2017, df_2018, df_2019)

In [4]:
# splitting Happy data
train, validate, test = mh.split_data(happy_df)

### Data Dictionary

| Feature | Definition | Type |
|:--------|:-----------|:-------
|**Country**|  Name of the Country | *obj*|
|**Happiness_Rank** |  Rank of the country based on the score |*int*|
|**Economy (GDP per Capita)**| The extent to which GDP contributes to the calculation | *float*|
|**Health (Life Expectancy)**| The extent to which Life expectancy contributed to the calculation | *float*|
|**Freedom to make life choices**| The extent to which Freedom contributed to the calculation | *float*|
|**Perceptions of Corruption in Gov**| The extent to which Perception of Corruption contributes to score | *float*|
|**Generosity**|  The extent to which Generosity contributed to the calculation | *float*|
|**Year**| Year data was assembled | *int*|
|**Happiness_Score** | A metric measured by asking the sampled people "How would you rate your happiness?"  |*float*|

In [5]:
# ready for exploration and modeling
x_train, y_train, x_validate, y_validate, x_test, y_test = mh.model_sets(train, validate, test)

### Overview of Data

In [6]:
# Shows data at a glance
# Key takeaway - no nulls and 9 columns including the target variable
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 437 entries, 117 to 0
Data columns (total 9 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Country                     437 non-null    object 
 1   Happiness_Rank              437 non-null    int64  
 2   Happiness_Score             437 non-null    float64
 3   Economy                     437 non-null    float64
 4   Health_Life_Expectancy      437 non-null    float64
 5   Freedom                     437 non-null    float64
 6   Perceptions_Corruption_Gov  437 non-null    float64
 7   Generosity                  437 non-null    float64
 8   Year                        437 non-null    int64  
dtypes: float64(6), int64(2), object(1)
memory usage: 34.1+ KB


### Data Summary

In [7]:
# We can see an important key take away which is the average/mean of the Happiness Score.
train.describe()

Unnamed: 0,Happiness_Rank,Happiness_Score,Economy,Health_Life_Expectancy,Freedom,Perceptions_Corruption_Gov,Generosity,Year
count,437.0,437.0,437.0,437.0,437.0,437.0,437.0,437.0
mean,77.837529,5.401465,0.917545,0.613893,0.411162,0.127307,0.221546,2016.983982
std,45.490635,1.12968,0.423243,0.252466,0.151789,0.107628,0.128605,1.423015
min,1.0,2.839,0.0,0.0,0.0,0.0,0.0,2015.0
25%,40.0,4.508,0.59325,0.42864,0.312,0.054,0.13,2016.0
50%,75.0,5.401,0.987,0.650785,0.434,0.0927,0.202,2017.0
75%,119.0,6.182,1.252785,0.810696,0.523,0.162,0.285,2018.0
max,158.0,7.594,2.096,1.088,0.724,0.55191,0.838075,2019.0


### Explore Data

### What is average score for Happiness over the 5 year span?

In [8]:
# Returns the baseline = average/mean of the Happiness Score
baseline = train.Happiness_Score.mean()
print(f'The average Happiness Score is {baseline:.2f}')

The average Happiness Score is 5.40


In [9]:
# Setting up for predictions
train_predictions, validate_predictions, test_predictions = mh.predict(train, validate, test)

### Statistical Tests

**I will now use a pearsonsr statistical test to investigate whether feature listed and happiness score are correlated** 

* I will use a confidence interval of 95% 
* The resulting alpha is .05<br>

${H_0}$: There is **no** relationship between feature listed and happiness score.

${H_a}$: There **is** a relationship between feature listed and happiness score.

In [10]:
# creates dataframe for statistical results
results_stats_df = eh.make_stats_df()

In [11]:
# retrieves results and reads them to dataframe
results_stats_df = eh.get_results(train, results_stats_df)

In [12]:
# visual of results
results_stats_df

Unnamed: 0,Index Scores,Health,Year,Generosity,Economy,Freedom,Perceptions of Corruption
0,PearsonsR,0.740978,0.04726,0.140445,0.784104,0.570145,0.391867
1,P-Value,0.0,0.324294,0.00326,0.0,0.0,0.0
2,Outcome,We reject the null hypothesis,We fail to reject the null hypothesis,We reject the null hypothesis,We reject the null hypothesis,We reject the null hypothesis,We reject the null hypothesis


### Exploration Summary

* PearsonsR statistical tests supported significance amongst the relationships between the 5 of the six features.  

### Creating predictive models

#### Features included: 
Features that had the most significance in relationship to the target variable are most likely going to model the best predictive power.

    * Health
    * Generosity
    * Economy
    * Freedom
    * Percetions of Corruption 

#### Features not included:
Feature had the weakest relationship to the target variable.
    
    * Year

## Modeling
### Simple Linear Regression Model

In [13]:
# fits the model on train and validate
train, train_predictions, validate, validate_predictions = mh.simple_lm_model(train, x_train, y_train, validate, x_validate, train_predictions, validate_predictions)

### Generalized Linear Regression Model

In [14]:
# fits the model on train and validate
train, train_predictions, validate, validate_predictions = mh.glm_model(train, x_train, y_train, validate, x_validate, train_predictions, validate_predictions)

In [15]:
# a glance at the predictions
train_predictions.head()

Unnamed: 0,Happiness_Score,Baseline,lm_predictions,glm_predictions
117,4.465,5.401465,5.119084,5.427748
71,5.504,5.401465,5.112611,5.314086
84,5.254,5.401465,5.284811,5.367115
3,7.522,5.401465,7.098599,5.647168
42,6.071,5.401465,5.343723,5.358461


### Evaluate

In [17]:
# creates evaluation dataframe
evaluate_df = mh.make_stats_df()

In [18]:
# reads results of evaluations to dataframe
evaluate_df = mh.final_eval(train, validate, evaluate_df)

In [19]:
evaluate_df

Unnamed: 0,models,RMSE
0,Baseline Train,1.116563
1,SimpleLinear Train,0.562486
2,GeneralizedLinear Train,0.990515
3,Baseline Validate,1.116563
4,SimpleLinear Validate,0.593547
5,GeneralizedLinear Validate,0.987753


### Modeling Summary

* The SimpleLinear Regression Model out-performed other models on train and validate data sets 
    * .5625
    * .5935
    
* The GeneralizedLinear Regression Model only out-performed baseline models
    * .9905
    * .9877
    
* The ideal model is expected to have the lowest RMSE in comparison.
    * For this reason SimpleLinear model will now be fit to the test data set 

### Simple Model on Test

In [20]:
# fits the model on test dataset
test, test_predictions, validate, validate_predictions = mh.test_lm_model(test, x_test, y_test, validate, x_validate, test_predictions, validate_predictions)

In [22]:
# Adding the test results to compare 
evaluate_df = evaluate_df.append({
    'models': 'SimpleLinear Test', 
    'RMSE': mean_squared_error(test_predictions.Happiness_Score, test_predictions.lm_predictions),
    }, ignore_index=True)
evaluate_df

Unnamed: 0,models,RMSE
0,Baseline Train,1.116563
1,SimpleLinear Train,0.562486
2,GeneralizedLinear Train,0.990515
3,Baseline Validate,1.116563
4,SimpleLinear Validate,0.593547
5,GeneralizedLinear Validate,0.987753
6,SimpleLinear Test,0.294145


### Conclusions
* SimpleLinear Regression model RMSE scores:

        * 0.562486 on training data samples
        * 0.593547 on validate data samples
        * 0.294145 on test data samples
#### Key TakeAway:
    SimpleLinear Regression model was successful on all train, validate and test data sets. 

### Recommendations

   * Consider age of persons contributing as a feature  
   * Consider gender of persons contributing as a feature
   * Consider gathering data seasonally