# Brief Description of My Dataset

**World Happiness Report**

The World Happiness Report includes the state of happiness in different parts of the World. It ranks 155 countries by their happiness levels which depends on factors like economic production, social support, life expectancy etc.

The happiness scores and rankings use data from the Gallup World Poll, which is based on answers to the main life evaluation question asked in the poll.  The dataset I used contains the values from the year 2015

The dataset consists of 12 columns, 158 rows and 158 unique values

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

plt.style.use('bmh')

world_happiness = pd.read_csv('../input/world-happiness/2015.csv') # Read the 2015.csv file into a pandas dataframe
world_happiness.head() # First 5 rows

In [None]:
world_happiness.columns

The columns 'Country' and 'Region' show where the survey was conducted and 'Happiness Rank' is the rank of the country among the others based on the 'Happiness Score'. Happiness Score was measured in 2015 by asking  the sampled people the question: "How would you rate your happiness on a scale of 0 to 10 where 10 is the happiest."

The columns following the happiness score estimate the extent to which each of six factors - economic production, social support, life expectancy, freedom, absence of corruption and generosity - contribute to making life evaluations higher in each country than they are in Dystopia, a hypothetical country that has values equal to the world's lowest national averages for each six factors

The residuals, or unexplained components, differ for each country, reflecting the extent to which the six variables either over- or under-explain average 2014-2016 life evaluations. Although some life evaluation residuals are quite large, occasionally exceeding one point on the scale from 0 to 10, they are always much smaller than the calculated value in Dystopia, where the average life is rated at 1.85 on the 0 to 10 scale.

The Dystopia Residual metric actually is the Dystopia Happiness Score(1.85) + the Residual value or the unexplained value for each country

![Opera Anlık Görüntü_2021-09-06_104918_www.kaggle.com.png](attachment:eacb9541-14a6-4bf7-b576-47336697fa5d.png)

![Opera Anlık Görüntü_2021-09-06_104932_www.kaggle.com.png](attachment:43fef88f-8786-46c0-b307-c103b9eb35e9.png)

![Opera Anlık Görüntü_2021-09-06_104943_www.kaggle.com.png](attachment:ca34e430-d430-40a1-9082-6fd71fa0affc.png)

![Opera Anlık Görüntü_2021-09-06_105043_www.kaggle.com.png](attachment:d020028c-1e9e-4563-b6cc-caa7b0bfd5e7.png)

![Opera Anlık Görüntü_2021-09-06_105054_www.kaggle.com.png](attachment:dd55cc6f-3fd0-48de-82d5-e0193f3288a8.png)

![Opera Anlık Görüntü_2021-09-06_105113_www.kaggle.com.png](attachment:cac9151a-3c3b-4f7f-9151-39b24d46ee47.png)

![Opera Anlık Görüntü_2021-09-06_105124_www.kaggle.com.png](attachment:d2485635-ae01-4e3a-be3f-685c3a399a33.png)

![Opera Anlık Görüntü_2021-09-06_105135_www.kaggle.com.png](attachment:79babbdd-797a-4451-81ad-81836ed0c58e.png)

![Opera Anlık Görüntü_2021-09-06_105159_www.kaggle.com.png](attachment:3fa5d830-553b-41eb-95cb-1240f6167a0d.png)

![Opera Anlık Görüntü_2021-09-06_105224_www.kaggle.com.png](attachment:af22e440-488e-416e-ba82-e8da97902bc2.png)

# Initial Plan for Data Exploration

This analysis is the first step to build a model to predict happiness score based on the other factors.

1. Data Insight
2. Data Cleaning and Feature Engineering
3. Formulating Hypothesis and Testing
4. Analyzing relationships between variables
5. Conclusion

**1. Data Insight**

In [None]:
world_happiness.nunique(axis=0)

In [None]:
world_happiness.describe()

**2. Cleaning the Dataset and Feature Engineering**

In [None]:
world_happiness.info()

I removed 'Happiness Rank' column since it doesn't affect the target.

There isn't any missing values so we don't need to remove other features.

Nevertheless, I used .dropna(axis=0) to remove any rows with null values in case I missed them.

In [None]:
df_cleaned = world_happiness.dropna(axis=0)

In [None]:
del df_cleaned['Happiness Rank']

In [None]:
df_cleaned.shape

In [None]:
plt.figure(figsize=(9, 8))
sns.distplot(df_cleaned['Happiness Score'], color='g', bins=100, hist_kws={'alpha': 0.4});

The happiness score has almost normal distrubition and there are some outliers

So lets look at the distribution of all of the features by ploting them.

In [None]:
df_cleaned.hist(figsize=(16, 20), bins=50, xlabelsize=8, ylabelsize=8)

Now, I will try to find features that are correlated with the target 'Happines Score'. I will save those features in a variable called corr_target.

We have 5 values which are strongly correlated with happiness score. Now we must find the features with very few or explainable outliers. Then we can remove the outliers from these features and see which one can have a good correlation without their outliers.

In [None]:
df_num_corr = df_cleaned.corr()['Happiness Score'][:] 
corr_target = df_num_corr[abs(df_num_corr) > 0.5].sort_values(ascending=False)
print("There is {} strongly correlated values with SalePrice:\n{}".format(len(corr_target), corr_target))

In [None]:
for i in range(0, len(df_cleaned.columns), 4):
    sns.pairplot(data=df_cleaned,
                x_vars=df_cleaned.columns[i:i+4],
                y_vars=['Happiness Score'])

No we can see some relationships. Most of the features have a linear relationship with the 'Happiness Score'. Some of the data points are located on x=0 which might mean the absence of such feature in score.

We can remove these '0' values and repeat the process of finding correlated values.

**Log Transformation for Skewed Variables**

In [None]:
def hist_loop(data: pd.DataFrame,
              rows: int,
              cols: int,
              figsize: tuple):
    fig, axes = plt.subplots(rows,cols, figsize=figsize)
    for i, ax in enumerate(axes.flatten()):
        if i < len(data.columns):
            data[sorted(data.columns)[i]].plot.hist(bins=30, ax=ax)
            ax.set_title(f'{sorted(data.columns)[i]} distribution', fontsize=10)
            ax.tick_params(axis='x', labelsize=10)
            ax.tick_params(axis='y', labelsize=10)
            ax.get_yaxis().get_label().set_visible(False)
        else:
            fig.delaxes(ax=ax)
    fig.tight_layout()

In [None]:
# Create a function to check skewness
def skew_df(data: pd.DataFrame, skew_limit: float) -> pd.DataFrame:
    # Define a limit above which we will log transform
    skew_vals = data.skew()

    # Showing the skewed columns
    skew_cols = (skew_vals
                 .sort_values(ascending=False)
                 .to_frame('Skew')
                 .query('abs(Skew) > {}'.format(skew_limit))
    )
    return skew_cols

In [None]:
# Print out skewed columns
skew_cols = skew_df(df_cleaned, 0.75)
skew_cols

In [None]:
# Perform log transformation
for col in skew_cols.index.values:
    df_cleaned['log_' + col] = df_cleaned[col].apply(np.log1p)

In [None]:
# Check skewness on log transformed data
log_df = df_cleaned.filter(regex='^log_', axis=1)
skew_log_cols = skew_df(log_df, 0.75)
skew_log_cols

In [None]:
# Plot log columns that have nearly normal distribution
log_df = log_df.drop(skew_log_cols.index, axis=1)
hist_loop(data=log_df.copy(),
          rows=3,
          cols=4,
          figsize=(20,10))

In [None]:
# Join these new columns to our dataset
df_cleaned = df_cleaned.merge(log_df)

- The target (Happiness Score) has nearly normal distribution.
- There are some linear relationship between the features and the target. Linear regressin might be suitable to this problem.

In [None]:
from scipy.stats import pearsonr, ttest_ind


def calculate_pvalues(df):
    df = df.dropna()._get_numeric_data()
    dfcols = pd.DataFrame(columns=df.columns)
    pvalues = dfcols.transpose().join(dfcols, how='outer')
    for r in df.columns:
        for c in df.columns:
            pvalues[r][c] = round(pearsonr(df[r], df[c])[1], 4)
    return pvalues

# Hypothesis Testing
1. First Hypothesis
Ho: Happiness Score and Standard Error has linear relationship
H1: Happiness Score and Standard Error don't  have linear relationship

2. Second Hypothesis
Ho: Happiness Score and Generositiy has linear relationship
H1: Happiness Score and Generosity don't  have linear relationship

3. Third Hypothesis
Ho: Happiness Score and Trust has linear relationship
H1: Happiness Score and Trust don't  have linear relationship

In [None]:
calculate_pvalues(df_cleaned) 


The smaller the p-value, the stronger the evidence that you should reject the null hypothesis. A p-value less than 0.05 (typically ≤ 0.05) is statistically significant. It indicates strong evidence against the null hypothesis, as there is less than a 5% probability the null is correct 

The results show that:
- Happiness Score and Generositiy don't have a linear relationship
- Happiness Score and Standard Error don't have a linear relationship


There need to be more analyses before jumping to a conclusion.

# Next Step in Analyzing this data**
- Apply mutual information regression for future selection
- Apply Backward Stepwise Regression
- Build a pipeline to preprocess data and run the model on the test set

# Conclusion

Linear regressin might be a good fit to this data set, since most of the features has linear relationships. Nevertheless, one can collect more data to form a better data set