# Introduction

#### This notebook is especially more useful for beginners in Data Science. It gives step by step guide for EDA, after reading this notebook, I am sure you will be more comfortable in EDA and you will be able to present your data with better visuals. You can apply the methods mentioned in this notebook to any dataset.

#### I have divided the notebook into 2 major parts:

#### 1. Data Cleaning
#### 2. Exploratory Data Analysis (EDA)

 **PLEASE UPVOTE GUYS AND RECOMMEND THAT SHOULD IMPLEMENT**

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file

# For visvalization
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd

%matplotlib inline

# 1. Data Cleaning

#### If you are new to data science, let me tell you that, most of the time your data is not clean and all data scientists have to do this dirty work of cleaning it first. EDA is like taking pictures in various directions, so to look good in photos you have to clean yourself, were some good clothes, set your hair, etc., and the same thing we are doing with our data. If you want correct and appropriate visuals, you have to clean your data.

#### Data may have missing values, wrong data types, outliers, etc. So, before you do EDA on your data, please check for these things and if they are present, you have to clean the data before moving further otherwise you can get misleading and incorrect visuals. So let's fold the sleeves and get ready for cleaning!

In [None]:
# Reading the csv files
df=pd.read_csv('../input/world-happiness-report-2021/world-happiness-report-2021.csv')
df1=pd.read_csv('../input/world-happiness-report-2021/world-happiness-report.csv')

df.head()

In [None]:
df1.head()

In [None]:
# df.info() shows the basic information about the data like column names, data types, number of rows, memory usages, etc.
df.info()

#### Everything looks fine here as all numeric values are 'float' and strings are 'object', so let's proceed for further steps. But if your data have incorrect data types you can use [df.astype()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html) for changing the data types.

In [None]:
# Checking for missing values
df.isnull().sum()

#### As there are no missing values in our data we shall proceed with further steps, but if your data have some missing values you can either remove them by using [df.dropna](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) or substitute the missing values with mean or mode with [df.fillna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) . Some models can give you an error if missing values are present and some will show incorrect results, so it is important to remove them.

# 2. EDA + ML

#### For identifying the outliers from numeric columns and understand the spread of the data, we will use a box plot which is commonly used.

![box plot](https://miro.medium.com/max/18000/1*2c21SkzJMf3frPXPAR_gZA.png)

In [None]:
# Visvalization of numeric columns

fig, ax=plt.subplots(3,6, figsize=(15,10)) # Creates a grid of 3 rows and 6 colums as we have 18 numeric columns.
numeric_col=df.select_dtypes('float64').columns # For selecting perticuler datatype
for num_col, axis in zip(numeric_col, ax.ravel()): # ax.ravel() kind of flattens the 2d grid we created, for iteration
    sns.boxplot(x=num_col, data=df, ax=axis)

plt.tight_layout() # makes the layout of the plot tight, i.e. to avoid overlapping of plots

#### There are many box plots where the values are more than 75 percentile and less than 25 percentile. The data can vary greatly with each country so these values may not be considered outliers. The data is an outlier or not is the decision of the field expert. We can also tell for some of the data like 'age' - it can't be more than 100 or less than 0, etc. but in this case, I am not removing these values as some countries  may actually have such abnormal stats.

#### Data cleaning part is over, now let's move to EDA, which you will be eager to do. I want to mention that, as this data was already clean, we didn't face any problem and this part is done quickly, but this won't be the same for other datasets. So, if you are working hard for cleaning your data, don't get frustrated and be patient. 

#### We are using geopandas for displaying happiness index, as it gives feel about which place of the world is how much happy at one glace. It is used for working with geospatial data. To dive deep, you go through this: [Geopandas](https://geopandas.org/). 

#### Don't worry more about the geopandas for now, the code is simple and you can very well use it for your dataset.

In [None]:
# Getting countries details for plotting them on map. 
#Geopandas has inbuild dataset of that, so you don't have to search it elsewhere.

world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
world.head()

#### The column 'geometry' is having cordinates for the bounding box for each country. One of them is plotted below.

In [None]:
world['geometry'][1]

In [None]:
# Checking for different country names. There can be missmatch in spelling of country names in 'df' and 'world'
country_data = list(df['Country name'].unique())
country_geo = list(world['name'])

country_diff = [country for country in country_data if country not in country_geo]
country_diff

In [None]:
# Replacing the misspelled country names in 'df' as per 'world', you can do it the other way also.
df['Country name'] = df['Country name'].replace({'United States' : 'United States of America',
                                          'Taiwan Province of China':'Taiwan',
                                          'Bosnia and Herzegovina':'Bosnia and Herz.',
                                          'Dominican Republic':'Dominican Rep.',
                                          'North Cyprus':'N. Cyprus',
                                          'Swaziland':'Switzerland'})

In [None]:
# Function for plotting on world map.
# It takes 3 arguments - the data frame, the columns which is to be plotted and the title of the plot

def plot_on_worldmap(df, col_to_map, title=None):
    # Plotting on world map
    mapped = world.set_index('name').join(df.set_index('Country name')).reset_index() # Joins the df with world data for plotting

    to_be_mapped = col_to_map # The column name which is to shown on map
    vmin, vmax = df[col_to_map].min(), df[col_to_map].max()# Minimum and maximum values for the column

    fig, ax = plt.subplots(1, figsize=(20,10))

    mapped.dropna().plot(column=to_be_mapped, cmap='Blues', linewidth=0.8,legend=True, ax=ax, 
                         edgecolors='0.8', legend_kwds={'shrink': 0.5})

    ax.set_title(title, fontdict={'fontsize':20})

In [None]:
plot_on_worldmap(df, 'Ladder score', 'Happiness Index')

#### North America is a happy continent as most of the countries have a very high happiness index, while overall Africa's continent doesn't seem to be happy.

#### Now we got an idea of how the happiness index varies with countries and continents, let's explore more!

In [None]:
# The most and least happy country
# df.sort_values sorts the numeric values in ascending oreder
# If 'ignore_index'=False, the original index of the dataframe won't change after sorting

leat_happy_country=df.sort_values(by='Ladder score', ignore_index=True)['Country name'].iloc[0]
most_happy_country=df.sort_values(by='Ladder score', ignore_index=True)['Country name'].iloc[-1]

print(f'The most happy country is {most_happy_country}, and the least happy country is {leat_happy_country}')

#### We have many factors like GDP, Social support, Health, Freedom, Corruption, Dystopia, and Generosity for deciding how happy a country is. Now let's find which factor influence more to the happiness index. Before you go further, what do you feel is most important for happiness? Please comment below, I am curious to know what you think!

#### For this purpose, we will make a Machine Learning model and using the [feature importance](https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html), we can know which factor is most important for happiness.

In [None]:
from sklearn.ensemble import RandomForestRegressor

y=df['Ladder score']

col_to_consider=['Explained by: Log GDP per capita', 'Explained by: Social support',
       'Explained by: Healthy life expectancy',
       'Explained by: Freedom to make life choices',
       'Explained by: Generosity', 'Explained by: Perceptions of corruption',
       'Dystopia + residual']

X= df[col_to_consider]

#### You might be wondering why I am using the Random Forest algorithm and why I am using 'Explained by:' columns for building the ML model, not other columns. The answer to the first question is that Random Forest comes with ready use 'feature importance' method. You can also use [XGBoost](https://stackoverflow.com/questions/37627923/how-to-get-feature-importance-in-xgboost) and [Catboost](https://catboost.ai/docs/concepts/python-reference_catboost_get_feature_importance.html) for this purpose, but Random Forest takes a little less time in training, I am using that.
#### For the second question, you can use prior columns also, as they are the same as 'Explained by:' columns. Let me show you,

In [None]:
prior_col=['Logged GDP per capita', 'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption', 'Ladder score in Dystopia']

fig, ax=plt.subplots(2,4, figsize=(10,8))
for prior_col, after_col, axis in zip(prior_col, col_to_consider, ax.ravel()):
    sns.scatterplot(x=prior_col, y=after_col, data=df, ax=axis)

plt.tight_layout()

#### The prior columns and 'Explained by:' columns are correlated (except for the Dystopia), the 'Explained by:' columns are just scaled and nothing else. So, the ML model will give similar results whether you feed it prior columns or 'Explained by:' columns.

In [None]:
# n_estimators - Number of trees to build
# max_depth - maximum depth to which tree can grow, if we increase it very much then, model can overfit
# min_samples_leaf - minimum number of samples a leaf can have
# min_samples_split - minimum number of samples a node should have to further split

random_forest= RandomForestRegressor(n_estimators=100, max_depth=5, min_samples_leaf=20, min_samples_split=40).fit(X,y)

#### I am not scaling the data, as it is not required for Random Forest (to know more, click [here](https://stackoverflow.com/questions/8961586/do-i-need-to-normalize-or-scale-data-for-randomforest-r-package#:~:text=No%2C%20scaling%20is%20not%20necessary,%2C%20aren't%20so%20important.)). Also, I am not splitting the data into train and test, because the focus is to find the important features contributing to happiness and not predicting happiness from given features. 

In [None]:
# Getting the feature importance
feature_importances_=random_forest.feature_importances_
feature_importances=pd.DataFrame({'Feature_name':col_to_consider, 'Feature_importance':feature_importances_})

fig, ax=plt.subplots(1, figsize=(15,8))
sns.barplot(x='Feature_name', y='Feature_importance', data=feature_importances, ax=ax)

# For making the graph look good
plt.xticks(fontsize=12, rotation=30); # Rotating the names by 30 degrees as the names were mixing with each other 
plt.yticks(fontsize=14);

plt.xlabel('Feature name',fontsize=18)
plt.ylabel('Feature importance',fontsize=18)

#### Wow! Did you get it right? To be honest, I thought 'Freedom to make choices' would have more influence on the Happiness index. 

#### Let's do some further exploration

In [None]:
# Happy Vs Unhappy Countries
thresold=df['Ladder score'].mean() # Above the thresold all the countries are happy, you can choose other value of thresold also

df['Happy_Unhappy']=df['Ladder score'].apply(lambda x: 1 if x>=thresold else 0)

plot_on_worldmap(df, 'Happy_Unhappy', 'Happy and Unhappy Countries')

#### The North and the South America, Austrilia and west Europian countris are very happy! While Asians and Africans are not much happy. 

## EDA using Seaborn

#### Now will do EDA using [seaborn](https://seaborn.pydata.org/) library. If you don't have any idea, about which plots will be suitable for you data, and how to express of the columns quickly, you can use [pairplot](https://seaborn.pydata.org/generated/seaborn.pairplot.html) in seaborn.

In [None]:
# Let's see how Happy and Unhappy contries varies based on varies parameters
# 0 =>Unhappy (Blue), 1 =>Happy (Orange)
# x_vars- variables on x-axis, and same for y_vars, hue=> column of your interest
ax=sns.pairplot(x_vars=col_to_consider, y_vars=col_to_consider, hue='Happy_Unhappy', data=df, height=3)

# use plt.savefig ('fignure_name.png') for saving the image

#### Tip:
#### Use 'pairplot' for getting a detailed view of the data.
#### If both columns are numeric => relplot, regplot, lmplot.
#### If both columns are categorical or one categorical and one numeric => catplot, barplot, countplot.

## regplot, relplot and lmplot - Both numeric variables

#### All the 3 [regplot](https://seaborn.pydata.org/generated/seaborn.regplot.html), [relplot](https://seaborn.pydata.org/generated/seaborn.relplot.html) and [lmplot](https://seaborn.pydata.org/generated/seaborn.lmplot.html) can be used when both of your variables are numeric. I will show you the basic of 'lmplot', once you practice and become confident using lmplot, you will find that other 2 are just piece of cake!

In [None]:
sns.lmplot(x='Ladder score', y='Logged GDP per capita', data=df)

#### Now, if you want to add a new dimention to this, it can be done in two ways - 
#### 1. hue (Uses same image for displaying the 3rd dimention)
#### 2. col or row (Uses multiple rows or columns for displaying the 3rd dimention)

#### Caution: The new dimention column should be categorical. If you want to do this with numeric column, you have to first convert it into a categorical one by using [pd.cut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html)

In [None]:
# 1. using 'hue'
sns.lmplot(x='Ladder score', y='Logged GDP per capita', data=df, hue='Happy_Unhappy')

In [None]:
# 2. using 'col'
sns.lmplot(x='Ladder score', y='Logged GDP per capita', data=df, col='Happy_Unhappy')

## Catplot - One numeric and One categorical variable

#### Catplot has many types (kinds) of plotting the data - stripplot, violinplot, boxenplot, pointplot, boxplot, swarmplot

In [None]:
kinds=['strip','violin','boxen','point','box','swarm']

for kind in kinds:
    #By changing the 'kind', you can have various type of plots
    ax=sns.catplot(y='Logged GDP per capita', x='Happy_Unhappy', data=df, kind=kind)
    ax.fig.suptitle(f'{kind} plot')

#### You can use any of these, whichever you like. Everyone has a unique purpose, you can go through this for more details: [Catplot](https://seaborn.pydata.org/generated/seaborn.catplot.html)

## EDA using Plotly

#### Unlike seaborn and matplotlib, were you can't interact with the plots easily, this can be easily done with Plotly. It is a open source library and widly used for vivalizations and interactive plots. I am covering basic plots in Plotly, for more deatails you can go through [Plotly documentation](https://plotly.com/python/). 

In [None]:
# Importing Plotly
import plotly.express as px

### 1. Scatterplot

In [None]:
# color => exactly like 'hue' in seaborn
# hover_name => shows the name related to the data point, when mouse is hovered over it

fig1=px.scatter(data_frame=df,x='Ladder score', y='Logged GDP per capita', color='Happy_Unhappy', hover_name='Country name')
fig1.update_layout(title=dict(text='Happiness Index Vs GDP per capita', xanchor='center', yanchor='top', x=0.5))
fig1.show()

## 2. Sunburst plot

In [None]:
# Adding Continent to df
contient_country=world[['name','continent']]
contient_country.columns=['Country name', 'Continent']
df=df.merge(contient_country, on='Country name')

In [None]:
fig2=px.sunburst(data_frame=df, path=['Continent','Country name'], values='Ladder score')
fig2.update_layout(title=dict(text='Happiness Index across Contients and Contries', xanchor='center', yanchor='top', x=0.5))
fig2.show()

## 3. Barplot

In [None]:
df3=df.groupby(['Continent', 'Country name']).mean(['Logged GDP per capita']).reset_index()

fig3=px.bar(data_frame=df3, x='Continent', y='Logged GDP per capita', 
            color='Happy_Unhappy', hover_name='Country name', color_continuous_scale='burg')
fig3.update_layout(title=dict(text='GDP per capita across Continents', xanchor='center', yanchor='top', x=0.5))

fig3.show()

## 4. Treemap

In [None]:
df4=df.groupby(['Continent', 'Country name']).mean(['Healthy life expectancy']).reset_index()
df4['World']='World'

fig4=px.treemap(data_frame=df4, path=['World', 'Continent', 'Country name'], values='Healthy life expectancy')
fig4.update_layout(title=dict(text='Health life expectaancy across Continents', xanchor='center', yanchor='top', x=0.5))
fig4.show()

In [None]:
from sklearn.preprocessing import MinMaxScaler # For bringing the values to scale

col=col_to_consider.copy()
col.append('Continent')

# Creating df_continent, having continents and their corresponding mean values of all columns
df_continent=df[col].groupby(['Continent']).mean().reset_index()

# Scaling the values, to make them into one scale for easy comparison
scalar=MinMaxScaler()
df_continent[col_to_consider]=scalar.fit_transform(df_continent[col_to_consider])

## 5. Polar plot

In [None]:
import plotly.graph_objects as go

# Plotting the comparative Polar plot for continent1 and continent2
def plot_polar(continent1,continent2):
    
    theta=df_continent.columns[1:]
    r1= df_continent[df_continent['Continent']==continent1].iloc[:,1:].values.flatten().tolist()
    r2= df_continent[df_continent['Continent']==continent2].iloc[:,1:].values.flatten().tolist()

    graph1=go.Scatterpolar(r = r1,theta = theta,fill = 'toself',name=continent1)
    graph2=go.Scatterpolar(r = r2,theta = theta,fill = 'toself',name=continent2)
    
    data = [graph1, graph2]
    fig = go.Figure(data = data)
    fig.update_layout(title=dict(text='Continent comparison', xanchor='center', yanchor='top', x=0.5))
    fig.show()

In [None]:
plot_polar('Africa','Asia')

In [None]:
plot_polar('Europe','Asia')

In [None]:
df1.head()

In [None]:
# Replacing the misspelled country names in 'df' as per 'world', you can do it the other way also.
df1['Country name'] = df1['Country name'].replace({'United States' : 'United States of America',
                                          'Taiwan Province of China':'Taiwan',
                                          'Bosnia and Herzegovina':'Bosnia and Herz.',
                                          'Dominican Republic':'Dominican Rep.',
                                          'North Cyprus':'N. Cyprus',
                                          'Swaziland':'Switzerland'})

In [None]:
# GDP per capita across countries
df_choropleth=df1.groupby(['Country name','year']).mean(['Log GDP per capita']).reset_index()

fig = px.choropleth(data_frame=df_choropleth, locations='Country name',locationmode="country names", 
                    color='Log GDP per capita', projection='orthographic', 
                    color_continuous_scale=[(0, "red"), (0.5, "white"), (1, "blue")])

fig.update_layout(title=dict(text='GDP per capita across countries', xanchor='center', yanchor='top', x=0.47))
fig.show()

In [None]:
category_order=[2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020]
df_choropleth=df1.groupby(['year','Country name']).mean(['Log GDP per capita']).reset_index()

fig = px.choropleth(data_frame=df_choropleth, locations='Country name',locationmode="country names", 
                    color='Log GDP per capita',animation_frame='year',
                    color_continuous_scale=[(0, "red"), (0.5, "white"), (1, "blue")])

fig.update_layout(title=dict(text='GDP per capita across countries from 2005 to 2020', xanchor='center', yanchor='top', x=0.47))
fig.show()

#### This was the brief for doing EDA. I have just scrached the surface, there are many parameters in each type of plot which you can explore. I highly recommend you, if you are biginner to Data Science to go through the [seaborn library tutorial](https://seaborn.pydata.org/tutorial.html) as it is widly used and user friendly library, also you can go through [plotly library](https://plotly.com/python/) for interactive plots. At any point of time, you will surely come accross, plotting the data and if you know this then you will be ahead of others.

#### Hope you find this helpful, if you like this notebook don't forget to upvote!