![](https://i.makeagif.com/media/11-19-2015/Tzr09f.gif)
# Introduction
According to World Health Organisation (WHO), malnutrition comes in numerous forms. This includes:
* Undernutrition caused by wasting, stunting and underweight
* Inadequate vitamins and minerals 
* Overweight and obesity

An estimate of 1.9 billion adults were overweight or obese, while approximately 462 millions of them were underweight. WHO also estimated that 45% death among children were caused by undernutrition. This mostly happen in low and middle-income countries. Shockingly, the amount of obesity and overweight issues were on the rise in these same counties. In this exploratory data analysis, we will dive deeper into the topic and identify key issues that lead to the prevalance of malnutrition across the world. 

## Importing required packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn.impute import SimpleImputer
from scipy import stats

## Have an overview of data

In [None]:
def overview():
    
    data = pd.read_csv("../input/malnutrition-across-the-globe/country-wise-average.csv")
    data1 = pd.read_csv("../input/malnutrition-across-the-globe/malnutrition-estimates.csv")
    # Print the first 5 lines of data
    print("First 5 lines of data \n\n")
    print(data.head())
    print("\n")
    print(data1.head())
    
    # Print data type
    print("\n\n\nDatatype\n")
    print(data.dtypes)
    print("\n")
    print(data1.dtypes)
    
    # Print number of null values 
    print("\n\n\nNumber of null values\n")
    print(data.isnull().sum())
    print("\n")
    print(data1.isnull().sum())
    
    # Print data summary
    print("\n\n\nData summary\n")
    print(data.describe())
    print("\n")
    print(data1.describe())
    
    # Print data shape
    print("\n\n\nData shape\n")
    print("Data has {} rows and {} columns".format(data.shape[0], data.shape[1]))
    print("\n")
    print("Data1 has {} rows and {} columns".format(data1.shape[0], data1.shape[1]))
    
    return data, data1

data, data1 = overview()

## Deal with NaN values
- I will usually remove rows with NaN values if the percentage of NaN values are below 5%
- For data, we can remove NaN values in wasting, overweight, stunting and underweight. However for severe wasting, I will replace them with median values.
- For data1, we can remove NaN values in stunting, survey sample (N), notes and underweight. The rest will be replace with median values.

In [None]:
data = data.dropna(subset = ['Wasting', 'Overweight', 'Stunting', 'Underweight'])
data1 = data1.dropna(subset = ['Stunting', 'Underweight', 'Survey Sample (N)', 'Notes'])

imputer = SimpleImputer(missing_values = np.nan, strategy = 'median')
data[['Severe Wasting']] = imputer.fit_transform(data[['Severe Wasting']])
data1[['Severe Wasting', 'Wasting', 'Overweight']] = imputer.fit_transform(data1[['Severe Wasting', 'Wasting', 'Overweight']])

In [None]:
# Check for NaN
print(data.isnull().sum())
print(data1.isnull().sum())

## Dealing with outliers
![](https://i.imgur.com/efYiKtP.gif)
- To deal with outliers, we use Z-score. 
- A Z-score is a numerical measurement used in statistics of a value’s relationship to the mean (average) of a group of values, measured in terms of standard deviations from the mean. 
    - If a Z-score is 0, it indicates that the data point’s score is identical to the mean score. 
    - A Z-score of 1.0 would indicate a value that is one standard deviation from the mean. 
    - Z-scores may be positive or negative, with a positive value indicating the score is above the mean and a negative score indicating it is below the mean.
- In most of the cases a threshold of 3 or -3 is used i.e if the Z-score value is greater than or less than 3 or -3 respectively, that data point will be identified as outliers.

In [None]:
# Create a function to separate out numerical and categorical data
    ## Using this function to ensure that all non-numerical in a numerical column
    ## and non-categorical in a categorical column is annotated
def cat_variable(df):
    return list(df.select_dtypes(include = ['category', 'object']))

def num_variable(df):
    return list(df.select_dtypes(exclude = ['category', 'object']))

categorical_variable = cat_variable(data)
numerical_variable = num_variable(data)
categorical_variable1 = cat_variable(data1)
numerical_variable1 = num_variable(data1)

# Create a function to process outlier data
def outlier(data):
    z = np.abs(stats.zscore(data[numerical_variable]))
    z_data = data[(z < 3).all(axis=1)] # Remove any outliers with Z-score > 3 or < -3
    return z_data

data = outlier(data)
data1 = outlier(data1)

## Income level
Here, we are converting those numbers in income classification to their real meaning. This allows readers to understand the context without having to constantly go back the description.

In [None]:
# Replace float to int
data['Income Classification'] = data['Income Classification'].astype('int')

# Create a new column to represent income level
def func(row):
    if row == 0:
        return 'Low income'
    elif row == 1:
        return 'Lower middle income'
    elif row == 2:
        return 'Upper middle income'
    else:
        return 'High income'

data['Income level'] = data.apply(lambda x: func(x['Income Classification']), axis=1)

# Plot countplot
plt.figure(figsize = (10,8))
sns.countplot(data = data, x = 'Income level').set_title('Income level')

We can see that a huge portion of countries fall between upper middle incoome and lower middle income.

## Look into the extend wasting among income level
- Wasting refers to the process by which a debilitating disease causes muscle and fat tissue to "waste" away.
- This could be due to the lack of food or proper nutrition.

In [None]:
plt.figure(figsize = (10,8))
sns.boxplot(data = data, x = 'Income level', y = 'Severe Wasting').set_title("Severe wasting among different income")

In [None]:
plt.figure(figsize = (10,8))
sns.boxplot(data = data, x = 'Income level', y = 'Wasting').set_title("Wasting among different income")

Low income countries tend to have a higher level of wasting. As we all know, wasting is caused by numerous factors and one of them is low energy intake. 
People in these countries tend to survive on less than 3 meals and hence suffer from severe malnutrition. Their bodies switched to survival mode and their muscles are striped off to preserve energy.

## Overweight

In [None]:
plt.figure(figsize = (10,8))
sns.boxplot(data = data, x = 'Income level', y = 'Overweight').set_title("Overweight among different income")

Unsurprisingly, we see high level of overweight issues in high income and upper middle income countries. As more people are able to afford their meals in these countries, we will tend to see a greater level of overweight issues.

## Stunting
- Stunting is the impaired growth and development that children experience from poor nutrition, repeated infection, and inadequate psychosocial stimulation. 

In [None]:
plt.figure(figsize = (10,8))
sns.boxplot(data = data, x = 'Income level', y = 'Stunting').set_title("Stunting among different income")

Both low and lower middle income countries have high median stunting rates. Similar reasons like the ones above.

## Underweight

In [None]:
plt.figure(figsize = (10,8))
sns.boxplot(data = data, x = 'Income level', y = 'Underweight').set_title("Underweight among different income")

Both low and lower middle income have high rates of underweight issues. Similar reason to the one above.

## Have malnutrition issues improved over the years for Bangladesh?
- Here, I will pick countries with more survey year since it will be clearer for us to see the difference over the years.

In [None]:
# Choosing the country
bangladesh = data1[data1['Country'] == 'BANGLADESH']

# Indexing the year
bangladesh = bangladesh.set_index('Year')
# Creating time series
axes = bangladesh[["Severe Wasting", "Wasting", "Overweight", "Stunting", "Underweight"]].plot(figsize=(11, 9), subplots=True, linewidth=1)

We observed an improvement in severe wasting, wasting, overweight, stunting and underweight issues over the years.

## Have malnutrition issues improved over the years for Kuwait?

In [None]:
# Choosing the country
kuwait = data1[data1['Country'] == 'KUWAIT']

# Indexing the year
kuwait = kuwait.set_index('Year')
# Creating time series
axes = kuwait[["Severe Wasting", "Wasting", "Overweight", "Stunting", "Underweight"]].plot(figsize=(11, 9), subplots=True, linewidth=1)

For Kuwait we observed a fluctuation of values. We see an improvement in 2007 onwards. Seems like it took Kuwait quite some time to recover from the war. 

## Have malnutrition issues improved over the years for Chile?

In [None]:
# Choosing the country
chile = data1[data1['Country'] == 'CHILE']

# Indexing the year
chile = chile.set_index('Year')
# Creating time series
axes = chile[["Severe Wasting", "Wasting", "Overweight", "Stunting", "Underweight"]].plot(figsize=(11, 9), subplots=True, linewidth=1)

We observed a decrease in malnutrition issues in Chile over the years. That was mainly due to the fact that Chile is a high income country.