![](https://ichef.bbci.co.uk/news/660/cpsprodpb/143A0/production/_112184828_gettyimages-939290830.jpg)

Obesity is a medical condition in which excess body fat has accumulated to an extent that it may have a negative effect on health. People are generally considered obese when their body mass index (BMI), a measurement obtained by dividing a person's weight by the square of the person's height, is over 30 ; the range 25â€“30  is defined as overweight. Obesity increases the likelihood of various diseases and conditions, particularly cardiovascular diseases, type 2 diabetes, obstructive sleep apnea, certain types of cancer, osteoarthritis, and depression.

Here we look at the WHO adults' obesity dataset for each country from 1975 to 2016.

The metric used is percentage of adults who have Body Mass Index (BMI) >= 30.

### **Update**:

Thanks to [Ran.Krish](https://www.kaggle.com/rankirsh) and [AndrewHou](https://www.kaggle.com/andrewhou) for the explanation about the estimate range within brackets in the data.

I'll quote them here

> It is probably a confidance interval, ill try to explain.
Researchers can't know exactly what is the obesity rate so they compute a confidance interval (usually 95%), it means that we are 95% confident that the obesity rate in every sample (of the same size) in this specific country will be between 0.2 and 1.1

> Thanks ,you did a great explaining.
Let me just add this for people without Statistical backgrounds.
Researchers aren't able to ask every person to compute the exact number of Obesity %, it would cost a lot of money.
so they just use sampling methods, for example, ask 10% people of every cities ( saturated sampling).
0.5 is the point estimation.
0.2~1.1 is the interval estimation, which means researchers are 95% or so confident that the REAL obesity% is between 0.2~1.1

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

**Loading the data**

In [None]:
df=pd.read_csv('/kaggle/input/obesity-among-adults-by-country-19752016/data.csv')

Let us see what the data from the downloaded csv file looks like:

In [None]:
df.head()

Check the dimensions of the dataframe df

In [None]:
df.shape

One of the headers is not clearly visible, let's print it

In [None]:
df.iloc[0,1]

What are the column titles?

In [None]:
df.columns

Let's drop the unnecessary top 3 rows

In [None]:
df.drop([0,1,2],inplace=True)
df.head()

As we can see, the titles which have **.1** and **.2** as suffix are **'Male'** and **'Female'** respectively and the other one with only the year (e.g., 2016) is overall obesity for both sexes.

Also, reset the indices:

In [None]:
df.reset_index(drop=True,inplace=True)

`drop=True` is used to delete the 'index' column which is created if we do `df.reset_index()` without `drop=True`

In [None]:
df.head()

Set the first column title to be **Country**

In [None]:
df.rename(columns={'Unnamed: 0': 'Country'}, inplace=True)

In [None]:
df.head()

Here comes the important part.

What I personally want to do right here is have 4 columns, namely, **Country, Year, Obesity (%) and Sex** for easier Exploratory Data Analysis and Visualizations.

I used `df.melt()` for the same. It essentially "unpivots" the data from a wide format to a long format.

Documentation for the same can be viewed [here](https://pandas.pydata.org/docs/reference/api/pandas.melt.html)

In [None]:
ndf = df.melt('Country', var_name='Year', value_name='Obesity (%)')
ndf[['Year', 'Sex']] = ndf['Year'].str.split('.', expand=True)

In [None]:
ndf.head(10)

Now we sort our dataset by **Country** and by **Year**

In [None]:
ndf=ndf.sort_values(by=['Country','Year'])

Resetting the indices:

In [None]:
ndf=ndf.reset_index(drop=True)

In [None]:
ndf.head()

As we can see we need to change the 3 distinct values in **'Sex'** column.

We do this as follows:

In [None]:
ndf['Sex']=ndf['Sex'].map({None: 'Both sexes', '1': 'Male', '2':'Female'})

Let's split the obesity data to create another column which has the age-standardized estimate

In [None]:
ndf['Age standardized estimate']=ndf['Obesity (%)'].apply(lambda x:x.split()[1])

In [None]:
ndf['Obesity (%)']=ndf['Obesity (%)'].apply(lambda x:x.split()[0])

In [None]:
ndf.head()

Voila! We fixed the downloaded dataset as we wanted.

Let us save the output .csv file.

In [None]:
ndf.to_csv('/kaggle/working/obesity-clean-split.csv')