**Data vs Information**

Data are raw or organized findings.
* Number of beer consumed during Carnaval 2019.
* Checking accounts opened in 2018.
* Price of Bitcoin in 2017.

Information is data that has been given context.
* Market share of brewers, in Brasil, over the past 10 years tells us if the market share for the top 5 brewers has changed significantly.

Data analysis helps uncover information, answer queries, and forecast the unknown. However, before this can be accomplished the data must be cleaned.
In this kernel, I will take you through the steps of preparing data for analysis.

**Import data**

In [None]:
#Import pandas library
import pandas as pd

In [None]:
#Save your dataframe into a variable so that you can keep working with it
Military_Expenditure = pd.read_csv("../input/military-expenditure-of-countries-19602019/Military Expenditure.csv")

**Explore the data set**

In [None]:
#The shape attribute returns a python tool
#The first value is the number of rows and the second value is the number of columns
Military_Expenditure.shape

In [None]:
#DataFrame, index dtype, column dtypes, non-null values and memory usage
Military_Expenditure.info()

In [None]:
#Head gives us the first five rows
#A pandas dataframe has three components: index, column, and value(body of the dataframe).
Military_Expenditure.head()

**Subset Columns**

In [None]:
Name = Military_Expenditure[['Name', 'Code']]

In [None]:
Name

**Subset Rows**

In [None]:
#loc matches the index label and iloc matches the index position
Military_Expenditure.loc[[0, 1, 2]]

**Subset rows and columns**

In [None]:
rows_columns = Military_Expenditure.loc[0:10, ['Name', 'Code', 'Type']]

In [None]:
rows_columns.head

In [None]:
#Multiple Criteria Filtering
filter_list = ['Regions Clubbed Economically', 'Semi Autonomous Region', 'Regions Clubbed Geographically']
Military_Expenditure[Military_Expenditure.Type.isin(filter_list)]

In [None]:
#This data set has a column named Indicator Name that should be removed since it adds no value to the data set
Military_Expenditure.drop(['Indicator Name'], axis='columns', inplace=True)

In [None]:
Military_Expenditure.head()

**Melt**

In [None]:
#When column headers are values and not variables the data set needs to be melted.
Military_Expenditure= Military_Expenditure.melt(id_vars=['Name', 'Code', 'Type'],
                                         var_name='Year', value_name='Expenditure_USD')

In [None]:
Military_Expenditure.head()

**Convert data type**

In [None]:
#Wrong data type is assigned to a feature
Military_Expenditure.dtypes

In [None]:
#Convert
Military_Expenditure[["Year"]] = Military_Expenditure[["Year"]].astype("int")

In [None]:
#List the columns after the conversion
Military_Expenditure.dtypes

**Identify and handle missing data**

Missing values in a data set can be represented as NaN, ?, 0, or an empty cell. The Expenditure_USD column has several missing values which are represented with NaN. 
There are several ways to deal with missing information. 
* You can go to the source and try to find the missing information
* Remove the data of the missing value(s)
* Replace the missing value(s)
* Leave the missing data as missing data

Which option would you choose? And Why?

In [None]:
Military_Expenditure.head()

In [None]:
#Count missing values in each column
for column in Military_Expenditure.columns.values.tolist():
    print(column)
    print (Military_Expenditure[column].value_counts())
    print("")

In [None]:
#Replace all NaN elements with 0s
Military_Expenditure.fillna(value=0, inplace=True)

In [None]:
Military_Expenditure.head()

Keep in mind that when you are working with large datasets a lot of time will be spent on cleaning the data.