**Content:**
1. [Cleaning Data](#1)
    1. [Diagnose data for cleaning](#2)
    1. [Exploratory data analysis](#3)
    1. [Visual exploratory data analysis](#4)
    1. [Tidy data](#5)
    1. [Pivoting data](#6)
    1. [Concatenating data](#7)
    1. [Data types](#8)
    1. [Missing data and testing with assert](#9)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns  # visualization tool

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

<a id="1"></a> <br>
# 1.CLEANING DATA

<a id="2"></a> <br>
### DIAGNOSE DATA for CLEANING
We need to diagnose and clean data before exploring.
<br>Unclean data:
* Column name inconsistency like upper-lower case letter or space between words
* missing data
* different language

<br> We will use head, tail, columns, shape and info methods to diagnose data

In [None]:
data =pd.read_csv('../input/Pokemon.csv')     # data read

In [None]:
data.info()     #data's information

In [None]:
data.head()  # first 5 rows

In [None]:
data.tail()  # last 5 rows

In [None]:
data.columns # give me column names

In [None]:
data.shape  ## shape gives number of rows and columns in a tuble

<a id="3"></a> <br>
### EXPLORATORY DATA ANALYSIS
value_counts(): Frequency counts
<br>outliers: the value that is considerably higher or lower from rest of the data
* Lets say value at 75% is Q3 and value at 25% is Q1. 
* Outlier are smaller than Q1 - 1.5(Q3-Q1) and bigger than Q3 + 1.5(Q3-Q1). (Q3-Q1) = IQR
<br>We will use describe() method. Describe method includes:
* count: number of entries
* mean: average of entries
* std: standart deviation
* min: minimum entry
* 25%: first quantile
* 50%: median or second quantile
* 75%: third quantile
* max: maximum entry

<br> What is quantile?

* 1,3,4,5,8,9,10,12,13,17,18,21,23
* The median is the number that is in **middle** of the sequence. In this case it would be 10.

* The lower quartile is the median in between the smallest number and the median i.e. in between 1 and 10, which is 4.
* The upper quartile, you find the median between the median and the largest number i.e. between 10 and 23, which will be 17 according to the question above.

In [None]:
# frequency of pokemon types
print(data['Type 2'].value_counts(dropna = False))   # if there are nan values that also be counted

In [None]:
# For example max Speed is 180 or min attack is 5
data.describe() # ignore null entries

<a id="4"></a> <br>
### VISUAL EXPLORATORY DATA ANALYSIS
* Box plots: A data set shows five summaries. At least minimum, first quarter, median, third quarter and maximum.

In [None]:
# For example: compare defense of pokemons that are legendary  or not
# Black line at top is max
# Blue line at top is 75%
# Green line is median (50%)
# Blue line at bottom is 25%
# Black line at bottom is min
# There are no outliers
# -------------------------------------------
# boxplot parameters
# column : Column name or list of names, or vector.
# by : Column in the DataFrame to pandas.DataFrame.groupby(). 
# ax : The matplotlib axes to be used by boxplot.
# fontsize : Tick label font size in points or as a string (e.g., large).
# grid : Setting this to True will show the grid.
# figsize : The size of the figure to create in matplotlib.

data.boxplot(column='Defense',by = 'Legendary',fontsize = 'large', figsize = (8,8) )

<a id="5"></a> <br>
### TIDY DATA
We tidy data with melt().
Describing melt is confusing. Therefore lets make example to understand it.

In [None]:
# Merge data
new_data = data.head()  # I only take 5 rows into new data
new_data   #show new_data


In [None]:
# lets melt
# id_vars = what we do not wish to melt
# value_vars = what we want to melt
melted = pd.melt(frame=new_data,id_vars ='Name' , value_vars =['HP','Speed'])
melted     # show melted

<a id="6"></a> <br>
### PIVOTING DATA
Reverse of melting.

In [None]:
# Lets reverse
melted.pivot(index ='Name' , columns ='variable', values = 'value' )

<a id="7"></a> <br>
### CONCATENATING DATA
We can concatenate two dataframe 

In [None]:
# Create 2 dataframe
data1 = data.head()
data2 = data.tail()
conc_data_row = pd.concat([data1,data2],axis=0, ignore_index=True ) # axis=0 dataframes in row 
conc_data_row   # show

In [None]:
# Create 2 dataframe
data1 = data['HP'].head()
data2 = data['Speed'].head()
conc_data_col = pd.concat([data1,data2],axis=1 )  
conc_data_col   # show

<a id="8"></a> <br>
### DATA TYPES
Data types: object(string),boolean,  integer, float and categorical.
<br> We can make conversion data types like from str to categorical or from int to float
<br> Why is category important: 
* make dataframe smaller in memory 
* can be utilized for anlaysis especially for sklear(we will learn later)

In [None]:
data.dtypes

In [None]:
# convert object(str) -----> categorical
# convert int ------> float
data['Type 1'] =data['Type 1'].astype('category')
data['Defense'] =data['Defense'].astype('float')

In [None]:
# Type 2 changed from object to category
# Speed changed from int to float
data.dtypes

<a id="9"></a> <br>
### MISSING DATA and TESTING WITH ASSERT
If we encounter with missing data, what we can do:
* leave as is
* drop them with dropna()
* fill missing value with fillna()
* fill missing values with test statistics like mean
<br>Assert statement: check that you can turn on or turn off when you are done with your testing of the program

In [None]:
#Type 2 has 414 non-null object so it has 386 null object
data.info()

In [None]:
#Lets chech Type 2
data["Type 2"].value_counts(dropna =False)

In [None]:
#Lets drop nan values (delete)
data1 = data   # also we will use data to fill missing value so I assign it to data1 variable
data1['Type 2'].dropna(inplace = True) # inplace = True means we do not assign it to new variable. Changes automatically assigned to data

In [None]:
# Assert statement:
assert 1==1   # return nothing because it is true

In [None]:
# False so give me error
#assert 1==2

In [None]:
assert data['Type 2'].notnull().all()  # returns nothing because we drop nan values

In [None]:
data["Type 2"].fillna('empty',inplace=True)

In [None]:
assert data['Type 2'].notnull().all()  # returns nothing because we do not have nan values

What we learned at the end of this chapter:
* Diagnose data for cleaning
* Exploratory data analysis
* Visual exploratory data analysis
* Tidy data
* Pivoting data
* Concatenating data
* Data types
* Missing data and testing with assert