# DATA SCIENTIST
**In this tutorial, I only will show you the first steps to be a data scientist using python.**

Data scientist need to have these skills:

1. Basic Tools: Like python, R or SQL. You do not need to know everything. What you only need is to learn how to use **python**
1. Basic Statistics: Like mean, median or standart deviation. If you know basic statistics, you can use **python** easily. 
1. Data Munging: Working with messy and difficult data. Like a inconsistent date and string formatting. As you guess, **python** helps us.
1. Data Visualization: Title is actually explanatory. We will visualize the data with **python** like matplot and seaborn libraries.
1. Machine Learning: You do not need to understand math behind the machine learning technique. You only need is understanding basics of machine learning and learning how to implement it while using **python**.

### As a summary we will learn python to be data scientist !!!

**Content:**
1. [Introduction to Python:](#1)
    1. [Matplotlib](#2)
    1. [Dictionaries ](#3)
    1. [Pandas](#4)
    1. [Logic, control flow and filtering](#5)
    1. [Loop data structures](#6)
1. [Python Data Science Toolbox:](#7)
    1. [User defined function](#8)
    1. [Scope](#9)
    1. [Nested function](#10)
    1. [Default and flexible arguments](#11)
    1. [Lambda function](#12)
    1. [Anonymous function](#13)
    1. [Iterators](#14)
    1. [List comprehension](#15)
1. [Cleaning Data](#16)
    1. [Diagnose data for cleaning](#17)
    1. [Exploratory data analysis](#18)
    1. [Visual exploratory data analysis](#19)
    1. [Tidy data](#20)
    1. [Pivoting data](#21)
    1. [Concatenating data](#22)
    1. [Data types](#23)
    1. [Missing data and testing with assert](#24)
1. [Pandas Foundation](#25)
    1. [Review of pandas](#26)
    1. [Building data frames from scratch](#27)
    1. [Visual exploratory data analysis](#28)
    1. [Statistical explatory data analysis](#29)
    1. [Indexing pandas time series](#30)
    1. [Resampling pandas time series](#31)
1. [Manipulating Data Frames with Pandas](#32)
    1. [Indexing data frames](#33)
    1. [Slicing data frames](#34)
    1. [Filtering data frames](#35)
    1. [Transforming data frames](#36)
    1. [Index objects and labeled data](#37)
    1. [Hierarchical indexing](#38)
    1. [Pivoting data frames](#39)
    1. [Stacking and unstacking data frames](#40)
    1. [Melting data frames](#41)
    1. [Categoricals and groupby](#42)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np  # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns  # visualization tool

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files 
#in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

Add csv file as a pandas dataframe

In [None]:
data = pd.read_csv('../input/pokemon.csv')

We can look our data closely by using the line of code below.

In [None]:
data.info() # Display the content of data

In [None]:
# To look first 5 values
data.head() 

#data.head(10) # To look first 10 values

In [None]:
# To look last 5 values
data.tail() 

#data.head(15) # To look last 15 values

In [None]:
# Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, 
#excluding NaN values.
data.describe()

In [None]:
# Display positive and negative correlation between columns
data.corr()

In [None]:
# Display positive and negative correlation between columns
figure ,axes = plt.subplots(figsize=(15, 15))
sns.heatmap(data.corr(), annot=True, linewidths=.5, fmt= '.1f', axes = axes)
plt.show()

# Detailed explanation
# f -> figure to be created
# ax -> a matplotlib.axes.Axes instance to which the heatmap is plotted. If not provided, 
#use current axes or create a new one.
# plt -> matplotlib.pyplot library impoted as plt
# subplots -> type of library feature to be used, can be called to plot two or more plots
#İn one figure.
# figsize -> size of each cells in created table

# figsize - image size
# data.corr() - Display positive and negative correlation between columns
# annot=True -shows correlation rates
# linewidths - determines the thickness of the lines in between
# cmap - determines the color tones we will use
# fmt - determines precision(Number of digits after 0)
# if the correlation between the two columns is close to 1 or 1, the correlation between the two columns has a positive ratio.
# if the correlation between the two columns is close to -1 or -1, the correlation between the two columns has a negative ratio.
# If it is close to 0 or 0 there is no relationship between them.

For more information;

https://matplotlib.org/3.1.0/api/_as_gen/matplotlib.pyplot.subplots.html

https://seaborn.pydata.org/generated/seaborn.heatmap.html

In [None]:
# To look first 10 values which defense value is the best.
data.sort_values("Defense", ascending = False).head(10)

<a id="1"></a> <br>
# 1. INTRODUCTION TO PYTHON

<a id="2"></a> <br>
### MATPLOTLIB
Matplot is a python library that help us to plot data. The easiest and basic plots are line, scatter and histogram plots.
* Line plot is better when x axis is time.
* Scatter is better when there is correlation between two variables
* Histogram is better when we need to see distribution of numerical data.
* Customization: Colors,labels,thickness of line, title, opacity, grid, figsize, ticks of axis and linestyle  

In [None]:
# Line Plot
# color = color, label = label, linewidth = width of line, alpha = opacity, grid = grid, linestyle = sytle of line
data.Speed.plot(kind = 'line', color = 'g',label = 'Speed',linewidth=1, alpha = 0.5, grid = True, linestyle = ':', figsize=(15,5))
data.Defense.plot(kind = 'line', color = 'r',label = 'Defense',linewidth=1, alpha = 0.5, grid = True, linestyle = '-.')
plt.legend(loc='upper right')     # legend = puts label into plot
plt.xlabel('x axis')              # label = name of label
plt.ylabel('y axis')
plt.title('Line Plot')            # title = title of plot
plt.show()

In [None]:
# subplots
data.plot(subplots = True, figsize=(15,15))
plt.show()

In [None]:
plt.subplot(4,2,1)
data.HP.plot(kind="line", color="orange", label="HP", linewidth=1, alpha=1, grid=True, figsize=(20,15))
data.Attack.plot(kind="line", color="purple", label="Attack", linewidth=1, alpha=0.5, grid=True)
plt.ylabel("HP")
plt.subplot(4,2,2)
data.Attack.plot(kind="line", color="blue", label="Attack", linewidth=1, alpha=0.8, grid=True, linestyle=":")
plt.ylabel("Attack")
plt.subplot(4,2,3)
data.Defense.plot(kind="line", color="green", label="Defense", linewidth=1, alpha=0.6, grid=True, linestyle="-.")
plt.ylabel("Defense")
plt.subplot(4,2,4)
data.Speed.plot(kind="line", color="red", label="Speed", linewidth=1, alpha=0.4, grid=True)
plt.ylabel("Speed")
plt.show()

In [None]:
# Scatter Plot 
# x = attack, y = defense
# color = color, label = label, linewidth = width of line, alpha = opacity, grid = grid, linestyle = sytle of line
data.plot(kind='scatter', x='Attack', y='Defense', alpha = 0.5, color = 'blue', figsize=(10,5))
plt.xlabel('Attack') # label = name of label
plt.ylabel('Defence')
plt.title('Attack - Defense Scatter Plot') # title = title of plot
plt.show() # for showing plot

In [None]:
# Histogram
# bins = number of bar in figure
data.Speed.plot(kind = 'hist',bins = len(data[["Speed"]]), figsize = (12,12))
plt.show()

In [None]:
# To look first 30 values in bar display
data.Attack.head(30).plot(kind="bar", figsize=(10,5))
plt.show()

In [None]:
# clf() = cleans it up again you can start a fresh
data.Speed.plot(kind = 'hist',bins = 50)
plt.clf()
plt.show()
# We cannot see plot due to clf()

<a id="3"></a> <br>
### DICTIONARY
Why we need dictionary?
* It has 'key' and 'value'
* Faster than lists
<br>

What is key and value. Example:
* dictionary = {'brand' : 'ford'}
* Key is brand.
* Values is ford.
<br>
<br>**It's that easy.**
<br>Lets practice some other properties like keys(), values(), update, add, check, remove key, remove all entries and remove dicrionary.

In [None]:
#create dictionary and look its keys and values
dictionary = {'brand' : 'ford','model' : 'mustang'}
print(dictionary.keys())
print(dictionary.values())

In [None]:
# Keys have to be immutable objects like string, boolean, float, integer or tubles
# List is not immutable
# Keys are unique
dictionary['brand'] = "Ford"         # Update existing entry
print(dictionary)

dictionary['year'] = 1964            # Add new entry
print(dictionary)

del dictionary['brand']              # Remove entry with key 'spain'
print(dictionary)

print('model' in dictionary)         # Check include or not

dictionary.clear()                   # Remove all entries in dict
print(dictionary)

# In order to run all code you need to take comment this line
# del dictionary         # delete entire dictionary     
#print(dictionary)       # it gives error because dictionary is deleted

<a id="4"></a> <br>
### PANDAS
What we need to know about pandas?
* CSV: comma - separated values


In [None]:
data = pd.read_csv('../input/pokemon.csv') # Add csv file as a pandas dataframe

In [None]:
print(type(data))                  # pandas.core.frame.DataFrame
print(type(data[["Attack"]]))      # pandas.core.frame.DataFrame
print(type(data["Attack"]))        # pandas.core.series.Series
print(type(data["Attack"].values)) # numpy.ndarray

In [None]:
series = data['Defense']        # data['Defense'] = series
data_frame = data[['Defense']]  # data[['Defense']] = data frame

print(type(series))
print(type(data_frame), end = "\n\n")

print(series.head(10), end = "\n\n")
print(data_frame.head(10))

<a id="5"></a> <br>
Before continue with pandas,   we need to learn **logic, control flow** and **filtering.**
<br>Comparison operator:  ==, <, >, <=
<br>Boolean operators: and, or ,not
<br> Filtering pandas

In [None]:
# Comparison operator
print(1 >0)
print(1 != 0)

# Boolean operators
print(True and False)
print(True or False)

In [None]:
# Filtering Pandas data frame
Filtered_Defense_200 = data['Defense'] > 200     # There are only 3 pokemons who have higher defense value than 200
data[Filtered_Defense_200]

In [None]:
# Filtering pandas with logical and
# There are only 2 pokemons who have higher defence value than 2oo and higher attack value than 100
data[np.logical_and(data['Defense'] > 200, data['Attack'] > 100)]

# This is also same with previous code line. Therefore we can also use '&' for filtering.
#data[(data['Defense'] > 200) & (data['Attack'] > 100)]

<a id="6"></a> <br>
### WHILE and FOR LOOPS
We will learn most basic while and for loops

In [None]:
# Stay in loop if condition (counter is not equal 10) is true
counter = 0
while counter != 10 :
    print('counter is: ',counter)
    counter +=1 
print('counter is equal to 10 (Loop finished)')

In [None]:
# Stay in loop if condition is true
list_names = ["berkant", "dogus", "kutay"]
for name in list_names:
    print("Name is: ", name)
    
print("")

# Stay in loop if condition is true
list_numbers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
for number in list_numbers:
    print("Number is: ", number)

print("")

# Enumerate index and value of list
# index : value = 0:1, 1:2, 2:3, 3:4, 4:5
for index, value in enumerate(list_numbers):
    print(index,". index : ",value, sep = "")

print("")

# For dictionaries
# We can use for loop to achive key and value of dictionary. We learnt key and value at dictionary part.
dictionary_car = {'Brand':'Ford','Model':'Mustang'}
for key in dictionary_car:
    print(key)

print("")

for key, value in dictionary_car.items():
    print(key," : ",value)
    
print("")

# For pandas we can achieve index and value
for index,value in data[["Defense"]][0:5].iterrows():
    print(index," : ",value)

What have we learned?
* How to import csv file
* Plotting line,scatter and histogram
* Basic dictionary features
* Basic pandas features like filtering that is actually something always used and main for being data scientist
* While and for loops

<a id="7"></a> <br>
# 2. PYTHON DATA SCIENCE TOOLBOX

<a id="8"></a> <br>
### USER DEFINED FUNCTION

In Python, function is a group of related statements that perform a specific task.

As you already know, Python gives you many built-in functions like print(), etc. 
but you can also create your own functions. These functions are called user-defined functions.

Functions help break our program into smaller and modular chunks. 
As our program grows larger and larger, functions make it more organized and manageable.

What we need to know about functions:
* docstrings: documentation for functions. Example:
<br>for foo():
    <br>"""This is docstring for documentation of function foo"""
    The first line should always be a short, concise summary of the object’s purpose.
* tuple: A tuple is a sequence of immutable Python objects.  
<br>Tuples are sequences, just like lists.
<br>The differences between tuples and lists are, the tuples cannot be changed unlike lists and tuples use parentheses, whereas lists use square brackets.

In [None]:
# For example
def tuple_function():
    """ This function returns defined tuple"""
    tuple_names = ("berkant", "dogus", "kutay")
    return tuple_names

name1, name2, name3 = tuple_function()
print(name1, name2, name3)

# You can not change tuples!
tuple_numbers = (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
print(tuple_numbers)
#tuple_numbers[0] = 8 # This line gives an error about assignment

# You can print some part of tuples
print(tuple_numbers[4:8])

<a id="9"></a> <br>
### SCOPE
What we need to know about scope:
* Local - Enclosing - Global - Built in 
<br><br>Lets make some basic examples

![image.png](attachment:image.png)

* The local scope is the namespace of the “current” level of the program. This is either within a function, class, or imported module that is not the main module. 
For example:

In [None]:
x = 2  # x is now defined within the module namespace
def foo():
    x = 3 # x is now defined within the local namespace of function
    print(x)

foo()

* Bear in mind the local namespace is not the lowest level of all the nested functions, classes, or modules, but rather the level on which the code is executing. 
For example:

In [None]:
x = "2"  # x is now defined within the module namespace
def example():
    x = "3" # x is now defined as 3 within the local namespace of example
    def method():
        x = "4" # x is now defined as 4 within the local namespace of method
        def function():
            x = "5" # x is now defined as 5 within the local namespace of function
            print("Function Scope: " + x)
        function()
        print("Method Scope: " + x)
    method()
    print("Example Scope: " + x)
example()
print("Module Scope: " + x)

* In any namespace, declaring a variable with the global statement will store and retrieve that variable immediately from the module scope. 
For example:

In [None]:
x = "2"  # x is now defined within the module namespace
def example():
    x = "3" # x is now defined as 3 within the local namespace of example
    def method():
        global x  # x will now be defined as being within the module scope 
        x = "4" # x is now defined as 4 within the local and module namespace
        def function():
            x = "5" # x is now defined as 5 within the local namespace of function
            print("Function Scope: " + x)
        function()
        print("Method Scope: " + x)
    method()
    print("Example Scope: " + x)
example()
print("Module Scope: " + x)

* The builtin scope contains all of the Python functions that are builtin to vanilla Python. These include common functions such as print and dir.

In [None]:
print(type(dir))
print(type(print))
print(type(open), end = "\n\n")

# How can we learn what is built in scope
import builtins
print(dir(builtins))

<a id="10"></a> <br>
### NESTED FUNCTION
* function inside function.
* There is a LEGB rule that is search 'L'ocal scope, 'E'nclosing function, 'G'lobal and 'B'uilt in scopes, respectively.

In [None]:
#nested function
def square():
    """ Return square of value """
    def add():
        """ Add two local variable """
        number1 = 4
        number2 = 3
        return number1 + number2
    return add() ** 2
print(square()) 

<a id="11"></a> <br>
### DEFAULT and FLEXIBLE ARGUMENTS

The most useful form is to specify a default value for one or more arguments. This creates a function that can be called with fewer arguments than it is defined to allow.

* Default argument example:
<br> def foo(a, b=1):
        """ b = 1 is default argument"""
* Flexible argument example:
<br> def foo(*args):
       """ *args can be one or more"""
<br>def foo(** kwargs)
       """ **kwargs is a dictionary"""
       
<br><br> For example;

In [None]:
def ask_ok(prompt, retries=4, reminder='Please try again!'):
    print("")
    while True:
        answer = input(prompt)
        if answer in ('y', 'ye', 'yes'):
            return True
        if answer in ('n', 'no', 'nop', 'nope'):
            return False
        retries = retries - 1
        if retries < 0:
            raise ValueError('Invalid user response!')
        print(reminder)

#ask_ok('Do you really want to quit?')

This function can be called in several ways:

* giving only the mandatory argument: ask_ok('Do you really want to quit?')
* giving one of the optional arguments: ask_ok('OK to overwrite the file?', 2)
* or even giving all arguments: ask_ok('OK to overwrite the file?', 2, 'Come on, only yes or no!')

In [None]:
# Flexible arguments *args
def multiply(*args):
    z = 1
    for num in args:
        z *= num
    print(z)

multiply(4, 5)
multiply(10, 9)
multiply(2, 3, 4)
multiply(3, 5, 10, 6)

print("")

# Flexible arguments **kwargs that is dictionary
def car_info(**kwargs):
    """ print key and value of dictionary"""
    for key, value in kwargs.items(): # If you do not understand this part turn for loop part and look at dictionary in for loop
        print(key, ":", value)
car_info(brand = 'ford', model = 'mustang', year = 1964)

<a id="12"></a> <br>
### LAMBDA FUNCTION
Faster way of writing function

In Python, lambda function means that a function is without a name. As we already know that def keyword is used to define the normal functions and the lambda keyword is used to create lambda functions. 

In [None]:
# lambda function
square = lambda x: x**2     # where x is name of argument
print(square(9))

total = lambda x,y,z: x+y+z   # where x,y,z are names of arguments
print(total(5,2,8))

<a id="13"></a> <br>
### ANONYMOUS FUNCTİON
Like lambda function but it can take more than one arguments.
* map(func,seq) : applies a function to all the items in a list

In [None]:
list_numbers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
list_cubes = list(map(lambda x: x**3, list_numbers))
print(list_cubes)

<a id="14"></a> <br>
### ITERATORS
* iterable is an object that can return an iterator
* iterable: an object with an associated iter() method
<br> example: list, strings and dictionaries
* iterator: produces next value with next() method

In [None]:
# Iteration example
name = "berkant"
iterable_name = iter(name)

print(next(iterable_name))    # print next iteration
print(next(iterable_name))    # print next iteration         
print(next(iterable_name))    # print next iteration
print(next(iterable_name))    # print next iteration
print(*iterable_name)         # print remaining iteration

* zip() 

The purpose of zip() is to map the similar index of multiple containers so that they can be used just using as single entity.

In [None]:
list_rank = [1,2,3]
list_name = ["berkant","dogus","kutay"]

zip_result = zip(list_rank, list_name)
print(zip_result)
#print(type(zip_result))

print("") 

list_zip_result = list(zip_result)  #converting zip to list type
print(list_zip_result)
#print(type(list_zip_result))

print("")

iterable_zip_result = iter(list_zip_result) 
print(next(iterable_zip_result))   # print next iteration
print(*iterable_zip_result)        # print remaining iteration
#print(type(iterable_zip_result))

In [None]:
unzip_result = zip(*list_zip_result)
list_rank, list_name = list(unzip_result) # unzip returns tuple

print(list_rank)
print(list_name)

print(type(list_rank))
print(type(list(list_name))) #if we want to change data type tuple to list we need to use list() method.

<a id="15"></a> <br>
### LIST COMPREHENSİON
**One of the most important topic of this kernel**

<br>List comprehensions provide a concise way to create lists. 

<br>We use list comprehension for data analysis often. 
<br>list comprehension: collapse for loops for building lists into a single line

<br>For example, assume we want to create a list of squares, like:

In [None]:
squares = []
for x in range(10):
    squares.append(x**2)
    
print(squares)

Note that this creates (or overwrites) a variable named 'x' that still exists after the loop completes. We can calculate the list of squares without any side effects using:

In [None]:
squares = list(map(lambda x: x**2, range(10)))
print(squares)

or, equivalently:

In [None]:
squares = [x**2 for x in range(10)]
print(squares)

which is more concise and readable.

[x \** 2 for x in range(10)]: list of comprehension
<br> x \** 2: list comprehension syntax
<br> for x in range(10): for loop syntax
<br> x: iterator
<br> range(10): iterable object

In [None]:
output = [(x,y) for x in [1,2,3] for y in [3,1,4] if x != y]

print(output)

and it’s equivalent to:

In [None]:
output = []
for x in [1,2,3]:
    for y in [3,1,4]:
        if x != y:
            output.append((x, y))

print(output)

Note how the order of the for and if statements is the same in both these snippets.

In [None]:
# Another example
result = ["Positive" if i > 0  else "Negative" if i<0 else "Zero" for i in range(-10,10,1)]
print(result)

In [None]:
# Lets return pokemon.csv and make one more list comprehension example
# Lets classify pokemons whether they have high or low speed. Our threshold is average speed.
threshold = sum(data.Speed)/len(data.Speed)
print("Threshold : ", threshold)

data["speed_level"] = ["high" if i > threshold else "low" for i in data.Speed]
print(data.loc[:10,["speed_level","Speed"]]) # We will learn loc more detailed later

What have we learned?

* User defined function 
* Scope
* Nested function
* Default and flexible arguments
* Lambda function
* Anonymous function
* Iterators
* List comprehension

<a id="16"></a> <br>
# 3.CLEANING DATA

<a id="17"></a> <br>
### DIAGNOSE DATA for CLEANING
We need to diagnose and clean data before exploring.
<br>Unclean data:
* Column name inconsistency like upper-lower case letter or space between words
* missing data
* different language

We will use head, tail, columns, shape and info methods to diagnose data

In [None]:
data = pd.read_csv('../input/pokemon.csv')
data.head()  # Head shows first 5 rows

In [None]:
data.tail()  # Tail shows last 5 rows

In [None]:
# Columns gives column names of features
data.columns

In [None]:
# Shape gives number of rows and columns in a tuble
data.shape

In [None]:
# Info gives data type like dataframe, number of sample or row, number of feature or column, feature types and memory usage
data.info()

In [None]:
data.rename(columns={"Type 1":"type1", "Type 2":"type2"}, inplace=True)
data.columns

In [None]:
# To replace spaces with an underscore
data.columns = [each.replace(" ","_") if(len(each.split())>1) else each for each in data.columns]
print(data.columns)

In [None]:
# To replace upper case with lower case
data.columns = [column.lower() for column in data.columns]
print(data.columns)

<a id="18"></a> <br>
### EXPLORATORY DATA ANALYSIS
value_counts(): Frequency counts
<br>outliers: the value that is considerably higher or lower from rest of the data
* Lets say value at 75% is Q3 and value at 25% is Q1. 
* Outlier are smaller than Q1 - 1.5(Q3-Q1) and bigger than Q3 + 1.5(Q3-Q1). (Q3-Q1) = IQR
<br>We will use describe() method. Describe method includes:
* count: number of entries
* mean: average of entries
* std: standart deviation
* min: minimum entry
* 25%: first quantile
* 50%: median or second quantile
* 75%: third quantile
* max: maximum entry

<br> What is quantile?

* 1,4,5,6,8,9,11,12,13,14,15,16,17
* The median is the number that is in **middle** of the sequence. In this case it would be 11.

* The lower quartile is the median in between the smallest number and the median i.e. in between 1 and 11, which is 6.
* The upper quartile, you find the median between the median and the largest number i.e. between 11 and 17, which will be 14 according to the question above.

![](http://www.whatissixsigma.net/wp-content/uploads/2015/07/Box-Plot-Diagram-to-identify-Outliers-figure-1.png)

Let the data range be 199, 201, 236, 269,271,278,283,291, 301, 303, and 341

![](http://www.whatissixsigma.net/wp-content/uploads/2015/07/Box-Plot-Diagram-to-identify-Outliers-figure-2.png)

In [None]:
# For example lets look frequency of pokemom types
print(data.type1.value_counts(dropna = False, sort = True, ascending = True))  # if there are nan values that also be counted
# sort : boolean, default True   =>Sort by values
# dropna : boolean, default True =>Don’t include counts of NaN.
# As it can be seen below there are 112 water pokemon or 70 grass pokemon

In [None]:
# For example max Attack is 190 or min Defense is 5
# First quantile of HP is 50
# Median (Second Quantile) of Speed is 65
data.describe() #ignore null entries

<a id="19"></a> <br>
### VISUAL EXPLORATORY DATA ANALYSIS
* Box plots: The box plot (a.k.a. box and whisker diagram) is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum.

In [None]:
print(data.columns)

In [None]:
# For example: compare attack of pokemons that are legendary  or not
# Black line at top is max
# Blue line at top is 75%
# Red line is median (50%)
# Blue line at bottom is 25%
# Black line at bottom is min
# There are no outliers
data.boxplot(column='attack',by = 'legendary')
plt.show()

<a id="20"></a> <br>
### TIDY DATA
We tidy data with melt().
Describing melt is confusing. Therefore lets make example to understand it.

In [None]:
data_head = data.head()
data_head

In [None]:
# lets melt
# id_vars = what we do not wish to melt
# value_vars = what we want to melt
data_melted = pd.melt(frame = data_head, id_vars = 'name', value_vars = ['attack','defense'])
data_melted

<a id="21"></a> <br>
### PIVOTING DATA
Reverse of melting.

In [None]:
# Index is name
# I want to make that columns are variable
# Finally values in columns are value
data_melted.pivot(index = 'name', columns = 'variable', values = 'value')

<a id="22"></a> <br>
### CONCATENATING DATA
We can concatenate two dataframe 

In [None]:
# Firstly lets create 2 data frame
data_head = data.head()
data_tail = data.tail()
conc_data_row = pd.concat([data_head, data_tail], axis = 0, ignore_index = True) # axis = 0 : adds dataframes in row
conc_data_row

In [None]:
# Firstly lets create 2 data frame
data_attack_head = data.attack.head()
data_defense_head = data.defense.head()
conc_data_row = pd.concat([data_attack_head, data_defense_head], axis = 1)
conc_data_row

<a id="23"></a> <br>
### DATA TYPES
There are 5 basic data types: object(string), booleab, integer, float and categorical.
<br> We can make conversion data types like from str to categorical or from int to float

<br> Why is category important: 
* make dataframe smaller in memory 
* can be utilized for anlaysis especially for sklear (we will learn later)

In [None]:
# To learn data types in dataset
data.dtypes

In [None]:
# lets convert object(str) to categorical and int to float.
data.type1 = data.type1.astype('category')
data.speed = data.speed.astype('float')

In [None]:
# As you can see type1 is converted from object to categorical
# And speed is converted from int to float
data.dtypes

<a id="24"></a> <br>
### MISSING DATA and TESTING WITH ASSERT
If we encounter with missing data, what we can do:
* leave as is
* drop them with dropna()
* fill missing value with fillna()
* fill missing values with test statistics like mean
<br>Assert statement: check that you can turn on or turn off when you are done with your testing of the program

In [None]:
# Lets look at does pokemon data have nan value
# As you can see there are 800 entries. However type2 has 414 non-null object so it has 386 null object.
data.info()

In [None]:
# Lets check type2
data.type2.value_counts(dropna = False)
# As you can see, there are 386 NAN value

In [None]:
# Lets drop nan values
data_dropna = data   # also we will use data to fill missing value so I assign it to data1 variable
data_dropna.type2.dropna(inplace = True)  # inplace = True means we do not assign it to new variable. 
# Changes automatically assigned to data
# So does it work ?

In [None]:
#  Lets check with assert statement
# Assert statement:
assert 1==1 # return nothing because it is true

In [None]:
# In order to run all code, we need to make this line comment
# assert 1==2 # return error because it is false

In [None]:
assert  data.type2.notnull().all() # returns nothing because we drop nan values
data.info()

data.type2.fillna('empty',inplace = True)

In [None]:
assert  data.type2.notnull().all() # returns nothing because we drop nan values


In [None]:
# # With assert statement we can check a lot of thing. For example
# assert data.columns[1] == 'name'
# assert data.speed.dtypes == np.float

In this part, you learn:
* Diagnose data for cleaning
* Exploratory data analysis
* Visual exploratory data analysis
* Tidy data
* Pivoting data
* Concatenating data
* Data types
* Missing data and testing with assert

<a id="26"></a> <br>
### REVIEW of PANDAS
As you notice, I do not give all idea in a same time. Although, we learn some basics of pandas, we will go deeper in pandas.
* single column = series
* NaN = not a number
* dataframe.values = numpy[](http://)

<a id="27"></a> <br>
### BUILDING DATA FRAMES FROM SCRATCH
* We can build data frames from csv as we did earlier.
* Also we can build dataframe from dictionaries
    * zip() method: This function returns a list of tuples, where the i-th tuple contains the i-th element from each of the argument sequences or iterables.
* Adding new column
* Broadcasting: Create new column and assign a value to entire column

In [None]:
# data frames from dictionary
brand = ["Ford","Opel"]
model = ["Focus","Corsa"]
list_label = ["Brand","Model"]

list_column = [brand, model]
data_zipped = list(zip(list_label, list_column))

data_dictionary = dict(data_zipped)
dataFrame = pd.DataFrame(data_dictionary)
dataFrame

In [None]:
# Add new columns
dataFrame["Year"] = ["2012","2015"]
dataFrame

In [None]:
# Broadcasting
dataFrame["Color"] = "White" # Broadcasting entire column
dataFrame

<a id="28"></a> <br>
### VISUAL EXPLORATORY DATA ANALYSIS
* Plot
* Subplot
* Histogram:
    * bins: number of bins
    * range(tuble): min and max values of bins
    * normed(boolean): normalize or not
    * cumulative(boolean): compute cumulative distribution

In [None]:
data_ads = data.loc[:, ["attack", "defense", "speed"]] 
data_ads.plot(subplots = True)
plt.show()

In [None]:
# scatter plot  
data_ads.plot(kind = "scatter", x= "attack", y = "defense")
plt.show()

In [None]:
# hist plot  
data_ads.plot(kind = "hist", y = "defense", bins = 50, range= (0,250), density = 1)
plt.show()

In [None]:
# histogram subplot with non cumulative and cumulative
figure, axes = plt.subplots(nrows = 2, ncols = 1)
data_ads.plot(kind = "hist", y = "defense", bins = 50, range= (0,250), density = 1, ax = axes[0])
data_ads.plot(kind = "hist", y = "defense", bins = 50, range= (0,250), density = 1, ax = axes[1], cumulative = True)
plt.savefig('graph.png')
plt.show()

<a id="29"></a> <br>
### STATISTICAL EXPLORATORY DATA ANALYSIS
I already explained it at previous parts. However lets look at one more time.
* count: number of entries
* mean: average of entries
* std: standart deviation
* min: minimum entry
* 25%: first quantile
* 50%: median or second quantile
* 75%: third quantile
* max: maximum entry

In [None]:
data.describe()

<a id="30"></a> <br>
### INDEXING PANDAS TIME SERIES
* datetime = object
* parse_dates(boolean): Transform date to ISO 8601 (yyyy-mm-dd hh:mm:ss ) format

In [None]:
time_list = ["1992-03-08","1992-04-12"]
print(type(time_list[1])) # As you can see date is string
# However we want it to be datetime object
datetime_object = pd.to_datetime(time_list)
print(type(datetime_object))

In [None]:
# close warning
import warnings
warnings.filterwarnings("ignore")
# In order to practice lets take head of pokemon data and add it a time list
data_datetime = data.head()
date_list = ["2019-06-21","2019-06-22","2019-06-23","2020-01-11","2020-01-12"]
datetime_object = pd.to_datetime(date_list)
data_datetime["date"] = datetime_object
# lets make date as index
data_datetime = data_datetime.set_index("date")
data_datetime 

In [None]:
# Now we can select according to our date index
print(data_datetime.loc["2019-06-22"])
print("---")
print(data_datetime.loc["2019-06-21":"2019-06-23"])

<a id="31"></a> <br>
### RESAMPLING PANDAS TIME SERIES
* Resampling: statistical method over different time intervals
    * Needs string to specify frequency like "M" = month or "A" = year
* Downsampling: reduce date time rows to slower frequency like from daily to weekly
* Upsampling: increase date time rows to faster frequency like from daily to hourly
* Interpolate: Interpolate values according to different methods like ‘linear’, ‘time’ or index’ 
    * https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.interpolate.html

In [None]:
# We will use data_datetime that we create at previous part
data_datetime.resample("A").mean()

In [None]:
# Lets resample with month
data_datetime.resample("M").mean()
# As you can see there are a lot of nan because data_datetime does not include all months

In [None]:
# In real life (data is real. Not created from us like data_datetime) we can solve this problem with interpolate
# We can interpolete from first value
data_datetime.resample("M").first().interpolate("linear")

In [None]:
# Or we can interpolate with mean()
data_datetime.resample("M").mean().interpolate("linear")

<a id="32"></a> <br>
# MANIPULATING DATA FRAMES WITH PANDAS

<a id="33"></a> <br>
### INDEXING DATA FRAMES
* Indexing using square brackets
* Using column attribute and row label
* Using loc accessor
* Selecting only some columns

In [None]:
# Read csv file
data = pd.read_csv('../input/pokemon.csv')
data= data.set_index("#")
data.head()

In [None]:
# Indexing using square brackets
data["HP"][1]

In [None]:
# Using column attribute and row label
data.HP[1]

In [None]:
# Using loc accessor
data.loc[1,["HP"]]

In [None]:
# Selecting only some columns
data[["HP","Attack"]]

<a id="34"></a> <br>
### SLICING DATA FRAME
* Difference between selecting columns
* Series and data frames
* Slicing and indexing series
* Reverse slicing 
* From something to end

In [None]:
# Difference between selecting columns: series and dataframes
print(type(data["HP"]))     # series
print(type(data[["HP"]]))   # data frames

In [None]:
# Slicing and indexing series
data.loc[1:10,"HP":"Defense"] # 10 and "Defense" are inclusive

In [None]:
# Reverse slicing 
data.loc[10:1:-1,"HP":"Defense"] 

In [None]:
# From something to end
data.loc[1:10,"Sp. Atk":] 

<a id="35"></a> <br>
### FILTERING DATA FRAMES
Creating boolean series
Combining filters
Filtering column based others

In [None]:
# Creating boolean series
boolean = data.HP > 180
data[boolean]

In [None]:
# Combining filters
first_filter = data.HP > 180
second_filter = data.Speed > 15
data[first_filter & second_filter]

In [None]:
# Filtering column based others
data.HP[data.Speed < 20]

<a id="36"></a> <br>
### TRANSFORMING DATA
* Plain python functions
* Lambda function: to apply arbitrary python function to every element
* Defining column using other columns

In [None]:
# Plain python functions
def div(n):
    return n/2
data.HP.apply(div)

In [None]:
# Or we can use lambda function
data.HP.apply(lambda n : n/2)

In [None]:
# Defining column using other columns
data["total_power"] = data.Attack + data.Defense
data.head()

<a id="37"></a> <br>
### INDEX OBJECTS AND LABELED DATA
index : sequence of label

In [None]:
# Our index name is this:
print(data.index.name)
# Lets change it
data.index.name = "index_name"
data.head()

In [None]:
# Overwrite index
# If we want to modify index we need to change all of them.
data.head()
# First copy of our data to data3 then change index 
data_indexed = data.copy()
# Lets make index start from 100. It is not remarkable change but it is just example
data_indexed.index = range(100, 900, 1)
data_indexed.head()

In [None]:
# We can make one of the column as index. I actually did it at the beginning of manipulating data frames with pandas section
# It was like this
# data = data.set_index("#")
# also you can use 
# data.index = data["#"]

<a id="38"></a> <br>
### HIERARCHICAL INDEXING
* Setting indexing

In [None]:
# Lets read data frame one more time to start from beginning
data = pd.read_csv('../input/pokemon.csv')
data.head()
# As you can see there is index. However we want to set one or more column to be index

In [None]:
# Setting index : type 1 is outer type 2 is inner index
data_index = data.set_index(["Type 1","Type 2"]) 
data_index.head(100)
# data1.loc["Fire","Flying"] # howw to use indexes

<a id="39"></a> <br>
### PIVOTING DATA FRAMES
* pivoting : reshape tool

In [None]:
dictionary = {"treatment":["A","A","B","B"],"gender":["F","M","F","M"],"response":[10,45,5,9],"age":[15,4,72,65]}
dataframe = pd.DataFrame(dictionary)
dataframe

In [None]:
# Pivoting
dataframe.pivot(index="treatment",columns = "gender",values="response")

<a id="40"></a> <br>
### STACKING and UNSTACKING DATAFRAME
* deal with multi label indexes
* level: position of unstacked index
* swaplevel : change inner and outer level index position

In [None]:
dataframe_index = dataframe.set_index(["treatment","gender"])
dataframe_index
# Lets unstack it

In [None]:
# level determines indexes
dataframe_index.unstack(level = 0)

In [None]:
dataframe_index.unstack(level = 1)

In [None]:
# change inner and outer level index position
dataframe_swap = dataframe_index.swaplevel(0, 1)
dataframe_swap

<a id="41"></a> <br>
### MELTING DATA FRAMES
* Reverse of pivoting

In [None]:
dataframe

In [None]:
# dataframe.pivot(index="treatment", columns = "gender", values="response")
pd.melt(dataframe, id_vars = "treatment", value_vars = ["age","response"])

<a id="42"></a> <br>
### CATEGORICALS AND GROUPBY

In [None]:
# We will use dataframe
dataframe

In [None]:
# According to treatment take means of other features
dataframe.groupby("treatment").mean() # Mean is aggregation / reduction method
# There are other methods like sum, std,max or min

In [None]:
# We can only choose one of the feature
dataframe.groupby("treatment").age.max() 

In [None]:
# Or we can choose multiple features
dataframe.groupby("treatment")[["age","response"]].min() 

In [None]:
dataframe.info()
# As you can see gender is object
# However if we use groupby, we can convert it categorical data. 
# Because categorical data uses less memory, speed up operations like groupby
#dataframe["gender"] = dataframe["gender"].astype("category")
#dataframe["treatment"] = dataframe["treatment"].astype("category")
#dataframe.info()

# CONCLUSION
Thank you for your votes and comments

This document was prepared with the [this](https://www.kaggle.com/kanncaa1/data-sciencetutorial-for-beginners) tutorial.