# DATA SCIENTIST
**In this tutorial, I only explain you what you need to be a data scientist.**

Data scientist need to have these skills:

1. Basic Tools: Like python, R or SQL. You do not need to know everything. What you only need is to learn how to use **python**
1. Basic Statistics: Like mean, median or standart deviation. If you know basic statistics, you can use **python** easily. 
1. Data Munging: Working with messy and difficult data. Like a inconsistent date and string formatting. As you guess, **python** helps us.
1. Data Visualization: Title is actually explanatory. We will visualize the data with **python** like matplot and seaborn libraries.
1. Machine Learning: You do not need to understand math behind the machine learning technique. You only need is understanding basics of machine learning and learning how to implement it while using **python**.

### As a summary we will learn python to be data scientist !!!

**Content:**
1. [Introduction to Python:](#1)
    1. [Matplotlib](#2)
    1. [Dictionaries ](#3)
    1. [Pandas](#4)
    1. [Logic, control flow and filtering](#5)
    1. [Loop data structures](#6)
1. [Python Data Science Toolbox:](#7)
    1. [User defined function](#8)
    1. [Scope](#9)
    1. [Nested function](#10)
    1. [Default and flexible arguments](#11)
    1. [Lambda function](#12)
    1. [Anonymous function](#13)
    1. [Iterators](#14)
    1. [List comprehension](#15)
1. [Cleaning Data](#16)
    1. [Diagnose data for cleaning](#17)
    1. [Exploratory data analysis](#18)
    1. [Visual exploratory data analysis](#19)
    1. [Tidy data](#20)
    1. [Pivoting data](#21)
    1. [Concatenating data](#22)
    1. [Data types](#23)
    1. [Missing data and testing with assert](#24)
1. [Pandas Foundation](#25)
    1. [Review of pandas](#26)
    1. [Building data frames from scratch](#27)
    1. [Visual exploratory data analysis](#28)
    1. [Statistical explatory data analysis](#29)
    1. [Indexing pandas time series](#30)
    1. [Resampling pandas time series](#31)
1. [Manipulating Data Frames with Pandas](#32)
    1. [Indexing data frames](#33)
    1. [Slicing data frames](#34)
    1. [Filtering data frames](#35)
    1. [Transforming data frames](#36)
    1. [Index objects and labeled data](#37)
    1. [Hierarchical indexing](#38)
    1. [Pivoting data frames](#39)
    1. [Stacking and unstacking data frames](#40)
    1. [Melting data frames](#41)
    1. [Categoricals and groupby](#42)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns  # visualization tool

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


# Any results you write to the current directory are saved as output.

In [None]:
data = pd.read_csv('../input/pokemon-challenge/pokemon.csv')

In [None]:
data.info()

In [None]:
data.corr()

In [None]:
#correlation map
f,ax = plt.subplots(figsize=(18, 18))
sns.heatmap(data.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)
plt.show()

In [None]:
data.head()

In [None]:
data.columns

<a id="2"></a> <br>
### MATPLOTLIB
Matplot is a python library that help us to plot data. The easiest and basic plots are line, scatter and histogram plots.
* Line plot is better when x axis is time.
* Scatter is better when there is correlation between two variables
* Histogram is better when we need to see distribution of numerical data.
* Customization: Colors,labels,thickness of line, title, opacity, grid, figsize, ticks of axis and linestyle 

In [None]:
# Line Plot
# color = color, label = label, linewidth = width of line, alpha = opacity, grid = grid, linestyle = sytle of line
data.Speed.plot(kind = 'line', color = 'g',label = 'Speed',linewidth=1,alpha = 0.5,grid = True,linestyle = ':')
data.Defense.plot(color = 'r',label = 'Defense',linewidth=1, alpha = 0.5,grid = True,linestyle = '-.')
plt.legend(loc='upper right')     # legend = puts label into plot
plt.xlabel('x axis')              # label = name of label
plt.ylabel('y axis')
plt.title('Line Plot')            # title = title of plot
plt.show()

In [None]:
# Scatter Plot 
# x = attack, y = defense
data.plot(kind='scatter', x='Attack', y='Defense',alpha = 0.5,color = 'red')
plt.xlabel('Attack')              # label = name of label
plt.ylabel('Defence')
plt.title('Attack Defense Scatter Plot')            # title = title of plot
plt.show()

In [None]:
# Histogram
# bins = number of bar in figure
data.Speed.plot(kind = 'hist',bins = 50,figsize = (12,12))
plt.show()

In [None]:
# clf() = cleans it up again you can start a fresh
data.Speed.plot(kind = 'hist',bins = 50)
plt.clf()
# We cannot see plot due to clf()

<a id="3"></a> <br>
### DICTIONARY
Why we need dictionary?
* It has 'key' and 'value'
* Faster than lists
<br>
What is key and value. Example:
* dictionary = {'spain' : 'madrid'}
* Key is spain.
* Values is madrid.
<br>
<br>**It's that easy.**
<br>Lets practice some other properties like keys(), values(), update, add, check, remove key, remove all entries and remove dicrionary.

In [None]:
#create dictionary and look its keys and values
dictionary = {'spain' : 'madrid','usa' : 'vegas'}
print(dictionary.keys())
print(dictionary.values())

In [None]:
# Keys have to be immutable objects like string, boolean, float, integer or tubles
# List is not immutable
# Keys are unique
dictionary['spain'] = "barcelona"    # update existing entry
print(dictionary)
dictionary['france'] = "paris"       # Add new entry
print(dictionary)
del dictionary['spain']              # remove entry with key 'spain'
print(dictionary)
print('france' in dictionary)        # check include or not
dictionary.clear()                   # remove all entries in dict
print(dictionary)


In [None]:
print(dictionary)

<a id="4"></a> <br>
### PANDAS
What we need to know about pandas?
* CSV: comma - separated values



In [None]:
data = pd.read_csv('../input/pokemon-challenge/pokemon.csv')

In [None]:
series = data['Defense']        # data['Defense'] = series
print(type(series))
data_frame = data[['Defense']]  # data[['Defense']] = data frame
print(type(data_frame))


In [None]:
# 1 - Filtering Pandas data frame
x = data['Defense']>200     # There are only 3 pokemons who have higher defense value than 200
data[x]

In [None]:
data[data['Defense']>200 ]

In [None]:
data[np.logical_and(data['Defense']>200, data['Attack']>100 )]

In [None]:
data[(data['Defense']>200) & (data['Attack']>100)]

<a id="6"></a> <br>
### WHILE and FOR LOOPS
We will learn most basic while and for loops

In [None]:
i = 0
while i != 5 :
    print('i is: ',i)
    i +=1 
print(i,' is equal to 5')

In [None]:
lis = [1,2,3,4,5]
for i in lis:
    print('i is: ',i)
print('end')

In [None]:
# Enumerate index and value of list
# index : value = 0:1, 1:2, 2:3, 3:4, 4:5
for index, value in enumerate(lis):
    print(index," : ",value)
print('')   

# For dictionaries
# We can use for loop to achive key and value of dictionary. We learnt key and value at dictionary part.
dictionary = {'spain':'madrid','france':'paris'}
for key,value in dictionary.items():
    print(key," : ",value)
print('')

# For pandas we can achieve index and value
for index,value in data[['Attack']][0:1].iterrows():
    print(index," : ",value)


<a id="7"></a> <br>
# 2. PYTHON DATA SCIENCE TOOLBOX

<a id="8"></a> <br>
### USER DEFINED FUNCTION
What we need to know about functions:
* docstrings: documentation for functions. Example:
<br>for f():
    <br>"""This is docstring for documentation of function f"""
* tuble: sequence of immutable python objects. 
<br>cant modify values
<br>tuble uses paranthesis like tuble = (1,2,3)
<br>unpack tuble into several variables like a,b,c = tuble

In [None]:
# example of what we learn above
def tuble_ex():
    """ return defined t tuble"""
    t = (1,2,3)
    return t
a,b,c = tuble_ex()
print(a,b,c)

<a id="9"></a> <br>
### SCOPE
What we need to know about scope:
* global: defined main body in script
* local: defined in a function
* built in scope: names in predefined built in scope module such as print, len
<br><br>Lets make some basic examples

In [None]:
# guess print what
x = 2
def f():
    x = 3
    return x
print(x)      # x = 2 global scope
print(f())    # x = 3 local scope

In [None]:
f

In [None]:
x = 5
def f():
    y = 2*x        # there is no local scope x
    return y
print(f())         # it uses global scope x
# First local scopesearched, then global scope searched, if two of them cannot be found lastly built in scope searched.

In [None]:
# How can we learn what is built in scope
import builtins
list = [dir(builtins)]
list

<a id="10"></a> <br>
### NESTED FUNCTION
* function inside function.
* There is a LEGB rule that is search local scope, enclosing function, global and built in scopes, respectively.

In [None]:
#nested function
def square():
    """ return square of value """
    def add():
        """ add two local variable """
        x = 2
        y = 3
        z = x + y
        return z
    return add()**2
print(square()) 

<a id="11"></a> <br>
### DEFAULT and FLEXIBLE ARGUMENTS
* Default argument example:
<br> def f(a, b=1):
        """ b = 1 is default argument"""
* Flexible argument example:
<br> def f(*args):
       """ *args can be one or more"""
<br>def f(** kwargs)
       """ **kwargs is a dictionary"""
       
<br><br> lets write some code to practice  

In [None]:
# default arguments
def f(a, b = 1, c = 2):
    y = a + b + c
    return y
print(f(5))
# what if we want to change default arguments
print(f(5,4,3))

In [None]:
# flexible arguments *args
def f(*args):
    for i in args:
        print(i)
f(1)
print("")
f(1,2,3,4)
# flexible arguments **kwargs that is dictionary
def f(**kwargs):
    """ print key and value of dictionary"""
    for key, value in kwargs.items():               # If you do not understand this part turn for loop part and look at dictionary in for loop
        print(key, " ", value)
f(country = 'spain', capital = 'madrid', population = 123456)

<a id="12"></a> <br>
### LAMBDA FUNCTION
Faster way of writing function

In [None]:
# lambda function
square = lambda x: x**2     # where x is name of argument
print(square(4))
tot = lambda x,y,z: x+y+z   # where x,y,z are names of arguments
print(tot(1,2,3))

<a id="13"></a> <br>
### ANONYMOUS FUNCTİON
Like lambda function but it can take more than one arguments.
* map(func,seq) : applies a function to all the items in a list


In [None]:
number_list = [1,2,3]
y = [(lambda x:x**2,number_list)]
y

<a id="14"></a> <br>
### ITERATORS
* iterable is an object that can return an iterator
* iterable: an object with an associated iter() method
<br> example: list, strings and dictionaries
* iterator: produces next value with next() method

In [None]:
# iteration example
name = "ronaldo"
it = iter(name)
print(next(it))    # print next iteration
print(*it)         # print remaining iteration


<a id="15"></a> <br>
### LIST COMPREHENSİON
**One of the most important topic of this kernel**
<br>We use list comprehension for data analysis often. 
<br> list comprehension: collapse for loops for building lists into a single line
<br>Ex: num1 = [1,2,3] and we want to make it num2 = [2,3,4]. This can be done with for loop. However it is  unnecessarily long. We can make it one line code that is list comprehension.

In [None]:
# Example of list comprehension
num1 = [1,2,3]
num2 = [i + 1 for i in num1 ]
print(num2)

In [None]:
# Conditionals on iterable
num1 = [5,10,15]
num2 = [i**2 if i == 10 else i-5 if i < 7 else i+5 for i in num1]
print(num2)

In [None]:
# lets return pokemon csv and make one more list comprehension example
# lets classify pokemons whether they have high or low speed. Our threshold is average speed.
threshold = sum(data.Speed)/len(data.Speed)
data["speed_level"] = ["high" if i > threshold else "low" for i in data.Speed]
data.loc[:10,["speed_level","Speed"]] # we will learn loc more detailed later

<a id="16"></a> <br>
# 3.CLEANING DATA

<a id="17"></a> <br>
### DIAGNOSE DATA for CLEANING
We need to diagnose and clean data before exploring.
<br>Unclean data:
* Column name inconsistency like upper-lower case letter or space between words
* missing data
* different language

<br> We will use head, tail, columns, shape and info methods to diagnose data

In [None]:
data = pd.read_csv('../input/pokemon-challenge/pokemon.csv')
data.head()  # head shows first 5 rows

In [None]:
# tail shows last 5 rows
data.tail()

In [None]:
# columns gives column names of features
data.columns

In [None]:
# shape gives number of rows and columns in a tuble
data.shape

In [None]:
# info gives data type like dataframe, number of sample or row, number of feature or column, feature types and memory usage
data.info()

<a id="18"></a> <br>
### EXPLORATORY DATA ANALYSIS
value_counts(): Frequency counts
<br>outliers: the value that is considerably higher or lower from rest of the data
* Lets say value at 75% is Q3 and value at 25% is Q1. 
* Outlier are smaller than Q1 - 1.5(Q3-Q1) and bigger than Q3 + 1.5(Q3-Q1). (Q3-Q1) = IQR
<br>We will use describe() method. Describe method includes:
* count: number of entries
* mean: average of entries
* std: standart deviation
* min: minimum entry
* 25%: first quantile
* 50%: median or second quantile
* 75%: third quantile
* max: maximum entry

<br> What is quantile?

* 1,4,5,6,8,9,11,12,13,14,15,16,17
* The median is the number that is in **middle** of the sequence. In this case it would be 11.

* The lower quartile is the median in between the smallest number and the median i.e. in between 1 and 11, which is 6.
* The upper quartile, you find the median between the median and the largest number i.e. between 11 and 17, which will be 14 according to the question above.

In [None]:
# For example lets look frequency of pokemom types
print(data['Type 1'].value_counts(dropna =False))  # if there are nan values that also be counted
# As it can be seen below there are 112 water pokemon or 70 grass pokemon

In [None]:
# For example max HP is 255 or min defense is 5
data.describe() #ignore null entries

<a id="19"></a> <br>
### VISUAL EXPLORATORY DATA ANALYSIS
* Box plots: visualize basic statistics like outliers, min/max or quantiles

In [None]:
# For example: compare attack of pokemons that are legendary  or not
# Black line at top is max
# Blue line at top is 75%
# Red line is median (50%)
# Blue line at bottom is 25%
# Black line at bottom is min
# There are no outliers
data.boxplot(column='Attack',by = 'Legendary')
plt.title('')
plt.show()

<a id="20"></a> <br>
### TIDY DATA
We tidy data with melt().
Describing melt is confusing. Therefore lets make example to understand it.


In [None]:
# Firstly I create new data from pokemons data to explain melt nore easily.
data_new = data.head()    # I only take 5 rows into new data
data_new

In [None]:
# lets melt
# id_vars = what we do not wish to melt
# value_vars = what we want to melt
melted = pd.melt(frame=data_new,id_vars = 'Name', value_vars= ['Attack','Defense'])
melted