# DATA SCIENTIST

**In this tutorial,I only will show you the first steps to be a data scientist using python.**

**Content:**
1. [Introduction to Python:](#1)
    1. [Matplotlib](#2)
    1. [Dictionaries ](#3)
    1. [Pandas](#4)
    1. [Logic, control flow and filtering](#5)
    1. [Loop data structures](#6)
1. [Python Data Science Toolbox:](#7)
    1. [User defined function](#8)
    1. [Scope](#9)
    1. [Nested function](#10)
    1. [Default and flexible arguments](#11)
    1. [Lambda function](#12)
    1. [Anonymous function](#13)
    1. [Iterators](#14)
    1. [List comprehension](#15)
1. [Cleaning Data](#16)
    1. [Diagnose data for cleaning](#17)
    1. [Exploratory data analysis](#18)
    1. [Visual exploratory data analysis](#19)
    1. [Tidy data](#20)
    1. [Pivoting data](#21)
    1. [Concatenating data](#22)
    1. [Data types](#23)
    1. [Missing data and testing with assert](#24)
1. [Pandas Foundation](#25)
    1. [Review of pandas](#26)
    1. [Building data frames from scratch](#27)
    1. [Visual exploratory data analysis](#28)
    1. [Statistical explatory data analysis](#29)
    1. [Indexing pandas time series](#30)
    1. [Resampling pandas time series](#31)
1. [Manipulating Data Frames with Pandas](#32)
    1. [Indexing data frames](#33)
    1. [Slicing data frames](#34)
    1. [Filtering data frames](#35)
    1. [Transforming data frames](#36)
    1. [Index objects and labeled data](#37)
    1. [Hierarchical indexing](#38)
    1. [Pivoting data frames](#39)
    1. [Stacking and unstacking data frames](#40)
    1. [Melting data frames](#41)
    1. [Categoricals and groupby](#42)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns  # visualization tool

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# close warning
import warnings
warnings.filterwarnings("ignore")

pd.set_option("display.max_columns",None) 
pd.set_option("display.max_rows",None)
# Any results you write to the current directory are saved as output.

In [None]:
data = pd.read_csv("../input/2015.csv")

In [None]:
data.info()# Display the content of data

In [None]:
data.rename(columns={"Economy (GDP per Capita)":"economy","Health (Life Expectancy)":"health","Trust (Government Corruption)":"Trust"}, inplace=True)

In [None]:
# shape gives number of rows and columns in a tuple
data.shape

In [None]:
data.columns

In [None]:
data.columns = [each.replace(" ","_") if(len(each.split())>1) else each for each in data.columns]
print(data.columns)

In [None]:
data.columns = [each.lower() for each in data.columns]
print(data.columns)

In [None]:
data.describe()

In [None]:
data.head()

In [None]:
data.tail()

In [None]:
data.sample(5)

In [None]:
data.dtypes

Standard Error: The standard error of the happiness score.

Economy (GDP per Capita): The extent to which GDP contributes to the calculation of the Happiness Score.

Family: The extent to which Family contributes to the calculation of the Happiness Score

Health (Life Expectancy): The extent to which Life expectancy contributed to the calculation of the Happiness Score.

Freedom: The extent to which Freedom contributed to the calculation of the Happiness Score.

Trust (Government Corruption): The extent to which Perception of Corruption contributes to Happiness Score.

Generosity: The extent to which Generosity contributed to the calculation of the Happiness Score.

Dystopia Residual: The extent to which Dystopia Residual contributed to the calculation of the Happiness Score.


In [None]:
# Display positive and negative correlation between columns
data.corr()

In [None]:
#sorts all correlations with ascending sort.
data.corr().unstack().sort_values().drop_duplicates()

In [None]:
#correlation map
plt.subplots(figsize=(10,10))
sns.heatmap(data.corr(), annot=True, linewidth=".5", cmap="YlGnBu", fmt=".2f")
plt.show()
#figsize - image size
#data.corr() - Display positive and negative correlation between columns
#annot=True -shows correlation rates
#linewidths - determines the thickness of the lines in between
#cmap - determines the color tones we will use
#fmt - determines precision(Number of digits after 0)
#if the correlation between the two columns is close to 1 or 1, the correlation between the two columns has a positive ratio.
#if the correlation between the two columns is close to -1 or -1, the correlation between the two columns has a negative ratio.
#If it is close to 0 or 0 there is no relationship between them.

In [None]:
data.isnull().head(15)

In [None]:
data.isnull().sum() #Indicates values not defined in our data

In [None]:
data.isnull().sum().sum()  #Indicates sum of values in our data

In [None]:
data[["happiness_score"]].isnull().head(15)

In [None]:
data.sort_values("happiness_score", ascending=False).head(10)

In [None]:
data.sort_values("happiness_score", ascending=True).head(10)

In [None]:
data[["happiness_score","economy","family","health"]].head(10)

<a id="1"></a> <br>
# 1. INTRODUCTION TO PYTHON

<a id="2"></a> <br>
### MATPLOTLIB
Matplot is a python library that help us to plot data. The easiest and basic plots are line, scatter and histogram plots.
* Line plot is better when x axis is time.
* Scatter is better when there is correlation between two variables
* Histogram is better when we need to see distribution of numerical data.
* Customization: Colors,labels,thickness of line, title, opacity, grid, figsize, ticks of axis and linestyle  

In [None]:
# Line Plot
# color = color, label = label, linewidth = width of line, alpha = opacity, grid = grid, linestyle = sytle of line
data.happiness_score.plot(kind="line", color="g", label="happiness_score", linewidth=1, alpha=0.5, grid=True, figsize=(12,12))
data.economy.plot(kind="line", color="r", label="economy", linewidth=1, alpha=0.5, grid=True)
data.family.plot(kind="line", color="y", label="family", linewidth=1, alpha=0.5, grid=True)
data.health.plot(kind="line", color="b", label="health", linewidth=1, alpha=0.5, grid=True)
plt.legend(loc="upper right")# legend = puts label into plot
plt.xlabel("x axis")         # label = name of label
plt.ylabel("y axis")
plt.title("line Plot")       # title = title of plot
plt.show()

#plt.xticks(np.arange(first value,last value,step)) 
#plt.xticks(np.arange(0,800,30)) #Determines the ranges of values in the x-axis
#plt.yticks(np.arange(0,300,30)) #Determines the ranges of values in the y-axis
#plt.show()

In [None]:
# subplots
data.plot(subplots = True, figsize=(12,12))
plt.show()

In [None]:
plt.subplot(4,2,1)
data.family.plot(kind="line", color="orange", label="family", linewidth=1, alpha=0.5, grid=True, figsize=(10,10))
data.happiness_score.plot(kind="line", color="green", label="family", linewidth=1, alpha=0.5, grid=True, figsize=(10,10))
plt.ylabel("family")
plt.subplot(4,2,2)
data.generosity.plot(kind="line", color="blue", label="generosity", linewidth=1, alpha=0.5, grid=True, linestyle=":")
plt.ylabel("generosity")
plt.subplot(4,2,3)
data.trust.plot(kind="line", color="green", label="trust", linewidth=1, alpha=0.5, grid=True, linestyle="-.")
plt.ylabel("trust")
plt.subplot(4,2,4)
data.freedom.plot(kind="line", color="red", label="freedom", linewidth=1, alpha=0.5, grid=True)
plt.ylabel("freedom")
plt.show()

In [None]:
# Scatter Plot 
# x = attack, y = defense
data.plot(kind="scatter", x="happiness_score", y="economy", alpha=0.5, color="green", figsize=(5,5))
plt.xlabel("happiness_score")    # label = name of label
plt.ylabel("economy")
plt.title("Happiness Score Economy Scatter Plot") # title = title of plot
plt.show()

In [None]:
data.plot(kind="scatter", x="economy", y="health", alpha=0.5, color="blue", figsize=(5,5))
plt.xlabel("economy")    # label = name of label
plt.ylabel("health")
plt.title("Economy Health Scatter Plot") # title = title of plot
plt.show()

In [None]:
# Histogram
# bins = number of bar in figure
data.happiness_score.plot(kind="hist",color="orange", bins=160, figsize=(10,10))
plt.show()

In [None]:
data.happiness_score.head(30).plot(kind="bar")
plt.show()

In [None]:
data.happiness_score.sample(30).plot(kind="bar")
plt.show()

In [None]:
data.happiness_score.head(100).plot(kind="area")
plt.show()

In [None]:
# clf() = cleans it up again you can start a fresh
data.happiness_score.plot(kind="hist", bins=50)
plt.clf() # We can not see plot if we use clf() method
plt.show()

<a id="3"></a> <br>
### DICTIONARY
Why we need dictionary?
* It has 'key' and 'value'
* Faster than lists
<br>
What is key and value. Example:
* dictionary = {'spain' : 'madrid'}
* Key is spain.
* Values is madrid.
<br>
<br>**It's that easy.**
<br>Lets practice some other properties like keys(), values(), update, add, check, remove key, remove all entries and remove dicrionary.

In [None]:
#we dont use.its just example
dic2 = [{"id": 825, "name": "Orhan"}, {"id": 851, "name": "Kadir"},{"id": 856, "name": "Cemal"}]
df2 = pd.DataFrame(dic2)
df2

In [None]:
#create dictionary and look its keys and values
dictionary = {"Turkey":"Ankara","Germany":"Berlin"}
print(dictionary.keys())
print(dictionary.values())

In [None]:
# Keys have to be immutable objects like string, boolean, float, integer or tubles
# List is not immutable
# Keys are unique
dictionary["Turkey"] = "Ankara" # update existing entry
print(dictionary)
dictionary["France"] = "Paris"    #Add new entry
print(dictionary)
del dictionary["France"]           # remove entry with key 'spain'
print(dictionary)
print("France" in dictionary)     # check include or not
dictionary.clear()                # remove all entries in dict
print(dictionary)

In [None]:
# In order to run all code you need to take comment this line
#del dictionary         # delete entire dictionary     
print(dictionary)       # it gives error because dictionary is deleted

### PANDAS
What we need to know about pandas?
* CSV: comma - separated values

In [None]:
print(type(data)) # pandas.core.frame.DataFrame
print(type(data[["freedom"]])) #pandas.core.frame.DataFrame
print(type(data["freedom"])) #pandas.core.series.Series
print(type(data["freedom"].values)) #numpy.ndarray

In [None]:
series = data['freedom']        # data['Defense'] = series
data_frame = data[['freedom']]  # data[['Defense']] = data frame

print(type(series))
print(type(data_frame))

print(series.head(10))
data_frame.head(10)

<a id="5"></a> <br>
Before continue with pandas,   we need to learn **logic, control flow** and **filtering.**
<br>Comparison operator:  ==, <, >, <=
<br>Boolean operators: and, or ,not
<br> Filtering pandas

In [None]:
# Comparison operator
print(3 > 2)
print(3!=2)
# Boolean operators
print(True and False)
print(True or False)

In [None]:
# 1 - Filtering Pandas data frame
x = data["happiness_score"]>5.0
data[x]

In [None]:
# 2 - Filtering pandas with logical_and
data[np.logical_and(data["family"]>1.3,data["economy"]>1.3)]

In [None]:
# This is also same with previous code line. Therefore we can also use '&' for filtering.
data[(data["family"]>1.3) & (data["economy"]>1.3)]

<a id="6"></a> <br>
### WHILE and FOR LOOPS
We will learn most basic while and for loops

In [None]:
# Stay in loop if condition( i is not equal 5) is true
i = 0
while i != 5:
    print("i is: ",i)
    i+=1
print(i," is equal to 5")

In [None]:
# Stay in loop if condition( i is not equal 5) is true
lis = [1,2,3,4,5]

for i in lis:
    print("i is: ",i)
print("")    
# Enumerate index and value of list
# index : value = 0:1, 1:2, 2:3, 3:4, 4:5
for index,value in enumerate(lis):
    print(index," : ",value)
print("")
# For dictionaries
# We can use for loop to achive key and value of dictionary. We learnt key and value at dictionary part.
dictionary = dictionary = {'Turkey':'Ankara','France':'Paris'}
for key in dictionary:
    print(key)
print("")
for key,value in dictionary.items():
    print(key," : ",value)
print("")
# For pandas we can achieve index and value
for index,value in data[["freedom"]][0:5].iterrows():
    print(index," : ",value)
data[["freedom"]][0:5]

<a id="7"></a> <br>
# 2. PYTHON DATA SCIENCE TOOLBOX


<a id="8"></a> <br>
### USER DEFINED FUNCTION
What we need to know about functions:
* docstrings: documentation for functions. Example:
<br>for f():
    <br>"""This is docstring for documentation of function f"""
* tuble: sequence of immutable python objects. 
<br>cant modify values
<br>tuple uses paranthesis like tuple = (1,2,3)
<br>unpack tuple into several variables like a,b,c = tuble
    

In [None]:
# example of what we learn above
def tuple_ex():
    """ return defined t tuble"""
    t = (1,2,3)
    return t
a,b,c = tuple_ex()
print(a,b,c)

<a id="9"></a> <br>
### SCOPE
What we need to know about scope:
* global: defined main body in script
* local: defined in a function
* built in scope: names in predefined built in scope module such as print, len
<br><br>Lets make some basic examples

In [None]:
# guess print what
x = 2
def f():
    x=3
    return x
print(x)      # x = 2 global scope
print(f())    # x = 3 local scope

In [None]:
# What if there is no local scope
x = 5
def f():
    y = 2*x        # there is no local scope x
    return y
print(f())         # it uses global scope x
# First local scopesearched, then global scope searched, if two of them cannot be found lastly built in scope searched.

In [None]:
# How can we learn what is built in scope
import builtins
dir(builtins)

<a id="10"></a> <br>
### NESTED FUNCTION
* function inside function.
* There is a LEGB rule that is search local scope, enclosing function, global and built in scopes, respectively.

In [None]:
#nested function
def square():
    """ return square of value """
    def add():
        """ add two local variable """
        x = 2
        y = 3
        z = x + y
        return z
    return add()**2
print(square())    

<a id="11"></a> <br>
### DEFAULT and FLEXIBLE ARGUMENTS
* Default argument example:
<br> def f(a, b=1):
        """ b = 1 is default argument"""
* Flexible argument example:
<br> def f(*args):
       """ *args can be one or more"""
<br>def f(** kwargs)
       """ **kwargs is a dictionary"""
       
<br><br> lets write some code to practice  

In [None]:
# default arguments
def f(a, b = 1, c = 2):
    y = a + b + c
    return y
print(f(5))
# what if we want to change default arguments
print(f(5,4,3))

In [None]:
# flexible arguments *args
def f(*args):
    for i in args:
        print(i)
f(1,1,2)
print("")
f(1,2,3,4)
print("")
f("orhan","kadir","cemal",1)
# flexible arguments **kwargs that is dictionary
def f(**kwargs):
    """ print key and value of dictionary"""
    for key, value in kwargs.items():     # If you do not understand this part turn for loop part and look at dictionary in for loop
        print(key, " ", value)
f(country = 'Turkey', capital = 'Ankara', population = 80000000)

In [None]:
# lambda function
square = lambda x: x**2     # where x is name of argument
print(square(4))
tot = lambda x,y,z: x+y+z   # where x,y,z are names of arguments
print(tot(1,2,3))

<a id="13"></a> <br>
### ANONYMOUS FUNCTİON
Like lambda function but it can take more than one arguments.
* map(func,seq) : applies a function to all the items in a list


In [None]:
number_list=(1,2,3,4,5,6,7,8,9)
y = map(lambda x : x**2,number_list)
#liste_Y = list(y) 
#print(liste_Y) #[1, 4, 9, 16, 25, 36, 49, 64, 81]
#OR short way
print(list(y)) #[1, 4, 9, 16, 25, 36, 49, 64, 81]

<a id="14"></a> <br>
### ITERATORS
* iterable is an object that can return an iterator
* iterable: an object with an associated iter() method
<br> example: list, strings and dictionaries
* iterator: produces next value with next() method

In [None]:
# iteration example
name = "Orhan"
itr = iter(name)
print(next(itr))# print next iteration
print(next(itr))# print next iteration
print(*itr)     # print remaining iteration

**zip(): zip lists**

In [None]:
list1 = [1,2,3,4]
list2 = [5,6,7,8]
z = zip(list1,list2)
print(z)
z_list = list(z)  #converting zip to list type
print(z_list)
print("")    
itr = iter(z_list) 
print(next(itr))   # print next iteration
print(*itr)        # print remaining iteration

In [None]:
un_zip = zip(*z_list)
unlist1,unlist2 = list(un_zip) # unzip returns tuple
print(unlist1)
print(unlist2)
print(type(unlist1))
print(type(list(unlist1))) #if we want to change data type tuple to list we need to use list() method.

<a id="15"></a> <br>
### LIST COMPREHENSİON
**One of the most important topic of this kernel**
<br>We use list comprehension for data analysis often. 
<br> list comprehension: collapse for loops for building lists into a single line
<br>Ex: num1 = [1,2,3] and we want to make it num2 = [2,3,4]. This can be done with for loop. However it is  unnecessarily long. We can make it one line code that is list comprehension.

In [None]:
num1 = [1,2,3]
num2 = [i+1 for i in num1]
print(num2)
#OR
print([i+1 for i in num1])

[i + 1 for i in num1 ]: list of comprehension
<br> i +1: list comprehension syntax
<br> for i in num1: for loop syntax
<br> i: iterator
<br> num1: iterable object

In [None]:
# Conditionals on iterable
num1 = [5,10,15]
num2 = [i**2 if i==10 else i-5 if i<7 else i+5 for i in num1]
print(num2)

In [None]:
# lets return 2015.csv and make one more list comprehension example
# lets classify happiness_score whether they have high or low. Our threshold is happiness_score.
threshold = sum(data.happiness_score)/len(data.happiness_score)
data["happiness_score_level"] = ["high" if i>threshold else "low" for i in data.happiness_score]
data.loc[60:90,["happiness_score_level","happiness_score"]]

Up to now, you learn 
* User defined function 
* Scope
* Nested function
* Default and flexible arguments
* Lambda function
*  Anonymous function
*  Iterators
* List comprehension

<a id="16"></a> <br>
# 3.CLEANING DATA

<a id="17"></a> <br>
### DIAGNOSE DATA for CLEANING
We need to diagnose and clean data before exploring.
<br>Unclean data:
* Column name inconsistency like upper-lower case letter or space between words
* missing data
* different language

<br> We will use head, tail, columns, shape and info methods to diagnose data


In [None]:
data = pd.read_csv('../input/2015.csv')
data.head()  # head shows first 5 rows

In [None]:
data.rename(columns={"Economy (GDP per Capita)":"economy","Health (Life Expectancy)":"health","Trust (Government Corruption)":"Trust"}, inplace=True)


In [None]:
data.columns = [each.replace(" ","_") if(len(each.split())>1) else each for each in data.columns]
print(data.columns)

In [None]:

data.columns = [each.lower() for each in data.columns]
print(data.columns)

In [None]:
# tail shows last 5 rows
data.tail()

In [None]:
# columns gives column names of features
data.columns

In [None]:
# shape gives number of rows and columns in a tuple
data.shape

In [None]:
data.dtypes

**Filtering Data**

In [None]:
data["region"].unique() #shows the unique region values

In [None]:
(data["happiness_score"] > 1).head(20) # We can filter the data if we want 

In [None]:
data["happiness_score"] > 1 # We can filter the data if we want 
data[data["happiness_score"] > 1].head(20)

<a id="18"></a> <br>
### EXPLORATORY DATA ANALYSIS
value_counts(): Frequency counts
<br>outliers: the value that is considerably higher or lower from rest of the data
* Lets say value at 75% is Q3 and value at 25% is Q1. 
* Outlier are smaller than Q1 - 1.5(Q3-Q1) and bigger than Q3 + 1.5(Q3-Q1). (Q3-Q1) = IQR
<br>We will use describe() method. Describe method includes:
* count: number of entries
* mean: average of entries
* std: standart deviation
* min: minimum entry
* 25%: first quantile
* 50%: median or second quantile
* 75%: third quantile
* max: maximum entry

<br> What is quantile?

* 1,4,5,6,8,9,11,12,13,14,15,16,17
* The median is the number that is in **middle** of the sequence. In this case it would be 11.

* The lower quartile is the median in between the smallest number and the median i.e. in between 1 and 11, which is 6.
* The upper quartile, you find the median between the median and the largest number i.e. between 11 and 17, which will be 14 according to the question above.

In [None]:
# For example lets look frequency of region types
print(data["region"].value_counts(dropna=False,sort=True))# if there are nan values that also be counted
#sort : boolean, default True   =>Sort by values
#dropna : boolean, default True =>Don’t include counts of NaN.
# As it can be seen below there are 40 Sub-Saharan Africa region or 29 Central and Eastern Europe region
# As you can see, there are no NAN values

<a id="19"></a> <br>
### VISUAL EXPLORATORY DATA ANALYSIS
* Box plots: visualize basic statistics like outliers, min/max or quantiles

In [None]:
# For example: compare happiness_score of region
# Black line at top is max
# Blue line at top is 75%
# Red line is median (50%)
# Blue line at bottom is 25%
# Black line at bottom is min
#Outlier are smaller than Q1 - 1.5(Q3-Q1) and bigger than Q3 + 1.5(Q3-Q1).     (Q3-Q1) = IQR
data.boxplot(column='happiness_score',by = 'region',fontsize=9,figsize=(20,20))

data2 = data[data["region"]=="Western Europe"]
print(data2.happiness_score.max())
print(data2.happiness_score.quantile(q=0.75))
print(data2.happiness_score.quantile(q=0.5))
print(data2.happiness_score.quantile(q=0.25))
print(data2.happiness_score.min())

data3 = data[data["region"]=="North America"]
print(data3.happiness_score.max())
print(data3.happiness_score.quantile(q=0.75))
print(data3.happiness_score.quantile(q=0.5))
print(data3.happiness_score.quantile(q=0.25))
print(data3.happiness_score.min())

data4 = data[data["region"]=="Australia and New Zealand"]
print(data4.happiness_score.max())
print(data4.happiness_score.quantile(q=0.75))
print(data4.happiness_score.quantile(q=0.5))
print(data4.happiness_score.quantile(q=0.25))
print(data4.happiness_score.min())

data5 = data[data["region"]=="Latin America and Caribbean"]
print(data5.happiness_score.max())
print(data5.happiness_score.quantile(q=0.75))
print(data5.happiness_score.quantile(q=0.5))
print(data5.happiness_score.quantile(q=0.25))
print(data5.happiness_score.min())
# FINDING OUTLIERS
#we should change the formula according to our data

#FOR data2
print("max outliers=",[x for x in data2.happiness_score if x>(data2.happiness_score.quantile(0.75)+1.5*(data2.happiness_score.quantile(0.75)-data2.happiness_score.quantile(0.25)))])
print("min outliers=",[x for x in data2.happiness_score if x<(data2.happiness_score.quantile(0.25)-1.5*(data2.happiness_score.quantile(0.75)-data2.happiness_score.quantile(0.25)))])
print("")

#FOR data3
print("max outliers=",[x for x in data3.happiness_score if x>(data3.happiness_score.quantile(0.75)+1.5*(data3.happiness_score.quantile(0.75)-data3.happiness_score.quantile(0.25)))])
print("min outliers=",[x for x in data3.happiness_score if x<(data3.happiness_score.quantile(0.25)-1.5*(data3.happiness_score.quantile(0.75)-data3.happiness_score.quantile(0.25)))])
print("")

#FOR data4
print("max outliers=",[x for x in data4.happiness_score if x>(data4.happiness_score.quantile(0.75)+1.5*(data4.happiness_score.quantile(0.75)-data4.happiness_score.quantile(0.25)))])
print("min outliers=",[x for x in data4.happiness_score if x<(data4.happiness_score.quantile(0.25)-1.5*(data4.happiness_score.quantile(0.75)-data4.happiness_score.quantile(0.25)))])
print("")

#doing with for loop
#FOR data5
for x in data5.happiness_score:
    if x>(data5.happiness_score.quantile(0.75)+1.5*(data5.happiness_score.quantile(0.75)-data5.happiness_score.quantile(0.25))):
       print("max outliers=",x)
    elif x<(data5.happiness_score.quantile(0.25)-1.5*(data5.happiness_score.quantile(0.75)-data5.happiness_score.quantile(0.25))):
       print("min outliers=",x)

<a id="20"></a> <br>
### TIDY DATA
We tidy data with melt().
Describing melt is confusing. Therefore lets make example to understand it.


In [None]:
# Firstly I create new data from 2015 data to explain melt more easily.
data_new = data.head(5)    # I only take 5 rows into new data
data_new

In [None]:
# lets melt
# id_vars = what we do not wish to melt
# value_vars = what we want to melt
melted = pd.melt(frame=data_new, id_vars = "country",value_vars=["economy","health"])
melted

<a id="21"></a> <br>
### PIVOTING DATA
Reverse of melting.

In [None]:
# Index is name
# I want to make that columns are variable
# Finally values in columns are value
melted.pivot(index="country", columns = "variable", values="value")

<a id="22"></a> <br>
### CONCATENATING DATA
We can concatenate two dataframe 

In [None]:
# Firstly lets create 2 data frame
data1 = data.head()
data2 = data.tail()
v_concat = pd.concat([data1,data2],axis=0,ignore_index=True)# axis = 0 : adds dataframe
v_concat

In [None]:
data1 = data.country.head()
data2 = data.happiness_score.head()
h_concat = pd.concat([data1,data2], axis=1)
h_concat

In [None]:
data.info()

In [None]:
data1 = data.country.head(10)
data2 = data.happiness_score.head(10)
data3 = data.trust.head(10)
data4 = data.region.head(10)
h_concat = pd.concat([data4+" - "+data1,data2,data3], axis=1)
h_concat

<a id="23"></a> <br>
### DATA TYPES
There are 5 basic data types: object(string),booleab,  integer, float and categorical.
<br> We can make conversion data types like from str to categorical or from int to float
<br> Why is category important: 
* make dataframe smaller in memory 
* can be utilized for anlaysis especially for sklear(we will learn later)

In [None]:
data.dtypes

In [None]:
#lets convert object(str) to categorical and float to int.
#DONT forget ,Setting return back default setting to int
#data["region"] = data["region"].astype("category")
#data.freedom = data.freedom.astype("int")
#data.freedom[0:10] #as you see it is converted from int to float

In [None]:
# As you can see region is converted from object to categorical
# And freedom is converted from float to int
data.dtypes

<a id="24"></a> <br>
### MISSING DATA and TESTING WITH ASSERT
If we encounter with missing data, what we can do:
* leave as is
* drop them with dropna()
* fill missing value with fillna()
* fill missing values with test statistics like mean
<br>Assert statement: check that you can turn on or turn off when you are done with your testing of the program

In [None]:
data = pd.read_csv("../input/2015.csv")
data.rename(columns={"Economy (GDP per Capita)":"economy","Health (Life Expectancy)":"health","Trust (Government Corruption)":"Trust"}, inplace=True)
data.columns = [each.replace(" ","_") if(len(each.split())>1) else each for each in data.columns]
data.columns = [each.lower() for each in data.columns]
print(data.columns)

data2 = data.copy(deep=True)
data2["region"][4:8] = np.nan
data2

In [None]:
# Lets chech happiness_score
data2["region"].value_counts(dropna =False)
# As you can see, there are 4 NAN value

In [None]:
# Lets drop nan values
# also we will use data to fill missing value
data2["region"].dropna(inplace = True)
data2

In [None]:
#  Lets check with assert statement
# Assert statement:
assert 1==1 # return nothing because it is true

In [None]:
#In order to run all code, we need to make this line comment
#assert 1==2 # return error because it is false

In [None]:
assert  data2['region'].notnull().all() # returns nothing because we drop nan values

In [None]:
data2["region"].fillna("empty",inplace = True)
data2 
#you can not assign empty values after delete nan values
#if you want to assign empty values firstly make that! after import csv file

In [None]:
# # With assert statement we can check a lot of thing. For example
assert data2.columns[1] == "region"
assert data2.happiness_score.dtype == "float"
#OR
assert data2.region.dtype == "object"
assert data2.happiness_score.dtype == "float64"
print(data2.happiness_score.dtypes)

<a id="25"></a> <br>
# 4. PANDAS FOUNDATION 

<a id="26"></a> <br>
### REVİEW of PANDAS
As you notice, I do not give all idea in a same time. Although, we learn some basics of pandas, we will go deeper in pandas.
* single column = series
* NaN = not a number
* dataframe.values = numpy


<a id="27"></a> <br>
### BUILDING DATA FRAMES FROM SCRATCH
* We can build data frames from csv as we did earlier.
* Also we can build dataframe from dictionaries
    * zip() method: This function returns a list of tuples, where the i-th tuple contains the i-th element from each of the argument sequences or iterables.
* Adding new column
* Broadcasting: Create new column and assign a value to entire column

In [None]:
country = ["Turkey","France"]
population = ["1000","2000"]
list_label = ["country","population"]
list_col = [country,population]
print(list_col)
zipped = list(zip(list_label,list_col))
print(zipped)
data_dict = dict(zipped)
print(data_dict)
df = pd.DataFrame(data_dict)
df

In [None]:
df["capital"]=["madrid","paris"]
df

In [None]:
df["income"] = 0
df

<a id="28"></a> <br>
### VISUAL EXPLORATORY DATA ANALYSIS
* Plot
* Subplot
* Histogram:
    * bins: number of bins
    * range(tuble): min and max values of bins
    * normed(boolean): normalize or not
    * cumulative(boolean): compute cumulative distribution

In [None]:
# Plotting all data 
data1 = data.loc[:,["happiness_score","freedom","health"]]
data1.plot()
# SAME THING
#data.happiness_score.plot()
#data.freedom.plot()
#data.health.plot()

In [None]:
# subplots
data1.plot(subplots = True)
plt.show()

In [None]:
# scatter plot  
data1.plot(kind = "scatter",x="freedom",y = "health")
plt.show()

In [None]:
# hist plot  
data1.happiness_score.plot(kind ="hist",range= (0,10),bins=50)
plt.show()

In [None]:
# histogram subplot with non cumulative and cumulative
fig, axes = plt.subplots(nrows=2,ncols=1)
data1.plot(kind = "hist",y = "happiness_score",color="orange",bins = 50,range= (0,10),ax = axes[0])
data1.plot(kind = "hist",y = "happiness_score",color="green",bins = 50,range= (0,10),ax = axes[1],cumulative = True)
plt.savefig('graph.png')
plt.show()

In [None]:
# In order to practice lets take head of 2015.csv data and add it a time list
data2 = data.head()
date_list = ["1992-01-10","1992-02-10","1992-03-10","1993-03-15","1993-03-16"]
datetime_object = pd.to_datetime(date_list)
data2["date"] = datetime_object
# lets make date as index
data2= data2.set_index("date")
#OR
#data2.set_index("date",inplace=True)
data2 

In [None]:
#bütün columnları ve rowları gösterir.
pd.set_option("display.max_columns",None) 
pd.set_option("display.max_rows",None)

# Now we can select according to our date index
print(data2.loc["1993-03-16"]) #print(data2.loc["1993-03-16",:]) same thing
print(data2.loc["1992-03-10":"1993-03-16"])

<a id="31"></a> <br>
### RESAMPLING PANDAS TIME SERIES
* Resampling: statistical method over different time intervals
    * Needs string to specify frequency like "M" = month or "A" = year
* Downsampling: reduce date time rows to slower frequency like from daily to weekly
* Upsampling: increase date time rows to faster frequency like from daily to hourly
* Interpolate: Interpolate values according to different methods like ‘linear’, ‘time’ or index’ 
    * https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.interpolate.html


In [None]:
# We will use data2 that we create at previous part
data2.resample("A").mean() #yıldan yıla featureların kendi içinde ortalaması

In [None]:
# Lets resample with month
data2.resample("M").mean()
# As you can see there are a lot of nan because data2 does not include all months

In [None]:
# In real life (data is real. Not created from us like data2) we can solve this problem with interpolate
# We can interpolete from first value
data2.resample("M").first().interpolate("linear")

In [None]:
# Or we can interpolate with mean()
data2.resample("M").mean().interpolate("linear")

<a id="32"></a> <br>
# MANIPULATING DATA FRAMES WITH PANDAS

<a id="33"></a> <br>
### INDEXING DATA FRAMES
* Indexing using square brackets
* Using column attribute and row label
* Using loc accessor
* Selecting only some columns

In [None]:
data1 = data.head(10)
data1

In [None]:
# indexing using square brackets
data1["happiness_rank"][1]

In [None]:
# using column attribute and row label
data1.happiness_rank[1]

In [None]:
# using loc accessor
data1.loc[2,["happiness_rank"]]

In [None]:
# Selecting only some columns
data1[["happiness_rank"]]

<a id="34"></a> <br>
### SLICING DATA FRAME
* Difference between selecting columns
* Series and data frames
* Slicing and indexing series
* Reverse slicing 
* From something to end

In [None]:
# Difference between selecting columns: series and dataframes
print(type(data["freedom"]))     # series
print(type(data[["freedom"]]))   # data frames

In [None]:
# Slicing and indexing series
data.loc[1:10,"health":"generosity"]   # 10 and "Defense" are inclusive

In [None]:
# Reverse slicing 
a =data.loc[10:1:-1,"generosity":"health":-1] 
a

In [None]:
# From something to end
data.loc[1:10,"trust":] 

<a id="35"></a> <br>
### FILTERING DATA FRAMES
Creating boolean series
Combining filters
Filtering column based others

In [None]:
# Creating boolean series
boolean = data.health > 0.95
data[boolean]

In [None]:
# Combining filters
first_filter = data.family > .95
second_filter = data.health > .95
data[np.logical_and(first_filter,second_filter)]
#OR
#data[np.logical_and(first_filter,second_filter)]

In [None]:
# Filtering column based others
data.country[data.happiness_score>7]

In [None]:
# Filtering column based others
data[["freedom"]][data.happiness_score>7]

In [None]:
# Filtering column based others
a = data[data.happiness_score>7]
a[["trust"]]

<a id="36"></a> <br>
### TRANSFORMING DATA
* Plain python functions
* Lambda function: to apply arbitrary python function to every element
* Defining column using other columns

In [None]:
# Plain python functions
def div(n):
    return n/2
data["new_happiness_score"]=data.happiness_score.apply(div)
data

In [None]:
data["new_happiness_score"] = data.happiness_score.apply(lambda hp : hp/2)
data

In [None]:
# Defining column using other columns
data["new_total_happiness_score"] = data.trust + data.economy
data.head()

<a id="37"></a> <br>
### INDEX OBJECTS AND LABELED DATA
index: sequence of label


In [None]:
# our index name is this:
print(data.index.name)
#lets change it
data.index.name = "index_name"
data.head()

In [None]:
# Overwrite index
# if we want to modify index we need to change all of them.
data.head()
# first copy of our data to data3 then change index
data2 = data.copy()
# lets make index start from 100. It is not remarkable change but it is just example
data2.index = range(100,258,1)#100 exclusive->258
data2.tail()

In [None]:
# We can make one of the column as index. I actually did it at the beginning of manipulating data frames with pandas section
# It was like this
# data= data.set_index("happiness_rank")
# also you can use 
data.index = data["happiness_rank"]
data.index = data["freedom"]
data.index = data["happiness_rank"]
data.head()
#with using set_index means make index happiness_rank and you can not back it as a column
#but if we using data["happiness_rank"] series we can use that feature as index and feature

<a id="38"></a> <br>
### HIERARCHICAL INDEXING
* Setting indexing

In [None]:
# Setting index : region is outer country is inner index
data1 = data.set_index(["region","country"]) 
data1

In [None]:
data1.loc[["Western Europe"]] # how to use indexes

<a id="39"></a> <br>
### PIVOTING DATA FRAMES
* pivoting: reshape tool

In [None]:
dic = {"treatment":["A","A","B","B"],"gender":["F","M","F","M"],"response":[10,45,5,9],"age":[15,4,72,65]}
df = pd.DataFrame(dic)
df

In [None]:
# pivoting
df.pivot(index="treatment",columns = "gender",values="response")

<a id="40"></a> <br>
### STACKING and UNSTACKING DATAFRAME
* deal with multi label indexes
* level: position of unstacked index
* swaplevel: change inner and outer level index position

In [None]:
df1 = df.set_index(["treatment","gender"])
df1
#OR
#df1 = df.set_index(["gender","treatment"])

In [None]:
# lets unstack it
# level determines indexes
df1.unstack(level=0)

In [None]:
df1.unstack(level=1)

In [None]:
# change inner and outer level index position
df2 = df1.swaplevel(0,1)
df2

<a id="41"></a> <br>
### MELTING DATA FRAMES
* Reverse of pivoting

In [None]:
df

In [None]:
pd.melt(df,id_vars="treatment",value_vars=["age","response"])

<a id="42"></a> <br>
### CATEGORICALS AND GROUPBY

In [None]:
data = pd.read_csv("../input/2015.csv")
data.rename(columns={"Economy (GDP per Capita)":"economy","Health (Life Expectancy)":"health","Trust (Government Corruption)":"Trust"}, inplace=True)
data.columns = [each.replace(" ","_") if(len(each.split())>1) else each for each in data.columns]
data.columns = [each.lower() for each in data.columns]
data.head()

In [None]:
data.groupby("region").count()

In [None]:
data.groupby("region").country.count()

In [None]:
data.groupby("region").country.count().sum()

In [None]:
data.groupby("region").country.count().sort_values(ascending=False)

In [None]:
#let find North America counts
data[data["region"]=="North America"].region.count()

In [None]:
data.groupby("region").country.count().sort_values(ascending=False).plot(kind="line")
plt.show()

In [None]:
data.groupby("region").country.count().sort_values(ascending=False).plot(kind="bar")
plt.show()

In [None]:
data.groupby("region").country.count().sort_values(ascending=False).plot(kind="hist",bins=50)
plt.show()

In [None]:
data.groupby("region").country.count().sort_values(ascending=False).plot(kind="box")
plt.show()

In [None]:
data.groupby("region").country.count().sort_values(ascending=False).plot(kind="area")
plt.show()

In [None]:
data.groupby("region").country.count().sort_values(ascending=False).plot(kind="pie")
plt.show()

In [None]:
# according to region take means of other features
data.groupby("region").mean()   # mean is aggregation / reduction method
# there are other methods like sum, std,max or min

In [None]:
# we can only choose one of the feature
data.groupby("region").happiness_score.mean() 
#OR
#df.groupby("region")[["happiness_score"]].mean() 

In [None]:
data.groupby("region").mean().sort_values("happiness_score",ascending=False)

In [None]:
# we can only choose one of the feature
data.groupby("region").happiness_score.max()

In [None]:
# Or we can choose multiple features
data.groupby("region")[["happiness_score","economy"]].mean()

In [None]:
data.groupby("region")[["happiness_score"]].mean() 