1. [Pandas Basics](#1)
    1. [Review of pandas](#2)
    1. [Building data frames from scratch](#3)
    1. [Visual exploratory data analysis](#4)
    1. [Statistical explatory data analysis](#5)
    1. [Indexing pandas time series](#6)
    1. [Resampling pandas time series](#7)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns  # visualization tool
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [None]:
data =pd.read_csv('../input/tmdb_5000_movies.csv')

In [None]:
data.info()

<a id="1"></a> <br>
#  PANDAS BASICS 

<a id="2"></a> <br>
### REVİEW of PANDAS

* single column = series
* NaN = not a number
* dataframe.values = numpy

<a id="3"></a> <br>
### BUILDING DATA FRAMES FROM SCRATCH
* We can build data frames from csv as we did earlier.
* Also we can build dataframe from dictionaries
    * zip() method: This function returns a list of tuples, where the i-th tuple contains the i-th element from each of the argument sequences or iterables.
* Adding new column
* Broadcasting: Create new column and assign a value to entire column

In [None]:
# data frames from dictionary
country = ["Turkey","USA"]
population =["10000","15000"]
list_label= ["country","population"]
list_col =[country,population]
zipped =list(zip(list_label,list_col))
data_dict = dict(zipped)
df = pd.DataFrame(data_dict)
df
             
             

In [None]:
# Add column 
df["capital"] =["ankara","miami"]
df

In [None]:
#Broadcasting
df["income"]=0 #Broadcasting entire column
df

<a id="4"></a> <br>
### VISUAL EXPLORATORY DATA ANALYSIS

* Plot
* Subplot
* Histogram:
    * bins: number of bins
    * range(tuble): min and max values of bins
    * normed(boolean): normalize or not
    * cumulative(boolean): compute cumulative distribution

In [None]:
# Plotting all data
new_data =data.loc[:,["runtime","popularity"]]
new_data.plot()
# it is confusing

In [None]:
#subplots
new_data.plot(subplots=True)
plt.show()


In [None]:
# scatter plot
new_data.plot(kind="scatter",x="runtime",y="popularity")
plt.show()

In [None]:
#histogram plot
new_data.plot(kind="hist",y="popularity",bins=40,range =(0,250),normed=True)


In [None]:
# histogram subplot with non-cumulative and cumulative
fig,axes =plt.subplots(nrows=2,ncols=1)
new_data.plot(kind="hist",y="popularity",bins =50,range =(0,250),normed=True,ax=axes[0])
new_data.plot(kind="hist",y="popularity",bins =50,range =(0,250),normed=True,ax=axes[1],cumulative = True)
plt.savefig('graph.png')
plt

<a id="5"></a> <br>
### STATISTICAL EXPLORATORY DATA ANALYSIS
I already explained it at previous parts. However lets look at one more time.
* count: number of entries
* mean: average of entries
* std: standart deviation
* min: minimum entry
* 25%: first quantile
* 50%: median or second quantile
* 75%: third quantile
* max: maximum entry

In [None]:
data.describe()

<a id="6"></a> <br>
### INDEXING PANDAS TIME SERIES
* datetime = object
* parse_dates(boolean): Transform date to ISO 8601 (yyyy-mm-dd hh:mm:ss ) format

In [None]:
time_list = ["2001-05-07","2001-04-11"]
print(type(time_list[1])) # As you can see date is string
# however we want it to be datetime object
datetime_object=pd.to_datetime(time_list)
print(type(datetime_object))

In [None]:
# In order to practice lets take head of tmdb_500 data and add it a time list
data1 =data.head()
date_list = ["1992-01-10","1992-02-10","1992-03-10","1993-03-15","1993-03-16"]
datetime_object =pd.to_datetime(date_list)
data1["date"] =datetime_object
# lets make date as index
data1 =data1.set_index("date")
data1

In [None]:
# Now we can select according to our date index
print(data1.loc["1993-03-16"])
print(data1.loc["1992-02-10":"1993-03-16"])

<a id="7"></a> <br>
### RESAMPLING PANDAS TIME SERIES
* Resampling: statistical method over different time intervals
    * Needs string to specify frequency like "M" = month or "A" = year
* Downsampling: reduce date time rows to slower frequency like from daily to weekly
* Upsampling: increase date time rows to faster frequency like from daily to hourly
* Interpolate: Interpolate values according to different methods like ‘linear’, ‘time’ or index’ 
    * https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.interpolate.html


In [None]:
# We will use data1 that we create at previous part
data1.resample("A").mean()

In [None]:
# Lets resample with month
data1.resample("M").mean()
# As you can see there are a lot of nan because data1 does not include all months

In [None]:
# In real life (data is real. Not created from us like data2) we can solve this problem with interpolate
# We can interpolete from first value
data1.resample("m").first().interpolate("linear")

In [None]:
# Or we can interpolate with mean()
data1.resample("m").mean().interpolate("linear")