![data-x](http://oi64.tinypic.com/o858n4.jpg)

---
# Pandas Introduction 
### with Stock Data and Correlation Examples


**Author list:** Ikhlaq Sidhu & Alexander Fred Ojala

**References / Sources:** 
Includes examples from Wes McKinney and the 10min intro to Pandas


**License Agreement:** Feel free to do whatever you want with this code

___

## What Does Pandas Do?
<img src="https://github.com/ikhlaqsidhu/data-x/raw/master/imgsource/pandas-p1.jpg">

## What is a Pandas Table Object?
<img src="https://github.com/ikhlaqsidhu/data-x/raw/master/imgsource/pandas-p2.jpg">


In [None]:
# ## This table is a dictionary of sequences (like np arrays)
# <img src="https://github.com/ikhlaqsidhu/data-x/raw/master/imgsource/pandas-p3.jpg">


### Topics:
1. Dataframe creation
2. Reading data in dataFrames
3. Data Manipulation

## Import package

In [1]:
import pandas as pd

# Part:1 Creation Pandas dataframes

**Key Points:** Main data types in Pandas:
* Series (similar to numpy arrays, but with index)
* DataFrames (table or spreadsheet with Series in the columns)




### We use `pd.DataFrame( )` and can insert almost any data type as an argument

**Function:** `pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)`

Input data ca be a numpy ndarray (structured or homogeneous), dictionary, or DataFrame. 


### 1.1 Create Dataframe using an array

In [2]:
# Try it with an array

import numpy as np
np.random.seed(0) # set seed for reproducibility

a1 = np.array(np.random.randn(3))
a2 = np.array(np.random.randn(3))
a3 = np.array(np.random.randn(3))

print (a1)
print (a2)
print (a3)

[ 1.76405235  0.40015721  0.97873798]
[ 2.2408932   1.86755799 -0.97727788]
[ 0.95008842 -0.15135721 -0.10321885]


In [3]:
# Create our first DataFrame w/ an np.array - it becomes a column
df0 = pd.DataFrame(a1)
print ("This is a", type(df0), ':')
df0

This is a <class 'pandas.core.frame.DataFrame'> :


Unnamed: 0,0
0,1.764052
1,0.400157
2,0.978738


In [4]:
print(df0) # difference when you print and output of the last row

          0
0  1.764052
1  0.400157
2  0.978738


In [5]:
# DataFrame from list of np.arrays

df0 = pd.DataFrame([a1, a2, a3])
df0

# notice that there is no column label, only integer values,
# and the index is set automatically

Unnamed: 0,0,1,2
0,1.764052,0.400157,0.978738
1,2.240893,1.867558,-0.977278
2,0.950088,-0.151357,-0.103219


In [6]:

# Let's us set names for rows and columns as per our choice

df0 = pd.DataFrame([a1, a2, a3],columns=['a1','a2','a3'],index=['a','b','c'])
df0
# notice that there are  index and column labels of your choice


Unnamed: 0,a1,a2,a3
a,1.764052,0.400157,0.978738
b,2.240893,1.867558,-0.977278
c,0.950088,-0.151357,-0.103219


In [7]:
# add  more columns to dataframe
df0['col4']=a2
df0

Unnamed: 0,a1,a2,a3,col4
a,1.764052,0.400157,0.978738,2.240893
b,2.240893,1.867558,-0.977278,1.867558
c,0.950088,-0.151357,-0.103219,-0.977278


In [8]:
# DataFrame from 2D np.array
np.random.seed(0)
array_2d = np.array(np.random.randn(9)).reshape(3,3)
array_2d

array([[ 1.76405235,  0.40015721,  0.97873798],
       [ 2.2408932 ,  1.86755799, -0.97727788],
       [ 0.95008842, -0.15135721, -0.10321885]])

In [9]:
df0 = pd.DataFrame(array_2d,columns=['a1','a2','a3'],index=[100,200,99]) 
# we can also assign columns and indices, sizes have to match
df0

Unnamed: 0,a1,a2,a3
100,1.764052,0.400157,0.978738
200,2.240893,1.867558,-0.977278
99,0.950088,-0.151357,-0.103219


### 1.2 Create Dataframe using an dictionary

In [12]:
# DataFrame from a Dictionary
dict1 = {'a1':a1, 'a2':a2,'a3':a3}
df1 = pd.DataFrame(dict1,index=[0,1,2]) 
# note that we now have columns without assignment
df1

Unnamed: 0,a1,a2,a3
0,1.764052,2.240893,0.950088
1,0.400157,1.867558,-0.151357
2,0.978738,-0.977278,-0.103219


In [13]:
# DataFrame from a Dictionary
# dict1 = {0:a1, 1:a2,2:a3}
# df1 = pd.DataFrame(dict1,index=['a1','a2','a3']) 
# # note that we now have columns without assignment
# df1.T

In [14]:
# We can easily add another column (just as you add values to a dictionary)
df1['add-column']=a3
df1

Unnamed: 0,a1,a2,a3,add-column
0,1.764052,2.240893,0.950088,0.950088
1,0.400157,1.867558,-0.151357,-0.151357
2,0.978738,-0.977278,-0.103219,-0.103219


In [15]:
# We can add a list with strings and ints as a column 
df1['L'] = ["List", 3, "words"]
print ("The column L is a ",type (df1['L']))
df1

The column L is a  <class 'pandas.core.series.Series'>


Unnamed: 0,a1,a2,a3,add-column,L
0,1.764052,2.240893,0.950088,0.950088,List
1,0.400157,1.867558,-0.151357,-0.151357,3
2,0.978738,-0.977278,-0.103219,-0.103219,words


### Pandas Series object: Like an np.array, but we can combine data types and it has its own index


In [16]:
# Note: Every column in a DataFrame is a Series
print(df1['L'])
print()
print(type(df1['L']))

0     List
1        3
2    words
Name: L, dtype: object

<class 'pandas.core.series.Series'>


In [17]:
# Create a Series from a Python list
s = pd.Series([1,np.nan,3]) # automatic index, 0,1,2...
s2 = pd.Series([2, 3, 4], index = ['a','b','c']) #specific index
print (s)
print()
print (s2)

0    1.0
1    NaN
2    3.0
dtype: float64

a    2
b    3
c    4
dtype: int64


In [18]:
# We can add the Series s to the DataFrame above as column S- remember to match indices
df1['Series'] = s
df1

Unnamed: 0,a1,a2,a3,add-column,L,Series
0,1.764052,2.240893,0.950088,0.950088,List,1.0
1,0.400157,1.867558,-0.151357,-0.151357,3,
2,0.978738,-0.977278,-0.103219,-0.103219,words,3.0


In [19]:
# We can also rename columns
df1 = df1.rename(columns = {'L':'RenamedL'})
df1

Unnamed: 0,a1,a2,a3,add-column,RenamedL,Series
0,1.764052,2.240893,0.950088,0.950088,List,1.0
1,0.400157,1.867558,-0.151357,-0.151357,3,
2,0.978738,-0.977278,-0.103219,-0.103219,words,3.0


In [20]:
# We can delete columns
del df1['RenamedL']
df1

Unnamed: 0,a1,a2,a3,add-column,Series
0,1.764052,2.240893,0.950088,0.950088,1.0
1,0.400157,1.867558,-0.151357,-0.151357,
2,0.978738,-0.977278,-0.103219,-0.103219,3.0


In [21]:
# or drop columns
df1.drop('a2',axis=1) # does not change df1 if we don't set inplace=True

Unnamed: 0,a1,a3,add-column,Series
0,1.764052,0.950088,0.950088,1.0
1,0.400157,-0.151357,-0.151357,
2,0.978738,-0.103219,-0.103219,3.0


In [22]:
df1

Unnamed: 0,a1,a2,a3,add-column,Series
0,1.764052,2.240893,0.950088,0.950088,1.0
1,0.400157,1.867558,-0.151357,-0.151357,
2,0.978738,-0.977278,-0.103219,-0.103219,3.0


In [24]:
# or drop rows
df1.drop(0,axis=0)

Unnamed: 0,a1,a2,a3,add-column,Series
1,0.400157,1.867558,-0.151357,-0.151357,
2,0.978738,-0.977278,-0.103219,-0.103219,3.0


# 1.3 Slicing/ Indexing in Pandas Datframe

In [25]:
# Example: view only one column
df1['a1']

0    1.764052
1    0.400157
2    0.978738
Name: a1, dtype: float64

In [26]:
# Or view several column
df1[['a1','a2']]

Unnamed: 0,a1,a2
0,1.764052,2.240893
1,0.400157,1.867558
2,0.978738,-0.977278


In [None]:
# slice of the DataFrame returned
# this slices the first three rows first followed by first 2 rows of the sliced frame
(df1[0:3][0:2])

In [27]:
# Lets print the five first 2  elements of column a1
# This is a new Series (like a new table)
df1['a1'][0:2]

0    1.764052
1    0.400157
Name: a1, dtype: float64

In [28]:
# Lets print the 2 column, and top 3 values- note the list of columns
df1[['a1','a2']][0:3]

Unnamed: 0,a1,a2
0,1.764052,2.240893
1,0.400157,1.867558
2,0.978738,-0.977278


In [None]:
# get first element of df1
df1[0,0]


## Instead of double indexing, we can use loc, iloc

##### loc gets rows (or columns) with particular labels from the index.
#### iloc gets rows (or columns) at particular positions in the index (so it only takes integers).

## .iloc()

In [29]:
df1.iloc[0,0]

1.764052345967664

In [30]:


df1.iloc[0:2,0:2] # 2nd to 4th row, 4th to 5th column

Unnamed: 0,a1,a2
0,1.764052,2.240893
1,0.400157,1.867558


In [31]:
# iloc will also accept 2 'lists' of position numbers
df1.iloc[[0,2],[0,2]]

Unnamed: 0,a1,a3
0,1.764052,0.950088
2,0.978738,-0.103219


In [32]:
1# Data only from row with index value '1'
print (df1.iloc[1])
print()
print (df1.iloc[1,:])

a1            0.400157
a2            1.867558
a3           -0.151357
add-column   -0.151357
Series             NaN
Name: 1, dtype: float64

a1            0.400157
a2            1.867558
a3           -0.151357
add-column   -0.151357
Series             NaN
Name: 1, dtype: float64


## .loc()

In [33]:
# Usually we want to grab values by column names 

# Note: You have to know indices and columns
df1.loc[0:2,['a3','a2']]

Unnamed: 0,a3,a2
0,0.950088,2.240893
1,-0.151357,1.867558
2,-0.103219,-0.977278


In [34]:
#  Boolean indexing
# return  full rows where a2>0

df1[df1['a2']>0]

# df1['a2']>0 - checks condition ans returns boolean and gives 



Unnamed: 0,a1,a2,a3,add-column,Series
0,1.764052,2.240893,0.950088,0.950088,1.0
1,0.400157,1.867558,-0.151357,-0.151357,


In [None]:
# return column a3 values where a2 >0
df1['a3'][df1['a2']>0] 

In [None]:
# If you want the values in an np array
npg = df1.loc[:,"a2"].values #otherwise it returns a  indexed series
print(type(npg))
print()
npg

### More Basic Statistics

In [35]:
df1.describe()

Unnamed: 0,a1,a2,a3,add-column,Series
count,3.0,3.0,3.0,3.0,2.0
mean,1.047649,1.043724,0.231837,0.231837,2.0
std,0.684554,1.760165,0.622489,0.622489,1.414214
min,0.400157,-0.977278,-0.151357,-0.151357,1.0
25%,0.689448,0.44514,-0.127288,-0.127288,1.5
50%,0.978738,1.867558,-0.103219,-0.103219,2.0
75%,1.371395,2.054226,0.423435,0.423435,2.5
max,1.764052,2.240893,0.950088,0.950088,3.0


In [None]:
df1.describe().loc[['mean','std'],['a2','a3']]

In [None]:
# We can change the index sorting
df1.sort_index(axis=0, ascending=False).head() # starts a year ago

#### For more functionalities check this notebook
https://github.com/ikhlaqsidhu/data-x/blob/master/02b-tools-pandas_intro-mplib_afo/10-minutes-to-pandas-w-data-x.ipynb



## Part 2: Reading data in pandas Dataframe


#### Now, lets get some data in CSV format.

See https://www.quantshare.com/sa-43-10-ways-to-download-historical-stock-quotes-data-for-free


In [None]:
# We can download data from the web by using pd.read_csv
# A CSV file is a comma seperated file
# We can use this 'pd.read_csv' method with urls that host csv files

df_google = pd.read_csv('https://finance.google.com/finance/historical?output=csv&q=googl') # Google stock data
df_apple = pd.read_csv('https://finance.google.com/finance/historical?output=csv&q=aapl') # Apple stock data

In [None]:
df_google.head()
# Volume is the number of shares or contracts traded

In [None]:
# check dtypes in each column
df_google.dtypes


###  Breakout: Check the file attributes & general statitics using Pandas 

In [None]:
# shape

In [None]:
#show first five values

In [None]:
# show last three

In [None]:
# retuen column names

In [None]:
#  get statistics- mean and std of "open" column

###  Convert the Date string  to pandas datetime object

In [None]:
type(df_google['Date'][0])

In [None]:
# convert string 'date' to datetime format
df_google['Date'] = pd.to_datetime(df_google['Date'],infer_datetime_format=True) # set index
df_google.head()

In [None]:
#  substutue date  by years only:
df_google['Date']=df_google['Date'].apply(lambda x:x.year)
df_google.head()

In [None]:
#### Count the number of occurences of a value in a column
df_google['Date'].value_counts()

### Masks and Boolean Indexing

In [None]:
# Check mask 1
df_google['Open']>941


In [None]:
# Use mask 1
df_google['Open'][df_google['Open']>1000]
# shows only rows with opening price greater than 1000

In [None]:
# Show only the fisrt 10 rows where
df_google['Open'][:10][df_google['Open']>1000]

In [None]:

# Show rows where opening stock is >1000 in 2017
df_google[(df_google['Open']>1000) &(df_google['Date']==2017)]

In [None]:
df_google[df_google>1150].head(10)

In [None]:
# we can also drop all NaN values
df_google[df_google>1150].head(10).dropna(axis=0) #play with axis

In [None]:
# another way to filter is with isin()

df_google[df_google['Open'].isin([1170.62,1184.98])]

### Manipulating  Values


In [None]:
# Recall
df_google.head(4)

In [None]:
# All the ways to view (by location, by index, iat, etc) 
# can also be used to set values
# good for data normalization

df_google['Volume'] = df_google['Volume']/1000.0
df_google.head(4)

In [None]:
df_google['Volume'] = 9999
print(df_google.head(10))

In [None]:
# Change specific entry
df_google.iat[0,1] = 0
df_google.head(3)

In [None]:
# Comments on dropping and filling NaN values
# A view where we drop any rows with value NnN
df_google.dropna(how='any')  # this would be used to drop rows with Nan
df1.fillna(value=5)    # this would be used to fill NaN values with 5

### More Statistics and Operations

In [None]:
# mean by column, also try var() for variance
df_google.mean()   

In [None]:
df_google.groupby('Date').count()

In [None]:
df_google[0:5].mean(1) # row means of first five rows
# df_google.mean(axis = 1)

In [None]:
# Use the apply method to perform calculations on every elementi
df_google[0:10].apply(np.sqrt)