![data-x](http://oi64.tinypic.com/o858n4.jpg)

---
# Pandas Introduction 
### with Stock Data and Correlation Examples


**Author list:** Ikhlaq Sidhu & Alexander Fred Ojala

**References / Sources:** 
Includes examples from Wes McKinney and the 10min intro to Pandas


**License Agreement:** Feel free to do whatever you want with this code

___

## What Does Pandas Do?
<img src="https://github.com/ikhlaqsidhu/data-x/raw/master/imgsource/pandas-p1.jpg">

## What is a Pandas Table Object?
<img src="https://github.com/ikhlaqsidhu/data-x/raw/master/imgsource/pandas-p2.jpg">


In [1]:
# ## This table is a dictionary of sequences (like np arrays)
# <img src="https://github.com/ikhlaqsidhu/data-x/raw/master/imgsource/pandas-p3.jpg">


### Topics:
1. Dataframe creation
2. Reading data in dataFrames
3. Data Manipulation

## Import package

In [2]:
import pandas as pd

# Part:1 Creation Pandas dataframes

**Key Points:** Main data types in Pandas:
* Series (similar to numpy arrays, but with index)
* DataFrames (table or spreadsheet with Series in the columns)




### We use `pd.DataFrame( )` and can insert almost any data type as an argument

**Function:** `pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)`

Input data ca be a numpy ndarray (structured or homogeneous), dictionary, or DataFrame. 


### 1.1 Create Dataframe using an array

In [3]:
# Try it with an array

import numpy as np
np.random.seed(0) # set seed for reproducibility

a1 = np.array(np.random.randn(3))
a2 = np.array(np.random.randn(3))
a3 = np.array(np.random.randn(3))

print (a1)
print (a2)
print (a3)

[ 1.76405235  0.40015721  0.97873798]
[ 2.2408932   1.86755799 -0.97727788]
[ 0.95008842 -0.15135721 -0.10321885]


In [4]:
# Create our first DataFrame w/ an np.array - it becomes a column
df0 = pd.DataFrame(a1)
print ("This is a", type(df0), ':')
df0

This is a <class 'pandas.core.frame.DataFrame'> :


Unnamed: 0,0
0,1.764052
1,0.400157
2,0.978738


In [5]:
print(df0) # difference when you print and output of the last row

          0
0  1.764052
1  0.400157
2  0.978738


In [6]:
# DataFrame from list of np.arrays

df0 = pd.DataFrame([a1, a2, a3])
df0

# notice that there is no column label, only integer values,
# and the index is set automatically

Unnamed: 0,0,1,2
0,1.764052,0.400157,0.978738
1,2.240893,1.867558,-0.977278
2,0.950088,-0.151357,-0.103219


In [7]:

# Let's us set names for rows and columns as per our choice

df0 = pd.DataFrame([a1, a2, a3],columns=['a1','a2','a3'],index=['a','b','c'])
df0
# notice that there are  index and column labels of your choice


Unnamed: 0,a1,a2,a3
a,1.764052,0.400157,0.978738
b,2.240893,1.867558,-0.977278
c,0.950088,-0.151357,-0.103219


In [8]:
# add  more columns to dataframe
df0['col4']=a2
df0

Unnamed: 0,a1,a2,a3,col4
a,1.764052,0.400157,0.978738,2.240893
b,2.240893,1.867558,-0.977278,1.867558
c,0.950088,-0.151357,-0.103219,-0.977278


In [9]:
# DataFrame from 2D np.array
np.random.seed(0)
array_2d = np.array(np.random.randn(9)).reshape(3,3)
array_2d

array([[ 1.76405235,  0.40015721,  0.97873798],
       [ 2.2408932 ,  1.86755799, -0.97727788],
       [ 0.95008842, -0.15135721, -0.10321885]])

In [10]:
df0 = pd.DataFrame(array_2d,columns=['a1','a2','a3'],index=[100,200,99]) 
# we can also assign columns and indices, sizes have to match
df0

Unnamed: 0,a1,a2,a3
100,1.764052,0.400157,0.978738
200,2.240893,1.867558,-0.977278
99,0.950088,-0.151357,-0.103219


### 1.2 Create Dataframe using an dictionary

In [11]:
# DataFrame from a Dictionary
dict1 = {'a1':a1, 'a2':a2,'a3':a3}
df1 = pd.DataFrame(dict1,index=[0,1,2]) 
df1
# note that we now have columns without assignment

Unnamed: 0,a1,a2,a3
0,1.764052,2.240893,0.950088
1,0.400157,1.867558,-0.151357
2,0.978738,-0.977278,-0.103219


In [12]:
# We can easily add another column (just as you add values to a dictionary)
df1['add-column']=a3
df1

Unnamed: 0,a1,a2,a3,add-column
0,1.764052,2.240893,0.950088,0.950088
1,0.400157,1.867558,-0.151357,-0.151357
2,0.978738,-0.977278,-0.103219,-0.103219


In [13]:
# We can add a list with strings and ints as a column 
df1['L'] = ["List", 3, "words"]
print ("The column L is a ",type (df1['L']))
df1

The column L is a  <class 'pandas.core.series.Series'>


Unnamed: 0,a1,a2,a3,add-column,L
0,1.764052,2.240893,0.950088,0.950088,List
1,0.400157,1.867558,-0.151357,-0.151357,3
2,0.978738,-0.977278,-0.103219,-0.103219,words


### Pandas Series object: Like an np.array, but we can combine data types and it has its own index


In [14]:
# Note: Every column in a DataFrame is a Series
print(df1['L'])
print()
print(type(df1['L']))

0     List
1        3
2    words
Name: L, dtype: object

<class 'pandas.core.series.Series'>


In [15]:
# Create a Series from a Python list
s = pd.Series([1,np.nan,3]) # automatic index, 0,1,2...
s2 = pd.Series([2, 3, 4], index = ['a','b','c']) #specific index
print (s)
print()
print (s2)

0    1.0
1    NaN
2    3.0
dtype: float64

a    2
b    3
c    4
dtype: int64


In [16]:
# We can add the Series s to the DataFrame above as column S- remember to match indices
df1['Series'] = s
df1

Unnamed: 0,a1,a2,a3,add-column,L,Series
0,1.764052,2.240893,0.950088,0.950088,List,1.0
1,0.400157,1.867558,-0.151357,-0.151357,3,
2,0.978738,-0.977278,-0.103219,-0.103219,words,3.0


In [17]:
# We can also rename columns
df1 = df1.rename(columns = {'L':'RenamedL'})
df1

Unnamed: 0,a1,a2,a3,add-column,RenamedL,Series
0,1.764052,2.240893,0.950088,0.950088,List,1.0
1,0.400157,1.867558,-0.151357,-0.151357,3,
2,0.978738,-0.977278,-0.103219,-0.103219,words,3.0


In [18]:
# We can delete columns
del df1['RenamedL']
df1

Unnamed: 0,a1,a2,a3,add-column,Series
0,1.764052,2.240893,0.950088,0.950088,1.0
1,0.400157,1.867558,-0.151357,-0.151357,
2,0.978738,-0.977278,-0.103219,-0.103219,3.0


In [19]:
# or drop columns
df1.drop('a2',axis=1) # does not change df1 if we don't set inplace=True

Unnamed: 0,a1,a3,add-column,Series
0,1.764052,0.950088,0.950088,1.0
1,0.400157,-0.151357,-0.151357,
2,0.978738,-0.103219,-0.103219,3.0


In [20]:
df1

Unnamed: 0,a1,a2,a3,add-column,Series
0,1.764052,2.240893,0.950088,0.950088,1.0
1,0.400157,1.867558,-0.151357,-0.151357,
2,0.978738,-0.977278,-0.103219,-0.103219,3.0


In [21]:
# or drop rows
df1.drop(0,axis=0)

Unnamed: 0,a1,a2,a3,add-column,Series
1,0.400157,1.867558,-0.151357,-0.151357,
2,0.978738,-0.977278,-0.103219,-0.103219,3.0


# 1.3 Slicing/ Indexing in Pandas Datframe

In [22]:
# Example: view only one column
df1['a1']

0    1.764052
1    0.400157
2    0.978738
Name: a1, dtype: float64

In [23]:
# Or view several column
df1[['a1','a2']]

Unnamed: 0,a1,a2
0,1.764052,2.240893
1,0.400157,1.867558
2,0.978738,-0.977278


In [24]:
# slice of the DataFrame returned
# this slices the first three rows first followed by first 2 rows of the sliced frame
(df1[0:3][0:2])

Unnamed: 0,a1,a2,a3,add-column,Series
0,1.764052,2.240893,0.950088,0.950088,1.0
1,0.400157,1.867558,-0.151357,-0.151357,


In [25]:
# Lets print the five first 2  elements of column a1
# This is a new Series (like a new table)
df1['a1'][0:2]

0    1.764052
1    0.400157
Name: a1, dtype: float64

In [26]:
# Lets print the 2 column, and top 3 values- note the list of columns
df1[['a1','a2']][0:3]

Unnamed: 0,a1,a2
0,1.764052,2.240893
1,0.400157,1.867558
2,0.978738,-0.977278


In [27]:
# get first element of df1
# df1[0,0]


## Instead of double indexing, we can use loc, iloc

##### loc gets rows (or columns) with particular labels from the index.
#### iloc gets rows (or columns) at particular positions in the index (so it only takes integers).

## .iloc()

In [28]:
df1.iloc[0,0]

1.764052345967664

In [29]:


df1.iloc[0:2,0:2] # 2nd to 4th row, 4th to 5th column

Unnamed: 0,a1,a2
0,1.764052,2.240893
1,0.400157,1.867558


In [30]:
# iloc will also accept 2 'lists' of position numbers
df1.iloc[[0,2],[0,2]]

Unnamed: 0,a1,a3
0,1.764052,0.950088
2,0.978738,-0.103219


In [31]:
1# Data only from row with index value '1'
print (df1.iloc[1])
print()
print (df1.iloc[1,:])

a1            0.400157
a2            1.867558
a3           -0.151357
add-column   -0.151357
Series             NaN
Name: 1, dtype: float64

a1            0.400157
a2            1.867558
a3           -0.151357
add-column   -0.151357
Series             NaN
Name: 1, dtype: float64


## .loc()

In [32]:
# Usually we want to grab values by column names 

# Note: You have to know indices and columns
df1.loc[0:2,['a3','a2']]

Unnamed: 0,a3,a2
0,0.950088,2.240893
1,-0.151357,1.867558
2,-0.103219,-0.977278


In [33]:
#  Boolean indexing
# return  full rows where a2>0

df1[df1['a2']>0]

# df1['a2']>0 - checks condition ans returns boolean and gives 



Unnamed: 0,a1,a2,a3,add-column,Series
0,1.764052,2.240893,0.950088,0.950088,1.0
1,0.400157,1.867558,-0.151357,-0.151357,


In [34]:
# return column a3 values where a2 >0
df1['a3'][df1['a2']>0] 

0    0.950088
1   -0.151357
Name: a3, dtype: float64

In [35]:
# If you want the values in an np array
npg = df1.loc[:,"a2"].values #otherwise it returns a  indexed series
print(type(npg))
print()
npg

<class 'numpy.ndarray'>



array([ 2.2408932 ,  1.86755799, -0.97727788])

### More Basic Statistics

In [36]:
df1.describe()

Unnamed: 0,a1,a2,a3,add-column,Series
count,3.0,3.0,3.0,3.0,2.0
mean,1.047649,1.043724,0.231837,0.231837,2.0
std,0.684554,1.760165,0.622489,0.622489,1.414214
min,0.400157,-0.977278,-0.151357,-0.151357,1.0
25%,0.689448,0.44514,-0.127288,-0.127288,1.5
50%,0.978738,1.867558,-0.103219,-0.103219,2.0
75%,1.371395,2.054226,0.423435,0.423435,2.5
max,1.764052,2.240893,0.950088,0.950088,3.0


In [37]:
df1.describe().loc[['mean','std'],['a2','a3']]

Unnamed: 0,a2,a3
mean,1.043724,0.231837
std,1.760165,0.622489


In [38]:
# We can change the index sorting
df1.sort_index(axis=0, ascending=False).head() # starts a year ago

Unnamed: 0,a1,a2,a3,add-column,Series
2,0.978738,-0.977278,-0.103219,-0.103219,3.0
1,0.400157,1.867558,-0.151357,-0.151357,
0,1.764052,2.240893,0.950088,0.950088,1.0


#### For more functionalities check this notebook
https://github.com/ikhlaqsidhu/data-x/blob/master/02b-tools-pandas_intro-mplib_afo/10-minutes-to-pandas-w-data-x.ipynb



## Part 2: Reading data in pandas Dataframe


#### Now, lets get some data in CSV format.

See https://www.quantshare.com/sa-43-10-ways-to-download-historical-stock-quotes-data-for-free


In [39]:
# We can download data from the web by using pd.read_csv
# A CSV file is a comma seperated file
# We can use this 'pd.read_csv' method with urls that host csv files

df_google = pd.read_csv('https://finance.google.com/finance/historical?output=csv&q=googl') # Google stock data
df_apple = pd.read_csv('https://finance.google.com/finance/historical?output=csv&q=aapl') # Apple stock data

In [40]:
df_google.head()
# Volume is the number of shares or contracts traded

Unnamed: 0,Date,Open,High,Low,Close,Volume
0,24-Jan-18,1184.98,1187.05,1167.4,1171.29,1856429
1,23-Jan-18,1170.62,1178.51,1167.25,1176.17,1956865
2,22-Jan-18,1143.82,1166.88,1141.82,1164.16,1477520
3,19-Jan-18,1138.03,1143.78,1132.5,1143.5,1527554
4,18-Jan-18,1139.35,1140.58,1124.46,1135.97,1374873


In [41]:
# check dtypes in each column
df_google.dtypes


Date       object
Open      float64
High      float64
Low       float64
Close     float64
Volume      int64
dtype: object

###  Breakout: Check the file attributes & general statitics using Pandas 

In [42]:
# df_google.shape

In [43]:
# df_google.head() # show first five values

In [44]:
# df_google.tail(3) # last three

In [45]:
# df_google.columns # returns columns, can be used to loop over

In [46]:
# df_google.index # return

In [47]:
# df_google.describe()

In [48]:
# Suppose you do not want the Volume column:
# df_google.drop('Volume',axis=1,inplace=True)

###  Convert the Date string  to pandas datetime object

In [49]:
type(df_google['Date'][0])

str

In [50]:
# convert string 'date' to datetime format
df_google['Date'] = pd.to_datetime(df_google['Date'],infer_datetime_format=True) # set index
df_google.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume
0,2018-01-24,1184.98,1187.05,1167.4,1171.29,1856429
1,2018-01-23,1170.62,1178.51,1167.25,1176.17,1956865
2,2018-01-22,1143.82,1166.88,1141.82,1164.16,1477520
3,2018-01-19,1138.03,1143.78,1132.5,1143.5,1527554
4,2018-01-18,1139.35,1140.58,1124.46,1135.97,1374873


In [51]:
#  substutue date  by years only:
df_google['Date']=df_google['Date'].apply(lambda x:x.year)
df_google.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume
0,2018,1184.98,1187.05,1167.4,1171.29,1856429
1,2018,1170.62,1178.51,1167.25,1176.17,1956865
2,2018,1143.82,1166.88,1141.82,1164.16,1477520
3,2018,1138.03,1143.78,1132.5,1143.5,1527554
4,2018,1139.35,1140.58,1124.46,1135.97,1374873


In [52]:
#### Count the number of occurences of a value in a column
df_google['Date'].value_counts()

2017    235
2018     16
Name: Date, dtype: int64

### Masks and Boolean Indexing

In [53]:
# Check mask 1
df_google['Open']>941


0       True
1       True
2       True
3       True
4       True
5       True
6       True
7       True
8       True
9       True
10      True
11      True
12      True
13      True
14      True
15      True
16      True
17      True
18      True
19      True
20      True
21      True
22      True
23      True
24      True
25      True
26      True
27      True
28      True
29      True
       ...  
221    False
222    False
223    False
224    False
225    False
226    False
227    False
228    False
229    False
230    False
231    False
232    False
233    False
234    False
235    False
236    False
237    False
238    False
239    False
240    False
241    False
242    False
243    False
244    False
245    False
246    False
247    False
248    False
249    False
250    False
Name: Open, Length: 251, dtype: bool

In [54]:
# Use mask 1
df_google['Open'][df_google['Open']>1000]
# shows only rows with opening price greater than 1000

0      1184.98
1      1170.62
2      1143.82
3      1138.03
4      1139.35
5      1136.36
6      1140.31
7      1110.10
8      1112.31
9      1107.00
10     1118.44
11     1111.00
12     1103.45
13     1097.09
14     1073.93
15     1053.02
16     1055.49
17     1062.25
18     1066.60
19     1068.64
20     1070.00
21     1075.39
22     1080.92
23     1083.02
24     1076.45
25     1063.78
26     1055.49
27     1052.08
28     1050.00
29     1051.11
        ...   
41     1051.16
42     1040.04
43     1036.00
44     1049.80
45     1038.75
46     1035.00
47     1037.72
48     1040.80
49     1043.87
50     1048.00
51     1050.05
52     1049.65
53     1049.10
54     1042.75
55     1039.99
56     1036.32
57     1033.00
58     1029.16
59     1030.99
63     1005.18
64     1007.05
65     1004.75
66     1011.05
67     1007.44
68     1009.63
69     1009.11
70     1003.84
157    1005.49
158    1004.23
160    1003.31
Name: Open, Length: 71, dtype: float64

In [55]:
# Show only the fisrt 10 rows where
df_google['Open'][:10][df_google['Open']>1000]

0    1184.98
1    1170.62
2    1143.82
3    1138.03
4    1139.35
5    1136.36
6    1140.31
7    1110.10
8    1112.31
9    1107.00
Name: Open, dtype: float64

In [56]:

# Show rows where opening stock is >1000 in 2017
df_google[(df_google['Open']>1000) &(df_google['Date']==2017)]

Unnamed: 0,Date,Open,High,Low,Close,Volume
16,2017,1055.49,1058.05,1052.7,1053.4,1180340
17,2017,1062.25,1064.84,1053.38,1055.95,994249
18,2017,1066.6,1068.27,1058.38,1060.2,1116203
19,2017,1068.64,1068.86,1058.64,1065.85,918767
20,2017,1070.0,1071.72,1067.64,1068.86,889446
21,2017,1075.39,1077.52,1069.0,1070.85,1282025
22,2017,1080.92,1081.24,1068.6,1073.56,1436391
23,2017,1083.02,1084.98,1072.27,1079.78,1317519
24,2017,1076.45,1086.49,1070.37,1085.09,1514601
25,2017,1063.78,1075.25,1060.09,1072.0,3187985


In [57]:
df_google[df_google>1150].head(10)

Unnamed: 0,Date,Open,High,Low,Close,Volume
0,2018,1184.98,1187.05,1167.4,1171.29,1856429
1,2018,1170.62,1178.51,1167.25,1176.17,1956865
2,2018,,1166.88,,1164.16,1477520
3,2018,,,,,1527554
4,2018,,,,,1374873
5,2018,,,,,1391510
6,2018,,,,,1823100
7,2018,,,,,1929306
8,2018,,,,,1121216
9,2018,,,,,1036655


In [58]:
# we can also drop all NaN values
df_google[df_google>1150].head(10).dropna(axis=0) #play with axis

Unnamed: 0,Date,Open,High,Low,Close,Volume
0,2018,1184.98,1187.05,1167.4,1171.29,1856429
1,2018,1170.62,1178.51,1167.25,1176.17,1956865


In [59]:
# another way to filter is with isin()

df_google[df_google['Open'].isin([1170.62,1184.98])]

Unnamed: 0,Date,Open,High,Low,Close,Volume
0,2018,1184.98,1187.05,1167.4,1171.29,1856429
1,2018,1170.62,1178.51,1167.25,1176.17,1956865


### Manipulating  Values


In [60]:
# Recall
df_google.head(4)

Unnamed: 0,Date,Open,High,Low,Close,Volume
0,2018,1184.98,1187.05,1167.4,1171.29,1856429
1,2018,1170.62,1178.51,1167.25,1176.17,1956865
2,2018,1143.82,1166.88,1141.82,1164.16,1477520
3,2018,1138.03,1143.78,1132.5,1143.5,1527554


In [61]:
# All the ways to view (by location, by index, iat, etc) 
# can also be used to set values
# good for data normalization

df_google['Volume'] = df_google['Volume']/1000.0
df_google.head(4)

Unnamed: 0,Date,Open,High,Low,Close,Volume
0,2018,1184.98,1187.05,1167.4,1171.29,1856.429
1,2018,1170.62,1178.51,1167.25,1176.17,1956.865
2,2018,1143.82,1166.88,1141.82,1164.16,1477.52
3,2018,1138.03,1143.78,1132.5,1143.5,1527.554


In [62]:
df_google['Volume'] = 9999
print(df_google.head(10))

   Date     Open     High      Low    Close  Volume
0  2018  1184.98  1187.05  1167.40  1171.29    9999
1  2018  1170.62  1178.51  1167.25  1176.17    9999
2  2018  1143.82  1166.88  1141.82  1164.16    9999
3  2018  1138.03  1143.78  1132.50  1143.50    9999
4  2018  1139.35  1140.58  1124.46  1135.97    9999
5  2018  1136.36  1139.32  1123.49  1139.10    9999
6  2018  1140.31  1148.88  1126.66  1130.70    9999
7  2018  1110.10  1131.30  1108.01  1130.65    9999
8  2018  1112.31  1114.85  1106.48  1112.05    9999
9  2018  1107.00  1112.78  1103.98  1110.14    9999


In [63]:
# Change specific entry
df_google.iat[0,1] = 0
df_google.head(3)

Unnamed: 0,Date,Open,High,Low,Close,Volume
0,2018,0.0,1187.05,1167.4,1171.29,9999
1,2018,1170.62,1178.51,1167.25,1176.17,9999
2,2018,1143.82,1166.88,1141.82,1164.16,9999


In [64]:
# Comments on dropping and filling NaN values
# A view where we drop any rows with value NnN
df_google.dropna(how='any')  # this would be used to drop rows with Nan
df1.fillna(value=5)    # this would be used to fill NaN values with 5

Unnamed: 0,a1,a2,a3,add-column,Series
0,1.764052,2.240893,0.950088,0.950088,1.0
1,0.400157,1.867558,-0.151357,-0.151357,5.0
2,0.978738,-0.977278,-0.103219,-0.103219,3.0


### More Statistics and Operations

In [65]:
# mean by column, also try var() for variance
df_google.mean()   

Date      2017.063745
Open       953.526614
High       964.304741
Low        951.905259
Close      958.707849
Volume    9999.000000
dtype: float64

In [66]:
df_google.groupby('Date').count()

Unnamed: 0_level_0,Open,High,Low,Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2017,235,235,235,235,235
2018,16,16,16,16,16


In [67]:
df_google[0:5].mean(1) # row means of first five rows
# df_google.mean(axis = 1)

0    2590.456667
1    2784.925000
2    2772.280000
3    2762.468333
4    2759.560000
dtype: float64

In [68]:
# Use the apply method to perform calculations on every elementi
df_google[0:10].apply(np.sqrt)

Unnamed: 0,Date,Open,High,Low,Close,Volume
0,44.922155,0.0,34.453592,34.167236,34.224114,99.995
1,44.922155,34.214324,34.329433,34.165041,34.295335,99.995
2,44.922155,33.820408,34.159625,33.790827,34.119789,99.995
3,44.922155,33.7347,33.819817,33.652637,33.815677,99.995
4,44.922155,33.754259,33.772474,33.532969,33.704154,99.995
5,44.922155,33.709939,33.753815,33.518502,33.750556,99.995
6,44.922155,33.768476,33.895132,33.565756,33.625883,99.995
7,44.922155,33.318163,33.634803,33.286784,33.625139,99.995
8,44.922155,33.351312,33.38937,33.263794,33.347414,99.995
9,44.922155,33.27161,33.358357,33.226194,33.318763,99.995
