![data-x](http://oi64.tinypic.com/o858n4.jpg)

---
# Pandas Introduction 
### with Stock Data and Correlation Examples


**Author list:** Ikhlaq Sidhu & Alexander Fred Ojala

**References / Sources:** 
Includes examples from Wes McKinney and the 10min intro to Pandas


**License Agreement:** Feel free to do whatever you want with this code

___

## What Does Pandas Do?
<img src="https://github.com/ikhlaqsidhu/data-x/raw/master/imgsource/pandas-p1.jpg">

## What is a Pandas Table Object?
<img src="https://github.com/ikhlaqsidhu/data-x/raw/master/imgsource/pandas-p2.jpg">


In [None]:
# ## This table is a dictionary of sequences (like np arrays)
# <img src="https://github.com/ikhlaqsidhu/data-x/raw/master/imgsource/pandas-p3.jpg">


### Topics:
1. Dataframe creation
2. Reading data in dataFrames
3. Data Manipulation

## Import package

In [None]:
# pandas
import pandas as pd

In [None]:
# Extra packages
import numpy as np
import matplotlib.pyplot as plt # for plotting

# jupyter notebook magic to display plots in output
%matplotlib inline

plt.rcParams['figure.figsize'] = (10,6) # make the plots bigger

# Part:1 Creation of Pandas objects

**Key Points:** Main object types in Pandas:
* Series (similar to numpy arrays, but with index)
* DataFrames (table or spreadsheet with Series in the columns)




### We use `pd.DataFrame( )` and can insert almost any data type as an argument

**Function:** `pd.DataFrame(data=None, index=None, columns=None, dtype=None, ...)`

Input data ca be a numpy ndarray (structured or homogeneous), dictionary, or DataFrame. 


### 1.1 Create Dataframe using an array

In [None]:
# Try it with an array

import numpy as np
np.random.seed(0) # set seed for reproducibility

a1 = np.random.randn(3)
a2 = np.random.randn(3)
a3 = np.random.randn(3)

print (a1)
print (a2)
print (a3)

In [None]:
# Create our first DataFrame w/ an np.array - it becomes a column
df0 = pd.DataFrame(a1)
df0

In [None]:
print(df0) # difference when you print and output of the last row

In [None]:
# Check type
type(df0)

In [None]:
# DataFrame from list of np.arrays

df0 = pd.DataFrame([a1, a2, a3])
df0

# notice that there is no column label, only integer values,
# and the index is set automatically

In [None]:
# We can set column and index names

df0 = pd.DataFrame([a1, a2, a3],columns=['a1','a2','a3'], \
                   index=['first','second','third'])
df0

In [None]:
# add  more columns to dataframe, like a dictionary, dimensions must match

df0['col4']=a2
df0

In [None]:
# DataFrame from 2D np.array

np.random.seed(0)
array_2d = np.array(np.random.randn(9)).reshape(3,3)
array_2d

In [None]:
df0 = pd.DataFrame(array_2d,columns=['rand_normal_1','Random Again','Third'] \
                   , index=[100,200,99]) 

df0

### 1.2 Create Dataframe using an dictionary

In [None]:
# DataFrame from a Dictionary
dict1 = {'a1':a1, 'a2':a2,'a3':a3}
dict1

In [None]:
df1 = pd.DataFrame(dict1,index=[4,5,6]) 
# note that we now have columns without assignment
df1

In [None]:
# We can add a list with strings and ints as a column 
df1['L'] = ["List", "three", "words"]
df1

### Pandas Series object
Every column is a Series. Like an np.array, but we can combine data types and it has its own index

In [None]:
df1['a1']

In [None]:
type(df1['a1'])

In [None]:
# Create a Series from a Python list
s = pd.Series([1,5,3]) # automatic index, 0,1,2...
s

In [None]:
s2 = pd.Series([2, 3, 4], index = ['a','b','c']) #specific index
s2

In [None]:
s2[2]

In [None]:
# We can add the Series s to the DataFrame above as column Series
# Remember to match indices
df1['Series'] = s
df1

In [None]:
s2.index = df1.index
s2

In [None]:
df1['Series2'] = s2
df1

In [None]:
# We can also rename columns
df1 = df1.rename(columns = {'L':'RenamedL'})
df1

In [None]:
# We can delete columns
del df1['RenamedL']
df1

In [None]:
# or drop columns, see axis = 1
# does not change df1 if we don't set inplace=True
df1.drop('a2', axis=1) # returns a copy

In [None]:
df1

In [None]:
# or drop rows
df1.drop(5,axis=0)

# 1.3 Indexing / Slicing a Pandas Datframe

In [None]:
df1

In [None]:
# Example: view only one column
df1['a1']

In [None]:
# Or view several column
df1[['a1','a3']]

In [None]:
# Lets print the 2 column, and top 2 values- note the list of columns
df1[['a1','a3']][0:3]

## Instead of double indexing, we can use loc, iloc

##### loc gets rows (or columns) with particular labels from the index.
#### iloc gets rows (or columns) at particular positions in the index (so it only takes integers).

## .iloc()

In [None]:
df1

In [None]:
df1.iloc[0,0]

In [None]:
df1.iloc[0:2,0:2] # 2nd to 4th row, 4th to 5th column

## .loc()

In [None]:
# Usually we want to grab values by column names 

# Note: You have to know indices and columns
df1.loc[4:5,['a3','a2']]

In [None]:
df1

In [None]:
#  Boolean indexing
# return  full rows where a2>0

df1[df1['a3']<0]

# df1['a2']>0 - checks condition ans returns boolean and gives 



In [None]:
# return column a3 values where a2 >0
df1['a3'][df1['a3']>0] 

In [None]:
# Convert to Numpy array
npg = df1.values #otherwise it returns a  indexed series
print(type(npg))
print()
npg

### More Basic Statistics

In [None]:
df1.describe()

In [None]:
df1.describe().loc[['mean','std'],['a2','a3']]

In [None]:
# We can change the index sorting
# df1.sort_index(axis=0, ascending=False) # starts a year ago

#### For more functionalities check 10 minute intro to Pandas

https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html


## Part 2: Reading data in pandas Dataframe


### Now, lets get some data in CSV format.

#### Description:
Aggregate data on applicants to graduate school at Berkeley for the six largest departments in 1973 classified by admission and sex.

https://vincentarelbundock.github.io/Rdatasets/doc/datasets/UCBAdmissions.html

In [None]:
!head -n 4 data/ucbadmissions.csv

In [None]:
df = pd.read_csv('data/ucbadmissions.csv')

In [None]:
df.head()

In [None]:
df.shape

In [None]:
# check statistics

In [None]:
df.columns

In [None]:
df.head(12)

In [None]:
df.tail(3)

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.groupby(['Admit','Gender']).sum()

In [None]:
pd.unique(df['Dept'])

In [None]:
# Total number of applicants to Dept A
df.groupby('Dept').sum()

In [None]:
df.groupby('Dept').sum(). \
    plot(kind='bar', label='index', color=[plt.cm.Paired(np.arange(len(df)))], legend=False, grid=True)

# Install Pandas datareader to access APIs with Stock data

Read about data sources here (note, not all works anymore): https://pandas-datareader.readthedocs.io/en/latest/remote_data.html

In [None]:
# Uncomment line below to install
!pip install pandas_datareader

In [None]:
pd.core.common.is_list_like = pd.api.types.is_list_like
from pandas_datareader import data as web
from datetime import datetime as dt

df_google = web.DataReader('GOOGL', data_source='iex', start=dt(2018, 1, 1), end=dt.now()).reset_index()
df_apple = web.DataReader('AAPL', data_source='iex', start=dt(2018, 1, 1), end=dt.now()).reset_index()

In [None]:
df_google.head()
# Volume is the number of shares or contracts traded

In [None]:
# check dtypes in each column
df_google.dtypes

###  Breakout: Check the file attributes & general statitics using Pandas 

In [None]:
df_google.shape

In [None]:
df_google.columns

In [None]:
df_google.mean()

In [None]:
# shape

In [None]:
# show first five values

In [None]:
# show last three

In [None]:
# return column names

In [None]:
# get statistics- mean and std of "open" column

###  Convert the Date string  to pandas datetime object

In [None]:
df_google['date'][0]

In [None]:
type(df_google['date'][0])

In [None]:
# convert string 'date' to datetime format
df_google['date'] = pd.to_datetime(df_google['date'] \
                                   ,infer_datetime_format=True) 
df_google.head()

In [None]:
df_google['date'][0]

In [None]:
#  set index
df_google = df_google.set_index('date',)

In [None]:
df_google.head(5)

In [None]:
# Then we can query date indices with strings
# Only January
df_google['2018-01']

In [None]:
df_google['2018-01-03':'2018-01-09']

In [None]:
df_google.loc['2018-02-28':'2018-04-21',['open','low']].head()

In [None]:
#### Opening price statistics
open_price = df_google['open'].map(lambda x: int(np.floor(x/100)*100))
open_price.value_counts()

In [None]:
open_price.hist()

In [None]:
open_price.value_counts().sort_values().plot(kind='bar')

### Masks and Boolean Indexing

In [None]:
# Check mask 1
df_google['open']>1200

In [None]:
# Use mask 1
df_google['open'][df_google['open']>1200]
# shows only rows with opening price greater than 1200

In [None]:
# Show rows where opening stock is >1200 before August 1st 2018
df_google[(df_google['open']>1200) & (df_google.index < dt(2018,8,1))]

In [None]:
# we can also drop all NaN values
df_google[df_google>1220]

In [None]:
df_google[df_google>1220].dropna(axis=0).head(10) #play with axis

In [None]:
# another way to filter is with isin()

df_google[df_google['open'].isin([1170.62,1184.98])]

### Manipulating  Values


In [None]:
# Recall
df_google.head(4)

In [None]:
# All the ways to view (by location, by index, iat, etc) 
# can also be used to set values
# good for data normalization

df_google['volume'] = df_google['volume']/1000.0
df_google.head(4)

In [None]:
# Change specific entry
df_google.iloc[0,1] = 2
df_google.head(3)

### More Statistics and Operations

In [None]:
# mean by column, also try var() for variance
df_google.mean()   

In [None]:
# Use the apply method to perform calculations on every elementi
df_google[0:10].apply(np.sqrt).head(5)

In [None]:
df_google['month']=df_google.index.month_name()

In [None]:
df_google.groupby('month')['open'].mean()

In [None]:
df_google[0:5].mean(axis = 1) # row means of first five rows

# PlotCorrelation
### Load several stocks

In [None]:
# Reload
dfg = pd.read_csv('data/googl.csv').drop('Unnamed: 0',axis=1) # Google stock data
dfa = pd.read_csv('data/apple.csv').drop('Unnamed: 0',axis=1) # Apple stock data
dfm = pd.read_csv('data/microsoft.csv').drop('Unnamed: 0',axis=1) # Google stock data
dfn = pd.read_csv('data/nike.csv').drop('Unnamed: 0',axis=1) # Apple stock data
dfb = pd.read_csv('data/boeing.csv').drop('Unnamed: 0',axis=1) # Apple stock data

In [None]:
dfb.head()

In [None]:
# Rename columns
dfg = dfg.rename(columns = {'Close':'GOOG'})
#print (dfg.head())

dfa = dfa.rename(columns = {'Close':'AAPL'})
#print (dfa.head())

dfm = dfm.rename(columns = {'Close':'MSFT'})
#print (dfm.head())

dfn = dfn.rename(columns = {'Close':'NKE'})
#print (dfn.head())

dfb = dfb.rename(columns = {'Close':'BA'})

In [None]:
dfb.head(2)

In [None]:
# Lets merge some tables
# They will all merge on the common column Date

df = dfg[['Date','GOOG']].merge(dfa[['Date','AAPL']])
df = df.merge(dfm[['Date','MSFT']])
df = df.merge(dfn[['Date','NKE']])
df = df.merge(dfb[['Date','BA']])

df.head()

In [None]:
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
df.head()

In [None]:
df.plot()

In [None]:
df['2017'][['NKE','BA']].plot()

In [None]:
# show a correlation matrix (pearson)
crl = df.corr()
crl

In [None]:
crl.sort_values(by='GOOG',ascending=False)

In [None]:
s = crl.unstack()
so = s.sort_values(ascending=False)
so[so<1]

In [None]:
# zero mean to plot correlation
df.mean()

In [None]:
sim=df-df.mean()
sim.tail()

In [None]:
sim[['MSFT','BA']].plot()