# Introduction to Pandas and Data Preprocessing

Agenda
    1. What is Pandas
    2. Pandas Vs. Other Platforms
    3. Python Lists (Revisiting)
    4. Numpy 1D Arrays (We will comeback and discuss in detail)
    5. Pandas Series
    6. Boolean Arrays
    7. Series Index 
        a. Data Types
        b. Iterations
        c. Broadcasting
        d. CRUD operations
        e. Summary Stats
        f. Handling duplicates
        g. Nan
        h. Plotting (we will revisit in detail later in viz class)
        i. Serialization
    8. Data Frames
        a. Columnar Data
        b. Similarities To Py Dicts
        c. Reading CSV
    9. Relational Algebra
        selections, Projections, Cartesian Product, Union, Difference
    10. Data Exploration
    11. Axes of DataFrames
    12. Index and Columns
    13. Summary Stats
    14. Histograms
    15. Transposing Data
    16. How to add rows/cols to Dataframes
    17. Iternations over Data
    18. Setting Data
    19. Joins 
        concat on Rows/Cols and various types of Joins; joins on Index
    20. Filtering Dataframes
    21. Grouping
        Pivot, Stack, unstack
    22. Serialization
        JSON, Pickel, Numpy, Excel
    23. Plotting (basic; we will cover Matplotlib and Seaborn in detail)
    24. Time Series 
        Date Manipulations, Window functions, Plotting
    25. Sample Application
        

# Introduction:
  
   **PAN**el **D**ata **A**nalysi**S** ( *PANDAS* ) it is an open library developed by Wes McKenny in pydata stack useful for doing Data Wrangling/ Data Exploration purpose. 
    
    Pandas Library introduces 2 data structures 1. Series 2. Data Frames which are built on top of Numpy
    We can perform analysis of Tabular data using pandas via data frames (the concept is inherited from R).
    

# Why Pandas?
* vs R
* vs Excel
* vs Matlab
* SAS

## Python Lists

In [None]:
#Find the temps greater than mean in the given sequence 
temps  = [30,45,60,90]

In [None]:
def mean_temps(seq):
    """
     Finds the mean of a sequence
    """
    return sum(seq)/len(seq)

In [None]:
mean_val = mean_temps(temps)
print(mean_val)

In [None]:
type(mean_val)

In [None]:
''' to find temps greater than mean'''
hold = []
for temp in temps:
    if temp > mean_val:
        hold.append(temp)

In [None]:
hold

In [None]:
# Pythonic ways of doing the same using list comprehensions.
[temp for temp in temps if temp > mean_temps(temps)]

In [None]:
a = 1+1
a

In [None]:
b = 2 + 1
b

In [None]:
__

# Numpy

In [None]:
import numpy as np

Pydata stack is faster compared to general *Pure Python* as it is rewrittern using C, CPP, Fotran etc.. 

In [None]:
#Lets look at an example
np_temp = np.array(temps) 

In [None]:
type(np_temp)

In [None]:
from timeit import Timer
from numpy import arange

In [None]:
Timer('py_lst = range(10000000); sum(py_lst)').timeit(10)

In [None]:
sum(range(10))

In [None]:
Timer('np_lst = arange(10000000);np_lst.sum()','from numpy import arange').timeit(10)

In [None]:
__/_

In [None]:
#Boolean arrays
np_temp[np_temp > np_temp.mean()]

In [None]:
np_temp.mean()

In [None]:
np_temp

In [None]:
mask  = np_temp > np_temp.mean()

In [None]:
np_temp[mask]

In [None]:
temps[1:3]

# *Series*
* A Series is a one-dimensional object similar to an array, list, or column in a table. 
* It will assign a labeled index to each item in the Series.
* By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.

In [None]:
# Pandas Series
import pandas as pd

In [None]:
ser_temp = pd.Series(temps, name = "Temperature")

In [None]:
ser_temp

In [None]:
ser_temp.mean()

In [None]:
ser_temp.sum()

In [None]:
dir(ser_temp)

### Boolean Arrays

In [None]:
hot = pd.Series([False, False, False, False, True, True])

In [None]:
hot

In [None]:
ser_temp[hot]

the size of the boolean array should be greater than or equal to the size base array

In [None]:
# Another way of creating a boolean array
mask = ser_temp > 55
mask

In [None]:
ser_temp[mask]

In [None]:
#Pure python 
temps >  55

 Multiple boolean arrays can be combine to build a compund logic using


In [None]:
mask2 = ser_temp  < 90

In [None]:
ser_temp

In [None]:
print(ser_temp[mask & mask2])
print(ser_temp[mask | mask2])

In [None]:
# Operator precedence
ser_temp[(ser_temp > 55) & (ser_temp < 90)]

### Index in Series

In [None]:
ser_temp.index

In [None]:
temp2 = pd.Series(temps, name = "Temperature2", index  = ["M","T","W","Th"])

In [None]:
temp2

Index makes your life easier in plotting and data subestting

In [None]:
temp3 = pd.Series(temps, name = "Temperature3", index = pd.date_range("20170101",periods=4))

In [None]:
temp3

In [None]:
temp4 = pd.Series(temps, index = [0,1,1,2])

In [None]:
temp4

# Pandas Data type

In [None]:
import pandas as pd

In [None]:
ser = pd.Series(range(10))

In [None]:
ser

In [None]:
ser2 = pd.Series([1.1,2.2,3,4])

In [None]:
ser2

In [None]:
ser3 = pd.Series(['a','b','c'])

In [None]:
ser3

In [None]:
ser_bad_practise = ([{},[],(2,)])

In [None]:
ser_dates = pd.Series(['2017-05-01','2017-06-01','2017-07-1'])

In [None]:
ser_dates.dtype

In [None]:
ser_dates2 = pd.to_datetime(ser_dates)

In [None]:
ser_dates2.dtype

Here the data type is dependent on the flavour of numpy and the machine architecture

In [None]:
ser_categorical = pd.Series(['Apple','Organe','Melon','Apple','Berry','Orange'],dtype='category')

In [None]:
ser_categorical.dtype

This is a special data type used in ML which implies number of categories for the given variable/feature and python does
some internal optimizations to save on memory


Series can store int, floats, Date, Categorical objects and hertogenious types whose  type is 'object' but not considered to be a good practise

###  Iteration

In [None]:
temps = [30, 45, 60, 90]
s = pd.Series(temps)

In [None]:
s

In [None]:
for val in s:
    print val

iteration which is similar to list

Value check (Containment) in Lists vs Serieas

In [None]:
temps

In [None]:
30 in temps

In [None]:
47 in temps

In [None]:
s

In [None]:
30 in s

In [None]:
45 in s

In [None]:
1 in s

In [None]:
2 in s

In [None]:
45 in set(s)

In [None]:
30 in set(s)

In [None]:
0 in dict(s)

In [None]:
1 in dict(s)

In [None]:
for key, val in s.iteritems():
    print (key,val)

In [None]:
for __,_ in dict(s).iteritems():
    print(__,_)

# Broadcasting

In [None]:
s

In [None]:
s + 2

In [None]:
s.shape

s is one-dimensional or a list or series and "2" is a scalar. The operation is applied to all the indvidual elements. 
Dunder add method handles it underthehood and the lenghts of vectors can differ.
All the basic arthimetic operations can be performed

In [None]:
temp_vals = pd.Series([1,2,3,4,5,6])

In [None]:
temp_vals.shape

In [None]:
temp_vals + s

In [None]:
s2 = pd.Series([10,20,30], index = [2,3,4])

In [None]:
s2

In [None]:
s

In [None]:
s + s2

Operations of two series does index wise operations.

In [None]:
s * s2

In [None]:
# Applying functions to series

In [None]:
def add_num(num):
    return num+2

s.apply(add_num)

In [None]:
# lamda functions
s.apply(lambda x : x**2)

In [None]:
s.apply(float)

In [None]:
s.astype(int)

# CRUD
    * Create Read Update Delete from some modeler State 

In [None]:
s

There are 3 ways to retrieve the data from series
* using index
* using loc (based on the label of the series)
* using position (iloc)

In [None]:
s[0]

In [None]:
s.loc[0]

In [None]:
s.iloc[-1]

In [None]:
temps

In [None]:
temp2 = pd.Series(temps, index = ['M','T','W','Th'])

In [None]:
temp2

In [None]:
temp2['W'] # by location and postions

In [None]:
temp2[3]

In [None]:
# If a index is a non interger it works both on location and position and gives an error for index 
temp2[-1]

In [None]:
# If a index is a non interger loc works only on label
temp2.loc['M']

In [None]:
# If a index is a non interger loc works only on label
temp2.iloc[-1]

In [None]:
temp2.iloc[1]

In [None]:
temp2.M

In [None]:
temp2.shape

In [None]:
temp3 = pd.Series(temps, index=['M','T', 0, 1])

In [None]:
temp3

In [None]:
temp3.M

In [None]:
temp3['M']

In [None]:
temp3[0]

In [None]:
temp3.loc['M']

### Zen of python: 
    Explicit is better than implicit
    User loc or iloc when accessing the elements from Series or Pandas rather directly

## Update

In [None]:
temp2

In [None]:
temp2.W = 70

In [None]:
temp2

In [None]:
temp2.loc['W'] = 60
temp2

In [None]:
temp2.T = 50

In [None]:
temp2

In [None]:
temp2.iloc[-1] = 100

The operations are inplace and considers the obj as mutable

In [None]:
lst = [30,45,60,90]

In [None]:
lst.append(100)

In [None]:
lst

In [None]:
st = temp2.append(pd.Series([110], index = ["F"]))

In [None]:
st

In [None]:
temp2

In [None]:
# Only vectors can be appended similar to an extend method

In [None]:
temp2.set_value('X',10) #updates in place! Returns mutatated series unlike append which returns a new series

In [None]:
temp2

In [None]:
temp2.set_value('S',95)

# Delete

In [None]:
d = {"foo":'bar'}

In [None]:
del d['foo']

In [None]:
d

In [None]:
temp2

In [None]:
del temp2['M']

In [None]:
temp2

In [None]:
temp2[temp2 < 94] # This is not mutated rather gives a new series

In [None]:
temp2

In [None]:
mask = temp2 < 94

In [None]:
temp2.index == 'T'

In [None]:
mask2 = temp2.index == 'T'

In [None]:
temp2[mask]

# Summary Stats

In [None]:
temp2.mean(), temp2.min(), temp2.median()

In [None]:
temp2.describe()

In [None]:
temp2.value_counts()

In [None]:
ser_categorical.value_counts()

In [None]:
ser_categorical.describe()

In [None]:
temp2.quantile(.3)

In [None]:
temp2.describe(percentiles=[0.05,.1,.2])

# Duplicates

In [None]:
temp2

In [None]:
temp3 = temp2.append(pd.Series([60],index = ['T']))

In [None]:
temp3

In [None]:
temp3.duplicated()

In [None]:
temp3.duplicated().all()

In [None]:
temp3.duplicated().any()

In [None]:
temp3.duplicated(keep='last')

In [None]:
mask = temp3.duplicated(keep = False)

In [None]:
temp3[mask]

In [None]:
temp3[-mask]

In [None]:
temp3.drop_duplicates(keep='last')

In [None]:
#Dealing with duplicate in index and labels
temp4 = temp3.append(pd.Series([100],index = ['T']))

In [None]:
temp4

In [None]:
temp4.iloc[1]

In [None]:
temp4.loc['S'] #label

In [None]:
temp4.loc['T'] # returns a series not a scalar

# Nan

In [None]:
temp6 = pd.Series([1,2,None])

In [None]:
temp6

In [None]:
pd.Series([None])

In [None]:
pd.Series([np.nan])

In [None]:
temp7 = temp3.append(pd.Series([100,None], index = ['F','Sun']))

In [None]:
temp7

In [None]:
len(temp7)

In [None]:
temp7.count() # Some methods ignores None.

In [None]:
temp7 + 2

In [None]:
temp7.mean() # default ignores nan

In [None]:
#Find nan in series
temp7.isnull()

In [None]:
temp7[-temp7.isnull()]

In [None]:
temp7.isnull().any()

In [None]:
temp7[temp7.notnull()]

### Replacing and droping null values in series

In [None]:
temp7.dropna()

In [None]:
temp7.fillna(-1)

In [None]:
temp7.fillna(method = 'ffill')

In [None]:
temp7

# Plotting

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
temp3.plot()

In [None]:
temp7.plot(style='r')

In [None]:
temp7.plot.box()

In [None]:
temp7.plot.hist()

In [None]:
temp3.plot.kde()

In [None]:
temp3.plot.pie()

In [None]:
ax = temp3.plot.barh()
ax.set_xlabel('temperature')
ax.set_ylabel('days')

# Reading a file

In [None]:
temp3

In [None]:
temp3.to_csv('temps.csv')

In [None]:
!pwd

In [None]:
cat temps.csv

In [None]:
temp3.name  = "Temp"

In [None]:
temp3.to_csv('temps.csv',header=True)

In [None]:
!cat temps.csv

In [None]:
temp3.to_csv('temps.csv',header=True, index_label= 'Index')

In [None]:
!cat temps.csv

In [None]:
temp4 = pd.Series.from_csv('temps.csv')

In [None]:
temp4

In [None]:
temp4.dtype

In [None]:
temp4 = pd.Series.from_csv('temps.csv', header=0)

In [None]:
temp4

In [None]:
df = pd.read_csv('temps.csv')

In [None]:
df

In [None]:
df.dtypes

In [None]:
df = pd.read_csv('temps.csv',index_col=0)

In [None]:
df

In [None]:
df.dtypes

In [None]:
type(df['Temp'])

In [None]:
type(df.Temp)

# Data Frames
* Tabular representation of data
* Each column in Series
* Think Data Frame as collections of Series (cols)

In [None]:
cols = {'name': ['Paul','Geroge','Ringo'],
       'age':[22,21,23]}

In [None]:
cols

In [None]:
df = pd.DataFrame(cols)

In [None]:
df

In [None]:
df.dtypes

In [None]:
df.info()

In [None]:
df.columns

In [None]:
df.age # Returns a Series ; attribute access

In [None]:
df['age'] # Returns a Series ; attribute access

In [None]:
df.describe() # Looks at numeric cols like int, float etc and spews out various stats

# DF vs Dictionaries 

In [None]:
df

In [None]:
df['last'] = pd.Series(['Lennon', 'McCartney', 'Starkey'], index = [4,0,2])

In [None]:
df

In [None]:
df['instrument'] = ['Bass','Guitar','Drums'] 

In [None]:
# Broadcasting in Pandas

In [None]:
df['birthplace'] = 'Liverpool'

In [None]:
df.age

In [None]:
df['age']

In [None]:
#accessing multiple cols

sub_df = df[['name','last']]
sub_df

In [None]:
type(sub_df)

In [None]:
del df['last'] # does inplace deletion 

In [None]:
df

In [None]:
df.pop('birthplace') # does inplace deletion and mutation

In [None]:
df

# Creation from List Dicts

In [None]:
cols = {'name': ['Paul','Geroge','Ringo'],
       'age':[22,21,23]}

In [None]:
df = pd.DataFrame(cols)

In [None]:
df # Order is changed from dictonary as dictonary doesn't maining any order

In [None]:
df = pd.DataFrame(cols, columns = ['name','age'])

In [None]:
df

In [None]:
row = [{'name':'Paul', 'age' : 22},
      {'name': 'George', 'age': 23},
      {'name':'Ringo','age':20}]

In [None]:
df = pd.DataFrame(row)

In [None]:
df

### Creating DataFrame from JSON

In [None]:
import json

In [None]:
col_json = json.dumps(cols)
col_json

In [None]:
df = pd.read_json(col_json)

In [None]:
df

In [None]:
row_json = json.dumps(row)

In [None]:
type(row_json)

In [None]:
df = pd.read_json(row_json)

In [None]:
df

Creating df from CSV

In [None]:
cat beatles.csv

In [None]:
filename = 'beatles.csv'

In [None]:
df = pd.read_csv(filename)

In [None]:
df

# Realtional Algebra
    * Foundations of Databases like Joins, Projections etc

### Selelction
    * AND
    * OR 

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame({'Name':["Paul",'George','Ringo'],
                  'Growth':[.5,.7,.9]})

In [None]:
df

In [None]:
df[df.Name == 'Paul'] # Boolean array; returns a df  

In [None]:
df.Name.str.contains('o')

In [None]:
df.dtypes

In [None]:
df[df.Name== 'Paul' & df.Growth == 0.5]

In [None]:
!pip install numexpr

In [None]:
df.query("Name == 'Paul' and Growth > 0.3") # SQLesq way of doing subselections

In [None]:
df[(df.Name == 'Paul')| (df.Name == 'George')]

In [None]:
df[~(df.Name == 'Jhon')]

In [None]:
~df.Name  == 'John'

In [None]:
df[df.Name != 'John']

In [None]:
df.head(2)

In [None]:
df.tail(2)

In [None]:
df.iloc[[2,1]]

In [None]:
# Similar to slicing of python list
df[1:2]

### Projections
* Used for pulling cols from the databases

In [None]:
df

In [None]:
df.dtypes

In [None]:
df.Growth # valid Python attribute name; starts with aplha or underscore

In [None]:
df['Name']

In [None]:
df[Name]

In [None]:
df[['Growth','Name']]

In [None]:
df.iloc[:,1:] # First is for Rows and the second for cols

In [None]:
df.loc[df.Name.str.contains('r'),['Name']]

In [None]:
df.loc[:,["Name"]]

###  Product

* Join without any conditions
* Inner Join
* Outer Join


In [None]:
df

In [None]:
inst_df = pd.DataFrame([{'Name': 'John','inst':'guitar'},
                      {'Name': 'Ringo','inst':'drum'}])

In [None]:
inst_df

In [None]:
pd.merge(df, inst_df)

In [None]:
pd.merge(df,inst_df, how = 'inner', on ='Name')

In [None]:
pd.merge(df, inst_df, how="outer", on = "Name")

In [None]:
pd.merge(df, inst_df, how="left", on = "Name")

In [None]:
pd.merge(df, inst_df, how="right", on = "Name")

In [None]:
inst_df2 = inst_df.copy()

In [None]:
inst_df2 = inst_df2.rename(columns= {"Name":"First"})

In [None]:
inst_df2

In [None]:
pd.merge(df, inst_df2)

In [None]:
pd.merge(df, inst_df2, left_on="Name", right_on="First")

In [None]:
# Mergeing on Index
pd.merge(df,inst_df, left_index= True, right_index= True)

In [None]:
df.set_index("Name")

In [None]:
pd.merge(df.set_index("Name"),inst_df2.set_index("First"),left_index= True, right_index=True)

# Union

In [None]:
df

In [None]:
inst_df

In [None]:
pd.concat([df,inst_df]) # Need not to have same cols and does union on col names

In [None]:
pd.concat([df,inst_df]).reset_index()

In [None]:
df.append(inst_df2)

In [None]:
df.append(inst_df2, ignore_index= True)

In [None]:
df2 = df.append(df)

In [None]:
df2

In [None]:
df2.index.is_unique

In [None]:
df2.reset_index().index.is_unique

In [None]:
df2.iloc[0]

In [None]:
df2.loc[0]

In [None]:
temp_df = df2.reset_index()

#### Difference

In [None]:
df

In [None]:
inst_df

In [None]:
~df.Name.isin(inst_df.Name)

In [None]:
df[~df.Name.isin(inst_df.Name)]

# Exploring DF

In [None]:
# https://www.ssa.gov/oact/babynames/limits.html

In [None]:
df = pd.read_csv('yob1880.txt')

In [None]:
df

In [None]:
df = pd.read_csv('yob1880.txt', names=["Name","gender","count"])

In [None]:
df

In [None]:
df.T

In [None]:
df.info()

In [None]:
df.dtypes

In [None]:
df.Gender.value_counts()

In [None]:
df.Gender = df.Gender.astype("category")

In [None]:
df.info()

In [None]:
df.head(4)

In [None]:
df.tail()

In [None]:
df.shape

### Axes of DF

In [None]:
df.axes

In [None]:
len(df.axes)

In [None]:
df.axes[0]

In [None]:
df.axes[1]

In [None]:
# axes 0 = index
# axes 1  = col headers

In [None]:
df.sort_index(axis=0, ascending= False) # sort by rows

In [None]:
df.index #synonym of axes 0

In [None]:
df.index.unique()

In [None]:
df.index.values

In [None]:
df.index.duplicated()

In [None]:
df.index.duplicated().any()

In [None]:
dupe_index = pd.Index([1,1,1])

In [None]:
dupe_index.duplicated()

In [None]:
dupe_index.duplicated().any()

In [None]:
dupe_index.duplicated().all()

In [None]:
df.columns

In [None]:
df.columns.duplicated()

In [None]:
df.axes[0] == df.index

In [None]:
df.axes[0] is df.index

# Summary statistics

In [None]:
df.describe()

In [None]:
df.dtypes

In [None]:
df.gender.describe()

In [None]:
df.describe(percentiles=[.4,.6,.8])

In [None]:
df.describe(include='all')

In [None]:
df.Name.value_counts()

In [None]:
df[df.Name == 'Clara']

In [None]:
df.groupby('gender').sum()

# Histograms

In [None]:
df['count'].hist()

In [None]:
ax = df['count'].hist()
ax.set_ylim([0,10])

In [None]:
ax = df['count'].plot(kind = 'hist')
ax.set_ylim([0,10])

In [None]:
df.columns

In [None]:
df[df.count > 5000]

In [None]:
df.plot(kind='box')

In [None]:
ax = df.plot(kind='box')
ax.set_ylim([0,100])

In [None]:
df['count'].describe()

### Transposing

In [None]:
df

In [None]:
df.T

In [None]:
df.T.iloc[:,:5]