# Pandas
<ul> 
    <li>A package contains (1) data structures (tabular data) and (2) data manipulation tools.</li>
    <li>Adopts significant parts of NumPy's idiomatic style of array-based computing, especially array-based functions and a preference for data processing without <i>for</i> loops.</li>
    <li>Works with heterogeneous data (different data types). </li>
</ul>
We will learn:<br>
<ol>
    <li>Data Structures: Series, DataFrame, Index </li>
    <li>Essential Functionality</li>
    <li>Summarizing and Computing Descriptive Statistics</li>
</ol>

# Series
<b>#1 Definition</b><br>

A Series is a <u>one-dimensional array-like object</u> containing a <u>sequence of values</u> (of similar types to NumPy types) of the <u>same type</u> and an associated <u>array of data labels</u>, called its index.</br>

<b>#2 Create a Series</b><br>

<b>#3 As an array</b></br>
+ slicing<br>
+ indexing<br>
+ batch operations: +, -, *, /, >, >=, ...<br>
+ numpy universal functions <br>

<b>#4 As a dictionary</b></br>
+ keys = indices (immuatable data type)<br>
+ adding a new value to a Series is the same as adding a new value to a dictionary


In [140]:
import pandas as pd
import numpy as np

df1 = pd.DataFrame(data = np.random.rand(5,3), 
                   columns = ['one', 'two', 'three'], 
                   index = ['NY', 'NJ', 'MA', 'VA', 'IA'])

df1['AA'] = ["1a", "2a", "3a", "4a", "5a"]
df1["BB"] = ["1b", "2b", "3b", "4b", "5b"]

#df1.describe()
#df1.info
#df1.head(3)
#df1.tail(3)
df1.columns

Index(['one', 'two', 'three', 'AA', 'BB'], dtype='object')

In [141]:
#2 create a Series
a = pd.Series([1, 2, 3, 4, 5, "abc"])

b = pd.Series({'a': 11, 'b': 22, 'c':33})
print(b)

b[0] = 111;
b.loc['a'] = 123
print(b)


a    11
b    22
c    33
dtype: int64
a    123
b     22
c     33
dtype: int64


In [142]:
#Series: accessing: slicing and indexing (numpy concepts)

b = pd.Series({'a': 11, 'b': 22, 'c':33})
c = b[0:2]
c[0] = 100
#print(b)
#print(c)

d = b[[0, 2]]
print(d)

d['a'] = 1000
print(b)

a    100
c     33
dtype: int64
a    100
b     22
c     33
dtype: int64


In [143]:
#Series accessing: slicing and indexing (pandas concepts: loc and iloc)
#loc[] works for label-indexing, iloc[] works for integer-indexing

b = pd.Series({'a': 11, 'b': 22, 'c':33})
c = b.loc['a':'b']  #c is a Series = Series({'a': 11, 'b': 22}) 


#c = b.loc[1] here this code will result an error as b has label-indexing and loc[] does NOT work for this indexing

#iloc[] works fine in the following statements as integer indexing is default when you create a Series
print(b.iloc[1])
print(b.iloc[0:1])

# iloc[] in the following statement does NOT work as it is NOT supposed to work on label-indexing
#print(b.iloc['a'])





22
a    11
dtype: int64


In [144]:
#Series: accessing: indexing (dict concept)
b = pd.Series({'a': 11, 'b': 22, 'c':33})
b['a'] = 111
print(b)


a    111
b     22
c     33
dtype: int64


# DataFrame
<b>#1 Definition </b></br>
A DataFrame represents a <u>rectangular table</u> of data and contains an <u>ordered, named collection of columns</u>, each of which can be a <u>different value type</u> (numeric, string, Boolean, etc.). The DataFrame has both a row and column index.<br>
<b>#2 As a dictionary of Series</b><br>
+ keys = columns<br>
+ values = rows<br>
+ 1 column = 1 Series<br>
<b>#3 Working with columns</b><br>
+ accessing one or more columns <br>
+ deleting one or more columns <br>
+ adding one column<br>
+ inserting one column<br>
<b>#4 Working with rows</b><br>
+ accessing one or more rows <br>
+ deleting one or more rows <br>
+ adding one or more rows<br>
+ inserting one or more rows<br>



In [145]:
#create a dataframe

"""
df1 = pd.DataFrame(data = np.arange(16).reshape(4,4), 
                   columns = ['one', 'two', 'three', 'four'], 
                   index = ['MA', 'NY', 'NJ', 'VA'])



df1
"""

"\ndf1 = pd.DataFrame(data = np.arange(16).reshape(4,4), \n                   columns = ['one', 'two', 'three', 'four'], \n                   index = ['MA', 'NY', 'NJ', 'VA'])\n\n\n\ndf1\n"

In [146]:
#accessing using []

#---one column: dictionary concepts: dataframe is a dictionary of Series (keys = columns's names) 
#---=> return a Series
df2 = df1['one']
print(df2)


df2 = df1['one'][['MA', 'NY']]
print(df2)

df2 = df1['one']['MA']
print(type(df2))
print(df2)


#---more than one columns (columns'names in a list)=> return a dataframe
df2 = df1[['one', 'two']]
print(df2)

#---one or more than one row (the convenient syntax [], slicing on rows )=> return a dataframe
df2 = df1[:2]
print(df2)

df2 = df1[:2][['one', 'two']]
print(df2)

#---one last row

df2 = df1[-1:]
print("the last row:", df2)

#---you can use boolean indexing too: applying on rows or all entries
#---the size of the boolean like-array object MUST equal to #rows when applying on rows
#---the size of the boolean 2d like-array object MUST equal to the shape of the dataframe when applying on entries

df2 = df1[[True, False, True, False, True]]
print(df2)

#print(df1 > 5)
#df1[df1 > 5] = 0
#print(df1)



NY    0.891418
NJ    0.450991
MA    0.212504
VA    0.266251
IA    0.815886
Name: one, dtype: float64
MA    0.212504
NY    0.891418
Name: one, dtype: float64
<class 'numpy.float64'>
0.21250388562372002
         one       two
NY  0.891418  0.668674
NJ  0.450991  0.515860
MA  0.212504  0.284589
VA  0.266251  0.224719
IA  0.815886  0.006091
         one       two     three  AA  BB
NY  0.891418  0.668674  0.714022  1a  1b
NJ  0.450991  0.515860  0.432764  2a  2b
         one       two
NY  0.891418  0.668674
NJ  0.450991  0.515860
the last row:          one       two     three  AA  BB
IA  0.815886  0.006091  0.928639  5a  5b
         one       two     three  AA  BB
NY  0.891418  0.668674  0.714022  1a  1b
MA  0.212504  0.284589  0.988748  3a  3b
IA  0.815886  0.006091  0.928639  5a  5b


In [147]:
#accessing by dot column name => returns a Series (just one column)
df2 = df1.one
print(df2)

NY    0.891418
NJ    0.450991
MA    0.212504
VA    0.266251
IA    0.815886
Name: one, dtype: float64


In [148]:
#accessing by loc[]: for label indexing

#---one row => returns a Series 
#---(index = [columns' names], values = all entries in the row 'MA')
df2 = df1.loc['MA']
print(df2)


#---2 rows: returns a df
df2 = df1.loc[['MA', 'VA']]
print(df2)


#---verify label indexing
#df2 = df1.loc[1]


#---rule: row then col
#---one row => returns a Series
df2 = df1.loc['MA'].loc [['one', 'two']]
print(df2)

df2 = df1.loc['MA'][['one', 'two']]
print(df2)

df2 = df1.loc['MA', ['one', 'two']]
print(df2)

#--- 2 or more rows => returns a df  

df2 = df1.loc[['MA', 'NY']].loc[:, ['one', 'two']]
print(df2)

df2 = df1.loc[['MA', 'NY']][['one', 'two']]
print(df2)

df2 = df1.loc[['MA', 'NY'], ['one', 'two']]
print(df2)

one      0.212504
two      0.284589
three    0.988748
AA             3a
BB             3b
Name: MA, dtype: object
         one       two     three  AA  BB
MA  0.212504  0.284589  0.988748  3a  3b
VA  0.266251  0.224719  0.860182  4a  4b
one    0.212504
two    0.284589
Name: MA, dtype: object
one    0.212504
two    0.284589
Name: MA, dtype: object
one    0.212504
two    0.284589
Name: MA, dtype: object
         one       two
MA  0.212504  0.284589
NY  0.891418  0.668674
         one       two
MA  0.212504  0.284589
NY  0.891418  0.668674
         one       two
MA  0.212504  0.284589
NY  0.891418  0.668674


In [149]:
#learn iloc[] yourself
#---iloc[] <=> loc[], but works for integer indexing


In [150]:
#---copy the cell #accessing by loc[]: for label indexing 
#---and change loc => iloc to learn iloc yourself


#accessing by iloc[]: for integer indexing 
#---one row => returns a Series 
#---(index = [columns' names], values = all entries in the row 'MA')
df2 = df1.iloc[0]
print(df2)


one      0.891418
two      0.668674
three    0.714022
AA             1a
BB             1b
Name: NY, dtype: object


In [151]:
#---column operations
#---adding new column (dictionary concept)
#---syntax: df['new_column_name'] = a Series

df2 = df1.copy()
df2['AABB'] = df1["AA"] + " " + df1["BB"]
df2['one_over_two'] = df1.one / df2.two
df2['CC'] = ["1c", "2c", "3c", "4c", "5c"]
#print(df2)

#---concatenate 2 dataframe: use concat( axis = 1) to concatenate 2 dfs

df3 = pd.DataFrame(data = np.arange(10).reshape(5,2), 
                   columns = ['five', 'six'], 
                   index = ['MA', 'NY', 'NJ', 'VA', 'OH'])


df4 = pd.concat([df3,df2], axis = 1, join = "inner")
#print(df4)

#---dropping columns: use the drop() method
#---syntax: df.drop(['column1', 'column2', ...], axis = 1)
#---this code will create df2 from df1 by dropping columns 'three' and 'five'
#---note df1 is still unchanged. If you want df1 to take the change =>
#---supply the parameter inplace = True, e.g., df1.drop(['three', 'five'], inplace=True)
#---and then the drop() method will not return anything

df2 = df1.drop(columns = ['three', 'one'])
#print(df2)

#---insert a column: use the insert() method
#---syntax: df.insert(location, column_name, values)
df2 = df1.copy()
df2.insert(1, 'inserted_col', ['11', '22', '33', '44', '55'])
df2


Unnamed: 0,one,inserted_col,two,three,AA,BB
NY,0.891418,11,0.668674,0.714022,1a,1b
NJ,0.450991,22,0.51586,0.432764,2a,2b
MA,0.212504,33,0.284589,0.988748,3a,3b
VA,0.266251,44,0.224719,0.860182,4a,4b
IA,0.815886,55,0.006091,0.928639,5a,5b


In [162]:
#---row operations
#---adding new row(s): organize new row(s) in a df then use concat(axis = 0), 

df2 = df1.copy()
df3 = pd.DataFrame(data = np.asarray([1, 2, 3, 'new_aa', 'new_bb']).reshape(1,5), 
                   columns = df1.columns,
                  index = ['WA'])


df4 = pd.concat([df2,df3], axis = 0)
#print(df4)


#---insert row(s): no straight function provided, you have to code yourself
#---drop row(s): use drop(axis = 0)

df2 = df1.drop(index = ['MA', 'NY'])
#---df1: drop all rows with one > 1.2

ind_to_del = df1[(df1.one > 0.5)].index

df2 = df1.drop(index = ind_to_del)
df2



Unnamed: 0,one,two,three,AA,BB
NJ,0.450991,0.51586,0.432764,2a,2b
MA,0.212504,0.284589,0.988748,3a,3b
VA,0.266251,0.224719,0.860182,4a,4b


# index

+ an immutable sequence used for <u>indexing</u> and <u>alignment</u>.<br>
+ creating an Index object <br>
+ as an array<br>
+ as a fix-size set, but can contain duplications<br>


In [153]:
#--- create an index object

ind = pd.Index([1, 3, 5, 7, 9])
print(ind)

#---Series and DataFrames have index objects associated with them. These index objects are
#---used for like-join (concat(), join()) functions.
#---we will see them in few chapters

Int64Index([1, 3, 5, 7, 9], dtype='int64')


In [154]:
#Essential Functionality
#---reindex(): 
#------works for both Series and DataFrame
#------=> a way to select rows/columns/both



df2 = df1.reindex(['MA', 'NY', 'IA'])
df2

#df2 = df1.reindex(columns = ['one', 'two', 'five'])
#df2

#df2 = df1.reindex(columns = ['one', 'two', 'five'], index = ['MA', 'NY', 'IA'])
#df2



Unnamed: 0,one,two,three,AA,BB
MA,0.212504,0.284589,0.988748,3a,3b
NY,0.891418,0.668674,0.714022,1a,1b
IA,0.815886,0.006091,0.928639,5a,5b


In [155]:
#---apply(): applying a function on one-dimensional arrays to each column or row.

def f1(x):
    return x.max() - x.min()

def f2(x):
    return (pd.Series([x.max(), x.min()], index = ['max', 'min']))


df2 = df1["AA"].apply(lambda e: e.upper())
df2

df2 = df1[["one", "two"]].apply(f2, axis = "rows")
df2


Unnamed: 0,one,two
max,0.891418,0.668674
min,0.212504,0.006091


In [163]:
#---sorting: sort_index() and sort_values()

df2 = df1.sort_index()
df2
df3 = df2.sort_values(['one', 'two'])
df3


Unnamed: 0,one,two,three,AA,BB
MA,0.212504,0.284589,0.988748,3a,3b
VA,0.266251,0.224719,0.860182,4a,4b
NJ,0.450991,0.51586,0.432764,2a,2b
IA,0.815886,0.006091,0.928639,5a,5b
NY,0.891418,0.668674,0.714022,1a,1b


In [166]:
#---arithmetic and data alignment: (1) adopt numpy's vectorization techniques for 
#---Series and Series; DataFrame and DataFrame (2) adopt numpy's broadcast techniques for 
#---Series and DataFrame

df2 = df1.copy()
df3 = df1.copy()
df4 = df2 + df3
#df4

s = [1, 2, 3, "aa", "bb"]
df5 = df1 + s
df5

Unnamed: 0,one,two,three,AA,BB
NY,1.891418,2.668674,3.714022,1aaa,1bbb
NJ,1.450991,2.51586,3.432764,2aaa,2bbb
MA,1.212504,2.284589,3.988748,3aaa,3bbb
VA,1.266251,2.224719,3.860182,4aaa,4bbb
IA,1.815886,2.006091,3.928639,5aaa,5bbb


In [158]:
#---Summarizing and Computing Descriptive Statistics
#------count(), min(), max(), sum(), mean(), ...