# Python S6 Pandas

## My Course Notes and Code

These are my notes from the Udemy course available at: 
https://www.udemy.com/course/python-for-data-science-and-machine-learning-bootcamp/

I'm focusing on the section 6 of the course, which deals with **Pandas**.

#### S6 Overview

- Series
- DataFrames
- Missing Data
- GroupBy
- Merging, Joining, Concatenating
- Pandas Operations
- Data Input and Output

#### S6V25 Intro to Pandas

- Pandas is built on top of **NumPy**
- It enalbles fast analysis and data cleaning and preparation
- Python's version of **Excel**, or R **Data Frames**
- Has built-in visualization features
- Can work with data from a variety of formats

#### S6V26 Series

- It's the data type we'll use to build DataFrames in the next lecture
- Similar to *NumPy arrays*   
    - However, Series have (labelled) indexes, and...
    - ... can hold a variety of object types

##### Setting various Python objects to Pandas Series

In [98]:
import numpy as np
import pandas as pd

In [99]:
labels = ['a', 'b', 'c']            # list
my_data = [10, 20, 30]              # list
arr = np.array(my_data)             # NumPy array
d = {'a' : 10, 'b' : 20, 'c' : 30}  # dictionary

In [100]:
pd.Series(data = my_data)           # labelled-index Series

0    10
1    20
2    30
dtype: int64

In [101]:
pd.Series(data = my_data, index = labels)

a    10
b    20
c    30
dtype: int64

In [102]:
pd.Series(my_data, labels) # doesn't have to be specified, it's the default order

a    10
b    20
c    30
dtype: int64

In [103]:
pd.Series(arr)     # Works the same as with lists 

0    10
1    20
2    30
dtype: int32

In [104]:
pd.Series(arr, labels) 

a    10
b    20
c    30
dtype: int32

In [105]:
pd.Series(d) # Keys are automatically set as indexes, and values as data points

a    10
b    20
c    30
dtype: int64

##### Pandas Series can store a wide variety of data types...

- ... "pretty much any type of Python object" 
- Super-flexible

In [106]:
pd.Series(labels)           # e.g., strings

0    a
1    b
2    c
dtype: object

In [107]:
pd.Series(data = [sum, print, len])  # functions

0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object

##### "Grabbing" information from a Series

In [108]:
ser1 = pd.Series([1,2,3,4], ['USA', 'Germany', 'USSR', 'Japan'])
ser1

USA        1
Germany    2
USSR       3
Japan      4
dtype: int64

In [109]:
ser2 = pd.Series([1,2,5,4], ['USA', 'Germany', 'Italy', 'Japan'])
ser2

USA        1
Germany    2
Italy      5
Japan      4
dtype: int64

In [110]:
ser1['USA'] # works similar as withy Python dictionaries # type label, get data

1

In [111]:
ser3 = pd.Series(labels)
ser3

0    a
1    b
2    c
dtype: object

In [112]:
ser3[0]

'a'

##### Basic operations with Series

In [113]:
ser1 + ser2 # Pandas tries to perform the operation based on the matching index

Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64

- 'Italy' and 'USSR' couldn't be matched -> **NaN**
- *Integers* are converted into *floats* when doing operations with Pandas Series (or prety much any NumPy- or Pandas-based object)
    - To prevent accidental information loss

#### S6V27 - DataFrames pt 1

DataFrames will be our main tool in working with Pandas.

In [114]:
# import numpy as nump
# import pandas as pd
from numpy.random import randn
np.random.seed(101)

In [115]:
df = pd.DataFrame(
    data = randn(5, 4), 
    index = ['A', 'B', 'C', 'D', 'E'],
    columns = ['W', 'X', 'Y', 'Z'])

df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


Each of the columns W-Z is actually a Pandas Series. They all share the same index.

DataFrame is a group of Series sharing the same index :)

##### Indexing and Selection (selecting columns)

In [116]:
df['W']         # First way to grabb a column 

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [117]:
type(df['W']) # it really is just a Series :)

pandas.core.series.Series

In [118]:
type(df)

pandas.core.frame.DataFrame

In [119]:
df.W # 2nd way to grab a column (SQL-like). Appears like calling methods off of df

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [120]:
df[['Y', 'Z']] # grabbing multiple columns # returns DataFrame

Unnamed: 0,Y,Z
A,0.907969,0.503826
B,-0.848077,0.605965
C,0.528813,-0.589001
D,-0.933237,0.955057
E,2.605967,0.683509


##### Creating new columns

In [121]:
df['new'] = randn(5)
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,0.302665
B,0.651118,-0.319318,-0.848077,0.605965,1.693723
C,-2.018168,0.740122,0.528813,-0.589001,-1.706086
D,0.188695,-0.758872,-0.933237,0.955057,-1.159119
E,0.190794,1.978757,2.605967,0.683509,-0.134841


In [122]:
df['new2'] = df['W'] + df['X']
df

Unnamed: 0,W,X,Y,Z,new,new2
A,2.70685,0.628133,0.907969,0.503826,0.302665,3.334983
B,0.651118,-0.319318,-0.848077,0.605965,1.693723,0.3318
C,-2.018168,0.740122,0.528813,-0.589001,-1.706086,-1.278046
D,0.188695,-0.758872,-0.933237,0.955057,-1.159119,-0.570177
E,0.190794,1.978757,2.605967,0.683509,-0.134841,2.169552


##### Dropping columns and rows

In [123]:
df.drop('new2', axis = 'columns') # or `axis = 1` # This does not happen in place

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,0.302665
B,0.651118,-0.319318,-0.848077,0.605965,1.693723
C,-2.018168,0.740122,0.528813,-0.589001,-1.706086
D,0.188695,-0.758872,-0.933237,0.955057,-1.159119
E,0.190794,1.978757,2.605967,0.683509,-0.134841


In [124]:
df

Unnamed: 0,W,X,Y,Z,new,new2
A,2.70685,0.628133,0.907969,0.503826,0.302665,3.334983
B,0.651118,-0.319318,-0.848077,0.605965,1.693723,0.3318
C,-2.018168,0.740122,0.528813,-0.589001,-1.706086,-1.278046
D,0.188695,-0.758872,-0.933237,0.955057,-1.159119,-0.570177
E,0.190794,1.978757,2.605967,0.683509,-0.134841,2.169552


In [125]:
df.drop(['new', 'new2'], axis = 'columns', inplace = True) # must be specified...
df                                                 # ... for many Pandas methods.

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [126]:
df.drop('E', axis = 0) # `axis = 0` is set by default # Not in place

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057


In [127]:
df.shape # tuple # rows (0) and columns (1)

(5, 4)

##### Selecting rows

In [128]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [129]:
df.loc['A'] # Not only columns, but rows as well are Series # loc[RowName]

W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

In [130]:
df.iloc[2] # `df.loc['C']` # iloc[RowIndexPosition]

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

##### Selecting subsets of rows and columns

In [131]:
df.loc['B', 'Y'] # row, col

-0.8480769834036315

In [132]:
df.loc['B', 'W':'Y']

W    0.651118
X   -0.319318
Y   -0.848077
Name: B, dtype: float64

In [133]:
df.loc[['A', 'B'], 'X':]

Unnamed: 0,X,Y,Z
A,0.628133,0.907969,0.503826
B,-0.319318,-0.848077,0.605965


In [134]:
df.loc[['A', 'E'], ['W', 'Z']]

Unnamed: 0,W,Z
A,2.70685,0.503826
E,0.190794,0.683509


#### S6V28 - DataFrames pt 2

##### Conditional Selection

In [135]:
# import numpy as nump
# import pandas as pd
# from numpy.random import randn
# np.random.seed(101)
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [136]:
df > 0

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,False,False,True
C,False,True,True,False
D,True,False,False,True
E,True,True,True,True


In [137]:
df[df > 0]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,,,0.605965
C,,0.740122,0.528813,
D,0.188695,,,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [138]:
df['W'] > 0

A     True
B     True
C    False
D     True
E     True
Name: W, dtype: bool

In [140]:
df[df['W'] > 0]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [141]:
df[df['Z'] < 0]

Unnamed: 0,W,X,Y,Z
C,-2.018168,0.740122,0.528813,-0.589001


Conditional selection of this sort returns DataFrames as the result. Methods can be called off of these resulting DataFrames. 

In [142]:
result = df[df['Z'] < 0] # In two steps
result['X']

C    0.740122
Name: X, dtype: float64

In [144]:
df[df['W'] > 0]['X']    # In one step

A    0.628133
B   -0.319318
D   -0.758872
E    1.978757
Name: X, dtype: float64

In [145]:
df[df['W'] > 0][['Y', 'X']]

Unnamed: 0,Y,X
A,0.907969,0.628133
B,-0.848077,-0.319318
D,-0.933237,-0.758872
E,2.605967,1.978757


In [139]:
df['W'][df['W'] > 0]

A    2.706850
B    0.651118
D    0.188695
E    0.190794
Name: W, dtype: float64

##### Multiple Conditions for Selection

In [146]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


Python's normal `and` operator won't work on Series - since it can't compare one Series of Boolean values with another. Instead, we use `&`.

In [150]:
df[(df['W'] > 0) & (df['X'] < 1)] # and

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057


The same goes for `or` -> `|`

In [151]:
df[(df['W'] > 0) | (df['X'] < 1)] # or

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


##### Resetting the index/Setting it to something else

To be continued :)