# Common Operations
> NumPy functions work in **Pandas**.    
> Statistical operations are like those in Excel  
```python 
A.add(B,fill_value=0)```
> Or, by mean value:
```python 
A.stack().mean()```

In [2]:
import pandas as pd
import numpy as np

In [5]:
rng  = np.random.RandomState(42)
ser0 = pd.Series(rng.randint(0, 10, 4))
ser0

0    6
1    3
2    7
3    4
dtype: int32

In [6]:
np.random.seed(42)
ser = pd.Series(np.random.randint(0, 10, 4))
ser

0    6
1    3
2    7
3    4
dtype: int32

In [7]:
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                  columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,6,9,2,6
1,7,4,3,7
2,7,2,5,4


In [8]:
df*3

Unnamed: 0,A,B,C,D
0,18,27,6,18
1,21,12,9,21
2,21,6,15,12


## Pandas Arithmetic Operations
![](fig31.jpg)

In [11]:
df.stack()

0  A    6
   B    9
   C    2
   D    6
1  A    7
   B    4
   C    3
   D    7
2  A    7
   B    2
   C    5
   D    4
dtype: int32

In [15]:
df.stack().mean()

5.166666666666667

In [10]:
print(df.mean())
print('---------------')
print(df.mean(axis=1))

A    6.666667
B    5.000000
C    3.333333
D    5.666667
dtype: float64
---------------
0    5.75
1    5.25
2    4.50
dtype: float64


## Activity 1

> 1. Generate a matrix using:  
```python
x = np.random.randint(0,10,size=(3,2))
```
> 2. Convert it into DataFrame, naming the columns by A,B.  
> 3. Calculate the mean of each row. *Hint: Using ```axis=1```*.    
> 4. Calculate the overall mean of **x**. 

In [3]:
x = np.random.randint(0,10,size=(3,2))
df = pd.DataFrame(x,columns=list('AB'))
df

Unnamed: 0,A,B
0,4,2
1,6,4
2,2,7


In [4]:
print(df.mean(axis=1))
print(df.stack().mean())

0    3.0
1    5.0
2    4.5
dtype: float64
4.166666666666667


# Handling Missing Values
> Missing data format: ***999***, ***NaN***.
> Generates **NaN** values for unmatched observations.  
> Missing values can be filled by particular number (eg.0) using 

In [16]:
# If there is index, order of data entry does not matter

area = pd.Series({'Alaska': 1723337, 
                  'Texas': 695662,
                  'California': 423967}, 
                 name='area')

population = pd.Series({'California': 38332521, 
                        'Texas': 26448193,
                        'New York': 19651127}, 
                       name='population')

In [17]:
population / area

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

In [16]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])

A.index | B.index

Int64Index([0, 1, 2, 3], dtype='int64')

In [17]:
A + B

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

In [18]:
A.add(B, fill_value=0)

0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

In [19]:
# list('AB') = 'A B'.split()

x1 = pd.DataFrame(rng.randint(0, 20, (2, 2)), columns=list('AB'))
x1

Unnamed: 0,A,B
0,1,11
1,5,1


In [20]:
x2 = pd.DataFrame(rng.randint(0, 10, (3, 3)), columns=list('BAC'))
x2

Unnamed: 0,B,A,C
0,4,0,9
1,5,8,0
2,9,2,6


In [31]:
print('Sum of 2 pd:\n')
x1.add(x2)

Sum of 2 pd:



Unnamed: 0,A,B,C
0,1.0,15.0,
1,13.0,6.0,
2,,,


In [26]:
print('Stack all elements in x1: \n',x1.stack())

print('----------------')

print('Mean of these all elements in x1: \n',x1.stack().mean())

Stack all elements in x1: 
 0  A     1
   B    11
1  A     5
   B     1
dtype: int32
----------------
Mean of these all elements in x1: 
 4.5
----------------


In [27]:
fill = x1.stack().mean()
x1.add(x2, fill_value=fill)

Unnamed: 0,A,B,C
0,1.0,15.0,13.5
1,13.0,6.0,4.5
2,6.5,13.5,10.5


In [32]:
A  = rng.randint(10, size=(3, 4))
df = pd.DataFrame(A, columns=list('QRST'))
df

Unnamed: 0,Q,R,S,T
0,3,8,2,4
1,2,6,4,8
2,6,1,3,8


In [18]:
# Arithmetic Operations
print('Difference between df and 1st row: \n', df - df.iloc[0])

print('----------------')

print('Difference between df and R column: \n', df.subtract(df['R'], axis=0))

Difference between df and 1st row: 
    Q  R  S  T
0  0  0  0  0
1 -1 -2  2  4
2  3 -7  1  4
Difference between df and R column: 
    Q  R  S  T
0 -5  0 -6 -4
1 -4  0 -2  2
2  5  0  2  7


In [19]:
halfrow = df.iloc[0, ::2]
halfrow

Q    3
S    2
Name: 0, dtype: int32

In [20]:
df - halfrow

Unnamed: 0,Q,R,S,T
0,0.0,,0.0,
1,-1.0,,2.0,
2,3.0,,1.0,


## Missing Value
> Forms of missing values: **np.nan**, **None**  
> Detecting null values  

### Missing values in NumPy
> **None** gives TypeError.  
> **np.nan___** works fine: **np.nansum**, **np.nanmin**, **np.nanmax**.   

In [37]:
x1 = np.array([1, None, 3, 4])
print(x1[1])
x1.sum()

None


TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

In [18]:
x2 = np.array([1, np.nan, 3, 4])
print(1 + np.nan)
print(0*np.nan)
print(x2.sum())
print(x2.max())
print(x2.min())
print('----------------')
print(np.nansum(x2))
print(np.nanmax(x2))
print(np.nanmin(x2))

nan
nan
nan
nan
nan
----------------
8.0
4.0
1.0


  return umr_maximum(a, axis, None, out, keepdims, initial)
  return umr_minimum(a, axis, None, out, keepdims, initial)


### Missing Values in Pandas
> When there is missing values, **dtype** is *float64*.  
> **Nan** and **None** do not matter.  
> **df.isnull()** returns True/False.  

In [42]:
# Note the dtype in pd.Series is float64

x1 = pd.Series([1, np.nan, 2, None]) 
print(x1[3]) # Change from None to nan
print(x1.dtype)

nan
float64


In [46]:
data = pd.Series([1, np.nan, 'hello', None])
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

### Dropping NA values
> Methods to drop nan, None in Pandas include:  
> 1. **df.notnull()**  
> 2. **df.dropna(axis='columns', how='all', thresh=3)**, same as ```axis=1```.  

In [47]:
data[data.notnull()]

0        1
2    hello
dtype: object

In [48]:
data.dropna()

0        1
2    hello
dtype: object

In [20]:
df = pd.DataFrame([[1, np.nan, 2],
                   [2, 3, 5],
                   [np.nan, 4, 6]])
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [70]:
# Whole observation with null is dropped

df.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [71]:
df.dropna(axis='columns')

Unnamed: 0,2
0,2
1,5
2,6


In [72]:
df.dropna(axis=1)

Unnamed: 0,2
0,2
1,5
2,6


In [21]:
a   = np.full((1,3),np.nan)
new = pd.DataFrame(a ,columns=[0,1,2])
df.append(new,ignore_index=True)

Unnamed: 0,0,1,2
0,1.0,,2.0
1,2.0,3.0,5.0
2,,4.0,6.0
3,,,


In [22]:
df.dropna(axis=0, how='all')

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [23]:
# thresh = min. number of non-null values 

df.dropna(axis='rows', thresh=3)

Unnamed: 0,0,1,2
1,2.0,3.0,5


### Filling in Values at NA
> **df.fillna(value)**  
> Forward fill along particular axis: **df.fillna(method='ffill',axis=1)**  
> Backward fill along particular axis: **df.fillna(method='bfill',axis=0)**  

In [77]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

In [78]:
data.fillna(0)

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

In [79]:
# Forward-fill
data.fillna(method='ffill')

a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

In [80]:
# Back-fill
data.fillna(method='bfill')

a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64

In [81]:
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [82]:
df.fillna(method='ffill', axis=1)

Unnamed: 0,0,1,2
0,1.0,1.0,2.0
1,2.0,3.0,5.0
2,,4.0,6.0


# Activity 2
> 1. Generate ```df``` using the following code, and print it out.   
```python
df = pd.DataFrame([[1, np.nan, 2],
                   [2, 3, 5],
                   [np.nan, 4, 6]],columns=list('ABC'))
```
> 2. Create a new column **D** filled with **np.nan** values.  
> 3. Fill the NA values in this column using *forward fill* using the row values next to it.   

In [5]:
df = pd.DataFrame([[1, np.nan, 2],
                   [2, 3, 5],
                   [np.nan, 4, 6]],columns=list('ABC'))
df

Unnamed: 0,A,B,C
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [6]:
df['D']=[np.nan]*3
df

Unnamed: 0,A,B,C,D
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [7]:
df.fillna(method='ffill',axis=1)

Unnamed: 0,A,B,C,D
0,1.0,1.0,2.0,2.0
1,2.0,3.0,5.0,5.0
2,,4.0,6.0,6.0
