# Lecture 5-2: Numy, Pandas and Matplotlib

* [1 NumPy](#1)
* [2 Pandas ](#2)
    * [2.1 Pandas Series](#2.1)
    * [2.2 Pandas DataFrames](#2.2)
    * [2.3 Indexing and Slicing](#2.3)
    * [2.4 Muanipulation](#2.4)
    * [2.5 Data Cleaning](#2.5)
* [3 Matplotlib](#3)


## 1 NumPy <a class="anchor" id="1"></a>

Check out Lecture 5-1!

## 2 Pandas <a class="anchor" id="2"></a>

https://pandas.pydata.org/docs/index.html

In [None]:
import pandas
pandas.__version__

### 2.1 Pandas Series <a class="anchor" id="2.1"></a>

A Pandas Series is like a column in a table.

It is a **one-dimensional** array holding data of **any type**.

In [None]:
import numpy as np
import pandas as pd

In [None]:
a = [1, 7, 2]
ser = pd.Series(a)
print(ser)
print(type(ser))

In [None]:
print(ser.index, type(ser.index))
print(ser.values,type(ser.values))

In [None]:
print(ser.index)
ser.index = ['a','b','b']
print(ser)
print(ser.index)

In [None]:
print(ser[2])
print(ser['b'])

In [None]:
print(ser.ndim, ser.shape, ser.size)
print(ser.dtype)

In [None]:
a = [1, 'seven', 2]
ser = pd.Series(a)
print(ser)
# compare with numpy.ndarray
print()
b = np.array(a)
print(b)
print(b.dtype) # U21: 21-character unicode string
# a special case for class str
print()
c = pd.Series(['one','two','three'])
print(c)

In [None]:
print(ser.name)

In [None]:
ser.name = 'my_series'

In [None]:
print(ser)

When creating a Series, in addtion to using
- `list`

we can also use

- `numpy.ndarray`
- `dictionary`

as the input data.

In [None]:
arr = np.linspace(0,10,4)
ser = pd.Series(arr, index = ['alice','ben','cindy','david'], name = 'my_ser')
print('Array:')
print(arr)
print('Series:')
print(ser)

In [None]:
calories = {"day1": 420, "day2": 380, "day3": 390}
ser = pd.Series(calories)
print(ser)

In [None]:
calories = {"day1": 420, "day2": 380, "day3": 390}
ser_last2days = pd.Series(calories, index = ["day2","day3"])
print(ser_last2days)

In [None]:
calories = {"day1": 420, "day2": 380, "day3": 390}
ser_new = pd.Series(calories, index = ["day2","day3","day4"])
print(ser_new)

In [None]:
x = ser_new['day4']
print(x, type(x))

In [None]:
y = ser_new.isnull()
# equivalently
y = pd.isnull(ser_new)
print(y)

In [None]:
y = ser_new.notnull()
# equivalently
y = pd.notnull(ser_new)
print(y)

In [None]:
help(pd.Series)

#### Series vs DataFrame

Data sets in Pandas are usually **multi-dimensional tables**, called DataFrames.

Series is like a column, a DataFrame is the whole table. Thus any columns of a DataFrame is a Series. We can also create a DataFrame from Series.

### 2.2 Pandas DataFrame <a class="anchor" id="2.2"></a>

In [None]:
df = pd.DataFrame([[1,2,3],[4,5,6]])
print(df)
print(type(df))

In [None]:
print(df.columns,type(df.columns))
print(df.index,type(df.index))
print(df.values,type(df.values))
print(df.ndim, df.shape, df.size)

In [None]:
df.columns =['col_1','col_2','col_3']
df.index =['a','a']
print(df)
print(df.columns,type(df.columns))
print(df.index,type(df.index))

#### `dtypes` of DataFrame

In [None]:
y = df.dtypes
print(y)
print()
print(type(y))
print(y.index==df.columns)

In [None]:
df = pd.DataFrame([['one',1,1.0],['two',2,2.0]])
print(df)
print()
y = df.dtypes
print(y)
# Pandas uses the object dtype for storing strings.
# A short explanation is that the length of the string is not fixed.

In [None]:
help(pd.DataFrame)

#### Create a DataFrame from
- 2-dim array 
- dictionary
- Series or other DataFrame
- loading csv files or xls/xlsx files

In [None]:
a = np.arange(12).reshape(4,3)
print(a)
print()
df = pd.DataFrame(a, columns = ['A','B','C'], index = np.arange(1,5))
print(df)
print()
display(df)

In [None]:
df = pd.DataFrame({'float': [2.0,3],
                   'int': [2,1],
                   'string': ['apple','banana']})
print(df)

In [None]:
dict1 = {'a':[1,2,3,4],'b':[5,6,7,8]}
df = pd.DataFrame(dict1, index = ['r1','r2','r3','r4'],columns=['a','c'])
print(df)

In [None]:
dict2 = {'a':{'r1':1,'r2':2,'r3':3,'r4':4},'b':{'r1':5,'r2':6,'r3':7,'r4':8}}
df = pd.DataFrame(dict2)
print(df)

In [None]:
dict2 = {'a':{'r1':1,'r2':2,'r3':3,'r4':4},'b':{'R1':5,'R2':6,'R3':7}}
df = pd.DataFrame(dict2)
print(df)

In [None]:
ser1 = pd.Series([1,2,3,4])
ser2 = pd.Series([1,2,3,5])
ser3 = pd.Series([1,2,3,6])
df = pd.DataFrame([ser1,ser2,ser3])
print(df)

In [None]:
ser1 = pd.Series([1,2,3,4])
ser2 = pd.Series([1,2,3,5])
ser3 = pd.Series([1,2,3,6])
df = pd.DataFrame({'c1':ser1, 'c2':ser2, 'c3':ser3})
print(df)

In [None]:
ser1 = pd.Series([1,2,3,4])
ser2 = pd.Series([1,2,3,5])
ser3 = pd.Series([1,2,3,6])
df = pd.DataFrame({'c1':ser1, 'c2':ser2, 'c3':ser3}, index = [1,3,5,7])
print(df)

In [None]:
df_sub = df[['c1','c2']]
print(df_sub)

In [None]:
sub1 = df[['c3']]
sub2 = df['c3']
print(type(sub1))
print(sub1)
print()
print(type(sub2))
print(sub2)

In [None]:
# read a csv file
Bitcoin = pd.read_csv("data/Bitcoin.csv")

In [None]:
print(type(Bitcoin))

In [None]:
display(Bitcoin)

In [None]:
# using os.getcwd(): get current working directory
import os
folder = os.getcwd()
print(folder)

In [None]:
Bitcoin = pd.read_csv("data/Bitcoin.csv")
# equivalently
Bitcoin = pd.read_csv(os.getcwd()+"/data/Bitcoin.csv")
display(Bitcoin)

In [None]:
Bitcoin_new = Bitcoin[['Date','Price']]
display(Bitcoin_new)

In [None]:
Bitcoin_new.to_csv('Bitcoin_new.csv')

In [None]:
Currency = pd.read_excel(os.getcwd()+"/data/RMBDailyCurrency.xlsx",header = None)
display(Currency)

In [None]:
Currency.columns=['date','currency']
Currency.to_excel('currency_copy.xlsx')

### 2.3 Indexing and Slicing <a class="anchor" id="2.3"></a>

In [None]:
ser = pd.Series([120.4,120.2,121,119.6,119.3], index = range(1,6), name ='weight')
print(ser)

In [None]:
ser[3]

In [None]:
x = ser[[1,3,5]]
print(x)
print(type(x))

In [None]:
x = ser[::2] # equivalent to x = ser[0:5:2]
print(x)
print(type(x))

In [None]:
ser = pd.Series([120.4,120.2,121,119.6,119.3],index = ['day%d'%i for i in range(1,6)], name ='weight')
print(ser)

#### Quiz: find out the weights since day3

In [None]:
x = ser['day3':]
print(x)
print(type(x))

In [None]:
df = pd.DataFrame({"a":[1,2,3,4],"b":[5,6,7,8], "c":[9,10,11,12],"d":[13,14,15,16]})
print(df)
print(df.columns)
print(df.index)

In [None]:
df.index = ['one','two','three','four']
display(df)

In [None]:
x = df['b']
print(type(x))
print(x)
x = df[['b','d']]
print(type(x))
print(x)

In [None]:
x = df[0] 
print(type(x))
print(x)

In [None]:
x = df[0:1] 
print(type(x))
print(x)

In [None]:
x = df['two'] 
print(type(x))
print(x)

In [None]:
x = df['two':] 
print(type(x))
print(x)

In [None]:
df = pd.DataFrame(df.values)
print(df)

In [None]:
print(df[[2,1]])
print()
print(df[np.arange(0,3,2)])

In [None]:
print(df[0:3:2])

#### Quiz: how to find the first 2 rows of the first 3 columns

In [None]:
x = df[[0,1,2]][0:2]
# or 
x = df[0:2][[0,1,2]]
# or 
x = df[0:2][np.arange(3)]
# or 
x = df[np.arange(3)][0:2]
print(x)

#### Use `loc` and `iloc`

In [None]:
df = pd.DataFrame({"a":[1,2,3,4],"b":[5,6,7,8], "c":[9,10,11,12],"d":[13,14,15,16]})
df.index = ['one','two','three','four']
print(df)
print(df.columns)
print(df.index)

In [None]:
x = df.loc['two']
print(type(x))
print(x)

In [None]:
x = df.iloc[1]
print(type(x))
print(x)

In [None]:
x = df.loc[:,'b']
print(type(x))
print(x)

In [None]:
x = df['b']
print(type(x))
print(x)

In [None]:
x = df.iloc[:,1]
print(type(x))
print(x)

In [None]:
x = df.loc['two':'four']
print(type(x))
print(x)
print()
x = df.loc[['two','three','four']]
print(type(x))
print(x)
print()
x = df.iloc[1:4]
print(type(x))
print(x)

#### Quiz: use `iloc` and `loc` to obtain the DataFrame consists of the first 3 columns of the last 2 rows of the above `df`.

In [None]:
display(df)

In [None]:
x = df[['a','b','c']][-2:]
print(x)
print()
x = df[-2:][['a','b','c']]
print(x)

In [None]:
x = df.iloc[-2:,:3]
print(x)
print()
x = df.iloc[[2,3],[0,1,2]]
print(x)

In [None]:
x = df.loc['three':,'a':'c']
print(x)
print()
x = df.loc[['three','four'],['a','b','c']]
print(x)

### 2.4 Muanipulation <a class="anchor" id="2.4"></a>
- Transpose `T`
- delete `drop()`
- merge `merge()`,`concat()`,`join()`
- sort `sort_values()`
- change data type `astype()`, `to_datetime()`,`to_numeric()`

In [None]:
import pandas as pd, numpy as np

In [None]:
df=pd.DataFrame({"key":["a","b","c","d","e"], "data":np.arange(5)})
display(df)

In [None]:
df_new = df.T
display(df)
display(df_new)

In [None]:
df = pd.DataFrame(np.arange(12).reshape(3, 4),columns=['A', 'B', 'C', 'D'])
display(df)

In [None]:
# Drop columns
df_new = df.drop(columns = ['B', 'C'])
display(df_new)
display(df)

In [None]:
# Drop columns method 2
df_new = df.drop(['B', 'C'], axis=1) # by default axis = 0
display(df_new)

In [None]:
# Drop one row by index
df_new = df.drop(1)
display(df_new)

In [None]:
# Drop rows by index
df_new = df.drop([0,2])
display(df_new)
# equivalently
df_new = df.drop(index = [0,2])
display(df_new)

In [None]:
display(df)
df_new = df.drop(columns =['B'], index = 1, inplace = True)
print(df_new)
display(df)

In [None]:
df['E']=[5,6]
display(df)

In [None]:
df.loc[3]=[0,0,0,0]
display(df)

In [None]:
df.loc[5,'F'] = 100
display(df)

#### `concat()` in Pandas is similar to `concatenate()` in Numpy

In [None]:
df1=pd.DataFrame({"key":["a","b","c","d","e"], "data":np.arange(5)})
df2=pd.DataFrame({"key":["a","b","c"], "data":np.arange(3)})
display(df1)
display(df2)

In [None]:
data=pd.concat([df1,df2]) 
display(data)

In [None]:
data1=pd.concat([df1,df2],axis=1)
display(data1)

In [None]:
data1.loc[:,'key']

In [None]:
data1.iloc[:,0]


`DataFrame.join()` is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame. The data alignment is **on the indexes (row labels)**.

If there are **overlapping** columns, the join will want you to add a suffix to the overlapping column name from the left dataframe. 

In [None]:
df3=pd.DataFrame({"key":["a","b","c","d","e"], "data1":np.arange(5)})
df4=pd.DataFrame({"key":["a","b","c","D"],"data2":np.arange(4)}) 
display(df3, df4)

In [None]:
df3.join(df4,how = 'outer', lsuffix='_1',rsuffix='_2')
# try how = 'inner','right'
# by default how = 'left'

In [None]:
# Compare with concat with axis = 1
pd.concat([df3,df4],axis=1)

`join()` takes an optional `on` argument which may be a column or multiple column names, which specifies that the passed DataFrame is to be aligned on that column in the DataFrame. 

In that case, `join()` and `merge()` calls are completely equivalent.

In [None]:
left = pd.DataFrame({"A": ["A0", "A1", "A2", "A3"],\
                     "B": ["B0", "B1", "B2", "B3"],\
                     "key": ["K0", "K1", "K2", "K3"]})

right = pd.DataFrame({"C": ["C0", "C1"], "D": ["D0", "D1"]}, index=["K0", "K1"])
display(left)
display(right)

In [None]:
result_1 = left.join(right, on="key", how="outer")
result_2 = pd.merge(left, right, left_on="key", right_index=True, how="outer")
display(result_1)
display(result_2)

pandas provides a single function, `merge()`, as the entry point for all standard database join operations between `DataFrame` or `named Series` objects

In [None]:
display(df3, df4)

`merge()` is not that different from `join` when using `left_index = True,right_index = True`, that is to join on the index

In [None]:
df3.merge(df4,left_index = True,right_index = True,suffixes =['_1','_2'], how = 'inner')

In [None]:
df3.join(df4, lsuffix='_1',rsuffix='_2', how = 'inner')

`merge()` is often used when we don’t want to join on the index.

In [None]:
df3.merge(df4, on = 'key', how = 'right')

In [None]:
df4['data2']=[0,1,2,3]
df4['key']=['A','B','C',"D"]
display(df3, df4)

In [None]:
df3.merge(df4,left_on = 'data1', right_on = 'data2', suffixes =['_1','_2'],how='outer')

A detailed tutorial on `concat()`, `join()` and `merge()`:

   - https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

`sort_values()`:

- `Series_obj.sort_values()`
- `DataFrame_obj.sort_values([col_name])`

In [None]:
df = pd.DataFrame({'Name':['jack','alice','zoe','henry'],'Height':[180,167,140,129]})
display(df)

In [None]:
name = df['Name'].sort_values()

In [None]:
print(name)
print(df['Name'])

In [None]:
df_sorted = df.sort_values(['Name'])
display(df_sorted)

In [None]:
df.sort_values(['Height'], inplace = True)
display(df)

In [None]:
df.sort_values(['Height'], ascending=False, inplace = True)
display(df)

In [None]:
df = pd.DataFrame({'Name':['jack','alice','zoe','henry'],\
                   'Height':['180','167','140','129'],\
                   'Date':['2022-5-1','2021-12-1','2022-1-11','2020-9-10']})
print(df.dtypes)
display(df)

In [None]:
df['Date'] = pd.to_datetime(df['Date'])
df['Height']=pd.to_numeric(df['Height'])
print(df.dtypes)
display(df)

In [None]:
print([x.year for x in df['Date']])
print([x.month for x in df['Date']])
print([x.day for x in df['Date']])
print(df.loc[3,'Date'] > df.loc[1,'Date'])

In [None]:
df['Height']=df['Height'].astype('float64')
print(df.dtypes)
display(df)

In [None]:
s5 = pd.Series(np.array([10,20,30,40,50,60]),index = ["a","b","c","d","e","f"])
s6 = pd.Series(np.array([1,1,1,1,1,1,]),index = ["a","b","C","d","e","F"])
print(s5)
print(s6)

In [None]:
s5-s6

### 2.5 Data Cleaning <a class="anchor" id="2.5"></a>

#### Some tips in data cleaning
- A quick look at the data
- `Describe()`
- Missing data
- Visualization

In [None]:
Bitcoin = pd.read_csv("data/Bitcoin.csv")
display(Bitcoin)

In [None]:
Bitcoin.head(3)

In [None]:
Bitcoin.tail()

In [None]:
Bitcoin.loc[[0,5,10,19]]

In [None]:
Bitcoin[["Turnover"]].head() 

In [None]:
Bitcoin[["Date","Price","Volatility"]].head() 

In [None]:
Bitcoin.loc[:,["Date","Price","Volatility"]].head() 

In [None]:
Bitcoin.loc[[0,5,10,19],["Date","Price","Volatility"]]

In [None]:
Bitcoin.describe()

In [None]:
print(Bitcoin['Price']>6500)
print(sum(Bitcoin['Price']>6500))

In [None]:
Bitcoin[Bitcoin['Price']>6500]

In [None]:
Bitcoin = pd.read_csv("data/Bitcoin_missingdata.csv") 
Bitcoin.tail()

In [None]:
type(Bitcoin.loc[1,'Date'])

In [None]:
date = pd.to_datetime(Bitcoin['Date'])
print(date)

In [None]:
turn = Bitcoin["Turnover"]
print(turn)

In [None]:
turn.isnull()

In [None]:
sum(turn.isnull())

In [None]:
Bitcoin[turn.isnull()]

In [None]:
turn.fillna(50)#turn.fillna(50)

In [None]:
#turn.fillna(method = 'bfill')
# try:
#turn.fillna(method = 'ffill')
# or
turn.bfill()

In [None]:
turn.fillna(turn.median())

In [None]:
help(pd.DataFrame.fillna)

In [None]:
df = pd.read_csv("data/Bitcoin_missingdata.csv") 
print(df.isnull())

In [None]:
values = {"Hash_rate": df['Hash_rate'].min(), "Miners_profits": df['Miners_profits'].max(), "Turnover": 0}
df_fill = df.fillna(value = values)
display(df_fill)

In [None]:
df_new=df.dropna()
print(df_new.shape)
display(df_new)

In [None]:
df_all = df.dropna(how = "all")
print(df_all.shape)
df_all

In [None]:
df_any = df.dropna(how = "any")
print(df_any.shape)

## 3 Matplotlib <a class="anchor" id="3"></a>

Stay tuned!