# Q5

#[Q. how to compile python code to be an executable file?]

In [None]:
#pip install pyinstaller;

* Install `pyinstaller`
* Assume `path` is the path of your Python script.
* `cd path`<br>
  `pyinstaller --onefile PythonScriptName.py`<br>
  `path\dist\PythonScriptName.exe`


## Q6

![flowchart](https://raw.githubusercontent.com/songssssss/notebook-repo/0b67215646f8916048ca2cfab0827bff81975b04/chp6%20flowchart.jpg)

In [None]:
#[Q. How to add the initial stock price at a particular position in an array? I am just curious...]

In [None]:
import numpy as np

In [3]:
# numpy.insert
a = np.array([[1,2,3], [4,5,6]])
print(np.insert(a,3,[11,12]))

[ 1  2  3 11 12  4  5  6] 



In [None]:
# some examples for list unpacking

In [4]:
# list unpacking
name = ['Tom', 'Alice', 'Jerry']
Tom = name[0]
Alice = name[1]
Jerry = name[2]
print(Tom, Alice, Jerry)

Tom Alice Jerry


In [5]:
Tom, Alice, Jerry = name
print(Tom, Alice, Jerry)

Tom Alice Jerry


QuantLib-Python Module Reference
https://quantlib-python-docs.readthedocs.io/_/downloads/en/latest/pdf/<br>
http://gouthamanbalaraman.com/blog/quantlib-python-tutorials-with-examples.html<br>
https://quantlib-python-docs.readthedocs.io/en/latest/basics.html

Derivatives Pricing using QuantLib: An Introduction
https://web.iima.ac.in/assets/snippets/workingpaperpdf/7473160052015-03-16.pdf


* `SimpleQuote` Class Reference
    * market element returning a stored value
    * https://www.quantlib.org/reference/class_quant_lib_1_1_simple_quote.html#details
* `FlatForward` Class Reference
    * Flat interest-rate curve  
* `BlackScholesProcess` Class Reference
    * https://www.quantlib.org/reference/class_quant_lib_1_1_black_scholes_process.html
* `VanillaOption` Class Reference
    * https://rkapl123.github.io/QLAnnotatedSource/d2/d47/class_quant_lib_1_1_vanilla_option.html

The whole process of pricing using Quantlib mainly involves class and inheritance.

Structure of QuantLib: Important classes

Price of any derivative, be it a plain-vanilla option or a complex structured product, depends on the following inputs:
* Price of the underlying securities as on date of pricing and their feeds
* Term structure of interest rates, volatility, inflation and default probabilities
* Stochastic process for the underlying
* Pricing engine (the numerical method used for pricing)

#### Quantlib Basics

##### Date Class

The `Date` object can be created using the constructor as `Date(day, month, year)`. It would be worthwhile to pay attention to the fact that `day` is the first argument, followed by `month` and then the `year`. This is different from the `python datetime` object instantiation.

In [None]:
import QuantLib as ql

In [33]:
date = ql.Date(23, 7, 2021)
print(date)

July 23rd, 2021


##### Calendar Class

The `Date` arithmetic above did not take holidays into account. But valuation of different securities would require taking into account the holidays observed in a specific exchange or country. The `Calendar` class implements this functionality for all the major exchanges.

In [20]:
hk_calendar = ql.HongKong()
period = ql.Period(2, ql.Days)

print("Add 2 days in HK:", date + period)
print("Add 2 business days in HK:", hk_calendar.advance(date, period))

Add 2 days in HK: July 25th, 2021
Add 2 business days in HK: July 27th, 2021


##### Interest Rate

The `InterestRate` class can be used to store the interest rate with the compounding type, day count and the frequency of compounding. Below we show how to create an interest rate of 6.0% compounded annually, using Actual/Actual day count convention.

In [54]:
annual_rate = 0.06
day_count = ql.ActualActual()  #`ActualActual()` Actual/Actual day count
compound_type = ql.Compounded 
frequency = ql.Annual
interest_rate = ql.InterestRate(annual_rate, day_count, compound_type, frequency)

In [55]:
t = 2.0
print(interest_rate.compoundFactor(t))
print((1+annual_rate)*(1.0+annual_rate))

1.1236000000000002
1.1236000000000002


##### Settings

In [30]:
# today's date is returned if the evaluation date is set to the null date (its default value) 
d = ql.Settings.instance().evaluationDate
print('Eval Date :', d)

# we can set it to a new value
ql.Settings.instance().evaluationDate = ql.Date(1, ql.January, 2020)
d = ql.Settings.instance().evaluationDate
print('New Eval Date :', d)

Eval Date : July 30th, 2021
New Eval Date : January 1st, 2020


##### Instruments and pricing engines

Take a European option as a sample instrument. Building the option requires only the specification of its contract, so its payoff (it’s a call option with strike at 100) and its exercise, three months from today’s date. Market data will be selected and passed later, depending on the calculation methods.

In [32]:
option = ql.EuropeanOption(ql.PlainVanillaPayoff(ql.Option.Call, 100.0), 
                        ql.EuropeanExercise(ql.Date(3, ql.September, 2021)))

Take the analytic Black-Scholes formula as a sample pricing engine

First, we collect the quoted market data. We’ll assume flat risk-free rate and volatility, so they can be expressed by `SimpleQuote` instances. The underlying value is at 100, the risk-free value at 1%, and the volatility at 20%.

* `Quote` instances: those model numbers whose value can change and that can notify observers when this happens. 

In [34]:
u = ql.SimpleQuote(100.0)
r = ql.SimpleQuote(0.01)
sigma = ql.SimpleQuote(0.2)

In order to build the engine, the market data are encapsulated in a Black-Scholes process object. 

Before that, we need to build flat curves for the risk-free rate and the volatility. 

In [39]:
riskFreeCurve = ql.FlatForward(0, hk_calendar, ql.QuoteHandle(r), day_count) 
volatility = ql.BlackConstantVol(0, hk_calendar, ql.QuoteHandle(sigma), day_count)

We instantiate the process with the underlying value and the curves we just built. 

* The `Handle` class is a smart pointer to pointer. The inputs are all stored into handles, so that we could change the quotes and curves used if we wanted. 

In [47]:
process = ql.BlackScholesProcess(ql.QuoteHandle(u),
                                 ql.YieldTermStructureHandle(riskFreeCurve), 
                                 ql.BlackVolTermStructureHandle(volatility))

After having the process, we can finally use it to build the engine.

In [48]:
engine = ql.AnalyticEuropeanEngine(process)

After having the engine, we can set it to the option and evaluate the latter.

In [49]:
option.setPricingEngine(engine)
print(option.NPV())

11.041153344216704


##### Market changes
As mentioned before, market data are stored in `Quote` instances and thus can notify the option when any of them changes. We don’t have to do anything explicitly to tell the option to recalculate: once we set a new value to the underlying, we can simply ask the option for its NPV again and we’ll get the updated value.

In [60]:
u.setValue(105.0) 
r.setValue(0.006) 
print(option.NPV())

13.750323541502016


6. Iteratively calculate the continuation value and the value vector in the given time:

We ran the backward part of the algorithm from time $T-1$ to $0$. At each of these steps, we estimated the expected continuation value as a cross-sectional linear regression. We fitted the $5^{th}$-degree polynomial to the data using `np.polyfit`. Then, we evaluated the polynomial at specific values (using `np.polyval`), which is the same as getting the fitted values from a linear regression. We compared the expected continuation value to the payoff to see if the option should be exercised. If the payoff was higher than the expected value from continuation, we set the value to the payoff. Otherwise, we set it to the discounted one-step-ahead value. We used `np.where` for this selection.

## Q7

In [None]:
#pip install yfinance==0.1.62

#[Q. Can you provide a bit more explanation here why we need to downgrade the yfinance package]

Actually, there are bugs for `yf.donwload` to download information for multiple tickers in the earlier version of yfinance. When I was doing this chapter, 0.1.62 or greater had not been released so I wrote 'upgrade'. But I think it will no longer be a problem for students.

In [1]:
#[Q. Can we define the forumulas here as well? Like what is Calmar ratio in notation?]

Notes:



* $Sharpe \space Ratio = \frac{R_p - R_f}{\sigma_p}$
    * $R_p$ = return of portfolio
    * $R_f$ = risk-free rate
    * $\sigma_p$ = standard deviation of portfolio's excess return
* $Maximum \space Drawdown= \frac{Through Value - Peak Value}{Peak Value}$
* $Calmar \space Ratio= \frac{R_p-R_f}{Maximum \space Drawdown}$
* $Sortino \space Ratio = \frac{R_p-R_f}{\sigma_d}$
    * $\sigma_d$ = standard deviation of the downside 

In [13]:
#range arange linspace

In [None]:
#[Q. what is the difference between linspace and range]

chapter 3

Notes:

*Generating a sequence of numbers*
* `numpy.linspace(start, stop, num=50)`
    * Return evenly spaced numbers over a specified interval.
    * `num`: *int, optional*. Number of samples to generate. Default is 50. Must be non-negative.
 
* `numpy.arange([start, ]stop, [step, ])`
    * Return evenly spaced values within a given interval.
    * `start` : *optional*. Start of interval range. By default start = 0
    * `step`  : *optional*. Step size of interval. By default step size = 1
    
* `range([start, ]stop, [step, ])`
    * Return an object that produces a sequence of integers from start.
    * `stop`: *integer*. Before which the sequence of integers is to be returned. The range of integers ends at stop - 1.
    * `step`: *optional*. It is an integer value which determines the increment between each integer in the sequence


In [1]:
import numpy as np

In [2]:
# linespace
# If dtype is not given, the data type is inferred from start and stop. 
# The inferred dtype will never be an integer; float is chosen even if the arguments would produce an array of integers.
print(np.linspace(1, 5, num=5))

[1. 2. 3. 4. 5.]


In [77]:
# arange
# default start is 0
print(np.arange(3))
# default step is 1
print(np.arange(3,7))
print(np.arange(3,7,2))

[0 1 2]
[3 4 5 6]
[3 5]


In [78]:
# range
# empty range
print(list(range(0)))
# using range(stop)
print(list(range(10)))
# using range(start, stop)
print(list(range(1, 10)))
# using range(start, stop, step)
print(list(range(1, 10, 3)))

[]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[1, 2, 3, 4, 5, 6, 7, 8, 9]
[1, 4, 7]


* `numpy.linspace` VS `numpy.arange`
    * `linspace` always includes the end points.
    * Use `linspace` when you know the number of points needed and the exact values of the start and end points of your range are important.
    * Use `arange` when you know the step size.

In [2]:
import numpy as np

In [4]:
np.linspace(0.3, 1, 7)

array([0.3       , 0.41666667, 0.53333333, 0.65      , 0.76666667,
       0.88333333, 1.        ])

In [17]:
np.linspace(1,2,5) # includes 2

array([1.  , 1.25, 1.5 , 1.75, 2.  ])

In [9]:
np.arange(1,2,0.2)

array([1. , 1.2, 1.4, 1.6, 1.8])

* `numpy.linspace` VS `range`
    * `range` function will return an object that produces a sequence of numbers. However, it is only for integers.
    * In Python3, `range` returns an iterable object and not a list (Python2).

In [13]:
# no restriction
np.linspace(0,1.5,3)

array([0.  , 0.75, 1.5 ])

In [12]:
# only for integers
range(0, 1.5)

TypeError: 'float' object cannot be interpreted as an integer

In [23]:
# range returns an iterable object 
print(type(range(2)))
print(list(range(2)))

<class 'range'>
[0, 1]


In [None]:
#pip install yfinance==0.1.62

#[Q. Can you provide a bit more explanation here why we need to downgrade the yfinance package]

Actually, there are bugs for `yf.donwload` to download information for multiple tickers in the earlier version of yfinance. When I was doing this chapter, 0.1.62 or greater had not been released so I wrote 'upgrade'. But I think it will no longer be a problem for students.

In [11]:
#[Q. What is 'b--'?] # blue dashed line

#https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html

Notes:
* Format Strings
    * A format string consists of a part for color, marker and line: `fmt = '[marker][line][color]'`
        * Each of them is optional. If not provided, the value from the style cycle is used. Exception: If line is given, but no marker, the data will be a line without markers.
        * Other combinations such as `[color][marker][line]` are also supported, but note that their parsing may be ambiguous.
    * Markers
        * `'o'`	circle marker
    * Line Styles
        * `'--'` dashed line style
        * `'-'`	solid line style
    * Colors: the supported color abbreviations are the single letter codes
        * `'b'`	blue
        * `'g'`	green
* See also https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html

In [24]:
#[Q. What is x['fun']]?
# 'fun' is an element in the scipy.optimize.OptimizeResult. 

Remarks:
* We extracted the volatility from the `scipy.optimize.OptimizeResult` object by accessing the `'fun'` element. This stands for the optimized objective function—in this case, the portfolio volatility.

In [None]:
#[Q. Any example for inequality constraint? ]

In [4]:
import scipy.optimize as sco
import numpy as np

Example for nonlinear programming

$min \space f(x) = x_1^2 +x_2^2+x_3^2+8$

$s.t. \space x_1^2-x_2+x_3^2 \geq 0$<br>
$\space x_1+x_2^2+x_3^2 \leq 20$<br>
$\space -x_1-x_2^2+2 = 0$<br>
$\space x_2+2x_3^2 = 3$<br>
$\space x_1,x_2,x_3\geq 0$<br>

In [2]:
# objecctive function
def objective(x):
    return -(x[0]**2+x[1]**2+x[2]**2+8)

In [10]:
b=(0.0, None)
bounds= (b,b,b)
#initial_guess
#x0 = np.array([0,0,0])
x0 = np.array([1,2,3])

In [11]:
ineq_cons = ({'type': 'ineq', 
             'fun' : lambda x: x[0]**2 - x[1] + x[2]**2},
             {'type': 'ineq', 
             'fun' : lambda x: 20 - x[0] - x[1]**2 - x[2]**2},
             {'type': 'eq', 
             'fun' : lambda x: -x[0] - x[1]**2 + 2},
             {'type': 'eq', 
             'fun' : lambda x: x[1] + 2*x[2]**2 -3 })

In [12]:
result = sco.minimize(objective, x0, method='SLSQP', 
                      constraints=ineq_cons,bounds=bounds)

In [17]:
result

     fun: -13.500000000000213
     jac: array([-4.        ,  0.        , -2.44948971])
 message: 'Optimization terminated successfully'
    nfev: 37
     nit: 9
    njev: 9
  status: 0
 success: True
       x: array([2.00000000e+00, 1.05762651e-18, 1.22474487e+00])

In [16]:
result['fun'] # objective function

-13.500000000000213

In [13]:
result.x

array([2.00000000e+00, 1.05762651e-18, 1.22474487e+00])

 cvxy installation

https://stackoverflow.com/questions/61365790/error-could-not-build-wheels-for-scipy-which-use-pep-517-and-cannot-be-installe

try update pip

In [None]:
# !conda install -c conda-forge cvxpy

difference between installation using pip, conda install

## Q8

In [None]:
#[Q. Can you teach me a bit what the first line does? ]
rename_dict = {x: y for x, y in zip(df.loc[:, 'pay_0':'pay_amt6'].columns, new_column_names)}

Notes:


* Producing dictionaries when you know the keys and values
    * `dictionary = dict(zip(keys, values))`
    * `dictionary = {x: y for x, y in zip(keys, values)}`
        * Dictionary comprehension in Python
            * Similar to list comprehension, dictionary comprehension is an elegant and concise way to create dictionaries.
            * `dictionary = {key: value for x in iterable}`

In [7]:
keys = ['cat', 'dog']
values = ['kitten', 'puppy']

In [2]:
keys = ['cat', 'dog']
values = ['kitten', 'puppy']
dict(zip(keys,values))

{'cat': 'kitten', 'dog': 'puppy'}

In [8]:
{x: y for x, y in zip(keys, values)}

{'cat': 'kitten', 'dog': 'puppy'}

In [None]:
#[Q. How can I see what is inside (zip(df.loc[:, 'pay_0':'pay_amt6'].columns, new_column_names))?]
# we can use list, tuple, dict or list comprehension, etc.

In [9]:
list(zip(keys,values))

[('cat', 'kitten'), ('dog', 'puppy')]

In [10]:
set(zip(keys,values))

{('cat', 'kitten'), ('dog', 'puppy')}

In [6]:
[(x,y) for x,y in zip(keys, values)]

[('cat', 'kitten'), ('dog', 'puppy')]

chapter 6

Notes:
* `zip()`
    * The built-in function `zip()` aggregates the elements from multiple iterable objects (lists, tuples, etc.). It is used when iterating multiple list elements in a for loop.
    * By passing an iterable object (lists, tuples, etc.) as an argument of `zip()`, multiple elements can be obtained simultaneously in the for loop.
    * The result of the `zip()` function is an iterator. An iterator in Python is an object that contains a fixed number of elements and allows you to access each element in an ordered fashion (the next(iterator) function for an iterator). This is more efficient and more general-purpose — compared to creating a list and returning the list as a result. To fix this, you have to convert the iterator object in the iterable you want (e.g. set, list, tuple).

In [35]:
# zip()
a = [1,2]
b = [3,4]
c = [3,4,5]

example1 = list(zip(a,b))
print(example1)

# using `*` to unzip 
print(type(example1))
a, b = zip(*example1) 
print(a,b)

[(1, 3), (2, 4)]
<class 'list'>
(1, 2) (3, 4)


In [3]:
# zip lists of different lengths
example2=zip(a,c)
print(example2) # zip object
print(tuple(example2)) # Python simply ignores the remaining elements of the longer list

<zip object at 0x00000140F3C25580>
((1, 3), (2, 4))


In [4]:
example3 = zip(a,b) 
print(tuple(example3))
#it's exhausted after one iteration
print(example3)

((1, 3), (2, 4))
<zip object at 0x00000140F3CC5780>


In [5]:
# display a readable version of the result
print(tuple(zip(a,b)))  # use the tuple() function
print([x for x in zip(*(zip(a,b)))]) # use list comprehension

((1, 3), (2, 4))
[(1, 2), (3, 4)]


In [39]:
print([x for x in zip(a,b)]) # use list comprehension

[(1, 3), (2, 4)]


In [None]:
import pandas as pd
example = pd.Series(['cat', 'dog', np.nan, 'rabbit'])
#print(example)
rule={'cat': 'kitten', 'dog': 'puppy'}
#print(example.map(rule))
#print(example.map('I have a {}'.format))
#print(example.map('I have a {}'.format,na_action='ignore'))

#[Q. What are you trying to do here? ]
# I am trying to show the usage of `na_action` option.

In [11]:
import pandas as pd
example = pd.Series(['cat', 'dog', np.nan, 'rabbit'])
example

0       cat
1       dog
2       NaN
3    rabbit
dtype: object

In [14]:
example.map(lambda x: x.find('a'),na_action='ignore') # nan will be ignored

0    1.0
1   -1.0
2    NaN
3    1.0
dtype: float64

In [None]:
#[Q. What is !head?]

chapter 4
* `!`
    * In Jupyter Notebook you can execute Terminal commands in the notebook cells by prepending an exclamation point/bang(`!`) to the beginning of the command. This can be useful for many things such as getting information without having to open a Terminal/Command Prompt, or installing a conda package you are trying to use.

For Linux and Unix:

`head`: Display the first lines of a file

In [None]:
#refer to https://ss64.com/osx/head.html
# it seems `head` also works for macOS

In [24]:
import copy
a = [1,2,[1,2]]  
print(a)
b = a # assign
c = a.copy() # shallow copy
d = copy.deepcopy(a) # deep copy
a[-1].append(3)
a.append(3)
print(a)
print(b) 
print(c)
print(d)

#[Q. I am still quite confused about the difference between shallow and deep copy... ]
#[Q. We may refer to some illustrations here at
# https://www.geeksforgeeks.org/copy-python-deep-copy-shallow-copy/
# or https://www.programiz.com/python-programming/shallow-deep-copy
# or especially : https://ithelp.ithome.com.tw/articles/10221255]

[1, 2, [1, 2]]
[1, 2, [1, 2, 3], 3]
[1, 2, [1, 2, 3], 3]
[1, 2, [1, 2, 3]]
[1, 2, [1, 2]]


* Attributes of Python Objects
    * Identity: the object's memory address.
    * Type: the kind of object that is created. For example- integer, list, string etc. 
    * Value: the value stored by the object. For example – `List=[1,2,3]` would hold the numbers 1,2 and 3
        * ID and Type cannot be changed once it’s created.

* Mutable and Immutable Data Types in Python
    * Objects whose value can change are said to be mutable. Objects whose value is unchangeable once they are created are called immutable.
    * Some common mutable types: list, dictionary, set, user-defined classes.
    * Some common immutable types: int, float,decimal, bool, string, tuple.
    * See also https://medium.com/@meghamohan/mutable-and-immutable-side-of-python-c2145cf72747

In [26]:
import numpy as np

# mutable
example = ['cat', 'dog', np.nan, 'rabbit']
id(example)

2539579867520

In [27]:
example[2]='bird' # change the value 
print(example)
id(example) # the identity will not change 

['cat', 'dog', 'bird', 'rabbit']


2539579867520

In [28]:
aa = 1
bb = aa # assign

In [29]:
print(id(aa))
print(id(bb))

140721832929056
140721832929056


In [30]:
aa += 1 # change the value of the object

In [31]:
print(id(aa)) # the id changed
print(id(bb))

140721832929088
140721832929056


In [32]:
aa # value updated

2

In [33]:
bb

1

* Shallow copy will copy the outer level items in the list, items nested inside the list will still be copied by reference.
* Deep copy goes through all nested items and copies each single one of them.

* A shallow copy constructs a new compound object and then (to the extent possible) inserts references into it to the objects found in the original.
* A deep copy constructs a new compound object and then, recursively, inserts copies into it of the objects found in the original.
* Therefore, as the identity of a mutable object will not change even when its value changes, the value of the nested items in a shallow copy will change accordingly. On the contrary, a deep copy will not be influenced by any changes.

In [None]:
import copy
a = [1,2,[1,2]]  
print(a)
b = a # assign
c = a.copy() # shallow copy
d = copy.deepcopy(a) # deep copy
a[-1].append(3) # change the value of the mutable object(list)
a.append(3)  # change the orignal object (append a object to the end of the list)
print('assign:', b)
print('shallow copy:', c)
print('deep copy:', d)

In [None]:
#[Q. The statistics look the same, don't they?]

![memory](memory_usage.png)


![memory](memory_usage_after.png)


In [None]:
#[Q. What is the point of loading this file if we can download the data online?]
# no difference except for the speed

# but in our case, the dataset originally does not contain missing values and the categorical variables are already encoded as numbers. 
# To show the entire pipeline of working and preparing potentially messy data, we did some transformations.
# If we can upload the transformed data online, then we can import the data directly and no need to store the file.  

In [81]:
df=pd.read_excel('http://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls', index_col=0, na_values='')

In [90]:
df

Unnamed: 0,limit_bal,sex,education,marriage,age,payment_status_sep,payment_status_aug,payment_status_jul,payment_status_jun,payment_status_may,...,bill_statement_jun,bill_statement_may,bill_statement_apr,previous_payment_sep,previous_payment_aug,previous_payment_jul,previous_payment_jun,previous_payment_may,previous_payment_apr,default_payment_next_month
0,20000,Female,University,Married,24.0,Payment delayed 2 months,Payment delayed 2 months,Payed duly,Payed duly,Unknown,...,0,0,0,0,689,0,0,0,0,1
1,120000,Female,University,Single,26.0,Payed duly,Payment delayed 2 months,Unknown,Unknown,Unknown,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,90000,Female,University,Single,34.0,Unknown,Unknown,Unknown,Unknown,Unknown,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,50000,Female,University,Married,37.0,Unknown,Unknown,Unknown,Unknown,Unknown,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,50000,Male,University,Married,57.0,Payed duly,Unknown,Payed duly,Unknown,Unknown,...,20940,19146,19131,2000,36681,10000,9000,689,679,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29995,220000,,High school,Married,39.0,Unknown,Unknown,Unknown,Unknown,Unknown,...,88004,31237,15980,8500,20000,5003,3047,5000,1000,0
29996,150000,Male,High school,Single,43.0,Payed duly,Payed duly,Payed duly,Payed duly,Unknown,...,8979,5190,0,1837,3526,8998,129,0,0,0
29997,30000,Male,University,Single,37.0,Payment delayed 4 months,Payment delayed 3 months,Payment delayed 2 months,Payed duly,Unknown,...,20878,20582,19357,0,0,22000,4200,2000,3100,1
29998,80000,Male,High school,Married,41.0,Payment delayed 1 month,Payed duly,Unknown,Unknown,Unknown,...,52774,11855,48944,85900,3409,1178,1926,52964,1804,1


In [85]:
df = pd.read_csv('credit_card_default.csv', index_col=0, na_values='')

In [88]:
#df.describe(include='object').transpose()
# [Q. What does (include='object') mean?]
# get summary statistics only for categorical variables

In [89]:
df.dtypes

limit_bal                       int64
sex                            object
education                      object
marriage                       object
age                           float64
payment_status_sep             object
payment_status_aug             object
payment_status_jul             object
payment_status_jun             object
payment_status_may             object
payment_status_apr             object
bill_statement_sep              int64
bill_statement_aug              int64
bill_statement_jul              int64
bill_statement_jun              int64
bill_statement_may              int64
bill_statement_apr              int64
previous_payment_sep            int64
previous_payment_aug            int64
previous_payment_jul            int64
previous_payment_jun            int64
previous_payment_may            int64
previous_payment_apr            int64
default_payment_next_month      int64
dtype: object

Notes:
* Comparison of conda and pip

|     |conda | pip |
| --- | --- | --- |
| manages | binaries | whell or source |
| can require compilers | no | yes |
| package types | any | Python-only |
| create environment | yes, built-in | no, requires virtualenv |
| dependenct checks | yes | no |
| package sources | Anaconda repo and cloud | PyPI |


In [None]:
#[Q. What is sparse?]

In [13]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [14]:
X = [['Male', 1], ['Female', 3],['Female', 2]]

In [21]:
eg = OneHotEncoder(handle_unknown='ignore',sparse=True)
eg.fit(X)
eg.transform([['Female', 1], ['Male', 4],['Male', 1]]).toarray()

array([[1., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 1., 0., 0.]])

In [25]:
eg.get_feature_names()

array(['x0_Female', 'x0_Male', 'x1_1', 'x1_2', 'x1_3'], dtype=object)

In [22]:
eg.transform([['Female', 1], ['Male', 4],['Male', 1]]).get_features()

AttributeError: get_features not found

`sparse`*bool, default=True*. Will return sparse matrix if set True else will return an array.

In [15]:
eg = OneHotEncoder(handle_unknown='ignore',sparse=True)
eg.fit(X)
type(eg.transform([['Female', 1], ['Male', 4],['Male', 1]]))

scipy.sparse.csr.csr_matrix

In [16]:
eg = OneHotEncoder(handle_unknown='ignore',sparse=False)
eg.fit(X)
type(eg.transform([['Female', 1], ['Male', 4],['Male', 1]]))

numpy.ndarray

In [None]:
#[Q. What does "one_hot" here do?]

In [None]:
one_hot_transformer = ColumnTransformer(
    [("one_hot", one_hot_encoder, ['sex'])] 
)
# "one_hot" is the name of the transformer object and you can change it 

# ("one_hot", one_hot_encoder, ['sex'])
# list of tuples: List of (name, transformer, columns) tuples specifying the transformer objects to be applied to subsets of the data.


In [None]:
#[Q. Can you tell me a bit more about about what the 'Category encoders can do here? What is the benefit of using it instead of the original encoder? How does the smoothing work here?]

`OneHotEncoder` in `category_encoders`

* automatically encodes only the columns containing strings

* returns a `pandas` DataFrame with the adjusted column names

In [17]:
import category_encoders as ce
X = [['Male', 1], ['Female', 3],['Female', 2]]

In [35]:
# create the encoder object
one_hot_encoder_ce = ce.OneHotEncoder(use_cat_names=True)
one_hot_encoder_ce.fit(X)
one_hot_encoder_ce.transform([['Female', 1], ['Male', 4],['Male', 1]]) # return a data frame

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,0_Male,0_Female,1
0,0,1,1
1,1,0,4
2,1,0,1


In [36]:
one_hot_encoder_ce.get_feature_names()  # encode only categorical features (column containing strings)

['0_Male', '0_Female', 1]

In [37]:
from sklearn.preprocessing import OneHotEncoder

In [38]:
eg = OneHotEncoder(handle_unknown='ignore',sparse=False)
eg.fit(X)
eg.transform([['Female', 1], ['Male', 4],['Male', 1]]) # return an array

array([[1., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 1., 0., 0.]])

In [39]:
eg.get_feature_names() # encode all features including numerical features

array(['x0_Female', 'x0_Male', 'x1_1', 'x1_2', 'x1_3'], dtype=object)

In [None]:
#[Q. How does the smoothing work here?]
# smoothing effect to balance categorical average vs prior. Higher value means stronger regularization. 
# The value must be strictly bigger than 0.

In [None]:
target_encoder.transform(X_train.sex).head()

#[Q. This is to fill up the missing values?]
# to encode "sex" feature

In [44]:
import category_encoders as ce
X = ['Male', 'Female', 'Female']
y = [1,0,1]
eg= ['Female', 'Male', 'Male'] 

In [48]:
target_encoder = ce.TargetEncoder(smoothing=0) 
target_encoder.fit(X, y)
target_encoder.transform(eg)

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,0
0,0.5
1,0.666667
2,0.666667


In [54]:
target_encoder = ce.TargetEncoder(smoothing=1) 
target_encoder.fit(X, y)
target_encoder.transform(eg)

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,0
0,0.544824
1,0.666667
2,0.666667


In [None]:
# https://www.section.io/engineering-education/custom-transformer/
# https://deepnote.com/@abalaji/How-to-use-custom-sklearn-classes-and-pipelines-N1zNlrxpRE6PlKGbEzLU-Q

## pipeline-transformer<br>

pipeline: actions

transformer: action and target

you can put a transformer into a pipeline


## steps:
1. num pipeline(simple imputer)

2. cat pipeline(imputer, onehot)

3. define a columntransformer to hold the two pipelines and specifiy the columns for the transformation

4. define a joint pipeline (transformer, classifier)

5. fit the pipeline to the data 

6. evaluate the performance

`Pipelines` allow you to run multiple operations on an input dataset in succession before an estimation is performed. It was originally designed to be a linear step-by-step transformation template but there are now additional tools in the Pipeline toolkit that allow for horizontal joining of "columns". 