# The Log-Logistic Distribution

Welcome to our Kaggle project! Today, we will attempt to further our understanding of the log-logistic distribution by fiddling around with various sliders and graphs! You might need to click the large black button in the upper right-hand corner that says Save and Edit in order to get all the functionality out of this document. You'll need to run each cell in order. To run a cell, select it and press Shift+Enter simultaneously.

Below is just a lot of code, don't worry!



In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # plotting library
import scipy
import random

from scipy.stats import beta as betafunction # Just want to try something...

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

By the way, these are markdown cells – they're basically just presentable text. They will provide instructions to the readers (that's you) throughout the lesson. First, we need to run a bunch of cells to initialize our plotting.

In [None]:
#Takes in an array x, and alpha, beta parameter values, and makes sure they are valid
def log_logistic_test_PDF(x, alpha, beta):
    #The idiot loop
    if beta <=0:
        print("Beta is either 0, or less than 0 and that's not allowed. Stop it.")
        return False
    if alpha <=0:
        print("The funciton is undefined for values of alpha that are less than 0, or equal to 0. I'm going to have to ask you to stop.")
        return False
    
    pdf = []
    for i in x:
        if i<0:
            print("Warning: cannot compute a negative probability")
        pdf.append(log_logistic_evalPDF(i, alpha, beta))
    return pdf

In [None]:
#Takes in an array x, and alpha, beta parameter values, and makes sure they are valid
def log_logistic_test_CDF(x, alpha, beta):
    #The idiot loop
    if beta <=0:
        print("Beta is either 0, or less than 0 and that's not allowed. Stop it.")
        return False
    if alpha <=0:
        print("The funciton is undefined for values of alpha that are less than 0, or equal to 0. I'm going to have to ask you to stop.")
        return False
    
    cdf = []
    for i in x:
        if i<0:
            print("Warning: cannot compute a negative probability")
        cdf.append(log_logistic_evalCDF(i, alpha, beta))
    return cdf

In [None]:
#Evaluates the PDF at a point, x
def log_logistic_evalPDF(x, alpha, beta):
    if beta >=1:
        val = ((beta/alpha)*(x/alpha)**(beta-1))/((1+(x/alpha)**beta)**2)
    elif (beta > 0 and beta < 1):
        # Treat discontinuities at 0
        if(x==0):
            return np.inf
        else:
            try:
                power = 1-beta
                val = ((beta/alpha))/((x/alpha)**(power)*(1+(x/alpha)**beta)**2)
            except Error as err:
                print("Caught an error: {0}. Returning infinity.".format(err))
                return np.inf
    return val

In [None]:
# Import all the interaction widgets
from __future__ import print_function
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

In [None]:
#Evaluates the CDF at a point, x
def log_logistic_evalCDF(x, alpha, beta): 
    if beta==0:
        print("Error, beta cannot be equal to 0.")
        return
    elif beta >= 1: 
        val=(x**beta)/((alpha**beta)+(x**beta))
    else: # For beta between 0 and 1
        if (x==0): # Account for numpy's weird behaviour at 0
            return 0
        else:
            val=(x**beta)/((alpha**beta)+(x**beta))
    return val

In [None]:
def moment(alpha, beta, n):
    try:
        if(beta > n):
            return alpha**n*(n*np.pi/beta)/(np.sin(n*np.pi/beta))
        else:
            return np.inf
    except Exception as err:
        return np.inf
    
# Not sure yet if I can generate arbitrary moments, or just one at a time
def moment_generator(alpha, beta):
    # From our four moments about the origin...
    mu1=moment(alpha, beta, 1)
    mu2=moment(alpha, beta, 2)
    mu3=moment(alpha, beta, 3)
    mu4=moment(alpha, beta, 4)
    
    #... we can compute the moments about the mean
    if(np.isfinite(mu1)):
      mean = mu1
      mean = np.around(mean, 3)
    else:
      mean = "Undefined"
    if(np.isfinite(mu2)):
      stdev = np.sqrt((mu2 - mu1**2))
      stdev = np.around(stdev, 3)
    else:
      stdev = "Undefined"
    if(np.isfinite(mu3)):
      skewness = (mu3 - 3*mu1*mu2 + 2*(mu1**2))/(stdev**3)
      skewness = np.around(skewness, 3)
    else:
      skewness = "Undefined"
    if(np.isfinite(mu4)):
      kurtosis = ((mu4 - 4*mu1*mu3 + 6*(mu1**2)*mu2 - 3*(mu1**4))/(stdev**4))-3
      kurtosis = np.around(kurtosis, 3)
    else:
      kurtosis = "Undefined"

    return mean, stdev, skewness, kurtosis

First, we will use sliders to visualize how different values of our scale parameter alpha ($\alpha$) and our shape parameter beta ($\beta$) affect the pdf and the cdf of the log-logistic distribution. Throughout these exercises, try to evaluate what  alpha and beta do. What happens when either is large? Small? How do each have an effect on kurtosis and skewness? Does one parameter seem to dominate over the other?

For reference, the pdf of the log-logistic distribution is given by the following function:
$$ f(x;\alpha,\beta)=\frac{(\beta/\alpha)(x/\alpha)^{\beta-1}}{(1+(x/\alpha)^\beta)^2} $$

Conveniently, our scale parameter $\alpha$ gives the median of this distribution.



In [None]:
# Creates a PDF we can interact with!
def log_logistic_interactionPDF(alpha, beta):
    fig, ax = plt.subplots(figsize=(12, 6))
    
    #Concatenate two arrays to improve resolution near 0
    x = np.concatenate((np.linspace(0,0.1,10), np.linspace(0.1,10,100)), axis=None)
    pdf = log_logistic_test_PDF(x, alpha, beta)
    
    ax.plot(x, pdf)
    
    plt.xlabel("x")
    plt.ylabel("f(x)")
    plt.xlim([0, 10])
    plt.ylim([0, 2])
    plt.title("Probability Distribution")
    
    moments = moment_generator(alpha, beta)
    plt.text(6, 1.8, "Mean = {0}".format(moments[0]))
    plt.text(6, 1.6, "Standard Deviation = {0}".format(moments[1]))
    plt.text(6, 1.4, "Skewness = {0}".format(moments[2]))
    plt.text(6, 1.2, "Kurtosis = {0}".format(moments[3]))
    fig.patch.set_facecolor('white') # we want to see the axis values in dark mode
    #plt.plot(x, pdf)
    #plt.yscale("log") # The log scale makes it difficult to interpret these graphs...

As a first exercise, let's play around with some sliders, and see how they affect the log-logistic PDF. As you move the sliders, try making observations about how the graph changes – how do alpha and beta affect skewness and kurtosis? What happens when alpha and beta are big or small?

You can change the values with the sliders, or just input the values you desire in the textbox beside the sliders. Explore what happens to the distribution as each parameter becomes large or small.

Take note of any observations you are able to make about the behaviour of log-logistic PDF's.



In [None]:
interact(log_logistic_interactionPDF, alpha=widgets.FloatSlider(min=0.1, max=10, step=0.1, value=1), beta=widgets.FloatSlider(min=0.1, max=10, step=0.1, value=1))

Notice that the mean is undefined for beta less than 1 – why do you think that is?  More subtly, the standard deviation is undefined for beta less than 2. As a hint, the  $r^\text{th}$ moment of the log-logistic distribution (about the origin) is given by $$\mu_r=\alpha^r\frac{r\pi/\beta}{\sin(\pi/\beta)}$$
The four moments we are primarily interested in are the Mean, Standard Deviation, Skewness, and Kurtosis. As in Section 3.6 of the textbook, the second, third, and fourth moment *about the mean* are given by:
$$\mu_2'=\mu_2-{\mu_1}^2$$
$$\mu_3'=\mu_3-3\mu_1\mu_2+2{\mu_1}^3 $$
$$ \mu_4' = \mu_4-4\mu_1\mu_3+6{\mu_1}^2\mu_2-3{\mu_1}^4$$


Now that you've had a chance to investigate the parameters for yourself, let's discuss what they represent, for good measure! Beta dominates the effects of skewness. Smaller values of $\beta$ yield a distribution which is more skewed to the left. As $\beta$ increases, the distribution becomes more skewed towards the right.

Meanwhile, alpha dominates the kurtosis of this distribution. As $\alpha$ grows larger, the distribution becomes less peaked and more tailed.

Notice that as $\beta$ gets large, the mean of the distribution tends to $\alpha$.


We can also look at the cumulative distribution. Once again, run the cells below and explore the limiting cases of each parameter. The cdf is given by the following formula: $$ F(x;\alpha,\beta)=\frac{x^\beta}{\alpha^\beta+x^\beta} $$

where $\alpha$ and $\beta$ are the same scale and shape parameters as before. Fiddle around, and try to take note of any other observations.

In [None]:
# Makes an interactive CDF
def log_logistic_interactionCDF(alpha, beta):
    fig, ax = plt.subplots(figsize=(12, 6))
    
    #Concatenate two arrays to improve resolution near 0
    x = np.concatenate((np.linspace(0,0.1,100), np.linspace(0.1,10,100)), axis=None)
    pdf = log_logistic_test_CDF(x, alpha, beta)
    
    ax.plot(x, pdf)
    
    plt.xlabel("x")
    plt.ylabel("F(x)")
    plt.xlim([0, 10])
    plt.ylim([0, 2])
    plt.title("Cumulative Distribution")
    fig.patch.set_facecolor('white') # we want to see the axis values in dark mode
    #plt.plot(x, pdf)
    #plt.yscale("log")

In [None]:
interact(log_logistic_interactionCDF, alpha=widgets.FloatSlider(min=0.1, max=10, step=0.1, value=1), beta=widgets.FloatSlider(min=0.1, max=10, step=0.1, value=1))

Why not look at both side by side? In this next exercise, try to see how changes affect both the PDF and CDF, and how they relate to each other. Perhaps you will be able to make further observations about their behaviour.

In [None]:
# Displays both at the sane time
def log_logistic_double_display(alpha, beta):
    fig = plt.figure(figsize=(12, 6))
    
    #Concatenate two arrays to improve resolution near 0
    x = np.concatenate((np.linspace(0,0.1,10), np.linspace(0.1,5,50)), axis=None)
    pdf = log_logistic_test_PDF(x, alpha, beta)
    cdf = log_logistic_test_CDF(x, alpha, beta)
    
    ax = fig.add_subplot(121)
    ax2 = fig.add_subplot(122)
    
    ax.plot(x, pdf)
    ax2.plot(x, cdf)
    
    ax.set_xlabel("x")
    ax.set_ylabel("f(x)")
    ax2.set_xlabel("x")
    ax2.set_ylabel("F(x)")
    
    ax.set_xlim([0, 5])
    ax.set_ylim([0, 2])
    ax.set_title("Probability Distribution")
    ax2.set_xlim([0, 5])
    ax2.set_ylim([0, 2])
    ax2.set_title("Cumulative Distribution")
    
    fig.patch.set_facecolor('white') # we want to see the axis values in dark mode

In [None]:
interact(log_logistic_double_display, alpha=widgets.FloatSlider(min=0.1, max=10, step=0.1, value=1), beta=widgets.FloatSlider(min=0.1, max=10, step=0.1, value=1))

We hope this demonstration has shed some light on the behaviour of the log-logistic distribution!

## The log-logistic distribution in Economics: The Fisk Distribution

In Economics, the log-logistic distribution takes on new meaning as the Fisk Distribution, where it is used as a simple model of wealth distribution. Our shape parameter ($\beta$) takes on a new meaning too: The Gini coefficient is defined as $G=\frac{1}{\beta}$, and is a measure of wealth dispersion. Essentially, $G$ assigns a value to the magnitude of income inequality in a country. A Gini coefficient of 0 denotes perfect equality, while a coefficient of 1 indicates complete inequality. Since $G=\frac{1}{\beta}$, we can only have perfect equality in the limit as Beta tends to infinity, wheras complete inequality occurs when Beta equals 1.

Sources:

https://corporatefinanceinstitute.com/resources/knowledge/economics/gini-coefficient/

https://uwaterloo.ca/canadian-index-wellbeing/what-we-do/domains-and-indicators/gini-coefficient-income-gap

Below is a data set that we will try and model with the Fisk distribution; we will use a $\chi^2$ method to evaluate our fit. If we are trying to fit $n$ data points, then we sum the squares of the residuals between our model and the data at each point $i\in\{1,2,...,n\}$ to obtain our $\chi^2$ value: $$ \chi^2 = \sum_i^n(x_{\text{model, }i} -  x_{\text{observed, }i})^2$$ The $\chi^2$ regression is a method of evaluating the fit of an arbitrary curve to data points. A high $\chi^2$ value means that there is significant difference between your model and the observed data, whereas a small one means your model describes the data relatively well. So, as a statistician, our goal is to choose parameters for our log-logsitic distribution that minimize $\chi^2$.



In [None]:
pandaData = pd.read_csv("../input/various-wealth-distributions/dfa-income-levels.csv")

pandaData

Note that the data set describes wealth. Along the $x$-axis, we have the income bracket, which is basically lists the population by percentage in increasing order of wealth. The below code is able to give us the total wealth percentage of each income bracket, so by adding them up, they give us the points of the cumulative distribution (CDF) relatively naturally. From here, your goal is to now choose parameters for a log-logsitic CDF using the sliders below that will minimize $\chi^2$.


In [None]:
#incbrak = [0.10, 0.30, 0.50, 0.70, 0.895, 0.995] # Need to interpret our income brackets as x-values...
incbrak = [0.2, 0.4, 0.6, 0.8, 1] # Use this one. We combine the two lowest data bins
yrqrtr = 110 # we pick the quarter we want (pick between 0 and 124)

print("Quarter picked: " + pandaData['Date'][yrqrtr*6]) # prints the quarter we picked
nworth = []
for i in range(0, 6):
    nworth.append(pandaData['Net worth'][i+yrqrtr*6])
    
print("(Low to high) Net worth by income: " + str(nworth)) # prints net worth of various income brackets

totalnetworth=0

for i in range(0, len(nworth)):
    totalnetworth+=nworth[i]
print("Total net worth: " + str(totalnetworth)) # prints total wealth

nworthnormal=nworth/totalnetworth

print("(Low to high) Net worth by income by %: " + str(nworthnormal)) # prints net worth of various income brackets as a % of total net worth

cumnetworth=[]
sumnetworth=nworthnormal[0]
for i in range (1, len(nworth)):
    sumnetworth+=nworthnormal[i]
    cumnetworth.append(round(sumnetworth,8))
    
print("(High to low) Cumulative net worth: " + str(cumnetworth))

If you are having difficulty loading the above data, then uncomment code below and run the cell. We have hardcoded the values for the wealth distribution of the first quarter of 2017 (2017:Q1) here for your convenience. To uncomment, delete the \# character at the start of each line.

In [None]:
# If you can't load the data, run this cell!
#incbrak = [0.2, 0.4, 0.6, 0.8, 1]
#cumnetworth = [0.71378521, 0.85827444, 0.9341023,  0.97637864, 1.]

In [None]:
def chi2(expected, measured):
    chi2 = 0
    for i in range(0, len(expected)):
        chi2 += (expected[i] - measured[i])**2
    return chi2

In [None]:
# This is the fisk cumulative distribtion function
def fisk_CDF(alpha, beta):
    fig, ax = plt.subplots(figsize=(12, 6))
    
    #Concatenate two arrays to improve resolution near 0
    x = np.concatenate((np.linspace(0,0.1,50), np.linspace(0.1,1,50)), axis=None)
    cdf = log_logistic_test_CDF(x, alpha, beta)
    # Need to evaluate the CDF at each point in the income bracket for our tests
    compare_points = log_logistic_test_CDF(incbrak, alpha, beta)
    
    ax.plot(x, cdf)
    ax.plot(incbrak, cumnetworth, 'r.')
    
    plt.xlabel("Fraction of population")
    plt.ylabel("Fraction of Wealth")
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.title("Cumulative Net Worth")
    plt.title("Cumulative Distribution. Current fit: Chi^2 = {0}".format(np.round(chi2(cumnetworth, compare_points), 4)))
    fig.patch.set_facecolor('white') # we want to see the axis values in dark mode

Below, you will try to minimize $\chi^2$ using the sliders for $\alpha$ and $\beta$. The red data points represent the (normalized) cumulative wealth of the population at that percentage of the total population. What is the lowest value of $\chi^2$ you can get? Would you say that this approximation describes the data well? What can you say about this graph's skewness and kurtosis. Also, try plotting your values of alpha and beta into the pdf, and try to see if it makes sense with our initial problem. Given this, what could you say about wealth equality or inequality in the United States? Recall that the Gini coefficient is $G=\frac{1}{\beta}$.

To get you started, recall that $\alpha$ gives the median of our distribution. Since we plot the income brackets as a fraction between 0 and 1, this probably means that $\alpha$ should be less than 1.


In [None]:
interact(fisk_CDF, alpha=widgets.FloatSlider(min=0.01, max=10, step=0.01, value=1), beta=widgets.FloatSlider(min=0.01, max=10, step=0.01, value=1))

This graph suggests things about the wealth distribution: $x$ percent of the population holds $F(x)$ percent of the wealth. For example, extrapolating based on the 2017:Q1 data, it appears that a mere 20\% of the population of the United States holds just over 50\% of the country's wealth.

Now that you have estimated alpha and beta, let's see how your answer compares to our algorithm, which computes $\chi^2$ for every combination of alpha and beta very quickly. Were you close? Far? If you were further off than what you would have thought, why? Below is our code to construct our best fit alpha and beta!

In [None]:
# to find our minimal chi^2
def minimizechi2(expected):
    alpha=0.01
    beta=0.01
    
    minalpha=alpha
    minbeta=beta
    
    compare_points=log_logistic_test_CDF(incbrak, alpha, beta)
    minchi2value=chi2(cumnetworth, compare_points)
    
    for i in range(499):
        
        for j in range(499):
            
            compare_points=log_logistic_test_CDF(incbrak, alpha, beta)
            chi2value=chi2(cumnetworth, compare_points)
            
            if(chi2value <= minchi2value):
                minchi2value=chi2value
                minalpha=alpha
                minbeta=beta
            beta+=0.01
            
        beta=0.01            
        compare_points=log_logistic_test_CDF(incbrak, alpha, beta)
        chi2value=chi2(cumnetworth, compare_points)
        
        if(chi2value <= minchi2value):
            minchi2value=chi2value
            minalpha=alpha
            minbeta=beta
        alpha+=0.01
        
    return minchi2value, minalpha, minbeta

In [None]:
minchi2, minAlpha, minBeta = minimizechi2(cumnetworth) #Just for this example specifically
minchi2 = np.round(minchi2, 5)
minAlpha = np.round(minAlpha, 2)
minBeta = np.round(minBeta, 2)
print("Minimum Chi-Squared Value: {0}".format(minchi2))
print("Minimum alpha: {0}".format(minAlpha))
print("Minimum beta: {0}".format(minBeta))

Above, we list the best-fit values for $\alpha$ and $\beta$ to match the quarter you chose. Were you able to find something similar playing around with the sliders? If not, what did you get? Try plotting these values in the sliders below to see what the pdfs and cdfs look like! Do they look similar to the ones you plotted before knowing the exact answer? If it differs, what would you say about G here? How do you think the United States compares to other countries?

Feel free to also try out different quarters, and see what the Gini Coefficient ends up being at different times. Is there a big change, or does it stay relatively the same? Has income inequality gotten worse or better over the years? If you have the opportunity (and if our data loads correctly), try computing the Gini coefficient in 2008 – there was a pretty big housing crisis back then that made everyone upset. Did it have any impact on the wealth disparity in the United States?


# Examining the PDF from our wealth distribution fit

Below we have included another graph which compares the PDF to the CDF, this time at the best-fit values found by our little method. What does the kurtiosis and skewness tell us about wealth distribution? Do these answers align with qualitative observations made from the graph?

Using what you know about the Gini coefficient, try to plot distributions for a country which has high income equality. Compare that to a model of a country with high ineqaulity.

In [None]:
def log_logistic_double_display_wealth(alpha, beta):
    fig = plt.figure(figsize=(12, 6))
    
    #Concatenate two arrays to improve resolution near 0
    x = np.concatenate((np.linspace(0,0.2,50), np.linspace(0.2,1,50)), axis=None)
    pdf = log_logistic_test_PDF(x, alpha, beta)
    cdf = log_logistic_test_CDF(x, alpha, beta)

    ax = fig.add_subplot(121)
    ax2 = fig.add_subplot(122)
    
    ax.plot(x,pdf)
    ax2.plot(x, cdf)
    
    ax.set_xlabel("Income Bracket (%)")
    ax.set_ylabel("Change in Total Wealth")
    ax2.set_xlabel("Income Bracket (%)")
    ax2.set_ylabel("Total Wealth")
    
    ax.set_xlim([0, 1])
    ax.set_ylim([0, 6])
    ax.set_title("Probability Distribution of Wealth by Income Bracket")
    ax2.set_xlim([0, 1])
    ax2.set_ylim([0, 1])
    ax2.set_title("Cumulative Distribution of Wealth by Income Bracket")
    
    fig.patch.set_facecolor('white') # we want to see the axis values in dark mode

In [None]:
interact(log_logistic_double_display_wealth, alpha=widgets.FloatSlider(min=0.01, max=10, step=0.01, value=minAlpha), beta=widgets.FloatSlider(min=0.01, max=10, step=0.01, value=minBeta))


Let us take note that the log-logistic distribution is used to model skewed data: phenomena which initially rapidly increase, then slowly decrease. The PDF for this wealth distribution tells us something about the marginal wealth of the United States, or how much we expect wealth to change given a small change in income bracket. The easier one to visualize wealth equality with is the CDF. Notice that the top 50% income bracket has around 83% of the wealth. And how the top 20% has 55% of the wealth. What about how the bottom 20% has around 5%? 

Looking at the PDF, we have the idea of marginal wealth; so this measures how wealth changes if we change the income bracket % slightly... this abstract idea tells us what happens to total wealth of an income bracket relative to the one before it, in a way. Our result actually makes sense as well, since even though the 1% has the most wealth individually, there are far more people in the 99-80% bracket, so they have more total wealth. This idea is hard to wrap our heads around since the amount of people is discrete, but we have a continuous function, but we get that "everyone" in at around the top 15% level of income have the most wealth.

The CDF is a lot more intuitive in this case, and it also the one that it more widely used, since it shows how wealth is distributed vs how it changes with the PDF.

Another thing we want to note is that given our total data, it is really hard to give a good value for G. Recall that $G=\frac{1}{\beta}$, so it depends entirely on $\beta$. However, the range of beta values that make for a reasonable model is very wide. For example, any value between $\beta$=1.7 and $\beta$=2.3 result in the chi^2 term to be less 0.01. However, this makes $G$ range between 0.43 and 0.58 - a pretty wide interval. For reference, the Gini Coefficient of the United States is sitting at around 0.45.