# Estimation of PDF

* The CDF of the random variable X is estimated to be of the form,
\begin{equation}
F(X) = \begin{cases}
0             & \text{if} x < 0 \\
\frac{x^3}{a} & \text{if } x \in [0,3)\\
\frac{x-2}{b} & \text{if } x \in [3,5)\\
1  & \text{if} x\geq5
\end{cases}
\end{equation}

* Taking the derivative of the CDF gives the PDF of the distribution.
\begin{equation}
f(X) = \begin{cases}
\frac{3x^2}{a} & \text{if } x \in [0,3)\\
\frac{1}{b} & \text{if } x \in [3,5)\\
0             & \text{otherwise} 
\end{cases}
\end{equation}

* There are two unknowns a and b and hence we need two equations to solve. The first equation comes from the axiom of probability that states that the total probability over the entire sample space should be equal to one.
\begin{equation}
\int_{-\infty}^{\infty} f(x) \,dx = \int_{0}^{3} \frac{3x^2}{a} \,dx + \int_{3}^{5} \frac{1}{b} \,dx  = 1\\ 
\end{equation}
* Solving the above equation gives,
$$ \frac{1}{a} = \frac{1}{27} - \frac{2}{27b}$$
* Eliminating a from the f(x) we get,
\begin{equation}
f(X) = \begin{cases}
\frac{x^2}{9}\left(1 - \frac{2}{b}\right) & \text{if } x \in [0,3)\\
\frac{1}{b}& \text{if } x \in [3,5)\\
0             & \text{otherwise} 
\end{cases}
\end{equation}
* 'b' is the only parameter that needs estimation. For this we could use any of the estimator methods on the observed dataset $\mathcal{D}$.


* Let $n_1$ and $n_2$ be the number of samples observed in the interval $[0,3)$ and $[3,5)$ respectively.
* The likelihood function can be expressed as,
\begin{align}
\mathcal{L}(b;\mathcal{D}) &=\left( \prod_{i:0\leq x_i <3}\frac{x^2}{9}\left(1 - \frac{2}{b}\right) \right) \left( 
                   \prod_{i:3\leq x_i <5} \frac{1}{b}\right) \\
                &= \left( \prod_{i=1}^{n_1}\frac{x_{i}^{2}}{9}\left(1 - \frac{2}{b}\right) \right) \left(\frac{1}{b}\right)^{n_2}
\end{align}
* Taking log on both sides we get,
\begin{align}
\log{\mathcal{L}(b;\mathcal{D})} &= \sum_{i=1}^{n_1} \log{\left(\frac{x_{i}^{2}}{9}\left(1 - \frac{2}{b}\right) \right)} - n_2 \log{b}\\
                                 &= \sum_{i=1}^{n_1} \left(\log{\frac{x_{i}^{2}}{9}} + \log{\left(1 - \frac{2}{b}\right)} \right)  - n_2 \log{b} \\
                                 &= n_1 \log{\left(1 - \frac{2}{b}\right)}  - n_2 \log{b}  +\sum_{i=1}^{n_1}\left(\log{\frac{x_{i}^{2}}{9}}\right)
\end{align}
* Maximising the log likelihood yields,
\begin{align*}
\frac{d}{db}\log{\mathcal{L}(b;\mathcal{D})} &= 0\\
\implies \frac{2n_1}{b(b-2)} - \frac{n_2}{b} &= 0 \\
\end{align*}
* The estimated value of the parameter $(b \neq 0)$ is,
$$\hat{b} = 2\left(1 + \frac{n_1}{n_2}\right)$$


In [1]:
#Importing the libraries to be used.
import pandas as pd

In [2]:
df = pd.read_csv(r'Q1.csv')

In [3]:
df1 = df[(df['X'] >= 0) & (df['X'] < 3)]   #The interval [0,3).
df2 = df[(df['X'] >= 3) & (df['X'] < 5)]   #The interval [3,5).
n_1 , n_2 = len(df1),len(df2)

In [4]:
print('The number of samples in the interval [0,3): {} '.format(n_1))
print('The number of samples in the interval [3,5): {} '.format(n_2))

The number of samples in the interval [0,3): 3360 
The number of samples in the interval [3,5): 6640 


* Substituting the values of $n_1$ and $n_2$ in eqn(11) we get the estimated parameter,
\begin{equation*}
\hat{b} = 2\left(1 + \frac{3360}{6640} \right) = 3
\end{equation*}
* Therefore the **PDF** of the distribution is given by,
\begin{equation}
f(X) = \begin{cases}
\frac{x^2}{27} & \text{if } x \in [0,3)\\
\frac{1}{3}& \text{if } x \in [3,5)\\
0             & \text{otherwise} 
\end{cases}
\end{equation}