## **PROBABLITY DISTRIBUTIONS**

- Now, through some calculations, a fellow analyst of the company has arrived at the net revenue for each of these scenarios. She creates a probability distribution with this data:

 

- X (Net Revenue of Project, in ₹ crore)	    P(x)    
     -305	                                    0.1
     +15	                                    0.7
     +95                                        0.2
 

- Now, you finally have a probability distribution for X, the net revenue of the project. Using this probability distribution, you can find the answer to our original question: “Can the company expect a profit from this project? Or, should it expect a loss?”. However, to answer this, you will have to learn the concept of expected value, which is what we will cover next.

#### **EXPECTED VALUE** ####

- So, the expected value for a variable X is the value of X that we would “expect” to get after performing the experiment an infinite number of times. It is also called the expectation, average or mean value. Mathematically speaking, for a random variable X that can take the values x1,x2,x3,x4....xn,the expected value (EV) is given by:

- EV(x)=x1*P(X=x1)+x2*P(X=x2)+x3*P(X=x3)+........+xn*P(X=xn).
- As you may recall, for our red ball game, the expected value came out to be 2.385.

Suppose we change the game’s rules to the following:

Outcome	Prize
- 4 red balls	+150
- 4 blue balls	-150
- Any other outcome	-10
- What will be the expected value now for X (the amount of money won by a player after playing the game once)?

E(X)=150*P(X=150)+(-150)*P(X=-150)+(-10)*P(X=-10)

#### **EXPECTED LOSS FOR A BANK(CASE STUDY)**

In [2]:
import pandas as pd
df=pd.read_csv("student_loan.csv")
df.head()

Unnamed: 0,Customer No.,Exposure at Default (in lakh Rs.),Recovery (%),Probability of Default,Unnamed: 4,Unnamed: 5
0,1,11.5,20.00%,0.007,,
1,2,0.24,5.10%,0.0033,,
2,3,0.04,24.86%,0.0022,,
3,4,13.81,2.29%,0.0066,,
4,5,19.84,3.47%,0.002,,


In [9]:
df['recovery']=df.iloc[:,2].apply(lambda x : float(x.replace('%','')))
df['lda']=df.iloc[:,2].apply(lambda x :100- float(x.replace('%','')))
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
Customer No.                         10000 non-null int64
Exposure at Default (in lakh Rs.)    10000 non-null float64
Recovery (%)                         10000 non-null object
Probability of Default               10000 non-null float64
Unnamed: 4                           0 non-null float64
Unnamed: 5                           0 non-null float64
recovery                             10000 non-null float64
lda                                  10000 non-null float64
dtypes: float64(6), int64(1), object(1)
memory usage: 625.1+ KB


In [10]:
df.head()

Unnamed: 0,Customer No.,Exposure at Default (in lakh Rs.),Recovery (%),Probability of Default,Unnamed: 4,Unnamed: 5,recovery,lda
0,1,11.5,20.00%,0.007,,,20.0,80.0
1,2,0.24,5.10%,0.0033,,,5.1,94.9
2,3,0.04,24.86%,0.0022,,,24.86,75.14
3,4,13.81,2.29%,0.0066,,,2.29,97.71
4,5,19.84,3.47%,0.002,,,3.47,96.53


In [13]:
((df.iloc[:,1]*df.iloc[:,3]*df.lda)/100).sum()

493.39131440619997

### **BIONOMIAL DISTRIBUTION** 

So, the formula for finding binomial probability is given by:

 
P
(
X
=
r
)
=
 
n
C
r
(
p
)
^
r
(
1
−
p
)
^
n
−
r


Where n is the number of trials, p is the probability of success, and r is the number of successes after n trials.

 

However, there are some conditions that need to be met in order for us to be able to apply the formula.

The total number of trials is fixed at n.

Each trial is binary, i.e., it has only two possible outcomes: success or failure.

Probability of success is the same in all trials, denoted by p.

#### Q.Let’s define X as the number of packets found to be defective after the 10 packets have been tested. What will be the expected value of X?

X=No of packets found to be defective after 10 packets tested.

- E(X)=x1*P(X=x1)+x2*P(X=x2)+......xn*P(X=xn)
- P(X=0)=0
- P(X=1)=1*(10C1)*(0.05)*(0.95)^9 & so on

In [4]:
import operator as op
from functools import reduce

def ncr(n, r):
    r = min(r, n-r)
    numer = reduce(op.mul, range(n, n-r, -1), 1)
    denom = reduce(op.mul, range(1, r+1), 1)
    return numer // denom


In [8]:
E=0
for i in range(1,11):
    E+=i*ncr(10,i)*pow(0.05,i)*pow(0.95,10-i)
E

0.4999999999999998

### **SOLVING EXPECTED VALUE AND ITS MEANING**

Let’s understand the process with an example. Suppose you’re playing a game involving a 6-sided die. Could you tell what is the average outcome that you’d expect each time the die is thrown? Answering this question requires us to calculate the expected value.

 

 

Let’s solve this problem step by step:

The first step is defining the random variable. The random variable (X) is the outcome of a die throw. So, X = {1, 2, 3, 4, 5, 6}

The second step is to calculate the probabilities related to each outcome. The probability of each outcome is 
1
6
 in a die throw.

 

Now, you have X and P(X). If you plug these values in the formula 
E
[
X
]
=
∑
(
X
×
P
(
X
)
)
, you’ll get 3.5 as the expected value. So how to interpret this number? This means if you were to throw the die a large number of times, the average of those numbers will tend towards 3.5.

#### **Expected Value Real world eg**
Suppose you’re interested in investing in the stock market. It’s always better to invest in multiple stocks rather than one stock. You can calculate the expected value of your returns using the concept of expected value. Let's take a simple hypothetical situation. Our random variable, in this case, can take the expected return of each stock. Then, to calculate the expected return, you need the probability of returns for each stock. This way, you can calculate the expected return of your entire portfolio, which will allow you to invest wisely.

**Q.Suppose you go to a shoe-shop to buy a pair of shoes. There are 50 pairs of shoes in the shop in-total. Every time you ask the shopkeeper to show you a pair of shoes, he draws 3 pair of shoes randomly from his stock of 50 pairs. If you don’t like any of the pairs, he places all the 3 of them back and then draws 3 pairs again, randomly. Out of all the 50 pairs, 7 pairs of shoes are defected. What is the expected number of defective shoes in a given trial?**

In [12]:
## Sample space for defective shoes in each trial={0,1,2,3}
## 0*P(X=0)+1*P(X=1)+.....+3*P(X=3)
E2=0
for i in range(1,4):
    E2+=i*ncr(3,i)*pow(7/50,i)*pow(43/50,3-i)
E2

0.42000000000000004

**Q.Suppose you’re playing a game using two dice. Upon throwing the two dice simultaneously, if their sum equals 4, then you get 1000 rupees. If their sum is anything other than 4, you lose 100 rupees. What is the expected earning/loss in this game?**

In [15]:
#1. X = {+1000, -100}
#2. P(X) = {3/36, 33/36}
#If you use the expected value formula, you'll get -8.33 as the answer..

E3=(1000)*(3/6)*(1/6)+(-100)*(33/36)
E3

-8.333333333333329

**Q.Rahul wants to play a poker game. The entry charge for the game is a non-refundable INR 2000 and probability that Rahul wins a poker game is 3%. The prize money is INR 50000. If Rahul wins, he gets the prize money. If he loses, he gets nothing. What's his expected earning/loss per game?**

In [17]:
## X={48000,-2000}
## P(X)={0.03,0.97}
E4=48000*0.03-2000*0.97
E4

-500.0

**Q.Suppose a new cancer treatment has been discovered, claiming to increase the one-year survival rate for pancreatic cancer patients to 40%. In other words, the probability that a patient suffering from pancreatic cancer would survive for at least one year after receiving this treatment is 40%.**

**- 1.The hospital has a total of 10 patients suffering from pancreatic cancer. What is the probability that exactly 4 of these patients would survive the first year after receiving this treatment?**

**- 2.What is the probability that the number of patients that survive the first year after receiving the treatment would not be more than 2?**

In [18]:
## For 1st--we use bionomial distribution ie either he survives or he doesnt
##P(survive)=0.4,P(not)=0.6
E5=ncr(10,4)*pow(0.4,4)*pow(0.6,6)
E5

0.250822656

In [20]:
## For 2nd P(X=0)+P(X=1)+P(X=2)
E6=0
for i in range(0,3):
    E6+=ncr(10,i)*pow(0.4,i)*pow(0.6,10-i)
E6

0.16728975359999998

In [39]:
p1=0.85*0.6
p2=0.6*0.3
p3=0.3*0.85
all_p=0.85*0.3*0.6
p1+p2+p3

0.945

In [44]:
### 7 normal,5-intermediate,2-high
### P(A U B U C)=P(A)+P(B)+P(C)-P(AintB)-P(BintC)-P(CintA)+P(AintBintC)
P_A=ncr(10,7)*pow(0.85,7)*pow(0.15,3)
P_B=ncr(10,5)*pow(0.6,5)*pow(0.4,5)
P_C=ncr(10,2)*pow(0.3,2)*pow(0.7,8)
P_A*P_B*P_C

0.00608252070404931

In [38]:
P_A_B=ncr(10,5)*pow(p1,5)*pow(1-p1,5)
P_B_C=ncr(10,2)*pow(p2,2)*pow(1-p2,8)
P_C_A=ncr(10,2)*pow(p3,2)*pow(1-p3,8)
P_A_B+P_B_C+P_C_A

0.8213167060590538

In [41]:
P_A_B_C=ncr(10,2)*pow(all_p,2)*pow(1-all_p,8)
P_A_B_C

0.2790375551213251

In [43]:
final=P_A+P_B+P_C-(P_A_B+P_B_C+P_C_A)+P_A_B_C
final

0.02168713511617737

### **CDF AND PDF(Probablity Distribution Function)**

![title](img/cdf.png)

![title](img/probdistfunc.png)

For example, the area under the curve between 20, the smallest possible value of X, and 28 gives the cumulative probability for X, which is equal to 28.
- The total area under the curve will always be equal to 1.
- Area under the curve at any X will give the cumulative probablity at that position.

![title](img/uniform_pdf.png)
Clearly, this area is the area of a rectangle with length 10 and unknown height h. Hence, you can say that 10 * h = 1, which gives us h = 0.1. So, the value of the PDF for all values between 0 and 10 is 0.1.

### **NORMAL DISTRIBUTION** ##
![title](img/normal_dist.png)

##### **IMPORTANT NOTE-Mean shifts the curve from left to right as it increases or decreases whereas 'standard deviation' is involved in flattening of the curve,the higher the standard deviation the more flattened the curve will be**


#  **SAMPLING AND CENTRAL LIMIT THEOREM**





![title](img/central_limit2.png)

In [13]:
import pandas as pd
import math
df2=pd.read_csv("dataset/Inferntial Statistics - UpGrad Samples.csv")
df2.head()

Unnamed: 0,Sample No.,Sample Mean
0,1,3.2
1,2,2.6
2,3,2.8
3,4,2.0
4,5,3.0


In [14]:
mean=df2['Sample Mean'].sum()/len(df2)
mean

2.3480000000000003

In [16]:
sd=pow((mean-df2['Sample Mean']),2)/len(df2)
math.sqrt(sd.sum())

0.4248482081873478

#### **SAMPLING AND SAMPLING DISTRIBUTION**

So, there are two important properties of a sampling distribution of the mean:

- Sampling distribution’s mean (
μ
¯
X
) = Population mean (μ)

- Sampling distribution’s standard deviation (Standard error) = 
σ
√
n
, where σ is the population’s standard deviation and n is the sample size

IMPORTANT-- If there are 2 samples n1 and n2 where n1>n2 then sd1<sd2 for the samples as the curve is narrower when samples are more

#### **CENTRAL LIMIT THEOREM**

So, the central limit theorem says that for any kind of data, provided a high number of samples has been taken, the following properties hold true:

- Sampling distribution’s mean (
μ
¯
X
) = Population mean (μ),

- Sampling distribution’s standard deviation (standard error) = 
σ
√
n
, and

- For n > 30, the sampling distribution becomes a normal distribution.

**FOR DEMO ON CLT You can refer the file CLT_DEMO**