# Probability distributions.
    
1. First part, answering questions about an artificial * data set * with data from a normal and a binomial sample.
2. The second part will be about analyzing the distribution of a variable from the _data set_ [Pulsar Star](https://archive.ics.uci.edu/ml/datasets/HTRU2).

In [52]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as sct
import seaborn as sns
from statsmodels.distributions.empirical_distribution import ECDF
from scipy.stats import norm

In [69]:
%matplotlib inline

from IPython.core.pylabtools import figsize


figsize(12, 8)

sns.set()

### First part

In [44]:
np.random.seed(42)
    
dataframe = pd.DataFrame({"normal": sct.norm.rvs(20, 4, size=10000),
                     "binomial": sct.binom.rvs(100, 0.2, size=10000)})

#### Analysis start

In [45]:
dataframe.head()

Unnamed: 0,normal,binomial
0,21.986857,18
1,19.446943,15
2,22.590754,14
3,26.092119,15
4,19.063387,21


#### Question 1

What is the difference between the quartiles (Q1, Q2 and Q3) of the `normal` and` binomial` variables of `dataframe`? Respond as a tuple of three elements rounded to three decimal places.

In other words, let `q1_norm`,` q2_norm` and `q3_norm` be the quantiles of the variable` normal` and `q1_binom`,` q2_binom` and `q3_binom` the quantiles of the variable` binom`, what is the difference `(q1_norm - q1 binom, q2_norm - q2_binom, q3_norm - q3_binom)`?

In [46]:
def q1():
    normal = np.percentile(dataframe.normal, [25, 50, 75])
    binomial = np.percentile(dataframe.binomial, [25, 50, 75])    
    return (round(normal[0]-binomial[0], 3),round(normal[1]-binomial[1], 3),round(normal[2]-binomial[2], 3))

q1()

(0.31, -0.01, -0.316)

#### Question 2

Consider the interval $ [\ bar {x} - s, \ bar {x} + s] $, where $ \ bar {x} $ is the sample mean and $ s $ is the standard deviation. What is the probability in this interval, calculated by the empirical cumulative distribution function (empirical CDF) of the `normal` variable? Respond as a single scalar rounded to three decimal places.

In [61]:
def q2():
    ecdf = ECDF(dataframe.normal)
    mean = dataframe.normal.mean()
    deviation = dataframe.normal.std()
    
    result = (round(ecdf(mean + deviation) - ecdf(mean - deviation), 3))
    result = result.item()
    return result

q2()

0.684

#### Question 3

What is the difference between the means and the variances of the `binomial` and` normal` variables? Respond as a tuple of two elements rounded to three decimal places.

In other words, let `m_binom` and` v_binom` be the mean and variance of the `binomial` variable, and` m_norm` and `v_norm` be the mean and variance of the` normal` variable. What are the differences `(m_binom - m_norm, v_binom - v_norm)`?

In [62]:
def q3():
    m_norm  = dataframe.normal.mean()
    v_norm  = dataframe.normal.var()
    m_binom = dataframe.binomial.mean()
    v_binom = dataframe.binomial.var()
    return (round(m_binom - m_norm, 3), round(v_binom - v_norm, 3))

q3()

(0.106, 0.22)

### Part 2

In [63]:
stars = pd.read_csv("pulsar_stars.csv")

stars.rename({old_name: new_name
              for (old_name, new_name)
              in zip(stars.columns,
                     ["mean_profile", "sd_profile", "kurt_profile", "skew_profile", "mean_curve", "sd_curve", "kurt_curve", "skew_curve", "target"])
             },
             axis=1, inplace=True)

stars.loc[:, "target"] = stars.target.astype(bool)

### Analysis start

In [64]:
stars.head()

Unnamed: 0,mean_profile,sd_profile,kurt_profile,skew_profile,mean_curve,sd_curve,kurt_curve,skew_curve,target
0,102.507812,58.88243,0.465318,-0.515088,1.677258,14.860146,10.576487,127.39358,False
1,103.015625,39.341649,0.323328,1.051164,3.121237,21.744669,7.735822,63.171909,False
2,136.75,57.178449,-0.068415,-0.636238,3.642977,20.95928,6.896499,53.593661,False
3,88.726562,40.672225,0.600866,1.123492,1.17893,11.46872,14.269573,252.567306,False
4,93.570312,46.698114,0.531905,0.416721,1.636288,14.545074,10.621748,131.394004,False


#### Question 4

Considering the `mean_profile` variable of` stars`:

1. Filter only the values of `mean_profile` where` target == 0` (ie, where the star is not a pulsar).
2. Standardize the `mean_profile` variable previously filtered to have mean 0 and variance 1.

We will call the resulting variable `false_pulsar_mean_profile_standardized`.

Find the theoretical quantiles for a normal distribution of mean 0 and variance 1 for 0.80, 0.90 and 0.95 using the `norm.ppf ()` function available in `scipy.stats`.

What are the probabilities associated with these quantiles using the empirical CDF of the variable `false_pulsar_mean_profile_standardized`? Respond as a tuple of three elements rounded to three decimal places.

In [65]:
aux = stars[stars['target'] == False]
aux = aux['mean_profile']
standardized = (aux - aux.mean()) / aux.std()

In [66]:
def q4():
    ecdf = ECDF(standardized)
    media = standardized.mean()
    desvio = standardized.var()
    q1 = norm.ppf(0.80, loc=0, scale=1)
    q2 = norm.ppf(0.90, loc=0, scale=1)
    q3 = norm.ppf(0.95, loc=0, scale=1)

    return (round(ecdf(q1), 3),
     round(ecdf(q2), 3),
     round(ecdf(q3), 3))

q4()

(0.806, 0.911, 0.959)

#### Question 5

What is the difference between the Q1, Q2 and Q3 quantiles of `false_pulsar_mean_profile_standardized` and the same theoretical quantiles of a normal distribution of mean 0 and variance 1? Respond as a tuple of three elements rounded to three decimal places.

In [67]:
def q5():
    quartil = np.percentile(standardized, [25, 50, 75])
    q1 = norm.ppf(0.25, loc=0, scale=1)
    q2 = norm.ppf(0.50, loc=0, scale=1)
    q3 = norm.ppf(0.75, loc=0, scale=1)
    return (round(quartil[0]-q1, 3),round(quartil[1]-q2, 3),round(quartil[2]-q3, 3))

q5()

(0.027, 0.04, -0.004)