# 1. Introduction

![Volkswagen](http://www.investopedia.com/thmb/5zbLZHbrLNwpLLbZjbteyvgDdY4=/1024x683/filters:fill(auto,1)/GettyImages-1135311347-4aabd8ab95354b9f9eae9e1d8b61ee33.jpg)

# 1.1. What are analytic distributions? 

To employ the dictionary meaning of distribution, it is; "the way in which something is shared out among a group or spread over an area". Histograms, mean, variance, effect size are a few examples of distribution. One thing that is common in these kind of distributions is that they are based on empirical observations. These along with Probability mass functions and Cumulative distribution functions are known as empirical distributions.

In contast to empirical distributions, the analytic distributions are the ones that are charcterized by a CDF or a PDF. The analytic distribution can be used to model the empirical distributions. Some examples of analytic distribution are normal distribution, exponential distribution, lognormal distribution, etc. In this notebook, I'm going to model certain variables from the dataset and check if that model fits in any of the analytic distribution. That said, I may wave hands at many of the concepts and use the concept quite implicitly, for that is the purpose of the notebook.

# 1.2. Knowing my data

I have have the necessary information, sufficient enough to carry out an anlysis with a few variables (columns) of the dataset, still for the benefit of the audience, I'm affixing the link to the dataset

[100,000 UK Used Car Data set](http://www.kaggle.com/adityadesai13/used-car-dataset-ford-and-mercedes)

The dataset has 13 sub datasets. For this notebook, one such dataset is required. I'll therefore use the data for the Volkswagen car.

There are 9 variables in the dataset but I'll use 3 of them, namely;
1. year: the year in which the model (of the car) is registered.
2. price: price of the vehicle in Euro.
3. fuelType: the type of engine fuel.

For my purpose, I'll only use data from one year and for one fuel type, namely; 2019 and Petrol, respectively.

# 2. Importing the libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import math
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
from scipy import stats
from scipy.stats import norm
import random

# 3. Exploratory Data Analysis

In [None]:
vw = pd.read_csv('/kaggle/input/used-car-dataset-ford-and-mercedes/vw.csv')

In [None]:
vw.head()

In [None]:
vw.describe()

In [None]:
vw.info()

In [None]:
vw.fuelType.value_counts()

In [None]:
t19_petrol = vw[(vw['year'] == 2019) & (vw['fuelType'] == 'Petrol')]

In [None]:
t19_petrol.head()

In [None]:
t19_petrol_price = t19_petrol.price.values

# 3.1. Transforming the data

I'm transforming (modifying) one variable of the dataset; "price". I'm dividing the prices by 1000 and sorting. I'm doing this because, the price in its usual form produce a cdf value of 0.0 each. After the transformation, a value, say 32000 euros become 3.2K euros.

In [None]:
t19_petrol_price = sorted(t19_petrol_price / 1000)

# 4. Brief introduction to CDF

Let's say we have a sample of 10 values and we want to find the percentile rank of a number 'x' that may appear in the sample, then we say that the percentile is the percentage of values in the sample that are less than or equal to that of 'x'. 

A cumulative distribution function is a normalized percentile rank. That is, we define the CDF on a scale of 0-1, whereas the percentile rank is measured on a 0-100 scale.

We can use the following function to define calculate the cdf of a sample. In this scenario, the sample would be the price of various models (petrol) of the Volkswagen car that are launched in the year 2019

In [None]:
def EvalCdf(sample, x):
    count = 0
    for i in sample:
        if i <= x:
            count += 1
    prob = count / len(sample)
    
    return prob

# 4.1. Applying the cdf on the sample

Now that we have with us both the sample and the function needed to convert that sample into cumulative values (normalized percentile ranks), we will straight away apply that function on each value of that sample and plot the distribution using matplolib library's line plot

In [None]:
t19_petrol_price_cdf = []

for i in t19_petrol_price:
    t19_petrol_price_cdf.append(EvalCdf(t19_petrol_price, i))

In [None]:
plt.figure(figsize = (15,8))

plt.xlabel('Price')
plt.ylabel('CDF')
plt.plot(t19_petrol_price, t19_petrol_price_cdf)

After plotting the price v/s cdf (price with their corresponding cumulative distribution values), we can say that the shape of the plot is nearly sigmoid.

Now our remaining discourse would be upon the plot that surfaced. Many of you might wonder, why would that happen? We have the plot, now infer the obvious and move on. 
Whilst it is true that I will infer some conclusions out of the plot and move on, I will do that through a different process. 

The sigmoid shape of the plot must have disguised a more useful insight that needs to be acknowledged. 

# 5. Fitting into a different model

Plotting the graph of one variable must be an example of empirical distribution for we haven't done anything hypothetical as of now, however the study of the plot would be done using the concepts of analytic distribution.

**What does the shape indicate?**

The shape of the plot is approximately sigmoid which is also the shape of the cdf of the normal and lognormal distribution. But how do we know? The shape may resemble to that of a *normal* or *lognormal* distribution but is there any way to cross check or double check? Yes is the answer.

# 6. Normal Probability Plot

# 6.1. For normal distribution

To check whether the distribution is approximately normal, we can use the normal probability plot. In a normal probability plot, we plot ordered values (the values in our sample, that is price) and the theoretical values (the values from a sample standard normal distribution, that is mean = 0 and standard deviation = 1). 

If the distribution of the sample is approximately normal, the result is a straight line with intercept mu and slope sigma.

In Python, ***scipy.stats provides us with an inbuilt method called "probplot"*** which not only does the calculations and ransformations but also plots the values.

In [None]:
plt.figure(figsize = (15, 8))

#plt.plot(stats.probplot(t19_petrol_price)[0][0], stats.probplot(t19_petrol_price)[0][1])
stats.probplot(t19_petrol_price, plot = plt)

**What does the plot tell us?**

As we can see, the plot is not exactly how it should look like (the red straight line). Albeit, most of the figure aligns with the straight line, both of its tail deviate substantially from the straight line, which indicates that the curve matches near the mean and deviates in the tails.

In [None]:
sum(stats.probplot(t19_petrol_price)[0][0])

In [None]:
t19_petrol_price[-5:]

In [None]:
stats.probplot(t19_petrol_price)[0][1]

# 6.2. For lognormal distribution

Normal probability plots are also used to check whether a distribution is lognormal. It is wise to ponder that the distribution may fit into the lognormal distribution because the shape of the cdf of a lognormal distribution is also sigmoid.

The concept is same, but the only difference is that the we employ the base 10 logarithm values for the ordered values (price in our case, which is the y-axis).

In [None]:
log_ordered_values = [math.log(x) for x in stats.probplot(t19_petrol_price)[0][1]]

In [None]:
plt.figure(figsize = (15, 8))

stats.probplot(log_ordered_values, plot = plt)

**What does the plot tell us?**

The plot is slightly different from that of the previous one. The curve matches near the mean and "slightly" deviates in the tails. This portrays that the distribution, more precisely, fit the lognormal distribution. 

# 7. Which model is the best?

Frankly speaking, the answer to the question; "which model is best" depends on what do we want to infer? In that case, I leave it for my generous audience to find out the rather ambiguous applications of each of the normal and lognormal distributions. This will end up in a lot of brainstorming but that will be fun. I liked investing time in this exercise.  

# 8. Epilogue

Analytic models leave out details that are unneeded and unnecessary for our purposes. That said, it is important to recall that many real world phenomena can be modelled with analytic distributions. These models smooth out measurement errors from the observed distribution. When an analytic model fits a dataset, a large amount of data can be summarized using a small set of parameters. 

At the same time, it is important to be aware of the fact that no model is perfect. 
"Models are useful if they capture the relevant aspects of the real world and
leave out unneeded details. But what is “relevant” or “unneeded” depends
on what you are planning to use the model for." - Allen B. Downey in ThinkStats book

What is the purpose of this notebook?

I was exploring these concepts and was eager to apply them on datasets. I think this is a good way to learn. A lot of things can be done on this kind of dataset, but as someone rightly pointed out; "many a little makes a mickle", understanding concepts in little portions and applying them in real world is the best way to learn. That sais, I wish to come up with more notebooks in tandem with my work on these datasets.

Please give your valuable feedback :)