# Descriptive Statistics and Python Implementation


In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv(r'data.csv')
df.head()

Unnamed: 0,Mthly_HH_Income,Mthly_HH_Expense,No_of_Fly_Members,Emi_or_Rent_Amt,Annual_HH_Income,Highest_Qualified_Member,No_of_Earning_Members
0,5000,8000,3,2000,64200,Under-Graduate,1
1,6000,7000,2,3000,79920,Illiterate,1
2,10000,4500,2,0,112800,Under-Graduate,1
3,10000,2000,1,0,97200,Illiterate,1
4,12500,12000,2,3000,147000,Graduate,1


## Mean:

- **Arithmetic mean** is the total of the sum of all values in a collection of numbers divided by the number of numbers in a collection.

#### Formula:

 
 $ Mean(\mu) = \frac {\sum_{i=0}^{N} x_{i}}{N}$
  

Where 

**N = Total number of obsevations.**

**$x_{i}$ = $i^{th}$ element.**


In [3]:
## Python Implementation of mean.
for i in df.columns:
    if df[i].dtypes !='O':
        count = 0
        sum = 0
        for j in df[i]:
            sum+=j
            count+=1
        b = sum/count
        print("Mean of",i," =  ",sum/count)

Mean of Mthly_HH_Income  =   41558.0
Mean of Mthly_HH_Expense  =   18818.0
Mean of No_of_Fly_Members  =   4.06
Mean of Emi_or_Rent_Amt  =   3060.0
Mean of Annual_HH_Income  =   490019.04
Mean of No_of_Earning_Members  =   1.46


## Median:

- **The median** of a set of data is the middlemost number or center value in the set. The median is also the number that is halfway into the set.

#### Formula:
$$
Median = \left\{
    \begin{array}\\
        X[\frac {n+1}{2}] & \mbox{if } \ n \ is \ Odd \\
        \frac {X[n/2]+X[(n/2)+1]}{2} & \mbox{if } \ n \ is\  Even  \\
    \end{array}
\right.
$$

Where    
**_n = no. of elements_**

In [4]:
for i in df.columns:
    
    if df[i].dtypes !='O':

        n=len(df[i])
        list1= sorted(df[i])
        if n%2 !=0 :
            median = list1((n-1)//2)
        if n%2 ==0:
            median =  (list1[(n//2)]+list1[((n-1)//2)])/2

        print("Median of ", i ," = ", median)


Median of  Mthly_HH_Income  =  35000.0
Median of  Mthly_HH_Expense  =  15500.0
Median of  No_of_Fly_Members  =  4.0
Median of  Emi_or_Rent_Amt  =  0.0
Median of  Annual_HH_Income  =  447420.0
Median of  No_of_Earning_Members  =  1.0


## Mode:

- **The mode** is the value that appears most frequently in a data set. A set of data may have one mode, more than one mode, or no mode at all. Other popular measures of central tendency include the mean, or the average of a set, and the median, the middle value in a set.

In [5]:
# for i in df.columns:
#for i in df['Highest_Qualified_Member']:
 #   d = dict()
  #  count = 0
   # if i not in d.keys():
    #    d[i]=count
     #   if i in d.keys():
      #      d[i]=count+1
    #print (d)

In [6]:
for i in df.columns:

    k = dict(df[i].value_counts()).keys()
    l = list(k)
    print("Mode of ", i , "= ", l[0])

Mode of  Mthly_HH_Income =  45000
Mode of  Mthly_HH_Expense =  25000
Mode of  No_of_Fly_Members =  4
Mode of  Emi_or_Rent_Amt =  0
Mode of  Annual_HH_Income =  590400
Mode of  Highest_Qualified_Member =  Graduate
Mode of  No_of_Earning_Members =  1


## Variance:   

-The term **variance** refers to a statistical measurement of the spread between numbers in a data set. More specifically, variance measures how far each number in the set is from the mean and thus from every other number in the set. Variance is often depicted by this symbol: $σ^2$

#### Formula:

 $$
 Variance(\sigma^2) = \frac {\sum_{i=0}^{N} (x_{i} - \bar{x})^2}{n-1}
 $$

Where   
**N = Total No. of Datapoints   
$x_{i} = i^{th}$ Datapoints   
$\bar{x}$ = Mean**


In [7]:
for i in df.columns:
    if df[i].dtypes != 'O':
        a= df[i].mean()
        n = 0
        for j in df[i]:
            n = (j-a)**2+n
        var = n/(len(df[i])-1)
        print("Variance of ",i,"= ", var)
            

Variance of  Mthly_HH_Income =  681100853.0612245
Variance of  Mthly_HH_Expense =  146173342.85714287
Variance of  No_of_Fly_Members =  2.302448979591837
Variance of  Emi_or_Rent_Amt =  38955510.20408163
Variance of  Annual_HH_Income =  102486925397.91666
Variance of  No_of_Earning_Members =  0.5391836734693878


In [8]:
for i in df.columns:
    if df[i].dtypes != 'O':
        print(df[i].var())

681100853.0612245
146173342.85714287
2.302448979591837
38955510.20408163
102486925397.91666
0.5391836734693878


## Standard Deviation

**Standard deviation** is the measure of dispersion of a set of data from its mean. It measures the absolute variability of a distribution; the higher the dispersion or variability, the greater is the standard deviation and greater will be the magnitude of the deviation of the value from their mean.


#### Formula:

 $$
 Standard Deviation(\sigma) =\sqrt{ \frac {\sum_{i=0}^{N} (x_{i} - \bar{x})^2}{n-1}}
 $$

Where   
**_n = Total No. of Datapoints   
$x_{i} = i^{th}$ Datapoint   
$\bar{x}$ = Mean_**

In [9]:
for i in df.columns:
    if df[i].dtypes != 'O':
        a= df[i].mean()
        n = 0
        for j in df[i]:
            n = (j-a)**2+n
        var = n/(len(df[i])-1)
        std = var**0.5
        print("Standard Deviation of ",i,"= ", std)
            

Standard Deviation of  Mthly_HH_Income =  26097.908978713687
Standard Deviation of  Mthly_HH_Expense =  12090.216824240286
Standard Deviation of  No_of_Fly_Members =  1.5173822786601394
Standard Deviation of  Emi_or_Rent_Amt =  6241.434947516607
Standard Deviation of  Annual_HH_Income =  320135.79212252516
Standard Deviation of  No_of_Earning_Members =  0.7342912729083656


In [10]:
for i in df.columns:
    if df[i].dtypes != 'O':
        print(df[i].std())

26097.908978713687
12090.216824240286
1.5173822786601394
6241.434947516607
320135.79212252516
0.7342912729083656


## Correlation:
**Correlation*** means association - more precisely it is a measure of the extent to which two variables are related. There are three possible results of a correlational study: a positive correlation, a negative correlation, and no correlation.


A **positive correlation** is a relationship between two variables in which both variables move in the same direction. Therefore, when one variable increases as the other variable increases, or one variable decreases while the other decreases. An example of positive correlation would be height and weight. Taller people tend to be heavier.

A **negative correlation** is a relationship between two variables in which an increase in one variable is associated with a decrease in the other. An example of negative correlation would be height above sea level and temperature. As you climb the mountain (increase in height) it gets colder (decrease in temperature).

A **zero correlation** exists when there is no relationship between two variables. For example there is no relationship between the amount of tea drunk and level of intelligence.


#### Formula:

$$
 Correlation(r_{xy}) = \frac {\sum_{i=0}^{N} (x_{i} - \bar{x})(y_{i} - \bar{y})}{\sqrt{\sum_{i=0}^{N} (x_{i} - \bar{x})^2  (y_{i} - \bar{y})^2}}
$$

***Where***

**_N = Total No. of Datapoints   
$x_{i} = i^{th}$ Datapoint in column x.      
$\bar{x}$ = Mean of column x.              
$y_{i} = i^{th}$ Datapoint in column y.         
$\bar{y}$ = Mean of column y_**

![Alt Text](https://www.mathsisfun.com/data/images/correlation-examples.svg "Correlation")

In [11]:
for i in df['Mthly_HH_Expense']:
    for j in df['Mthly_HH_Income']:
        p=0
        q=0
        r=0
        a=df['Mthly_HH_Expense'].mean()
        b=df['Mthly_HH_Income'].mean()
        p+=((i-a)*(j-b))
        q+=(i-a)**2
        r+=(j-b)**2
#corre = p/((q*r)**0.5)
    print (p)
    print (q)
    print (r)
#print(corre)    

-632225556.0
117029124.0
3415467364.0
-690667556.0
139665124.0
3415467364.0
-836772556.0
205005124.0
3415467364.0
-982877556.0
282845124.0
3415467364.0
-398457556.0
46485124.0
3415467364.0
-632225556.0
117029124.0
3415467364.0
-164689556.0
7941124.0
3415467364.0
69078444.0
1397124.0
3415467364.0
-573783556.0
96393124.0
3415467364.0
-573783556.0
96393124.0
3415467364.0
-47805556.0
669124.0
3415467364.0
361288444.0
38217124.0
3415467364.0
-807551556.0
190937124.0
3415467364.0
-486120556.0
69189124.0
3415467364.0
-515341556.0
77757124.0
3415467364.0
-380924956.0
42484324.0
3415467364.0
69078444.0
1397124.0
3415467364.0
-515341556.0
77757124.0
3415467364.0
-714044356.0
149279524.0
3415467364.0
-340015556.0
33849124.0
3415467364.0
361288444.0
38217124.0
3415467364.0
-223131556.0
14577124.0
3415467364.0
10636444.0
33124.0
3415467364.0
361288444.0
38217124.0
3415467364.0
-398457556.0
46485124.0
3415467364.0
361288444.0
38217124.0
3415467364.0
-632225556.0
117029124.0
3415467364.0
-515341556.0

In [12]:
for i,j in zip(df['Mthly_HH_Expense'],df['Mthly_HH_Income']):
    p=0
    q=0
    r=0
    a=df['Mthly_HH_Expense'].mean()
    b=df['Mthly_HH_Income'].mean()
    p=((i-a)*(j-b))+p
    q+=(i-a)**2
    r+=(j-b)**2
corre = p/((q*r)**0.5)
    #print (p)
    #print (q)
    #print (r)
print(corre)    

1.0


# Normal Distribution

Normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. In graph form, normal distribution will appear as a bell curve

$$f(x)= {\frac{1}{\sigma\sqrt{2\pi}}}e^{- {\frac {1}{2}} (\frac {x-\mu}{\sigma})^2}$$$$f(x) = probability \  density \ function$$$$\sigma = \ standard \  deviation$$$$\mu = mean$$






![Alt Text](https://www.simplypsychology.org/normal-distribution.jpg "Normal Distribution")

# Feature of Normal Distribution

## 1. It is symmetric¶
A normal distribution comes with a perfectly symmetrical shape. This means that the distribution curve can be divided in the middle to produce two equal halves. The symmetric shape occurs when one-half of the observations fall on each side of the curve.

## 2. The mean, median, and mode are equal¶
The middle point of a normal distribution is the point with the maximum frequency, which means that it possesses the most observations of the variable. The midpoint is also the point where these three measures fall. The measures are usually equal in a perfectly (normal) distribution.

## 3. Empirical rule¶
In normally distributed data, there is a constant proportion of distance lying under the curve between the mean and specific number of standard deviations from the mean. For example, 68.25% of all cases fall within +/- one standard deviation from the mean. 95% of all cases fall within +/- two standard deviations from the mean, while 99% of all cases fall within +/- three standard deviations from the mean.

## 4. Skewness and kurtosis¶
Skewness and kurtosis are coefficients that measure how different a distribution is from a normal distribution. Skewness measures the symmetry of a normal distribution while kurtosis measures the thickness of the tail ends relative to the tails of a normal distribution.

## Positively Skewed & Negatively Skewed Normal Distribution
### Skewed Left (Negative Skew):
A left skewed distribution is sometimes called a negatively skewed distribution because it's long tail is on the negative direction on a number line.

A common misconception is that the peak of distribution is what defines peakness. In other words, a peak that tends to the left is left skewed distribution. This is incorrect. There are two main things that make a distribution skewed left:

The mean is to the left of the peak. This is the main definition behind skewness, which is technically a measure of the distribution of values around the mean. The tail is longer on the left. In most cases, the mean is to the left of the median. This isn't a reliable test for skewness though, as some distributions (i.e. many multimodal distributions) violate this rule. You should think of this as a general idea kind of rule, and not a set-in-stone one.

### Right Skewed or Postive Skewed:
So, the distribution which is right skewed have a long tail that extends to the right or positive side of the x axis.

- Mean greater than the Mode
- Median greater than the Mode
- Mean greater than Median
_The first and second always hold in case of right skewed distribution but third one may not be valid sometimes._


![Alt Text](https://cdn.corporatefinanceinstitute.com/assets/skewness2.png "Normal Distribution")

### Effect on Mean, Median and Mode due to Skewness
In a positively skewed distribution the outliers will be pulling the mean down the scale a great deal. The median might be slightly lower due to the outlier, but the mode will be unaffected. Thus, with a negatively skewed distribution the mean is numerically lower than the median or mode