# Descriptive Statistics

Descriptive statistics are measures that summarize important features of data, often with a single number. Producing descriptive statistics is a common first step to take after cleaning and preparing a data set for analysis. A descriptive statistic is a summary statistic that quantitatively describes or summarizes features from a collection of information, while descriptive statistics is the process of using and analysing those statistics. Descriptive statistics help you explore features of your data, like center, spread and shape by summarizing them with numerical measurements. Descriptive statistics help inform the direction of an analysis and let you communicate your insights to others quickly.

1. Measure of Central Tendency
    * Mean
    * Median
    * Mode
2. Measure of Variation
    * Variance
    * Range
    * Standard Deviation
3. Measure of Shapes
    * Skewness
    * Kurtosis

![pexels-markus-spiske-3970328.jpg](attachment:pexels-markus-spiske-3970328.jpg)

# 1. Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline

# 2. Import the data

In [4]:
sn.get_dataset_names()

['anagrams',
 'anscombe',
 'attention',
 'brain_networks',
 'car_crashes',
 'diamonds',
 'dots',
 'dowjones',
 'exercise',
 'flights',
 'fmri',
 'geyser',
 'glue',
 'healthexp',
 'iris',
 'mpg',
 'penguins',
 'planets',
 'seaice',
 'taxis',
 'tips',
 'titanic']

In [5]:
mtcars=sn.load_dataset("mpg") 

In [6]:
mtcars.to_csv("C:/Users/PRADISH/ML 2024/car.csv")

In [7]:
print(mtcars)

      mpg  cylinders  displacement  horsepower  weight  acceleration  \
0    18.0          8         307.0       130.0    3504          12.0   
1    15.0          8         350.0       165.0    3693          11.5   
2    18.0          8         318.0       150.0    3436          11.0   
3    16.0          8         304.0       150.0    3433          12.0   
4    17.0          8         302.0       140.0    3449          10.5   
..    ...        ...           ...         ...     ...           ...   
393  27.0          4         140.0        86.0    2790          15.6   
394  44.0          4          97.0        52.0    2130          24.6   
395  32.0          4         135.0        84.0    2295          11.6   
396  28.0          4         120.0        79.0    2625          18.6   
397  31.0          4         119.0        82.0    2720          19.4   

     model_year  origin                       name  
0            70     usa  chevrolet chevelle malibu  
1            70     usa      

# 3. Remove unnecessary data

In [None]:
del mtcars["name"]

In [None]:
del mtcars["origin"]

In [None]:
mtcars.head()

# 4. Measure of Central Tendency
    * Mean
    * Median
    * Mode

Measures of center are statistics that give us a sense of the "middle" of a numeric variable. In other words, centrality measures give you a sense of a typical value you'd expect to see. Common measures of center include the mean, median and mode.

# 4.1 Mean of each column

In [None]:
mtcars.mean()  

# 4.2 Mean of each Row

In [None]:
mtcars.mean(axis=1)

# 4.3 Median
The median of a distribution is the value where 50% of the data lies below it and 50% lies above it. In essence, the median splits the data in half

In [None]:
mtcars.median()

# 4.4 Median of each row

In [None]:
mtcars.median(axis=1)

Although the mean and median both give us some sense of the center of a distribution, they aren't always the same. The median always gives us a value that splits the data into two halves while the mean is a numeric average so extreme values can have a significant impact on the mean. In a symmetric distribution, the mean and median will be the same

# 4.5 Mode

The mode of a variable is simply the value that appears most frequently. Unlike mean and median, you can take the mode of a categorical variable and it is possible to have multiple modes

In [None]:
mtcars.mode()

# 5. Measure of Spread
    * Variance
    * Range
    * Standard Deviation

Measures of spread are statistics that describe how data varies. While measures of center give us an idea of the typical value, measures of spread give us a sense of how much the data tends to diverge from the typical value. One of the simplest measures of spread is the range. Range is the distance between the maximum and minimum observations:

# 5.1 Range

In [None]:
max(mtcars["mpg"]) - min(mtcars["mpg"])

In [None]:
print(min(mtcars["mpg"]))
print(max(mtcars["mpg"]))

As noted earlier, the median represents the 50th percentile of a data set. A summary of several percentiles can be used to describe a variable's spread. We can extract the minimum value (0th percentile), first quartile (25th percentile), median, third quartile(75th percentile) and maximum value (100th percentile) using the quantile() function:

# 5.2 Quantile Values

In [None]:
Quantile_Values = [mtcars["mpg"].quantile(0),   
            mtcars["mpg"].quantile(0.25),
            mtcars["mpg"].quantile(0.50),
            mtcars["mpg"].quantile(0.75),
            mtcars["mpg"].quantile(1)]

In [None]:
Quantile_Values

### Describe()
The describe functions returns the inter quantile values and describes the data

In [None]:
mtcars["mpg"].describe()

# 5.3 Interquartile Range

Interquartile (IQR) range is another common measure of spread. IQR is the distance between the 3rd quartile and the 1st quartile:

In [None]:
mtcars["mpg"].quantile(0.75) - mtcars["mpg"].quantile(0.25)

# 5.4 Box Plot

In [None]:
mtcars.boxplot(column="mpg",
               return_type='axes',
               figsize=(8,8))

plt.text(x=0.74, y=29, s="3rd Quartile")
plt.text(x=0.8, y=23, s="Median")
plt.text(x=0.75, y=17.5, s="1st Quartile")
plt.text(x=0.9, y=10, s="Min")
plt.text(x=0.9, y=45, s="Max")
plt.text(x=0.6, y=19.5, s="IQR", rotation=90, size=25);

# 5.5 Variance & Standard Deviation

Variance and standard deviation are two other common measures of spread. The variance of a distribution is the average of the squared deviations (differences) from the mean

# 5.5.1 Variance

In [None]:
mtcars["mpg"].var()

# 5.5.2 Standard Deviation

In [None]:
mtcars["mpg"].std()

Since variance and standard deviation are both derived from the mean, they are susceptible to the influence of data skew and outliers. Median absolute deviation is an alternative measure of spread based on the median, which inherits the median's robustness against the influence of skew and outliers. It is the median of the absolute value of the deviations from the median

In [None]:
abs_median_devs = abs(mtcars["mpg"] - mtcars["mpg"].median())

In [None]:
abs_median_devs.median() * 1.4826

# 6.0 Measure of Shape
    * Skewness
    * Kurtosis

Beyond measures of center and spread, descriptive statistics include measures that give you a sense of the shape of a distribution

### 6.1 Skewness

Skewness measures the skew or asymmetry of a distribution  

In [8]:
mtcars["mpg"].skew()

0.45706634399491913

### 6.2 Kurtosis

kurtosis measures how much data is in the tails of a distribution vs the centre.

In [9]:
mtcars["mpg"].kurt()

-0.5107812652123154