# Descriptive Statistics With Python

## Load packages

In [3]:
import pandas as pd
import seaborn as sns

## Load data

https://seaborn.pydata.org/generated/seaborn.load_dataset.html

Loads the "mpg" dataset from the seaborn library.

In [4]:
df = sns.load_dataset("mpg") 

In [5]:
df

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino
...,...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86.0,2790,15.6,82,usa,ford mustang gl
394,44.0,4,97.0,52.0,2130,24.6,82,europe,vw pickup
395,32.0,4,135.0,84.0,2295,11.6,82,usa,dodge rampage
396,28.0,4,120.0,79.0,2625,18.6,82,usa,ford ranger


## Descriptive statistics for categorical variables

### One-dimensional frequency table

https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html

Count the number of occurrences of each unique value in a column of a Pandas DataFrame.

In [6]:
df['name'].value_counts()

ford pinto             6
toyota corolla         5
amc matador            5
ford maverick          5
chevrolet chevette     4
                      ..
chevrolet monza 2+2    1
ford mustang ii        1
pontiac astro          1
amc pacer              1
chevy s-10             1
Name: name, Length: 305, dtype: int64

Count the number of occurrences of each unique value in a column of a Pandas DataFrame and return the result as a normalized (percentage) count.

In [7]:
df['name'].value_counts(normalize = True)

ford pinto             0.015075
toyota corolla         0.012563
amc matador            0.012563
ford maverick          0.012563
chevrolet chevette     0.010050
                         ...   
chevrolet monza 2+2    0.002513
ford mustang ii        0.002513
pontiac astro          0.002513
amc pacer              0.002513
chevy s-10             0.002513
Name: name, Length: 305, dtype: float64

### Multi-dimensional frequency table

https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html

Create a pivot table from a pandas DataFrame that shows the unique number of occurrences of each value in the values column, grouped by the values in the index column.

In [8]:
pd.pivot_table(data = df, index = "origin", values = ['name'], aggfunc = 'nunique' )

Unnamed: 0_level_0,name
origin,Unnamed: 1_level_1
europe,58
japan,60
usa,187


Create a pivot table from a pandas DataFrame that shows the unique number of occurrences of each value in the values column, grouped by *multiple* index columns.

In [9]:
pd.pivot_table(data = df, index = ["model_year", "origin"], values = ['name'], aggfunc = 'nunique' )

Unnamed: 0_level_0,Unnamed: 1_level_0,name
model_year,origin,Unnamed: 2_level_1
70,europe,5
70,japan,2
70,usa,22
71,europe,4
71,japan,4
71,usa,20
72,europe,5
72,japan,5
72,usa,18
73,europe,7


In [10]:
# Double Check
df.query("origin == 'europe' & model_year == 82")

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
375,36.0,4,105.0,74.0,1980,15.3,82,europe,volkswagen rabbit l
394,44.0,4,97.0,52.0,2130,24.6,82,europe,vw pickup


## Numerical variables

### Summary statistics

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html


Return the common descriptive statistics for all numerical variables of a dataframe:

In [None]:
df.describe()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year
count,398.0,398.0,398.0,392.0,398.0,398.0,398.0
mean,23.514573,5.454774,193.425879,104.469388,2970.424623,15.56809,76.01005
std,7.815984,1.701004,104.269838,38.49116,846.841774,2.757689,3.697627
min,9.0,3.0,68.0,46.0,1613.0,8.0,70.0
25%,17.5,4.0,104.25,75.0,2223.75,13.825,73.0
50%,23.0,4.0,148.5,93.5,2803.5,15.5,76.0
75%,29.0,8.0,262.0,126.0,3608.0,17.175,79.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,82.0


### Mean
https://pandas.pydata.org/docs/reference/api/pandas.Series.mean.html

Calculate the arithmetic mean of a column

In [12]:
df['mpg'].mean()

23.514572864321607

### Median

https://pandas.pydata.org/docs/reference/api/pandas.Series.median.html

Return the median value of a column

In [11]:
df['mpg'].median()

23.0

### Mode

https://pandas.pydata.org/docs/reference/api/pandas.Series.mode.html

Return the mode(s) of a column.

In [13]:
df['mpg'].mode()

0    13.0
Name: mpg, dtype: float64

In [14]:
# works also for categorical data
df['origin'].mode()

0    usa
Name: origin, dtype: object

### Range


https://pandas.pydata.org/docs/reference/api/pandas.Series.max.html

https://pandas.pydata.org/docs/reference/api/pandas.Series.min.html

Calculate the range of values in a column

In [None]:
df['mpg'].max() - df['mpg'].min()

37.6

### Variance

https://pandas.pydata.org/docs/reference/api/pandas.Series.var.html

Return unbiased variance for a column (normalized by N-1 by default which can be changed using the `ddof` argument.)

In [15]:
df['mpg'].var()

61.089610774274405

In [20]:
df['mpg'].var(ddof=0)
# ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables

60.93611928991693

### Standard deviation

https://pandas.pydata.org/docs/reference/api/pandas.Series.std.html

Return sample standard deviation for a column (normalized by N-1 by default which can be changed using the `ddof` argument.

In [21]:
df['mpg'].std()

7.815984312565782

In [22]:
df['mpg'].std(ddof=0)

7.806159061274433

### Quantiles (Percentiles)

https://pandas.pydata.org/docs/reference/api/pandas.Series.quantile.html

Return value at the given quantile.

A quantile is a statistical concept that divides a set of values into equal-sized subgroups based on their rank order, where the value at a specific quantile represents a certain percentage of the data below that value. For example, the 25th percentile (or first quartile) represents the value below which 25% of the data fall.

In [23]:
# 25% quantile
df['mpg'].quantile(q = 0.25)

17.5

In [24]:
# 75% quantile
df['mpg'].quantile(q = 0.75)

29.0

In [26]:
# both quantiles
df['mpg'].quantile(q = [0.25, 0.75])

0.25    17.5
0.75    29.0
Name: mpg, dtype: float64

In [27]:
q1, q3 = df['mpg'].quantile(q = [0.25, 0.75])

Calculate the interquartile range (IQR) = spread of the data.

The IQR may also be called the midspread, middle 50%. It is defined as the difference between the 75th and 25th percentiles of the data.

In [28]:
iqr = q3 - q1
iqr

11.5

## Automated exploratory data analysis

### Load package

Use [sweetviz](https://pypi.org/project/sweetviz/) to calculate descriptive statistics for all your data
See [documentation here](https://colab.research.google.com/drive/1-md6YEwcVGWVnQWTBirQSYQYgdNoeSWg?usp=sharing)

In [None]:
# Using pip
!pip install sweetviz

In [None]:
# Using conda (for Anaconda distribution)
!conda install -c conda-forge sweetviz

In [33]:
import sweetviz as sv

### Analyze

In [None]:
sv_report = sv.analyze(df)

In [37]:
sv_report.show_notebook(w=900, h=450, scale=0.8)

#### Analyze with a target variable

In [None]:
sv_report = sv.analyze(df, "mpg")

In [39]:
sv_report.show_notebook(w=900, h=450, scale=0.8)

### Compare two datasets

In [None]:
df1 = df.query("model_year < 76")
df2 = df.query("model_year >= 76")

In [None]:
sv_report = sv.compare(source = [df1, "< 76"], compare = [df2,">= 76"])

                                             |          | [  0%]   00:00 -> (? left)

In [None]:
sv_report.show_notebook(w=900, h=450, scale=0.8)

#### Compare two datasets with target

In [None]:
sv_report = sv.compare(source = [df1, "< 76"], compare = [df2,">= 76"], target_feat = "mpg")

                                             |          | [  0%]   00:00 -> (? left)

In [None]:
sv_report.show_notebook(w=900, h=450, scale=0.8)

### Compare intra
(two variables inside one dataset)

In [None]:
sv_report = sv.compare_intra(df, df['origin'] == 'usa', ["US", "Non-US"])

                                             |          | [  0%]   00:00 -> (? left)

In [None]:
sv_report.show_notebook(w=900, h=450, scale=0.8)

#### Compare variables inside dataset with target

In [None]:
sv_report = sv.compare_intra(df, df['origin'] == 'usa', ["US", "Non-US"], "mpg")

                                             |          | [  0%]   00:00 -> (? left)

In [None]:
sv_report.show_notebook(w=900, h=450, scale=0.8)

## Bonus

**More Alternatives for Automated EDA:**

* ydata-profiling (ex. Pandas Profiling): https://ydata-profiling.ydata.ai/docs/master/index.html

* dtale: https://github.com/man-group/dtale

* autoviz: https://github.com/AutoViML/AutoViz
