**Descriptive Statistics for Pandas Dataframe**

https://chrisalbon.com/python/data_wrangling/pandas_dataframe_descriptive_stats/

In [1]:
import pandas as pd

**Create Dataframe**

In [2]:
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'age': [42, 52, 36, 24, 73], 
        'preTestScore': [4, 24, 31, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore', 'postTestScore'])
df

Unnamed: 0,name,age,preTestScore,postTestScore
0,Jason,42,4,25
1,Molly,52,24,94
2,Tina,36,31,57
3,Jake,24,2,62
4,Amy,73,3,70


**The sum of all ages**

In [3]:
df['age'].sum()

227

**Mean preTestScore**

In [4]:
df['preTestScore'].mean()

12.8

**Cumulative sum of preTestScores moving from the rows from the top**

In [5]:
df['preTestScore'].cumsum()

0     4
1    28
2    59
3    61
4    64
Name: preTestScore, dtype: int64

**Summary Statistics on preTestScore**

In [6]:
df['preTestScore'].describe()

count     5.000000
mean     12.800000
std      13.663821
min       2.000000
25%       3.000000
50%       4.000000
75%      24.000000
max      31.000000
Name: preTestScore, dtype: float64

**Count the number of non-NA values**

In [7]:
df['preTestScore'].count()

5

**Minimum value of preTestScore**

In [8]:
df['preTestScore'].min()

2

**Maximum value of preTestScore**

In [9]:
df['preTestScore'].max()

31

**Median value of preTestScore**

In [10]:
df['preTestScore'].median()

4.0

**Sample variance of preTestScore**

**Variance measures how far a set of (random) numbers are spread out from their average value. **

In [11]:
df['preTestScore'].var()

186.7

**Sample standard deviation of preTestScore values**

**Standard deviation is a measure that is used to quantify the amount of variation or dispersion of a set of data values. A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values.**

In [13]:
df['preTestScore'].std()

13.663820841916802

**Skewness of preTestScore values**

**Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive or negative, or undefined.**

In [14]:
df['preTestScore'].skew()

0.74334524573267513

**Kurtosis of preTestScore values**

**Kurtosis (from Greek: κυρτός, kyrtos or kurtos, meaning "curved, arching") is a measure of the "tailedness" of the probability distribution of a real-valued random variable. In a similar way to the concept of skewness, kurtosis is a descriptor of the shape of a probability distribution.**

**The standard measure of kurtosis, originating with Karl Pearson, is based on a scaled version of the fourth moment of the data or population. This number is related to the tails of the distribution, not its peak; hence, the sometimes-seen characterization as "peakedness" is mistaken. For this measure, higher kurtosis is the result of infrequent extreme deviations (or outliers), as opposed to frequent modestly sized deviations.**

In [15]:
df['preTestScore'].kurt()

-2.4673543738411547

**Correlation matrix of values**

**Correlation is any of a broad class of statistical relationships involving dependence, though in common usage it most often refers to how close two variables are to having a linear relationship with each other. Familiar examples of dependent phenomena include the correlation between the physical statures of parents and their offspring, and the correlation between the demand for a limited supply product and its price.**

In [17]:
df.corr()

Unnamed: 0,age,preTestScore,postTestScore
age,1.0,-0.105651,0.328852
preTestScore,-0.105651,1.0,0.378039
postTestScore,0.328852,0.378039,1.0


**Covariance matrix of values**

**Covariance is a measure of the joint variability of two random variables.If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, i.e., the variables tend to show similar behavior, the covariance is positive.In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other, i.e., the variables tend to show opposite behavior, the covariance is negative. The sign of the covariance therefore shows the tendency in the linear relationship between the variables. **

In [18]:
df.cov()

Unnamed: 0,age,preTestScore,postTestScore
age,340.8,-26.65,151.2
preTestScore,-26.65,186.7,128.65
postTestScore,151.2,128.65,620.3
