<a href="https://colab.research.google.com/github/zhijunm/Text_Mining/blob/master/BasicPyDescriptives.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Descriptive Statistics in Python!

### Introduction

In this notebook we'll be learning how to generate basic descriptive statistics in Python. Specifically, we'll be loading the Galton Families data set from the statsmodels library into a Pandas DataFrame, after which we'll use several of the DataFrame's methods to view statistics that describe the data.

These data are height measurements (in inches) for parents and their children gathered by Francis Galton in 1886.

### Loading the Data

For today's examples, we'll rely on the Galton Families data set. The code below imports the statsmodels library, and loads the Galton Familes data set from the statsmodels library into a pandas DataFrame object:

In [0]:
#import  statsmodels library
import statsmodels.api as sm

#get the Galton Families dataset and assign it to a pandas dataframe
df = sm.datasets.get_rdataset('GaltonFamilies', package='HistData').data

 ### Familiarizing Ourselves with the Data
 
 Understanding the nature and structure of your data is always an important first step in a data analysis project. To view the first several rows of data, we can use the DataFrame's *head()* function. Run the code below to view the first 20 rows of data in the data set. Notice that there is one record for each child, and that records are grouped into families.

In [0]:
#view the first 20 rows of data in the dataframe
df.head(20)

Unnamed: 0,family,father,mother,midparentHeight,children,childNum,gender,childHeight
0,1,78.5,67.0,75.43,4,1,male,73.2
1,1,78.5,67.0,75.43,4,2,female,69.2
2,1,78.5,67.0,75.43,4,3,female,69.0
3,1,78.5,67.0,75.43,4,4,female,69.0
4,2,75.5,66.5,73.66,4,1,male,73.5
5,2,75.5,66.5,73.66,4,2,male,72.5
6,2,75.5,66.5,73.66,4,3,female,65.5
7,2,75.5,66.5,73.66,4,4,female,65.5
8,3,75.0,64.0,72.06,2,1,male,71.0
9,3,75.0,64.0,72.06,2,2,female,68.0


### Shape of the Data Set

What if we wanted to know how many rows and columns our data set contains? All we need to do to get these answers is to examine the DataFrame's *shape* property. The first value returned is the number of rows, while the second value is the number of columns:

In [0]:
#print the shape of the dataframe
print(df.shape)

#we can also identify just the number of rows or columns by specifying an index value:
print(df.shape[0])
print(df.shape[1])

(934, 8)
934
8


### Measures of Central Tendency

Standard measures of central tendency include the mean (average), the median (middle value), and mode (most common value). Let's find these values for the *childHeight* column in our data set by using the DataFrame's *mean()*, *median()*, and *mode()* functions:

In [0]:
#print the mean, median, and mode for the 'childHeight' column
print(df['childHeight'].mean())
print(df['childHeight'].median())
print(df['childHeight'].mode())

#assign the data series to a variable
heights = df['childHeight']
print(heights.mean())

66.74593147751597
66.5
0    65.0
dtype: float64
66.74593147751597


### Minimum, Maximum, and Range of Values

What if we wanted to know the minimum, maximum, and range of values for our *childHeight* column? We can easily view the minimum and maximum values for the children's heights by using the *min()* and *max()* functions:

In [0]:
#print the minimum and maximum values for the 'childHeight' column
print(df['childHeight'].min())
print(df['childHeight'].max())

56.0
79.0


The range is simply the difference between the maximum and the minimum:

In [0]:
#print the range of children's heights -- the str() function converts a number into a string
print(df['childHeight'].max() - df['childHeight'].min())

23.0


### Quartiles and Percentiles

If we'd like to know the 25%, 50%, and 75% quartiles for the children's heights, we simply need to use the DataFrame's *quantile()* function (Note that *quantile* is a generic term for any ranked position, such as a percentile or a quartile). In this example, we will ask Python to give us the three quartiles simultaneously by passing a list of values into the *quantiles()* function:

In [0]:
#print the 25%, 50%, and 75% quartiles for the children's heights
print(df['childHeight'].quantile([0.25, 0.5, 0.75]))

0.25    64.0
0.50    66.5
0.75    69.7
Name: childHeight, dtype: float64


As noted above, we can use the *quartile()* function to give us ANY percentile that we need. For example, imagine that we'd like to see the 10th and 90th percentiles...

In [0]:
#print the 10th and 90th percentiles for the children's heights
print(df['childHeight'].quantile([0.1, 0.9]))

0.1    62.0
0.9    71.5
Name: childHeight, dtype: float64


### Variance and Standard Deviation

If we'd like to gain some insights into how spread out the children's heights are, we can examine their variance and standard deviation by using the *var()* and *std()* functions:

In [0]:
#print the variance and standard deviation for the children's heights
print(df['childHeight'].var())
print(df['childHeight'].std())

#note the relationship between variance and standard deviation
df['childHeight'].std()**2

12.811039163574018
3.5792512015188342


12.81103916357402

### The Lazy Way!

Viewing descriptive statistics individually can be useful, but it is often best for a data scientist to view a variety of descriptive statistics at the same time. For this purpose, we can use the *describe()* function:

In [0]:
#print summary statistics for the children's heights
print(df['childHeight'].describe())

count    934.000000
mean      66.745931
std        3.579251
min       56.000000
25%       64.000000
50%       66.500000
75%       69.700000
max       79.000000
Name: childHeight, dtype: float64


The *describe()* function can also be used to generate descriptive statistics for multiple numeric variables simultaneously:

In [0]:
#print summary statistics for all of the numeric variables in the dataframe
df.describe()

Unnamed: 0,father,mother,midparentHeight,children,childNum,childHeight
count,934.0,934.0,934.0,934.0,934.0,934.0
mean,69.197109,64.089293,69.206773,6.171306,3.585653,66.745931
std,2.476479,2.290886,1.80237,2.729025,2.36141,3.579251
min,62.0,58.0,64.4,1.0,1.0,56.0
25%,68.0,63.0,68.14,4.0,2.0,64.0
50%,69.0,64.0,69.248,6.0,3.0,66.5
75%,71.0,65.875,70.14,8.0,5.0,69.7
max,78.5,70.5,75.43,15.0,15.0,79.0


In [0]:
#let's compute all of the children's heights to centimeters using Python's handy list comprehension feature!
df['childHeightCm'] = [i * 2.54 for i in df['childHeight']]

In [0]:
df.describe()

Unnamed: 0,father,mother,midparentHeight,children,childNum,childHeight,childHeightCm
count,934.0,934.0,934.0,934.0,934.0,934.0,934.0
mean,69.197109,64.089293,69.206773,6.171306,3.585653,66.745931,169.534666
std,2.476479,2.290886,1.80237,2.729025,2.36141,3.579251,9.091298
min,62.0,58.0,64.4,1.0,1.0,56.0,142.24
25%,68.0,63.0,68.14,4.0,2.0,64.0,162.56
50%,69.0,64.0,69.248,6.0,3.0,66.5,168.91
75%,71.0,65.875,70.14,8.0,5.0,69.7,177.038
max,78.5,70.5,75.43,15.0,15.0,79.0,200.66


In [0]:
#show the number of children who have each height
df['childHeight'].value_counts()

65.0    63
66.0    59
68.0    57
70.0    56
67.0    52
        ..
60.1     1
59.0     1
58.0     1
79.0     1
61.2     1
Name: childHeight, Length: 67, dtype: int64