# Practicing Descriptive Statistics with `NumPy` and `pandas`

Much like the theme of the `R` practice notebook, data exploration isn't complete without manipulating the data. Splitting the data across categories and discovering the underlying features of the categories could lead to better, more powerful insights. 

In this practice, we will be working with the data manipulation capabilities of `pandas` and the statistical functionality that `Numpy` provides to further investigate the *Game of Thrones* data.

In [1]:
import pandas as pd
import numpy as np

with open('../../datasets/game-of-thrones/GoT_age_at_death.csv') as file:
    df = pd.read_csv(file)
    df.columns  = ['character', 'age', 'dead', 'gender', 'affiliation'] # rename the columns
    
    # turn age, gender, and affiliation into categorical data
    df['dead'] = df['dead'].astype('category') 
    df['gender'] = df['gender'].astype('category')
    df['affiliation'] = df['affiliation'].astype('category')

In [2]:
df.head(7)

Unnamed: 0,character,age,dead,gender,affiliation
0,Sandor Clegan,29,1,1,4
1,Benjen Stark,35,1,1,10
2,Syrio Forel,41,1,1,1
3,Tysha,29,0,0,4
4,Jeyne Pool,12,1,0,1
5,Imry Florent,35,1,1,2
6,Sorcerer in the Box,60,1,1,0


In module 1, we didn't introduce you to `pandas`' `groupby` method, but it operates on a data frame the same way that `dplyr`'s `group_by` method does. Below is how we would group the data frame by `gender` (so 0 and 1) and produce the `mean` age for both.

In [3]:
df.groupby(['gender'])['age'].mean()

gender
0    29.467391
1    37.635379
Name: age, dtype: float64

We could go even more fine-grained than this. 

**Activity 1**: *Find the mean age per `gender` and per `affiliation`.*

In [4]:
# Activity 1 code goes here
# -------------------------


df.groupby(['gender'])['age'].mean()
df.groupby(['affiliation'])['age'].mean()

affiliation
0     32.181818
1     34.872340
2     33.023810
3     36.508475
4     39.263158
5     26.500000
6     32.800000
8     59.000000
9     26.666667
10    35.939394
11    36.666667
12    33.866667
13    46.400000
14    32.705882
15    47.100000
16    34.692308
Name: age, dtype: float64

In the `R` practice, we emphasized descriptive statistics of those characters who have died in the series. Just for the sake of comparison, let's look at those who are alive. 

**Activity 2**: *Create a subset from the data frame of only those who are living. Call this subset `alive_chars`.*

In [6]:
# Activity 2 code goes here
# -------------------------


alive_chars = df[df['dead']==0]

Remember in the lab, we used `NumPy` to find the mean age of the entire dataset, the syntax for which is below. 

**Activity 3**: *Find the mean age of those who are alive.*

In [7]:
np.mean(df.age)

35.59891598915989

In [8]:
# Activity 3 code goes here
# -------------------------


np.mean(alive_chars.age)

33.00568181818182

Below is one of the ways that we found the standard deviation.

In [9]:
np.std(df.age)

18.99184246263994

But remember, without any extra arguments passed to the method, we get the population standard deviation. What we want is the sample standard deviation.

**Activity 4**: *Find the sample standard deviation of the ages of the characters that are still alive.*

In [10]:
# Activity 4 code goes here
# -------------------------


np.std(alive_chars.age,ddof = 1)

17.829829631135702

In the `R` practice we referred to the symmetry of a distribution. As a reminder, when the mean is greater than the median, it means that the data is skewed to the right, and when the median is greater than the mean, the data is skewed to the left. When the mean and median are the same, it means that the data are symmetrical. In other words, in a right skewed distribution, most of the data points are below the mean, but there are some high value points pulling the mean higher; the opposite is true for a left skew.

<img src="../images/distr_sym.gif", width = 750>

Below is an example of how we found the median before.

In [11]:
np.median(df.age)

35.0

**Activity 5**: *Is the distribution of the age of those who are alive skewed to the right, the left, or symmetrical?*

In [15]:
# Activity 5 code goes here
# -------------------------

median = np.median(alive_chars.age)
mean = np.mean(alive_chars.age)


if mean > median:
    print('right')
elif median > mean:
    print('left')
else:
    print('symmetric')

right


`NumPy` also gives us the functionality to see what value a certain percentage of the rows fall under by calling the `percentile` method. We did this to find the 65th percentile of ages of the entire data fame by doing the following...

In [16]:
np.percentile(df.age, 65)

41.0

Now, remember we discussed the Inter Quartile Range in the `R` lab notebook.

**Activity 6**: *Find the the Inter Quartile Range (IQR) of the `alive_chars` age variable.*

In [18]:
# Activity 6 code goes here
# -------------------------


q75, q25 = np.percentile(alive_chars.age, [75 ,25])
iqr = q75 - q25
iqr

26.25

## Bivariate Analysis

Now we are going to switch back over to the *Stature Hand and Foot* data frame, and practice some bivariate analysis on some variables.

In [19]:
with open('../../datasets/stature-hand-foot/stature-hand-foot.csv') as file2:
    df2 = pd.read_csv(file2)
    df2['gender'] = df2['gender'].astype('category')
    df2.columns = ['gender', 'height', 'hand_length', 'foot_length']

First, we should actually split the data based on male and female. 

**Activity 7**: *Create two subsets of the data, one for females and the other for males. Name the two subsets `female` and `male` respectively.*

In [23]:
# Activity 7 code goes here
# -------------------------
# male is 1, female is 2

female = df2[df2['gender']==2]
male = df2[df2['gender']==1]

You will recall that we could create a correlation and covariance matrix in `R`. The same functionality is available in `pandas`. This is a good way to quickly spot some linear relationships between variables. Below is an example of how to create a correlation matrix on the entire data frame.

In [24]:
df2.corr()

Unnamed: 0,height,hand_length,foot_length
height,1.0,0.873295,0.88128
hand_length,0.873295,1.0,0.788224
foot_length,0.88128,0.788224,1.0


Simple enough. And in `pandas` it knows to ignore the non-quantitative columns. 

**Activity 8**: *Create a covariance and correlation matrices for both the males and females of this dataset.*

In [25]:
# Activity 8 code goes here
# -------------------------


female.corr()
male.corr()

Unnamed: 0,height,hand_length,foot_length
height,1.0,0.722356,0.715975
hand_length,0.722356,1.0,0.473
foot_length,0.715975,0.473,1.0
