# Practicing Descriptive Statistics with `NumPy` and `pandas`

Data exploration isn't complete without diving deeper into the potential subgroups and categories that exist within the data. 
Splitting the data across categories and discovering the underlying features of the categories could lead to better, more powerful insights. 

In this practice, we will be working with the data manipulation capabilities of `pandas` and the statistical functionality that `Numpy` provides to further investigate the *Game of Thrones* data.

In [1]:
import pandas as pd
import numpy as np

with open('/dsa/data/all_datasets/game-of-thrones/GoT_age_at_death.csv') as file:
    df = pd.read_csv(file)
    df.columns  = ['character', 'age', 'dead', 'gender', 'affiliation'] # rename the columns
    
    # turn age, gender, and affiliation into categorical data
    df['dead'] = df['dead'].astype('category') 
    df['gender'] = df['gender'].astype('category')
    df['affiliation'] = df['affiliation'].astype('category')

In [2]:
df.head(7)

Unnamed: 0,character,age,dead,gender,affiliation
0,Sandor Clegan,29,1,1,4
1,Benjen Stark,35,1,1,10
2,Syrio Forel,41,1,1,1
3,Tysha,29,0,0,4
4,Jeyne Pool,12,1,0,1
5,Imry Florent,35,1,1,2
6,Sorcerer in the Box,60,1,1,0


Suppose we are now interested in knowing the mean age of the characters by gender. This could be done by subsetting the dataframe by gender and then calculating the mean for each. However, there is a function within `pandas` that can do this for us `groupby`. `group_by` is great for beginning to explore data. `groupby` lets us split, apply a function, and combine the results. See [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) for more information

So if we are interested in the mean age by gender of the characters we would use the code below that produces a data frame by `gender` (so 0 and 1) and produce the `mean` age for both.

In [3]:
df.groupby(['gender'])['age'].mean()

gender
0    29.467391
1    37.635379
Name: age, dtype: float64

We could go even more fine-grained than this. 

**Activity 1**: *Find the mean age per `gender` and per `affiliation`.*

In [6]:
# Activity 1 code goes here
# -------------------------
df.groupby(['gender','affiliation'])['age'].mean()



gender  affiliation
0       0              29.000000
        1              33.000000
        2              25.300000
        3              31.750000
        4              28.714286
        5              24.166667
        6              20.600000
        8                    NaN
        9              19.000000
        10                   NaN
        11                   NaN
        12             30.333333
        13             75.000000
        14             30.666667
        15             68.000000
        16             35.666667
1       0              33.375000
        1              35.588235
        2              35.437500
        3              38.279070
        4              41.645161
        5              31.166667
        6              40.933333
        8              59.000000
        9              30.500000
        10             35.939394
        11             36.666667
        12             36.222222
        13             44.894737
        14             

In the lab, we learned how to subset data. We now want to look at only characters that are still alive. 

**Activity 2**: *Create a subset from the data frame of only those who are living. Call this subset `alive_chars`.*

In [9]:
# Activity 2 code goes here
# -------------------------
alive_chars = df[df["dead"] == 0]

alive_chars.head()

Unnamed: 0,character,age,dead,gender,affiliation
3,Tysha,29,0,0,4
7,Jhiqui,17,0,0,3
12,Craster's Younger Wife,23,0,0,12
13,Palla,18,0,0,1
16,Hallis Mollen,40,0,1,1


Remember in the lab, we used `NumPy` to find the mean age of the entire dataset, the syntax for which is below. 

In [None]:
np.mean(df.age)

**Activity 3**: *Find the mean age of those who are alive.*

In [14]:
# Activity 3 code goes here
# -------------------------

alive_chars["age"].mean()
np.mean(alive_chars.age)

33.00568181818182

Below is one of the ways that we found the standard deviation.

In [None]:
np.std(df.age)

But remember, without any extra arguments passed to the method, we get the population standard deviation. What we want is the sample standard deviation.

**Activity 4**: *Find the sample standard deviation of the ages of the characters that are still alive.*

In [15]:
# Activity 4 code goes here
# -------------------------
np.std(alive_chars.age)



17.779104550736886

In the practice we introduced the mean, median, and quartiles. To take this one step farther we want to look at the symmetry of a distribution. When the mean is greater than the median, it means that the data is skewed to the right, and when the median is greater than the mean, the data is skewed to the left. When the mean and median are the same, it means that the data are symmetrical. In other words, in a right skewed distribution, most of the data points are below the mean, but there are some high value points pulling the mean higher; the opposite is true for a left skew.

<img src="../images/distr_sym.gif" style="width: 750px;"/>

Below is an example of how we found the median before.

In [None]:
np.median(df.age)

**Activity 5**: *Is the distribution of the age of those who are alive skewed to the right, the left, or symmetrical and what does this tell us?*

In [16]:
# Activity 5 code goes here
# -------------------------
np.median(alive_chars.age)


# Text Answer to the question should be a comment here.
# Symetric, or very slightly right skewed 




32.0

`NumPy` also gives us the functionality to see what value a certain percentage of the rows fall under by calling the `percentile` method. We did this to find the 65th percentile of ages of the entire data fame by doing the following...

In [17]:
np.percentile(df.age, 65)

41.0

Now, remember we discussed the Quartile Range in the lab notebook.

**Activity 6**: *Find the the Quartile Range of the `alive_chars` age variable.*

In [18]:
# Activity 6 code goes here
# -------------------------
alive_chars.age.describe()



count    176.000000
mean      33.005682
std       17.829830
min        1.000000
25%       18.000000
50%       32.000000
75%       44.250000
max       92.000000
Name: age, dtype: float64

## Bivariate Analysis

Now we are going to switch back over to the *Stature Hand and Foot* data frame, and practice some bivariate analysis on some variables.

In [19]:
with open('/dsa/data/all_datasets/stature-hand-foot/stature-hand-foot.csv') as file2:
    df2 = pd.read_csv(file2)
    df2['gender'] = df2['gender'].astype('category')
    df2.columns = ['gender', 'height', 'hand_length', 'foot_length']

First, we should actually split the data based on male and female. 

**Activity 7**: *Create two subsets of the data, one for females and the other for males. Name the two subsets `female` and `male` respectively.*

In [37]:
# Activity 7 code goes here
# -------------------------


female = df2[df2["gender"] == 2]
male = df2[df2["gender"] == 1]




You will recall that we conducted covariance and correlation calculations on two varibles. However we can also create a correlation and covariance matrix. This is a good way to quickly spot some linear relationships between variables. Below is an example of how to create a correlation matrix on the entire data frame.

In [40]:
df2.corr()

Unnamed: 0,height,hand_length,foot_length
height,1.0,0.873295,0.88128
hand_length,0.873295,1.0,0.788224
foot_length,0.88128,0.788224,1.0


Simple enough. And in `pandas` it knows to ignore the non-quantitative columns. 

**Activity 8**: *Create a covariance and correlation matrices for both the males and females of this dataset.*

In [43]:
# Activity 8 code goes here
# -------------------------
male.corr()
female.corr()

male.cov()
female.cov()


Unnamed: 0,height,hand_length,foot_length
height,2424.11756,325.975892,416.985944
hand_length,325.975892,87.141622,68.931676
foot_length,416.985944,68.931676,146.760912


# Save your notebook, then `File > Close and Halt`