# Exploratory Data Analysis ============================

## 1. Operation on data

Usually, in some of the data analysis tasks, the insights can be unveiled when the data are aggregated or grouped.

Most of the times, it helps the reader get the flavour of the dataset he/she is dealing with and it also reveals 

some key points that can be used during the decision making process.

Here we will present some of them including Grouping, summrizing and aggregating.

### a. Groupby 

The **Pandas dataframe.groupby()** function is used to split the data into groups based on some criteria.

The abstract definition of grouping is to provide a mapping of labels to group names.

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('data/gapminder.tsv' , "\t")
df.head(5)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Algeria,Africa,1952,43.077,9279525,2449.008185
1,Algeria,Africa,1957,45.685,10270856,3013.976023
2,Algeria,Africa,1962,48.303,11000948,2550.81688
3,Algeria,Africa,1967,51.407,12760499,3246.991771
4,Algeria,Africa,1972,54.518,14760787,4182.663766


In [4]:
#Let's group the data per year
g1 = df.groupby('year')
g1

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x7f41444ba4a8>

Though the data is grouped, it does not give out any relevant information since the user did not ask

anything apart from the data to be grouped per year.

**What if we want to get the average of the life expectancy for the yearly grouped data?**

In [5]:
g1.lifeExp.mean()

year
1952    49.158221
1957    51.603135
1962    53.683545
1967    55.757694
1972    57.729649
1977    59.654245
1982    61.624571
1987    63.309779
1992    64.267950
1997    65.124079
2002    65.792136
2007    67.105736
Name: lifeExp, dtype: float64

**Or the average if the variance of the population column for the per continent grouped data?**

In [6]:
g2  = df.groupby('continent')
g2.pop.var()

continent
Africa      2.399687e+14
Americas    2.598902e+15
Asia        4.402000e+16
Europe      4.279715e+14
Oceania     4.233249e+13
Name: pop, dtype: float64

**Or the descriptive statistics of the life expectancyfor the yearly grouped data?**

In [7]:
g3 = df.groupby('year')
g3.lifeExp.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1952,140.0,49.158221,12.181478,30.0,39.115,45.1355,59.964,72.67
1957,140.0,51.603135,12.169108,31.57,41.3265,48.3605,63.18325,73.47
1962,140.0,53.683545,12.007452,32.767,43.5495,50.881,65.337,73.68
1967,140.0,55.757694,11.624418,34.113,46.17325,53.825,67.5025,74.16
1972,140.0,57.729649,11.28435,35.4,48.62675,56.53,69.2925,74.72
1977,140.0,59.654245,11.136169,31.22,50.7265,59.672,70.42,76.11
1982,140.0,61.624571,10.663845,38.445,52.954,62.4415,70.98575,77.11
1987,140.0,63.309779,10.433923,39.906,54.97025,65.834,71.80575,78.67
1992,140.0,64.26795,11.127546,23.599,56.32925,67.703,72.63875,79.36
1997,140.0,65.124079,11.453762,36.087,55.78525,69.394,74.329,80.69


This can be even taken one level forward, as we could also get the average 

of the gdpPerCap for the yearly and "per continent" grouped data

In [8]:
g4 = df.groupby(['year' , 'continent'])

g4.gdpPercap.mean()

year  continent
1952  Africa        1252.572466
      Americas      4079.062552
      Asia          5333.485213
      Europe        5801.057480
      Oceania      10298.085650
1957  Africa        1385.236062
      Americas      4616.043733
      Asia          5942.947937
      Europe        7136.141387
      Oceania      11598.522455
1962  Africa        1598.078825
      Americas      4901.541870
      Asia          5881.753028
      Europe        8574.197085
      Oceania      12696.452430
1967  Africa        2050.363801
      Americas      5668.253496
      Asia          6131.641381
      Europe       10398.431578
      Oceania      14495.021790
1972  Africa        2339.615674
      Americas      6491.334139
      Asia          8420.202687
      Europe       12795.649490
      Oceania      16417.333380
1977  Africa        2585.938508
      Americas      7352.007126
      Asia          8010.226541
      Europe       14654.702392
      Oceania      17283.957605
1982  Africa        2481

You might have noticed that these calculations work only for the pandas built-in functions,

such as mean(), max(), min(), std() , var()

For those who want to run their own function. How to proceed?

The first approach consists on using the built-in fucntions of the numpy library

The below command returns the median of the yearly grouped data

In [None]:
import 

In [9]:
g5 = df.groupby('year')['gdpPercap'].agg(np.median)
g5

NameError: name 'np' is not defined

You might also define your own function

The function below returns the equivalent in centuriy of the average of the yearly grouped data

In [26]:
#just define your own function
import numpy as np
def century(values):
    """This function returns the equivalent
    in centuries of the average of a group"""
    n = len(values)
    s = np.sum(values)
    m = s / (n *100)
    return m 
    

In [27]:
g6 = df.groupby('year')
g6.lifeExp.agg(century)

year
1952    0.491582
1957    0.516031
1962    0.536835
1967    0.557577
1972    0.577296
1977    0.596542
1982    0.616246
1987    0.633098
1992    0.642680
1997    0.651241
2002    0.657921
2007    0.671057
Name: lifeExp, dtype: float64