#Do the following assignment:
1. Load the web_events.csv data set into a Pandas dataframe.
  * Convert the values in the timestamp field to datetimes.
  * Extract different time units from the timestamp field.
  * Aggregate on each one, counting the number of records, and see what insights you can discover for each type of event.
  * Round datetimes by hour, aggregate, and see what insights you can discover.
2. Load the life_expectancy.csv data set into a Pandas dataframe.
  * Transform/melt the data so that the years are listed in a single column instead of separate columns.
  * Practice address missing values for countries using the different approaches (imputation, interpolation, and deletion)

In [1]:
import pandas as pd
import numpy as np

In [2]:
#web events data
web = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/web_events.csv')
web.head()

Unnamed: 0,timestamp,visitorid,event,itemid,transactionid
0,1433221332117,257597,view,355908,
1,1433224214164,992329,view,248676,
2,1433221999827,111016,view,318965,
3,1433221955914,483717,view,253185,
4,1433221337106,951259,view,367447,


In [3]:
web.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2756101 entries, 0 to 2756100
Data columns (total 5 columns):
 #   Column         Dtype  
---  ------         -----  
 0   timestamp      int64  
 1   visitorid      int64  
 2   event          object 
 3   itemid         int64  
 4   transactionid  float64
dtypes: float64(1), int64(3), object(1)
memory usage: 105.1+ MB


In [4]:
#Convert timestamp column to datetimes
web['timestamp'] = pd.to_datetime(web['timestamp'], unit='ms')
web.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2756101 entries, 0 to 2756100
Data columns (total 5 columns):
 #   Column         Dtype         
---  ------         -----         
 0   timestamp      datetime64[ns]
 1   visitorid      int64         
 2   event          object        
 3   itemid         int64         
 4   transactionid  float64       
dtypes: datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 105.1+ MB


In [5]:
web.head()

Unnamed: 0,timestamp,visitorid,event,itemid,transactionid
0,2015-06-02 05:02:12.117,257597,view,355908,
1,2015-06-02 05:50:14.164,992329,view,248676,
2,2015-06-02 05:13:19.827,111016,view,318965,
3,2015-06-02 05:12:35.914,483717,view,253185,
4,2015-06-02 05:02:17.106,951259,view,367447,


In [6]:
web.timestamp.dt.month

0          6
1          6
2          6
3          6
4          6
          ..
2756096    8
2756097    8
2756098    8
2756099    8
2756100    8
Name: timestamp, Length: 2756101, dtype: int64

In [7]:
#extract different time units
web['month'] = web.timestamp.dt.month
web.head()

Unnamed: 0,timestamp,visitorid,event,itemid,transactionid,month
0,2015-06-02 05:02:12.117,257597,view,355908,,6
1,2015-06-02 05:50:14.164,992329,view,248676,,6
2,2015-06-02 05:13:19.827,111016,view,318965,,6
3,2015-06-02 05:12:35.914,483717,view,253185,,6
4,2015-06-02 05:02:17.106,951259,view,367447,,6


In [8]:
web.month.unique()

array([6, 7, 8, 9, 5])

In [9]:
web.month.value_counts()

7    697984
6    610393
5    590652
8    553362
9    303710
Name: month, dtype: int64

In [10]:
#extracting year
web['year'] = web.timestamp.dt.year
web.year.unique()

array([2015])

In [11]:
#extracting day
web['day'] = web.timestamp.dt.day
web.day.unique()

array([ 2,  1,  3,  4,  5,  6,  7,  8,  9, 10, 11, 13, 14, 12, 16, 15, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31])

In [12]:
#value counts for each day
web.day.value_counts()

14    103072
7     102839
15    102451
10    100289
8      99072
13     98174
4      98103
11     97610
6      97340
3      97334
9      96252
17     95948
12     94946
5      94349
26     93553
16     91599
25     90921
18     89176
27     85541
19     84312
20     82351
22     80932
28     80799
29     80723
24     80703
21     78672
2      77508
1      77197
23     75639
30     74730
31     53966
Name: day, dtype: int64

In [13]:
#extracting hour
web['hour'] = web.timestamp.dt.hour
web.hour.unique()

array([ 5,  4, 20, 21,  3, 16, 15, 18, 17, 12, 13, 14, 23, 22,  1,  2, 19,
        8,  7,  6,  9, 10, 11,  0])

In [14]:
#value counts for each hour
web.hour.value_counts()

20    187919
21    184297
19    183348
18    181200
17    179651
22    175956
16    161784
23    159084
3     150860
4     147184
2     145879
0     144303
1     140702
15    129092
5     119572
14     81823
6      76972
13     51089
7      43944
12     31486
8      25309
11     20330
9      17909
10     16408
Name: hour, dtype: int64

In [15]:
#extracting the minute
web['minute'] = web['timestamp'].dt.minute
web.minute.unique()

array([ 2, 50, 13, 12, 48, 34, 54,  0, 16,  8, 41, 47, 22, 33, 57,  1, 58,
       15, 55, 51,  3, 56, 32, 40, 37, 20,  7, 42, 43, 59, 39, 19, 31, 36,
       52, 44, 49, 14, 29, 45,  4, 28,  9, 23, 27, 46, 53, 17,  5, 18, 26,
       10, 25, 24,  6, 21, 38, 11, 35, 30])

In [16]:
#minute value counts
web.minute.value_counts()

54    48316
55    48121
56    47605
53    47056
57    46776
51    46726
52    46649
58    46626
47    46560
41    46545
43    46427
48    46361
36    46269
44    46206
37    46195
34    46190
26    46097
39    46079
24    46073
42    46022
50    46022
27    45999
31    45978
22    45938
35    45916
49    45908
25    45867
29    45867
45    45847
46    45840
38    45837
59    45816
18    45787
32    45768
33    45760
12    45725
23    45657
21    45655
30    45630
28    45619
19    45619
7     45587
16    45578
8     45575
40    45555
11    45535
17    45493
9     45476
1     45470
13    45374
14    45352
2     45350
4     45335
6     45289
15    45278
5     45240
20    45149
10    45047
0     44882
3     44582
Name: minute, dtype: int64

In [17]:
web.head()

Unnamed: 0,timestamp,visitorid,event,itemid,transactionid,month,year,day,hour,minute
0,2015-06-02 05:02:12.117,257597,view,355908,,6,2015,2,5,2
1,2015-06-02 05:50:14.164,992329,view,248676,,6,2015,2,5,50
2,2015-06-02 05:13:19.827,111016,view,318965,,6,2015,2,5,13
3,2015-06-02 05:12:35.914,483717,view,253185,,6,2015,2,5,12
4,2015-06-02 05:02:17.106,951259,view,367447,,6,2015,2,5,2


In [18]:
#rounding the hour
hour_round = web['timestamp'].dt.round('H')
web['hour_round'] = hour_round.dt.hour
web.head()

Unnamed: 0,timestamp,visitorid,event,itemid,transactionid,month,year,day,hour,minute,hour_round
0,2015-06-02 05:02:12.117,257597,view,355908,,6,2015,2,5,2,5
1,2015-06-02 05:50:14.164,992329,view,248676,,6,2015,2,5,50,6
2,2015-06-02 05:13:19.827,111016,view,318965,,6,2015,2,5,13,5
3,2015-06-02 05:12:35.914,483717,view,253185,,6,2015,2,5,12,5
4,2015-06-02 05:02:17.106,951259,view,367447,,6,2015,2,5,2,5


In [19]:
#value counts for rounded hour
web.hour_round.value_counts()

21    186821
20    186163
18    182608
19    181444
22    180913
17    172255
23    168257
4     151716
0     151210
3     148576
16    147325
2     142286
1     141481
5     135492
15    104702
6      99396
14     63954
7      58369
13     40588
8      33111
12     24735
9      20755
11     17686
10     16258
Name: hour_round, dtype: int64

In [20]:
#earliest date
web['timestamp'].min()

Timestamp('2015-05-03 03:00:04.384000')

In [21]:
#latest date
web['timestamp'].max()

Timestamp('2015-09-18 02:59:47.788000')

*It seems that most of the web events are occurring later in the day between the rounded hours of 18:00H and 22:00H and during the months June and July having the most occuring events. Also the most frequent days for the events are the 14th, 15th, 7th, and 10th.*

*The earliest date in the dataset is May 3 2015 at 3am and the latest is September 18 2015 at 2:59am.*

In [22]:
#life expectancy data
life = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/life_expectancy.csv')
life.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
0,Aruba,ABW,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,65.662,66.074,66.444,66.787,67.113,67.435,67.762,68.095,68.436,68.784,69.14,69.498,69.851,70.191,70.519,70.833,71.14,71.441,71.736,72.023,72.293,72.538,72.751,72.929,73.071,73.181,73.262,73.325,73.378,73.425,73.468,73.509,73.544,73.573,73.598,73.622,73.646,73.671,73.7,73.738,73.787,73.853,73.937,74.038,74.156,74.287,74.429,74.576,74.725,74.872,75.016,75.158,75.299,75.44,75.582,75.725,75.867
1,Afghanistan,AFG,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,32.292,32.742,33.185,33.624,34.06,34.495,34.928,35.361,35.796,36.234,36.678,37.128,37.587,38.056,38.54,39.039,39.556,40.092,40.65,41.234,41.853,42.513,43.217,43.963,44.747,45.566,46.417,47.288,48.164,49.028,49.856,50.627,51.331,51.968,52.539,53.055,53.533,53.997,54.468,54.959,55.482,56.044,56.637,57.25,57.875,58.5,59.11,59.694,60.243,60.754,61.226,61.666,62.086,62.494,62.895,63.288,63.673
2,Angola,AGO,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,33.251,33.573,33.914,34.272,34.645,35.031,35.426,35.828,36.234,36.64,37.047,37.46,37.878,38.297,38.712,39.11,39.478,39.81,40.099,40.344,40.546,40.71,40.848,40.97,41.085,41.193,41.292,41.382,41.471,41.572,41.696,41.855,42.06,42.329,42.677,43.125,43.695,44.385,45.192,46.105,47.113,48.2,49.341,50.508,51.676,52.833,53.974,55.096,56.189,57.231,58.192,59.042,59.77,60.373,60.858,61.241,61.547
3,Albania,ALB,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,62.279,63.298,64.187,64.911,65.461,65.848,66.108,66.302,66.485,66.687,66.933,67.235,67.58,67.951,68.341,68.734,69.108,69.447,69.741,69.99,70.207,70.416,70.635,70.876,71.134,71.388,71.605,71.76,71.843,71.86,71.836,71.803,71.802,71.86,71.992,72.205,72.495,72.838,73.208,73.588,73.955,74.286,74.575,74.82,75.028,75.217,75.418,75.656,75.943,76.281,76.652,77.031,77.389,77.702,77.963,78.174,78.345
4,Andorra,AND,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [23]:
#Transform/melt years columns
ids = ['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code']

In [24]:
melt_fields = list(life.columns.drop(ids))

In [25]:
#melted dataset
melted = pd.melt(life,
                 id_vars=ids,
                 value_vars=melt_fields,
                 var_name='Year',
                 value_name='Value')
melted.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,Year,Value
0,Aruba,ABW,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,1960,65.662
1,Afghanistan,AFG,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,1960,32.292
2,Angola,AGO,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,1960,33.251
3,Albania,ALB,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,1960,62.279
4,Andorra,AND,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,1960,


**IMPUTATION**

*for missing values*

In [30]:
value_mean = melted.Value.fillna(melted['Value'].mean())
value_mean.head()

0    65.662000
1    32.292000
2    33.251000
3    62.279000
4    63.544406
Name: Value, dtype: float64

In [31]:
value_med = melted.Value.fillna(melted['Value'].median())
value_med.head()

0    65.662
1    32.292
2    33.251
3    62.279
4    66.328
Name: Value, dtype: float64

In [32]:
value_mode = melted.Value.fillna(melted['Value'].mode())
value_mode.head()

0    65.662
1    32.292
2    33.251
3    62.279
4       NaN
Name: Value, dtype: float64

In [33]:
value_zero = melted.Value.fillna(0)
value_zero.head()

0    65.662
1    32.292
2    33.251
3    62.279
4     0.000
Name: Value, dtype: float64

**INTERPOLATION**

*for missing values*

In [34]:
front_fill = melted.Value.fillna(method='ffill')
front_fill.head()

0    65.662
1    32.292
2    33.251
3    62.279
4    62.279
Name: Value, dtype: float64

In [35]:
back_fill = melted.Value.fillna(method='bfill')
back_fill.head()

0    65.662000
1    32.292000
2    33.251000
3    62.279000
4    46.825065
Name: Value, dtype: float64

In [36]:
smoothed = (front_fill + back_fill) / 2
smoothed.head()

0    65.662000
1    32.292000
2    33.251000
3    62.279000
4    54.552032
Name: Value, dtype: float64

**DELETION**

*for missing values*

In [37]:
dropped = melted['Value'].dropna()
dropped.head()

0    65.662000
1    32.292000
2    33.251000
3    62.279000
5    46.825065
Name: Value, dtype: float64