In [1]:
import pandas as pd

Load the web_events.csv data set into a Pandas dataframe.

In [3]:
df = pd.read_csv("https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/web_events.csv")
df.head()

Unnamed: 0,timestamp,visitorid,event,itemid,transactionid
0,1433221332117,257597,view,355908,
1,1433224214164,992329,view,248676,
2,1433221999827,111016,view,318965,
3,1433221955914,483717,view,253185,
4,1433221337106,951259,view,367447,


Convert the values in the timestamp field to datetimes.

In [4]:
df.timestamp = pd.to_datetime(df.timestamp, unit="ms")

Extract different time units from the timestamp field.

In [9]:
year = df.timestamp.dt.year
month = df.timestamp.dt.month
day = df.timestamp.dt.day
week = df.timestamp.dt.week
hour = df.timestamp.dt.hour
minute = df.timestamp.dt.minute
second = df.timestamp.dt.second
ms = df.timestamp.dt.microsecond

  after removing the cwd from sys.path.


Aggregate on each one, counting the number of records, and see what insights you can discover for each type of event.

In [20]:
print(f"Year: \n{year.value_counts()}\n")
print(f"Months: \n{month.value_counts()}\n")
print(f"Days: \n{day.value_counts()}\n")
print(f"Week: \n{week.value_counts()}\n")
print(f"Hour: \n{hour.value_counts()}\n")
print(f"Minute: \n{minute.value_counts()}\n")

Year: 
2015    2756101
Name: timestamp, dtype: int64

Months: 
7    697984
6    610393
5    590652
8    553362
9    303710
Name: timestamp, dtype: int64

Days: 
14    103072
7     102839
15    102451
10    100289
8      99072
13     98174
4      98103
11     97610
6      97340
3      97334
9      96252
17     95948
12     94946
5      94349
26     93553
16     91599
25     90921
18     89176
27     85541
19     84312
20     82351
22     80932
28     80799
29     80723
24     80703
21     78672
2      77508
1      77197
23     75639
30     74730
31     53966
Name: timestamp, dtype: int64

Week: 
30    175437
28    161845
21    152514
29    152243
20    148861
23    146445
25    145976
31    144337
22    141819
26    141688
27    139233
37    138789
19    133775
24    131852
32    128559
35    127810
34    126924
33    120763
36    119057
38     64491
18     13683
Name: timestamp, dtype: int64

Hour: 
20    187919
21    184297
19    183348
18    181200
17    179651
22    175956
16    161

Round datetimes by hour, aggregate, and see what insights you can discover.

In [22]:
by_hour = df.timestamp.dt.round("H")

In [24]:
by_hour.dt.hour.value_counts()

21    186821
20    186163
18    182608
19    181444
22    180913
17    172255
23    168257
4     151716
0     151210
3     148576
16    147325
2     142286
1     141481
5     135492
15    104702
6      99396
14     63954
7      58369
13     40588
8      33111
12     24735
9      20755
11     17686
10     16258
Name: timestamp, dtype: int64

A majority of visitors are still from 1700H - 2300H after rounding by hour.

Load the life_expectancy.csv data set into a Pandas dataframe.

In [25]:
life_df = pd.read_csv("https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/life_expectancy.csv")

Transform/melt the data so that the years are listed in a single column instead of separate columns.

In [43]:
ids = ["Country Name", "Country Code", "Indicator Name", "Indicator Code"]
melt_fields = list(life_df.columns.drop(ids))

melted = pd.melt(
    life_df, 
    id_vars=ids, 
    value_vars=melt_fields,
    var_name="Year",
    value_name="Value"
    )

melted.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,Year,Value
0,Aruba,ABW,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,1960,65.662
1,Afghanistan,AFG,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,1960,32.292
2,Angola,AGO,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,1960,33.251
3,Albania,ALB,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,1960,62.279
4,Andorra,AND,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,1960,


Practice address missing values for countries using the different approaches (imputation, interpolation, and deletion).

### Imputation

In [34]:
melted.Value.fillna(melted.Value.mean(), inplace=True)

In [37]:
melted.Value.fillna(melted.Value.median(), inplace=True)

In [39]:
melted.Value.fillna(melted.Value.mode(), inplace=True)

In [41]:
melted.Value.fillna(0, inplace=True)

### Interpolation

In [46]:
ffill = melted.Value.fillna(method="ffill")
bfill = melted.Value.fillna(method="bfill")
smoothed = (ffill + bfill) / 2
smoothed

0        65.662000
1        32.292000
2        33.251000
3        62.279000
4        54.552032
           ...    
15043    71.646341
15044    64.953000
15045    62.774000
15046    61.874000
15047    61.163000
Name: Value, Length: 15048, dtype: float64

In [47]:
melted.Value.dropna()

0        65.662000
1        32.292000
2        33.251000
3        62.279000
5        46.825065
           ...    
15043    71.646341
15044    64.953000
15045    62.774000
15046    61.874000
15047    61.163000
Name: Value, Length: 13747, dtype: float64