<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-data" data-toc-modified-id="Import-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import data</a></span></li><li><span><a href="#Q1" data-toc-modified-id="Q1-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Q1</a></span></li><li><span><a href="#Q2" data-toc-modified-id="Q2-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Q2</a></span></li><li><span><a href="#Q3" data-toc-modified-id="Q3-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Q3</a></span></li><li><span><a href="#Q4" data-toc-modified-id="Q4-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Q4</a></span><ul class="toc-item"><li><span><a href="#Include-users-with-only-1-record" data-toc-modified-id="Include-users-with-only-1-record-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Include users with only 1 record</a></span></li><li><span><a href="#Ignore-users-with-only-1-record" data-toc-modified-id="Ignore-users-with-only-1-record-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Ignore users with only 1 record</a></span></li></ul></li></ul></div>

* Author: Stephanie Jung

# Import data

In [1]:
import pandas as pd
import numpy as np

Import data and parse 'ts' columns as 'datetime' object.

In [270]:
data = pd.read_csv('q1_data.csv', parse_dates=['ts'])

In [28]:
data.head()

Unnamed: 0,ts,user_id,country_id,site_id
0,2019-02-01 00:01:24,LC36FC,TL6,N0OTG
1,2019-02-01 00:10:19,LC39B6,TL6,N0OTG
2,2019-02-01 00:21:50,LC3500,TL6,N0OTG
3,2019-02-01 00:22:50,LC374F,TL6,N0OTG
4,2019-02-01 00:23:44,LCC1C3,TL6,QGO3G


In [30]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3553 entries, 0 to 3552
Data columns (total 4 columns):
ts            3553 non-null datetime64[ns]
user_id       3553 non-null object
country_id    3553 non-null object
site_id       3553 non-null object
dtypes: datetime64[ns](1), object(3)
memory usage: 111.2+ KB


Check if there's any missing value for each column.

In [271]:
for col in data.columns:
    print(sum(data[col].isna()))

0
0
0
0


# Q1

Consider only the rows with country_id = "BDV" (there are 844 such rows). For each site_id, we can compute the number of unique user_id's found in these 844 rows. Which site_id has the largest number of unique users? And what's the number?

First, select records with country_id = "BDV". Then, group by 'site_id' and count the number of unique user_id's and choose the top one.

In [266]:
data.loc[data.country_id == "BDV"].groupby('site_id').user_id.nunique().nlargest(1).reset_index()

Unnamed: 0,site_id,user_id
0,5NPAU,544


# Q2

Between 2019-02-03 00:00:00 and 2019-02-04 23:59:59, there are four users who visited a certain site more than 10 times. Find these four users & which sites they (each) visited more than 10 times. (Simply provides four triples in the form (user_id, site_id, number of visits) in the box below.)

First, Select records with ts between 2019-02-03 00:00:00 and 2019-02-04 23:59:59.  
   
Then, group by both 'user_id' and 'site_id' to see the statistic for each unique pair of 'user_id' and 'site_id'.  
Count the number of records to see the visits for each unique pair.  
Select the pairs which have number of visits > 10.

In [272]:
datac = data.loc[(data.ts>='2019-02-03 00:00:00') & (data.ts<='2019-02-04 23:59:59')]
datac = datac.groupby(['user_id', 'site_id']).count().reset_index()
datac = datac.loc[datac.ts>10]
datac

Unnamed: 0,user_id,site_id,ts,country_id
3,LC06C3,N0OTG,25,25
417,LC3A59,N0OTG,26,26
485,LC3C7E,3POLC,15,15
493,LC3C9D,N0OTG,17,17


Print the answers according to the format.

In [74]:
for i in range(len(datac)):
    print(f'({datac.iloc[i].user_id}, {datac.iloc[i].site_id}, {datac.iloc[i].ts})')

(LC06C3, N0OTG, 25)
(LC3A59, N0OTG, 26)
(LC3C7E, 3POLC, 15)
(LC3C9D, N0OTG, 17)


# Q3

For each site, compute the unique number of users whose last visit (found in the original data set) was to that site. For instance, user "LC3561"'s last visit is to "N0OTG" based on timestamp data. Based on this measure, what are top three sites? (hint: site "3POLC" is ranked at 5th with 28 users whose last visit in the data set was to 3POLC; simply provide three pairs in the form (site_id, number of users).)

First, sort data with timestamp in descending order.    
    
Then, group by 'user_id' and choose the first record for each 'user' group to pick the latest record for each 'user'.

In [267]:
datac = data.sort_values('ts', ascending=False).groupby('user_id', as_index=False).nth(0)
datac

Unnamed: 0,ts,user_id,country_id,site_id
3552,2019-02-07 23:59:37,LC3842,HVQ,3POLC
3550,2019-02-07 23:58:56,LC35EB,TL6,QGO3G
3548,2019-02-07 23:56:57,LC3F13,TL6,QGO3G
3547,2019-02-07 23:55:07,LC3837,TL6,RT9Z6
3545,2019-02-07 23:44:34,LC3561,TL6,N0OTG
...,...,...,...,...
12,2019-02-01 00:42:13,LC39C8,TL6,QGO3G
11,2019-02-01 00:41:50,LCC3C3,QLT,5NPAU
4,2019-02-01 00:23:44,LCC1C3,TL6,QGO3G
2,2019-02-01 00:21:50,LC3500,TL6,N0OTG


Then, to see which 'site' has the most user visited for the last time, group by 'site_id' and count 'user_id' and display the top 3 records.

In [268]:
datac1 = datac.groupby('site_id', as_index=False).user_id.count().nlargest(3,'user_id')
datac1

Unnamed: 0,site_id,user_id
1,5NPAU,992
5,N0OTG,561
6,QGO3G,289


Print the answers according to the format.

In [269]:
for i in range(len(datac1)):
    print(f'({datac1.iloc[i].site_id}, {datac1.iloc[i].user_id})')

(5NPAU, 992)
(N0OTG, 561)
(QGO3G, 289)


# Q4

For each user, determine the first site he/she visited and the last site he/she visited based on the timestamp data. Compute the number of users whose first/last visits are to the same website. What is the number?

## Include users with only 1 record

There are some users who have only 1 visit out of all sites. These users have automatically the same first and last site visits.

First, sort the data with 'ts'.  
    Then, group by 'user_id' and choose the first and last record for each user group.

In [278]:
datac = data.sort_values('ts').groupby('user_id', as_index=False).nth([0, -1]).sort_values(['user_id', 'ts'])
datac

Unnamed: 0,ts,user_id,country_id,site_id
1422,2019-02-03 18:52:50,LC00C3,QLT,5NPAU
1767,2019-02-04 11:35:10,LC01C3,QLT,5NPAU
733,2019-02-02 14:14:44,LC05C3,BDV,5NPAU
526,2019-02-01 22:49:39,LC06C3,TL6,N0OTG
3078,2019-02-07 01:16:12,LC06C3,TL6,N0OTG
...,...,...,...,...
470,2019-02-01 20:49:13,LCFC3E,BDV,5NPAU
580,2019-02-02 01:19:49,LCFEC3,TL6,N0OTG
3173,2019-02-07 06:23:59,LCFEC3,HVQ,3POLC
1027,2019-02-02 22:36:23,LCFFC3,XA7,N0OTG


To check whether the first and last records for each user have the same 'site_id's, group by 'user_id' and see the number of unique values.   
    
   If the value is 1, then it means the first and last records have the same 'site_id's.

In [279]:
datac = datac.groupby('user_id').site_id.nunique().reset_index()
datac

Unnamed: 0,user_id,site_id
0,LC00C3,1
1,LC01C3,1
2,LC05C3,1
3,LC06C3,1
4,LC07C3,1
...,...,...
1911,LCFC3B,1
1912,LCFC3D,1
1913,LCFC3E,1
1914,LCFEC3,2


Count the number of records with the number of unique values = 1.

In [290]:
datac.loc[datac.site_id==1].user_id.count()

1670

## Ignore users with only 1 record

Sort data with 'ts', group by 'user_id' and select the first and last records for each user group.

In [293]:
datac = data.sort_values('ts').groupby('user_id', as_index=False).nth([0, -1])
datac

Unnamed: 0,ts,user_id,country_id,site_id
0,2019-02-01 00:01:24,LC36FC,TL6,N0OTG
1,2019-02-01 00:10:19,LC39B6,TL6,N0OTG
2,2019-02-01 00:21:50,LC3500,TL6,N0OTG
3,2019-02-01 00:22:50,LC374F,TL6,N0OTG
4,2019-02-01 00:23:44,LCC1C3,TL6,QGO3G
...,...,...,...,...
3545,2019-02-07 23:44:34,LC3561,TL6,N0OTG
3547,2019-02-07 23:55:07,LC3837,TL6,RT9Z6
3548,2019-02-07 23:56:57,LC3F13,TL6,QGO3G
3550,2019-02-07 23:58:56,LC35EB,TL6,QGO3G


Count the number of records for each user and exclude users with only 1 record.

In [253]:
datac1 = datac.groupby('user_id', as_index=False).ts.count()
datac1 = datac1.loc[datac1.ts>1]
datac1

Unnamed: 0,user_id,ts
3,LC06C3,2
6,LC0C32,2
10,LC0C3E,2
11,LC0CC3,2
14,LC0FC3,2
...,...,...
1908,LCFC34,2
1909,LCFC36,2
1913,LCFC3E,2
1914,LCFEC3,2


Inner join the dataframe with 'datac' to get the first and last visted side_id's for selected users.

In [254]:
datac2 = pd.merge(datac, datac1, on='user_id', how='inner')
datac2

Unnamed: 0,ts_x,user_id,country_id,site_id,ts_y
0,2019-02-01 00:01:24,LC36FC,TL6,N0OTG,2
1,2019-02-07 00:24:50,LC36FC,TL6,N0OTG,2
2,2019-02-01 00:22:50,LC374F,TL6,N0OTG,2
3,2019-02-03 04:50:43,LC374F,TL6,N0OTG,2
4,2019-02-01 00:24:21,LC3E1D,HVQ,GVOFK,2
...,...,...,...,...,...
1305,2019-02-07 19:30:22,LC362E,QLT,5NPAU,2
1306,2019-02-07 19:36:05,LC3D07,TL6,N0OTG,2
1307,2019-02-07 21:42:23,LC3D07,BDV,5NPAU,2
1308,2019-02-07 19:38:55,LC3557,BDV,5NPAU,2


Group by 'user_id' and get the number of unique 'site_id's for each user group. If the value is 1 then the first and last visit was made to the same site.

In [255]:
datac3 = datac2.groupby('user_id').site_id.nunique().reset_index()
datac3

Unnamed: 0,user_id,site_id
0,LC06C3,1
1,LC0C32,2
2,LC0C3E,1
3,LC0CC3,1
4,LC0FC3,1
...,...,...
650,LCFC34,1
651,LCFC36,1
652,LCFC3E,1
653,LCFEC3,2


Count the number of users where the number of unique site is 1.

In [259]:
datac3.loc[datac3.site_id==1].user_id.nunique()

409