## In-class excercises week 7

In the final tutorial we will work with two datasets containing digital traces and try to think how to solve questions related to these datasets.
Both datasets are fictional datasets from an online service that is currenly running mutliple communication campaigns to attract users to its website. 

* `da56_sessions.pkl.gz` contains all visits (sessions) to the website. It includes the following columns:
    * `session_id` - unique id of the session
    * `session_timestamp` - starting time of the session
    * `user_agent` - information about the browser from which the session takes place
    * `referral` - the website from which the visitor came
    * `paid_campaign` - categorical variable indicating the type of paid campaign that the visitor came from (one of four). Nan indicates not coming from a paid campaign. 
    * `user_id` - if the visitor has an account, their user id is pasted here
    
* `da56_users.pkl` contains information on registered users of the website. It includes the following columns:
    * `id` - unique user id for registered users
    * `reg_name` - users' name used for registration
    * `age` - users' age provided in the registration
    * `registration_date` - registration date
    * `initial_referrer` - the website from which this user originally came from when registering
    * `preferential_client` -  variable takes value 1 when the user has preferential status
    

How would you do the following:
* Deal with missing values in the dataset? What are my options? What do missings mean?
* Merge the two datasets: what are my options? What are the consequenecs of different types of merge for unit of analysis and for missing values?
    * Scenario: I want to know if a visitor was a preferential client. What merge type do I need? What do I do with missing values?
* Minimize the datasets: what do I need to know? What steps do I need to take?
    * Scenario: I am only interested in referrals, type of paid campaign and if visitor was a preferential client. What do I need to do to minimize the data?
* Create categorical variables: how do I "recode" the exisitng columns? What steps do I need to take?
    * Scenario: I want to have a new variable about visits that is binary and tells me if a visitor came to my website from social media (Instagram or Facebook) or not. What column do I need to "recode"? How do I do that?

In [1]:
import pandas as pd

In [2]:
sessions = pd.read_pickle('da56_sessions.pkl.gz')

In [3]:
users = pd.read_pickle('da56_users.pkl')

# Exploring the data sets

In [4]:
users.head()

Unnamed: 0,id,reg_name,age,registration_date,initial_referrer,preferential_client
0,55885858,Melissa Hanson,21,2019-10-29,instagram.com,
1,55885859,Danielle Evans,20,2019-11-08,instagram.com,
2,55885860,Erika Horton,18,2021-01-27,google.com,
3,55885861,Nicole Campbell,46,2020-06-21,google.com,
4,55885862,Jessica Sanchez,64,2020-02-15,massey.com,


In [5]:
sessions.head()

Unnamed: 0,session_id,session_timestamp,user_agent,referral,paid_campaign,user_id
0,5555694754,2021-09-20 06:59:09,Mozilla/5.0 (Windows NT 6.2; lo-LA; rv:1.9.1.2...,google.com,1.0,
1,5555694755,2021-09-23 19:07:17,Mozilla/5.0 (Windows; U; Windows NT 6.0) Apple...,instagram.com,4.0,
2,5555694756,2021-09-25 14:12:23,Mozilla/5.0 (Android 2.3.6; Mobile; rv:7.0) Ge...,instagram.com,,
3,5555694757,2021-09-20 11:12:36,Mozilla/5.0 (compatible; MSIE 5.0; Windows NT ...,google.com,,
4,5555694758,2021-09-24 11:12:22,Opera/9.52.(X11; Linux i686; bho-IN) Presto/2....,google.com,2.0,


In [6]:
users.columns

Index(['id', 'reg_name', 'age', 'registration_date', 'initial_referrer',
       'preferential_client'],
      dtype='object')

In [7]:
sessions.columns

Index(['session_id', 'session_timestamp', 'user_agent', 'referral',
       'paid_campaign', 'user_id'],
      dtype='object')

In [8]:
sessions.shape

(50000, 6)

In [9]:
users.shape

(1000, 6)

# Deal with missing values in the dataset? What are my options? What do missings mean?
My options include...
- deleting rows with missing values in specific columns
- filling them with a value (depending on whether a meaningful replacement is possible)
- leaving them in

Some missing values in the data set are __meaningful__. For instance, in the __sessions__ data set, the missing values in user_id indicate that these visits were __not from a registered user__. We therefore do not have to delete rows with missing values in this column if we are interested in any visit to the website. 

In [10]:
sessions.isna().sum()

session_id               0
session_timestamp        0
user_agent               0
referral                 0
paid_campaign        21569
user_id              34877
dtype: int64

In [11]:
users.isna().sum()

id                       0
reg_name                 0
age                      0
registration_date        0
initial_referrer         0
preferential_client    831
dtype: int64

# Merge the two datasets: what are my options? What are the consequenecs of different types of merge for unit of analysis and for missing values?
In the sessions data frame, my units of analysis are visits. In the users data frame, the units of analyses are registrered users.
Depending on your research question and the corresponding units of analysis you are interested in, you prioritise one of the dataframes when merging.

# Scenario: I want to know if a visitor was a preferential client. What merge type do I need? What do I do with missing values?
The unit of analysis is __visits__. Therefore, sessions is my primary dataframe. I will do a left merge with sessions-users to keep all rows in sessions. Notably, the ID keys are named differently in the dataframes: users = 'id', sessions='user_id'.

There will be a lot of missings values. The missings in prefernetial client in the user dataframe  can be filled with zero, because we can interpret them as "not a preferential client" (based on the dataset description). This also means that all missings in preferential_client in the merged dataframe are __meaningful__, they indicate that the visit is __not from a preferential client__. I will fill them with zero.

In [12]:
sessions_plus = pd.merge(sessions, users, how="left", left_on="user_id", right_on="id")
sessions_plus.shape

(50000, 12)

In [13]:
# same length as the original sessions data frame
sessions.shape

(50000, 6)

In [14]:
sessions_plus.head()

Unnamed: 0,session_id,session_timestamp,user_agent,referral,paid_campaign,user_id,id,reg_name,age,registration_date,initial_referrer,preferential_client
0,5555694754,2021-09-20 06:59:09,Mozilla/5.0 (Windows NT 6.2; lo-LA; rv:1.9.1.2...,google.com,1.0,,,,,,,
1,5555694755,2021-09-23 19:07:17,Mozilla/5.0 (Windows; U; Windows NT 6.0) Apple...,instagram.com,4.0,,,,,,,
2,5555694756,2021-09-25 14:12:23,Mozilla/5.0 (Android 2.3.6; Mobile; rv:7.0) Ge...,instagram.com,,,,,,,,
3,5555694757,2021-09-20 11:12:36,Mozilla/5.0 (compatible; MSIE 5.0; Windows NT ...,google.com,,,,,,,,
4,5555694758,2021-09-24 11:12:22,Opera/9.52.(X11; Linux i686; bho-IN) Presto/2....,google.com,2.0,,,,,,,


In [15]:
sessions_plus.isna().sum()

session_id                 0
session_timestamp          0
user_agent                 0
referral                   0
paid_campaign          21569
user_id                34877
id                     34877
reg_name               34877
age                    34877
registration_date      34877
initial_referrer       34877
preferential_client    47488
dtype: int64

In [16]:
# fill NA's in preferential_client with 0. 
sessions_plus.preferential_client.fillna(0, inplace=True)
sessions.isna().sum()

session_id               0
session_timestamp        0
user_agent               0
referral                 0
paid_campaign        21569
user_id              34877
dtype: int64

In [17]:
# 2512 visitors are prefernetial clients.
sessions_plus.preferential_client.value_counts()

0.0    47488
1.0     2512
Name: preferential_client, dtype: int64

In [18]:
# Note that there are multiple visits from the same id, i.e. clients.
sessions_plus.id.value_counts()

55886491    27
55886625    26
55886845    26
55886116    26
55886018    26
            ..
55885964     7
55886220     7
55885858     6
55885965     6
55886481     6
Name: id, Length: 1000, dtype: int64

# Minimize the datasets: what do I need to know? What steps do I need to take?
I need to know which which data is essential to my analysis. I want to avoid unnecessarily matching potentially privacy sensitive data. Therefore, I want to merge only those columns that I am interested in.

# Scenario: I am only interested in referrals, type of paid campaign and if visitor was a preferential client. What do I need to do to minimize the data?
Referrals and type of paid campaign are columns in the __sessions__ data set. Preferential client is a column in the __users__ data set. We do not need the other data. I will avoid merging these data  by first selecting my columns of interest: id columns (to merge), and columns of interest. 

In [19]:
# left merge, prioritising sessions
# slice data frames that I want to merge, keeping only user_id, referral and paid_campaign from sessions, keeping only id and preferental_client from users
sessions_min = pd.merge(sessions[['user_id', 'referral', 'paid_campaign']], users[['id', 'preferential_client']], how='left', 
                         left_on='user_id', right_on='id')
sessions_min.head()

Unnamed: 0,user_id,referral,paid_campaign,id,preferential_client
0,,google.com,1.0,,
1,,instagram.com,4.0,,
2,,instagram.com,,,
3,,google.com,,,
4,,google.com,2.0,,


In [20]:
sessions_min.shape

(50000, 5)

# Create categorical variables: how do I "recode" the exisitng columns? What steps do I need to take?
We can recode numerical or categorical variables by writing a function that is applied to each row of a dataframe. 
First, we need to explore the data I want to "recode". How does it look like?
Second, we need to write a function that takes a value or string as input. 
Third, we need to apply this to a specific column of the dataframe.

# Scenario: I want to have a new variable about visits that is binary and tells me if a visitor came to my website from social media (Instagram or Facebook) or not. What column do I need to "recode"? How do I do that?
This information is __referral__ column of the sessions data set. After exploring how these columns look like, I noticed instagram.com and facebook.com. I want to make a function that takes a string (the website) as input, and I use conditional statements. If 'facebook' or 'instagram' is in the string, then return 1 (social media), else return 0 (other website). I apply this function to the referral column, and make a new column containing these binary labels.

In [21]:
# explore how the referral column looks like
sessions.referral.value_counts()

google.com              11168
instagram.com           11136
facebook.com            10938
smith.com                 135
johnson.com               104
                        ...  
newton-stewart.org          1
vance-anderson.com          1
valencia-stevens.com        1
townsend.net                1
morales-patel.com           1
Name: referral, Length: 8630, dtype: int64

In [22]:
sessions.dtypes

session_id                   object
session_timestamp    datetime64[ns]
user_agent                   object
referral                     object
paid_campaign               float64
user_id                      object
dtype: object

In [23]:
# write a function that categorizes websites into social media or not.
def social_media(website): # the function expects a string as argument
    if 'facebook' in website or 'instagram' in website: # is 'facebook' or 'instagram' in the string, then return 1
        return 1 
    else: # all other cases return 1
        return 0

In [24]:
# make a new column 'social_media_ref' (1 = referral from social media, 0 = other website)
sessions['social_media_ref'] = sessions['referral'].apply(social_media)
sessions.social_media_ref.value_counts()

0    27926
1    22074
Name: social_media_ref, dtype: int64