# Smart TV Privacy Analysis

In this assignment, we will analyze real traces collected from Roku TV and Amazon FireTV to learn more about the behavior of TV channels and whether there's any privacy concern.

In [70]:
import matplotlib.pyplot as plt
import pandas as pd

First, let's read in the data. Given the large size, all data files are stored as "pickles" (https://docs.python.org/3/library/pickle.html) to save space. 

We have four files here: two files for Roku and two files for Amazon FireTV, where "vanilla" corresponds to the default configuration, and "limitad" corresponds to having the "limit ad tracking" option turned on. 

All four files contain HTTP response traffic.

In [71]:
# Roku data
roku_limitad = pd.read_pickle("roku-data-limitad_http_resp.pickle")
roku_vanilla = pd.read_pickle("roku-data_http_resp.pickle")

# Amazon data
amzn_limitad = pd.read_pickle("amazon-data-limitad_http_resp.pickle")
amzn_vanilla = pd.read_pickle("amazon-data_http_resp.pickle")

Let's see how many records we have in each file.

In [72]:
# Roku: number of records
print("Roku:", len(roku_vanilla), len(roku_limitad))

# Amazon: number of records
print("Amazon:", len(amzn_vanilla), len(amzn_limitad))

Roku: 4279 3984
Amazon: 8695 8124


Now, let's take a closer look at the data. The pickle file is read into a "dataframe" (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). You can think of it as a SQL table or spreasheet with columns and rows.

For example, let's check out the first five rows in "roku_vanilla".

In [73]:
roku_vanilla.head(5)

Unnamed: 0,channel_id,data,time,body,location,code,set_cookie,user_agent,ip_src,tcp_srcport,tcp_stream,http2,host_by_dns,domain_by_dns,channel_name,rank,category,status,playback
0,13,01d27b2269223a2263613036333836342d333432612d34...,1556908916.319217,,,,,,54.175.224.52,2350,17,False,,,Prime Video,3,Movies & TV,TERMINATED,False
1,13,,1556908947.88691,ï¿½PNG\r\n\n,,200.0,,,54.230.51.79,80,62,False,g-ecx.images-amazon.com,images-amazon.com,Prime Video,3,Movies & TV,TERMINATED,False
2,13,,1556908964.537949,ï¿½PNG\r\n\n,,200.0,,,54.230.51.130,80,96,False,g-ecx.images-amazon.com,images-amazon.com,Prime Video,3,Movies & TV,TERMINATED,False
3,13,,1556909007.42293,ï¿½PNG\r\n\n,,200.0,,,54.230.51.79,80,146,False,g-ecx.images-amazon.com,images-amazon.com,Prime Video,3,Movies & TV,TERMINATED,False
4,13,,1556909041.44391,ï¿½PNG\r\n\n,,200.0,,,54.230.51.130,80,193,False,g-ecx.images-amazon.com,images-amazon.com,Prime Video,3,Movies & TV,TERMINATED,False


We can see that all five records come from the same channel (channel_id = 13), which is "Prime Video" (channel_name). 

In all five HTTP response records, the channel queries "images-amazon.com" domain (domain_by_dns). 

# Part 1: analyze/compare popular domains

First, let's find out what are the most popular domains.

Complete the following "get_popular_domains" function that outputs the **top five domains** that are queried by the most number of **distinct channels**. \
For each domain, output the number of distinct channels that queried it. 

For example: \
domain1, num_distinct_channels \
domain2, num_distinct_channels \
...

Hint: use the **domain_by_dns** column to extract domain, and the **channel_id** column to extract distinct/unique channel.

You can either show the output by using "print", or output the table format like above.

In [74]:
def get_popular_domains(df):
    df = df[["domain_by_dns","channel_id"]]
    df = df.groupby('domain_by_dns', as_index=False).nunique()
    df = df.sort_values("channel_id", ascending=False)
    df.columns = ['domain','num_distinct_channels']
    print(df[:5])
    return

Now, let's run it and get the top five domains from each file (roku/amazon + vanilla/limitads)

In [75]:
# Roku vanilla
get_popular_domains(roku_vanilla)

                    domain  num_distinct_channels
0                                              96
100               roku.com                     29
59                ifood.tv                     14
104  scorecardresearch.com                     13
39              demdex.net                     11


In [76]:
# Roku w/ limit ads option
get_popular_domains(roku_limitad)

                    domain  num_distinct_channels
0                                              97
102               roku.com                     26
105  scorecardresearch.com                     15
62                ifood.tv                     14
38              demdex.net                     12


In [77]:
# Amazon vanilla
get_popular_domains(amzn_vanilla)

                 domain  num_distinct_channels
21           amazon.com                     81
0                                           53
70      crashlytics.com                     31
19       amazon-dss.com                     30
18  amazon-adsystem.com                     27


In [78]:
# Amazon w/ limit ads option
get_popular_domains(amzn_limitad)

                 domain  num_distinct_channels
19           amazon.com                     76
0                                           42
65      crashlytics.com                     31
17       amazon-dss.com                     20
30  app-measurement.com                     19


## Question 1: Describe what you observe from the above results. Do you think the "limit ads" option reduce the number of channels that queried domains on Roku and Amazon FireTV? Why or why not?

Based on the above results I think that the "limit ads" option does not reduce the number of channels that queried domains on Roku devices and does somewhat reduce the number of channels that queried domains on Amazon FireTV devices. When a "limit ads" option is selected we would expect the channels that queried domains to decrease. However, when comparing the top five queried domains of Roku vanilla data to the Roku limited data, there are four instances where the number of channels that queried domains increased, clearly showing that the option did not reduce the number of domain queries. When we compare Amazon FireTV vanilla data to the Amazon FireTV limited data, using the "limit ads" option reduces number of channels that queried the four of the top five queried domains and kept the number of channels that queried the other top queried domain the same, indicating that the "limit ads" option was somewhat limiting the number of channels that queried domains. 

# Part 2: analyze/compare channels

Now, let's take a look at the channels. We want to find out which channels queried the most number of distinct domains. 

Complete the following "get_channels_with_most_domains" function that outputs the **top five channel names** that queried the most number of **distinct domains**. \
For each channel name, output its channel category and the number of distinct domains it queried.

For example: \
channel1, category, num_distinct_domains \
channel2, category, num_distinct_domains \
...

Hint: use the **channel_name** column to extract channel name, and the **category** column to extract channel category.

You can either show the output by using "print", or output the table format.

In [79]:
def get_channels_with_most_domains(df, head=5):
    df = df[["channel_name","category", "domain_by_dns"]]
    df = df.groupby(['channel_name', 'category'], as_index=False).nunique()
    df = df.sort_values("domain_by_dns", ascending=False)
    df.columns = ['channel', 'category', 'num_distinct_domains']
    print(df[:5])
    return

Now, let's run it and get the top five channels from each file (roku/amazon + vanilla/limitads)

In [80]:
# Roku vanilla
get_channels_with_most_domains(roku_vanilla)

             channel          category  num_distinct_domains
20   CopyKat Recipes              Food                    18
58          NBC News    News & Weather                    16
15             CNNgo    News & Weather                    10
55  Models In Motion  Special Interest                    10
67       Nickelodeon     Kids & Family                     9


In [81]:
# Roku w/ limit ads option
get_channels_with_most_domains(roku_limitad)

        channel          category  num_distinct_domains
81  Sexy Shorts  Special Interest                    23
58     NBC News    News & Weather                    17
59   NBC Sports            Sports                    11
57          NBA            Sports                    10
67  Nickelodeon     Kids & Family                     9


In [82]:
# Amazon vanilla
get_channels_with_most_domains(amzn_vanilla)

                                              channel     category  \
20  CuriosityStream - Watch Documentaries Online (TV)      Medical   
71                                  The CW on Fire TV  Movies & TV   
14                                 CBS News - Fire TV         News   
48                                           NBC News         News   
47                                                NBC  Movies & TV   

    num_distinct_domains  
20                    40  
71                    25  
14                    23  
48                    21  
47                    21  


In [83]:
# Amazon w/ limit ads option
get_channels_with_most_domains(amzn_limitad)

                                              channel     category  \
20  CuriosityStream - Watch Documentaries Online (TV)      Medical   
47                                                NBC  Movies & TV   
71                                  The CW on Fire TV  Movies & TV   
15                 CBS Sports Stream &amp; Watch Live       Sports   
14                                 CBS News - Fire TV         News   

    num_distinct_domains  
20                    41  
47                    28  
71                    20  
15                    19  
14                    19  


## Question 2: Describe what you observe from the above results. Do you think the "limit ads" option reduce the number of domains queried by channels on Roku and Amazon FireTV? Why or why not?

Based on the above results I think that the "limit ads" option does not reduce the number of distinct domains queried by channels for either Roku and Amazon FireTV. When looking at the Roku vanilla and Roku limited data, the number of domains queried by the respective top five channels stayed relatively the same, and even increased in some cases. When looking at the Amazon FireTV vanilla and Amazon FireTV limited data, the same behavior was seen where the number of domains queried by the respective top five channels stayed relatively the same, and even increased in some cases. This points to the conclusion that the "limit ads" option does not reduce the number of distinct domains queried by channels.

# Part 3: analyze "data leak"

Finally, let's take a deeper look into which data is being collected by the channels. We call it "data leak". Every channel may collect different information about the user, e.g., device information and geolocation.

Here, we need to use new pickle files with the "data leak" information. Let's load the data. 

In [84]:
# Roku data
roku_limitad_leak = pd.read_pickle("roku-data-limitad_leak.pickle")
roku_vanilla_leak = pd.read_pickle("roku-data_leak.pickle")

# Amazon data
amzn_limitad_leak = pd.read_pickle("amazon-data-limitad_leak.pickle")
amzn_vanilla_leak = pd.read_pickle("amazon-data_leak.pickle")

Same as before, the data is being stored in dataframe format. 

We will focus on the **id_type** column here. This column indicates *what type of data is being collected and sent by the channel*. \
Let's see what's in the **id_type**.

In [85]:
list(roku_vanilla_leak.id_type.unique())

['Channel name', 'AD ID', 'Build Number', 'Serial No', 'Zip', 'City']

Note that the **id_type** is a bit different in Amazon.

In [86]:
list(amzn_vanilla_leak.id_type.unique())

['Android ID',
 'Channel name',
 'AD ID',
 'Serial No',
 'Device name',
 'Zip',
 'MAC',
 'City']

Now, for each of the **id_type** above, we want to know: 
* How many times it's leaked (i.e., the number of records that have this id_type)
* How many channels leak this data (i.e., the number of distinct channels that have this id_type)

Complete the following "analyze_data_leak" function that does the above. It should output the number of leaks and the number of channels for each **id_type**.

For example: \
id_type_1, num_leaks, num_channels \
id_type_2, num_leaks, num_channels \
...

You can either show the output by using "print", or output the table format.

In [87]:
def analyze_data_leak(leak_df):
    df = leak_df[["id_type","channel_id"]]
    df = df.groupby(['id_type'], as_index=False)
    df = df.count().merge(df.nunique(), on='id_type')
    df.columns = ['id_type','num_leaks','num_channels']
    print(df)
    return

Now, let's run it and find out the data leaks from each file (roku/amazon + vanilla/limitads)

In [88]:
# Roku vanilla
analyze_data_leak(roku_vanilla_leak)

        id_type  num_leaks  num_channels
0         AD ID        655            30
1  Build Number        450            34
2  Channel name       1747            25
3          City          2             2
4     Serial No        246            14
5           Zip          8             1


In [89]:
# Roku w/ limit ads option
analyze_data_leak(roku_limitad_leak)

        id_type  num_leaks  num_channels
0  Build Number        394            34
1  Channel name       1555            26
2          City         16             3
3     Serial No        213            13
4         State          7             1
5           Zip         22             2


In [90]:
# Amazon vanilla
analyze_data_leak(amzn_vanilla_leak)

        id_type  num_leaks  num_channels
0         AD ID        269            35
1    Android ID        923            69
2  Channel name       1437            20
3          City         12             1
4   Device name          8             4
5           MAC         26             3
6     Serial No        201            33
7           Zip         21             2


In [91]:
# Amazon w/ limit ads option
analyze_data_leak(amzn_limitad_leak)

        id_type  num_leaks  num_channels
0         AD ID        218            16
1    Android ID       1000            65
2  Channel name       1419            20
3          City         24             1
4   Device name          5             2
5           MAC         30             3
6     Serial No        161            24
7           Zip         28             2


## Question 3: Describe what you observe from the above results. Does the "limit ads" option reduce or eliminate "data leaks"? If so, what's the degree of reduction? Please discuss Roku and Amazon FireTV separately. 

Based on the above results, I think that the "limit ads" option substantially reduces "data leaks" on Roku devices and does not reduce "data leaks" on Amazon FireTV devices overall. When looking at the Roku vanilla and Roku limited data, we can see that the number of leaks for most id type reduce somewhat. The most noticible change on the Roku devices is that the "AD ID" id type appears in the Roku vanilla data and does not appear on the Roku limited data, while the "State" id type appears in the limited data, but not the vanilla data. However, since so few "State" id types (7) appeared to leak in the limited data, it might just be a type that leaks very infrequently and did not leak on the devices captured in the vanilla dataset. However, the "AD ID" id type leaked 655 times in the vanilla data, which indicates data of that id type sent frequently, and it is probably intentionally excluded when the "limit ads" option is on. When looking at the Amazon FireTV vanilla and Amazon FireTV limited data, we can see that the number of leaks for each id type stay relatively the same for most id types. However, something notable is that the number of leaks for and channels requesting the "AD ID" and "Serial No" id type is reduced somewhat. 

Another thing interesting to me is that when the "limit ads" option is on for both Roku and Amazon FireTV, then more location data is leaked. This could mean that when these services "limit ad tracking" they may just turn off direct consumer targetting using your device information, but still use location data to serve some relevent ads. However, the numbers of leaks for location data is relativly small, so I can not make a definitive conclusion there.