# Lab 1 - Data Structures and Sorting

### Name:  Samantha Cohen
### Uniqname: samcoh
### People you worked with: Rhea Kulkarni

## Submission instructions

Fill in the three fields above and turn in the Lab by tomorrow night (Thurs) at 11:59pm.

Turn in the following files via Canvas -> Assignments -> Lab 1:

* Your Notebook, named `si330-hw1-YOUR_UNIQUE_NAME.ipynb`
* the HTML file, named `si330-hw1-YOUR_UNIQUE_NAME.html`

## Objectives
After completing this homework assignment, you should know how to
* use compound data structures
* perform simple and complex sorting
* use lambda functions

In addition, this assignment will provide an opportunity to work with a large (100,000 row) data set.

## Background

Massive Open Online Courses (MOOCs) are a popular way for people to learn new skills.  The University of Michigan
offers many different MOOCs, which are produced by faculty members and supported by the Office of Academic 
Innovation.

MOOCs tend to be used by hundreds to hundreds of thousands of users.  These users leave "digital exhaust" when
they work through the MOOC in the form of web server log entries.  We have obtained a small sample of these data
files from Prof. Chris Brooks, who is a colleague here at UMSI.  The data files are de-identified: anything
that could identify a person, such as their UMID or their IP address are "hashed" (encrypted).  Each line in the
data file represents a "page view" by a user.  The schema for each line is:

```umich_user_id, hashed_session_cookie_id, server_timestamp, hashed_ip, user_agent, url, initial_referrer_url, rowser_language, course_id, country_cd, region_cd, timezone, os, browser, key, value```

**For this lab we will only concern ourselves with ```umich_user_id```, which identifies each user.**

We will use the files **mooc_small.csv** and **countrycodes.tsv** for this lab.

In the lab, we will go through the motions of some manipulation of the MOOC log data. These concepts would be tested in your homework assignment, where you will use these manipulations to answer some real world questions.

## Setup

In [2]:
import csv
from collections import defaultdict

## The data

We will use two data files:

- `mooc_small.csv`: MOOC usage logs, as described above
- `countrycodes.tsv`: a table of country codes and corresponding country names, so that we can determine the names of countries which are represented as country codes in ```mooc_small.csv```. 

## Part 0: Inspecting the data

A good first step, **before** writing/executing code, is to inspect the data files you'll use.

Double-click on `countrycodes.tsv` and `mooc.csv` in Jupyter Lab's left-hand file pane to open them up in Jupyter Lab's built-in data file viewer. 
We will be trying to use the `country_cd` column in `mooc.csv` to look up countries in `countrycodes.tsv`.

## Part 1: Importing the data

### Part 1.1: MOOC logs

First, we'll import the MOOC usage logs from `mooc_small.csv` into a list-of-dictionaries data structure called `mooc_data`:

In [6]:
mooc_data_file_name = "mooc_small.csv"

with open(mooc_data_file_name, "r") as csvfile:
    mooc_data = list(csv.DictReader(csvfile))
    # CHANGE ME
    # insert lines here to create a variable called `mooc_data` containing a
    # list-of-dictionaries, where each dictionary is a row from "mooc_small.csv"
    #pass
#mooc_data

Print the **first 10 lines** from ```mooc_data```. We will output the ```user id``` and ```country code```:

In [7]:
for row in mooc_data[:10]: # CHANGE ME: As-is, this will print all the rows. Change it so that it prints only the first 10
    print(row["umich_user_id"],row["country_cd"]) # CHANGE ME: Write down the code within the print statement so that it prints user id and country code

0ea5cc6ff0ca76782e6c0a81f070cae9cf0971d9 PF
0ea5cc6ff0ca76782e6c0a81f070cae9cf0971d9 PF
5450de1c9e1874d613a9649a39352a10313a3b8f IT
0ea5cc6ff0ca76782e6c0a81f070cae9cf0971d9 PF
25424b1007637699cf0c672edc7a64c2b65268fa US
a95f04999ccf8fcd0f26fb0851745073a147e009 CZ
4ea0a18ab02a30290dda02bdb2da8a7a6a469245 US
25424b1007637699cf0c672edc7a64c2b65268fa US
44185055eece5d1bc7986d743d240a7633d968ff US
005aa91c779e6fe84b49398793dbda670dd6c352 NO


### Part 1.2: Country names

We want to know for a given two digit ```country_cd``` in ```mooc_small.csv```, the complete name of the country. Hence, we will import the country codes into a dictionary.

Double-click on `countrycodes.tsv`

**Q1.2.1. What is the name of the column that corresponds to the country name?**

> Country or Area Name

**Q1.2.2: What is the name of the column that corresponds to the two-digit country code?**

> ISO ALPHA-2 Code

In order to be able to lookup the country name corresponding to a particular country code, we will create a dictionary with these properties:
- **keys** in the dictionary are country codes
- **values** in the dictionary are country names

In [4]:
with open("countrycodes.tsv", "r") as csvfile:
    reader = csv.DictReader(csvfile, delimiter = "\t", quotechar = '"')
    country_names = {}
    for row in reader:
        country_names[row['ISO ALPHA-2 Code']]=row['Country or Area Name']

Let's see what it looks like:

In [5]:
for country_code, country_name in country_names.items():  # CHANGE ME: Change XXXX to the appropriate expression to iterate over
    print(country_code, country_name) #CHANGE ME: print out the country code and country name

AF Afghanistan
AX Aland Islands
AL Albania
DZ Algeria
AS American Samoa
AD Andorra
AO Angola
AI Anguilla
AQ Antarctica
AG Antigua and Barbuda
AR Argentina
AM Armenia
AW Aruba
AU Australia
AT Austria
AZ Azerbaijan
BS Bahamas
BH Bahrain
BD Bangladesh
BB Barbados
BY Belarus
BE Belgium
BZ Belize
BJ Benin
BM Bermuda
BT Bhutan
BO Bolivia
BA Bosnia and Herzegovina
BW Botswana
BV Bouvet Island
BR Brazil
VG British Virgin Islands
IO British Indian Ocean Territory
BN Brunei Darussalam
BG Bulgaria
BF Burkina Faso
BI Burundi
KH Cambodia
CM Cameroon
CA Canada
CV Cape Verde
KY Cayman Islands
CF Central African Republic
TD Chad
CL Chile
CN China
HK Hong Kong, SAR China
MO Macao, SAR China
CX Christmas Island
CC Cocos (Keeling) Islands
CO Colombia
KM Comoros
CG Congo (Brazzaville)
CD Congo, (Kinshasa)
CK Cook Islands
CR Costa Rica
CI Cote d'Ivoire
HR Croatia
CU Cuba
CY Cyprus
CZ Czech Republic
DK Denmark
DJ Djibouti
DM Dominica
DO Dominican Republic
EC Ecuador
EG Egypt
SV El Salvador
GQ Equatorial Gui

Now, let's re-write the code you used to create `country_names` to use a *dictionary comprehension* instead of a loop:

In [6]:
with open("countrycodes.tsv", "r") as csvfile:
    reader = csv.DictReader(csvfile, delimiter = "\t", quotechar = '"')
    country_names = {row["ISO ALPHA-2 Code"]: row['Country or Area Name'] for row in reader}# CHANGE ME: Change this line to a dictionary comprehension that creates the same data structure you created above

Re-run the code block that prints country codes and names to make sure your dictionary comprehension version works.

## Part 2: Combining `mooc_data` and `country_names`

Print the **first 10 lines** from ```mooc_data``` again. This time, we will output the ```user id```, ```country code```, **and full country name**.

Hint: you'll need to use `country_names` to look up the country name.

In [7]:
for row in mooc_data[:10]: # CHANGE ME: As-is, this will print all the rows. Change it so that it prints only the first 10
    print(row["umich_user_id"], row["country_cd"], country_names[row["country_cd"]]) # CHANGE ME: Write down the code within the print statement so that it prints user id, country code, and country name

0ea5cc6ff0ca76782e6c0a81f070cae9cf0971d9 PF French Polynesia
0ea5cc6ff0ca76782e6c0a81f070cae9cf0971d9 PF French Polynesia
5450de1c9e1874d613a9649a39352a10313a3b8f IT Italy
0ea5cc6ff0ca76782e6c0a81f070cae9cf0971d9 PF French Polynesia
25424b1007637699cf0c672edc7a64c2b65268fa US United States of America
a95f04999ccf8fcd0f26fb0851745073a147e009 CZ Czech Republic
4ea0a18ab02a30290dda02bdb2da8a7a6a469245 US United States of America
25424b1007637699cf0c672edc7a64c2b65268fa US United States of America
44185055eece5d1bc7986d743d240a7633d968ff US United States of America
005aa91c779e6fe84b49398793dbda670dd6c352 NO Norway


**Q1.2.3: Why is a dictionary a good data structure for `country_names`? Hint: think about (1) what operations we needed to perform on this data structure and
(2) what operations are fast or slow for this data structure.**

> The operation we need to perform on this data structure is getting items by index.
Dictionaies are fast when you are trying to get a value by an index, which is why it is a good data structure to use. The operations that are fast for this data structure:  to get/set items by index,  to test for key membership, and deletion of values by key. The operation that is slow for a dictionary data structure is testing for value membership. 

## Part 3: Organizing user ids by country code

Next, we want to re-organize the `mooc_data` into a data structure which will make it easier for us to get all user ids for one country.

We will create a dictionary of lists, where:

- each **key** is a country code
- each **value** is a list of user ids from that country

We will use a `defaultdict(list)` object instead of just a normal dictionary. A `defaultdict(list)` acts just like a normal dictionary,
except that when you try to retrieve the value for a key that is not in the dictionary, it calls the `list()` function, which creates a new list.
It then takes that list, assigns it to that key, and returns it.

In [10]:
user_data_by_country = defaultdict(list)

for row in mooc_data:
    user_data_by_country[row["country_cd"]].append(row["umich_user_id"])
    # CHANGE ME: insert code here to add the user id for each row to the correct list within `user_data_by_country`.
#print(user_data_by_country)

## Part 4: Getting all user ids for just the US

We want to find out the number of different users overall, and from the US. For that we will first need to filter ```user_data_by_country``` to retrieve data from the US, and store it in ```us_user_data```. Since we are using a dictionary, this is relatively straightforward: (nothing for you to change here):

In [9]:
us_user_data = user_data_by_country['US']

## Part 5: Counting unique users

To get the number of unique users, we can do it in two different ways - **dictionaries** and **sets**.

Sets are useful if you just want to know what the unique user ids are. Dictionaries are useful if you want to associate some additional information with each unique user, such as the number of logs for that user.

First, you will use **dictionaries** to count unique users and store the number of logs for each. You will do this both for all users globally (`unique_user_log_count_overall`) and for just users in the US (`unique_user_log_count_us`). It is easiest to do this using ```defaultdict```:

In [11]:
unique_user_log_count_overall = defaultdict(int) # CHANGE ME: should be defaultdict(XXX) --- figure out what XXX should be!
unique_user_log_count_us = defaultdict(int)       # CHANGE ME
for row in mooc_data:
    unique_user_log_count_overall[row["umich_user_id"]] += 1
    if row["country_cd"] == "US": 
        unique_user_log_count_us[row["umich_user_id"]] += 1
    #CHANGE ME
    #pass
#print(unique_user_log_count_us)
print("Total no. of unique users: ", len(unique_user_log_count_overall))
print("No. of unique users from the US: ", len(unique_user_log_count_us))

defaultdict(<class 'int'>, {'25424b1007637699cf0c672edc7a64c2b65268fa': 2, '4ea0a18ab02a30290dda02bdb2da8a7a6a469245': 2, '44185055eece5d1bc7986d743d240a7633d968ff': 3, '95cffc5948af183853735930299a0bc48c1cdc6c': 3, '90e1b71e4948330711a9bb9f2df6818532945829': 1, '81f41b99410e89adfd934416faf87e361da5815f': 1, 'f9a7816134e0397b5f380ab082cdef5b37d3a323': 1, '7fe732fdcf374ef1ea747a19e1eb28478c6ad656': 2, 'c7e0b7e873392815abee61a53c231a1d5866a659': 6, '1066a697903937dcb2bba46698b65c9067602b13': 4, 'ba41844ae5dba3fe42d7cc13c124d68062656cd2': 1, '6834ae47d226b2053f130fb9e52eab3c63212ca1': 1, 'b8ee892ce61b98c391c7f49dac24a32569548d15': 1, 'bec8d341e667a015f933c171705ab73c3affdf03': 1, '1e791177b039f48de777a632003765c1ee10f349': 1, '7c08d6e31857a6cf120b5a22b5c659ead920e4dc': 1, '70d530b2e677aa82a680b36ba534dbabc884e010': 3, '95b5bdb0c8007d9deb2adfdbc585300ca4346b9f': 1, '32b73c34b1f858c72bfdb95114891bcc90f6856d': 1, '5ea1e0b32644f577a0f9f08f43d793a7f8c864ef': 2, '02b921627914e08b888160fafef88ab

Second, you will count unique users using **sets**. In the following block you will create two sets to store the unique user ids of all users globally (`unique_users_overall`)
and from just the US (`unique_users_us`). Remember that a set will only store unique ids, unlike a list.

In the following chunk, write down the **set comprehensions** to get the sets of all the unique users globally and in the US:

In [4]:
unique_users_overall = {row["umich_user_id"] for row in mooc_data}   # CHANGE ME: write down a set comprehension to get all unique users 
unique_users_us = {row["umich_user_id"] for row in mooc_data if row["country_cd"]=="US"}         # CHANGE ME: write down a set comprehension to get US unique users 

print("Total no. of unique users: ", len(unique_users_overall))
print("No. of unique users from the US: ", len(unique_users_us))

Total no. of unique users:  51
No. of unique users from the US:  25


## Part 6: Getting the average number of logs per user
Here we want to get the average number of logs for users in the world versus users just in the US. We can use the dictionaries that we created previously, ```unique_user_log_count_overall``` and ```unique_user_log_count_us``` to compute this. The easiest way to do this is using the `sum()` function --- look it up if you are not familiar with it.

In [12]:
#print(unique_user_log_count_overall)
#print(sum([log_count for log_count in unique_user_log_count_overall.values()]))
#print(len(unique_user_log_count_overall))
mean_logs_per_user_overall = sum(unique_user_log_count_overall.values())/ len(unique_user_log_count_overall)  # CHANGE ME: Calculate the sum over the values of the dictionary. Divide by the length of the dictionary
mean_logs_per_user_us = sum(unique_user_log_count_us.values())/ len(unique_user_log_count_us)

print("Total no. of unique users: ", mean_logs_per_user_overall)
print("No. of unique users from the US: ", mean_logs_per_user_us)

Total no. of unique users:  1.9607843137254901
No. of unique users from the US:  1.76


## Part 7: Top 5 users

Let's get the ```user_id``` of the top 5 users who visited the most pages, globally and from the US.

To do this, you will need to use the ```sorted``` and pass a ```lambda``` function through the ```key``` parameter. **Hint:** you should sort `unique_user_log_count_overall.items()` and `unique_user_log_count_us.items()`.

First, globally:

In [13]:
sorted_users_overall = [tup[0] for tup in sorted(unique_user_log_count_overall.items(), key= lambda x: x[1], reverse= True)] # CHANGE ME: Fill in within the parenthesis to sort the dictionary `unique_user_log_count_overall` by the number of logs in the descending order
sorted_users_overall[:5] # Do not change this. This will output the top 5users.

#print(sorted_users_overall)

['bb116a9af763fa2f53139c1dfa851c760a667169',
 'c7e0b7e873392815abee61a53c231a1d5866a659',
 '0ea5cc6ff0ca76782e6c0a81f070cae9cf0971d9',
 '17c1eed00cc46d9dfe62d95570d1b8e8846d5239',
 '005aa91c779e6fe84b49398793dbda670dd6c352']

Then for US only:

In [14]:
sorted_users_us = [tup[0] for tup in sorted(unique_user_log_count_us.items(), key= lambda x: x[1], reverse = True)] # CHANGE ME: Similar to above
sorted_users_us[:5] # Do not change this. This will output the top 5 users.



['c7e0b7e873392815abee61a53c231a1d5866a659',
 '1066a697903937dcb2bba46698b65c9067602b13',
 '44185055eece5d1bc7986d743d240a7633d968ff',
 '95cffc5948af183853735930299a0bc48c1cdc6c',
 '70d530b2e677aa82a680b36ba534dbabc884e010']